Evidence of direct complementary interactions between messenger RNAs and their cognate proteins

10
Evidence of direct complementary interactions between messenger RNAs and their cognate proteins Anton A. Polyansky and Bojan Zagrovic* Department of Structural and Computational Biology, Max F. Perutz Laboratories, University of Vienna, Campus Vienna Biocenter 5, A-1030 Vienna, Austria Received March 20, 2013; Revised May 23, 2013; Accepted May 29, 2013 ABSTRACT Recently, the ability to interact with messenger RNA (mRNA) has been reported for a number of known RNA-binding proteins, but surprisingly also for dif- ferent proteins without recognizable RNA binding domains including several transcription factors and metabolic enzymes. Moreover, direct binding to cognate mRNAs has been detected for multiple proteins, thus creating a strong impetus to search for functional significance and basic physico- chemical principles behind such interactions. Here, we derive interaction preferences between amino acids and RNA bases by analyzing binding inter- faces in the known 3D structures of protein–RNA complexes. By applying this tool to human proteome, we reveal statistically significant matching between the composition of mRNA se- quences and base-binding preferences of protein sequences they code for. For example, purine density profiles of mRNA sequences mirror guanine affinity profiles of cognate protein se- quences with quantitative accuracy (median Pearson correlation coefficient R = 0.80 across the entire human proteome). Notably, statistically significant anti-matching is seen only in the case of adenine. Our results provide strong evidence for the stereo-chemical foundation of the genetic code and suggest that mRNAs and cognate proteins may in general be directly complementary to each other and associate, especially if unstructured. INTRODUCTION In the 50 years since the discovery of messenger RNA (mRNA) (1), the relationship between this key biopolymer and proteins has been studied predominantly in the context of transmission of genetic information and protein synthesis. Recently, however, evidence of direct non-covalent binding between mRNAs and a number of functionally diverse proteins has been provided, including surprisingly various metabolic enzymes, transcription factors and scaffolding proteins with hitherto uncharacterized RNA-binding domains (2–5). It has been found that such mRNA–protein complexes fre- quently participate in the formation of RNA droplets in the cell (e.g. P-bodies), which display all features of a separate cytoplasmic microphase and open up new para- digms in cell biophysics (6–8). What is more, several proteins have been found over the years to directly bind their own cognate mRNAs, including among others thymidylate synthase, dihydrofolate reductase and p53 (2,9–14), with binding sites in both translated and untrans- lated mRNA regions. The functional significance of such cognate interactions has been clearly ascertained in some cases [e.g. translational feedback control (12)], but it is far from clear how general and functionally relevant they actually are. Kyrpides and Ouzounis hypothesized that cognate protein–mRNA interactions may represent an ancient mechanism for autoregulation of mRNA stability (9,10), but structural and mechanistic aspects of their proposal have never been explored in detail. Altogether, the rapid growth of the number of experimentally verified mRNA-binding proteins, both cognate and non-cognate, has now created a strong incentive to search for the func- tional significance of such interactions and, even more fundamentally, the basic physico-chemical rules that guide them. Related to this, we have recently shown that pyrimidine (PYR) density profiles of mRNA sequences tend to closely mirror sequence profiles of the respective cognate proteins capturing their amino-acid affinity for pyridines, chem- icals closely related to PYR (15). These findings provided strong support for the stereo-chemical hypoth- esis concerning the origin of the genetic code, the idea that the specific pairing between individual amino acids and cognate codons stems from direct binding preferences of the two for each other (16–21). However, based on our *To whom correspondence should be addressed. Tel: +43 1 4277 52271; Fax: +43 1 4277 9522; Email: [email protected] 8434–8443 Nucleic Acids Research, 2013, Vol. 41, No. 18 Published online 18 July 2013 doi:10.1093/nar/gkt618 ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Transcript of Evidence of direct complementary interactions between messenger RNAs and their cognate proteins

Evidence of direct complementary interactionsbetween messenger RNAs and their cognateproteinsAnton A Polyansky and Bojan Zagrovic

Department of Structural and Computational Biology Max F Perutz Laboratories University of Vienna CampusVienna Biocenter 5 A-1030 Vienna Austria

Received March 20 2013 Revised May 23 2013 Accepted May 29 2013

ABSTRACT

Recently the ability to interact with messenger RNA(mRNA) has been reported for a number of knownRNA-binding proteins but surprisingly also for dif-ferent proteins without recognizable RNA bindingdomains including several transcription factorsand metabolic enzymes Moreover direct bindingto cognate mRNAs has been detected for multipleproteins thus creating a strong impetus to searchfor functional significance and basic physico-chemical principles behind such interactions Herewe derive interaction preferences between aminoacids and RNA bases by analyzing binding inter-faces in the known 3D structures of proteinndashRNAcomplexes By applying this tool to humanproteome we reveal statistically significantmatching between the composition of mRNA se-quences and base-binding preferences of proteinsequences they code for For example purinedensity profiles of mRNA sequences mirrorguanine affinity profiles of cognate protein se-quences with quantitative accuracy (medianPearson correlation coefficient R =080 acrossthe entire human proteome) Notably statisticallysignificant anti-matching is seen only in the caseof adenine Our results provide strong evidence forthe stereo-chemical foundation of the genetic codeand suggest that mRNAs and cognate proteins mayin general be directly complementary to each otherand associate especially if unstructured

INTRODUCTION

In the 50 years since the discovery of messenger RNA(mRNA) (1) the relationship between this key biopolymerand proteins has been studied predominantly in thecontext of transmission of genetic information and

protein synthesis Recently however evidence of directnon-covalent binding between mRNAs and a number offunctionally diverse proteins has been provided includingsurprisingly various metabolic enzymes transcriptionfactors and scaffolding proteins with hithertouncharacterized RNA-binding domains (2ndash5) It hasbeen found that such mRNAndashprotein complexes fre-quently participate in the formation of RNA droplets inthe cell (eg P-bodies) which display all features of aseparate cytoplasmic microphase and open up new para-digms in cell biophysics (6ndash8) What is more severalproteins have been found over the years to directly bindtheir own cognate mRNAs including among othersthymidylate synthase dihydrofolate reductase and p53(29ndash14) with binding sites in both translated and untrans-lated mRNA regions The functional significance of suchcognate interactions has been clearly ascertained in somecases [eg translational feedback control (12)] but it is farfrom clear how general and functionally relevant theyactually are Kyrpides and Ouzounis hypothesized thatcognate proteinndashmRNA interactions may represent anancient mechanism for autoregulation of mRNA stability(910) but structural and mechanistic aspects of theirproposal have never been explored in detail Altogetherthe rapid growth of the number of experimentally verifiedmRNA-binding proteins both cognate and non-cognatehas now created a strong incentive to search for the func-tional significance of such interactions and even morefundamentally the basic physico-chemical rules thatguide them

Related to this we have recently shown that pyrimidine(PYR) density profiles of mRNA sequences tend to closelymirror sequence profiles of the respective cognate proteinscapturing their amino-acid affinity for pyridines chem-icals closely related to PYR (15) These findingsprovided strong support for the stereo-chemical hypoth-esis concerning the origin of the genetic code the idea thatthe specific pairing between individual amino acids andcognate codons stems from direct binding preferences ofthe two for each other (16ndash21) However based on our

To whom correspondence should be addressed Tel +43 1 4277 52271 Fax +43 1 4277 9522 Email bojanzagrovicunivieacat

8434ndash8443 Nucleic Acids Research 2013 Vol 41 No 18 Published online 18 July 2013doi101093nargkt618

The Author(s) 2013 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby30) whichpermits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

results such binding complementarity may exist predom-inantly at the level of longer polypeptide and mRNAstretches rather than individual amino acids and codonsStimulated by these findings we hypothesized that PYR-rich regions in mRNAs and protein stretches encoded bythem may bind each other in a complementary fashion afeature encoded directly in the universal genetic code (15)Although strongly suggestive these findings remainedsilent about the potentially equivalent complementarityon the side of purines (PUR) as well as any details con-cerning specific nitrogenous bases In addition to confirm-ing our previous results using a completely orthogonalapproach the present study provides strong novelevidence along both of these two key lines

MATERIALS AND METHODS

Analysis of contacts between amino-acid side chains andRNA nucleobases

All available structures of proteinndashRNA complexes (bothX-ray and nuclear magnetic resonance structures) weredownloaded from the Protein Data Bank (PDB) (22) inSeptember 2012 using the 30 protein sequence identityand 3 A resolution (for X-ray structures) cutoffs Theinitial set was further manually filtered to excludecomplexes containing double-stranded RNAs or maturetransfer RNAs The structures of the completeSaccharomyces cerevisiae (23) Escherichia coli (24) andThermus thermophilus (25) ribosomes with the highestcrystallographic resolution as well as the 50S subunit ofthe Deinococcus radiodurans (26) and Haloarculamarismortui ribosome were also included in the set Thisresulted in a total of 299 individual PDB structures(Supplementary Table S1) An amino-acid residue andan RNA base were considered to be neighbors and forma contact if their centers of geometry were separated byless than a given cutoff distance All the results reported inthe main manuscript are given for the cutoff of 8 Awhereas for testing purposes this cutoff was also variedbetween 6 and 10 A with a 025 A step We separatelyanalyzed contact statistics for residues having at leastone neighboring base (set lsquo1+rsquo with a total of 25 820unique contacts for 8 A cutoff) at least two neighboringbases (set lsquo2+rsquo with a total of 16 331 unique contacts for8 A cutoff) or include only the two closest neighboringbases (set lsquo2rsquo with a total of unique 12 040 contacts for8 A cutoff)

Calculations of amino-acid interaction preferences

Amino acidnucleobase preferences eij (with i=1 20for amino acids and j=1 4 for bases) were estimatedusing the following standard distance-independent contactpotential formalism with the quasi-chemical definition ofthe reference state (27ndash31)

eij frac14 lnNij

obs

Nijexp

frac14 lnNij

obs

XiXjNTOTobs

eth1THORN

where Nijobs is the number of observed contacts between

amino acid side chain of type i and nucleobase of type j

in experimental structures and Nijexp is the expected

number of such contacts The latter is calculated as theproduct of molar fractions of amino acid i and base jamong all observed contacts (Xi and Xj respectively)

and the total number of all observed contacts NTOTobs

Interaction preference scales of amino acids wereobtained separately for guanine (lsquoG-preferencersquo) adenine(lsquoA-preferencersquo) cytosine (lsquoC-preferencersquo) uracil (lsquoU-pref-erencersquo) PUR (both G and A lsquoPUR-preferencersquo) andPYR (both C and U lsquoPYR-preferencesrsquo)

Proteome data

The sequences of the complete human proteome (17 083proteins) and coding sequences of their correspondingmRNAs were extracted from UniProtKB database(January 2013 release) with maximal-protein-evidence-level set at 4 (ie proteins annotated as lsquouncertainrsquo wereexcluded) and with only the reviewed Swiss-Prot (32)entries used for further analysis The coding sequencesof their corresponding mRNAs were extracted using thelsquoCross-referencesrsquo section of each of UniProtKB entrywhere out of several possible translated RNA sequencesthe first one satisfying the length criterion (RNAlength=3 protein length+3) was selected and itssequence downloaded from European NucleotideArchive Database (httpwwwebiacukena) Theprotein as well as RNA sequences with only canonicalamino acids or nucleotides were chosen for analysis Thecomplete set of mRNAprotein sequences used herein isincluded in the Supplementary Data The average contentof codons when it comes to individual nucleobases orPYR or PURs for all 20 amino acids (lsquocodon contentrsquoscales) was extracted from the thus-obtained cognatemRNA and protein sequences

Correlation calculations

Pearson correlation coefficients (R) were calculatedbetween nucleobase preferences and lsquocodon contentrsquoscales and between sequence profiles of nucleobasecontent for mRNAs and of different amino-acid prefer-ence scales for proteins from the complete humanproteome set Before comparison the profiles weresmoothed using a sliding-window averaging procedurethe window size of 21 residuescodons was used for allcalculations

Analysis of statistical significance

Statistical significance (P-values) of the observed correl-ations was estimated using a randomization procedureinvolving random shuffling of the interaction preferencescales Each scale was shuffled one million times andPearson correlation coefficients (R) against codoncontent scales as well as for mRNAprotein profiles werecalculated for each shuffled scale The reported P-valuescorrespond to the fraction of shuffled scales which exhibita higher absolute R than the original (jRjgt jRoriginalj) inthe case of codon content comparisons or for which ltRgtis higher in absolute value than ltRoriginalgt in the case ofsequence-profile comparisons

Nucleic Acids Research 2013 Vol 41 No 18 8435

The typical randomized scales whose distributions ofcorrelation coefficients are depicted in the manuscriptwere chosen to be those whose mean and standard devi-ation are the same as the average mean and the averagestandard deviation over all 106 randomized scales in eachcase

Analysis of protein disorder and gene ontology (GO)classification

The average disorder for each protein sequence in thehuman proteome was predicted using IUpred server(33) Fourteen subsets of proteins displaying best orworst matching between their interaction preferenceprofiles and nucleobase density profiles of their cognatemRNAs in term of Pearson R were extracted from thehuman proteome (top and bottom 10 cohorts) for thesix cases of direct correspondence between nucleobasepreferences and nucleobase composition profiles (egprotein G-preference versus G mRNA content Gprotein-GmRNA etc) and for the G-preference versus PURmRNA content one (Gprotein-PURmRNA) Each of thesesubsets contains 1707 proteins for which averagedisorder values were assigned Means and standard devi-ations of the 14 thus-obtained distributions of averagepredicted disorder were compared with those of theentire human proteome (background) The significanceof the mean difference from the background was estimatedfor each of the analyzed subsets using the Wilcoxonsigned-rank test The gene ontology (GO) analysis wasperformed for the same seven top 10 best-matchingprotein subsets using DAVID functional annotationserver (34) The entire human proteome was used as back-ground and only the most significantly enriched func-tional terms with a DAVID EASE score (P-values)1010 were considered

Data visualization

The 3D structures of proteinndashRNA and amino acidnucleobase complexes were visualized using PyMol(httpwwwpymolorg) (35) Contact statistics heat-map was produced using MATLAB (R2009a) PearsonR distributions for mRNAprotein profiles were processedand visualized using Grace (httpplasma-gateweizmannacilGrace)

RESULTS

Derivation of amino acidnucleobase interactionpreferences

How differentiated and context-dependent are the prefer-ences of amino acids to interact with specific nitrogenousbases To address this question we analyze contact inter-faces of 300 high-resolution structures of differentproteinndashRNA complexes including five ribosomal struc-tures (Supplementary Table S1) We use distancesbetween centers of geometry of amino-acid side chainsand nucleotide nitrogenous bases in combination with afixed cutoff to define contacting neighbors (Figure 1A) Inthis way we isolate sequence-specific proteinndashRNA

contacts (36ndash38) while ignoring non-specific interactionsdefined exclusively by protein or RNA backbones Wefirst present results for the distance cutoff of 8 A followingShakhnovich et al who established cutoffs between 7 and8 A to be optimal for residue-based statistical potentialsdescribing proteinndashDNA interactions albeit with aslightly different definition of reference points (28)However all of our principal findings hold qualitativelyfor cutoffs between 6 and 9 A as discussed later in thetext Finally to differentiate cases in which an amino acidinteracts with a single base only from denser potentiallymore stereospecific contacts with more than one neighbor-ing base within the cutoff we separately merge contactstatistics over the whole set of studied structures foramino acids having at least one neighboring base(set lsquo1+rsquo) or at least two neighboring bases (set lsquo2+rsquoFigure 1A) within the cutoff

Using standard distance-independent contact potentialformalism (27ndash31) we subsequently derive scales of aminoacidnucleobase interaction preferences (Figure 1B andSupplementary Table S2) and use them to address thefollowing questions (i) how does the average compositionof mRNA codons coding for a given amino acid relate tothe preferences of this amino acid to interact with different

Figure 1 Derivation of amino acidnucleobase interaction preferencescales from known structures of RNAprotein complexes (A) Wedefine amino-acid side chains and RNA bases in a given complex tobe contacting neighbors if their centers of geometry are less than agiven cutoff radius R apart (left and middle) and merge contact statis-tics over the entire set of studied structures (right lsquo2+rsquo set with applied8 A cutoff) (B) Interaction preference scales of amino acids (in arbi-trary units) for binding to guanines (G) PYR and PUR obtained fromset lsquo2+rsquo statistics using 8 A cutoff (panel A right) The scales are stat-istical analogs of relative free energy of binding (see lsquoMaterials andMethodsrsquo section) with the prominently negative values correspondingto amino acid side chains having the highest affinities for bases of agiven type and vice versa

8436 Nucleic Acids Research 2013 Vol 41 No 18

nucleobases at proteinRNA interfaces and (ii) how doessequence density of different bases in mRNA-coding se-quences relate to sequence profiles of amino-acid inter-action preferences for these and other bases in cognateprotein sequences

Amino acid interaction preferences and their codoncontent

We first focus on contact statistics from set lsquo2+rsquoDinucleotides were found previously to exhibit potentialfor specific recognition of amino acids at proteinndashRNAinterfaces (39) and have also been suggested as potentialcatalysts for amino acid synthesis in pre-biotic environ-ments (40) Moreover set lsquo2+rsquo by definition alsoincludes all instances where triplets of bases directlycontact a given amino acid which may be relevant inthe context of the genetic code Using set lsquo2+rsquo statisticswe observe a remarkably strong correlation between pref-erences of amino acids to interact with guanine (G-prefer-ence Figure 1B) and the average PUR content of theirrespective codons as derived from the complete humanproteome with Pearson correlation coefficient R of084 (Figure 2A) Negative Pearson correlation coeffi-cients indicate matching between amino acid preferencesand codon content owing to the way preference is defined(see lsquoMaterials and Methodsrsquo section) Put differentlyamino acids which are predominantly encoded byPURs display a strong tendency to co-localize with Gat proteinndashRNA interfaces This is also true albeit at asomewhat weaker level of correlation for matchingbetween PUR composition of individual codons fromthe standard genetic table and the respective G-prefer-ences if the statistics of codon usage in the humanproteome is not included (R=068 SupplementaryFigure S1) The observed signal for G is statisticallyhighly significant as evidenced by randomization calcula-tions (P-valuelt 106 Figure 2B) Related to this G-pref-erence of amino acids inversely correlates with C and Ucontent of their codons (Figure 2B) Somewhat less prom-inent but still extremely significant correlations areobserved for G- and C-preference of amino acids andthe average G- and C-content of their codons (R of047 and 058 respectively) On the other hand theinterface statistics for adenine (A) and uracil (U) do notcorrelate with their average usage in codons In particularthe A-preference of amino acids correlates inversely withthe A-content (R=059) or directly with the U-content oftheir codons (R=051) whereas the U-preferenceexhibits relatively low correlations throughout(Figure 2B) Finally both PYR and PUR binding prefer-ences of amino acids (Figure 1B) display significant cor-relations with PYR and PUR fraction in their codons withR of 054 and 053 respectively and P-valueslt 106 inboth cases In other words amino acids coded for byPYR-rich codons prefer to co-localize with PYR andthose coded for by PUR-rich codons with PUR atRNAndashprotein interfaces Although similar in the presentcase PYR- and PUR-preference scales need not necessar-ily be inverses of each other owing to the way preferencesare defined and we therefore here report and discuss both

Matching between sequence profiles of mRNAs and theircognate proteins

How do these observations translate if one comparescomplete mRNA-coding sequences with their cognateprotein sequences Owing to codon usage bias and non-uniform amino-acid composition of the human proteomethese results could in principle deviate significantly fromthe results obtained for individual codons and aminoacids To address this question we calculate a PearsonR for every cognate mRNAprotein pair in the humanproteome capturing the correlation between each mRNAsequence composition profile with the base-binding pref-erence profile of its cognate protein sequenceRemarkably we observe an extremely high level ofmatching between PUR density profiles of mRNAs andG-preference profiles of cognate protein sequences with amedian Pearson R (Rmedian) over the entire humanproteome of 080 and a low P-value (lt106) asdetermined by randomization (Figure 2C) In particularthe distribution of Pearson R values for this scale over thehuman proteome is significantly left shifted and showsonly marginal overlap with the one calculatedfor a typical randomized interaction preference scale(Figure 2C) For illustration we present sequenceprofiles for proteins of most abundant length (300ndash400amino acids Supplementary Figure S2) displayingtypical (ie exhibiting a Pearson R equal to the populationmedian) or best levels of correlation (Figure 2D) As isevident the PUR density of mRNAs is quantitatively ex-tremely well predicted by the G-binding preference profilesof cognate proteins even for typical human proteins(Rmedian=080 and Plt 106) We also observe signifi-cant matching between C-preference profiles for proteinsequences and both C- and PYR-density profiles of theircognate mRNAs with Rmedian of 055 and 047 re-spectively (Figure 2E) In contrast the A-preferencesdisplay significant matching with PYR-density profileson the side of mRNA (Figure 2E see alsoSupplementary Table S2 for the full report of profile cor-relations) with Rmedian of 053 Finally strong and sig-nificant level of matching is observed for PYR-bindingpreferences of amino acids and PYR mRNA profiles aswell as PUR-binding preferences of amino acids and PURprofiles (Rmedian of 058 in both cases and P-values of86 103 and 79 103 respectively Figure 3A and C)From the exemplary typical and best profiles (Figure 3Band D) it is clear that the PYR- and PUR-rich regions inmRNA code for stretches of amino acids in cognateproteins which prefer to co-localize with PYR and PURbases respectively at proteinndashRNA interfaces in theknown 3D PDB structures The typical level of similaritybetween sequence profiles is actually greater than whatone might infer from Rmedian values suggesting thatPearson correlation coefficient might not even be theoptimal measure of deviation in this case Importantlythis direct physico-chemical complementarity betweenmRNA and cognate protein sequences may be indicativeof pronounced potential for complex formation betweenthem especially under circumstances when lsquopeakrsquo regionsbecome available for such interactions Given the fact that

Nucleic Acids Research 2013 Vol 41 No 18 8437

a significant matching of profiles is detected at the level ofprimary sequences we propose that the presence of ex-tended unstructured protein and mRNA segments maybe required for such binding This suggestion agrees wellwith recent knowledge-based studies where RNA loopsand bulges were found to be more likely to interact withamino-acid side chains in a specific manner (3841)How sensitive is the level of matching to the choice of

cutoff distance used to define contacting amino acids and

nucleobases in proteinRNA complexes To address thisquestion we have repeated the aforementioned analysisfor a range of different cutoff values going from 6 to10 A in steps of 025 A (Figure 4) Overall for set lsquo2+rsquoour findings are largely robust to the choice of the exactcutoff in this range albeit with a somewhat lower level ofsignificance for longer cutoffs However the majority ofthe signal is lost if one uses the lsquo1+rsquo set except forG-preference and PUR-content (Figure 5A) and

Figure 2 Relationship between nucleobase-binding preferences of amino acids and mRNA content at multiple levels (A) Correlation between Ginteraction preferences of amino acids (Figure 1B) and the average PUR content of their codons in mRNAs of the entire human proteome(B) Pairwise Pearson correlation coefficients (R) between base-binding preference scales of amino acids (lsquosclrsquo) and average base content of theircodons (lsquocdnrsquo) (C) Distributions of correlation coefficients (R) between window-averaged PUR-content profiles of individual mRNA coding sequencesand window-averaged G-preference sequence profiles of the respective proteins for the entire human proteome (window-size=21) The dashed curvedepicts the distribution of correlation coefficients calculated for a typical randomized G-preference scale Inset the distribution of the means of sequence-profile correlation coefficients for the human proteome (ltRgt) calculated for 106 randomized G-preference scales The R for the original G-preferencescale is shown with an arrow (D) Typical (R=Rmedian) and best pairs of mRNA PUR-content (black curves) and protein G-preference profiles (reddashed curves) for human proteins (E) Median pairwise Pearson correlation coefficients for comparison between nucleobase content profiles of mRNAs(subscript lsquomRNArsquo x-axis) and base-preference-weighted protein sequence profiles (subscript lsquoproteinrsquo y-axis) over the entire human proteome Allresults are based on the analysis of set lsquo2+rsquo statistics All data reported for preference scales are obtained using an 8 A cutoff

8438 Nucleic Acids Research 2013 Vol 41 No 18

A-preference and PYR-content (Supplementary TableS2) This observation strongly suggests that close densepacking of nucleobases around amino acids may berequired for specificity in cognate complex formationAlthough interfaces may be dynamic and liquid-like aswe have suggested before they may still need to bedensely packed Interestingly if one reduces the 2+ setby including only the two closest bases in contact with agiven amino acid (set lsquo2rsquo) the signal for G-preferencePUR-content even further improves by several percentagepoints (Figure 5A) and the same holds for C-preferenceC-content and A-preferencePYR-content (Supplemen-tary Table S2)

To further study the role of protein structural disorderin matching we have analyzed the levels of the predicteddisorder of the top and the bottom 10 of proteins whenit comes to the degree of mRNAprotein profile matchingas captured by Pearson R coefficient (see lsquoMaterials andMethodsrsquo section) We have done this for the six cases ofdirect comparison whereby the same base type is used forboth protein preference and mRNA profile density(Gprotein-GmRNA Aprotein-AmRNA Cprotein-CmRNAUprotein-UmRNA PURprotein-PURmRNA and PYRprotein-PYRmRNA) and also for the case displaying the strongestsignal in our analysis (Gprotein-PURmRNA) Importantly inthe case of Gprotein-GmRNA Aprotein-AmRNA and Cprotein-CmRNA matching we do observe a pronounced tendency

for the top and the bottom 10 cohorts to be significantlyenriched (top 10) and depleted (bottom 10) in dis-ordered proteins (Supplementary Table S3) whereas inthe case of Uprotein-Uprof matching the situation isreversed Interestingly for PURprotein-PURmRNAPYRprotein-PYRmRNA and Gprotein-PURmRNA matchingone observes slight disorder enrichment in both top andbottom cohorts The most prominent shift of the distribu-tion of predicted average disorder toward higher disorderas compared with background is observed for the top10 cohort of proteins displaying strong matchingbetween C-preference profiles of their sequences and theC-content of their cognate mRNAs (Cprotein-CmRNASupplementary Table S3 Supplementary Figure S3)One might argue that this effect could just be related tocompositional properties of such protein and mRNApairs whereby disordered proteins are simply encodedby C-rich sequences However the differences betweennucleobase compositions of mRNAs from the Cprotein-CmRNA top 10 cohort and the complete proteome areminor suggesting that the underlying explanation mightbe more complex (Supplementary Figure S3)Which biological functions might be associated with a

high level of complementarity between proteins andcognate mRNAs To address this question we have per-formed GO analysis for seven different top 10 subsets ofproteins displaying strong matching with cognate mRNAs

Figure 3 PYRPUR mRNA sequence profiles strongly match PYRPUR-preferences of cognate protein sequences PYR (A and B) and PUR (Cand D) amino-acid preference scales are given in Figure 1B For details please see the analogous captions to Figure 2C and D

Nucleic Acids Research 2013 Vol 41 No 18 8439

(see lsquoMaterials and Methodsrsquo section for details) InSupplementary Table S4 we report the most significantlyenriched biological functions (using a P-value cutoff of1010) shared by proteins from the analyzed cohorts Ina striking agreement with our hypothesis in most caseswe observe pronounced enrichment of terms related tonucleic-acidprotein interactions including regulation ofRNA metabolic processes ribonucleoprotein complexesand transcription The latter in particular allows one tospeculate that protein tendencies to associate with cognatemRNA might be used by the cells to modulate gene ex-pression pathways What is more PUR or PYR densityprofiles of mRNAs are identical to PUR or PYR densityprofiles of coding-strand DNA sequences (with Us beingreplaced by Ts) Although based on our statistical poten-tials we cannot say anything about T-binding preferencesof amino acids it is possible that our results may be gen-eralizable even to DNA-protein interactions as well asother RNA-protein interactions One should alsomention that depending on the particular type ofmatching other biological functions also tend to beenriched For instance the Uprotein-UmRNA top 10subset displays significant enrichment of membraneproteins whereas Gprotein-PURmRNA top cohort seems to

be populated by extracellular proteins and particularlythose involved in the functioning of the innate immunesystem Altogether our preliminary GO analysis illus-trates significant functional differences between proteinsthat strongly complement their cognate mRNAs and therest of the human proteome and these findings will befurther explored in another manuscript

DISCUSSION

High levels of matching between base-binding-preferenceprofiles of proteins and PYR- or PUR-density profiles ofcognate mRNA-coding sequences defined primarily byamino acid preferences to co-localize with G and Cbases at RNAprotein interfaces allow one to speculatethat direct complementary binding interactions may be akey element underlying the whole mRNAprotein rela-tionship when it comes to both its evolutionary develop-ment as well as present day biology (Figure 5B) Thisagrees well with and significantly extends our previousfindings where we have shown that protein sequenceprofiles of amino acid affinity for PYR analogs (42ndash44)mirror PYR density profiles of cognate mRNA sequences

Figure 5 Physico-chemical origins of the mRNAprotein relationship(A) Correlation coefficients (R and ltRgt with standard deviations)between PYR or PUR average codon content (lsquoCodon contentrsquo) andrespective mRNA profiles (lsquoProfilesrsquo) calculated for G- (blue) PUR-(red) and PYR- (green) binding preferences of amino acids whichwere obtained using different amino acid neighbor statistics (1+ 2+or 2) (B) A model of physico-chemical complementarity betweenproteins and cognate mRNAs Preferential interactions of aminoacids with PYR or PUR define their codon content in the genetictable and facilitate complementary interactions between PYRPUR-rich mRNA regions and PYRPUR preferring regions in proteinsThe opposite behavior of adenines and guanines adds an additionallayer of complexity in the case of PURs as signified by dashedarrows in the model Note polymer sizes not drawn to scale

Figure 4 Effect of cutoff radius used to define proteinndashRNA contactson observed correlations (A) Dependence of Pearson correlation coef-ficients (R) between amino acid preference scales and average codoncontent on the cutoff radius for the two sets of statistics studied (lsquo1+rsquolsquo2+rsquo) The total number of unique contacts in lsquo1+rsquo and lsquo2+rsquo (given inparentheses) sets obtained for each of used cutoff radii is indicated atthe top of the panel (B) Cutoff radius dependence of median pairwisePearson correlation coefficients (Rmedian) for comparison betweennucleobase content profiles of mRNAs and base-preference-weightedprotein sequence profiles over the entire human proteome (color codethe same as in panel A)

8440 Nucleic Acids Research 2013 Vol 41 No 18

(15) It should be emphasized however that our presentresults are based exclusively on the statistics of directamino acidnucleobase contacts at RNAprotein inter-faces It is therefore still possible that the driving forcefor interactions between mRNAs and cognate proteins isnon-specific (eg binding of positively charged amino acidside chains to RNA phosphate groups) whereas comple-mentary interactions actually confer specificity to binding

Moreover our results provide a clear evolutionary per-spective concerning the physico-chemical origins of trans-lation in line with the stereo-chemical hypothesis of theorigin of the genetic code (16ndash21) In particular ourresults give strong support to the possibility of directtemplating of proteins from mRNAs in the era beforethe development of ribosomal decoding and codersquosfixation in that era (1745) In this framework ancientamino acids associated with mRNA directly followingtheir intrinsic physico-chemical preferences as outlinedhere However the fact that an analogous effect is notseen for all bases especially adenine and uracil supportsthe possibility that in addition to physico-chemical ration-ales in the context of direct binding other evolutionaryforces were also responsible for shaping the genetic codeas suggested before (19) Our results are most consistentwith the possibility that the early stereo-chemical phase incodersquos development was dominated by G- and C-richcodons as strongest correlations are seen for preciselythese bases If the basic structure of the early geneticcode was defined by such codons but was later modulatedby the inclusion of A and U bases this might explain whyG-affinity of amino acids in present-day protein sequencesclosely follows PUR density profiles in cognate mRNAsInterestingly Trifonov and coworkers have suggested thatthe first codons were G- and C-rich on the basis of a con-sensus analysis of 40 different criteria (46)

Importantly it should be emphasized that the stereo-chemical hypothesis of the codersquos origin may differ fromthe cognate mRNAprotein complementary interactionhypothesis in terms of its evolutionary underpinningsDirect templating of proteins from mRNAs in ancientsystems (the coding aspect of the stereo-chemical hypoth-esis) does not necessarily imply that modern proteinsdirectly interact with their own mRNA (complementaryinteraction hypothesis) However our findings support thepossibility that the origin of the genetic code and potentialcomplementarity between proteins and cognate mRNAsmight have the same physico-chemical background It iswell possible that other independent influences haveshaped both effects and the two hypotheses leave ampleroom for such refinements However we would like tostress that in our view the two hypotheses are inter-linked cognate binding is on the one hand a reasonableconsequence of the stereochemical hypothesis but on theother hand it also gives a potential biological rationale forthe early development of the code to begin with such asstabilization of RNA structures by bound polypeptides ashas been suggested before (45)

There are a number of open challenges concerning theaforementioned proposal First and foremost the struc-tural features of mRNAs and cognate proteins imposesevere constraints on any putative complementarity

between the two Namely with the contour length of themRNA coding part being 45 times longer than that of acognate protein it is not clear what structural arrange-ments may be consistent with any complementary inter-actions We would like to suggest that structures of suchcomplexes may be dynamic and liquid-like with mRNAstretches enveloping and solubilizing cognate proteinstretches (15) Second with many mRNAs and proteinsbeing well-folded and compact for most of the time itremains to be studied when and how opportunities couldarise for the complementarity between their primary se-quences to be of relevance It is possible that if at allrealistic such complementary binding might be function-ally important precisely in those situations where bothpolymers are unstructured such as during translationexport and degradation as a consequence of thermalstress or in the case of intrinsically unstructuredproteins However we do not exclude the possibility ofcomplementary interactions even in the folded stateFinally concerning the origin of the genetic code it isnot clear how the final well-defined structure of the codecould have arisen based on still partially non-specificlarge-scale binding interactions between mRNAs andcognate proteins As suggested before it is possible thatthe answer lies in a combination of different influences(19) Future research should shed light on these andrelated questionsThese challenges notwithstanding our findings provide

strong evidence that the ability to interact with mRNAmight be a widespread phenomenon in the cell involvingnot only cognate proteins but also other proteins based onsimilar principles The potential significance of suchphysico-chemical complementarity between mRNAs andproteins potentially extends to all facets of nucleic acidand protein biology in the modern cell including transcrip-tiontranslation regulation (9104748) mRNA transportand localization (4950) processing and decay (51) struc-ture of ribonucleoproteins (52) and others (2ndash55354)Our preliminary GO analysis has demonstrated a signifi-cant enrichment of functions related to association withnucleic acids for the subsets of proteins that complementtheir cognate mRNAs strongly and these findings will beexplored in more detail in future work

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

ACKNOWLEDGEMENTS

The authors thank S Dorner R Schroeder A Vaziri GWarren and members of the Laboratory ofComputational Biophysics at MFPL for useful adviceand critical reading of the manuscript

FUNDING

This work was supported in part by the Austrian ScienceFund FWF [START grant Y514-B11 to BZ] EuropeanResearch Council [ERC Starting Independent grant

Nucleic Acids Research 2013 Vol 41 No 18 8441

279408 to BZ] Funding for open access chargeAustrian Science Fund FWF

Conflict of interest statement None declared

REFERENCES

1 BrennerS JacobF and MeselsonM (1961) An unstableintermediate carrying information from genes to ribosomes forprotein synthesis Nature 190 576ndash581

2 AndersG MackowiakSD JensM MaaskolaJ KuntzagkARajewskyN LandthalerM and DieterichC (2012) doRiNA adatabase of RNA interactions in post-transcriptional regulationNucleic Acids Res 40 D180ndashD186

3 BaltzAG MunschauerM SchwanhausserB VasileAMurakawaY SchuelerM YoungsN Penfold-BrownDDrewK MilekM et al (2012) The mRNA-bound proteome andits global occupancy profile on protein-coding transcripts MolCell 46 674ndash690

4 CastelloA FischerB EichelbaumK HorosRBeckmannBM StreinC DaveyNE HumphreysDTPreissT SteinmetzLM et al (2012) Insights into RNA biologyfrom an atlas of mammalian mRNA-binding proteins Cell 1491393ndash1406

5 KonigJ ZarnackK LuscombeNM and UleJ (2012) Protein-RNA interactions new genomic technologies and perspectivesNat Rev Genet 13 221ndash221

6 MitchellSF JainS SheM and ParkerR (2013) Globalanalysis of yeast mRNPs Nat Struct Mol Biol 20 127ndash133

7 WeberSC and BrangwynneCP (2012) Getting RNA andprotein in phase Cell 149 1188ndash1191

8 HanTNW KatoM XieSH WuLC MirzaeiH PeiJMChenM XieY AllenJ XiaoGH et al (2012) Cell-freeformation of RNA granules bound RNAs identify features andcomponents of cellular assemblies Cell 149 768ndash779

9 KyrpidesNC and OuzounisCA (1993) Mechanisms ofspecificity in messenger-rna degradation - autoregulation andcognate interactions J Theor Biol 163 373ndash392

10 OuzounisCA and KyrpidesNC (1994) Reverse interpretation-ahypothetical selection mechanism for adaptive mutagenesis basedon autoregulated messenger-RNA stability J Theor Biol 167373ndash379

11 ChuE CopurSM JuJ ChenTM KhleifS VoellerDMMizunumaN PatelM MaleyGF MaleyF et al (1999)Thymidylate synthase protein and p53 mRNA form an in vivoribonucleoprotein complex Mol Cell Biol 19 1582ndash1594

12 TaiN SchmitzJC LiuJ LinX BaillyM ChenTM andChuE (2004) Translational autoregulation of thymidylatesynthase and dihydrofolate reductase Front Biosci 9 2521ndash2526

13 SchuttpelzM SchoningJC DooseS NeuweilerH PetersEStaigerD and SauerM (2008) Changes in conformationaldynamics of mRNA upon AtGRP7 binding studied byfluorescence correlation spectroscopy J Am Chem Soc 1309507ndash9513

14 ZhaoX LiuM WuN DingL LiuH and LinX (2010)Recovery of recombinant zebrafish p53 protein from inclusionbodies and its binding activity to p53 mRNA in vitro ProteinExpr Purif 72 262ndash266

15 HlevnjakM PolyanskyAA and ZagrovicB (2012) Sequencesignatures of direct complementarity between mRNAs andcognate proteins on multiple levels Nucleic Acids Res 408874ndash8882

16 WoeseCR (1965) Order in genetic code Proc Natl Acad SciUSA 54 71ndash75

17 WoeseCR (1965) On the evolution of the genetic code ProcNatl Acad Sci USA 54 1546ndash1552

18 YarusM (1998) Amino acids as RNA ligands A direct-RNA-template theory for the codersquos origin J Mol Evol 47 109ndash117

19 KooninEV and NovozhilovAS (2009) Origin and evolution ofthe genetic code the universal enigma IUBMB Life 61 99ndash111

20 YarusM WidmannJJ and KnightR (2009) RNA-amino acidbinding a stereochemical era for the genetic code J Mol Evol69 406ndash429

21 JohnsonDB and WangL (2010) Imprints of the genetic code inthe ribosome Proc Natl Acad Sci USA 107 8298ndash8303

22 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The proteindata bank Nucleic Acids Res 28 235ndash242

23 Ben-ShemA de LoubresseNG MelnikovS JennerLYusupovaG and YusupovM (2011) The structure of theeukaryotic ribosome at 30 angstrom resolution Science 3341524ndash1529

24 DunkleJA WangLY FeldmanMB PulkA ChenVBKapralGJ NoeskeJ RichardsonJS BlanchardSC andCateJHD (2011) Structures of the bacterial ribosome inclassical and hybrid states of tRNA binding Science 332981ndash984

25 PolikanovYS BlahaGM and SteitzTA (2012) Howhibernation factors RMF HPF and YfiA turn off proteinsynthesis Science 336 915ndash918

26 HarmsJM WilsonDN SchluenzenF ConnellSRStachelhausT ZaborowskaZ SpahnCM and FuciniP (2008)Translational regulation via L11 molecular switches on theribosome turned on and off by thiostrepton and micrococcinMol Cell 30 26ndash38

27 MiyazawaS and JerniganRL (1985) Estimation of effectiveinterresidue contact energies from protein crystal-structures -quasi-chemical approximation Macromolecules 18 534ndash552

28 DonaldJE ChenWW and ShakhnovichEI (2007) Energeticsof protein-DNA interactions Nucleic Acids Res 35 1039ndash1047

29 JonikasMA RadmerRJ LaederachA DasR PearlmanSHerschlagD and AltmanRB (2009) Coarse-grained modeling oflarge RNA molecules with knowledge-based potentials andstructural filters RNA 15 189ndash199

30 Perez-CanoL SolernouA PonsC and Fernandez-RecioJ(2010) Structural prediction of protein-RNA interaction bycomputational docking with propensity-based statistical potentialsPac Symp Biocomput 15 269ndash280

31 TuszynskaI and BujnickiJM (2011) DARS-RNP and QUASI-RNP new statistical potentials for protein-RNA docking BMCBioinformatics 12 348

32 GasteigerE GattikerA HooglandC IvanyiI AppelRD andBairochA (2003) ExPASy the proteomics server for in-depthprotein knowledge and analysis Nucleic Acids Res 31 3784ndash3788

33 DosztanyiZ CsizmokV TompaP and SimonI (2005) IUPredweb server for the prediction of intrinsically unstructured regionsof proteins based on estimated energy content Bioinformatics 213433ndash3434

34 Huang daW ShermanBT and LempickiRA (2009) Systematicand integrative analysis of large gene lists using DAVIDbioinformatics resources Nat Protoc 4 44ndash57

35 The PyMOL Molecular Graphics System Version 13r1 (2010)Schrodinger LLC httpwwwpymolorgciting (4 July 2013date last accessed)

36 TregerM and WesthofE (2001) Statistical analysis of atomiccontacts at RNA-protein interfaces J Mol Recognit 14199ndash214

37 HoffmanMM KhrapovMA CoxJC YaoJ TongL andEllingtonAD (2004) AANT the amino acid-nucleotideinteraction database Nucleic Acids Res 32 D174ndashD181

38 GuptaA and GribskovM (2011) The role of RNA sequence andstructure in RNAmdashprotein interactions J Mol Biol 409 574ndash587

39 FernandezM KumagaiY StandleyDM SaraiAMizuguchiK and AhmadS (2011) Prediction of dinucleotide-specific RNA-binding sites in proteins BMC Bioinformatics12(Suppl 13) S5

40 CopleySD SmithE and MorowitzHJ (2005) A mechanismfor the association of amino acids with their codons and theorigin of the genetic code Proc Natl Acad Sci USA 1024442ndash4447

41 IwakiriJ TateishiH ChakrabortyA PatilP and KenmochiN(2012) Dissecting the protein-RNA interface the role of proteinsurface shapes and RNA secondary structures in protein-RNArecognition Nucleic Acids Res 40 3299ndash3306

8442 Nucleic Acids Research 2013 Vol 41 No 18

42 WoeseCR DugreDH SaxingerWC and DugreSA (1966)The molecular basis for the genetic code Proc Natl Acad SciUSA 55 966ndash974

43 WoeseCR (1973) Evolution of the genetic codeNaturwissenschaften 60 447ndash459

44 MathewDC and Luthey-SchultenZ (2008) On the physicalbasis of the amino acid polar requirement J Mol Evol 66519ndash528

45 NollerHF (2012) Evolution of protein synthesis from an RNAworld Cold Spring Harb Perspect Biol 4 1ndashU20

46 TrifonovEN KirzhnerA KirzhnerVM and BerezovskyIN(2001) Distinct stages of protein evolution as suggested by proteinsequence analysis J Mol Evol 53 394ndash401

47 VaquerizasJM KummerfeldSK TeichmannSA andLuscombeNM (2009) A census of human transcription factorsfunction expression and evolution Nat Rev Genet 10 252ndash263

48 SonenbergN and HinnebuschAG (2009) Regulation oftranslation initiation in eukaryotes mechanisms and biologicaltargets Cell 136 731ndash745

49 LecuyerE YoshidaH ParthasarathyN AlmC BabakTCerovinaT HughesTR TomancakP and KrauseHM (2007)Global analysis of mRNA localization reveals a prominentrole in organizing cellular architecture and function Cell 131174ndash187

50 MartinKC and EphrussiA (2009) mRNA localization geneexpression in the spatial dimension Cell 136 719ndash730

51 MooreMJ and ProudfootNJ (2009) Pre-mRNA processingreaches back to transcription and ahead to translation Cell 136688ndash700

52 GlisovicT BachorikJL YongJ and DreyfussG (2008) RNA-binding proteins and post-transcriptional gene regulation FEBSLett 582 1977ndash1986

53 BellucciM AgostiniF MasinM and TartagliaGG (2011)Predicting protein associations with long noncoding RNAs NatMethods 8 444ndash445

54 RinnJL and ChangHY (2012) Genome regulation by longnoncoding RNAs Annu Rev Biochem 81 145ndash166

Nucleic Acids Research 2013 Vol 41 No 18 8443

results such binding complementarity may exist predom-inantly at the level of longer polypeptide and mRNAstretches rather than individual amino acids and codonsStimulated by these findings we hypothesized that PYR-rich regions in mRNAs and protein stretches encoded bythem may bind each other in a complementary fashion afeature encoded directly in the universal genetic code (15)Although strongly suggestive these findings remainedsilent about the potentially equivalent complementarityon the side of purines (PUR) as well as any details con-cerning specific nitrogenous bases In addition to confirm-ing our previous results using a completely orthogonalapproach the present study provides strong novelevidence along both of these two key lines

MATERIALS AND METHODS

Analysis of contacts between amino-acid side chains andRNA nucleobases

All available structures of proteinndashRNA complexes (bothX-ray and nuclear magnetic resonance structures) weredownloaded from the Protein Data Bank (PDB) (22) inSeptember 2012 using the 30 protein sequence identityand 3 A resolution (for X-ray structures) cutoffs Theinitial set was further manually filtered to excludecomplexes containing double-stranded RNAs or maturetransfer RNAs The structures of the completeSaccharomyces cerevisiae (23) Escherichia coli (24) andThermus thermophilus (25) ribosomes with the highestcrystallographic resolution as well as the 50S subunit ofthe Deinococcus radiodurans (26) and Haloarculamarismortui ribosome were also included in the set Thisresulted in a total of 299 individual PDB structures(Supplementary Table S1) An amino-acid residue andan RNA base were considered to be neighbors and forma contact if their centers of geometry were separated byless than a given cutoff distance All the results reported inthe main manuscript are given for the cutoff of 8 Awhereas for testing purposes this cutoff was also variedbetween 6 and 10 A with a 025 A step We separatelyanalyzed contact statistics for residues having at leastone neighboring base (set lsquo1+rsquo with a total of 25 820unique contacts for 8 A cutoff) at least two neighboringbases (set lsquo2+rsquo with a total of 16 331 unique contacts for8 A cutoff) or include only the two closest neighboringbases (set lsquo2rsquo with a total of unique 12 040 contacts for8 A cutoff)

Calculations of amino-acid interaction preferences

Amino acidnucleobase preferences eij (with i=1 20for amino acids and j=1 4 for bases) were estimatedusing the following standard distance-independent contactpotential formalism with the quasi-chemical definition ofthe reference state (27ndash31)

eij frac14 lnNij

obs

Nijexp

frac14 lnNij

obs

XiXjNTOTobs

eth1THORN

where Nijobs is the number of observed contacts between

amino acid side chain of type i and nucleobase of type j

in experimental structures and Nijexp is the expected

number of such contacts The latter is calculated as theproduct of molar fractions of amino acid i and base jamong all observed contacts (Xi and Xj respectively)

and the total number of all observed contacts NTOTobs

Interaction preference scales of amino acids wereobtained separately for guanine (lsquoG-preferencersquo) adenine(lsquoA-preferencersquo) cytosine (lsquoC-preferencersquo) uracil (lsquoU-pref-erencersquo) PUR (both G and A lsquoPUR-preferencersquo) andPYR (both C and U lsquoPYR-preferencesrsquo)

Proteome data

The sequences of the complete human proteome (17 083proteins) and coding sequences of their correspondingmRNAs were extracted from UniProtKB database(January 2013 release) with maximal-protein-evidence-level set at 4 (ie proteins annotated as lsquouncertainrsquo wereexcluded) and with only the reviewed Swiss-Prot (32)entries used for further analysis The coding sequencesof their corresponding mRNAs were extracted using thelsquoCross-referencesrsquo section of each of UniProtKB entrywhere out of several possible translated RNA sequencesthe first one satisfying the length criterion (RNAlength=3 protein length+3) was selected and itssequence downloaded from European NucleotideArchive Database (httpwwwebiacukena) Theprotein as well as RNA sequences with only canonicalamino acids or nucleotides were chosen for analysis Thecomplete set of mRNAprotein sequences used herein isincluded in the Supplementary Data The average contentof codons when it comes to individual nucleobases orPYR or PURs for all 20 amino acids (lsquocodon contentrsquoscales) was extracted from the thus-obtained cognatemRNA and protein sequences

Correlation calculations

Pearson correlation coefficients (R) were calculatedbetween nucleobase preferences and lsquocodon contentrsquoscales and between sequence profiles of nucleobasecontent for mRNAs and of different amino-acid prefer-ence scales for proteins from the complete humanproteome set Before comparison the profiles weresmoothed using a sliding-window averaging procedurethe window size of 21 residuescodons was used for allcalculations

Analysis of statistical significance

Statistical significance (P-values) of the observed correl-ations was estimated using a randomization procedureinvolving random shuffling of the interaction preferencescales Each scale was shuffled one million times andPearson correlation coefficients (R) against codoncontent scales as well as for mRNAprotein profiles werecalculated for each shuffled scale The reported P-valuescorrespond to the fraction of shuffled scales which exhibita higher absolute R than the original (jRjgt jRoriginalj) inthe case of codon content comparisons or for which ltRgtis higher in absolute value than ltRoriginalgt in the case ofsequence-profile comparisons

Nucleic Acids Research 2013 Vol 41 No 18 8435

The typical randomized scales whose distributions ofcorrelation coefficients are depicted in the manuscriptwere chosen to be those whose mean and standard devi-ation are the same as the average mean and the averagestandard deviation over all 106 randomized scales in eachcase

Analysis of protein disorder and gene ontology (GO)classification

The average disorder for each protein sequence in thehuman proteome was predicted using IUpred server(33) Fourteen subsets of proteins displaying best orworst matching between their interaction preferenceprofiles and nucleobase density profiles of their cognatemRNAs in term of Pearson R were extracted from thehuman proteome (top and bottom 10 cohorts) for thesix cases of direct correspondence between nucleobasepreferences and nucleobase composition profiles (egprotein G-preference versus G mRNA content Gprotein-GmRNA etc) and for the G-preference versus PURmRNA content one (Gprotein-PURmRNA) Each of thesesubsets contains 1707 proteins for which averagedisorder values were assigned Means and standard devi-ations of the 14 thus-obtained distributions of averagepredicted disorder were compared with those of theentire human proteome (background) The significanceof the mean difference from the background was estimatedfor each of the analyzed subsets using the Wilcoxonsigned-rank test The gene ontology (GO) analysis wasperformed for the same seven top 10 best-matchingprotein subsets using DAVID functional annotationserver (34) The entire human proteome was used as back-ground and only the most significantly enriched func-tional terms with a DAVID EASE score (P-values)1010 were considered

Data visualization

The 3D structures of proteinndashRNA and amino acidnucleobase complexes were visualized using PyMol(httpwwwpymolorg) (35) Contact statistics heat-map was produced using MATLAB (R2009a) PearsonR distributions for mRNAprotein profiles were processedand visualized using Grace (httpplasma-gateweizmannacilGrace)

RESULTS

Derivation of amino acidnucleobase interactionpreferences

How differentiated and context-dependent are the prefer-ences of amino acids to interact with specific nitrogenousbases To address this question we analyze contact inter-faces of 300 high-resolution structures of differentproteinndashRNA complexes including five ribosomal struc-tures (Supplementary Table S1) We use distancesbetween centers of geometry of amino-acid side chainsand nucleotide nitrogenous bases in combination with afixed cutoff to define contacting neighbors (Figure 1A) Inthis way we isolate sequence-specific proteinndashRNA

contacts (36ndash38) while ignoring non-specific interactionsdefined exclusively by protein or RNA backbones Wefirst present results for the distance cutoff of 8 A followingShakhnovich et al who established cutoffs between 7 and8 A to be optimal for residue-based statistical potentialsdescribing proteinndashDNA interactions albeit with aslightly different definition of reference points (28)However all of our principal findings hold qualitativelyfor cutoffs between 6 and 9 A as discussed later in thetext Finally to differentiate cases in which an amino acidinteracts with a single base only from denser potentiallymore stereospecific contacts with more than one neighbor-ing base within the cutoff we separately merge contactstatistics over the whole set of studied structures foramino acids having at least one neighboring base(set lsquo1+rsquo) or at least two neighboring bases (set lsquo2+rsquoFigure 1A) within the cutoff

Using standard distance-independent contact potentialformalism (27ndash31) we subsequently derive scales of aminoacidnucleobase interaction preferences (Figure 1B andSupplementary Table S2) and use them to address thefollowing questions (i) how does the average compositionof mRNA codons coding for a given amino acid relate tothe preferences of this amino acid to interact with different

Figure 1 Derivation of amino acidnucleobase interaction preferencescales from known structures of RNAprotein complexes (A) Wedefine amino-acid side chains and RNA bases in a given complex tobe contacting neighbors if their centers of geometry are less than agiven cutoff radius R apart (left and middle) and merge contact statis-tics over the entire set of studied structures (right lsquo2+rsquo set with applied8 A cutoff) (B) Interaction preference scales of amino acids (in arbi-trary units) for binding to guanines (G) PYR and PUR obtained fromset lsquo2+rsquo statistics using 8 A cutoff (panel A right) The scales are stat-istical analogs of relative free energy of binding (see lsquoMaterials andMethodsrsquo section) with the prominently negative values correspondingto amino acid side chains having the highest affinities for bases of agiven type and vice versa

8436 Nucleic Acids Research 2013 Vol 41 No 18

nucleobases at proteinRNA interfaces and (ii) how doessequence density of different bases in mRNA-coding se-quences relate to sequence profiles of amino-acid inter-action preferences for these and other bases in cognateprotein sequences

Amino acid interaction preferences and their codoncontent

We first focus on contact statistics from set lsquo2+rsquoDinucleotides were found previously to exhibit potentialfor specific recognition of amino acids at proteinndashRNAinterfaces (39) and have also been suggested as potentialcatalysts for amino acid synthesis in pre-biotic environ-ments (40) Moreover set lsquo2+rsquo by definition alsoincludes all instances where triplets of bases directlycontact a given amino acid which may be relevant inthe context of the genetic code Using set lsquo2+rsquo statisticswe observe a remarkably strong correlation between pref-erences of amino acids to interact with guanine (G-prefer-ence Figure 1B) and the average PUR content of theirrespective codons as derived from the complete humanproteome with Pearson correlation coefficient R of084 (Figure 2A) Negative Pearson correlation coeffi-cients indicate matching between amino acid preferencesand codon content owing to the way preference is defined(see lsquoMaterials and Methodsrsquo section) Put differentlyamino acids which are predominantly encoded byPURs display a strong tendency to co-localize with Gat proteinndashRNA interfaces This is also true albeit at asomewhat weaker level of correlation for matchingbetween PUR composition of individual codons fromthe standard genetic table and the respective G-prefer-ences if the statistics of codon usage in the humanproteome is not included (R=068 SupplementaryFigure S1) The observed signal for G is statisticallyhighly significant as evidenced by randomization calcula-tions (P-valuelt 106 Figure 2B) Related to this G-pref-erence of amino acids inversely correlates with C and Ucontent of their codons (Figure 2B) Somewhat less prom-inent but still extremely significant correlations areobserved for G- and C-preference of amino acids andthe average G- and C-content of their codons (R of047 and 058 respectively) On the other hand theinterface statistics for adenine (A) and uracil (U) do notcorrelate with their average usage in codons In particularthe A-preference of amino acids correlates inversely withthe A-content (R=059) or directly with the U-content oftheir codons (R=051) whereas the U-preferenceexhibits relatively low correlations throughout(Figure 2B) Finally both PYR and PUR binding prefer-ences of amino acids (Figure 1B) display significant cor-relations with PYR and PUR fraction in their codons withR of 054 and 053 respectively and P-valueslt 106 inboth cases In other words amino acids coded for byPYR-rich codons prefer to co-localize with PYR andthose coded for by PUR-rich codons with PUR atRNAndashprotein interfaces Although similar in the presentcase PYR- and PUR-preference scales need not necessar-ily be inverses of each other owing to the way preferencesare defined and we therefore here report and discuss both

Matching between sequence profiles of mRNAs and theircognate proteins

How do these observations translate if one comparescomplete mRNA-coding sequences with their cognateprotein sequences Owing to codon usage bias and non-uniform amino-acid composition of the human proteomethese results could in principle deviate significantly fromthe results obtained for individual codons and aminoacids To address this question we calculate a PearsonR for every cognate mRNAprotein pair in the humanproteome capturing the correlation between each mRNAsequence composition profile with the base-binding pref-erence profile of its cognate protein sequenceRemarkably we observe an extremely high level ofmatching between PUR density profiles of mRNAs andG-preference profiles of cognate protein sequences with amedian Pearson R (Rmedian) over the entire humanproteome of 080 and a low P-value (lt106) asdetermined by randomization (Figure 2C) In particularthe distribution of Pearson R values for this scale over thehuman proteome is significantly left shifted and showsonly marginal overlap with the one calculatedfor a typical randomized interaction preference scale(Figure 2C) For illustration we present sequenceprofiles for proteins of most abundant length (300ndash400amino acids Supplementary Figure S2) displayingtypical (ie exhibiting a Pearson R equal to the populationmedian) or best levels of correlation (Figure 2D) As isevident the PUR density of mRNAs is quantitatively ex-tremely well predicted by the G-binding preference profilesof cognate proteins even for typical human proteins(Rmedian=080 and Plt 106) We also observe signifi-cant matching between C-preference profiles for proteinsequences and both C- and PYR-density profiles of theircognate mRNAs with Rmedian of 055 and 047 re-spectively (Figure 2E) In contrast the A-preferencesdisplay significant matching with PYR-density profileson the side of mRNA (Figure 2E see alsoSupplementary Table S2 for the full report of profile cor-relations) with Rmedian of 053 Finally strong and sig-nificant level of matching is observed for PYR-bindingpreferences of amino acids and PYR mRNA profiles aswell as PUR-binding preferences of amino acids and PURprofiles (Rmedian of 058 in both cases and P-values of86 103 and 79 103 respectively Figure 3A and C)From the exemplary typical and best profiles (Figure 3Band D) it is clear that the PYR- and PUR-rich regions inmRNA code for stretches of amino acids in cognateproteins which prefer to co-localize with PYR and PURbases respectively at proteinndashRNA interfaces in theknown 3D PDB structures The typical level of similaritybetween sequence profiles is actually greater than whatone might infer from Rmedian values suggesting thatPearson correlation coefficient might not even be theoptimal measure of deviation in this case Importantlythis direct physico-chemical complementarity betweenmRNA and cognate protein sequences may be indicativeof pronounced potential for complex formation betweenthem especially under circumstances when lsquopeakrsquo regionsbecome available for such interactions Given the fact that

Nucleic Acids Research 2013 Vol 41 No 18 8437

a significant matching of profiles is detected at the level ofprimary sequences we propose that the presence of ex-tended unstructured protein and mRNA segments maybe required for such binding This suggestion agrees wellwith recent knowledge-based studies where RNA loopsand bulges were found to be more likely to interact withamino-acid side chains in a specific manner (3841)How sensitive is the level of matching to the choice of

cutoff distance used to define contacting amino acids and

nucleobases in proteinRNA complexes To address thisquestion we have repeated the aforementioned analysisfor a range of different cutoff values going from 6 to10 A in steps of 025 A (Figure 4) Overall for set lsquo2+rsquoour findings are largely robust to the choice of the exactcutoff in this range albeit with a somewhat lower level ofsignificance for longer cutoffs However the majority ofthe signal is lost if one uses the lsquo1+rsquo set except forG-preference and PUR-content (Figure 5A) and

Figure 2 Relationship between nucleobase-binding preferences of amino acids and mRNA content at multiple levels (A) Correlation between Ginteraction preferences of amino acids (Figure 1B) and the average PUR content of their codons in mRNAs of the entire human proteome(B) Pairwise Pearson correlation coefficients (R) between base-binding preference scales of amino acids (lsquosclrsquo) and average base content of theircodons (lsquocdnrsquo) (C) Distributions of correlation coefficients (R) between window-averaged PUR-content profiles of individual mRNA coding sequencesand window-averaged G-preference sequence profiles of the respective proteins for the entire human proteome (window-size=21) The dashed curvedepicts the distribution of correlation coefficients calculated for a typical randomized G-preference scale Inset the distribution of the means of sequence-profile correlation coefficients for the human proteome (ltRgt) calculated for 106 randomized G-preference scales The R for the original G-preferencescale is shown with an arrow (D) Typical (R=Rmedian) and best pairs of mRNA PUR-content (black curves) and protein G-preference profiles (reddashed curves) for human proteins (E) Median pairwise Pearson correlation coefficients for comparison between nucleobase content profiles of mRNAs(subscript lsquomRNArsquo x-axis) and base-preference-weighted protein sequence profiles (subscript lsquoproteinrsquo y-axis) over the entire human proteome Allresults are based on the analysis of set lsquo2+rsquo statistics All data reported for preference scales are obtained using an 8 A cutoff

8438 Nucleic Acids Research 2013 Vol 41 No 18

A-preference and PYR-content (Supplementary TableS2) This observation strongly suggests that close densepacking of nucleobases around amino acids may berequired for specificity in cognate complex formationAlthough interfaces may be dynamic and liquid-like aswe have suggested before they may still need to bedensely packed Interestingly if one reduces the 2+ setby including only the two closest bases in contact with agiven amino acid (set lsquo2rsquo) the signal for G-preferencePUR-content even further improves by several percentagepoints (Figure 5A) and the same holds for C-preferenceC-content and A-preferencePYR-content (Supplemen-tary Table S2)

To further study the role of protein structural disorderin matching we have analyzed the levels of the predicteddisorder of the top and the bottom 10 of proteins whenit comes to the degree of mRNAprotein profile matchingas captured by Pearson R coefficient (see lsquoMaterials andMethodsrsquo section) We have done this for the six cases ofdirect comparison whereby the same base type is used forboth protein preference and mRNA profile density(Gprotein-GmRNA Aprotein-AmRNA Cprotein-CmRNAUprotein-UmRNA PURprotein-PURmRNA and PYRprotein-PYRmRNA) and also for the case displaying the strongestsignal in our analysis (Gprotein-PURmRNA) Importantly inthe case of Gprotein-GmRNA Aprotein-AmRNA and Cprotein-CmRNA matching we do observe a pronounced tendency

for the top and the bottom 10 cohorts to be significantlyenriched (top 10) and depleted (bottom 10) in dis-ordered proteins (Supplementary Table S3) whereas inthe case of Uprotein-Uprof matching the situation isreversed Interestingly for PURprotein-PURmRNAPYRprotein-PYRmRNA and Gprotein-PURmRNA matchingone observes slight disorder enrichment in both top andbottom cohorts The most prominent shift of the distribu-tion of predicted average disorder toward higher disorderas compared with background is observed for the top10 cohort of proteins displaying strong matchingbetween C-preference profiles of their sequences and theC-content of their cognate mRNAs (Cprotein-CmRNASupplementary Table S3 Supplementary Figure S3)One might argue that this effect could just be related tocompositional properties of such protein and mRNApairs whereby disordered proteins are simply encodedby C-rich sequences However the differences betweennucleobase compositions of mRNAs from the Cprotein-CmRNA top 10 cohort and the complete proteome areminor suggesting that the underlying explanation mightbe more complex (Supplementary Figure S3)Which biological functions might be associated with a

high level of complementarity between proteins andcognate mRNAs To address this question we have per-formed GO analysis for seven different top 10 subsets ofproteins displaying strong matching with cognate mRNAs

Figure 3 PYRPUR mRNA sequence profiles strongly match PYRPUR-preferences of cognate protein sequences PYR (A and B) and PUR (Cand D) amino-acid preference scales are given in Figure 1B For details please see the analogous captions to Figure 2C and D

Nucleic Acids Research 2013 Vol 41 No 18 8439

(see lsquoMaterials and Methodsrsquo section for details) InSupplementary Table S4 we report the most significantlyenriched biological functions (using a P-value cutoff of1010) shared by proteins from the analyzed cohorts Ina striking agreement with our hypothesis in most caseswe observe pronounced enrichment of terms related tonucleic-acidprotein interactions including regulation ofRNA metabolic processes ribonucleoprotein complexesand transcription The latter in particular allows one tospeculate that protein tendencies to associate with cognatemRNA might be used by the cells to modulate gene ex-pression pathways What is more PUR or PYR densityprofiles of mRNAs are identical to PUR or PYR densityprofiles of coding-strand DNA sequences (with Us beingreplaced by Ts) Although based on our statistical poten-tials we cannot say anything about T-binding preferencesof amino acids it is possible that our results may be gen-eralizable even to DNA-protein interactions as well asother RNA-protein interactions One should alsomention that depending on the particular type ofmatching other biological functions also tend to beenriched For instance the Uprotein-UmRNA top 10subset displays significant enrichment of membraneproteins whereas Gprotein-PURmRNA top cohort seems to

be populated by extracellular proteins and particularlythose involved in the functioning of the innate immunesystem Altogether our preliminary GO analysis illus-trates significant functional differences between proteinsthat strongly complement their cognate mRNAs and therest of the human proteome and these findings will befurther explored in another manuscript

DISCUSSION

High levels of matching between base-binding-preferenceprofiles of proteins and PYR- or PUR-density profiles ofcognate mRNA-coding sequences defined primarily byamino acid preferences to co-localize with G and Cbases at RNAprotein interfaces allow one to speculatethat direct complementary binding interactions may be akey element underlying the whole mRNAprotein rela-tionship when it comes to both its evolutionary develop-ment as well as present day biology (Figure 5B) Thisagrees well with and significantly extends our previousfindings where we have shown that protein sequenceprofiles of amino acid affinity for PYR analogs (42ndash44)mirror PYR density profiles of cognate mRNA sequences

Figure 5 Physico-chemical origins of the mRNAprotein relationship(A) Correlation coefficients (R and ltRgt with standard deviations)between PYR or PUR average codon content (lsquoCodon contentrsquo) andrespective mRNA profiles (lsquoProfilesrsquo) calculated for G- (blue) PUR-(red) and PYR- (green) binding preferences of amino acids whichwere obtained using different amino acid neighbor statistics (1+ 2+or 2) (B) A model of physico-chemical complementarity betweenproteins and cognate mRNAs Preferential interactions of aminoacids with PYR or PUR define their codon content in the genetictable and facilitate complementary interactions between PYRPUR-rich mRNA regions and PYRPUR preferring regions in proteinsThe opposite behavior of adenines and guanines adds an additionallayer of complexity in the case of PURs as signified by dashedarrows in the model Note polymer sizes not drawn to scale

Figure 4 Effect of cutoff radius used to define proteinndashRNA contactson observed correlations (A) Dependence of Pearson correlation coef-ficients (R) between amino acid preference scales and average codoncontent on the cutoff radius for the two sets of statistics studied (lsquo1+rsquolsquo2+rsquo) The total number of unique contacts in lsquo1+rsquo and lsquo2+rsquo (given inparentheses) sets obtained for each of used cutoff radii is indicated atthe top of the panel (B) Cutoff radius dependence of median pairwisePearson correlation coefficients (Rmedian) for comparison betweennucleobase content profiles of mRNAs and base-preference-weightedprotein sequence profiles over the entire human proteome (color codethe same as in panel A)

8440 Nucleic Acids Research 2013 Vol 41 No 18

(15) It should be emphasized however that our presentresults are based exclusively on the statistics of directamino acidnucleobase contacts at RNAprotein inter-faces It is therefore still possible that the driving forcefor interactions between mRNAs and cognate proteins isnon-specific (eg binding of positively charged amino acidside chains to RNA phosphate groups) whereas comple-mentary interactions actually confer specificity to binding

Moreover our results provide a clear evolutionary per-spective concerning the physico-chemical origins of trans-lation in line with the stereo-chemical hypothesis of theorigin of the genetic code (16ndash21) In particular ourresults give strong support to the possibility of directtemplating of proteins from mRNAs in the era beforethe development of ribosomal decoding and codersquosfixation in that era (1745) In this framework ancientamino acids associated with mRNA directly followingtheir intrinsic physico-chemical preferences as outlinedhere However the fact that an analogous effect is notseen for all bases especially adenine and uracil supportsthe possibility that in addition to physico-chemical ration-ales in the context of direct binding other evolutionaryforces were also responsible for shaping the genetic codeas suggested before (19) Our results are most consistentwith the possibility that the early stereo-chemical phase incodersquos development was dominated by G- and C-richcodons as strongest correlations are seen for preciselythese bases If the basic structure of the early geneticcode was defined by such codons but was later modulatedby the inclusion of A and U bases this might explain whyG-affinity of amino acids in present-day protein sequencesclosely follows PUR density profiles in cognate mRNAsInterestingly Trifonov and coworkers have suggested thatthe first codons were G- and C-rich on the basis of a con-sensus analysis of 40 different criteria (46)

Importantly it should be emphasized that the stereo-chemical hypothesis of the codersquos origin may differ fromthe cognate mRNAprotein complementary interactionhypothesis in terms of its evolutionary underpinningsDirect templating of proteins from mRNAs in ancientsystems (the coding aspect of the stereo-chemical hypoth-esis) does not necessarily imply that modern proteinsdirectly interact with their own mRNA (complementaryinteraction hypothesis) However our findings support thepossibility that the origin of the genetic code and potentialcomplementarity between proteins and cognate mRNAsmight have the same physico-chemical background It iswell possible that other independent influences haveshaped both effects and the two hypotheses leave ampleroom for such refinements However we would like tostress that in our view the two hypotheses are inter-linked cognate binding is on the one hand a reasonableconsequence of the stereochemical hypothesis but on theother hand it also gives a potential biological rationale forthe early development of the code to begin with such asstabilization of RNA structures by bound polypeptides ashas been suggested before (45)

There are a number of open challenges concerning theaforementioned proposal First and foremost the struc-tural features of mRNAs and cognate proteins imposesevere constraints on any putative complementarity

between the two Namely with the contour length of themRNA coding part being 45 times longer than that of acognate protein it is not clear what structural arrange-ments may be consistent with any complementary inter-actions We would like to suggest that structures of suchcomplexes may be dynamic and liquid-like with mRNAstretches enveloping and solubilizing cognate proteinstretches (15) Second with many mRNAs and proteinsbeing well-folded and compact for most of the time itremains to be studied when and how opportunities couldarise for the complementarity between their primary se-quences to be of relevance It is possible that if at allrealistic such complementary binding might be function-ally important precisely in those situations where bothpolymers are unstructured such as during translationexport and degradation as a consequence of thermalstress or in the case of intrinsically unstructuredproteins However we do not exclude the possibility ofcomplementary interactions even in the folded stateFinally concerning the origin of the genetic code it isnot clear how the final well-defined structure of the codecould have arisen based on still partially non-specificlarge-scale binding interactions between mRNAs andcognate proteins As suggested before it is possible thatthe answer lies in a combination of different influences(19) Future research should shed light on these andrelated questionsThese challenges notwithstanding our findings provide

strong evidence that the ability to interact with mRNAmight be a widespread phenomenon in the cell involvingnot only cognate proteins but also other proteins based onsimilar principles The potential significance of suchphysico-chemical complementarity between mRNAs andproteins potentially extends to all facets of nucleic acidand protein biology in the modern cell including transcrip-tiontranslation regulation (9104748) mRNA transportand localization (4950) processing and decay (51) struc-ture of ribonucleoproteins (52) and others (2ndash55354)Our preliminary GO analysis has demonstrated a signifi-cant enrichment of functions related to association withnucleic acids for the subsets of proteins that complementtheir cognate mRNAs strongly and these findings will beexplored in more detail in future work

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

ACKNOWLEDGEMENTS

The authors thank S Dorner R Schroeder A Vaziri GWarren and members of the Laboratory ofComputational Biophysics at MFPL for useful adviceand critical reading of the manuscript

FUNDING

This work was supported in part by the Austrian ScienceFund FWF [START grant Y514-B11 to BZ] EuropeanResearch Council [ERC Starting Independent grant

Nucleic Acids Research 2013 Vol 41 No 18 8441

279408 to BZ] Funding for open access chargeAustrian Science Fund FWF

Conflict of interest statement None declared

REFERENCES

1 BrennerS JacobF and MeselsonM (1961) An unstableintermediate carrying information from genes to ribosomes forprotein synthesis Nature 190 576ndash581

2 AndersG MackowiakSD JensM MaaskolaJ KuntzagkARajewskyN LandthalerM and DieterichC (2012) doRiNA adatabase of RNA interactions in post-transcriptional regulationNucleic Acids Res 40 D180ndashD186

3 BaltzAG MunschauerM SchwanhausserB VasileAMurakawaY SchuelerM YoungsN Penfold-BrownDDrewK MilekM et al (2012) The mRNA-bound proteome andits global occupancy profile on protein-coding transcripts MolCell 46 674ndash690

4 CastelloA FischerB EichelbaumK HorosRBeckmannBM StreinC DaveyNE HumphreysDTPreissT SteinmetzLM et al (2012) Insights into RNA biologyfrom an atlas of mammalian mRNA-binding proteins Cell 1491393ndash1406

5 KonigJ ZarnackK LuscombeNM and UleJ (2012) Protein-RNA interactions new genomic technologies and perspectivesNat Rev Genet 13 221ndash221

6 MitchellSF JainS SheM and ParkerR (2013) Globalanalysis of yeast mRNPs Nat Struct Mol Biol 20 127ndash133

7 WeberSC and BrangwynneCP (2012) Getting RNA andprotein in phase Cell 149 1188ndash1191

8 HanTNW KatoM XieSH WuLC MirzaeiH PeiJMChenM XieY AllenJ XiaoGH et al (2012) Cell-freeformation of RNA granules bound RNAs identify features andcomponents of cellular assemblies Cell 149 768ndash779

9 KyrpidesNC and OuzounisCA (1993) Mechanisms ofspecificity in messenger-rna degradation - autoregulation andcognate interactions J Theor Biol 163 373ndash392

10 OuzounisCA and KyrpidesNC (1994) Reverse interpretation-ahypothetical selection mechanism for adaptive mutagenesis basedon autoregulated messenger-RNA stability J Theor Biol 167373ndash379

11 ChuE CopurSM JuJ ChenTM KhleifS VoellerDMMizunumaN PatelM MaleyGF MaleyF et al (1999)Thymidylate synthase protein and p53 mRNA form an in vivoribonucleoprotein complex Mol Cell Biol 19 1582ndash1594

12 TaiN SchmitzJC LiuJ LinX BaillyM ChenTM andChuE (2004) Translational autoregulation of thymidylatesynthase and dihydrofolate reductase Front Biosci 9 2521ndash2526

13 SchuttpelzM SchoningJC DooseS NeuweilerH PetersEStaigerD and SauerM (2008) Changes in conformationaldynamics of mRNA upon AtGRP7 binding studied byfluorescence correlation spectroscopy J Am Chem Soc 1309507ndash9513

14 ZhaoX LiuM WuN DingL LiuH and LinX (2010)Recovery of recombinant zebrafish p53 protein from inclusionbodies and its binding activity to p53 mRNA in vitro ProteinExpr Purif 72 262ndash266

15 HlevnjakM PolyanskyAA and ZagrovicB (2012) Sequencesignatures of direct complementarity between mRNAs andcognate proteins on multiple levels Nucleic Acids Res 408874ndash8882

16 WoeseCR (1965) Order in genetic code Proc Natl Acad SciUSA 54 71ndash75

17 WoeseCR (1965) On the evolution of the genetic code ProcNatl Acad Sci USA 54 1546ndash1552

18 YarusM (1998) Amino acids as RNA ligands A direct-RNA-template theory for the codersquos origin J Mol Evol 47 109ndash117

19 KooninEV and NovozhilovAS (2009) Origin and evolution ofthe genetic code the universal enigma IUBMB Life 61 99ndash111

20 YarusM WidmannJJ and KnightR (2009) RNA-amino acidbinding a stereochemical era for the genetic code J Mol Evol69 406ndash429

21 JohnsonDB and WangL (2010) Imprints of the genetic code inthe ribosome Proc Natl Acad Sci USA 107 8298ndash8303

22 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The proteindata bank Nucleic Acids Res 28 235ndash242

23 Ben-ShemA de LoubresseNG MelnikovS JennerLYusupovaG and YusupovM (2011) The structure of theeukaryotic ribosome at 30 angstrom resolution Science 3341524ndash1529

24 DunkleJA WangLY FeldmanMB PulkA ChenVBKapralGJ NoeskeJ RichardsonJS BlanchardSC andCateJHD (2011) Structures of the bacterial ribosome inclassical and hybrid states of tRNA binding Science 332981ndash984

25 PolikanovYS BlahaGM and SteitzTA (2012) Howhibernation factors RMF HPF and YfiA turn off proteinsynthesis Science 336 915ndash918

26 HarmsJM WilsonDN SchluenzenF ConnellSRStachelhausT ZaborowskaZ SpahnCM and FuciniP (2008)Translational regulation via L11 molecular switches on theribosome turned on and off by thiostrepton and micrococcinMol Cell 30 26ndash38

27 MiyazawaS and JerniganRL (1985) Estimation of effectiveinterresidue contact energies from protein crystal-structures -quasi-chemical approximation Macromolecules 18 534ndash552

28 DonaldJE ChenWW and ShakhnovichEI (2007) Energeticsof protein-DNA interactions Nucleic Acids Res 35 1039ndash1047

29 JonikasMA RadmerRJ LaederachA DasR PearlmanSHerschlagD and AltmanRB (2009) Coarse-grained modeling oflarge RNA molecules with knowledge-based potentials andstructural filters RNA 15 189ndash199

30 Perez-CanoL SolernouA PonsC and Fernandez-RecioJ(2010) Structural prediction of protein-RNA interaction bycomputational docking with propensity-based statistical potentialsPac Symp Biocomput 15 269ndash280

31 TuszynskaI and BujnickiJM (2011) DARS-RNP and QUASI-RNP new statistical potentials for protein-RNA docking BMCBioinformatics 12 348

32 GasteigerE GattikerA HooglandC IvanyiI AppelRD andBairochA (2003) ExPASy the proteomics server for in-depthprotein knowledge and analysis Nucleic Acids Res 31 3784ndash3788

33 DosztanyiZ CsizmokV TompaP and SimonI (2005) IUPredweb server for the prediction of intrinsically unstructured regionsof proteins based on estimated energy content Bioinformatics 213433ndash3434

34 Huang daW ShermanBT and LempickiRA (2009) Systematicand integrative analysis of large gene lists using DAVIDbioinformatics resources Nat Protoc 4 44ndash57

35 The PyMOL Molecular Graphics System Version 13r1 (2010)Schrodinger LLC httpwwwpymolorgciting (4 July 2013date last accessed)

36 TregerM and WesthofE (2001) Statistical analysis of atomiccontacts at RNA-protein interfaces J Mol Recognit 14199ndash214

37 HoffmanMM KhrapovMA CoxJC YaoJ TongL andEllingtonAD (2004) AANT the amino acid-nucleotideinteraction database Nucleic Acids Res 32 D174ndashD181

38 GuptaA and GribskovM (2011) The role of RNA sequence andstructure in RNAmdashprotein interactions J Mol Biol 409 574ndash587

39 FernandezM KumagaiY StandleyDM SaraiAMizuguchiK and AhmadS (2011) Prediction of dinucleotide-specific RNA-binding sites in proteins BMC Bioinformatics12(Suppl 13) S5

40 CopleySD SmithE and MorowitzHJ (2005) A mechanismfor the association of amino acids with their codons and theorigin of the genetic code Proc Natl Acad Sci USA 1024442ndash4447

41 IwakiriJ TateishiH ChakrabortyA PatilP and KenmochiN(2012) Dissecting the protein-RNA interface the role of proteinsurface shapes and RNA secondary structures in protein-RNArecognition Nucleic Acids Res 40 3299ndash3306

8442 Nucleic Acids Research 2013 Vol 41 No 18

42 WoeseCR DugreDH SaxingerWC and DugreSA (1966)The molecular basis for the genetic code Proc Natl Acad SciUSA 55 966ndash974

43 WoeseCR (1973) Evolution of the genetic codeNaturwissenschaften 60 447ndash459

44 MathewDC and Luthey-SchultenZ (2008) On the physicalbasis of the amino acid polar requirement J Mol Evol 66519ndash528

45 NollerHF (2012) Evolution of protein synthesis from an RNAworld Cold Spring Harb Perspect Biol 4 1ndashU20

46 TrifonovEN KirzhnerA KirzhnerVM and BerezovskyIN(2001) Distinct stages of protein evolution as suggested by proteinsequence analysis J Mol Evol 53 394ndash401

47 VaquerizasJM KummerfeldSK TeichmannSA andLuscombeNM (2009) A census of human transcription factorsfunction expression and evolution Nat Rev Genet 10 252ndash263

48 SonenbergN and HinnebuschAG (2009) Regulation oftranslation initiation in eukaryotes mechanisms and biologicaltargets Cell 136 731ndash745

49 LecuyerE YoshidaH ParthasarathyN AlmC BabakTCerovinaT HughesTR TomancakP and KrauseHM (2007)Global analysis of mRNA localization reveals a prominentrole in organizing cellular architecture and function Cell 131174ndash187

50 MartinKC and EphrussiA (2009) mRNA localization geneexpression in the spatial dimension Cell 136 719ndash730

51 MooreMJ and ProudfootNJ (2009) Pre-mRNA processingreaches back to transcription and ahead to translation Cell 136688ndash700

52 GlisovicT BachorikJL YongJ and DreyfussG (2008) RNA-binding proteins and post-transcriptional gene regulation FEBSLett 582 1977ndash1986

53 BellucciM AgostiniF MasinM and TartagliaGG (2011)Predicting protein associations with long noncoding RNAs NatMethods 8 444ndash445

54 RinnJL and ChangHY (2012) Genome regulation by longnoncoding RNAs Annu Rev Biochem 81 145ndash166

Nucleic Acids Research 2013 Vol 41 No 18 8443

The typical randomized scales whose distributions ofcorrelation coefficients are depicted in the manuscriptwere chosen to be those whose mean and standard devi-ation are the same as the average mean and the averagestandard deviation over all 106 randomized scales in eachcase

Analysis of protein disorder and gene ontology (GO)classification

The average disorder for each protein sequence in thehuman proteome was predicted using IUpred server(33) Fourteen subsets of proteins displaying best orworst matching between their interaction preferenceprofiles and nucleobase density profiles of their cognatemRNAs in term of Pearson R were extracted from thehuman proteome (top and bottom 10 cohorts) for thesix cases of direct correspondence between nucleobasepreferences and nucleobase composition profiles (egprotein G-preference versus G mRNA content Gprotein-GmRNA etc) and for the G-preference versus PURmRNA content one (Gprotein-PURmRNA) Each of thesesubsets contains 1707 proteins for which averagedisorder values were assigned Means and standard devi-ations of the 14 thus-obtained distributions of averagepredicted disorder were compared with those of theentire human proteome (background) The significanceof the mean difference from the background was estimatedfor each of the analyzed subsets using the Wilcoxonsigned-rank test The gene ontology (GO) analysis wasperformed for the same seven top 10 best-matchingprotein subsets using DAVID functional annotationserver (34) The entire human proteome was used as back-ground and only the most significantly enriched func-tional terms with a DAVID EASE score (P-values)1010 were considered

Data visualization

The 3D structures of proteinndashRNA and amino acidnucleobase complexes were visualized using PyMol(httpwwwpymolorg) (35) Contact statistics heat-map was produced using MATLAB (R2009a) PearsonR distributions for mRNAprotein profiles were processedand visualized using Grace (httpplasma-gateweizmannacilGrace)

RESULTS

Derivation of amino acidnucleobase interactionpreferences

How differentiated and context-dependent are the prefer-ences of amino acids to interact with specific nitrogenousbases To address this question we analyze contact inter-faces of 300 high-resolution structures of differentproteinndashRNA complexes including five ribosomal struc-tures (Supplementary Table S1) We use distancesbetween centers of geometry of amino-acid side chainsand nucleotide nitrogenous bases in combination with afixed cutoff to define contacting neighbors (Figure 1A) Inthis way we isolate sequence-specific proteinndashRNA

contacts (36ndash38) while ignoring non-specific interactionsdefined exclusively by protein or RNA backbones Wefirst present results for the distance cutoff of 8 A followingShakhnovich et al who established cutoffs between 7 and8 A to be optimal for residue-based statistical potentialsdescribing proteinndashDNA interactions albeit with aslightly different definition of reference points (28)However all of our principal findings hold qualitativelyfor cutoffs between 6 and 9 A as discussed later in thetext Finally to differentiate cases in which an amino acidinteracts with a single base only from denser potentiallymore stereospecific contacts with more than one neighbor-ing base within the cutoff we separately merge contactstatistics over the whole set of studied structures foramino acids having at least one neighboring base(set lsquo1+rsquo) or at least two neighboring bases (set lsquo2+rsquoFigure 1A) within the cutoff

Using standard distance-independent contact potentialformalism (27ndash31) we subsequently derive scales of aminoacidnucleobase interaction preferences (Figure 1B andSupplementary Table S2) and use them to address thefollowing questions (i) how does the average compositionof mRNA codons coding for a given amino acid relate tothe preferences of this amino acid to interact with different

Figure 1 Derivation of amino acidnucleobase interaction preferencescales from known structures of RNAprotein complexes (A) Wedefine amino-acid side chains and RNA bases in a given complex tobe contacting neighbors if their centers of geometry are less than agiven cutoff radius R apart (left and middle) and merge contact statis-tics over the entire set of studied structures (right lsquo2+rsquo set with applied8 A cutoff) (B) Interaction preference scales of amino acids (in arbi-trary units) for binding to guanines (G) PYR and PUR obtained fromset lsquo2+rsquo statistics using 8 A cutoff (panel A right) The scales are stat-istical analogs of relative free energy of binding (see lsquoMaterials andMethodsrsquo section) with the prominently negative values correspondingto amino acid side chains having the highest affinities for bases of agiven type and vice versa

8436 Nucleic Acids Research 2013 Vol 41 No 18

nucleobases at proteinRNA interfaces and (ii) how doessequence density of different bases in mRNA-coding se-quences relate to sequence profiles of amino-acid inter-action preferences for these and other bases in cognateprotein sequences

Amino acid interaction preferences and their codoncontent

We first focus on contact statistics from set lsquo2+rsquoDinucleotides were found previously to exhibit potentialfor specific recognition of amino acids at proteinndashRNAinterfaces (39) and have also been suggested as potentialcatalysts for amino acid synthesis in pre-biotic environ-ments (40) Moreover set lsquo2+rsquo by definition alsoincludes all instances where triplets of bases directlycontact a given amino acid which may be relevant inthe context of the genetic code Using set lsquo2+rsquo statisticswe observe a remarkably strong correlation between pref-erences of amino acids to interact with guanine (G-prefer-ence Figure 1B) and the average PUR content of theirrespective codons as derived from the complete humanproteome with Pearson correlation coefficient R of084 (Figure 2A) Negative Pearson correlation coeffi-cients indicate matching between amino acid preferencesand codon content owing to the way preference is defined(see lsquoMaterials and Methodsrsquo section) Put differentlyamino acids which are predominantly encoded byPURs display a strong tendency to co-localize with Gat proteinndashRNA interfaces This is also true albeit at asomewhat weaker level of correlation for matchingbetween PUR composition of individual codons fromthe standard genetic table and the respective G-prefer-ences if the statistics of codon usage in the humanproteome is not included (R=068 SupplementaryFigure S1) The observed signal for G is statisticallyhighly significant as evidenced by randomization calcula-tions (P-valuelt 106 Figure 2B) Related to this G-pref-erence of amino acids inversely correlates with C and Ucontent of their codons (Figure 2B) Somewhat less prom-inent but still extremely significant correlations areobserved for G- and C-preference of amino acids andthe average G- and C-content of their codons (R of047 and 058 respectively) On the other hand theinterface statistics for adenine (A) and uracil (U) do notcorrelate with their average usage in codons In particularthe A-preference of amino acids correlates inversely withthe A-content (R=059) or directly with the U-content oftheir codons (R=051) whereas the U-preferenceexhibits relatively low correlations throughout(Figure 2B) Finally both PYR and PUR binding prefer-ences of amino acids (Figure 1B) display significant cor-relations with PYR and PUR fraction in their codons withR of 054 and 053 respectively and P-valueslt 106 inboth cases In other words amino acids coded for byPYR-rich codons prefer to co-localize with PYR andthose coded for by PUR-rich codons with PUR atRNAndashprotein interfaces Although similar in the presentcase PYR- and PUR-preference scales need not necessar-ily be inverses of each other owing to the way preferencesare defined and we therefore here report and discuss both

Matching between sequence profiles of mRNAs and theircognate proteins

How do these observations translate if one comparescomplete mRNA-coding sequences with their cognateprotein sequences Owing to codon usage bias and non-uniform amino-acid composition of the human proteomethese results could in principle deviate significantly fromthe results obtained for individual codons and aminoacids To address this question we calculate a PearsonR for every cognate mRNAprotein pair in the humanproteome capturing the correlation between each mRNAsequence composition profile with the base-binding pref-erence profile of its cognate protein sequenceRemarkably we observe an extremely high level ofmatching between PUR density profiles of mRNAs andG-preference profiles of cognate protein sequences with amedian Pearson R (Rmedian) over the entire humanproteome of 080 and a low P-value (lt106) asdetermined by randomization (Figure 2C) In particularthe distribution of Pearson R values for this scale over thehuman proteome is significantly left shifted and showsonly marginal overlap with the one calculatedfor a typical randomized interaction preference scale(Figure 2C) For illustration we present sequenceprofiles for proteins of most abundant length (300ndash400amino acids Supplementary Figure S2) displayingtypical (ie exhibiting a Pearson R equal to the populationmedian) or best levels of correlation (Figure 2D) As isevident the PUR density of mRNAs is quantitatively ex-tremely well predicted by the G-binding preference profilesof cognate proteins even for typical human proteins(Rmedian=080 and Plt 106) We also observe signifi-cant matching between C-preference profiles for proteinsequences and both C- and PYR-density profiles of theircognate mRNAs with Rmedian of 055 and 047 re-spectively (Figure 2E) In contrast the A-preferencesdisplay significant matching with PYR-density profileson the side of mRNA (Figure 2E see alsoSupplementary Table S2 for the full report of profile cor-relations) with Rmedian of 053 Finally strong and sig-nificant level of matching is observed for PYR-bindingpreferences of amino acids and PYR mRNA profiles aswell as PUR-binding preferences of amino acids and PURprofiles (Rmedian of 058 in both cases and P-values of86 103 and 79 103 respectively Figure 3A and C)From the exemplary typical and best profiles (Figure 3Band D) it is clear that the PYR- and PUR-rich regions inmRNA code for stretches of amino acids in cognateproteins which prefer to co-localize with PYR and PURbases respectively at proteinndashRNA interfaces in theknown 3D PDB structures The typical level of similaritybetween sequence profiles is actually greater than whatone might infer from Rmedian values suggesting thatPearson correlation coefficient might not even be theoptimal measure of deviation in this case Importantlythis direct physico-chemical complementarity betweenmRNA and cognate protein sequences may be indicativeof pronounced potential for complex formation betweenthem especially under circumstances when lsquopeakrsquo regionsbecome available for such interactions Given the fact that

Nucleic Acids Research 2013 Vol 41 No 18 8437

a significant matching of profiles is detected at the level ofprimary sequences we propose that the presence of ex-tended unstructured protein and mRNA segments maybe required for such binding This suggestion agrees wellwith recent knowledge-based studies where RNA loopsand bulges were found to be more likely to interact withamino-acid side chains in a specific manner (3841)How sensitive is the level of matching to the choice of

cutoff distance used to define contacting amino acids and

nucleobases in proteinRNA complexes To address thisquestion we have repeated the aforementioned analysisfor a range of different cutoff values going from 6 to10 A in steps of 025 A (Figure 4) Overall for set lsquo2+rsquoour findings are largely robust to the choice of the exactcutoff in this range albeit with a somewhat lower level ofsignificance for longer cutoffs However the majority ofthe signal is lost if one uses the lsquo1+rsquo set except forG-preference and PUR-content (Figure 5A) and

Figure 2 Relationship between nucleobase-binding preferences of amino acids and mRNA content at multiple levels (A) Correlation between Ginteraction preferences of amino acids (Figure 1B) and the average PUR content of their codons in mRNAs of the entire human proteome(B) Pairwise Pearson correlation coefficients (R) between base-binding preference scales of amino acids (lsquosclrsquo) and average base content of theircodons (lsquocdnrsquo) (C) Distributions of correlation coefficients (R) between window-averaged PUR-content profiles of individual mRNA coding sequencesand window-averaged G-preference sequence profiles of the respective proteins for the entire human proteome (window-size=21) The dashed curvedepicts the distribution of correlation coefficients calculated for a typical randomized G-preference scale Inset the distribution of the means of sequence-profile correlation coefficients for the human proteome (ltRgt) calculated for 106 randomized G-preference scales The R for the original G-preferencescale is shown with an arrow (D) Typical (R=Rmedian) and best pairs of mRNA PUR-content (black curves) and protein G-preference profiles (reddashed curves) for human proteins (E) Median pairwise Pearson correlation coefficients for comparison between nucleobase content profiles of mRNAs(subscript lsquomRNArsquo x-axis) and base-preference-weighted protein sequence profiles (subscript lsquoproteinrsquo y-axis) over the entire human proteome Allresults are based on the analysis of set lsquo2+rsquo statistics All data reported for preference scales are obtained using an 8 A cutoff

8438 Nucleic Acids Research 2013 Vol 41 No 18

A-preference and PYR-content (Supplementary TableS2) This observation strongly suggests that close densepacking of nucleobases around amino acids may berequired for specificity in cognate complex formationAlthough interfaces may be dynamic and liquid-like aswe have suggested before they may still need to bedensely packed Interestingly if one reduces the 2+ setby including only the two closest bases in contact with agiven amino acid (set lsquo2rsquo) the signal for G-preferencePUR-content even further improves by several percentagepoints (Figure 5A) and the same holds for C-preferenceC-content and A-preferencePYR-content (Supplemen-tary Table S2)

To further study the role of protein structural disorderin matching we have analyzed the levels of the predicteddisorder of the top and the bottom 10 of proteins whenit comes to the degree of mRNAprotein profile matchingas captured by Pearson R coefficient (see lsquoMaterials andMethodsrsquo section) We have done this for the six cases ofdirect comparison whereby the same base type is used forboth protein preference and mRNA profile density(Gprotein-GmRNA Aprotein-AmRNA Cprotein-CmRNAUprotein-UmRNA PURprotein-PURmRNA and PYRprotein-PYRmRNA) and also for the case displaying the strongestsignal in our analysis (Gprotein-PURmRNA) Importantly inthe case of Gprotein-GmRNA Aprotein-AmRNA and Cprotein-CmRNA matching we do observe a pronounced tendency

for the top and the bottom 10 cohorts to be significantlyenriched (top 10) and depleted (bottom 10) in dis-ordered proteins (Supplementary Table S3) whereas inthe case of Uprotein-Uprof matching the situation isreversed Interestingly for PURprotein-PURmRNAPYRprotein-PYRmRNA and Gprotein-PURmRNA matchingone observes slight disorder enrichment in both top andbottom cohorts The most prominent shift of the distribu-tion of predicted average disorder toward higher disorderas compared with background is observed for the top10 cohort of proteins displaying strong matchingbetween C-preference profiles of their sequences and theC-content of their cognate mRNAs (Cprotein-CmRNASupplementary Table S3 Supplementary Figure S3)One might argue that this effect could just be related tocompositional properties of such protein and mRNApairs whereby disordered proteins are simply encodedby C-rich sequences However the differences betweennucleobase compositions of mRNAs from the Cprotein-CmRNA top 10 cohort and the complete proteome areminor suggesting that the underlying explanation mightbe more complex (Supplementary Figure S3)Which biological functions might be associated with a

high level of complementarity between proteins andcognate mRNAs To address this question we have per-formed GO analysis for seven different top 10 subsets ofproteins displaying strong matching with cognate mRNAs

Figure 3 PYRPUR mRNA sequence profiles strongly match PYRPUR-preferences of cognate protein sequences PYR (A and B) and PUR (Cand D) amino-acid preference scales are given in Figure 1B For details please see the analogous captions to Figure 2C and D

Nucleic Acids Research 2013 Vol 41 No 18 8439

(see lsquoMaterials and Methodsrsquo section for details) InSupplementary Table S4 we report the most significantlyenriched biological functions (using a P-value cutoff of1010) shared by proteins from the analyzed cohorts Ina striking agreement with our hypothesis in most caseswe observe pronounced enrichment of terms related tonucleic-acidprotein interactions including regulation ofRNA metabolic processes ribonucleoprotein complexesand transcription The latter in particular allows one tospeculate that protein tendencies to associate with cognatemRNA might be used by the cells to modulate gene ex-pression pathways What is more PUR or PYR densityprofiles of mRNAs are identical to PUR or PYR densityprofiles of coding-strand DNA sequences (with Us beingreplaced by Ts) Although based on our statistical poten-tials we cannot say anything about T-binding preferencesof amino acids it is possible that our results may be gen-eralizable even to DNA-protein interactions as well asother RNA-protein interactions One should alsomention that depending on the particular type ofmatching other biological functions also tend to beenriched For instance the Uprotein-UmRNA top 10subset displays significant enrichment of membraneproteins whereas Gprotein-PURmRNA top cohort seems to

be populated by extracellular proteins and particularlythose involved in the functioning of the innate immunesystem Altogether our preliminary GO analysis illus-trates significant functional differences between proteinsthat strongly complement their cognate mRNAs and therest of the human proteome and these findings will befurther explored in another manuscript

DISCUSSION

High levels of matching between base-binding-preferenceprofiles of proteins and PYR- or PUR-density profiles ofcognate mRNA-coding sequences defined primarily byamino acid preferences to co-localize with G and Cbases at RNAprotein interfaces allow one to speculatethat direct complementary binding interactions may be akey element underlying the whole mRNAprotein rela-tionship when it comes to both its evolutionary develop-ment as well as present day biology (Figure 5B) Thisagrees well with and significantly extends our previousfindings where we have shown that protein sequenceprofiles of amino acid affinity for PYR analogs (42ndash44)mirror PYR density profiles of cognate mRNA sequences

Figure 5 Physico-chemical origins of the mRNAprotein relationship(A) Correlation coefficients (R and ltRgt with standard deviations)between PYR or PUR average codon content (lsquoCodon contentrsquo) andrespective mRNA profiles (lsquoProfilesrsquo) calculated for G- (blue) PUR-(red) and PYR- (green) binding preferences of amino acids whichwere obtained using different amino acid neighbor statistics (1+ 2+or 2) (B) A model of physico-chemical complementarity betweenproteins and cognate mRNAs Preferential interactions of aminoacids with PYR or PUR define their codon content in the genetictable and facilitate complementary interactions between PYRPUR-rich mRNA regions and PYRPUR preferring regions in proteinsThe opposite behavior of adenines and guanines adds an additionallayer of complexity in the case of PURs as signified by dashedarrows in the model Note polymer sizes not drawn to scale

Figure 4 Effect of cutoff radius used to define proteinndashRNA contactson observed correlations (A) Dependence of Pearson correlation coef-ficients (R) between amino acid preference scales and average codoncontent on the cutoff radius for the two sets of statistics studied (lsquo1+rsquolsquo2+rsquo) The total number of unique contacts in lsquo1+rsquo and lsquo2+rsquo (given inparentheses) sets obtained for each of used cutoff radii is indicated atthe top of the panel (B) Cutoff radius dependence of median pairwisePearson correlation coefficients (Rmedian) for comparison betweennucleobase content profiles of mRNAs and base-preference-weightedprotein sequence profiles over the entire human proteome (color codethe same as in panel A)

8440 Nucleic Acids Research 2013 Vol 41 No 18

(15) It should be emphasized however that our presentresults are based exclusively on the statistics of directamino acidnucleobase contacts at RNAprotein inter-faces It is therefore still possible that the driving forcefor interactions between mRNAs and cognate proteins isnon-specific (eg binding of positively charged amino acidside chains to RNA phosphate groups) whereas comple-mentary interactions actually confer specificity to binding

Moreover our results provide a clear evolutionary per-spective concerning the physico-chemical origins of trans-lation in line with the stereo-chemical hypothesis of theorigin of the genetic code (16ndash21) In particular ourresults give strong support to the possibility of directtemplating of proteins from mRNAs in the era beforethe development of ribosomal decoding and codersquosfixation in that era (1745) In this framework ancientamino acids associated with mRNA directly followingtheir intrinsic physico-chemical preferences as outlinedhere However the fact that an analogous effect is notseen for all bases especially adenine and uracil supportsthe possibility that in addition to physico-chemical ration-ales in the context of direct binding other evolutionaryforces were also responsible for shaping the genetic codeas suggested before (19) Our results are most consistentwith the possibility that the early stereo-chemical phase incodersquos development was dominated by G- and C-richcodons as strongest correlations are seen for preciselythese bases If the basic structure of the early geneticcode was defined by such codons but was later modulatedby the inclusion of A and U bases this might explain whyG-affinity of amino acids in present-day protein sequencesclosely follows PUR density profiles in cognate mRNAsInterestingly Trifonov and coworkers have suggested thatthe first codons were G- and C-rich on the basis of a con-sensus analysis of 40 different criteria (46)

Importantly it should be emphasized that the stereo-chemical hypothesis of the codersquos origin may differ fromthe cognate mRNAprotein complementary interactionhypothesis in terms of its evolutionary underpinningsDirect templating of proteins from mRNAs in ancientsystems (the coding aspect of the stereo-chemical hypoth-esis) does not necessarily imply that modern proteinsdirectly interact with their own mRNA (complementaryinteraction hypothesis) However our findings support thepossibility that the origin of the genetic code and potentialcomplementarity between proteins and cognate mRNAsmight have the same physico-chemical background It iswell possible that other independent influences haveshaped both effects and the two hypotheses leave ampleroom for such refinements However we would like tostress that in our view the two hypotheses are inter-linked cognate binding is on the one hand a reasonableconsequence of the stereochemical hypothesis but on theother hand it also gives a potential biological rationale forthe early development of the code to begin with such asstabilization of RNA structures by bound polypeptides ashas been suggested before (45)

There are a number of open challenges concerning theaforementioned proposal First and foremost the struc-tural features of mRNAs and cognate proteins imposesevere constraints on any putative complementarity

between the two Namely with the contour length of themRNA coding part being 45 times longer than that of acognate protein it is not clear what structural arrange-ments may be consistent with any complementary inter-actions We would like to suggest that structures of suchcomplexes may be dynamic and liquid-like with mRNAstretches enveloping and solubilizing cognate proteinstretches (15) Second with many mRNAs and proteinsbeing well-folded and compact for most of the time itremains to be studied when and how opportunities couldarise for the complementarity between their primary se-quences to be of relevance It is possible that if at allrealistic such complementary binding might be function-ally important precisely in those situations where bothpolymers are unstructured such as during translationexport and degradation as a consequence of thermalstress or in the case of intrinsically unstructuredproteins However we do not exclude the possibility ofcomplementary interactions even in the folded stateFinally concerning the origin of the genetic code it isnot clear how the final well-defined structure of the codecould have arisen based on still partially non-specificlarge-scale binding interactions between mRNAs andcognate proteins As suggested before it is possible thatthe answer lies in a combination of different influences(19) Future research should shed light on these andrelated questionsThese challenges notwithstanding our findings provide

strong evidence that the ability to interact with mRNAmight be a widespread phenomenon in the cell involvingnot only cognate proteins but also other proteins based onsimilar principles The potential significance of suchphysico-chemical complementarity between mRNAs andproteins potentially extends to all facets of nucleic acidand protein biology in the modern cell including transcrip-tiontranslation regulation (9104748) mRNA transportand localization (4950) processing and decay (51) struc-ture of ribonucleoproteins (52) and others (2ndash55354)Our preliminary GO analysis has demonstrated a signifi-cant enrichment of functions related to association withnucleic acids for the subsets of proteins that complementtheir cognate mRNAs strongly and these findings will beexplored in more detail in future work

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

ACKNOWLEDGEMENTS

The authors thank S Dorner R Schroeder A Vaziri GWarren and members of the Laboratory ofComputational Biophysics at MFPL for useful adviceand critical reading of the manuscript

FUNDING

This work was supported in part by the Austrian ScienceFund FWF [START grant Y514-B11 to BZ] EuropeanResearch Council [ERC Starting Independent grant

Nucleic Acids Research 2013 Vol 41 No 18 8441

279408 to BZ] Funding for open access chargeAustrian Science Fund FWF

Conflict of interest statement None declared

REFERENCES

1 BrennerS JacobF and MeselsonM (1961) An unstableintermediate carrying information from genes to ribosomes forprotein synthesis Nature 190 576ndash581

2 AndersG MackowiakSD JensM MaaskolaJ KuntzagkARajewskyN LandthalerM and DieterichC (2012) doRiNA adatabase of RNA interactions in post-transcriptional regulationNucleic Acids Res 40 D180ndashD186

3 BaltzAG MunschauerM SchwanhausserB VasileAMurakawaY SchuelerM YoungsN Penfold-BrownDDrewK MilekM et al (2012) The mRNA-bound proteome andits global occupancy profile on protein-coding transcripts MolCell 46 674ndash690

4 CastelloA FischerB EichelbaumK HorosRBeckmannBM StreinC DaveyNE HumphreysDTPreissT SteinmetzLM et al (2012) Insights into RNA biologyfrom an atlas of mammalian mRNA-binding proteins Cell 1491393ndash1406

5 KonigJ ZarnackK LuscombeNM and UleJ (2012) Protein-RNA interactions new genomic technologies and perspectivesNat Rev Genet 13 221ndash221

6 MitchellSF JainS SheM and ParkerR (2013) Globalanalysis of yeast mRNPs Nat Struct Mol Biol 20 127ndash133

7 WeberSC and BrangwynneCP (2012) Getting RNA andprotein in phase Cell 149 1188ndash1191

8 HanTNW KatoM XieSH WuLC MirzaeiH PeiJMChenM XieY AllenJ XiaoGH et al (2012) Cell-freeformation of RNA granules bound RNAs identify features andcomponents of cellular assemblies Cell 149 768ndash779

9 KyrpidesNC and OuzounisCA (1993) Mechanisms ofspecificity in messenger-rna degradation - autoregulation andcognate interactions J Theor Biol 163 373ndash392

10 OuzounisCA and KyrpidesNC (1994) Reverse interpretation-ahypothetical selection mechanism for adaptive mutagenesis basedon autoregulated messenger-RNA stability J Theor Biol 167373ndash379

11 ChuE CopurSM JuJ ChenTM KhleifS VoellerDMMizunumaN PatelM MaleyGF MaleyF et al (1999)Thymidylate synthase protein and p53 mRNA form an in vivoribonucleoprotein complex Mol Cell Biol 19 1582ndash1594

12 TaiN SchmitzJC LiuJ LinX BaillyM ChenTM andChuE (2004) Translational autoregulation of thymidylatesynthase and dihydrofolate reductase Front Biosci 9 2521ndash2526

13 SchuttpelzM SchoningJC DooseS NeuweilerH PetersEStaigerD and SauerM (2008) Changes in conformationaldynamics of mRNA upon AtGRP7 binding studied byfluorescence correlation spectroscopy J Am Chem Soc 1309507ndash9513

14 ZhaoX LiuM WuN DingL LiuH and LinX (2010)Recovery of recombinant zebrafish p53 protein from inclusionbodies and its binding activity to p53 mRNA in vitro ProteinExpr Purif 72 262ndash266

15 HlevnjakM PolyanskyAA and ZagrovicB (2012) Sequencesignatures of direct complementarity between mRNAs andcognate proteins on multiple levels Nucleic Acids Res 408874ndash8882

16 WoeseCR (1965) Order in genetic code Proc Natl Acad SciUSA 54 71ndash75

17 WoeseCR (1965) On the evolution of the genetic code ProcNatl Acad Sci USA 54 1546ndash1552

18 YarusM (1998) Amino acids as RNA ligands A direct-RNA-template theory for the codersquos origin J Mol Evol 47 109ndash117

19 KooninEV and NovozhilovAS (2009) Origin and evolution ofthe genetic code the universal enigma IUBMB Life 61 99ndash111

20 YarusM WidmannJJ and KnightR (2009) RNA-amino acidbinding a stereochemical era for the genetic code J Mol Evol69 406ndash429

21 JohnsonDB and WangL (2010) Imprints of the genetic code inthe ribosome Proc Natl Acad Sci USA 107 8298ndash8303

22 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The proteindata bank Nucleic Acids Res 28 235ndash242

23 Ben-ShemA de LoubresseNG MelnikovS JennerLYusupovaG and YusupovM (2011) The structure of theeukaryotic ribosome at 30 angstrom resolution Science 3341524ndash1529

24 DunkleJA WangLY FeldmanMB PulkA ChenVBKapralGJ NoeskeJ RichardsonJS BlanchardSC andCateJHD (2011) Structures of the bacterial ribosome inclassical and hybrid states of tRNA binding Science 332981ndash984

25 PolikanovYS BlahaGM and SteitzTA (2012) Howhibernation factors RMF HPF and YfiA turn off proteinsynthesis Science 336 915ndash918

26 HarmsJM WilsonDN SchluenzenF ConnellSRStachelhausT ZaborowskaZ SpahnCM and FuciniP (2008)Translational regulation via L11 molecular switches on theribosome turned on and off by thiostrepton and micrococcinMol Cell 30 26ndash38

27 MiyazawaS and JerniganRL (1985) Estimation of effectiveinterresidue contact energies from protein crystal-structures -quasi-chemical approximation Macromolecules 18 534ndash552

28 DonaldJE ChenWW and ShakhnovichEI (2007) Energeticsof protein-DNA interactions Nucleic Acids Res 35 1039ndash1047

29 JonikasMA RadmerRJ LaederachA DasR PearlmanSHerschlagD and AltmanRB (2009) Coarse-grained modeling oflarge RNA molecules with knowledge-based potentials andstructural filters RNA 15 189ndash199

30 Perez-CanoL SolernouA PonsC and Fernandez-RecioJ(2010) Structural prediction of protein-RNA interaction bycomputational docking with propensity-based statistical potentialsPac Symp Biocomput 15 269ndash280

31 TuszynskaI and BujnickiJM (2011) DARS-RNP and QUASI-RNP new statistical potentials for protein-RNA docking BMCBioinformatics 12 348

32 GasteigerE GattikerA HooglandC IvanyiI AppelRD andBairochA (2003) ExPASy the proteomics server for in-depthprotein knowledge and analysis Nucleic Acids Res 31 3784ndash3788

33 DosztanyiZ CsizmokV TompaP and SimonI (2005) IUPredweb server for the prediction of intrinsically unstructured regionsof proteins based on estimated energy content Bioinformatics 213433ndash3434

34 Huang daW ShermanBT and LempickiRA (2009) Systematicand integrative analysis of large gene lists using DAVIDbioinformatics resources Nat Protoc 4 44ndash57

35 The PyMOL Molecular Graphics System Version 13r1 (2010)Schrodinger LLC httpwwwpymolorgciting (4 July 2013date last accessed)

36 TregerM and WesthofE (2001) Statistical analysis of atomiccontacts at RNA-protein interfaces J Mol Recognit 14199ndash214

37 HoffmanMM KhrapovMA CoxJC YaoJ TongL andEllingtonAD (2004) AANT the amino acid-nucleotideinteraction database Nucleic Acids Res 32 D174ndashD181

38 GuptaA and GribskovM (2011) The role of RNA sequence andstructure in RNAmdashprotein interactions J Mol Biol 409 574ndash587

39 FernandezM KumagaiY StandleyDM SaraiAMizuguchiK and AhmadS (2011) Prediction of dinucleotide-specific RNA-binding sites in proteins BMC Bioinformatics12(Suppl 13) S5

40 CopleySD SmithE and MorowitzHJ (2005) A mechanismfor the association of amino acids with their codons and theorigin of the genetic code Proc Natl Acad Sci USA 1024442ndash4447

41 IwakiriJ TateishiH ChakrabortyA PatilP and KenmochiN(2012) Dissecting the protein-RNA interface the role of proteinsurface shapes and RNA secondary structures in protein-RNArecognition Nucleic Acids Res 40 3299ndash3306

8442 Nucleic Acids Research 2013 Vol 41 No 18

42 WoeseCR DugreDH SaxingerWC and DugreSA (1966)The molecular basis for the genetic code Proc Natl Acad SciUSA 55 966ndash974

43 WoeseCR (1973) Evolution of the genetic codeNaturwissenschaften 60 447ndash459

44 MathewDC and Luthey-SchultenZ (2008) On the physicalbasis of the amino acid polar requirement J Mol Evol 66519ndash528

45 NollerHF (2012) Evolution of protein synthesis from an RNAworld Cold Spring Harb Perspect Biol 4 1ndashU20

46 TrifonovEN KirzhnerA KirzhnerVM and BerezovskyIN(2001) Distinct stages of protein evolution as suggested by proteinsequence analysis J Mol Evol 53 394ndash401

47 VaquerizasJM KummerfeldSK TeichmannSA andLuscombeNM (2009) A census of human transcription factorsfunction expression and evolution Nat Rev Genet 10 252ndash263

48 SonenbergN and HinnebuschAG (2009) Regulation oftranslation initiation in eukaryotes mechanisms and biologicaltargets Cell 136 731ndash745

49 LecuyerE YoshidaH ParthasarathyN AlmC BabakTCerovinaT HughesTR TomancakP and KrauseHM (2007)Global analysis of mRNA localization reveals a prominentrole in organizing cellular architecture and function Cell 131174ndash187

50 MartinKC and EphrussiA (2009) mRNA localization geneexpression in the spatial dimension Cell 136 719ndash730

51 MooreMJ and ProudfootNJ (2009) Pre-mRNA processingreaches back to transcription and ahead to translation Cell 136688ndash700

52 GlisovicT BachorikJL YongJ and DreyfussG (2008) RNA-binding proteins and post-transcriptional gene regulation FEBSLett 582 1977ndash1986

53 BellucciM AgostiniF MasinM and TartagliaGG (2011)Predicting protein associations with long noncoding RNAs NatMethods 8 444ndash445

54 RinnJL and ChangHY (2012) Genome regulation by longnoncoding RNAs Annu Rev Biochem 81 145ndash166

Nucleic Acids Research 2013 Vol 41 No 18 8443

nucleobases at proteinRNA interfaces and (ii) how doessequence density of different bases in mRNA-coding se-quences relate to sequence profiles of amino-acid inter-action preferences for these and other bases in cognateprotein sequences

Amino acid interaction preferences and their codoncontent

We first focus on contact statistics from set lsquo2+rsquoDinucleotides were found previously to exhibit potentialfor specific recognition of amino acids at proteinndashRNAinterfaces (39) and have also been suggested as potentialcatalysts for amino acid synthesis in pre-biotic environ-ments (40) Moreover set lsquo2+rsquo by definition alsoincludes all instances where triplets of bases directlycontact a given amino acid which may be relevant inthe context of the genetic code Using set lsquo2+rsquo statisticswe observe a remarkably strong correlation between pref-erences of amino acids to interact with guanine (G-prefer-ence Figure 1B) and the average PUR content of theirrespective codons as derived from the complete humanproteome with Pearson correlation coefficient R of084 (Figure 2A) Negative Pearson correlation coeffi-cients indicate matching between amino acid preferencesand codon content owing to the way preference is defined(see lsquoMaterials and Methodsrsquo section) Put differentlyamino acids which are predominantly encoded byPURs display a strong tendency to co-localize with Gat proteinndashRNA interfaces This is also true albeit at asomewhat weaker level of correlation for matchingbetween PUR composition of individual codons fromthe standard genetic table and the respective G-prefer-ences if the statistics of codon usage in the humanproteome is not included (R=068 SupplementaryFigure S1) The observed signal for G is statisticallyhighly significant as evidenced by randomization calcula-tions (P-valuelt 106 Figure 2B) Related to this G-pref-erence of amino acids inversely correlates with C and Ucontent of their codons (Figure 2B) Somewhat less prom-inent but still extremely significant correlations areobserved for G- and C-preference of amino acids andthe average G- and C-content of their codons (R of047 and 058 respectively) On the other hand theinterface statistics for adenine (A) and uracil (U) do notcorrelate with their average usage in codons In particularthe A-preference of amino acids correlates inversely withthe A-content (R=059) or directly with the U-content oftheir codons (R=051) whereas the U-preferenceexhibits relatively low correlations throughout(Figure 2B) Finally both PYR and PUR binding prefer-ences of amino acids (Figure 1B) display significant cor-relations with PYR and PUR fraction in their codons withR of 054 and 053 respectively and P-valueslt 106 inboth cases In other words amino acids coded for byPYR-rich codons prefer to co-localize with PYR andthose coded for by PUR-rich codons with PUR atRNAndashprotein interfaces Although similar in the presentcase PYR- and PUR-preference scales need not necessar-ily be inverses of each other owing to the way preferencesare defined and we therefore here report and discuss both

Matching between sequence profiles of mRNAs and theircognate proteins

How do these observations translate if one comparescomplete mRNA-coding sequences with their cognateprotein sequences Owing to codon usage bias and non-uniform amino-acid composition of the human proteomethese results could in principle deviate significantly fromthe results obtained for individual codons and aminoacids To address this question we calculate a PearsonR for every cognate mRNAprotein pair in the humanproteome capturing the correlation between each mRNAsequence composition profile with the base-binding pref-erence profile of its cognate protein sequenceRemarkably we observe an extremely high level ofmatching between PUR density profiles of mRNAs andG-preference profiles of cognate protein sequences with amedian Pearson R (Rmedian) over the entire humanproteome of 080 and a low P-value (lt106) asdetermined by randomization (Figure 2C) In particularthe distribution of Pearson R values for this scale over thehuman proteome is significantly left shifted and showsonly marginal overlap with the one calculatedfor a typical randomized interaction preference scale(Figure 2C) For illustration we present sequenceprofiles for proteins of most abundant length (300ndash400amino acids Supplementary Figure S2) displayingtypical (ie exhibiting a Pearson R equal to the populationmedian) or best levels of correlation (Figure 2D) As isevident the PUR density of mRNAs is quantitatively ex-tremely well predicted by the G-binding preference profilesof cognate proteins even for typical human proteins(Rmedian=080 and Plt 106) We also observe signifi-cant matching between C-preference profiles for proteinsequences and both C- and PYR-density profiles of theircognate mRNAs with Rmedian of 055 and 047 re-spectively (Figure 2E) In contrast the A-preferencesdisplay significant matching with PYR-density profileson the side of mRNA (Figure 2E see alsoSupplementary Table S2 for the full report of profile cor-relations) with Rmedian of 053 Finally strong and sig-nificant level of matching is observed for PYR-bindingpreferences of amino acids and PYR mRNA profiles aswell as PUR-binding preferences of amino acids and PURprofiles (Rmedian of 058 in both cases and P-values of86 103 and 79 103 respectively Figure 3A and C)From the exemplary typical and best profiles (Figure 3Band D) it is clear that the PYR- and PUR-rich regions inmRNA code for stretches of amino acids in cognateproteins which prefer to co-localize with PYR and PURbases respectively at proteinndashRNA interfaces in theknown 3D PDB structures The typical level of similaritybetween sequence profiles is actually greater than whatone might infer from Rmedian values suggesting thatPearson correlation coefficient might not even be theoptimal measure of deviation in this case Importantlythis direct physico-chemical complementarity betweenmRNA and cognate protein sequences may be indicativeof pronounced potential for complex formation betweenthem especially under circumstances when lsquopeakrsquo regionsbecome available for such interactions Given the fact that

Nucleic Acids Research 2013 Vol 41 No 18 8437

a significant matching of profiles is detected at the level ofprimary sequences we propose that the presence of ex-tended unstructured protein and mRNA segments maybe required for such binding This suggestion agrees wellwith recent knowledge-based studies where RNA loopsand bulges were found to be more likely to interact withamino-acid side chains in a specific manner (3841)How sensitive is the level of matching to the choice of

cutoff distance used to define contacting amino acids and

nucleobases in proteinRNA complexes To address thisquestion we have repeated the aforementioned analysisfor a range of different cutoff values going from 6 to10 A in steps of 025 A (Figure 4) Overall for set lsquo2+rsquoour findings are largely robust to the choice of the exactcutoff in this range albeit with a somewhat lower level ofsignificance for longer cutoffs However the majority ofthe signal is lost if one uses the lsquo1+rsquo set except forG-preference and PUR-content (Figure 5A) and

Figure 2 Relationship between nucleobase-binding preferences of amino acids and mRNA content at multiple levels (A) Correlation between Ginteraction preferences of amino acids (Figure 1B) and the average PUR content of their codons in mRNAs of the entire human proteome(B) Pairwise Pearson correlation coefficients (R) between base-binding preference scales of amino acids (lsquosclrsquo) and average base content of theircodons (lsquocdnrsquo) (C) Distributions of correlation coefficients (R) between window-averaged PUR-content profiles of individual mRNA coding sequencesand window-averaged G-preference sequence profiles of the respective proteins for the entire human proteome (window-size=21) The dashed curvedepicts the distribution of correlation coefficients calculated for a typical randomized G-preference scale Inset the distribution of the means of sequence-profile correlation coefficients for the human proteome (ltRgt) calculated for 106 randomized G-preference scales The R for the original G-preferencescale is shown with an arrow (D) Typical (R=Rmedian) and best pairs of mRNA PUR-content (black curves) and protein G-preference profiles (reddashed curves) for human proteins (E) Median pairwise Pearson correlation coefficients for comparison between nucleobase content profiles of mRNAs(subscript lsquomRNArsquo x-axis) and base-preference-weighted protein sequence profiles (subscript lsquoproteinrsquo y-axis) over the entire human proteome Allresults are based on the analysis of set lsquo2+rsquo statistics All data reported for preference scales are obtained using an 8 A cutoff

8438 Nucleic Acids Research 2013 Vol 41 No 18

A-preference and PYR-content (Supplementary TableS2) This observation strongly suggests that close densepacking of nucleobases around amino acids may berequired for specificity in cognate complex formationAlthough interfaces may be dynamic and liquid-like aswe have suggested before they may still need to bedensely packed Interestingly if one reduces the 2+ setby including only the two closest bases in contact with agiven amino acid (set lsquo2rsquo) the signal for G-preferencePUR-content even further improves by several percentagepoints (Figure 5A) and the same holds for C-preferenceC-content and A-preferencePYR-content (Supplemen-tary Table S2)

To further study the role of protein structural disorderin matching we have analyzed the levels of the predicteddisorder of the top and the bottom 10 of proteins whenit comes to the degree of mRNAprotein profile matchingas captured by Pearson R coefficient (see lsquoMaterials andMethodsrsquo section) We have done this for the six cases ofdirect comparison whereby the same base type is used forboth protein preference and mRNA profile density(Gprotein-GmRNA Aprotein-AmRNA Cprotein-CmRNAUprotein-UmRNA PURprotein-PURmRNA and PYRprotein-PYRmRNA) and also for the case displaying the strongestsignal in our analysis (Gprotein-PURmRNA) Importantly inthe case of Gprotein-GmRNA Aprotein-AmRNA and Cprotein-CmRNA matching we do observe a pronounced tendency

for the top and the bottom 10 cohorts to be significantlyenriched (top 10) and depleted (bottom 10) in dis-ordered proteins (Supplementary Table S3) whereas inthe case of Uprotein-Uprof matching the situation isreversed Interestingly for PURprotein-PURmRNAPYRprotein-PYRmRNA and Gprotein-PURmRNA matchingone observes slight disorder enrichment in both top andbottom cohorts The most prominent shift of the distribu-tion of predicted average disorder toward higher disorderas compared with background is observed for the top10 cohort of proteins displaying strong matchingbetween C-preference profiles of their sequences and theC-content of their cognate mRNAs (Cprotein-CmRNASupplementary Table S3 Supplementary Figure S3)One might argue that this effect could just be related tocompositional properties of such protein and mRNApairs whereby disordered proteins are simply encodedby C-rich sequences However the differences betweennucleobase compositions of mRNAs from the Cprotein-CmRNA top 10 cohort and the complete proteome areminor suggesting that the underlying explanation mightbe more complex (Supplementary Figure S3)Which biological functions might be associated with a

high level of complementarity between proteins andcognate mRNAs To address this question we have per-formed GO analysis for seven different top 10 subsets ofproteins displaying strong matching with cognate mRNAs

Figure 3 PYRPUR mRNA sequence profiles strongly match PYRPUR-preferences of cognate protein sequences PYR (A and B) and PUR (Cand D) amino-acid preference scales are given in Figure 1B For details please see the analogous captions to Figure 2C and D

Nucleic Acids Research 2013 Vol 41 No 18 8439

(see lsquoMaterials and Methodsrsquo section for details) InSupplementary Table S4 we report the most significantlyenriched biological functions (using a P-value cutoff of1010) shared by proteins from the analyzed cohorts Ina striking agreement with our hypothesis in most caseswe observe pronounced enrichment of terms related tonucleic-acidprotein interactions including regulation ofRNA metabolic processes ribonucleoprotein complexesand transcription The latter in particular allows one tospeculate that protein tendencies to associate with cognatemRNA might be used by the cells to modulate gene ex-pression pathways What is more PUR or PYR densityprofiles of mRNAs are identical to PUR or PYR densityprofiles of coding-strand DNA sequences (with Us beingreplaced by Ts) Although based on our statistical poten-tials we cannot say anything about T-binding preferencesof amino acids it is possible that our results may be gen-eralizable even to DNA-protein interactions as well asother RNA-protein interactions One should alsomention that depending on the particular type ofmatching other biological functions also tend to beenriched For instance the Uprotein-UmRNA top 10subset displays significant enrichment of membraneproteins whereas Gprotein-PURmRNA top cohort seems to

be populated by extracellular proteins and particularlythose involved in the functioning of the innate immunesystem Altogether our preliminary GO analysis illus-trates significant functional differences between proteinsthat strongly complement their cognate mRNAs and therest of the human proteome and these findings will befurther explored in another manuscript

DISCUSSION

High levels of matching between base-binding-preferenceprofiles of proteins and PYR- or PUR-density profiles ofcognate mRNA-coding sequences defined primarily byamino acid preferences to co-localize with G and Cbases at RNAprotein interfaces allow one to speculatethat direct complementary binding interactions may be akey element underlying the whole mRNAprotein rela-tionship when it comes to both its evolutionary develop-ment as well as present day biology (Figure 5B) Thisagrees well with and significantly extends our previousfindings where we have shown that protein sequenceprofiles of amino acid affinity for PYR analogs (42ndash44)mirror PYR density profiles of cognate mRNA sequences

Figure 5 Physico-chemical origins of the mRNAprotein relationship(A) Correlation coefficients (R and ltRgt with standard deviations)between PYR or PUR average codon content (lsquoCodon contentrsquo) andrespective mRNA profiles (lsquoProfilesrsquo) calculated for G- (blue) PUR-(red) and PYR- (green) binding preferences of amino acids whichwere obtained using different amino acid neighbor statistics (1+ 2+or 2) (B) A model of physico-chemical complementarity betweenproteins and cognate mRNAs Preferential interactions of aminoacids with PYR or PUR define their codon content in the genetictable and facilitate complementary interactions between PYRPUR-rich mRNA regions and PYRPUR preferring regions in proteinsThe opposite behavior of adenines and guanines adds an additionallayer of complexity in the case of PURs as signified by dashedarrows in the model Note polymer sizes not drawn to scale

Figure 4 Effect of cutoff radius used to define proteinndashRNA contactson observed correlations (A) Dependence of Pearson correlation coef-ficients (R) between amino acid preference scales and average codoncontent on the cutoff radius for the two sets of statistics studied (lsquo1+rsquolsquo2+rsquo) The total number of unique contacts in lsquo1+rsquo and lsquo2+rsquo (given inparentheses) sets obtained for each of used cutoff radii is indicated atthe top of the panel (B) Cutoff radius dependence of median pairwisePearson correlation coefficients (Rmedian) for comparison betweennucleobase content profiles of mRNAs and base-preference-weightedprotein sequence profiles over the entire human proteome (color codethe same as in panel A)

8440 Nucleic Acids Research 2013 Vol 41 No 18

(15) It should be emphasized however that our presentresults are based exclusively on the statistics of directamino acidnucleobase contacts at RNAprotein inter-faces It is therefore still possible that the driving forcefor interactions between mRNAs and cognate proteins isnon-specific (eg binding of positively charged amino acidside chains to RNA phosphate groups) whereas comple-mentary interactions actually confer specificity to binding

Moreover our results provide a clear evolutionary per-spective concerning the physico-chemical origins of trans-lation in line with the stereo-chemical hypothesis of theorigin of the genetic code (16ndash21) In particular ourresults give strong support to the possibility of directtemplating of proteins from mRNAs in the era beforethe development of ribosomal decoding and codersquosfixation in that era (1745) In this framework ancientamino acids associated with mRNA directly followingtheir intrinsic physico-chemical preferences as outlinedhere However the fact that an analogous effect is notseen for all bases especially adenine and uracil supportsthe possibility that in addition to physico-chemical ration-ales in the context of direct binding other evolutionaryforces were also responsible for shaping the genetic codeas suggested before (19) Our results are most consistentwith the possibility that the early stereo-chemical phase incodersquos development was dominated by G- and C-richcodons as strongest correlations are seen for preciselythese bases If the basic structure of the early geneticcode was defined by such codons but was later modulatedby the inclusion of A and U bases this might explain whyG-affinity of amino acids in present-day protein sequencesclosely follows PUR density profiles in cognate mRNAsInterestingly Trifonov and coworkers have suggested thatthe first codons were G- and C-rich on the basis of a con-sensus analysis of 40 different criteria (46)

Importantly it should be emphasized that the stereo-chemical hypothesis of the codersquos origin may differ fromthe cognate mRNAprotein complementary interactionhypothesis in terms of its evolutionary underpinningsDirect templating of proteins from mRNAs in ancientsystems (the coding aspect of the stereo-chemical hypoth-esis) does not necessarily imply that modern proteinsdirectly interact with their own mRNA (complementaryinteraction hypothesis) However our findings support thepossibility that the origin of the genetic code and potentialcomplementarity between proteins and cognate mRNAsmight have the same physico-chemical background It iswell possible that other independent influences haveshaped both effects and the two hypotheses leave ampleroom for such refinements However we would like tostress that in our view the two hypotheses are inter-linked cognate binding is on the one hand a reasonableconsequence of the stereochemical hypothesis but on theother hand it also gives a potential biological rationale forthe early development of the code to begin with such asstabilization of RNA structures by bound polypeptides ashas been suggested before (45)

There are a number of open challenges concerning theaforementioned proposal First and foremost the struc-tural features of mRNAs and cognate proteins imposesevere constraints on any putative complementarity

between the two Namely with the contour length of themRNA coding part being 45 times longer than that of acognate protein it is not clear what structural arrange-ments may be consistent with any complementary inter-actions We would like to suggest that structures of suchcomplexes may be dynamic and liquid-like with mRNAstretches enveloping and solubilizing cognate proteinstretches (15) Second with many mRNAs and proteinsbeing well-folded and compact for most of the time itremains to be studied when and how opportunities couldarise for the complementarity between their primary se-quences to be of relevance It is possible that if at allrealistic such complementary binding might be function-ally important precisely in those situations where bothpolymers are unstructured such as during translationexport and degradation as a consequence of thermalstress or in the case of intrinsically unstructuredproteins However we do not exclude the possibility ofcomplementary interactions even in the folded stateFinally concerning the origin of the genetic code it isnot clear how the final well-defined structure of the codecould have arisen based on still partially non-specificlarge-scale binding interactions between mRNAs andcognate proteins As suggested before it is possible thatthe answer lies in a combination of different influences(19) Future research should shed light on these andrelated questionsThese challenges notwithstanding our findings provide

strong evidence that the ability to interact with mRNAmight be a widespread phenomenon in the cell involvingnot only cognate proteins but also other proteins based onsimilar principles The potential significance of suchphysico-chemical complementarity between mRNAs andproteins potentially extends to all facets of nucleic acidand protein biology in the modern cell including transcrip-tiontranslation regulation (9104748) mRNA transportand localization (4950) processing and decay (51) struc-ture of ribonucleoproteins (52) and others (2ndash55354)Our preliminary GO analysis has demonstrated a signifi-cant enrichment of functions related to association withnucleic acids for the subsets of proteins that complementtheir cognate mRNAs strongly and these findings will beexplored in more detail in future work

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

ACKNOWLEDGEMENTS

The authors thank S Dorner R Schroeder A Vaziri GWarren and members of the Laboratory ofComputational Biophysics at MFPL for useful adviceand critical reading of the manuscript

FUNDING

This work was supported in part by the Austrian ScienceFund FWF [START grant Y514-B11 to BZ] EuropeanResearch Council [ERC Starting Independent grant

Nucleic Acids Research 2013 Vol 41 No 18 8441

279408 to BZ] Funding for open access chargeAustrian Science Fund FWF

Conflict of interest statement None declared

REFERENCES

1 BrennerS JacobF and MeselsonM (1961) An unstableintermediate carrying information from genes to ribosomes forprotein synthesis Nature 190 576ndash581

2 AndersG MackowiakSD JensM MaaskolaJ KuntzagkARajewskyN LandthalerM and DieterichC (2012) doRiNA adatabase of RNA interactions in post-transcriptional regulationNucleic Acids Res 40 D180ndashD186

3 BaltzAG MunschauerM SchwanhausserB VasileAMurakawaY SchuelerM YoungsN Penfold-BrownDDrewK MilekM et al (2012) The mRNA-bound proteome andits global occupancy profile on protein-coding transcripts MolCell 46 674ndash690

4 CastelloA FischerB EichelbaumK HorosRBeckmannBM StreinC DaveyNE HumphreysDTPreissT SteinmetzLM et al (2012) Insights into RNA biologyfrom an atlas of mammalian mRNA-binding proteins Cell 1491393ndash1406

5 KonigJ ZarnackK LuscombeNM and UleJ (2012) Protein-RNA interactions new genomic technologies and perspectivesNat Rev Genet 13 221ndash221

6 MitchellSF JainS SheM and ParkerR (2013) Globalanalysis of yeast mRNPs Nat Struct Mol Biol 20 127ndash133

7 WeberSC and BrangwynneCP (2012) Getting RNA andprotein in phase Cell 149 1188ndash1191

8 HanTNW KatoM XieSH WuLC MirzaeiH PeiJMChenM XieY AllenJ XiaoGH et al (2012) Cell-freeformation of RNA granules bound RNAs identify features andcomponents of cellular assemblies Cell 149 768ndash779

9 KyrpidesNC and OuzounisCA (1993) Mechanisms ofspecificity in messenger-rna degradation - autoregulation andcognate interactions J Theor Biol 163 373ndash392

10 OuzounisCA and KyrpidesNC (1994) Reverse interpretation-ahypothetical selection mechanism for adaptive mutagenesis basedon autoregulated messenger-RNA stability J Theor Biol 167373ndash379

11 ChuE CopurSM JuJ ChenTM KhleifS VoellerDMMizunumaN PatelM MaleyGF MaleyF et al (1999)Thymidylate synthase protein and p53 mRNA form an in vivoribonucleoprotein complex Mol Cell Biol 19 1582ndash1594

12 TaiN SchmitzJC LiuJ LinX BaillyM ChenTM andChuE (2004) Translational autoregulation of thymidylatesynthase and dihydrofolate reductase Front Biosci 9 2521ndash2526

13 SchuttpelzM SchoningJC DooseS NeuweilerH PetersEStaigerD and SauerM (2008) Changes in conformationaldynamics of mRNA upon AtGRP7 binding studied byfluorescence correlation spectroscopy J Am Chem Soc 1309507ndash9513

14 ZhaoX LiuM WuN DingL LiuH and LinX (2010)Recovery of recombinant zebrafish p53 protein from inclusionbodies and its binding activity to p53 mRNA in vitro ProteinExpr Purif 72 262ndash266

15 HlevnjakM PolyanskyAA and ZagrovicB (2012) Sequencesignatures of direct complementarity between mRNAs andcognate proteins on multiple levels Nucleic Acids Res 408874ndash8882

16 WoeseCR (1965) Order in genetic code Proc Natl Acad SciUSA 54 71ndash75

17 WoeseCR (1965) On the evolution of the genetic code ProcNatl Acad Sci USA 54 1546ndash1552

18 YarusM (1998) Amino acids as RNA ligands A direct-RNA-template theory for the codersquos origin J Mol Evol 47 109ndash117

19 KooninEV and NovozhilovAS (2009) Origin and evolution ofthe genetic code the universal enigma IUBMB Life 61 99ndash111

20 YarusM WidmannJJ and KnightR (2009) RNA-amino acidbinding a stereochemical era for the genetic code J Mol Evol69 406ndash429

21 JohnsonDB and WangL (2010) Imprints of the genetic code inthe ribosome Proc Natl Acad Sci USA 107 8298ndash8303

22 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The proteindata bank Nucleic Acids Res 28 235ndash242

23 Ben-ShemA de LoubresseNG MelnikovS JennerLYusupovaG and YusupovM (2011) The structure of theeukaryotic ribosome at 30 angstrom resolution Science 3341524ndash1529

24 DunkleJA WangLY FeldmanMB PulkA ChenVBKapralGJ NoeskeJ RichardsonJS BlanchardSC andCateJHD (2011) Structures of the bacterial ribosome inclassical and hybrid states of tRNA binding Science 332981ndash984

25 PolikanovYS BlahaGM and SteitzTA (2012) Howhibernation factors RMF HPF and YfiA turn off proteinsynthesis Science 336 915ndash918

26 HarmsJM WilsonDN SchluenzenF ConnellSRStachelhausT ZaborowskaZ SpahnCM and FuciniP (2008)Translational regulation via L11 molecular switches on theribosome turned on and off by thiostrepton and micrococcinMol Cell 30 26ndash38

27 MiyazawaS and JerniganRL (1985) Estimation of effectiveinterresidue contact energies from protein crystal-structures -quasi-chemical approximation Macromolecules 18 534ndash552

28 DonaldJE ChenWW and ShakhnovichEI (2007) Energeticsof protein-DNA interactions Nucleic Acids Res 35 1039ndash1047

29 JonikasMA RadmerRJ LaederachA DasR PearlmanSHerschlagD and AltmanRB (2009) Coarse-grained modeling oflarge RNA molecules with knowledge-based potentials andstructural filters RNA 15 189ndash199

30 Perez-CanoL SolernouA PonsC and Fernandez-RecioJ(2010) Structural prediction of protein-RNA interaction bycomputational docking with propensity-based statistical potentialsPac Symp Biocomput 15 269ndash280

31 TuszynskaI and BujnickiJM (2011) DARS-RNP and QUASI-RNP new statistical potentials for protein-RNA docking BMCBioinformatics 12 348

32 GasteigerE GattikerA HooglandC IvanyiI AppelRD andBairochA (2003) ExPASy the proteomics server for in-depthprotein knowledge and analysis Nucleic Acids Res 31 3784ndash3788

33 DosztanyiZ CsizmokV TompaP and SimonI (2005) IUPredweb server for the prediction of intrinsically unstructured regionsof proteins based on estimated energy content Bioinformatics 213433ndash3434

34 Huang daW ShermanBT and LempickiRA (2009) Systematicand integrative analysis of large gene lists using DAVIDbioinformatics resources Nat Protoc 4 44ndash57

35 The PyMOL Molecular Graphics System Version 13r1 (2010)Schrodinger LLC httpwwwpymolorgciting (4 July 2013date last accessed)

36 TregerM and WesthofE (2001) Statistical analysis of atomiccontacts at RNA-protein interfaces J Mol Recognit 14199ndash214

37 HoffmanMM KhrapovMA CoxJC YaoJ TongL andEllingtonAD (2004) AANT the amino acid-nucleotideinteraction database Nucleic Acids Res 32 D174ndashD181

38 GuptaA and GribskovM (2011) The role of RNA sequence andstructure in RNAmdashprotein interactions J Mol Biol 409 574ndash587

39 FernandezM KumagaiY StandleyDM SaraiAMizuguchiK and AhmadS (2011) Prediction of dinucleotide-specific RNA-binding sites in proteins BMC Bioinformatics12(Suppl 13) S5

40 CopleySD SmithE and MorowitzHJ (2005) A mechanismfor the association of amino acids with their codons and theorigin of the genetic code Proc Natl Acad Sci USA 1024442ndash4447

41 IwakiriJ TateishiH ChakrabortyA PatilP and KenmochiN(2012) Dissecting the protein-RNA interface the role of proteinsurface shapes and RNA secondary structures in protein-RNArecognition Nucleic Acids Res 40 3299ndash3306

8442 Nucleic Acids Research 2013 Vol 41 No 18

42 WoeseCR DugreDH SaxingerWC and DugreSA (1966)The molecular basis for the genetic code Proc Natl Acad SciUSA 55 966ndash974

43 WoeseCR (1973) Evolution of the genetic codeNaturwissenschaften 60 447ndash459

44 MathewDC and Luthey-SchultenZ (2008) On the physicalbasis of the amino acid polar requirement J Mol Evol 66519ndash528

45 NollerHF (2012) Evolution of protein synthesis from an RNAworld Cold Spring Harb Perspect Biol 4 1ndashU20

46 TrifonovEN KirzhnerA KirzhnerVM and BerezovskyIN(2001) Distinct stages of protein evolution as suggested by proteinsequence analysis J Mol Evol 53 394ndash401

47 VaquerizasJM KummerfeldSK TeichmannSA andLuscombeNM (2009) A census of human transcription factorsfunction expression and evolution Nat Rev Genet 10 252ndash263

48 SonenbergN and HinnebuschAG (2009) Regulation oftranslation initiation in eukaryotes mechanisms and biologicaltargets Cell 136 731ndash745

49 LecuyerE YoshidaH ParthasarathyN AlmC BabakTCerovinaT HughesTR TomancakP and KrauseHM (2007)Global analysis of mRNA localization reveals a prominentrole in organizing cellular architecture and function Cell 131174ndash187

50 MartinKC and EphrussiA (2009) mRNA localization geneexpression in the spatial dimension Cell 136 719ndash730

51 MooreMJ and ProudfootNJ (2009) Pre-mRNA processingreaches back to transcription and ahead to translation Cell 136688ndash700

52 GlisovicT BachorikJL YongJ and DreyfussG (2008) RNA-binding proteins and post-transcriptional gene regulation FEBSLett 582 1977ndash1986

53 BellucciM AgostiniF MasinM and TartagliaGG (2011)Predicting protein associations with long noncoding RNAs NatMethods 8 444ndash445

54 RinnJL and ChangHY (2012) Genome regulation by longnoncoding RNAs Annu Rev Biochem 81 145ndash166

Nucleic Acids Research 2013 Vol 41 No 18 8443

a significant matching of profiles is detected at the level ofprimary sequences we propose that the presence of ex-tended unstructured protein and mRNA segments maybe required for such binding This suggestion agrees wellwith recent knowledge-based studies where RNA loopsand bulges were found to be more likely to interact withamino-acid side chains in a specific manner (3841)How sensitive is the level of matching to the choice of

cutoff distance used to define contacting amino acids and

nucleobases in proteinRNA complexes To address thisquestion we have repeated the aforementioned analysisfor a range of different cutoff values going from 6 to10 A in steps of 025 A (Figure 4) Overall for set lsquo2+rsquoour findings are largely robust to the choice of the exactcutoff in this range albeit with a somewhat lower level ofsignificance for longer cutoffs However the majority ofthe signal is lost if one uses the lsquo1+rsquo set except forG-preference and PUR-content (Figure 5A) and

Figure 2 Relationship between nucleobase-binding preferences of amino acids and mRNA content at multiple levels (A) Correlation between Ginteraction preferences of amino acids (Figure 1B) and the average PUR content of their codons in mRNAs of the entire human proteome(B) Pairwise Pearson correlation coefficients (R) between base-binding preference scales of amino acids (lsquosclrsquo) and average base content of theircodons (lsquocdnrsquo) (C) Distributions of correlation coefficients (R) between window-averaged PUR-content profiles of individual mRNA coding sequencesand window-averaged G-preference sequence profiles of the respective proteins for the entire human proteome (window-size=21) The dashed curvedepicts the distribution of correlation coefficients calculated for a typical randomized G-preference scale Inset the distribution of the means of sequence-profile correlation coefficients for the human proteome (ltRgt) calculated for 106 randomized G-preference scales The R for the original G-preferencescale is shown with an arrow (D) Typical (R=Rmedian) and best pairs of mRNA PUR-content (black curves) and protein G-preference profiles (reddashed curves) for human proteins (E) Median pairwise Pearson correlation coefficients for comparison between nucleobase content profiles of mRNAs(subscript lsquomRNArsquo x-axis) and base-preference-weighted protein sequence profiles (subscript lsquoproteinrsquo y-axis) over the entire human proteome Allresults are based on the analysis of set lsquo2+rsquo statistics All data reported for preference scales are obtained using an 8 A cutoff

8438 Nucleic Acids Research 2013 Vol 41 No 18

A-preference and PYR-content (Supplementary TableS2) This observation strongly suggests that close densepacking of nucleobases around amino acids may berequired for specificity in cognate complex formationAlthough interfaces may be dynamic and liquid-like aswe have suggested before they may still need to bedensely packed Interestingly if one reduces the 2+ setby including only the two closest bases in contact with agiven amino acid (set lsquo2rsquo) the signal for G-preferencePUR-content even further improves by several percentagepoints (Figure 5A) and the same holds for C-preferenceC-content and A-preferencePYR-content (Supplemen-tary Table S2)

To further study the role of protein structural disorderin matching we have analyzed the levels of the predicteddisorder of the top and the bottom 10 of proteins whenit comes to the degree of mRNAprotein profile matchingas captured by Pearson R coefficient (see lsquoMaterials andMethodsrsquo section) We have done this for the six cases ofdirect comparison whereby the same base type is used forboth protein preference and mRNA profile density(Gprotein-GmRNA Aprotein-AmRNA Cprotein-CmRNAUprotein-UmRNA PURprotein-PURmRNA and PYRprotein-PYRmRNA) and also for the case displaying the strongestsignal in our analysis (Gprotein-PURmRNA) Importantly inthe case of Gprotein-GmRNA Aprotein-AmRNA and Cprotein-CmRNA matching we do observe a pronounced tendency

for the top and the bottom 10 cohorts to be significantlyenriched (top 10) and depleted (bottom 10) in dis-ordered proteins (Supplementary Table S3) whereas inthe case of Uprotein-Uprof matching the situation isreversed Interestingly for PURprotein-PURmRNAPYRprotein-PYRmRNA and Gprotein-PURmRNA matchingone observes slight disorder enrichment in both top andbottom cohorts The most prominent shift of the distribu-tion of predicted average disorder toward higher disorderas compared with background is observed for the top10 cohort of proteins displaying strong matchingbetween C-preference profiles of their sequences and theC-content of their cognate mRNAs (Cprotein-CmRNASupplementary Table S3 Supplementary Figure S3)One might argue that this effect could just be related tocompositional properties of such protein and mRNApairs whereby disordered proteins are simply encodedby C-rich sequences However the differences betweennucleobase compositions of mRNAs from the Cprotein-CmRNA top 10 cohort and the complete proteome areminor suggesting that the underlying explanation mightbe more complex (Supplementary Figure S3)Which biological functions might be associated with a

high level of complementarity between proteins andcognate mRNAs To address this question we have per-formed GO analysis for seven different top 10 subsets ofproteins displaying strong matching with cognate mRNAs

Figure 3 PYRPUR mRNA sequence profiles strongly match PYRPUR-preferences of cognate protein sequences PYR (A and B) and PUR (Cand D) amino-acid preference scales are given in Figure 1B For details please see the analogous captions to Figure 2C and D

Nucleic Acids Research 2013 Vol 41 No 18 8439

(see lsquoMaterials and Methodsrsquo section for details) InSupplementary Table S4 we report the most significantlyenriched biological functions (using a P-value cutoff of1010) shared by proteins from the analyzed cohorts Ina striking agreement with our hypothesis in most caseswe observe pronounced enrichment of terms related tonucleic-acidprotein interactions including regulation ofRNA metabolic processes ribonucleoprotein complexesand transcription The latter in particular allows one tospeculate that protein tendencies to associate with cognatemRNA might be used by the cells to modulate gene ex-pression pathways What is more PUR or PYR densityprofiles of mRNAs are identical to PUR or PYR densityprofiles of coding-strand DNA sequences (with Us beingreplaced by Ts) Although based on our statistical poten-tials we cannot say anything about T-binding preferencesof amino acids it is possible that our results may be gen-eralizable even to DNA-protein interactions as well asother RNA-protein interactions One should alsomention that depending on the particular type ofmatching other biological functions also tend to beenriched For instance the Uprotein-UmRNA top 10subset displays significant enrichment of membraneproteins whereas Gprotein-PURmRNA top cohort seems to

be populated by extracellular proteins and particularlythose involved in the functioning of the innate immunesystem Altogether our preliminary GO analysis illus-trates significant functional differences between proteinsthat strongly complement their cognate mRNAs and therest of the human proteome and these findings will befurther explored in another manuscript

DISCUSSION

High levels of matching between base-binding-preferenceprofiles of proteins and PYR- or PUR-density profiles ofcognate mRNA-coding sequences defined primarily byamino acid preferences to co-localize with G and Cbases at RNAprotein interfaces allow one to speculatethat direct complementary binding interactions may be akey element underlying the whole mRNAprotein rela-tionship when it comes to both its evolutionary develop-ment as well as present day biology (Figure 5B) Thisagrees well with and significantly extends our previousfindings where we have shown that protein sequenceprofiles of amino acid affinity for PYR analogs (42ndash44)mirror PYR density profiles of cognate mRNA sequences

Figure 5 Physico-chemical origins of the mRNAprotein relationship(A) Correlation coefficients (R and ltRgt with standard deviations)between PYR or PUR average codon content (lsquoCodon contentrsquo) andrespective mRNA profiles (lsquoProfilesrsquo) calculated for G- (blue) PUR-(red) and PYR- (green) binding preferences of amino acids whichwere obtained using different amino acid neighbor statistics (1+ 2+or 2) (B) A model of physico-chemical complementarity betweenproteins and cognate mRNAs Preferential interactions of aminoacids with PYR or PUR define their codon content in the genetictable and facilitate complementary interactions between PYRPUR-rich mRNA regions and PYRPUR preferring regions in proteinsThe opposite behavior of adenines and guanines adds an additionallayer of complexity in the case of PURs as signified by dashedarrows in the model Note polymer sizes not drawn to scale

Figure 4 Effect of cutoff radius used to define proteinndashRNA contactson observed correlations (A) Dependence of Pearson correlation coef-ficients (R) between amino acid preference scales and average codoncontent on the cutoff radius for the two sets of statistics studied (lsquo1+rsquolsquo2+rsquo) The total number of unique contacts in lsquo1+rsquo and lsquo2+rsquo (given inparentheses) sets obtained for each of used cutoff radii is indicated atthe top of the panel (B) Cutoff radius dependence of median pairwisePearson correlation coefficients (Rmedian) for comparison betweennucleobase content profiles of mRNAs and base-preference-weightedprotein sequence profiles over the entire human proteome (color codethe same as in panel A)

8440 Nucleic Acids Research 2013 Vol 41 No 18

(15) It should be emphasized however that our presentresults are based exclusively on the statistics of directamino acidnucleobase contacts at RNAprotein inter-faces It is therefore still possible that the driving forcefor interactions between mRNAs and cognate proteins isnon-specific (eg binding of positively charged amino acidside chains to RNA phosphate groups) whereas comple-mentary interactions actually confer specificity to binding

Moreover our results provide a clear evolutionary per-spective concerning the physico-chemical origins of trans-lation in line with the stereo-chemical hypothesis of theorigin of the genetic code (16ndash21) In particular ourresults give strong support to the possibility of directtemplating of proteins from mRNAs in the era beforethe development of ribosomal decoding and codersquosfixation in that era (1745) In this framework ancientamino acids associated with mRNA directly followingtheir intrinsic physico-chemical preferences as outlinedhere However the fact that an analogous effect is notseen for all bases especially adenine and uracil supportsthe possibility that in addition to physico-chemical ration-ales in the context of direct binding other evolutionaryforces were also responsible for shaping the genetic codeas suggested before (19) Our results are most consistentwith the possibility that the early stereo-chemical phase incodersquos development was dominated by G- and C-richcodons as strongest correlations are seen for preciselythese bases If the basic structure of the early geneticcode was defined by such codons but was later modulatedby the inclusion of A and U bases this might explain whyG-affinity of amino acids in present-day protein sequencesclosely follows PUR density profiles in cognate mRNAsInterestingly Trifonov and coworkers have suggested thatthe first codons were G- and C-rich on the basis of a con-sensus analysis of 40 different criteria (46)

Importantly it should be emphasized that the stereo-chemical hypothesis of the codersquos origin may differ fromthe cognate mRNAprotein complementary interactionhypothesis in terms of its evolutionary underpinningsDirect templating of proteins from mRNAs in ancientsystems (the coding aspect of the stereo-chemical hypoth-esis) does not necessarily imply that modern proteinsdirectly interact with their own mRNA (complementaryinteraction hypothesis) However our findings support thepossibility that the origin of the genetic code and potentialcomplementarity between proteins and cognate mRNAsmight have the same physico-chemical background It iswell possible that other independent influences haveshaped both effects and the two hypotheses leave ampleroom for such refinements However we would like tostress that in our view the two hypotheses are inter-linked cognate binding is on the one hand a reasonableconsequence of the stereochemical hypothesis but on theother hand it also gives a potential biological rationale forthe early development of the code to begin with such asstabilization of RNA structures by bound polypeptides ashas been suggested before (45)

There are a number of open challenges concerning theaforementioned proposal First and foremost the struc-tural features of mRNAs and cognate proteins imposesevere constraints on any putative complementarity

between the two Namely with the contour length of themRNA coding part being 45 times longer than that of acognate protein it is not clear what structural arrange-ments may be consistent with any complementary inter-actions We would like to suggest that structures of suchcomplexes may be dynamic and liquid-like with mRNAstretches enveloping and solubilizing cognate proteinstretches (15) Second with many mRNAs and proteinsbeing well-folded and compact for most of the time itremains to be studied when and how opportunities couldarise for the complementarity between their primary se-quences to be of relevance It is possible that if at allrealistic such complementary binding might be function-ally important precisely in those situations where bothpolymers are unstructured such as during translationexport and degradation as a consequence of thermalstress or in the case of intrinsically unstructuredproteins However we do not exclude the possibility ofcomplementary interactions even in the folded stateFinally concerning the origin of the genetic code it isnot clear how the final well-defined structure of the codecould have arisen based on still partially non-specificlarge-scale binding interactions between mRNAs andcognate proteins As suggested before it is possible thatthe answer lies in a combination of different influences(19) Future research should shed light on these andrelated questionsThese challenges notwithstanding our findings provide

strong evidence that the ability to interact with mRNAmight be a widespread phenomenon in the cell involvingnot only cognate proteins but also other proteins based onsimilar principles The potential significance of suchphysico-chemical complementarity between mRNAs andproteins potentially extends to all facets of nucleic acidand protein biology in the modern cell including transcrip-tiontranslation regulation (9104748) mRNA transportand localization (4950) processing and decay (51) struc-ture of ribonucleoproteins (52) and others (2ndash55354)Our preliminary GO analysis has demonstrated a signifi-cant enrichment of functions related to association withnucleic acids for the subsets of proteins that complementtheir cognate mRNAs strongly and these findings will beexplored in more detail in future work

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

ACKNOWLEDGEMENTS

The authors thank S Dorner R Schroeder A Vaziri GWarren and members of the Laboratory ofComputational Biophysics at MFPL for useful adviceand critical reading of the manuscript

FUNDING

This work was supported in part by the Austrian ScienceFund FWF [START grant Y514-B11 to BZ] EuropeanResearch Council [ERC Starting Independent grant

Nucleic Acids Research 2013 Vol 41 No 18 8441

279408 to BZ] Funding for open access chargeAustrian Science Fund FWF

Conflict of interest statement None declared

REFERENCES

1 BrennerS JacobF and MeselsonM (1961) An unstableintermediate carrying information from genes to ribosomes forprotein synthesis Nature 190 576ndash581

2 AndersG MackowiakSD JensM MaaskolaJ KuntzagkARajewskyN LandthalerM and DieterichC (2012) doRiNA adatabase of RNA interactions in post-transcriptional regulationNucleic Acids Res 40 D180ndashD186

3 BaltzAG MunschauerM SchwanhausserB VasileAMurakawaY SchuelerM YoungsN Penfold-BrownDDrewK MilekM et al (2012) The mRNA-bound proteome andits global occupancy profile on protein-coding transcripts MolCell 46 674ndash690

4 CastelloA FischerB EichelbaumK HorosRBeckmannBM StreinC DaveyNE HumphreysDTPreissT SteinmetzLM et al (2012) Insights into RNA biologyfrom an atlas of mammalian mRNA-binding proteins Cell 1491393ndash1406

5 KonigJ ZarnackK LuscombeNM and UleJ (2012) Protein-RNA interactions new genomic technologies and perspectivesNat Rev Genet 13 221ndash221

6 MitchellSF JainS SheM and ParkerR (2013) Globalanalysis of yeast mRNPs Nat Struct Mol Biol 20 127ndash133

7 WeberSC and BrangwynneCP (2012) Getting RNA andprotein in phase Cell 149 1188ndash1191

8 HanTNW KatoM XieSH WuLC MirzaeiH PeiJMChenM XieY AllenJ XiaoGH et al (2012) Cell-freeformation of RNA granules bound RNAs identify features andcomponents of cellular assemblies Cell 149 768ndash779

9 KyrpidesNC and OuzounisCA (1993) Mechanisms ofspecificity in messenger-rna degradation - autoregulation andcognate interactions J Theor Biol 163 373ndash392

10 OuzounisCA and KyrpidesNC (1994) Reverse interpretation-ahypothetical selection mechanism for adaptive mutagenesis basedon autoregulated messenger-RNA stability J Theor Biol 167373ndash379

11 ChuE CopurSM JuJ ChenTM KhleifS VoellerDMMizunumaN PatelM MaleyGF MaleyF et al (1999)Thymidylate synthase protein and p53 mRNA form an in vivoribonucleoprotein complex Mol Cell Biol 19 1582ndash1594

12 TaiN SchmitzJC LiuJ LinX BaillyM ChenTM andChuE (2004) Translational autoregulation of thymidylatesynthase and dihydrofolate reductase Front Biosci 9 2521ndash2526

13 SchuttpelzM SchoningJC DooseS NeuweilerH PetersEStaigerD and SauerM (2008) Changes in conformationaldynamics of mRNA upon AtGRP7 binding studied byfluorescence correlation spectroscopy J Am Chem Soc 1309507ndash9513

14 ZhaoX LiuM WuN DingL LiuH and LinX (2010)Recovery of recombinant zebrafish p53 protein from inclusionbodies and its binding activity to p53 mRNA in vitro ProteinExpr Purif 72 262ndash266

15 HlevnjakM PolyanskyAA and ZagrovicB (2012) Sequencesignatures of direct complementarity between mRNAs andcognate proteins on multiple levels Nucleic Acids Res 408874ndash8882

16 WoeseCR (1965) Order in genetic code Proc Natl Acad SciUSA 54 71ndash75

17 WoeseCR (1965) On the evolution of the genetic code ProcNatl Acad Sci USA 54 1546ndash1552

18 YarusM (1998) Amino acids as RNA ligands A direct-RNA-template theory for the codersquos origin J Mol Evol 47 109ndash117

19 KooninEV and NovozhilovAS (2009) Origin and evolution ofthe genetic code the universal enigma IUBMB Life 61 99ndash111

20 YarusM WidmannJJ and KnightR (2009) RNA-amino acidbinding a stereochemical era for the genetic code J Mol Evol69 406ndash429

21 JohnsonDB and WangL (2010) Imprints of the genetic code inthe ribosome Proc Natl Acad Sci USA 107 8298ndash8303

22 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The proteindata bank Nucleic Acids Res 28 235ndash242

23 Ben-ShemA de LoubresseNG MelnikovS JennerLYusupovaG and YusupovM (2011) The structure of theeukaryotic ribosome at 30 angstrom resolution Science 3341524ndash1529

24 DunkleJA WangLY FeldmanMB PulkA ChenVBKapralGJ NoeskeJ RichardsonJS BlanchardSC andCateJHD (2011) Structures of the bacterial ribosome inclassical and hybrid states of tRNA binding Science 332981ndash984

25 PolikanovYS BlahaGM and SteitzTA (2012) Howhibernation factors RMF HPF and YfiA turn off proteinsynthesis Science 336 915ndash918

26 HarmsJM WilsonDN SchluenzenF ConnellSRStachelhausT ZaborowskaZ SpahnCM and FuciniP (2008)Translational regulation via L11 molecular switches on theribosome turned on and off by thiostrepton and micrococcinMol Cell 30 26ndash38

27 MiyazawaS and JerniganRL (1985) Estimation of effectiveinterresidue contact energies from protein crystal-structures -quasi-chemical approximation Macromolecules 18 534ndash552

28 DonaldJE ChenWW and ShakhnovichEI (2007) Energeticsof protein-DNA interactions Nucleic Acids Res 35 1039ndash1047

29 JonikasMA RadmerRJ LaederachA DasR PearlmanSHerschlagD and AltmanRB (2009) Coarse-grained modeling oflarge RNA molecules with knowledge-based potentials andstructural filters RNA 15 189ndash199

30 Perez-CanoL SolernouA PonsC and Fernandez-RecioJ(2010) Structural prediction of protein-RNA interaction bycomputational docking with propensity-based statistical potentialsPac Symp Biocomput 15 269ndash280

31 TuszynskaI and BujnickiJM (2011) DARS-RNP and QUASI-RNP new statistical potentials for protein-RNA docking BMCBioinformatics 12 348

32 GasteigerE GattikerA HooglandC IvanyiI AppelRD andBairochA (2003) ExPASy the proteomics server for in-depthprotein knowledge and analysis Nucleic Acids Res 31 3784ndash3788

33 DosztanyiZ CsizmokV TompaP and SimonI (2005) IUPredweb server for the prediction of intrinsically unstructured regionsof proteins based on estimated energy content Bioinformatics 213433ndash3434

34 Huang daW ShermanBT and LempickiRA (2009) Systematicand integrative analysis of large gene lists using DAVIDbioinformatics resources Nat Protoc 4 44ndash57

35 The PyMOL Molecular Graphics System Version 13r1 (2010)Schrodinger LLC httpwwwpymolorgciting (4 July 2013date last accessed)

36 TregerM and WesthofE (2001) Statistical analysis of atomiccontacts at RNA-protein interfaces J Mol Recognit 14199ndash214

37 HoffmanMM KhrapovMA CoxJC YaoJ TongL andEllingtonAD (2004) AANT the amino acid-nucleotideinteraction database Nucleic Acids Res 32 D174ndashD181

38 GuptaA and GribskovM (2011) The role of RNA sequence andstructure in RNAmdashprotein interactions J Mol Biol 409 574ndash587

39 FernandezM KumagaiY StandleyDM SaraiAMizuguchiK and AhmadS (2011) Prediction of dinucleotide-specific RNA-binding sites in proteins BMC Bioinformatics12(Suppl 13) S5

40 CopleySD SmithE and MorowitzHJ (2005) A mechanismfor the association of amino acids with their codons and theorigin of the genetic code Proc Natl Acad Sci USA 1024442ndash4447

41 IwakiriJ TateishiH ChakrabortyA PatilP and KenmochiN(2012) Dissecting the protein-RNA interface the role of proteinsurface shapes and RNA secondary structures in protein-RNArecognition Nucleic Acids Res 40 3299ndash3306

8442 Nucleic Acids Research 2013 Vol 41 No 18

42 WoeseCR DugreDH SaxingerWC and DugreSA (1966)The molecular basis for the genetic code Proc Natl Acad SciUSA 55 966ndash974

43 WoeseCR (1973) Evolution of the genetic codeNaturwissenschaften 60 447ndash459

44 MathewDC and Luthey-SchultenZ (2008) On the physicalbasis of the amino acid polar requirement J Mol Evol 66519ndash528

45 NollerHF (2012) Evolution of protein synthesis from an RNAworld Cold Spring Harb Perspect Biol 4 1ndashU20

46 TrifonovEN KirzhnerA KirzhnerVM and BerezovskyIN(2001) Distinct stages of protein evolution as suggested by proteinsequence analysis J Mol Evol 53 394ndash401

47 VaquerizasJM KummerfeldSK TeichmannSA andLuscombeNM (2009) A census of human transcription factorsfunction expression and evolution Nat Rev Genet 10 252ndash263

48 SonenbergN and HinnebuschAG (2009) Regulation oftranslation initiation in eukaryotes mechanisms and biologicaltargets Cell 136 731ndash745

49 LecuyerE YoshidaH ParthasarathyN AlmC BabakTCerovinaT HughesTR TomancakP and KrauseHM (2007)Global analysis of mRNA localization reveals a prominentrole in organizing cellular architecture and function Cell 131174ndash187

50 MartinKC and EphrussiA (2009) mRNA localization geneexpression in the spatial dimension Cell 136 719ndash730

51 MooreMJ and ProudfootNJ (2009) Pre-mRNA processingreaches back to transcription and ahead to translation Cell 136688ndash700

52 GlisovicT BachorikJL YongJ and DreyfussG (2008) RNA-binding proteins and post-transcriptional gene regulation FEBSLett 582 1977ndash1986

53 BellucciM AgostiniF MasinM and TartagliaGG (2011)Predicting protein associations with long noncoding RNAs NatMethods 8 444ndash445

54 RinnJL and ChangHY (2012) Genome regulation by longnoncoding RNAs Annu Rev Biochem 81 145ndash166

Nucleic Acids Research 2013 Vol 41 No 18 8443

A-preference and PYR-content (Supplementary TableS2) This observation strongly suggests that close densepacking of nucleobases around amino acids may berequired for specificity in cognate complex formationAlthough interfaces may be dynamic and liquid-like aswe have suggested before they may still need to bedensely packed Interestingly if one reduces the 2+ setby including only the two closest bases in contact with agiven amino acid (set lsquo2rsquo) the signal for G-preferencePUR-content even further improves by several percentagepoints (Figure 5A) and the same holds for C-preferenceC-content and A-preferencePYR-content (Supplemen-tary Table S2)

To further study the role of protein structural disorderin matching we have analyzed the levels of the predicteddisorder of the top and the bottom 10 of proteins whenit comes to the degree of mRNAprotein profile matchingas captured by Pearson R coefficient (see lsquoMaterials andMethodsrsquo section) We have done this for the six cases ofdirect comparison whereby the same base type is used forboth protein preference and mRNA profile density(Gprotein-GmRNA Aprotein-AmRNA Cprotein-CmRNAUprotein-UmRNA PURprotein-PURmRNA and PYRprotein-PYRmRNA) and also for the case displaying the strongestsignal in our analysis (Gprotein-PURmRNA) Importantly inthe case of Gprotein-GmRNA Aprotein-AmRNA and Cprotein-CmRNA matching we do observe a pronounced tendency

for the top and the bottom 10 cohorts to be significantlyenriched (top 10) and depleted (bottom 10) in dis-ordered proteins (Supplementary Table S3) whereas inthe case of Uprotein-Uprof matching the situation isreversed Interestingly for PURprotein-PURmRNAPYRprotein-PYRmRNA and Gprotein-PURmRNA matchingone observes slight disorder enrichment in both top andbottom cohorts The most prominent shift of the distribu-tion of predicted average disorder toward higher disorderas compared with background is observed for the top10 cohort of proteins displaying strong matchingbetween C-preference profiles of their sequences and theC-content of their cognate mRNAs (Cprotein-CmRNASupplementary Table S3 Supplementary Figure S3)One might argue that this effect could just be related tocompositional properties of such protein and mRNApairs whereby disordered proteins are simply encodedby C-rich sequences However the differences betweennucleobase compositions of mRNAs from the Cprotein-CmRNA top 10 cohort and the complete proteome areminor suggesting that the underlying explanation mightbe more complex (Supplementary Figure S3)Which biological functions might be associated with a

high level of complementarity between proteins andcognate mRNAs To address this question we have per-formed GO analysis for seven different top 10 subsets ofproteins displaying strong matching with cognate mRNAs

Figure 3 PYRPUR mRNA sequence profiles strongly match PYRPUR-preferences of cognate protein sequences PYR (A and B) and PUR (Cand D) amino-acid preference scales are given in Figure 1B For details please see the analogous captions to Figure 2C and D

Nucleic Acids Research 2013 Vol 41 No 18 8439

(see lsquoMaterials and Methodsrsquo section for details) InSupplementary Table S4 we report the most significantlyenriched biological functions (using a P-value cutoff of1010) shared by proteins from the analyzed cohorts Ina striking agreement with our hypothesis in most caseswe observe pronounced enrichment of terms related tonucleic-acidprotein interactions including regulation ofRNA metabolic processes ribonucleoprotein complexesand transcription The latter in particular allows one tospeculate that protein tendencies to associate with cognatemRNA might be used by the cells to modulate gene ex-pression pathways What is more PUR or PYR densityprofiles of mRNAs are identical to PUR or PYR densityprofiles of coding-strand DNA sequences (with Us beingreplaced by Ts) Although based on our statistical poten-tials we cannot say anything about T-binding preferencesof amino acids it is possible that our results may be gen-eralizable even to DNA-protein interactions as well asother RNA-protein interactions One should alsomention that depending on the particular type ofmatching other biological functions also tend to beenriched For instance the Uprotein-UmRNA top 10subset displays significant enrichment of membraneproteins whereas Gprotein-PURmRNA top cohort seems to

be populated by extracellular proteins and particularlythose involved in the functioning of the innate immunesystem Altogether our preliminary GO analysis illus-trates significant functional differences between proteinsthat strongly complement their cognate mRNAs and therest of the human proteome and these findings will befurther explored in another manuscript

DISCUSSION

High levels of matching between base-binding-preferenceprofiles of proteins and PYR- or PUR-density profiles ofcognate mRNA-coding sequences defined primarily byamino acid preferences to co-localize with G and Cbases at RNAprotein interfaces allow one to speculatethat direct complementary binding interactions may be akey element underlying the whole mRNAprotein rela-tionship when it comes to both its evolutionary develop-ment as well as present day biology (Figure 5B) Thisagrees well with and significantly extends our previousfindings where we have shown that protein sequenceprofiles of amino acid affinity for PYR analogs (42ndash44)mirror PYR density profiles of cognate mRNA sequences

Figure 5 Physico-chemical origins of the mRNAprotein relationship(A) Correlation coefficients (R and ltRgt with standard deviations)between PYR or PUR average codon content (lsquoCodon contentrsquo) andrespective mRNA profiles (lsquoProfilesrsquo) calculated for G- (blue) PUR-(red) and PYR- (green) binding preferences of amino acids whichwere obtained using different amino acid neighbor statistics (1+ 2+or 2) (B) A model of physico-chemical complementarity betweenproteins and cognate mRNAs Preferential interactions of aminoacids with PYR or PUR define their codon content in the genetictable and facilitate complementary interactions between PYRPUR-rich mRNA regions and PYRPUR preferring regions in proteinsThe opposite behavior of adenines and guanines adds an additionallayer of complexity in the case of PURs as signified by dashedarrows in the model Note polymer sizes not drawn to scale

Figure 4 Effect of cutoff radius used to define proteinndashRNA contactson observed correlations (A) Dependence of Pearson correlation coef-ficients (R) between amino acid preference scales and average codoncontent on the cutoff radius for the two sets of statistics studied (lsquo1+rsquolsquo2+rsquo) The total number of unique contacts in lsquo1+rsquo and lsquo2+rsquo (given inparentheses) sets obtained for each of used cutoff radii is indicated atthe top of the panel (B) Cutoff radius dependence of median pairwisePearson correlation coefficients (Rmedian) for comparison betweennucleobase content profiles of mRNAs and base-preference-weightedprotein sequence profiles over the entire human proteome (color codethe same as in panel A)

8440 Nucleic Acids Research 2013 Vol 41 No 18

(15) It should be emphasized however that our presentresults are based exclusively on the statistics of directamino acidnucleobase contacts at RNAprotein inter-faces It is therefore still possible that the driving forcefor interactions between mRNAs and cognate proteins isnon-specific (eg binding of positively charged amino acidside chains to RNA phosphate groups) whereas comple-mentary interactions actually confer specificity to binding

Moreover our results provide a clear evolutionary per-spective concerning the physico-chemical origins of trans-lation in line with the stereo-chemical hypothesis of theorigin of the genetic code (16ndash21) In particular ourresults give strong support to the possibility of directtemplating of proteins from mRNAs in the era beforethe development of ribosomal decoding and codersquosfixation in that era (1745) In this framework ancientamino acids associated with mRNA directly followingtheir intrinsic physico-chemical preferences as outlinedhere However the fact that an analogous effect is notseen for all bases especially adenine and uracil supportsthe possibility that in addition to physico-chemical ration-ales in the context of direct binding other evolutionaryforces were also responsible for shaping the genetic codeas suggested before (19) Our results are most consistentwith the possibility that the early stereo-chemical phase incodersquos development was dominated by G- and C-richcodons as strongest correlations are seen for preciselythese bases If the basic structure of the early geneticcode was defined by such codons but was later modulatedby the inclusion of A and U bases this might explain whyG-affinity of amino acids in present-day protein sequencesclosely follows PUR density profiles in cognate mRNAsInterestingly Trifonov and coworkers have suggested thatthe first codons were G- and C-rich on the basis of a con-sensus analysis of 40 different criteria (46)

Importantly it should be emphasized that the stereo-chemical hypothesis of the codersquos origin may differ fromthe cognate mRNAprotein complementary interactionhypothesis in terms of its evolutionary underpinningsDirect templating of proteins from mRNAs in ancientsystems (the coding aspect of the stereo-chemical hypoth-esis) does not necessarily imply that modern proteinsdirectly interact with their own mRNA (complementaryinteraction hypothesis) However our findings support thepossibility that the origin of the genetic code and potentialcomplementarity between proteins and cognate mRNAsmight have the same physico-chemical background It iswell possible that other independent influences haveshaped both effects and the two hypotheses leave ampleroom for such refinements However we would like tostress that in our view the two hypotheses are inter-linked cognate binding is on the one hand a reasonableconsequence of the stereochemical hypothesis but on theother hand it also gives a potential biological rationale forthe early development of the code to begin with such asstabilization of RNA structures by bound polypeptides ashas been suggested before (45)

There are a number of open challenges concerning theaforementioned proposal First and foremost the struc-tural features of mRNAs and cognate proteins imposesevere constraints on any putative complementarity

between the two Namely with the contour length of themRNA coding part being 45 times longer than that of acognate protein it is not clear what structural arrange-ments may be consistent with any complementary inter-actions We would like to suggest that structures of suchcomplexes may be dynamic and liquid-like with mRNAstretches enveloping and solubilizing cognate proteinstretches (15) Second with many mRNAs and proteinsbeing well-folded and compact for most of the time itremains to be studied when and how opportunities couldarise for the complementarity between their primary se-quences to be of relevance It is possible that if at allrealistic such complementary binding might be function-ally important precisely in those situations where bothpolymers are unstructured such as during translationexport and degradation as a consequence of thermalstress or in the case of intrinsically unstructuredproteins However we do not exclude the possibility ofcomplementary interactions even in the folded stateFinally concerning the origin of the genetic code it isnot clear how the final well-defined structure of the codecould have arisen based on still partially non-specificlarge-scale binding interactions between mRNAs andcognate proteins As suggested before it is possible thatthe answer lies in a combination of different influences(19) Future research should shed light on these andrelated questionsThese challenges notwithstanding our findings provide

strong evidence that the ability to interact with mRNAmight be a widespread phenomenon in the cell involvingnot only cognate proteins but also other proteins based onsimilar principles The potential significance of suchphysico-chemical complementarity between mRNAs andproteins potentially extends to all facets of nucleic acidand protein biology in the modern cell including transcrip-tiontranslation regulation (9104748) mRNA transportand localization (4950) processing and decay (51) struc-ture of ribonucleoproteins (52) and others (2ndash55354)Our preliminary GO analysis has demonstrated a signifi-cant enrichment of functions related to association withnucleic acids for the subsets of proteins that complementtheir cognate mRNAs strongly and these findings will beexplored in more detail in future work

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

ACKNOWLEDGEMENTS

The authors thank S Dorner R Schroeder A Vaziri GWarren and members of the Laboratory ofComputational Biophysics at MFPL for useful adviceand critical reading of the manuscript

FUNDING

This work was supported in part by the Austrian ScienceFund FWF [START grant Y514-B11 to BZ] EuropeanResearch Council [ERC Starting Independent grant

Nucleic Acids Research 2013 Vol 41 No 18 8441

279408 to BZ] Funding for open access chargeAustrian Science Fund FWF

Conflict of interest statement None declared

REFERENCES

1 BrennerS JacobF and MeselsonM (1961) An unstableintermediate carrying information from genes to ribosomes forprotein synthesis Nature 190 576ndash581

2 AndersG MackowiakSD JensM MaaskolaJ KuntzagkARajewskyN LandthalerM and DieterichC (2012) doRiNA adatabase of RNA interactions in post-transcriptional regulationNucleic Acids Res 40 D180ndashD186

3 BaltzAG MunschauerM SchwanhausserB VasileAMurakawaY SchuelerM YoungsN Penfold-BrownDDrewK MilekM et al (2012) The mRNA-bound proteome andits global occupancy profile on protein-coding transcripts MolCell 46 674ndash690

4 CastelloA FischerB EichelbaumK HorosRBeckmannBM StreinC DaveyNE HumphreysDTPreissT SteinmetzLM et al (2012) Insights into RNA biologyfrom an atlas of mammalian mRNA-binding proteins Cell 1491393ndash1406

5 KonigJ ZarnackK LuscombeNM and UleJ (2012) Protein-RNA interactions new genomic technologies and perspectivesNat Rev Genet 13 221ndash221

6 MitchellSF JainS SheM and ParkerR (2013) Globalanalysis of yeast mRNPs Nat Struct Mol Biol 20 127ndash133

7 WeberSC and BrangwynneCP (2012) Getting RNA andprotein in phase Cell 149 1188ndash1191

8 HanTNW KatoM XieSH WuLC MirzaeiH PeiJMChenM XieY AllenJ XiaoGH et al (2012) Cell-freeformation of RNA granules bound RNAs identify features andcomponents of cellular assemblies Cell 149 768ndash779

9 KyrpidesNC and OuzounisCA (1993) Mechanisms ofspecificity in messenger-rna degradation - autoregulation andcognate interactions J Theor Biol 163 373ndash392

10 OuzounisCA and KyrpidesNC (1994) Reverse interpretation-ahypothetical selection mechanism for adaptive mutagenesis basedon autoregulated messenger-RNA stability J Theor Biol 167373ndash379

11 ChuE CopurSM JuJ ChenTM KhleifS VoellerDMMizunumaN PatelM MaleyGF MaleyF et al (1999)Thymidylate synthase protein and p53 mRNA form an in vivoribonucleoprotein complex Mol Cell Biol 19 1582ndash1594

12 TaiN SchmitzJC LiuJ LinX BaillyM ChenTM andChuE (2004) Translational autoregulation of thymidylatesynthase and dihydrofolate reductase Front Biosci 9 2521ndash2526

13 SchuttpelzM SchoningJC DooseS NeuweilerH PetersEStaigerD and SauerM (2008) Changes in conformationaldynamics of mRNA upon AtGRP7 binding studied byfluorescence correlation spectroscopy J Am Chem Soc 1309507ndash9513

14 ZhaoX LiuM WuN DingL LiuH and LinX (2010)Recovery of recombinant zebrafish p53 protein from inclusionbodies and its binding activity to p53 mRNA in vitro ProteinExpr Purif 72 262ndash266

15 HlevnjakM PolyanskyAA and ZagrovicB (2012) Sequencesignatures of direct complementarity between mRNAs andcognate proteins on multiple levels Nucleic Acids Res 408874ndash8882

16 WoeseCR (1965) Order in genetic code Proc Natl Acad SciUSA 54 71ndash75

17 WoeseCR (1965) On the evolution of the genetic code ProcNatl Acad Sci USA 54 1546ndash1552

18 YarusM (1998) Amino acids as RNA ligands A direct-RNA-template theory for the codersquos origin J Mol Evol 47 109ndash117

19 KooninEV and NovozhilovAS (2009) Origin and evolution ofthe genetic code the universal enigma IUBMB Life 61 99ndash111

20 YarusM WidmannJJ and KnightR (2009) RNA-amino acidbinding a stereochemical era for the genetic code J Mol Evol69 406ndash429

21 JohnsonDB and WangL (2010) Imprints of the genetic code inthe ribosome Proc Natl Acad Sci USA 107 8298ndash8303

22 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The proteindata bank Nucleic Acids Res 28 235ndash242

23 Ben-ShemA de LoubresseNG MelnikovS JennerLYusupovaG and YusupovM (2011) The structure of theeukaryotic ribosome at 30 angstrom resolution Science 3341524ndash1529

24 DunkleJA WangLY FeldmanMB PulkA ChenVBKapralGJ NoeskeJ RichardsonJS BlanchardSC andCateJHD (2011) Structures of the bacterial ribosome inclassical and hybrid states of tRNA binding Science 332981ndash984

25 PolikanovYS BlahaGM and SteitzTA (2012) Howhibernation factors RMF HPF and YfiA turn off proteinsynthesis Science 336 915ndash918

26 HarmsJM WilsonDN SchluenzenF ConnellSRStachelhausT ZaborowskaZ SpahnCM and FuciniP (2008)Translational regulation via L11 molecular switches on theribosome turned on and off by thiostrepton and micrococcinMol Cell 30 26ndash38

27 MiyazawaS and JerniganRL (1985) Estimation of effectiveinterresidue contact energies from protein crystal-structures -quasi-chemical approximation Macromolecules 18 534ndash552

28 DonaldJE ChenWW and ShakhnovichEI (2007) Energeticsof protein-DNA interactions Nucleic Acids Res 35 1039ndash1047

29 JonikasMA RadmerRJ LaederachA DasR PearlmanSHerschlagD and AltmanRB (2009) Coarse-grained modeling oflarge RNA molecules with knowledge-based potentials andstructural filters RNA 15 189ndash199

30 Perez-CanoL SolernouA PonsC and Fernandez-RecioJ(2010) Structural prediction of protein-RNA interaction bycomputational docking with propensity-based statistical potentialsPac Symp Biocomput 15 269ndash280

31 TuszynskaI and BujnickiJM (2011) DARS-RNP and QUASI-RNP new statistical potentials for protein-RNA docking BMCBioinformatics 12 348

32 GasteigerE GattikerA HooglandC IvanyiI AppelRD andBairochA (2003) ExPASy the proteomics server for in-depthprotein knowledge and analysis Nucleic Acids Res 31 3784ndash3788

33 DosztanyiZ CsizmokV TompaP and SimonI (2005) IUPredweb server for the prediction of intrinsically unstructured regionsof proteins based on estimated energy content Bioinformatics 213433ndash3434

34 Huang daW ShermanBT and LempickiRA (2009) Systematicand integrative analysis of large gene lists using DAVIDbioinformatics resources Nat Protoc 4 44ndash57

35 The PyMOL Molecular Graphics System Version 13r1 (2010)Schrodinger LLC httpwwwpymolorgciting (4 July 2013date last accessed)

36 TregerM and WesthofE (2001) Statistical analysis of atomiccontacts at RNA-protein interfaces J Mol Recognit 14199ndash214

37 HoffmanMM KhrapovMA CoxJC YaoJ TongL andEllingtonAD (2004) AANT the amino acid-nucleotideinteraction database Nucleic Acids Res 32 D174ndashD181

38 GuptaA and GribskovM (2011) The role of RNA sequence andstructure in RNAmdashprotein interactions J Mol Biol 409 574ndash587

39 FernandezM KumagaiY StandleyDM SaraiAMizuguchiK and AhmadS (2011) Prediction of dinucleotide-specific RNA-binding sites in proteins BMC Bioinformatics12(Suppl 13) S5

40 CopleySD SmithE and MorowitzHJ (2005) A mechanismfor the association of amino acids with their codons and theorigin of the genetic code Proc Natl Acad Sci USA 1024442ndash4447

41 IwakiriJ TateishiH ChakrabortyA PatilP and KenmochiN(2012) Dissecting the protein-RNA interface the role of proteinsurface shapes and RNA secondary structures in protein-RNArecognition Nucleic Acids Res 40 3299ndash3306

8442 Nucleic Acids Research 2013 Vol 41 No 18

42 WoeseCR DugreDH SaxingerWC and DugreSA (1966)The molecular basis for the genetic code Proc Natl Acad SciUSA 55 966ndash974

43 WoeseCR (1973) Evolution of the genetic codeNaturwissenschaften 60 447ndash459

44 MathewDC and Luthey-SchultenZ (2008) On the physicalbasis of the amino acid polar requirement J Mol Evol 66519ndash528

45 NollerHF (2012) Evolution of protein synthesis from an RNAworld Cold Spring Harb Perspect Biol 4 1ndashU20

46 TrifonovEN KirzhnerA KirzhnerVM and BerezovskyIN(2001) Distinct stages of protein evolution as suggested by proteinsequence analysis J Mol Evol 53 394ndash401

47 VaquerizasJM KummerfeldSK TeichmannSA andLuscombeNM (2009) A census of human transcription factorsfunction expression and evolution Nat Rev Genet 10 252ndash263

48 SonenbergN and HinnebuschAG (2009) Regulation oftranslation initiation in eukaryotes mechanisms and biologicaltargets Cell 136 731ndash745

49 LecuyerE YoshidaH ParthasarathyN AlmC BabakTCerovinaT HughesTR TomancakP and KrauseHM (2007)Global analysis of mRNA localization reveals a prominentrole in organizing cellular architecture and function Cell 131174ndash187

50 MartinKC and EphrussiA (2009) mRNA localization geneexpression in the spatial dimension Cell 136 719ndash730

51 MooreMJ and ProudfootNJ (2009) Pre-mRNA processingreaches back to transcription and ahead to translation Cell 136688ndash700

52 GlisovicT BachorikJL YongJ and DreyfussG (2008) RNA-binding proteins and post-transcriptional gene regulation FEBSLett 582 1977ndash1986

53 BellucciM AgostiniF MasinM and TartagliaGG (2011)Predicting protein associations with long noncoding RNAs NatMethods 8 444ndash445

54 RinnJL and ChangHY (2012) Genome regulation by longnoncoding RNAs Annu Rev Biochem 81 145ndash166

Nucleic Acids Research 2013 Vol 41 No 18 8443

(see lsquoMaterials and Methodsrsquo section for details) InSupplementary Table S4 we report the most significantlyenriched biological functions (using a P-value cutoff of1010) shared by proteins from the analyzed cohorts Ina striking agreement with our hypothesis in most caseswe observe pronounced enrichment of terms related tonucleic-acidprotein interactions including regulation ofRNA metabolic processes ribonucleoprotein complexesand transcription The latter in particular allows one tospeculate that protein tendencies to associate with cognatemRNA might be used by the cells to modulate gene ex-pression pathways What is more PUR or PYR densityprofiles of mRNAs are identical to PUR or PYR densityprofiles of coding-strand DNA sequences (with Us beingreplaced by Ts) Although based on our statistical poten-tials we cannot say anything about T-binding preferencesof amino acids it is possible that our results may be gen-eralizable even to DNA-protein interactions as well asother RNA-protein interactions One should alsomention that depending on the particular type ofmatching other biological functions also tend to beenriched For instance the Uprotein-UmRNA top 10subset displays significant enrichment of membraneproteins whereas Gprotein-PURmRNA top cohort seems to

be populated by extracellular proteins and particularlythose involved in the functioning of the innate immunesystem Altogether our preliminary GO analysis illus-trates significant functional differences between proteinsthat strongly complement their cognate mRNAs and therest of the human proteome and these findings will befurther explored in another manuscript

DISCUSSION

High levels of matching between base-binding-preferenceprofiles of proteins and PYR- or PUR-density profiles ofcognate mRNA-coding sequences defined primarily byamino acid preferences to co-localize with G and Cbases at RNAprotein interfaces allow one to speculatethat direct complementary binding interactions may be akey element underlying the whole mRNAprotein rela-tionship when it comes to both its evolutionary develop-ment as well as present day biology (Figure 5B) Thisagrees well with and significantly extends our previousfindings where we have shown that protein sequenceprofiles of amino acid affinity for PYR analogs (42ndash44)mirror PYR density profiles of cognate mRNA sequences

Figure 5 Physico-chemical origins of the mRNAprotein relationship(A) Correlation coefficients (R and ltRgt with standard deviations)between PYR or PUR average codon content (lsquoCodon contentrsquo) andrespective mRNA profiles (lsquoProfilesrsquo) calculated for G- (blue) PUR-(red) and PYR- (green) binding preferences of amino acids whichwere obtained using different amino acid neighbor statistics (1+ 2+or 2) (B) A model of physico-chemical complementarity betweenproteins and cognate mRNAs Preferential interactions of aminoacids with PYR or PUR define their codon content in the genetictable and facilitate complementary interactions between PYRPUR-rich mRNA regions and PYRPUR preferring regions in proteinsThe opposite behavior of adenines and guanines adds an additionallayer of complexity in the case of PURs as signified by dashedarrows in the model Note polymer sizes not drawn to scale

Figure 4 Effect of cutoff radius used to define proteinndashRNA contactson observed correlations (A) Dependence of Pearson correlation coef-ficients (R) between amino acid preference scales and average codoncontent on the cutoff radius for the two sets of statistics studied (lsquo1+rsquolsquo2+rsquo) The total number of unique contacts in lsquo1+rsquo and lsquo2+rsquo (given inparentheses) sets obtained for each of used cutoff radii is indicated atthe top of the panel (B) Cutoff radius dependence of median pairwisePearson correlation coefficients (Rmedian) for comparison betweennucleobase content profiles of mRNAs and base-preference-weightedprotein sequence profiles over the entire human proteome (color codethe same as in panel A)

8440 Nucleic Acids Research 2013 Vol 41 No 18

(15) It should be emphasized however that our presentresults are based exclusively on the statistics of directamino acidnucleobase contacts at RNAprotein inter-faces It is therefore still possible that the driving forcefor interactions between mRNAs and cognate proteins isnon-specific (eg binding of positively charged amino acidside chains to RNA phosphate groups) whereas comple-mentary interactions actually confer specificity to binding

Moreover our results provide a clear evolutionary per-spective concerning the physico-chemical origins of trans-lation in line with the stereo-chemical hypothesis of theorigin of the genetic code (16ndash21) In particular ourresults give strong support to the possibility of directtemplating of proteins from mRNAs in the era beforethe development of ribosomal decoding and codersquosfixation in that era (1745) In this framework ancientamino acids associated with mRNA directly followingtheir intrinsic physico-chemical preferences as outlinedhere However the fact that an analogous effect is notseen for all bases especially adenine and uracil supportsthe possibility that in addition to physico-chemical ration-ales in the context of direct binding other evolutionaryforces were also responsible for shaping the genetic codeas suggested before (19) Our results are most consistentwith the possibility that the early stereo-chemical phase incodersquos development was dominated by G- and C-richcodons as strongest correlations are seen for preciselythese bases If the basic structure of the early geneticcode was defined by such codons but was later modulatedby the inclusion of A and U bases this might explain whyG-affinity of amino acids in present-day protein sequencesclosely follows PUR density profiles in cognate mRNAsInterestingly Trifonov and coworkers have suggested thatthe first codons were G- and C-rich on the basis of a con-sensus analysis of 40 different criteria (46)

Importantly it should be emphasized that the stereo-chemical hypothesis of the codersquos origin may differ fromthe cognate mRNAprotein complementary interactionhypothesis in terms of its evolutionary underpinningsDirect templating of proteins from mRNAs in ancientsystems (the coding aspect of the stereo-chemical hypoth-esis) does not necessarily imply that modern proteinsdirectly interact with their own mRNA (complementaryinteraction hypothesis) However our findings support thepossibility that the origin of the genetic code and potentialcomplementarity between proteins and cognate mRNAsmight have the same physico-chemical background It iswell possible that other independent influences haveshaped both effects and the two hypotheses leave ampleroom for such refinements However we would like tostress that in our view the two hypotheses are inter-linked cognate binding is on the one hand a reasonableconsequence of the stereochemical hypothesis but on theother hand it also gives a potential biological rationale forthe early development of the code to begin with such asstabilization of RNA structures by bound polypeptides ashas been suggested before (45)

There are a number of open challenges concerning theaforementioned proposal First and foremost the struc-tural features of mRNAs and cognate proteins imposesevere constraints on any putative complementarity

between the two Namely with the contour length of themRNA coding part being 45 times longer than that of acognate protein it is not clear what structural arrange-ments may be consistent with any complementary inter-actions We would like to suggest that structures of suchcomplexes may be dynamic and liquid-like with mRNAstretches enveloping and solubilizing cognate proteinstretches (15) Second with many mRNAs and proteinsbeing well-folded and compact for most of the time itremains to be studied when and how opportunities couldarise for the complementarity between their primary se-quences to be of relevance It is possible that if at allrealistic such complementary binding might be function-ally important precisely in those situations where bothpolymers are unstructured such as during translationexport and degradation as a consequence of thermalstress or in the case of intrinsically unstructuredproteins However we do not exclude the possibility ofcomplementary interactions even in the folded stateFinally concerning the origin of the genetic code it isnot clear how the final well-defined structure of the codecould have arisen based on still partially non-specificlarge-scale binding interactions between mRNAs andcognate proteins As suggested before it is possible thatthe answer lies in a combination of different influences(19) Future research should shed light on these andrelated questionsThese challenges notwithstanding our findings provide

strong evidence that the ability to interact with mRNAmight be a widespread phenomenon in the cell involvingnot only cognate proteins but also other proteins based onsimilar principles The potential significance of suchphysico-chemical complementarity between mRNAs andproteins potentially extends to all facets of nucleic acidand protein biology in the modern cell including transcrip-tiontranslation regulation (9104748) mRNA transportand localization (4950) processing and decay (51) struc-ture of ribonucleoproteins (52) and others (2ndash55354)Our preliminary GO analysis has demonstrated a signifi-cant enrichment of functions related to association withnucleic acids for the subsets of proteins that complementtheir cognate mRNAs strongly and these findings will beexplored in more detail in future work

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

ACKNOWLEDGEMENTS

The authors thank S Dorner R Schroeder A Vaziri GWarren and members of the Laboratory ofComputational Biophysics at MFPL for useful adviceand critical reading of the manuscript

FUNDING

This work was supported in part by the Austrian ScienceFund FWF [START grant Y514-B11 to BZ] EuropeanResearch Council [ERC Starting Independent grant

Nucleic Acids Research 2013 Vol 41 No 18 8441

279408 to BZ] Funding for open access chargeAustrian Science Fund FWF

Conflict of interest statement None declared

REFERENCES

1 BrennerS JacobF and MeselsonM (1961) An unstableintermediate carrying information from genes to ribosomes forprotein synthesis Nature 190 576ndash581

2 AndersG MackowiakSD JensM MaaskolaJ KuntzagkARajewskyN LandthalerM and DieterichC (2012) doRiNA adatabase of RNA interactions in post-transcriptional regulationNucleic Acids Res 40 D180ndashD186

3 BaltzAG MunschauerM SchwanhausserB VasileAMurakawaY SchuelerM YoungsN Penfold-BrownDDrewK MilekM et al (2012) The mRNA-bound proteome andits global occupancy profile on protein-coding transcripts MolCell 46 674ndash690

4 CastelloA FischerB EichelbaumK HorosRBeckmannBM StreinC DaveyNE HumphreysDTPreissT SteinmetzLM et al (2012) Insights into RNA biologyfrom an atlas of mammalian mRNA-binding proteins Cell 1491393ndash1406

5 KonigJ ZarnackK LuscombeNM and UleJ (2012) Protein-RNA interactions new genomic technologies and perspectivesNat Rev Genet 13 221ndash221

6 MitchellSF JainS SheM and ParkerR (2013) Globalanalysis of yeast mRNPs Nat Struct Mol Biol 20 127ndash133

7 WeberSC and BrangwynneCP (2012) Getting RNA andprotein in phase Cell 149 1188ndash1191

8 HanTNW KatoM XieSH WuLC MirzaeiH PeiJMChenM XieY AllenJ XiaoGH et al (2012) Cell-freeformation of RNA granules bound RNAs identify features andcomponents of cellular assemblies Cell 149 768ndash779

9 KyrpidesNC and OuzounisCA (1993) Mechanisms ofspecificity in messenger-rna degradation - autoregulation andcognate interactions J Theor Biol 163 373ndash392

10 OuzounisCA and KyrpidesNC (1994) Reverse interpretation-ahypothetical selection mechanism for adaptive mutagenesis basedon autoregulated messenger-RNA stability J Theor Biol 167373ndash379

11 ChuE CopurSM JuJ ChenTM KhleifS VoellerDMMizunumaN PatelM MaleyGF MaleyF et al (1999)Thymidylate synthase protein and p53 mRNA form an in vivoribonucleoprotein complex Mol Cell Biol 19 1582ndash1594

12 TaiN SchmitzJC LiuJ LinX BaillyM ChenTM andChuE (2004) Translational autoregulation of thymidylatesynthase and dihydrofolate reductase Front Biosci 9 2521ndash2526

13 SchuttpelzM SchoningJC DooseS NeuweilerH PetersEStaigerD and SauerM (2008) Changes in conformationaldynamics of mRNA upon AtGRP7 binding studied byfluorescence correlation spectroscopy J Am Chem Soc 1309507ndash9513

14 ZhaoX LiuM WuN DingL LiuH and LinX (2010)Recovery of recombinant zebrafish p53 protein from inclusionbodies and its binding activity to p53 mRNA in vitro ProteinExpr Purif 72 262ndash266

15 HlevnjakM PolyanskyAA and ZagrovicB (2012) Sequencesignatures of direct complementarity between mRNAs andcognate proteins on multiple levels Nucleic Acids Res 408874ndash8882

16 WoeseCR (1965) Order in genetic code Proc Natl Acad SciUSA 54 71ndash75

17 WoeseCR (1965) On the evolution of the genetic code ProcNatl Acad Sci USA 54 1546ndash1552

18 YarusM (1998) Amino acids as RNA ligands A direct-RNA-template theory for the codersquos origin J Mol Evol 47 109ndash117

19 KooninEV and NovozhilovAS (2009) Origin and evolution ofthe genetic code the universal enigma IUBMB Life 61 99ndash111

20 YarusM WidmannJJ and KnightR (2009) RNA-amino acidbinding a stereochemical era for the genetic code J Mol Evol69 406ndash429

21 JohnsonDB and WangL (2010) Imprints of the genetic code inthe ribosome Proc Natl Acad Sci USA 107 8298ndash8303

22 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The proteindata bank Nucleic Acids Res 28 235ndash242

23 Ben-ShemA de LoubresseNG MelnikovS JennerLYusupovaG and YusupovM (2011) The structure of theeukaryotic ribosome at 30 angstrom resolution Science 3341524ndash1529

24 DunkleJA WangLY FeldmanMB PulkA ChenVBKapralGJ NoeskeJ RichardsonJS BlanchardSC andCateJHD (2011) Structures of the bacterial ribosome inclassical and hybrid states of tRNA binding Science 332981ndash984

25 PolikanovYS BlahaGM and SteitzTA (2012) Howhibernation factors RMF HPF and YfiA turn off proteinsynthesis Science 336 915ndash918

26 HarmsJM WilsonDN SchluenzenF ConnellSRStachelhausT ZaborowskaZ SpahnCM and FuciniP (2008)Translational regulation via L11 molecular switches on theribosome turned on and off by thiostrepton and micrococcinMol Cell 30 26ndash38

27 MiyazawaS and JerniganRL (1985) Estimation of effectiveinterresidue contact energies from protein crystal-structures -quasi-chemical approximation Macromolecules 18 534ndash552

28 DonaldJE ChenWW and ShakhnovichEI (2007) Energeticsof protein-DNA interactions Nucleic Acids Res 35 1039ndash1047

29 JonikasMA RadmerRJ LaederachA DasR PearlmanSHerschlagD and AltmanRB (2009) Coarse-grained modeling oflarge RNA molecules with knowledge-based potentials andstructural filters RNA 15 189ndash199

30 Perez-CanoL SolernouA PonsC and Fernandez-RecioJ(2010) Structural prediction of protein-RNA interaction bycomputational docking with propensity-based statistical potentialsPac Symp Biocomput 15 269ndash280

31 TuszynskaI and BujnickiJM (2011) DARS-RNP and QUASI-RNP new statistical potentials for protein-RNA docking BMCBioinformatics 12 348

32 GasteigerE GattikerA HooglandC IvanyiI AppelRD andBairochA (2003) ExPASy the proteomics server for in-depthprotein knowledge and analysis Nucleic Acids Res 31 3784ndash3788

33 DosztanyiZ CsizmokV TompaP and SimonI (2005) IUPredweb server for the prediction of intrinsically unstructured regionsof proteins based on estimated energy content Bioinformatics 213433ndash3434

34 Huang daW ShermanBT and LempickiRA (2009) Systematicand integrative analysis of large gene lists using DAVIDbioinformatics resources Nat Protoc 4 44ndash57

35 The PyMOL Molecular Graphics System Version 13r1 (2010)Schrodinger LLC httpwwwpymolorgciting (4 July 2013date last accessed)

36 TregerM and WesthofE (2001) Statistical analysis of atomiccontacts at RNA-protein interfaces J Mol Recognit 14199ndash214

37 HoffmanMM KhrapovMA CoxJC YaoJ TongL andEllingtonAD (2004) AANT the amino acid-nucleotideinteraction database Nucleic Acids Res 32 D174ndashD181

38 GuptaA and GribskovM (2011) The role of RNA sequence andstructure in RNAmdashprotein interactions J Mol Biol 409 574ndash587

39 FernandezM KumagaiY StandleyDM SaraiAMizuguchiK and AhmadS (2011) Prediction of dinucleotide-specific RNA-binding sites in proteins BMC Bioinformatics12(Suppl 13) S5

40 CopleySD SmithE and MorowitzHJ (2005) A mechanismfor the association of amino acids with their codons and theorigin of the genetic code Proc Natl Acad Sci USA 1024442ndash4447

41 IwakiriJ TateishiH ChakrabortyA PatilP and KenmochiN(2012) Dissecting the protein-RNA interface the role of proteinsurface shapes and RNA secondary structures in protein-RNArecognition Nucleic Acids Res 40 3299ndash3306

8442 Nucleic Acids Research 2013 Vol 41 No 18

42 WoeseCR DugreDH SaxingerWC and DugreSA (1966)The molecular basis for the genetic code Proc Natl Acad SciUSA 55 966ndash974

43 WoeseCR (1973) Evolution of the genetic codeNaturwissenschaften 60 447ndash459

44 MathewDC and Luthey-SchultenZ (2008) On the physicalbasis of the amino acid polar requirement J Mol Evol 66519ndash528

45 NollerHF (2012) Evolution of protein synthesis from an RNAworld Cold Spring Harb Perspect Biol 4 1ndashU20

46 TrifonovEN KirzhnerA KirzhnerVM and BerezovskyIN(2001) Distinct stages of protein evolution as suggested by proteinsequence analysis J Mol Evol 53 394ndash401

47 VaquerizasJM KummerfeldSK TeichmannSA andLuscombeNM (2009) A census of human transcription factorsfunction expression and evolution Nat Rev Genet 10 252ndash263

48 SonenbergN and HinnebuschAG (2009) Regulation oftranslation initiation in eukaryotes mechanisms and biologicaltargets Cell 136 731ndash745

49 LecuyerE YoshidaH ParthasarathyN AlmC BabakTCerovinaT HughesTR TomancakP and KrauseHM (2007)Global analysis of mRNA localization reveals a prominentrole in organizing cellular architecture and function Cell 131174ndash187

50 MartinKC and EphrussiA (2009) mRNA localization geneexpression in the spatial dimension Cell 136 719ndash730

51 MooreMJ and ProudfootNJ (2009) Pre-mRNA processingreaches back to transcription and ahead to translation Cell 136688ndash700

52 GlisovicT BachorikJL YongJ and DreyfussG (2008) RNA-binding proteins and post-transcriptional gene regulation FEBSLett 582 1977ndash1986

53 BellucciM AgostiniF MasinM and TartagliaGG (2011)Predicting protein associations with long noncoding RNAs NatMethods 8 444ndash445

54 RinnJL and ChangHY (2012) Genome regulation by longnoncoding RNAs Annu Rev Biochem 81 145ndash166

Nucleic Acids Research 2013 Vol 41 No 18 8443

(15) It should be emphasized however that our presentresults are based exclusively on the statistics of directamino acidnucleobase contacts at RNAprotein inter-faces It is therefore still possible that the driving forcefor interactions between mRNAs and cognate proteins isnon-specific (eg binding of positively charged amino acidside chains to RNA phosphate groups) whereas comple-mentary interactions actually confer specificity to binding

Moreover our results provide a clear evolutionary per-spective concerning the physico-chemical origins of trans-lation in line with the stereo-chemical hypothesis of theorigin of the genetic code (16ndash21) In particular ourresults give strong support to the possibility of directtemplating of proteins from mRNAs in the era beforethe development of ribosomal decoding and codersquosfixation in that era (1745) In this framework ancientamino acids associated with mRNA directly followingtheir intrinsic physico-chemical preferences as outlinedhere However the fact that an analogous effect is notseen for all bases especially adenine and uracil supportsthe possibility that in addition to physico-chemical ration-ales in the context of direct binding other evolutionaryforces were also responsible for shaping the genetic codeas suggested before (19) Our results are most consistentwith the possibility that the early stereo-chemical phase incodersquos development was dominated by G- and C-richcodons as strongest correlations are seen for preciselythese bases If the basic structure of the early geneticcode was defined by such codons but was later modulatedby the inclusion of A and U bases this might explain whyG-affinity of amino acids in present-day protein sequencesclosely follows PUR density profiles in cognate mRNAsInterestingly Trifonov and coworkers have suggested thatthe first codons were G- and C-rich on the basis of a con-sensus analysis of 40 different criteria (46)

Importantly it should be emphasized that the stereo-chemical hypothesis of the codersquos origin may differ fromthe cognate mRNAprotein complementary interactionhypothesis in terms of its evolutionary underpinningsDirect templating of proteins from mRNAs in ancientsystems (the coding aspect of the stereo-chemical hypoth-esis) does not necessarily imply that modern proteinsdirectly interact with their own mRNA (complementaryinteraction hypothesis) However our findings support thepossibility that the origin of the genetic code and potentialcomplementarity between proteins and cognate mRNAsmight have the same physico-chemical background It iswell possible that other independent influences haveshaped both effects and the two hypotheses leave ampleroom for such refinements However we would like tostress that in our view the two hypotheses are inter-linked cognate binding is on the one hand a reasonableconsequence of the stereochemical hypothesis but on theother hand it also gives a potential biological rationale forthe early development of the code to begin with such asstabilization of RNA structures by bound polypeptides ashas been suggested before (45)

There are a number of open challenges concerning theaforementioned proposal First and foremost the struc-tural features of mRNAs and cognate proteins imposesevere constraints on any putative complementarity

between the two Namely with the contour length of themRNA coding part being 45 times longer than that of acognate protein it is not clear what structural arrange-ments may be consistent with any complementary inter-actions We would like to suggest that structures of suchcomplexes may be dynamic and liquid-like with mRNAstretches enveloping and solubilizing cognate proteinstretches (15) Second with many mRNAs and proteinsbeing well-folded and compact for most of the time itremains to be studied when and how opportunities couldarise for the complementarity between their primary se-quences to be of relevance It is possible that if at allrealistic such complementary binding might be function-ally important precisely in those situations where bothpolymers are unstructured such as during translationexport and degradation as a consequence of thermalstress or in the case of intrinsically unstructuredproteins However we do not exclude the possibility ofcomplementary interactions even in the folded stateFinally concerning the origin of the genetic code it isnot clear how the final well-defined structure of the codecould have arisen based on still partially non-specificlarge-scale binding interactions between mRNAs andcognate proteins As suggested before it is possible thatthe answer lies in a combination of different influences(19) Future research should shed light on these andrelated questionsThese challenges notwithstanding our findings provide

strong evidence that the ability to interact with mRNAmight be a widespread phenomenon in the cell involvingnot only cognate proteins but also other proteins based onsimilar principles The potential significance of suchphysico-chemical complementarity between mRNAs andproteins potentially extends to all facets of nucleic acidand protein biology in the modern cell including transcrip-tiontranslation regulation (9104748) mRNA transportand localization (4950) processing and decay (51) struc-ture of ribonucleoproteins (52) and others (2ndash55354)Our preliminary GO analysis has demonstrated a signifi-cant enrichment of functions related to association withnucleic acids for the subsets of proteins that complementtheir cognate mRNAs strongly and these findings will beexplored in more detail in future work

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

ACKNOWLEDGEMENTS

The authors thank S Dorner R Schroeder A Vaziri GWarren and members of the Laboratory ofComputational Biophysics at MFPL for useful adviceand critical reading of the manuscript

FUNDING

This work was supported in part by the Austrian ScienceFund FWF [START grant Y514-B11 to BZ] EuropeanResearch Council [ERC Starting Independent grant

Nucleic Acids Research 2013 Vol 41 No 18 8441

279408 to BZ] Funding for open access chargeAustrian Science Fund FWF

Conflict of interest statement None declared

REFERENCES

1 BrennerS JacobF and MeselsonM (1961) An unstableintermediate carrying information from genes to ribosomes forprotein synthesis Nature 190 576ndash581

2 AndersG MackowiakSD JensM MaaskolaJ KuntzagkARajewskyN LandthalerM and DieterichC (2012) doRiNA adatabase of RNA interactions in post-transcriptional regulationNucleic Acids Res 40 D180ndashD186

3 BaltzAG MunschauerM SchwanhausserB VasileAMurakawaY SchuelerM YoungsN Penfold-BrownDDrewK MilekM et al (2012) The mRNA-bound proteome andits global occupancy profile on protein-coding transcripts MolCell 46 674ndash690

4 CastelloA FischerB EichelbaumK HorosRBeckmannBM StreinC DaveyNE HumphreysDTPreissT SteinmetzLM et al (2012) Insights into RNA biologyfrom an atlas of mammalian mRNA-binding proteins Cell 1491393ndash1406

5 KonigJ ZarnackK LuscombeNM and UleJ (2012) Protein-RNA interactions new genomic technologies and perspectivesNat Rev Genet 13 221ndash221

6 MitchellSF JainS SheM and ParkerR (2013) Globalanalysis of yeast mRNPs Nat Struct Mol Biol 20 127ndash133

7 WeberSC and BrangwynneCP (2012) Getting RNA andprotein in phase Cell 149 1188ndash1191

8 HanTNW KatoM XieSH WuLC MirzaeiH PeiJMChenM XieY AllenJ XiaoGH et al (2012) Cell-freeformation of RNA granules bound RNAs identify features andcomponents of cellular assemblies Cell 149 768ndash779

9 KyrpidesNC and OuzounisCA (1993) Mechanisms ofspecificity in messenger-rna degradation - autoregulation andcognate interactions J Theor Biol 163 373ndash392

10 OuzounisCA and KyrpidesNC (1994) Reverse interpretation-ahypothetical selection mechanism for adaptive mutagenesis basedon autoregulated messenger-RNA stability J Theor Biol 167373ndash379

11 ChuE CopurSM JuJ ChenTM KhleifS VoellerDMMizunumaN PatelM MaleyGF MaleyF et al (1999)Thymidylate synthase protein and p53 mRNA form an in vivoribonucleoprotein complex Mol Cell Biol 19 1582ndash1594

12 TaiN SchmitzJC LiuJ LinX BaillyM ChenTM andChuE (2004) Translational autoregulation of thymidylatesynthase and dihydrofolate reductase Front Biosci 9 2521ndash2526

13 SchuttpelzM SchoningJC DooseS NeuweilerH PetersEStaigerD and SauerM (2008) Changes in conformationaldynamics of mRNA upon AtGRP7 binding studied byfluorescence correlation spectroscopy J Am Chem Soc 1309507ndash9513

14 ZhaoX LiuM WuN DingL LiuH and LinX (2010)Recovery of recombinant zebrafish p53 protein from inclusionbodies and its binding activity to p53 mRNA in vitro ProteinExpr Purif 72 262ndash266

15 HlevnjakM PolyanskyAA and ZagrovicB (2012) Sequencesignatures of direct complementarity between mRNAs andcognate proteins on multiple levels Nucleic Acids Res 408874ndash8882

16 WoeseCR (1965) Order in genetic code Proc Natl Acad SciUSA 54 71ndash75

17 WoeseCR (1965) On the evolution of the genetic code ProcNatl Acad Sci USA 54 1546ndash1552

18 YarusM (1998) Amino acids as RNA ligands A direct-RNA-template theory for the codersquos origin J Mol Evol 47 109ndash117

19 KooninEV and NovozhilovAS (2009) Origin and evolution ofthe genetic code the universal enigma IUBMB Life 61 99ndash111

20 YarusM WidmannJJ and KnightR (2009) RNA-amino acidbinding a stereochemical era for the genetic code J Mol Evol69 406ndash429

21 JohnsonDB and WangL (2010) Imprints of the genetic code inthe ribosome Proc Natl Acad Sci USA 107 8298ndash8303

22 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The proteindata bank Nucleic Acids Res 28 235ndash242

23 Ben-ShemA de LoubresseNG MelnikovS JennerLYusupovaG and YusupovM (2011) The structure of theeukaryotic ribosome at 30 angstrom resolution Science 3341524ndash1529

24 DunkleJA WangLY FeldmanMB PulkA ChenVBKapralGJ NoeskeJ RichardsonJS BlanchardSC andCateJHD (2011) Structures of the bacterial ribosome inclassical and hybrid states of tRNA binding Science 332981ndash984

25 PolikanovYS BlahaGM and SteitzTA (2012) Howhibernation factors RMF HPF and YfiA turn off proteinsynthesis Science 336 915ndash918

26 HarmsJM WilsonDN SchluenzenF ConnellSRStachelhausT ZaborowskaZ SpahnCM and FuciniP (2008)Translational regulation via L11 molecular switches on theribosome turned on and off by thiostrepton and micrococcinMol Cell 30 26ndash38

27 MiyazawaS and JerniganRL (1985) Estimation of effectiveinterresidue contact energies from protein crystal-structures -quasi-chemical approximation Macromolecules 18 534ndash552

28 DonaldJE ChenWW and ShakhnovichEI (2007) Energeticsof protein-DNA interactions Nucleic Acids Res 35 1039ndash1047

29 JonikasMA RadmerRJ LaederachA DasR PearlmanSHerschlagD and AltmanRB (2009) Coarse-grained modeling oflarge RNA molecules with knowledge-based potentials andstructural filters RNA 15 189ndash199

30 Perez-CanoL SolernouA PonsC and Fernandez-RecioJ(2010) Structural prediction of protein-RNA interaction bycomputational docking with propensity-based statistical potentialsPac Symp Biocomput 15 269ndash280

31 TuszynskaI and BujnickiJM (2011) DARS-RNP and QUASI-RNP new statistical potentials for protein-RNA docking BMCBioinformatics 12 348

32 GasteigerE GattikerA HooglandC IvanyiI AppelRD andBairochA (2003) ExPASy the proteomics server for in-depthprotein knowledge and analysis Nucleic Acids Res 31 3784ndash3788

33 DosztanyiZ CsizmokV TompaP and SimonI (2005) IUPredweb server for the prediction of intrinsically unstructured regionsof proteins based on estimated energy content Bioinformatics 213433ndash3434

34 Huang daW ShermanBT and LempickiRA (2009) Systematicand integrative analysis of large gene lists using DAVIDbioinformatics resources Nat Protoc 4 44ndash57

35 The PyMOL Molecular Graphics System Version 13r1 (2010)Schrodinger LLC httpwwwpymolorgciting (4 July 2013date last accessed)

36 TregerM and WesthofE (2001) Statistical analysis of atomiccontacts at RNA-protein interfaces J Mol Recognit 14199ndash214

37 HoffmanMM KhrapovMA CoxJC YaoJ TongL andEllingtonAD (2004) AANT the amino acid-nucleotideinteraction database Nucleic Acids Res 32 D174ndashD181

38 GuptaA and GribskovM (2011) The role of RNA sequence andstructure in RNAmdashprotein interactions J Mol Biol 409 574ndash587

39 FernandezM KumagaiY StandleyDM SaraiAMizuguchiK and AhmadS (2011) Prediction of dinucleotide-specific RNA-binding sites in proteins BMC Bioinformatics12(Suppl 13) S5

40 CopleySD SmithE and MorowitzHJ (2005) A mechanismfor the association of amino acids with their codons and theorigin of the genetic code Proc Natl Acad Sci USA 1024442ndash4447

41 IwakiriJ TateishiH ChakrabortyA PatilP and KenmochiN(2012) Dissecting the protein-RNA interface the role of proteinsurface shapes and RNA secondary structures in protein-RNArecognition Nucleic Acids Res 40 3299ndash3306

8442 Nucleic Acids Research 2013 Vol 41 No 18

42 WoeseCR DugreDH SaxingerWC and DugreSA (1966)The molecular basis for the genetic code Proc Natl Acad SciUSA 55 966ndash974

43 WoeseCR (1973) Evolution of the genetic codeNaturwissenschaften 60 447ndash459

44 MathewDC and Luthey-SchultenZ (2008) On the physicalbasis of the amino acid polar requirement J Mol Evol 66519ndash528

45 NollerHF (2012) Evolution of protein synthesis from an RNAworld Cold Spring Harb Perspect Biol 4 1ndashU20

46 TrifonovEN KirzhnerA KirzhnerVM and BerezovskyIN(2001) Distinct stages of protein evolution as suggested by proteinsequence analysis J Mol Evol 53 394ndash401

47 VaquerizasJM KummerfeldSK TeichmannSA andLuscombeNM (2009) A census of human transcription factorsfunction expression and evolution Nat Rev Genet 10 252ndash263

48 SonenbergN and HinnebuschAG (2009) Regulation oftranslation initiation in eukaryotes mechanisms and biologicaltargets Cell 136 731ndash745

49 LecuyerE YoshidaH ParthasarathyN AlmC BabakTCerovinaT HughesTR TomancakP and KrauseHM (2007)Global analysis of mRNA localization reveals a prominentrole in organizing cellular architecture and function Cell 131174ndash187

50 MartinKC and EphrussiA (2009) mRNA localization geneexpression in the spatial dimension Cell 136 719ndash730

51 MooreMJ and ProudfootNJ (2009) Pre-mRNA processingreaches back to transcription and ahead to translation Cell 136688ndash700

52 GlisovicT BachorikJL YongJ and DreyfussG (2008) RNA-binding proteins and post-transcriptional gene regulation FEBSLett 582 1977ndash1986

53 BellucciM AgostiniF MasinM and TartagliaGG (2011)Predicting protein associations with long noncoding RNAs NatMethods 8 444ndash445

54 RinnJL and ChangHY (2012) Genome regulation by longnoncoding RNAs Annu Rev Biochem 81 145ndash166

Nucleic Acids Research 2013 Vol 41 No 18 8443

279408 to BZ] Funding for open access chargeAustrian Science Fund FWF

Conflict of interest statement None declared

REFERENCES

1 BrennerS JacobF and MeselsonM (1961) An unstableintermediate carrying information from genes to ribosomes forprotein synthesis Nature 190 576ndash581

2 AndersG MackowiakSD JensM MaaskolaJ KuntzagkARajewskyN LandthalerM and DieterichC (2012) doRiNA adatabase of RNA interactions in post-transcriptional regulationNucleic Acids Res 40 D180ndashD186

3 BaltzAG MunschauerM SchwanhausserB VasileAMurakawaY SchuelerM YoungsN Penfold-BrownDDrewK MilekM et al (2012) The mRNA-bound proteome andits global occupancy profile on protein-coding transcripts MolCell 46 674ndash690

4 CastelloA FischerB EichelbaumK HorosRBeckmannBM StreinC DaveyNE HumphreysDTPreissT SteinmetzLM et al (2012) Insights into RNA biologyfrom an atlas of mammalian mRNA-binding proteins Cell 1491393ndash1406

5 KonigJ ZarnackK LuscombeNM and UleJ (2012) Protein-RNA interactions new genomic technologies and perspectivesNat Rev Genet 13 221ndash221

6 MitchellSF JainS SheM and ParkerR (2013) Globalanalysis of yeast mRNPs Nat Struct Mol Biol 20 127ndash133

7 WeberSC and BrangwynneCP (2012) Getting RNA andprotein in phase Cell 149 1188ndash1191

8 HanTNW KatoM XieSH WuLC MirzaeiH PeiJMChenM XieY AllenJ XiaoGH et al (2012) Cell-freeformation of RNA granules bound RNAs identify features andcomponents of cellular assemblies Cell 149 768ndash779

9 KyrpidesNC and OuzounisCA (1993) Mechanisms ofspecificity in messenger-rna degradation - autoregulation andcognate interactions J Theor Biol 163 373ndash392

10 OuzounisCA and KyrpidesNC (1994) Reverse interpretation-ahypothetical selection mechanism for adaptive mutagenesis basedon autoregulated messenger-RNA stability J Theor Biol 167373ndash379

11 ChuE CopurSM JuJ ChenTM KhleifS VoellerDMMizunumaN PatelM MaleyGF MaleyF et al (1999)Thymidylate synthase protein and p53 mRNA form an in vivoribonucleoprotein complex Mol Cell Biol 19 1582ndash1594

12 TaiN SchmitzJC LiuJ LinX BaillyM ChenTM andChuE (2004) Translational autoregulation of thymidylatesynthase and dihydrofolate reductase Front Biosci 9 2521ndash2526

13 SchuttpelzM SchoningJC DooseS NeuweilerH PetersEStaigerD and SauerM (2008) Changes in conformationaldynamics of mRNA upon AtGRP7 binding studied byfluorescence correlation spectroscopy J Am Chem Soc 1309507ndash9513

14 ZhaoX LiuM WuN DingL LiuH and LinX (2010)Recovery of recombinant zebrafish p53 protein from inclusionbodies and its binding activity to p53 mRNA in vitro ProteinExpr Purif 72 262ndash266

15 HlevnjakM PolyanskyAA and ZagrovicB (2012) Sequencesignatures of direct complementarity between mRNAs andcognate proteins on multiple levels Nucleic Acids Res 408874ndash8882

16 WoeseCR (1965) Order in genetic code Proc Natl Acad SciUSA 54 71ndash75

17 WoeseCR (1965) On the evolution of the genetic code ProcNatl Acad Sci USA 54 1546ndash1552

18 YarusM (1998) Amino acids as RNA ligands A direct-RNA-template theory for the codersquos origin J Mol Evol 47 109ndash117

19 KooninEV and NovozhilovAS (2009) Origin and evolution ofthe genetic code the universal enigma IUBMB Life 61 99ndash111

20 YarusM WidmannJJ and KnightR (2009) RNA-amino acidbinding a stereochemical era for the genetic code J Mol Evol69 406ndash429

21 JohnsonDB and WangL (2010) Imprints of the genetic code inthe ribosome Proc Natl Acad Sci USA 107 8298ndash8303

22 BermanHM WestbrookJ FengZ GillilandG BhatTNWeissigH ShindyalovIN and BournePE (2000) The proteindata bank Nucleic Acids Res 28 235ndash242

23 Ben-ShemA de LoubresseNG MelnikovS JennerLYusupovaG and YusupovM (2011) The structure of theeukaryotic ribosome at 30 angstrom resolution Science 3341524ndash1529

24 DunkleJA WangLY FeldmanMB PulkA ChenVBKapralGJ NoeskeJ RichardsonJS BlanchardSC andCateJHD (2011) Structures of the bacterial ribosome inclassical and hybrid states of tRNA binding Science 332981ndash984

25 PolikanovYS BlahaGM and SteitzTA (2012) Howhibernation factors RMF HPF and YfiA turn off proteinsynthesis Science 336 915ndash918

26 HarmsJM WilsonDN SchluenzenF ConnellSRStachelhausT ZaborowskaZ SpahnCM and FuciniP (2008)Translational regulation via L11 molecular switches on theribosome turned on and off by thiostrepton and micrococcinMol Cell 30 26ndash38

27 MiyazawaS and JerniganRL (1985) Estimation of effectiveinterresidue contact energies from protein crystal-structures -quasi-chemical approximation Macromolecules 18 534ndash552

28 DonaldJE ChenWW and ShakhnovichEI (2007) Energeticsof protein-DNA interactions Nucleic Acids Res 35 1039ndash1047

29 JonikasMA RadmerRJ LaederachA DasR PearlmanSHerschlagD and AltmanRB (2009) Coarse-grained modeling oflarge RNA molecules with knowledge-based potentials andstructural filters RNA 15 189ndash199

30 Perez-CanoL SolernouA PonsC and Fernandez-RecioJ(2010) Structural prediction of protein-RNA interaction bycomputational docking with propensity-based statistical potentialsPac Symp Biocomput 15 269ndash280

31 TuszynskaI and BujnickiJM (2011) DARS-RNP and QUASI-RNP new statistical potentials for protein-RNA docking BMCBioinformatics 12 348

32 GasteigerE GattikerA HooglandC IvanyiI AppelRD andBairochA (2003) ExPASy the proteomics server for in-depthprotein knowledge and analysis Nucleic Acids Res 31 3784ndash3788

33 DosztanyiZ CsizmokV TompaP and SimonI (2005) IUPredweb server for the prediction of intrinsically unstructured regionsof proteins based on estimated energy content Bioinformatics 213433ndash3434

34 Huang daW ShermanBT and LempickiRA (2009) Systematicand integrative analysis of large gene lists using DAVIDbioinformatics resources Nat Protoc 4 44ndash57

35 The PyMOL Molecular Graphics System Version 13r1 (2010)Schrodinger LLC httpwwwpymolorgciting (4 July 2013date last accessed)

36 TregerM and WesthofE (2001) Statistical analysis of atomiccontacts at RNA-protein interfaces J Mol Recognit 14199ndash214

37 HoffmanMM KhrapovMA CoxJC YaoJ TongL andEllingtonAD (2004) AANT the amino acid-nucleotideinteraction database Nucleic Acids Res 32 D174ndashD181

38 GuptaA and GribskovM (2011) The role of RNA sequence andstructure in RNAmdashprotein interactions J Mol Biol 409 574ndash587

39 FernandezM KumagaiY StandleyDM SaraiAMizuguchiK and AhmadS (2011) Prediction of dinucleotide-specific RNA-binding sites in proteins BMC Bioinformatics12(Suppl 13) S5

40 CopleySD SmithE and MorowitzHJ (2005) A mechanismfor the association of amino acids with their codons and theorigin of the genetic code Proc Natl Acad Sci USA 1024442ndash4447

41 IwakiriJ TateishiH ChakrabortyA PatilP and KenmochiN(2012) Dissecting the protein-RNA interface the role of proteinsurface shapes and RNA secondary structures in protein-RNArecognition Nucleic Acids Res 40 3299ndash3306

8442 Nucleic Acids Research 2013 Vol 41 No 18

42 WoeseCR DugreDH SaxingerWC and DugreSA (1966)The molecular basis for the genetic code Proc Natl Acad SciUSA 55 966ndash974

43 WoeseCR (1973) Evolution of the genetic codeNaturwissenschaften 60 447ndash459

44 MathewDC and Luthey-SchultenZ (2008) On the physicalbasis of the amino acid polar requirement J Mol Evol 66519ndash528

45 NollerHF (2012) Evolution of protein synthesis from an RNAworld Cold Spring Harb Perspect Biol 4 1ndashU20

46 TrifonovEN KirzhnerA KirzhnerVM and BerezovskyIN(2001) Distinct stages of protein evolution as suggested by proteinsequence analysis J Mol Evol 53 394ndash401

47 VaquerizasJM KummerfeldSK TeichmannSA andLuscombeNM (2009) A census of human transcription factorsfunction expression and evolution Nat Rev Genet 10 252ndash263

48 SonenbergN and HinnebuschAG (2009) Regulation oftranslation initiation in eukaryotes mechanisms and biologicaltargets Cell 136 731ndash745

49 LecuyerE YoshidaH ParthasarathyN AlmC BabakTCerovinaT HughesTR TomancakP and KrauseHM (2007)Global analysis of mRNA localization reveals a prominentrole in organizing cellular architecture and function Cell 131174ndash187

50 MartinKC and EphrussiA (2009) mRNA localization geneexpression in the spatial dimension Cell 136 719ndash730

51 MooreMJ and ProudfootNJ (2009) Pre-mRNA processingreaches back to transcription and ahead to translation Cell 136688ndash700

52 GlisovicT BachorikJL YongJ and DreyfussG (2008) RNA-binding proteins and post-transcriptional gene regulation FEBSLett 582 1977ndash1986

53 BellucciM AgostiniF MasinM and TartagliaGG (2011)Predicting protein associations with long noncoding RNAs NatMethods 8 444ndash445

54 RinnJL and ChangHY (2012) Genome regulation by longnoncoding RNAs Annu Rev Biochem 81 145ndash166

Nucleic Acids Research 2013 Vol 41 No 18 8443

42 WoeseCR DugreDH SaxingerWC and DugreSA (1966)The molecular basis for the genetic code Proc Natl Acad SciUSA 55 966ndash974

43 WoeseCR (1973) Evolution of the genetic codeNaturwissenschaften 60 447ndash459

44 MathewDC and Luthey-SchultenZ (2008) On the physicalbasis of the amino acid polar requirement J Mol Evol 66519ndash528

45 NollerHF (2012) Evolution of protein synthesis from an RNAworld Cold Spring Harb Perspect Biol 4 1ndashU20

46 TrifonovEN KirzhnerA KirzhnerVM and BerezovskyIN(2001) Distinct stages of protein evolution as suggested by proteinsequence analysis J Mol Evol 53 394ndash401

47 VaquerizasJM KummerfeldSK TeichmannSA andLuscombeNM (2009) A census of human transcription factorsfunction expression and evolution Nat Rev Genet 10 252ndash263

48 SonenbergN and HinnebuschAG (2009) Regulation oftranslation initiation in eukaryotes mechanisms and biologicaltargets Cell 136 731ndash745

49 LecuyerE YoshidaH ParthasarathyN AlmC BabakTCerovinaT HughesTR TomancakP and KrauseHM (2007)Global analysis of mRNA localization reveals a prominentrole in organizing cellular architecture and function Cell 131174ndash187

50 MartinKC and EphrussiA (2009) mRNA localization geneexpression in the spatial dimension Cell 136 719ndash730

51 MooreMJ and ProudfootNJ (2009) Pre-mRNA processingreaches back to transcription and ahead to translation Cell 136688ndash700

52 GlisovicT BachorikJL YongJ and DreyfussG (2008) RNA-binding proteins and post-transcriptional gene regulation FEBSLett 582 1977ndash1986

53 BellucciM AgostiniF MasinM and TartagliaGG (2011)Predicting protein associations with long noncoding RNAs NatMethods 8 444ndash445

54 RinnJL and ChangHY (2012) Genome regulation by longnoncoding RNAs Annu Rev Biochem 81 145ndash166

Nucleic Acids Research 2013 Vol 41 No 18 8443