Understanding hierarchical protein evolution from first principles

19
Understanding Hierarchical Protein Evolution from First Principles Nikolay V. Dokholyan* and Eugene I. Shakhnovich Department of Chemistry Harvard University 12 Oxford Street, Cambridge MA 02138, USA We propose a model that explains the hierarchical organization of pro- teins in fold families. The model, which is based on the evolutionary selection of proteins by their native state stability, reproduces patterns of amino acids conserved across protein families. Due to its dynamic nature, the model sheds light on the evolutionary time-scales. By studying the relaxation of the correlation function between consecutive mutations at a given position in proteins, we observe separation of the evolutionary time-scales: at short time intervals families of proteins with similar sequences and structures are formed, while at long time intervals the families of structurally similar proteins that have low sequence similarity are formed. We discuss the evolutionary implications of our model. We provide a ‘‘profile’’ solution to our model and find agreement between predicted patterns of conserved amino acids and those actually observed in nature. # 2001 Academic Press Keywords: protein evolution; energy gap model; profile solution *Corresponding author Introduction Understanding protein evolution still remains a major challenge in molecular biology. 1–16 While the mechanisms of mutations in DNA sequences that code for proteins are known, 17 the selective fixation of these mutations in proteins is far from clear. Mutations occurring in DNA are directly governed by physical-chemical processes and their fixation is subject to cellular repair mechanisms being able to protect nucleotide(s) from modifi- cations. Mutations occurring in protein sequences may drastically alter their physical, chemical, and biological properties. Thus, in the course of evol- ution nature exerts pressure to preserve those amino acids that play an important role in the fold- ing kinetics, functionality and stability of proteins. Our goal is to understand evolution from the stat- istical mechanics perspective. There are several principal facts observed in pro- teins: (i) a protein sequence folds into a unique three-dimensional structure (there might be excep- tions, e.g. prions); (ii) protein sequences are selected, i.e. a randomly chosen polypeptide most likely aggregates in solution without forming any definite three-dimensional structure; (iii) proteins taken from various species and having sequence identity (ID) at least 25 30 % have similar three- dimensional structures (native state) 15,17 – 24 and are said to belong to the same fold family; (iv) some pairs of proteins sharing the same fold have sequence similarity as low as expected for random sequences ID 8 9 %; 11,25,26 (v) within the same fold family, protein sequences have only 3-4 % ‘‘anchored’’ amino acids. 9 Here, we call homologs a set of proteins that have at least 25 % sequence similarity and are structurally similar. A set of structurally similar proteins that may have less than 25 % sequence similarity we call a group of analogous proteins or analogs. Analogs include several families of homo- logs and generally constitute a larger set of proteins than homologs. Known homologs and analogs are collected in the HSSP 22 and FSSP 25 databases, respectively, and are the subject of our study. Because our conclusions suggest that the concept of homology and analogy may be ill- defined in some cases, we do not follow conven- tional definition of (analogs) homologs as proteins (not) sharing the same ancestor. 27 Rather we define homologs as proteins that have substantial sequence similarity (ID 5 25 %), while analogs are proteins that share the same fold but have low sequence similarity. This definition is natural for our considerations that focuses on physical aspects of protein evolution, such as stability and, to a lesser extent, folding kinetics. E-mail address of the corresponding author: [email protected] doi:10.1006/jmbi.2001.4949 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 312, 289–307 0022-2836/01/010289–19 $35.00/0 # 2001 Academic Press

Transcript of Understanding hierarchical protein evolution from first principles

doi:10.1006/jmbi.2001.4949 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 312, 289±307

Understanding Hierarchical Protein Evolution fromFirst Principles

Nikolay V. Dokholyan* and Eugene I. Shakhnovich

Department of ChemistryHarvard University12 Oxford Street, CambridgeMA 02138, USA

E-mail address of the [email protected]

0022-2836/01/010289±19 $35.00/0

We propose a model that explains the hierarchical organization of pro-teins in fold families. The model, which is based on the evolutionaryselection of proteins by their native state stability, reproduces patterns ofamino acids conserved across protein families. Due to its dynamic nature,the model sheds light on the evolutionary time-scales. By studying therelaxation of the correlation function between consecutive mutations at agiven position in proteins, we observe separation of the evolutionarytime-scales: at short time intervals families of proteins with similarsequences and structures are formed, while at long time intervals thefamilies of structurally similar proteins that have low sequence similarityare formed. We discuss the evolutionary implications of our model. Weprovide a ``pro®le'' solution to our model and ®nd agreement betweenpredicted patterns of conserved amino acids and those actually observedin nature.

# 2001 Academic Press

Keywords: protein evolution; energy gap model; pro®le solution

*Corresponding author

Introduction

Understanding protein evolution still remains amajor challenge in molecular biology.1 ± 16 Whilethe mechanisms of mutations in DNA sequencesthat code for proteins are known,17 the selective®xation of these mutations in proteins is far fromclear. Mutations occurring in DNA are directlygoverned by physical-chemical processes and their®xation is subject to cellular repair mechanismsbeing able to protect nucleotide(s) from modi®-cations. Mutations occurring in protein sequencesmay drastically alter their physical, chemical, andbiological properties. Thus, in the course of evol-ution nature exerts pressure to preserve thoseamino acids that play an important role in the fold-ing kinetics, functionality and stability of proteins.Our goal is to understand evolution from the stat-istical mechanics perspective.

There are several principal facts observed in pro-teins: (i) a protein sequence folds into a uniquethree-dimensional structure (there might be excep-tions, e.g. prions); (ii) protein sequences areselected, i.e. a randomly chosen polypeptide mostlikely aggregates in solution without forming anyde®nite three-dimensional structure; (iii) proteinstaken from various species and having sequence

ing author:

identity (ID) at least 25 ÿ 30 % have similar three-dimensional structures (native state)15,17 ± 24 and aresaid to belong to the same fold family; (iv) somepairs of proteins sharing the same fold havesequence similarity as low as expected for randomsequences ID � 8 ÿ 9 %;11,25,26 (v) within the samefold family, protein sequences have only 3-4 %``anchored'' amino acids.9

Here, we call homologs a set of proteins thathave at least 25 % sequence similarity and arestructurally similar. A set of structurally similarproteins that may have less than 25 % sequencesimilarity we call a group of analogous proteins oranalogs. Analogs include several families of homo-logs and generally constitute a larger set ofproteins than homologs. Known homologs andanalogs are collected in the HSSP22 and FSSP25

databases, respectively, and are the subject of ourstudy. Because our conclusions suggest that theconcept of homology and analogy may be ill-de®ned in some cases, we do not follow conven-tional de®nition of (analogs) homologs as proteins(not) sharing the same ancestor.27 Rather we de®nehomologs as proteins that have substantialsequence similarity (ID 5 25 %), while analogs areproteins that share the same fold but have lowsequence similarity. This de®nition is natural forour considerations that focuses on physical aspectsof protein evolution, such as stability and, to alesser extent, folding kinetics.

# 2001 Academic Press

290 Understanding hierarchical protein evolution

Here, we propose a model of evolution (energygap model) that, based on facts (i) and (ii),attempts to reproduce the rest of the remainingprincipal observations (iii)-(v) described above. Theenergy gap model is based on the design of a setof structurally identical sequences by the Z-scoreminimization.28 ± 31 The idea is to ®nd the simi-larities in the sequences of such a set and torecover those residues that are conserved acrossthis set. The protein folding theory32,33 suggeststhat Z-score minimization is equivalent to maxi-mizing the energy gap between misfolded orunfolded conformations and the native state of aprotein. It has been pointed out that such maximi-zation results in stable and fast-foldingproteins.29,34 Thus, by designing sequences thathave the same fold, we attempt to mimic evolutionin diversifying protein sequences for the same foldfamily. In addition, the energy gap model is adynamical model, i.e. there is an implicit time-scalethat allows one to follow the evolution ofsequences during the design procedure. The modelis discussed in detail and a pro®le approximationto this model is outlined below. We show that ourview of evolution proposed below is consistentwith the implications of the proposed model. Next,we discuss our scenario of protein evolution.

We conjecture that hierarchical organization ofstructurally similar proteins may be the result ofthe separation of the evolutionary time-scales,shown schematically in Figure 1. On a time-scaleto, a set of mutations occur that do not affect thoseamino acids that play crucial thermodynamical,kinetical and/or functional roles. As a result, thereis little variation in sequences at the important sitesof proteins. If a mutation occurs at the thermody-namically, kinetically and/or functionally import-ant sites, it usually substitutes amino acidswith close physical properties so that core, nucleusand/or functional site are not disrupted and theprotein folds into its family fold, is stable in thisfold, and its function is preserved. At this time, afamily of homologs is born.

Rarely, at time-scale t, correlated mutationsoccur35 ± 37 that modify several amino acids at thecore, nucleus and/or functional site, so that thestability and kinetics of proteins are not altered.Such a set of mutations can drastically modify thesequence of the protein. However, within the time-scale to, a family of homologs is born within whichthere is conservation of (already new) amino acidsin the speci®c (important) sites of homologous pro-teins. Although there are alternations in the speci®csites of the proteins at the time-scale t, these sitesare more preserved than the rest of the sequence.The proposed view of protein evolution is consist-ent with the observations of hierarchical organiz-ation of structurally similar proteins in families ofhomologs. Sets of families of homologs are orga-nized, in turn, in super-families of analogs.

The time in our discussion is associated with thenumber of mutations that accumulate in the courseof evolution. Because the rates may vary between

families and even proteins, there is only a hypothe-tical relation of evolutionary time to physical time.Evolutionary time can be rigorously de®ned stat-istically as the number of mutations that occur in afold family, averaged over all family members. Thereal time for one family may be different from thatof another.

Energy Gap Model

We start with a random protein amino acidsequence and perform a Monte Carlo search forthe mutation that energetically favors interactionsin such a sequence. The Monte Carlo design algor-ithm is based on the minimization of the so-calledZ-score, de®ned as:

Z � ENS ÿ hEis�E� �1�

which corresponds to the minimization of theenergy gap between the native state, ENS, of theselected sequence and the average energy, hEi, ofstructurally unrelated conformations (decoys).38 ± 40

s(E) is the standard deviation of energies of alldecoys.41

Since Z-score minimization is equivalent to max-imizing the energy gap between misfolded orunfolded conformations and the native state of theprotein,32,33,38 such maximization results in stableand fast-folding proteins. The energy gap must be``signi®cant'', meaning that ENS must deviate fromhEi by many standard deviations s: ENS5hEi ÿ s.Many researchers have pointed out (e.g. seeShakhnovich34) that minimization of the Z-scorecorresponds to the stabilization of the protein in itsnative state.

The design proceeds as follows: (i) we select anamino acid si at a random position 1 4 i 4 N;(ii) we substitute this amino acid by si

0 withprobability p:

p ��

1; if dZ < 0exp�ÿdZ=Tdes�; if dZ > 0

�2�

where dZ � Z(si0) ÿ Z(s) is the difference between

the Z-scores of the mutated and the original pro-teins. We design each of Ns � 100 sequences byrunning the simulations for Nm Monte Carlo stepsat some design temperature, Tdes.

Computation of hEi and s(E) is straightforward:

hEi � 1

2

Xi 6�j

U�si;sj�fij �3�

and:

s2�E� � h�Eÿ hEi�2i

� 1

2

Xi 6�j

fij�1ÿ fij�U2�si;sj� � O�f 2ij �

�4�

where fij is the frequency of a contact between

Figure 1. A schematic representation of the evolutionary processes that result in conservation patterns of aminoacids. For a given family of folds, e.g. Ig in this diagram, there are several alternative minima (three) in the hypotheti-cal free energy landscape in the sequence space as a function of the ``evolutionary'' reaction coordinate (e.g. time).Each of these minima is formed by mutations in protein sequences at time-scales, to, that do not alter the protein'sthermodynamically and/or kinetically important sites, forming families of homologous proteins. Transitions from oneminimum to another occur at time-scales t � t0exp (�G/T). At time-scale t, mutations occur that would alter severalamino acids at the important sites of the proteins in such a way that the protein properties are not compromised. Attime-scale t, the family of analogs is formed. In three minima we present three families of homologs (1TEN, 1FNF,and 1CFB) each comprised of six homologous proteins. We show ten positions in the aligned proteins: from 18 to 28.It can be observed that at position 4 (marked by blocks) in each of the families presented in the diagram amino acidsare conserved within each family of homologs, but vary between these families. This position corresponds to position21 in Ig fold alignment (to 1TEN) and is conserved (see Figure 12(a)).

Understanding hierarchical protein evolution 291

monomers i and j in a set of decoys, i.e.:

fij � h�iji �5�

We estimate frequencies of contacts by makingtwo assumptions about the set of decoys: (1) thedistribution, P�` � jiÿ jj; i; j� of the contact dis-tances, ` � jiÿ jj, between various amino acids atthe positions i and j is universal among globularproteins; and (2) the actual frequency of contactsbetween various amino acids, i and j, is only afunction of the absolute value of the length of con-tacts, ji ÿ jj, and is equal to the distribution of thecontact lengths, i.e.:

fij � fjiÿjj � f �`� �6�The distribution P�`� is then:

P�`� � f �`�PN`�1 f �`� �7�

Both assumptions, (1) and (2), are motivated by thefact that the variety of protein structures known todate samples adequately the conformational spaceof proteins under study.

In order to estimate frequencies, fij, accordingto equation (7) we compute the distribution ofcontacts of length ` � jiÿ jj in the ensemble ofapproximately 103 representative globular proteinsin the Protein Data Bank (PDB).42,43 Thedistribution shown in Figure 2 is obtained usingCb-representation of proteins. The contacts arede®ned by equation (20).

The estimation of contact frequencies fij is one ofthe key ingredients to protein design. An alterna-tive approach based on sampling of homopolymerconformations appears to be less ef®cient, so weomit it in the present study. Nevertheless, due toits importance and possible potential for other stu-dies, we discuss this approach in the Appendix.

After we obtain Ns number of designedsequences, we compute the probability of anamino acid sk to be in the kth position, PZ(sk), asthe frequency of occurrence of this amino acid:

PZ�sk� � N�sk�=Ns �8�where N(sk) is the total number of occurrences ofan amino acid sk at the position k. Next, usingequation (22) we compute the sequence entropy,SZ(k).

Figure 2. The double-logarithmic plot of the prob-ability distribution P�`� that any two amino acids, posi-tioned ith and jth along the protein chain so that theircontact length ji ÿ jj is equal to `, are in physical con-tact, as a function of the contact length `. Residues posi-tioned ith and jth along the protein chain are de®ned tobe in physical contact if their corresponding Cb (Ca inthe case of Gly) are separated by less than 7.5 AÊ

(equation (20)). P�`� is computed for the ensemble ofapproximately 103 representative globular proteins inPDB.42,43 The parallel line in the range of length204`4200 indicates the power-law behavior of P�`� inthis region, P�`� � `ÿ1:64. The region 54`420 is speci®cto proteins and has been discussed in detail byBerezovsky et al.100

292 Understanding hierarchical protein evolution

Profile Solution

We develop a pro®le solution to the energy gapmodel that provides a rationale for conservatismpatterns caused by selection for stability. Our sol-ution is of equilibrium evolution that maintainsstability and other properties achieved at an ear-lier, prebiotic stage. To this end we propose thatstability selection accepts only those mutations thatkeep the energy of the native protein, E, below acertain threshold Ec necessary to maintain anenergy gap.8,29,44,45 The requirement to maintain anenergy threshold for the viable sequences makesthe equilibrium ensemble of sequences analogousto a microcanonical ensemble. In analogy with stat-istical mechanics, a more convenient and realisticdescription of the sequence ensemble is a canonicalensemble, whereby strict requirements on theenergy of the native state is replaced by a ``soft''evolutionary pressure that allows energy ¯uctu-ations from sequence to sequence but makessequences with high energy in the native state unli-kely. In the canonical ensemble of sequences, theprobability of ®nding a particular sequence, {s},in the ensemble follows the Boltzmanndistribution:8,29,44,46

P�fsg� � exp�ÿEfsg=T�Z

�9�

where T is the effective temperature of the canoni-cal ensemble of sequences that serves as a measureof evolutionary pressure and Z � �{s} exp(ÿE{s}/T) is the partition function taken in sequence space.

Next, we apply a pro®le approximation thatreplaces all multiparticle interactions betweenamino acids with interaction of each amino acidwith an effective ®eld � acting on this amino acidfrom the rest of the protein, so that each aminoacid experiences the exact ®eld of its neighbors.This approximation presents P({s}) in a multiplica-tive form as �k � 1

N p(sk) of probabilities to ®ndan amino acid s at position k.47 p(sk) also obeysBoltzmann statistics:

p�sk� � exp�ÿ��sk�=T�Xs

exp�ÿ��sk�=T��10�

The pro®le potential �(sk) is the effective potentialenergy between amino acid sk and all amino acidsinteracting with it, i.e.:

��sk� �XN

i6�k

U�sk;si��ki �11�

The potential � is similar in spirit to the proteinpro®le introduced by Bowie et al.48 to identifyprotein sequences that fold into a speci®c 3Dstructure.

For each member, m, of the fold family (FSSPdatabase25) presented in Figure 1, we compute thepro®le probability, pm(sk), using equation (10). Thisprobability, pm(sk), for each fold family membercorresponds to the frequency of amino acids, sk, atpositions, k, for a given family of homologs. Then,we compute the average pro®le probability overall members of the fold family:

pP�sk� � 1

Ns

XNs

m�1

pm�sk� �12�

This quantity corresponds to the Pacr(sk).35

Equations (10)-(12), along with the properlyselected energy function, U, make it possible topredict probabilities of all amino acid types andsequence entropy SP(k) at each position k:

SP�k� � ÿXs

pP�sk� ln pP�sk� �13�

from the native structure of a protein. The sum-mation is taken over all possible values of s.

If stability selection is a factor in the evolution ofproteins and our model captures it, then we shouldobserve a correlation between the predicted pro®lebased sequence entropies, SP(k), and actualsequence entropies Sacr(k) in real proteins. Thus,the question is: can we ®nd such T, so that the pre-dicted conservatism pro®le SP(k) matches the realone Sacr(k)?

By varying the values of the temperature T inthe range 0.1 4 T 4 4.0, we minimize the distance,

Figure 3. Three-dimensional structures of the repre-sentative proteins of the ®ve folds under study (Ig, OB,R, a/b-P and TIM folds): (a) 1TEN protein; (b) 1MJC; (c)3CHY; (d) 2ACY; and (e) 2EBN.

Understanding hierarchical protein evolution 293

D2 � �k � 1N (SP(k) ÿ Sacr(k))2, between the predicted

and observed conservatism pro®les. We excludefrom this sum such positions in structurallyaligned sequences that have more than 50 % gapsin the structural (FSSP) alignment. We denote byTsel the temperature that minimizes D.

The proposed pro®le solution has a dual role.On one hand, it allows us to understand the selec-tive temperature scale, Tsel, which is the measureof evolutionary optimization. On the other hand,the correlation coef®cient between SP(k) and Sacr(k)does not vary strongly in the range of Tsel from0.19 to 0.34, thus, allowing one to use the effectivetemperature of Tsel � 0.25 to predict the actual con-servatism pro®les of proteins (see Table 1).

Results and Discussion

We study ®ve folds: immunoglobulin fold (Ig),oligonucleotide-binding fold (OB), Rossman fold(R), a/b-plait (a/b-P), and TIM-barrel fold (TIM).The three-dimensional structures of the representa-tive proteins of these ®ve folds are shown inFigure 3: (a) tenascin (third ®bronectin type IIIrepeat), pdb:1TEN; (b) major cold shock protein 7.4(Cspa (Cs 7.4)) of Escherichia coli, pdb:1MJC; (c)chemotactic protein CheY, pdb:3CHY; (d) acyl-phosphatase (common type) from bovine testis,pdb:2ACY; (e) endo-b-N-acetylglucosaminidaseF1, pdb:2EBN. We compute the correlationcoef®cient47 between values of SP(k), obtained atTsel, and Sacr(k) for all ®ve folds. The results aresummarized in Table 1. The plots of SP(k) andSacr(k) versus k, as well as their scatter plots, areshown in Figures 12-16, below.

Energy gap model

We ®nd that correlation between SZ(k) andSacr(k) strongly depends on the number ofmutations, Nm, we introduce during design of aprotein. This fact is in accord with our view (seeFigure 1) of protein evolution. On a short time-scale, to � 102 Monte Carlo steps, mutations rarelyalter amino acids with speci®c important proper-ties such as participation in stabilization of proteins

Table 1. The values of the correlation coef®cient r for the lina/b-P, and TIM folds and the corresponding optimal valuesregression

Representative protein

Fold Ns PDB code40,41 N

Ig 51 1TEN 89OB 18 1MJC 69R 166 3CHY 128a/b-P 29 2ACY 98TIM 49 2EBN 285

The last column corresponds to the correlation coef®cient for the sthe rates of mutations, R(k), we perform Z-score design of the sequen

and/or in the nucleation processes in folding kin-etics of proteins. These mutations diversify thefamily, m, of homologs, Mm

h . On a larger scale,t4to, correlated mutations35 ± 37 modify core and/or nucleus site(s) of the proteins without compro-mising their stability, folding rates or function(s).Thus, at the time-scale t, evolution moves from onefamily of homologs to another, diversifying theunderlying family of analogs, Ma;[mMm

h �Ma.The ensemble of analogs is still much smaller

ear regression of SP(k) and R(k) versus Sacr for Ig, OB, R,of the temperature T � Tsel for the SP(k) versus Sacr linear

Correlation coefficient, r

R(k) versusSacr(k)

SP(k) versusSacr)k) Tsel

SP(k) versusSacr(k)

(Tsel � 0.25)

0.57 0.63 0.34 0.570.67 0.69 0.19 0.690.74 0.71 0.25 0.710.45 0.54 0.26 0.530.54 0.50 0.23 0.50

tudied folds at a ®xed selective temperature T � 0.25. To obtainces for td � 108 Monte Carlo steps at Tdes � 0.25.

0 0.4 0.8 1.2 1.6 2 2.4temperature, T

2.7

1.7

0.7

Zsc

ore,

arb

. uni

ts

1mjc1ten3chy2acy2ebn

Figure 4. The temperature-dependence of the Z-score(equation (1)) for ®ve representative proteins (1MJC,1TEN, 3CHY, 2ACY, and 2EBN) of the ®ve folds understudy. For each fold and for a given temperature T, wecompute the Z-score after performing sequence designruns during 105 Monte Carlo steps. We average the Z-score over 100 sequence design runs. Due to the normal-ization of contacts' frequencies extracted from the PDBdatabase (equation (7)), the scales of the computed Z-scores are different from the correctly normalized ones,de®ned by equation (1). There is a sharp transition ofZ(T) at T � Tc � 0.25 for all studied proteins. Above Tc,protein design results in unstable sequences, while attemperatures much lower than Tc many of the residues``freeze'' in their original states.

294 Understanding hierarchical protein evolution

than the ensemble, Mo, of all possible sequences�Ma �Mo�, which is of the size 6N (in a six-letteralphabet), for N � 100 residue protein this numberis of the order of 1080. These results are in agree-ment with the theoretical predictions5,32,34,50 ± 52 thatthere is a large number (of the order e1.9N)34, of fastfolding sequences with a given native structureand pronounced stability gaps � � ENS ÿ hEi.

It is important that for the small number ofmutations we ®nd correlation between entropies ofthe designed sequence, SZ(k), and the empiricallyobserved one, Sacr(k). This correlation depends onthe input random number, indicating that theselected sequences constitute a family of homologs,Mm

h , that is closer or more distant to an originalsequence family of homologs, M0

h (both Mmh and

M0h belong to a given family of analogs, Ma).

Here, we present the results for the selected ensem-bles of the designed sequences, Mm

h , after beingoptimized during Nm mutations. More importantthan the correlation between SZ(k) and Sacr(k), we®nd that the pro®les of SZ(k) and Sacr(k) are invisible concert with each other.

The temperature-dependence of the Z-score exhi-bits a sharp transition at T � Tc � 0.25 (Figure 4)for all studied proteins. Above Tc, protein designresults in unstable sequences, while at tempera-tures much lower than Tc many of the residues``freeze'' in their original states. Thus, we select Tc

as our design temperature.

Degree of divergence of sequences

To assess the degree of similarity or divergenceof sequences in the course of Z-score design at var-ious time-scales, we compute the distribution ofhamming distances at these time-scales. The ham-ming distance,53 also known as the p-distance,54 Hd({s}(1), {s}(2)), between two sequences, {s}(1) and{s}(2), is de®ned as the number of distinct aminoacids at equal positions in these two sequencesdivided by the length of the sequences, N:

Hd�fsg�1�; fsg�2�� � 1

N

XN

i�1

�1ÿ d�s�1�k ÿ s�2�k �� �14�

Hamming distance has a simple interpretation - itis the degree of divergence between two sequences:when Hd is equal to 1, the sequences have noamino acids in common, when Hd is equal to 0, thesequences are exactly the same.

We compute the distribution of hamming dis-tances between all designed sequences for twodesign times (a) td � 1034to (Figure 5(a)) and (b)td � 102 � to Monte Carlo steps (Figure 5(b)),where to is a characteristic time-scale. We use1MJC family of homologs (OB fold) as an examplethroughout this subsection. We also performsimilar analysis with other folds and the results arequalitatively the same (not shown).

In the computation of the distribution in case (b)we omit all sequences with sequence similarity less

than ID � 55 % to mimic sequence collection in theHSSP database. This threshold sequence similarity,ID, is chosen so that the hamming distance distri-bution derived from the actual sequences in theHSSP database (Figure 5(d)) is similar to ours.Given that we use a six-letter alphabet, the corre-spondence between ID used in HSSP and our ID isnot well de®ned. Because in (b) we select onlysequences with minimal threshold similarity, ID,there are no events with Hd > 0.55 in Figure 5(b).In addition, the events with low Hd! 0 are over-represented in our simulations, since we do notaccount for additional pressure that exists in realprotein sequences due to function or kinetics.Therefore, the distribution in our simulations(Figure 5(b)) has a more pronounced tail Hd! 0than that in real proteins (Figure 5(d)).

At the long time-scales (Figure 5(a)) we ®nd thatmost of the sequences are divergent from eachother with average hHdi � 0.7. We observe thesame result by computing the distribution of thehamming distances between all analogs belongingto the OB fold family present in the FSSP database(Figure 5(c)). The only difference between simu-lated (Figure 5(a)) and observed (Figure 5(c)) distri-butions of hamming distances is the tail present inthe simulated distribution corresponding to thesequences with a signi®cant degree of similarity.This tail is due to the fact that we compare allsequences with all sequences, thus, effectively

Figure 5. The histograms of hamming distances(equation (14)) between all designed sequences for theOB fold family for two design times (a) td � 1034to and(b) td � 102 � to Monte Carlo steps. The histograms ofhamming distances for 1MJC family of actual proteinsequences: (c) analogs taken from the FSSP database;and (d) homologs taken from the HSSP database. In thecomputation of the histograms in case (b) we omit allsequences with sequence similarity less than ID � 55 %to mimic sequence collection in the HSSP database. Thethreshold sequence similarity, ID, is chosen so that thehamming distance histogram is derived from the actualsequences in HSSP database (d). All histograms arenormalized to unit area.

Understanding hierarchical protein evolution 295

including similar sequences in our histogram. Inthe FSSP database, on the other hand, only distantsequences are present so that the tail correspond-ing to the close sequences in Figure 5(c) is absent.

The distributions of hamming distances in pro-tein families are in qualitative agreement withthose observed in simulations and with our pictureof hierarchical protein evolution. At short time-

scales sequences are not strongly separated fromeach other forming families of homologs, while atlong time-scales, a family of analogs is formed,comprised of strongly separated sequences butstructurally similar proteins.

Determination of the family formationtime-scale

To quantify our observation of evolutionarytime-scales separation, we compute the relaxationtimes of the correlation function (at each proteinposition, k) in the course of Z-score design de®nedas:

Ck�t� � 1

tdNs

XNs

a�1

Z td

0

w�a�k �t; t�dt � hhwk�t�iitd;Ns�15�

where hh � � � iitd, Ns denotes average over simulationdesign time, td, and the number, Ns, of initialsequences. wk

(a)(t,t) is a Boolean indicator ofwhether an amino acid sk(t � t) at position k attime t � t is the same as the amino acid at timesk(t) at the same position at time t:

w�a�k �t; t� ��

1; sk�t� � sk�t� t�0; sk�t� 6� sk�t� t� �16�

Ck(t) measures the probability that a mutation doesnot occur at the position k in time t. This functionfor most equilibrium systems decays exponentially:

C�t� � exp�ÿt=to� �17�where to is the relaxation time that is the averagemutation time between subsequent mutations. Thequantity inverse to the relaxation time (1/to) isproportional to the average substitution rate of asite.53

We also ®nd that the correlation function com-puted for Ns � 103 and for td � 103 decays expo-nentially (see Figure 6) and relaxation times to

depend strongly on the positions of the aminoacids under consideration. For example, the relax-ation of the correlation functions for positions 1(Ser in 1MJC) and 31 (Val) in 1MJC design vary byalmost a factor of 2: to(Ser1) � 143 andto(Val31) � 387 Monte Carlo steps. The factthat to(Ser1) is more than twice as large asto(Val31) indicates that Ser1 is likely to mutatemore than twice in the time-span of a single Val31mutation.

In addition, the distribution of the relaxationtimes (see Figure 5) exhibits a pronounced peak atto � 170 Monte Carlo steps, indicating that formost protein positions relaxation occurs with thistypical relaxation time. The relaxation times foundfrom the correlation function analysis are in agree-ment with our observations. The long non-Gaus-sian tail in the histogram of the relaxation timesalso suggests the presence of the conserved pos-itions. In fact, this tail, composed of the conservedpositions, strongly deviates from the rest of the

0 50 100 150time

101

100

Cor

rela

tion

func

tion

Ser1Val31

το1

= (143 MCs)1

το1

= (387 MCs)1

(a)

100 150 200 250 300 350 400τo

0

10

20

30

num

ber

of p

ositi

ons

(b)

Figure 6. (a) Plot of the correlation functions versustime for two positions in 1MJC, Ser1 and Val31,obtained in the course of Z-score design of 103

sequences for 103 Monte Carlo steps. In semilogarithmicscale Ck � 1(t) and Ck � 31(t) are straight lines with slopesto � 143 and to � 387 Monte Carlo steps, correspond-ingly. (b) The histogram of the relaxation times to for allpositions in 1MJC obtained in the course of Z-scoredesign of 103 sequences for 103 Monte Carlo steps. Thehistogram is well ®t by a Gaussian function in theregion 100 < to < 250 (continuous line). (The regressioncoef®cient for the Gaussian ®t is r � 0.98.) The long tailthat strongly deviates from the Gaussian distribution(over seven standard deviations) indicates the presenceof the conserved positions in the course of design.

296 Understanding hierarchical protein evolution

distribution, which is well approximated by aGaussian distribution.

Rates of amino acid substitutionsand conservatism

A number of authors suggested56 ± 58 (and seeDokhdyan et al.{) that study of the conservedamino acids in families of structurally similar pro-

{ http://arxiv.org/abs/cond-mat/0007084

teins can shed light on the functionally, kineticallyand thermodynamically important amino acids inproteins. The basic belief behind the majority ofsuch studies is that evolution optimizes, to a cer-tain extent, the properties of proteins so that theybecome more stable and have better folding andfunctional properties. Here, we use the ``optimiz-ation'' hypothesis of molecular evolution to under-stand the universe of protein sequences byimplication of molecular evolution. The linkbetween conserved amino acids and their role inproteins has been widely studied.58,59 ± 63

A recent study by Mirny & Shakhnovich35 ident-i®ed the presence of universally conserved aminoacids across the families of proteins sharing thesame fold. These conserved residues have beenlinked to protein stability, kinetic properties orfunction. Various experiments59,64 ± 73 have ident-i®ed some of the conserved residues to have pre-dicted speci®c roles.

Direct evidence of the relationship between con-servatism and the physical properties of aminoacids can be accessed by calculating the rates ofamino acid substitutions in the course of the Z-score design. By comparing mutational rates atvarious positions of the proteins, we attempt toreconstruct the conservatism of these positionsacross the family of analogous proteins. Startingwith the sequence of a representative protein of agiven fold we perform Z-score design for td � 108

Monte Carlo steps. The substitution rates arede®ned as:

R�k� � Nm�k�td

� N

td

Xtd

t�1

�1ÿ d�sk�t� ÿ sk�tÿ 1��� �18�

where Nm(k) is the number of mutations thatoccurred at the position k, d(x) is a Kronecker func-tion, equal to 1 if x � 0 and 0 otherwise, sk(t) is anamino acid s at the position k at time t, andtd � td=N is the average number of attemptedmutations per position in a protein. Thus, R(k)from equation (18) is inversely proportional to theaverage time between subsequent substitutions ofamino acids at the position k; the lower the R(k)value the longer the amino acid at the position kremains unchanged and, therefore, the more con-servative is this position in the course of design.

We ®nd that the rates of substitutions, R(k),correlate with the conservatism patterns, Sacr (seeFigures 7-11). Since there is no obvious relationbetween R(k) and Sacr, and, moreover, there is noreason to assume linear relation between thesequantities, the linear regression has only anillustrative meaning of the correlations observedbetween R(k) and Sacr (see Figures 7-1(b) andTable 1). Despite the likely lack of linear relationbetween the rates and the entropy, the correlationobserved based on the assumption of linear depen-dence between R(k) and Sacr is feasible.

The computation of mutational rates, R(k), doesnot involve the tuning of any parameters. We can

0 20 40 60 80position

0.4

0.5

0.6

0.7

0.8

rate

, R

0.4

0.9

1.4

S acr

(a)

0.4 0.9 1.4

Sacr

0.4

0.5

0.6

0.7

0.8

rate

, R

Ig(b)

Figure 7. (a) The values R(k) (black line) and Sacr(k)(red line) for all positions, k, for the Ig-fold. The lowerthe values of R(k) the more conservative amino acids areat these positions. (b) The scatter plot of R(k) versusobserved Sacr(k). The linear regression correlation coef®-cients are shown in Table 1. The blue line is the linearregression approximation. In both (a) and (b) rates aremultiplied by the length of the representative protein.

0 20 40 60 80 100 120position

0.88

0.90

0.92

0.94

0.96

rate

, R

0.5

1

1.5

2

S acr

0.5 1 1.5

Sacr

0.88

0.9

0.92

0.94

0.96

0.98

rate

, R

R(b)

(a)

Figure 9. (a)-(c) The same as Figure 7 but for the R-fold.

Understanding hierarchical protein evolution 297

choose any non-zero temperature, given that thetotal number of Monte Carlo steps, td, in the courseof design is large enough to obtain statistically sig-ni®cant values of R(k). We also ®nd that atTdes � 0.25 the data for R(k) are identical after 107

Monte Carlo steps to that after 108 Monte Carlosteps, so the values of R(k) are statistically signi®-cant.

Interestingly, the fastest rates are at most twiceas fast as the slowest rates. Such variability of rates

0 10 20 30 40 50 60 70position

0.4

0.5

0.6

0.7

0.8

rate

, R

0.5

1

1.5

2

S acr

0.5 1 1.5 2

Sacr

0.4

0.5

0.6

0.7

0.8

rate

, R

OB(b)

(a)

Figure 8. (a)-(c) The same as Figure 7 but for the OB-fold.

might be due to the variability in physical proper-ties of amino acids. It has been shown74 that thereare only two principal eigenvalues of Miyazawa-Jernigan energy matrix75,76 and the remainingeigenvalues are close to each other. Such a ``degen-eracy'' in eigenvalues accounts for the similaritiesin physical properties of amino acids. In addition,the limitations in the derivation of the knowledge-based potential of amino acid interactions may bepartially responsible for the observed range of ratevariability.

Another possible reason for such a limited rangeof rate variability is the absence of the side-chainsin our model. The side-chains are an additional fac-

0 10 20 30 40 50 60 70 80 90position

0.55

0.65

0.75

0.85

rate

, R

0.3

0.8

1.3

1.8

S acr

0.3 0.8 1.3 1.8

Sacr

0.55

0.65

0.75

0.85

rate

, R

α/βP(b)

(a)

Figure 10. (a)-(c) The same as Figure 7 but for the a/b-P-fold.

0 50 100 150 200 250position

0.7

0.8

0.9

rate

, R

0.6

1.1

1.6

S acr

0.6 1.1 1.6

Sacr

0.7

0.8

0.9

rate

, R

TIM(b)

(a)

Figure 11. (a)-(c) The same as Figure 7 but for theTIM-fold.

298 Understanding hierarchical protein evolution

tor that slow the rates because of the frustrationscaused by the multiple side-chain conformations.77

Despite the limitations of our model, the corre-lation between R(k) and Sacr is signi®cant, whichindicates that the model does qualitatively capturethe evolutionary selection of proteins.

Profile solution

The correlation between SP(k) and Sacr(k) isremarkable for all ®ve folds and indicates that ourpro®le approximation is able to select conservedamino acids in protein fold families and properlydescribe the formation of families on the shorttime-scales (Table 1). It is fully expected that thecorrelation coef®cient is smaller than 1. The reasonfor this is that computation of SP(k) takes intoaccount evolutionary selection for stability onlyand it does not take into account possibleadditional pressure to optimize kinetic orfunctional properties.

The additional evolutionary pressure due to thekinetic or functional importance of amino acidsresults in pronounced deviations of SP from Sacr fora few amino acids that may be kinetically or func-tionally important. A number of amino acidswhose conservatism is much greater than predictedby our model form a group of ``outliers'' fromotherwise very close correspondence between SP

and Sacr. To demonstrate that some of those aminoacids are important for folding kinetics and, assuch, they can be under additional evolutionarypressure, we color data points on the SP versus Sacr

scatter plot according to the range of f-values78

that the corresponding amino acids fall into. Thethermodynamic and kinetic roles of individualamino acids were studied extensively (i) by Hamillet al.80 for the TNfn3 (1TEN) protein, (ii) by LoÂpez-

HernaÂndez & Serrano66 for the chemotactic protein(CheY, pdb:3CHY), and (iii) by Chiti et al.78 formuscle acylphosphatase (AcP, pdb:2ACY).

We use the f-values for individual amino acidsobtained by LoÂpez-HernaÂndez & Serrano and byHamill et al.66,79 We observe that: (i) for TNfn3protein most of the points on Figure 12(b) thatbelong to the outlier group have f-values rangingfrom 0.2 to 1; (ii) for CheY protein most of thepoints (for which f-values are known) onFigure 14(b) that belong to the outlier havef-values ranging from 0.3 to 1; and (iii) for AcPprotein, one nucleic amino acid, Tyr11, is stronglyconserved, more than predicted by the pro®le sol-ution, while the second amino acid, Pro54, belong-ing to the nucleus80 does not appear to beconserved. The third nucleus amino acid, Phe94, inAcP protein is excluded from our analysis due tothe lack of data at position 94. The discrepancy ofthe Pro54 conservatism and its kinetic role may beattributed to the poor statistical signi®cance ofSacr(k) calculation at this position. Figures 12(b),14(b), and 15(b) demonstrate that the presence ofadditional evolutionary pressure due to the kineticimportance of amino acids results in stronger con-servatism of speci®c positions than predicted bypro®le solution.

It has been conjectured (e.g. see Rost11) that onaverage only 3-4 % of residues are ``anchor resi-dues'', i.e. those that are more signi®cantly con-served than the rest of the residues. In fact, thisobservation is supported by the Sacr(k) pro®le ofthe sequences and their pro®le estimates SP(k) (seeFigures 12(a)-16(a)). These 3-4 % of anchor residuesare the principal ``gates'' to the structure/kineticsof a given family of proteins. For example, it hasbeen shown77 that the number of residues thatbelong to the nucleus of a model protein is about5 %; we expect the same low percentage of residuesthat determine the kinetics of real proteins. Thenumber of key residues that form a functional siteis also a small fraction of the total number of resi-dues in proteins.

In order to demonstrate the statistical signi®-cance of the outliers' kinetic importance, we showthat the number of sites with high values of ffound among the outliers is larger than thatexpected if such sites were randomly distributedacross all values of SP. For tenascin, the total num-ber of residues is Ntot � 89, the number of siteswith f > 0.2 is Ntot(f > 0.2) � 17, the number ofoutliers is Nout � 13, and the expected numberof sites with f > 0.2 among outliers isNexp

out(f > 0.2) � Ntot(f > 0.2) Nout/Ntot � 2.5. Theobserved number of sites with f > 0.2 among out-liers is Nout(f > 0.2) � 8, which is over three timesmore than expected. A similar estimate for f > 0.5gives Nexp

out(f > 0.5) � 0.75 and Nout(f > 0.5) � 2,which is nearly three times more than expected.For CheY, the total number of residues isNtot � 128, the number of sites with f > 0.3 isNtot(f > 0.3) � 11, the number of outliers:Nout � 22, and the expected number of sites with

0 10 20 30 40 50 60 70position

0.5

1

1.5

2

Ent

ropy

(a)

0.5 1 1.5Entropy, Sacr

0.5

1

1.5

Ent

ropy

, SP

0.3 0.1 0.1 0.3 0.5(SPSacr)/SP

0

2

4

6

8

10

His

togr

am(b)

OB

(c)

Figure 13. (a)-(c) The same as Figure 12 but for theOB-fold.

0 20 40 60 80position

0.4

0.9

1.4

Ent

ropy

(a)

0.4 0.9 1.4 1.9Entropy, Sacr

0.4

0.9

1.4

1.9

Ent

ropy

, SP

φ<0.10.1<φ<0.20.2<φ<0.50.5<φ

0.2 0 0.2 0.4(SPSacr)/SPF

0

5

10

15

20

His

togr

am

(b)

Ig

(c)

Figure 12. (a) The values SP(k) (black line) and Sacr(k)(red line) for all positions, k, for the Ig-fold. The lowerthe values of SP(k), the more conservative amino acidsare at these positions. (b) The scatter plot of predictedSP(k) versus observed Sacr(k). The linear regression corre-lation coef®cients are shown in Table 1. The blue line isthe linear regression that has a slope different from 1(red line), corresponding to the SP(k) � Sacr(k) relation.(c) The histogram of the relative differences betweenSP(k) and Sacr(k). In (b) we assign colors to data pointscorresponding to amino acids with the speci®c range off-values:79 red, if 0.5 < f < 1; yellow, if 0.2 < f < 0.5;magenta, if 0.1 < f < 0.2; violet if f < 0.1; and black iff-values are not determined.

0 20 40 60 80 100 120position

0.5

1

1.5

Ent

ropy

(a)

0.5 0.9 1.3 1.7Entropy, Sacr

0.5

0.9

1.3

1.7

Ent

ropy

, SP

φ<0.10.1<φ<0.3φ>0.3

0.2 0 0.2 0.4(SPSacr)/SP

0

10

20

30

His

togr

am

(b)

R

(c)

Figure 14. (a)-(c) The same as Figure 12 but for the R-fold. In (b) we assign colors to data points correspond-ing to amino acids with the speci®c range of f-values:64

red, if 0.3 < f < 1; yellow, if 0.1 < f < 0.3; violet iff < 0.1, and black if f-values are not determined.

Understanding hierarchical protein evolution 299

f > 0.3 among outliers is Nexpout(f > 0.3) �

Ntot(f > 0.3) Nout/Ntot � 1.9. The observed numberof sites with f > 0.3 among outliers isNout(f > 0.3) � 4, which is over twice thatexpected. These crude estimates demonstrate thatoutliers have, in fact, a higher than expected num-ber of residues with pronounced kinetic role, hint-ing towards an additional evolutionary pressureexerted on kinetically important amino acids.

Convergent or divergent evolution?

It has been a long-standing question11,13,82 ± 84

whether the presently known proteins haveevolved from a smaller family of prebiotic proteins(``divergent'' evolution scenario) or whether theyevolved from ancestors with distant homology anddue to thermodynamic, kinetic, and functionalpressure exerted by evolution they converged tostructurally similar proteins (``convergent'' evol-ution scenario). The model of evolution proposedin this work does not rule out any of these scen-arios. However, the similarity in distribution ofhamming distances in the family of homologousproteins produced by our model to that taken from

nature is striking (Figure 5), serving as a hint infavor of divergent evolution.

Proteins' functions are related to the their struc-tures in several cases,14,17 so there is an evolution-ary pressure to preserve structures. If a proteinwere to change the structure in the course of evol-ution, it would affect its functionality (there are, ofcourse, possible exceptions). There is a growingevidence78,85,86 that a small subset of all amino

0 10 20 30 40 50 60 70 80position

0.5

1

1.5

Ent

ropy

(a)

0.3 0.8 1.3 1.8Entropy, Sacr

0.3

0.8

1.3

1.8

Ent

ropy

, SP

0.4 0.2 0 0.2 0.4(SPSacr)/SP

0

5

10

15

His

togr

am(b)

α/β P

(c)

Figure 15. (a)-(c) The same as Figure 12 but for the a/b-P-fold. In (b) we color red (two out three) nucleicamino acids, Tyr11 and Pro54.80

0 50 100 150 200 250position

0.5

1

1.5

Ent

ropy

(a)

0.6 1.1 1.6Entropy, Sacr

0.6

1.1

1.6

Ent

ropy

, SP

0.4 0 0.4(SPSacr)/SP

0

10

20

30

40

His

togr

am

(b)

TIM

(c)

Figure 16. (a)-(c) The same as Figure 12 but for theTIM-fold.

300 Understanding hierarchical protein evolution

acids of proteins is responsible for protein thermo-dynamic stability of the native structure and thefolding kinetics. Thus, an important argumentfavoring divergent evolution is that nature has topreserve only a small subset of amino acids thatcontribute the most to the protein stability andfolding kinetics. However, Murzin14 proposed away for functionless protein to survive by fusionwith another functional protein and evolvingalready as a unit to a multi-functional protein. Oneof the most prominent examples are the DNApolymerases that are composed of similar domainswith different sequence composition.87 ± 89

If we set aside multi-domain proteins, the factthat there is a limited number of folds (<1000according to Chothia4 or <7920 according toOrengo6) has been extensively used to favor con-vergent evolution.5,8,10,12,14 Li et al.10 used the desig-nability principle to show from full enumeration oflattice protein models that the number of membersof a fold family depends on the stability gap, �.This dependence means that many unrelatedsequences search in the course of evolution for thestable conformations and as soon as they reach abasin of a certain fold with large enough energygap they stay within that basin. The scenarioproposed5,8,10 also explains why various foldfamilies are unequally populated,6 the number offamily members depends on the energy gap. Themore pronounced the energy gap is, the moremutations such a fold can tolerate. Buchler & Gold-stein16 argued that the energy gap depends on thenumber of non-local contacts of a given fold.

There are several questions about that scenario.First, it is not clear if nature exploits all possiblefolds:6 even though there are only 1000 folds,4 it is

possible that nature simply does not need more ofthem. Second, Chothia & Gerstein12 argued thatthe restriction on the divergence of proteins fromone another does not come from a stability require-ment (which is, of course, important) but from theseparation of the mutated residue from the activesite. Thus, the extent of sequence divergence isinversely proportional to that of protein func-tion(s). The experiments by Gassner et al.90 andAxe et al.91 support the arguments of Chothia &Gerstein.12 In these experiments substitution of theseveral amino acids in the hydrophobic core of theT4 lysozyme conserved, to a certain extent, thefunction and the structure of the protein. Third, wenote that there is a limited amount of types ofchemical elements that are part of the ligand struc-tures that are bound by the active site. Thus, weexpect that there are groups of evolutionarily unre-lated proteins with similar binding sites and struc-ture. In fact, there are examples of proteins sharingthe same site, also called a ``super-site''. Forexample, both transforming protein p21H-RAS-1fragment (pdb:1CTQA) and chemotactic protein3CHY have similar binding sites, the root-mean-square deviation of one protein from another is3.2 AÊ while there are only 13 identical residues (i.e.ID < 10 %).92,93 Interestingly, the active site of1CTQA is centered around Mg2 �, while the activesite of 3CHY is built by residues only (Asp12,Asp13, Met17, and Asp57).

It is possible that evolution follows several pathsat the same time and the question of whether evol-ution is divergent or convergent is just ill-posed.To understand which pathways are favored bynature we need more evidence to make statisticallyvalid conclusions.

Understanding hierarchical protein evolution 301

Roles of stability, folding kinetics and function

There are three essential ingredients of proteinsthat nature exerts pressure on in the course of evol-ution, their thermodynamic stability, folding kin-etics, and function. These ingredients are notmutually independent. For example, disruption ofthe folding kinetics or function of a protein maydiscard this protein from the cell's life. Out of thesethree ingredients, the evolutionary selection of pro-teins by their function may result in the strongestconservatism of amino acids. However, in manyproteins this pressure may be localized to a smallnumber of amino acids (e.g. binding site) that areresponsible for its function. Therefore, there maybe only few functionally important amino acidsthat are under functional evolutionary pressure. Ofcourse there are exceptions, such as histones,where function is the structure of these proteinsand, thus, most amino acids are conserved in thecourse of evolution.17 In addition, the loss of a pro-tein function may not result in elimination of thisprotein from the cell's life, because there are mech-anisms by which functionless proteins can be fusedwith another functional protein14 and evolve as aunit of a multi-functional protein.

The evolutionary pressure to preserve the fold-ing kinetics of proteins may be crucial and necess-ary for proteins to survive. It has been pointedout57,81,94 that there are few amino acids, the pro-tein folding nucleus, that are responsible for thefolding kinetics of the proteins. Thus, to preservefolding properties, nature may exert evolutionarypressure on the few amino acids that are part ofthe folding nucleus.

The evolutionary selection of proteins by theirthermodynamic stability may be most robust,because if the stability of a protein is lost, the func-tion and folding kinetics of this protein maybecome irrelevant. It seems that there is a largeramount of amino acids that are important for pro-tein stabilization than for the folding kinetics.81,95,96

Thus, evolutionary pressure to preserve thermo-dynamic stability of proteins may be less speci®cand affect more amino acids.

It should be noted that our model, while notaccounting explicitly for protein function does soimplicitly. Conservation of function (not con-sidered explicitly in our model) may be a primaryreason for conservation of stable structure, whichis the major premise of our analysis of superfamilyformation. There are several examples of proteinsuperfamilies (according to SCOP classi®cation)that have similar or identical function, similarstructures but vastly different sequences whosesimilarity is undetectable by most sensitivesequence alignment methods such as PSI-BLAST(e.g. Zn-dependent exopeptidases, glutathionesynthetase ATP-binding domain-like, serin/threo-nine and tyrosine kinases). In terms of our modelthis situation ®ts the hierarchy of time-scaleswhereby the time-scale for functional divergence tF

is much greater than the longest time-scale to on

which a superfamily is born. This re¯ects the well-known fact of considerable plasticity of sequencespreserving function (and structure).

As we already pointed out, conservation ofstable structure has the most wide-ranging impacton protein sequences and the main goal of ouranalysis here is to understand the implications ofpreservation of protein structure and stability forsequence evolution.

Homology in proteins: a rigorous way out of aterminological muddle

In 1987, 11 leading evolutionary biologists27

made a statement asking the scienti®c communityfor the appropriate usage of the term ``homology''.Two proteins are said to be homologous if theyposses a common evolutionary origin (e.g. Fitch97).Because many proteins that have high sequencesimilarity are homologous, this term has been usedloosely in the discussion of any proteins with highsequence identity. Proteins that have no commonancestor, but posses structural similarity, are calledanalogs.

If the sequence identity of two structurally simi-lar proteins is high (ID > 25-30 %), there is a highprobability that these proteins share a commonancestor, and thus, statistically, one would rarelybe mistaken when calling these two proteinshomologs. If the sequence similarity of two structu-rally similar proteins is low (ID < 25 %), it is dif®-cult to establish whether these proteins arehomologs or analogs. In fact, despite cleverefforts,97 it is still questionable whether there is aunique solution to the problem of determiningwhether two proteins with low sequence identityare homologs or analogs, i.e. whether they evolvedby divergent or convergent evolution.

Two proteins are likely to be homologs thatdiverged from the same root if they still carry thesame function (i.e. if the evolutionary time elapsedfrom their common divergence point is smallerthan functional relaxation time tF). However, iftwo structurally similar proteins with lowsequence identity have signi®cantly different func-tions, then there is little information with which toidentify them as homologs or analogs. These twoproteins might be homologs, although one of themhas evolved to possess a new function.14 However,these two proteins can also be analogs and theirsimilarity in structure is purely accidental or, forexample, is due to a potential similarity of thestructure of the binding site. The question thenbecomes, how can we retrace the history of thesetwo proteins?

Our results suggest that it may be impossible toretrace the history of two structurally similar pro-teins with low sequence identity. In this case, theancestral relation classi®cation terms, homologsand analogs, become meaningless. There are tworeasons we believe this to be so. Firstly, the corre-lation function (equation (15)) decays exponen-tially, so that beyond the correlation function

302 Understanding hierarchical protein evolution

relaxation time one cannot relate the sequences.Secondly, it does not make a difference if we startour design procedure from one sequence or fromtwo unrelated sequences. After t4to, sequencesdiverge so much from each other that one cannotsay what initial sequence we used in the designprocedure. Furthermore, results suggest that somedegree of homology may occur even betweensequences that converged from an unrelated rootto the same structure, i.e. in clear analogs. Thereason for that is that, as we show here, some pos-itions may feature conserved residues due to thephysical requirement of stability of a common fold.Physical conservation of certain classes of aminoacids at some positions in protein folds may bere¯ected on the genetic level due to the speci®cs ofthe genetic code. Such conservation in some casesmay be confused with homology due to origin ofsequences in divergent evolution.

A rigorous de®nition of analogs and homologscan therefore come only from the understanding ofthe correlation times t between consecutivemutations. If the time-scale is smaller than to thenthe homology is well de®ned: the homologoussequences in this case have high sequence simi-larity, while the analogous sequences have lowsequence similarity. At a longer time-scale t4to,unless there is a high sequence similarity betweensequences, the notion of homology and analogybecomes meaningless. Thus, we just use the termsanalogs or homologs to refer to their sequenceidentity. In our model we know exactly the ances-tral information during the design procedure. Sincethis is not the case for real fold families, we usesimilar terminology applied to real protein folds toavoid confusion.

Conclusion

We present a hierarchical model that attempts toexplain sequence conservation caused by the mostbasic and universal evolutionary pressure in pro-teins to maintain stability. Using this model, weshow that separation of basic time-scales (that con-stitute a broad distribution with long tails) in evol-ution is a plausible scenario for the sequenceheterogeneity of analogous proteins. The two basictime-scales are to and t4to; (i) at time-scale, to,most mutations that occur in protein sequences donot alter the protein's thermodynamically and/orkinetically important sites and form families ofhomologous proteins; (ii) at time-scales t4to,mutations occur that would alter several aminoacids at the important sites of the proteins in sucha way that the properties of the proteins are notcompromised. At time-scale t, the family ofanalogs is formed. Mutational rates, directlycomputed during Z-score design, show agreementwith the conservatism pro®les of the fold families.

The pro®le solution predicts sequence entropyreasonably well for the majority of, but not all,amino acids. The amino acids that exhibit consider-

ably higher conservatism than predicted from stab-ility pressure alone are likely to be important forfunction and/or folding. Comparison of the ``base-level'' stability conservatism SP(k) with Sacr(k),actual conservatism pro®le of a protein fold, allowsone to identify functionally and kinetically import-ant amino acid residues and potentially gainspeci®c insights into folding and function of aprotein.

Analysis of the correlation function con®rms (i)the presence of an intrinsic time-scale, to, at whichdesigned sequences are similar and beyond whichthey differ strongly, (ii) the presence of the con-served positions in the course of Z-score design.The distributions of hamming distances betweensequences reveal ``clustering'' of similar sequences(with low Hd) at short time-scales t � to and disap-pearance of similarity at larger time-scales t4to.The above distributions are in accord with the dis-tribution of hamming distances in the families ofhomologs, taken from the HSSP database, andwith that of analogs, taken from the FSSP database,correspondingly.

The proposed study offers a plausible expla-nation of the clustering of structurally similar pro-tein into families of homologs and analogs. Fromthe perspective of the proposed view of evolution,the conserved amino acids appear as thermodyna-mically and kinetically important centers,mutations of which result in other (possibly strong)sequence modi®cations to preserve the physicalproperties of the parental proteins. Such modi®-cations result in a new family of homologous pro-teins. In addition, the proposed model can beutilized to search for the thermodynamically andkinetically important amino acids in silica.

Evolution is an extremely complex phenomenon,driven by numerous factors, such as history, pres-ervation of function, folding kinetics and stabilityof proteins in response to change in cell/bodyenvironment. It is remarkable, however, that oursimple model was able to qualitatively capture cer-tain aspects of protein evolution without anyadjustable parameters (except for the contact de®-nition threshold and the empirical matrix of aminoacid pairwise interactions).

In addition, our model provides a possiblescenario of divergent evolution. Possibly bothdivergent and convergent paths exist in nature.However, the question of whether evolution pre-fers a divergent or a convergent path is yet to beresolved. Extensive theoretical, phenomenologicaland experimental effort may bring insight to thispuzzle.

Materials and Methods

Protein model

We use the Cb representation of proteins in whicheach pair of amino acids is in contact if their Cbs (Ca inthe case of Gly) are within the distance 7.5 AÊ .98 We useMiyazawa-Jernigan (MJ)76 matrix of pair potentials to

Table 2. A six-letter potential derived for MJ 20-letter potential

l r p � ÿ s

l ÿ0.31 ÿ0.39 ÿ0.22 0.01 ÿ0.41 ÿ0.12r ÿ0.39 ÿ0.27 ÿ0.32 ÿ0.02 ÿ0.28 ÿ0.12p ÿ0.22 ÿ0.32 ÿ0.41 ÿ0.25 0.07 ÿ0.29� 0.01 ÿ0.02 ÿ0.25 ÿ0.10 ÿ0.18 ÿ0.18ÿ ÿ0.41 ÿ0.28 0.07 ÿ0.18 0.01 ÿ0.05s ÿ0.12 ÿ0.12 ÿ0.29 ÿ0.18 ÿ0.05 0.04

The symbols l, r, p, �, ÿ and s denote six distinct corresponding groups of amino acids: aliphatic (A, V, L, I, M and C), aromatic(F, W, Y and H), polar (S, T, N and Q), positively charged (K and R), negatively charged (D and E), and special (re¯ecting their spe-cial conformational properties) (G and P).

Understanding hierarchical protein evolution 303

represent the interaction between each pair of 20 aminoacids. The total potential energy of the protein can bewritten as follows:

E � 1

2

XN

i 6�j

U�si;sj��ij �19�

where N is the length of the protein, si is an amino acidat the position i � 1, . . . ,N. U(si, sj) is the correspondingelement of the MJ matrix of pairwise interactionsbetween amino acids si and sj. �ij is the element of thecontact matrix, that is de®ned to be 1 if contact betweenamino acids i and j exists (i.e. the distance between theseamino acids in the native (ground) state is smaller than7.5 AÊ ), and 0, if the above contact does not exist:

�ij �� 1; jrNS

i ÿ rNSj j47:5A

0; jrNSi ÿ rNS

j j > 7:5A�20�

where riNS is the position of the ith residue when the pro-

tein is in the native conformation.

The six-letter potential

Due to the similarities in properties of the 20 types ofamino acid, one can classify these amino acids into sixdistinct groups: aliphatic (A, V, L, I, M and C), aromatic(F, W, Y and H), polar (S, T, N and Q), positive (K andR), negative (D and E), and special (re¯ecting theirspecial conformational properties) (G and P). Weconstruct the potential of interaction, U6�si; sj�, betweenthe six groups of amino acids, s, by computing theaverage interaction between these groups, i.e.:

U6�si; sj� � 1

NsiNsj

Xsk2si;sl2sj

U20�sk;sl� �21�

where s denotes amino acids in 20-letter representationand U20(sk, sl) is the 20-letter matrix of interaction MJ; sdenotes amino acids in six-letter representation. Ns isthe number of actual amino acids of type s, e.g. for thealiphatic group Ns � 6. The six-letter interaction poten-tial for MJ 20-letter potential is given in Table 2.

The measure of the information context ofthe sequences

In both the energy gap model and the pro®le solution,to study the information context of the sequences, wecompute the sequence entropy, SX(k), at each position, k,of the sequence:

SX�k� � ÿXs

PX�sk� ln PX�sk� �22�

where PX(sk) is the probability that we observe an aminoacid sk at the kth position. Subscript X � Z or P denotesthe energy gap model or pro®le solution correspond-ingly. The summation is taken over all possible values ofsk.

The effect of switching to a six-letter representationof amino acids from the 20-letter representation on thesequence entropy, S6(k), is that all values S6(k) aretypically smaller than that of S20(k). For an M-letteralphabet with all letters equally represented, i.e.PX(sk) � 1/M, the entropy is equal to ln M. Thus, weexpect that the difference between the typical valuesof S20(k) and S6(k) is approximately ln(20/6) � 1.2. Thecase when all letters of an M-letter alphabet areequally presented corresponds to the maximal valueof the entropy, i.e.:

SM�k�4 ln M �23�

The entropy of the protein fold families

Theoretical predictions from statistical-mechanicalanalysis can be compared with data on real proteins. Inorder to determine conservatism in real proteins weassume that the space of sequences that fold into thesame protein structure presents a two-tier system, wherehomologous sequences are grouped into families andthere is no recognizable sequence homology betweenfamilies despite the fact that they fold into closely relatedstructures.11,35,99

Using the database of protein families with closesequence similarity (HSSP database22), we computefrequencies of amino acids at each position, k, ofaligned sequences, Pm(sk), for a given, mth, family ofproteins. We average these frequencies across all Ns

families sharing the same fold that are present inFSSP database:25

Pacr�sk� � 1

Ns

XNs

m�1

Pm�sk� �24�

Next, we determine the sequence entropy, Sacr(k), ateach position, k, of structurally aligned proteinanalogs:

Sacr�k� � ÿXs

Pacr�sk� ln Pacr�sk� �25�

304 Understanding hierarchical protein evolution

Acknowledgments

We thank R. S. Dokholyan for careful reading of themanuscript and S. V. Buldyrev, A. V. Finkelstein, N. V.Grishin, A. Yu. Grosberg, and L. A. Mirny for helpfuldiscussions. The pro®le solution was developed withL. A. Mirny. N.V.D. is supported by NIH postdoctoralfellowship GM20251-01. E.I.S. is supported by NIH grantRO1-52126.

References

1. Govindarajan, S. & Goldstein, R. A. (1997). The fold-ability landscape of model protein. Biopolymers, 42,427-438.

2. Govindarajan, S. & Goldstein, R. A. (1997). Evol-ution of model proteins on a foldability landscape.Proteins: Struct. Funct. Genet. 29, 461-466.

3. Levitt, M. & Chothia, C. (1976). Structural patternsin globular proteins. Nature, 261, 552-558.

4. Chothia, C. (1992). One thousand families for themolecular biologist. Nature, 357, 543-544.

5. Finkelstein, A. V., Gutin, A. & Badretdinov, A.(1993). Why are some protein structures so com-mon?. FEBS Letters, 325, 23-28.

6. Orengo, C. A., Jones, D. T. & Thornton, J. M. (1994).Protein superfamilies and domain superfolds.Nature, 372, 631-634.

7. Davidson, A. R. & Sauer, R. T. (1994). Folded pro-teins occur frequently in libraries of random aminoacid sequences. Proc. Natl Acad. Sci. USA, 91, 2146-2150.

8. Finkelstein, A. V., Gutin, A. & Badretdinov, A.(1995). Why are the same protein folds used to per-form different functions? Proteins: Struct. Funct.Genet. 23, 142-149.

9. Murzin, A. G., Brenner, S. E., Hubbard, T. &Chothia, C. (1995). Scop: a structural classi®cation ofproteins database for the investigation of sequencesand structures. J. Mol. Biol. 247, 536-540.

10. Li, H., Helling, R., Tang, C. & Wingreen, N. S.(1996). Emergence of preferred structures in a simplemodel of protein folding. Science, 273, 666-669.

11. Rost, B. (1997). Protein structures sustain evolution-ary drift. Fold. Des. 2, S19-S24.

12. Chothia, C. & Gerstein, M. (1997). Protein evolution- how far can sequences diverge?. Nature, 385, 579.

13. Grishin, N. V. (1997). Estimation of evolutionary dis-tances from protein spatial structures. J. Mol. Evol.45, 359-369.

14. Murzin, A. G. (1998). How far divergent evolutiongoes in proteins. Curr. Opin. Struct. Biol. 8, 380-387.

15. Holm, L. (1998). Uni®cation of protein families.Curr. Opin. Struct. Biol. 8, 372-379.

16. Buchler, N. E. G. & Goldstein, R. A. (2000). Survey-ing determinants of protein structure designabilityacross different energy models and amino-acidalphabets: a consensus. J. Chem. Phys. 112, 2533-2547.

17. Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K.& Watson, J. D. (1994). Molecular Biology of the Cell,Garland Publishing, New York.

18. Sander, C. & Schneider, R. (1991). Database of hom-ology-derived protein structures and the structuralmeaning of sequence alignment. Proteins: Struct.Funct. Genet. 9, 56-68.

19. Flahetry, K. M., McKay, D. B., Kabsch, W. &Holmes, K. C. (1991). Similarity of the three-dimen-sional structures of actin and the ATPase fragmentof a 70k heat-shock cognate protein. Proc. Natl Acad.Sci. USA, 88, 5041-5045.

20. Holmes, K. C., Sander, C. & Valencia, A. (1993).A new ATP-binding fold in actin, hexokinase andHsc70. Trends Cell Biol. 3, 53-59.

21. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T.,Swindells, M. B. & Thornton, J. M. (1997). Cath - ahierarchic classi®cation of protein domain structures.Structure, 5, 1093-1108.

22. Dodge, C., Schneider, R. & Sander, C. (1998). TheHSSP database of protein structure-sequence align-ments and family pro®les. Nucl. Acids Res. 26, 313-315.

23. SaÂnchez, R., Pieper, U., Melo, F., Eswar, N., Marti-Renom, M. A., Madhusudhan, M. S. et al. (2000).Protein structure modeling for structural genomics.Nature Struct. Biol. 7, 986-990.

24. Perl, F. M. G., Lee, D., Bray, J. E., Sillitoe, I., Todd,A. E., Harrison, A. P. et al. (2000). Assigning geno-mic sequences to CATH. Nucl. Acids Res. 28, 277-282.

25. Holm, L. & Sander, C. (1993). Protein structure com-parison by alignment of distance matrices. J. Mol.Biol. 233, 123-138.

26. Holm, L. & Sander, C. (1997). An evolutionary treas-ure: uni®cation of a broad set of amidohydrolasesrelated to urease. Proteins: Struct. Funct. Genet. 28,72-82.

27. Reeck, G. R., de Haen, C., Teller, D. C., Doolittle,R. F., Fitch, W. M., Dickerson, R. E., et al. (1987).``Homology'' in proteins and nucleic acids: a termi-nology muddle and a way out of it. Cell, 50, 667.

28. Goldstein, R., Luthey-Schulten, Z. A. & Wolynes, P.(1992). Optimal protein-folding codes from spin-glass theory. Proc. Natl Acad. Sci. USA, 89, 4918-4922.

29. Shakhnovich, E. I. & Gutin, A. M. (1993). Engineer-ing of stable and fast folding sequences of modelproteins. Proc. Natl Acad. Sci. USA, 90, 7195-7199.

30. Abkevich, V. I., Gutin, A. M. & Shakhnovich, E. I.(1996). Improved design of stable and fast-foldingmodel proteins. Fold. Des. 1, 221-230.

31. Shakhnovich, E. I. (1997). Theoretical studies of pro-tein-folding thermodynamics and kinetics. Curr.Opin. Struct. Biol. 7, 29-40.

32. Bryngelson, J. D. & Wolynes, P. G. (1987). Spinglasses and the statistical mechanics of protein fold-ing. Proc. Natl Acad. Sci. USA, 84, 7524-7528.

33. Abkevich, V. I., Gutin, A. M. & Shakhnovich, E. I.(1998). Theory of kinetic partitioning in protein fold-ing with possible applications to prions. Proteins:Struct. Funct. Genet. 31, 335-344.

34. Shakhnovich, E. I. (1998). Protein design: a perspec-tive from simple tractable models. Fold. Des. 3,R45-R58.

35. Mirny, L. A. & Shakhnovich, E. I. (1999). Universallyconserved positions in protein folds: reading evol-utionary signals about stability, folding kinetics andfunction. J. Mol. Biol. 291, 177-196.

36. Altschuh, D., Vernet, T., Berti, P., Moras, D. &Nagai, K. (1988). Coordinated amino acid changesin homologous protein families. Protein Eng. 2, 193-199.

37. Thomas, D., Casari, G. & Sander, C. (1996). The pre-diction of protein contacts from multiple sequencealignments. Protein Eng. 9, 941-948.

Understanding hierarchical protein evolution 305

38. Gutin, A. M., Sali, A., Abkevich, V. I., Karplus, M.& Shakhnovich, E. I. (1998). Temperature depen-dence of the folding rate in a simple protein model:search for a ``glass'' transition. J. Chem. Phys. 108,6466-6483.

39. Buchler, N. E. G. & Goldstein, R. A. (1999). Univer-sal correlation between energy gap and foldabilityfor the random energy model and lattice proteins.J. Chem. Phys. 111, 6599-6609.

40. Mirny, L. A., Finkelstein, A. V. & Shakhnovich, E. I.(2000). Statistical signi®cance of protein structureprediction by threading. Proc. Natl Acad. Sci. USA,97, 9978-9983.

41. Gutin, A. M. & Shakhnovich, E. I. (1993). Groundstate of random copolymers and the discrete ran-dom energy model. J. Chem. Phys. 98, 8174-8177.

42. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Jr,Mayer, E. F., Brice, M. D., Rodgers, J. R. et al. (1977).The protein data bank: a computer-based archival®le for macromolecular structures. J. Mol. Biol. 112,535-542.

43. Abola, E. E., Bernstein, F. C., Bryant, S. H., Koetzle,T. F. & Weng, J. (1987). Protein data bank. In Crys-tallographic Databases-Information Content, SoftwareSystems, Scienti®c Applications (Allen, F. H.,Bergerhoff, G. & Sievers, R., eds), pp. 107-132, DataCommission of the International Union ofCrystallography, Cambridge.

44. Ramanathan, S. & Shakhnovich, E. I. (1994). Statisti-cal mechanics of proteins with evolutionary``selected'' sequences. Phys. Rev. ser. E, 50, 1303-1312.

45. Sali, A., Shakhnovich, E. I. & Karplus, M. (1994).Kinetics of protein folding. a lattice model study forthe requirements for folding to the native state.J. Mol. Biol. 235, 1614-1636.

46. Pande, V. S., Grosberg, A. Y. & Tanaka, T. (1995).Freezing transition of random heteropolymers con-sisting of arbitrary sets of monomers. Phys. Rev. ser.E, 51, 3381-3393.

47. Saven, J. & Wolynes, P. (1997). Statistical mechanicsof the combinatorial synthesis and analysis of fold-ing molecules. J. Phys. Chem. ser. B, 101, 8375-8389.

48. Bowie, J. U., Luthy, R. & Eisenberg, D. (1991).A method to identify protein sequences that foldinto a known 3-dimensional structure. Science, 253,164-170.

49. Press, W. H., Flannery, B. P., Teukolsky, S. A. &Vetterling, W. T. (1989). Numerical Recipes, Cam-bridge University Press, Cambridge.

50. Shakhnovich, E. I. & Gutin, A. M. (1989). Formationof unique structure in polypeptide chains: theoreticalinvestigation with the aid of a replica approach. Bio-phys. Chem. 34, 187-199.

51. Shakhnovich, E. I. & Gutin, A. M. (1990). Impli-cations of thermodynamics of protein folding forevolution of primary sequences. Nature, 346, 773-775.

52. Pande, V. S., Grosberg, A. Y. & Tanaka, T. (1997).Statistical mechanics of simple models of proteinfolding and design. Biophys. J. 73, 3192-3210.

53. Hamming, R. W. (1950). Error detecting and errorcorrecting codes. Bell System Tech. J. 29, 147-160.

54. Rzhetsky, A. & Sitnikova, T. (1996). When is it safeto use an oversimpli®ed substitution model in tree-making?. Mol. Biol. Evol. 13, 1255-1265.

55. Feng, D.-F. & Doolittle, R. F. (1997). Convertingamino acid alignment scores into measures of evol-

utionary time: a simulation study of variousrelationships. J. Mol. Evol. 44, 361-370.

56. Ptitsyn, O. B. (1998). Protein folding and proteinevolution: common folding nucleus in different sub-families of c-type cytochromes? J. Mol. Biol. 278, 655-666.

57. Mirny, L. A., Abkevich, V. I. & Shakhnovich, E. I.(1998). How evolution makes proteins fold quickly.Proc. Natl Acad. Sci. USA, 95, 4976-4981.

58. Ptitsyn, O. B. & Ting, K.-L. H. (1999). Non-func-tional conserved residues in globins and their poss-ible role as a folding nucleus. J. Mol. Biol. 291, 671-682.

59. Krebs, H., Schmid, F. X. & Jaenicke, R. (1983). Fold-ing of homologous proteins. J. Mol. Biol. 169, 619-635.

60. Lesk, A. M. & Chothia, C. (1980). How differentamino acid sequences determine similar proteinstructures: the structure and evolutionary dynamicsof the globins. J. Mol. Biol. 136, 225-270.

61. Hollecker, M. & Creighton, T. E. (1983). Evolution-ary conservation and variation of protein foldingpathways. J. Mol. Biol. 168, 409-437.

62. Plaxco, K. W., Spitzfaden, C., Campbell, I. D. &Dobson, C. M. (1997). A comparison of the foldingkinetics and thermodynamics of two homologous®bronectin type III modules. J. Mol. Biol. 270, 763-770.

63. Nishimura, C., Prytulla, S., Dyson, H. J. & Wright,P. E. (2000). Conservation of folding pathways inevolutionarily distant globin sequences. NatureStruct. Biol. 7, 679-686.

64. Lorch, M., Mason, J., Clarke, A. & Parker, M. (1999).Effects of core mutations on the folding of a beta-sheet protein: implications for backbone organiz-ation in the i-state. Biochemistry, 38, 1377-1385.

65. Schindler, T., Perl, D., Graumann, P., Sieber, V.,Marahiel, M. & Schmid, F. (1998). Surface-exposedphenylalanines in the rnp1/rnp2 motif stabilize thecold-shock protein cspb from Bacillus subtilis.Proteins: Struct. Funct. Genet. 30, 401-406.

66. LoÂpez-HernaÂndez, E. & Serrano, L. (1996). Structureof the transition state for folding of the 129 aa pro-tein chey resembles that of a smaller protein, CI-2.Fold. Des. 1, 43-55.

67. Russell, R., Sasieni, P. & Sternberg, M. (1998). Super-sites within superfolds. Binding site similarity in theabsence of homology. J. Mol. Biol. 282, 903-918.

68. Welch, M., Oosawa, K., Aizawa, S. & Eisenbach, M.(1994). Effects of phosphorylation, Mg2�, and confor-mation of the chemotaxis protein chey on its bind-ing to the ¯agellar switch protein ®lm. Biochemistry,33, 10470-10476.

69. Bellsolell, L., Cronet, P., Majolero, M., Serrano, L. &Coll, M. (1996). The three-dimensional structure oftwo mutants of the signal transduction protein cheysuggests its molecular activation mechanism. J. Mol.Biol. 257, 116-128.

70. Wilcock, D., Pissabarro, M. T., LoÂpez-HernaÂndez, E.,Serrano, L. & Coll, M. (1998). Structure analysisof two chey mutants: importance of the hydro-gen bond contribution to protein stability. ActaCrystallog. sect. D, 54, 378-385.

71. Villegas, V., MartõÂnez, J. C., AvileÂs, F. X. & Serrano,L. (1998). Structure of the transition state in the fold-ing process of human procarboxypeptidase A2activation domain. J. Mol. Biol. 283, 1027-1036.

72. van Nuland, N. A. J., Chiti, H., Taddei, N., Raugei,G., Ramponi, G. & Dobson, C. (1998a). Slow folding

306 Understanding hierarchical protein evolution

of muscle acylphosphatase in the absence of inter-mediates. J. Mol. Biol. 283, 883-891.

73. van Nuland, N. A. J., Meijberg, W., Warner, J.,Forge, V., Scheek, R., Robbilard, G. & Dobson, C.(1998b). Slow cooperative folding of a small globularprotein hpr. Biochemistry, 37, 622-637.

74. Li, H., Tang, C. & Wingreen, N. S. (1997). Nature ofdriving force for protein folding: a result from ana-lyzing the statistical potential. Phys. Rev. Letters, 79,765-768.

75. Miyazawa, S. & Jernigan, R. L. (1985). Estimation ofeffective interresidue contact energies from proteincrystal structures: quasi-chemical approximation.Macromolecules, 18, 534-552.

76. Miyazawa, S. & Jernigan, R. L. (1996). Residue-resi-due potentials with a favorable contact pair termand an unfavorable high packing density term, forsimulation and threading. J. Mol. Biol. 256, 623-644.

77. Kussell, E., Shimada, J. & Shakhnovich, E. I. (2001).Excluded volume in protein sidechain packing.J. Mol. Biol. 311, 183-193.

78. Itzhaki, L. S., Otzen, D. E. & Fersht, A. R. (1995).The structure of the transition-state for folding ofchymotrypsin inhibitor-2 analyzed by protein engin-eering methods - evidence for a nucleation-conden-sation mechanism for protein folding. J. Mol. Biol.254, 260-288.

79. Hamill, S. J., Steward, A. & Clarke, J. (2000). Thefolding of an immunoglobulin-like greek key proteinis de®ned by a common-core nucleus and regionsconstrained by topology. J. Mol. Biol. 297, 165-178.

80. Chiti, F., Taddei, N., White, P. M., Bucciantini, M.,Magherini, F., Stefani, M. & Dobson, C. M. (1999).Mutational analysis of acylphosphatase suggests theimportance of topology and contact order in proteinfolding. Nature Struct. Biol. 6, 1005-1009.

81. Dokholyan, N. V., Buldyrev, S. V., Stanley, H. E. &Shakhnovich, E. I. (2000). Identifying the proteinfolding nucleus using molecular dynamics. J. Mol.Biol. 296, 1183-1188.

82. Zhang, G., Liu, Y., Ruoho, A. E. & Hurley, J. H.(1987). Structure of the adenylylcyclase catalyticcore. Nature, 386, 247-253.

83. Artymiuk, P. J., Poirrette, A. R., Rice, D. W. &Willett, P. (1997). A polymerase I palm in adenylylcyclase. Nature, 388, 33-34.

84. Bryant, S. H., Madej, T., Liu, Y., Ruoho, A. E.,Zhang, G. & Hurley, J. H. (1997). A polymeraseI palm in adenylyl cyclase-reply. Nature, 388, 34.

85. Kim, D. E., Gu, H. & Baker, D. (1998). Thesequences of small proteins are not extensively opti-mized for rapid folding by natural selection. Proc.Natl Acad. Sci. USA, 95, 4982-4986.

86. Gegoret, L. M. & Sauer, R. T. (1998). Tolerance of aprotein helix to multiple alanine and valine substi-tutions. Fold. Des. 3, 119-126.

87. Wang, J., Sattar, A. K. M. A., Wang, C. C., Karam,J. D., Konigsberg, W. H. & Steitz, T. A. (1997). Crys-tal structure of a pol a family replication DNApolymerase from bacteriophage RB69. Cell, 89, 1087-1099.

88. Doublie, S., Tabor, S., Long, A. M., Richardson, C. C.& Ellenberg, T. (1997). Crystal structure of abacteriophage T7 DNA replication complex at 2.2 AÊ

resolution. Nature, 391, 251-258.89. Kiefer, J. R., Mao, C., Braman, J. C. & Beese, L. S.

(1997). Visualizing DNA replication in a catalyticallyactive Bacillus DNA polymerase crystals. Nature,391, 304-307.

90. Gassner, N. C., Baase, W. A. & Matthews, B. W.(1996). A test of the ``jigsaw puzzle'' model for pro-tein folding by multiple methionine substitutionswithin the core of T4 lysozyme. Proc. Natl Acad. Sci.USA, 93, 12155-12158.

91. Axe, D. D., Foster, N. W. & Fersht, A. R. (1996).Active barnase variants with completely randomhydrophobic core. Proc. Natl Acad. Sci. USA, 93,5590-5594.

92. Bellsolell, L., Prieto, J., Serrano, L. & Coll, M. (1994).Magnesium binding to the bacterial chemotaxis pro-tein CheY results in large conformational changesinvolving its functional surface. J. Mol. Biol. 238, 489-495.

93. Berger, J. M., Fass, D., Wang, J. C. & Harrison, S. C.(1998). Structural similarities between topoisome-rases that cleave one or both DNA strands. Proc.Natl Acad. Sci. USA, 95, 7876-7881.

94. Fersht, A. R. (1997). Nucleation mechanisms in pro-tein folding. Curr. Opin. Struct. Biol. 7, 3-9.

95. Dokholyan, N. V., Buldyrev, S. V., Stanley, H. E. &Shakhnovich, E. I. (1998). Molecular dynamics stu-dies of folding of a protein-like model. Fold. Des. 3,577-587.

96. Scala, A., Dokholyan, N. V., Buldyrev, S. V. &Stanley, H. E. (2001). Thermodynamically importantcontacts in folding of model proteins. Phys. Rev. ser.E, 63, 032901.

97. Fitch, W. M. (1970). Distinguishing homologousfrom analogous proteins. Syst. Zool. 19, 99-113.

98. Jernigan, R. L. & Bahar, I. (1996). Structure-derivedpotentials and protein simulations. Curr. Opin.Struct. Biol. 6, 195-209.

99. Tiana, G., Broglia, R. & Shakhnovich, E. I. (2000).Hiking in the energy landscape in sequence space: abumpy road to good folders. Proteins: Struct. Funct.Genet. 39, 244-251.

100. Berezovsky, I. N., Grosberg, A. Y. & Trifonov, E. N.(2000). Closed loops of nearly standard size: com-mon basic element of protein structure. FEBS Letters,466, 283-286.

Appendix

Determination of Contact Frequencies fromHomopolymer Conformations

The estimation of frequencies is one of the keyingredients in protein design. An alternativeapproach to that proposed above is to assume thatthe set of conformational decoys is the set of allpossible random coil states of a homopolymer col-lapsed at the temperatures below theta-point tem-perature, T < Ty, these are the states that decoyrandom heteropolymers explore at the foldingtransition temperature. Thus, we can determine thefrequencies of contacts in an ensemble of randomheteropolymers by taking the time average of acontact matrix element �ij in the possible confor-mations of a homopolymer at T < Ty.

To compute the frequencies of contacts for ahomopolymer of length N, we use discrete molecu-lar dynamics simulations.1-3 We model a homo-polymer by N beads on a string with theinteraction distances scaled to 7.5 AÊ . (see Dokhol-yan et al.2 for a detailed description of the modeland the algorithm). We run the simulation at the

Understanding hierarchical protein evolution 307

temperature Ty (e parameter2 is set to ÿ1) for 107

time units. (In the discrete molecular dynamicsalgorithm, the time unit is the average timebetween subsequent collisions.). After 107 timeunits of simulations we compute the frequency fijof each of the N(Nÿ1)/2 contacts in our homopoly-mer.

There are two principal drawbacks of the secondmethod: (1) the probability of occurrence of stableelements of the structure in homopolymers resem-bling secondary structure in proteins is so low thatthe distribution of contact lengths, P�`�, in homo-polymer (not shown) drastically differs from thatshown for real proteins in Figure 2 of the maintext. (2) The model of a homopolymer used in thesimulations strongly differs (e.g. in ¯exibility) fromreal proteins. In fact, the problem of building anappropriate model for chain ¯exibility is so import-ant that small variations in it result in drasticallydifferent kinetics from a realistic one (e.g. appear-ance of the intermediate states) (Borreguero et al.,unpublished results; Ding et al., unpublishedresults). We ®nd that both of these drawbacksmake the ``homopolymer'' approach of estimating

the frequencies very inef®cient for existing proteinmodels, so we omit it in our studies.

There are two strong advantages of thisapproach, which make it worthwhile to explore itin the future, upon the availability of realisticprotein models: (i) the possibility to generate alarge amount of decoy conformations and, thus,achieve statistically highly signi®cant contactfrequency spectra; (ii) the independence of theproduced decoys from various database biases.

References

1. Dokholyhan, N. V., Buldyrev, S. V., Stanley, H.E. & Shakhnovich, E. I. (2000). Identifying theprotein folding nucleus using molecular dynamics.J. Mol. Biol. 296, 1183-1188.2. Dokholyhan, N. V., Buldyrev, S. V., Stanley, H.E. & Shakhnovich, E. I. (1998). Molecular dynamicsstudies of folding of a protein-like model. Fold.Des. 3, 577-587.3. Zhou, Y. & Karplus, M. (1997). Folding thermo-dynamics of a three-helix-bundle protein. Proc NatlAcad. Sci. USA, 94, 14429-14432.

Edited by J. Thornton

(Received 9 April 2001; received in revised form 17 July 2001; accepted 19 July 2001)