What is a Desirable Statistical Energy Function for Proteins and How Can It Be Obtained

10
REVIEW © Copyright 2006 by Humana Press Inc. All rights of any nature whatsoever reserved. 1085-9195/(Online)1559-0283/06/46:00–00/$30.00 Cell Biochemistry and Biophysics 165 Volume 46, 2006 INTRODUCTION Most biological activities are directed or regulated by proteins that are made of a gene-specified sequence of 20 amino acid residue types. As a result, almost all diseases can be related to the function or malfunction of proteins. Proteins perform their func- tion through their unique, self-assembled three- dimensional shapes (or structures) and through their specific binding to small molecules, to DNA/RNA (e.g., transcription factors that regulate gene expres- sions), or to other proteins (e.g., molecular recogni- tion in signal transduction). These functional activities are driven by the physical interactions between amino acid residues in aqueous solution and between amino acid residues and biologically active molecules. The physical interactions are described by energy functions. One bottleneck to the solution of the problems of how proteins fold or bind is the lack of an accurate energy function. The energy functions that are used by the computa- tional biology community are obtained through either a physical-based (1–4) or a “bioinformatics-based” statis- tical approach (5). In the physical-based approach, energy functions are derived by the laws of physics. However, because they are limited by the computing power available to solve quantum mechanical equa- tions, the physical-based energy functions have to be approximately decomposed into many empirical ener- getic terms, each of which involves many empirical parameters. Moreover, the contribution of entropy (except for solvation entropy in some models [6–9]) is often not included. In the bioinformatics-based approach, the free- energy function (potential of mean force) is extracted directly from statistical analysis of known protein structures (5). The resulting energy functions are com- monly referred as knowledge-based statistical energy functions. They were proven effective in many appli- cations such as structure prediction, docking, muta- tion, and protein–ligand binding (for recent reviews, see refs. 10–14). There are many different forms of sta- tistical energy functions, depending on how statistics What is a Desirable Statistical Energy Function for Proteins and How Can It Be Obtained? Yaoqi Zhou, 1, * Hongyi Zhou, 1 Chi Zhang, 1 and Song Liu 1 Howard Hughes Medical Institute Center for Single Molecule Biophysics, Department of Physiology and Biophysics, State University of New York at Buffalo, 124 Sherman Hall, Buffalo, NY 14214 Abstract Can one obtain a physical energy function for proteins from statistical analysis of protein structures? A direct answer to this question is likely “no.” A less demanding question is whether one can produce a statistical energy function that has the desirable features of a physical-based energy function. Such a desirable energy function would be founded on a physical basis with few or no adjustable parameters, reproduce the known physical char- acters of amino acid residues, be mostly database independent and transferable, and, more importantly, reason- ably accurate in various applications. In this review, we show how such a desirable energy function can be obtained via introducing a simple physical-based reference state called DFIRE (Distance-scaled, Finite, Ideal-gas REference state). Index Entries: knowledge-based potential; hydrophobicity; protein–ligand binding affinity; database dependence. *Author to whom all correspondence and reprint requests should be addressed. E-mail: [email protected]

Transcript of What is a Desirable Statistical Energy Function for Proteins and How Can It Be Obtained

REVIEW

© Copyright 2006 by Humana Press Inc.All rights of any nature whatsoever reserved.1085-9195/(Online)1559-0283/06/46:00–00/$30.00

Cell Biochemistry and Biophysics 165 Volume 46, 2006

INTRODUCTION

Most biological activities are directed or regulatedby proteins that are made of a gene-specifiedsequence of 20 amino acid residue types. As a result,almost all diseases can be related to the function ormalfunction of proteins. Proteins perform their func-tion through their unique, self-assembled three-dimensional shapes (or structures) and through theirspecific binding to small molecules, to DNA/RNA(e.g., transcription factors that regulate gene expres-sions), or to other proteins (e.g., molecular recogni-tion in signal transduction). These functionalactivities are driven by the physical interactionsbetween amino acid residues in aqueous solution andbetween amino acid residues and biologically activemolecules. The physical interactions are described byenergy functions. One bottleneck to the solution ofthe problems of how proteins fold or bind is the lackof an accurate energy function.

The energy functions that are used by the computa-tional biology community are obtained through either aphysical-based (1–4) or a “bioinformatics-based” statis-tical approach (5). In the physical-based approach,energy functions are derived by the laws of physics.However, because they are limited by the computingpower available to solve quantum mechanical equa-tions, the physical-based energy functions have to beapproximately decomposed into many empirical ener-getic terms, each of which involves many empiricalparameters. Moreover, the contribution of entropy(except for solvation entropy in some models [6–9]) isoften not included.

In the bioinformatics-based approach, the free-energy function (potential of mean force) is extracteddirectly from statistical analysis of known proteinstructures (5). The resulting energy functions are com-monly referred as knowledge-based statistical energyfunctions. They were proven effective in many appli-cations such as structure prediction, docking, muta-tion, and protein–ligand binding (for recent reviews,see refs. 10–14). There are many different forms of sta-tistical energy functions, depending on how statistics

What is a Desirable Statistical Energy Function for Proteins and How Can It Be Obtained?

Yaoqi Zhou,1,* Hongyi Zhou,1 Chi Zhang,1 and Song Liu1

Howard Hughes Medical Institute Center for Single Molecule Biophysics, Department of Physiology and Biophysics, State University of New York at Buffalo, 124 Sherman Hall, Buffalo, NY 14214

Abstract

Can one obtain a physical energy function for proteins from statistical analysis of protein structures? A directanswer to this question is likely “no.” A less demanding question is whether one can produce a statistical energyfunction that has the desirable features of a physical-based energy function. Such a desirable energy functionwould be founded on a physical basis with few or no adjustable parameters, reproduce the known physical char-acters of amino acid residues, be mostly database independent and transferable, and, more importantly, reason-ably accurate in various applications. In this review, we show how such a desirable energy function can beobtained via introducing a simple physical-based reference state called DFIRE (Distance-scaled, Finite, Ideal-gasREference state).

Index Entries: knowledge-based potential; hydrophobicity; protein–ligand binding affinity; databasedependence.

*Author to whom all correspondence and reprint requestsshould be addressed. E-mail: [email protected]

are calculated and how proteins are modeled: exam-ples are distance-independent contact energies(15–18), solvent-accessible surface potential (19–21),backbone torsion potential (21–24), and distance-dependent potentials (25–32). Both coarse-grainedresidue level (15,25–88) and atomic level (16,17,29,30)potentials were developed. This review focuses on all-atom, distance-dependent, pairwise statistical energyfunctions because distance-dependent, pairwise inter-action is the dominant term in physical-based energyfunctions.

Statistical energy functions, however, are not consid-ered to be physically rigorous because collecting the sta-tistics from the structures of unrelated proteins impliesthat the unrelated protein structures correspond to dif-ferent snapshots of the same thermodynamic ensemble(15,33). That is, the chain connectivity and difference inprotein sizes and in the compositions of amino acidresidues within a single protein and between differentproteins are all neglected. This severe approximationseems to produce unphysical long-range repulsionbetween hydrophobic residues (33) in the statisticalenergy function based on the commonly used Sipplapproximation (25).

The drastic thermodynamic approximation alsoseems to cause strong dependence of statistical energyfunctions on the databases of protein structures that areused to extract the energy functions. Proteins are inho-mogeneous mixtures of amino acid residues. The com-positions of amino acid residues at the surface, core, andinterface of proteins are all different from each other(34–36). Thus structural databases of surface, cores, andinterfaces of proteins will produce different statisticaloutcomes (i.e., energy functions). For example, it wasshown that the energy function extracted from all-αprotein structures is quantitatively different from thatextracted from all-β protein structures (37). The struc-tural database of single-chain proteins and the struc-tural database of dimeric interfaces also yield differentstatistical energy functions for folding and binding(34,38). In practice, the database dependence is oftenused to produce a system-specific or environment-dependent statistical energy function (34,37–41) toimprove the accuracy of the energy function for a spe-cific application. The strong database dependence, how-ever, is not desirable for an energy function designed forproteins because different proteins can differ signifi-cantly in their amino acid compositions and three-dimensional structures. It is not possible to develop aspecific energy function for every specific case.

However, there is no real proof that the thermody-namic approximation is the cause for the previouslymentioned unphysical or undesirable characters of sta-tistical energy functions. Nor is there any evidence thatmultibody interactions are stronger than two-body

(pairwise) interactions in proteins as suggested by thestrong database dependence of statistical energy func-tions. The possible origins of multibody interactions arepolarization, entropic effect, and solvent-induced effect.The entropic and solvent-induced effects, however, maybe reasonably approximated by pairwise interactions asin the case of approximation of polarization effect bypairwise van der Waals interactions. If true, a pairwise,statistical energy function would be mostly transferable.This review focuses on the search for a “desirable” sta-tistical, pairwise energy function that would have cer-tain characters of a commonly used physical-basedenergy function. More specifically, we desire to have astatistical energy function built on a physical basis sothat it has few or no adjustable parameters, an energyfunction that captures physical characteristics of aminoacid residues, an energy function mostly transferable sothat its applications are not limited to specific cases, andan energy function that is reasonably accurate in vari-ous applications. This review illustrates how an appro-priate reference state for a statistical potential can makeall of these happen.

REFERENCE STATE FOR STATISTICALPOTENTIAL

The master equation for deriving a pairwise statisti-

cal energy function is given by

(1)

in which R is the gas constant, T is the temperature, andNobs(i, j, r) and Nexp(i, j, r) are the number of a pair ofatoms i, j between the spatial distances r – ∆r/2 and r +∆r/2 occurred in known protein structures and theexpected number of the pair that would occur in theabsence of any interactions (the reference state), respec-tively. Removing the effect of the reference state fromthe observed state is a necessary step to obtain the neteffect of residue–residue and residue–solvent interac-tions. Clearly, statistical energy functions differ mainlyin choice of the reference state because the observedatomic pair Nobs(i, j, r) for a given database is the same.

Many approaches based on various physical or sta-tistical principles (14,42–47) have been developed toconstruct reference states. Most commonly used sta-tistical energy functions employ a state statisticallyaveraged over atomic types or over interaction dis-tance to represent the reference state. For example, thereference states used in the Sippl approximation (25),in the residue-specific all-atom probability discrimi-natory function (RAPDF) (29), and in the atomicknowledge-based potential (KBP) (30), regardless of

u i j r RTN i j rN i j r

obs( , , ) ln( , , )( , , )exp

= −

u i j r( , , )

166 Zhou et al.

Cell Biochemistry and Biophysics Volume 46, 2006

their physical or statistical origins, all satisfy the equa-

tion . In other words, thedistance dependence of the number of atomic pairs ina zero-interaction reference state is the same as that inthe observed state with the full interaction strength.

A PHYSICAL REFERENCE STATE BASED ON IDEAL GASES CONFINED IN FINITEPROTEIN-SIZE SPHERES

One can search for an appropriate reference state bystarting from the fundamental equation of statisticalmechanics where the atom–atom potential of mean

force is calculated from the pair distributionfunction gij(r) as follows (48):

(2)

in which V is the volume and Ni and Nj are the totalnumber of atoms i and j, respectively. This is a formallyexact equation in which the reference state [Nobs (i,j,r) =NiNj (4πr2∆r)/V] is an ideal gas mixture with a uniformnumber densities of Ni/V and Nj/V for atoms i and j,respectively.

To apply Eq. 2 to proteins, two approximations weremade (49). First, the number of pairs of ideal-gas points[N(r)] in finite protein-size spheres is proportional to thedistance r to the power of a fractional number that issmaller than 2, the exact exponent in an infinite system.That is, Nobs (i,j,r) = NiNj (4πrα∆r)/V. Second, the poten-tial of mean force has a finite interaction range with acutoff distance, rcut. That is, for r > rcut. Thetwo approximations lead tothe Distance-scaled, Finite, Ideal-gas REference (DFIRE)state for the statistical potential that satisfies (49):

(3)

in which ∆rcut is the width of distance bin ∆r at r = rcut.The DFIRE-based energy function was generated with rcut= 14.5 Å, ∆r= 2 Å for r < 2 Å, ∆r = 0.5 Å for 2 Å < r < 8 Å,and ∆r = 1 Å for 8 Å < r < 15 Å. The value of Nobs(i, j, r)was obtained from a structural database of 1011 nonho-mologous (less than 30% homology) proteins with resolu-tion less than 2 Å (http://chaos.fccc.edu/research/labs/dunbrack/culledpdb.html) (50). Residue-specific atomtypes (29,30) were used (167 atom types).

Figure 1 compares the DFIRE interaction energies

between the Cβ atoms of Leu and Asp [ ]and that between the Cβ atoms of Leu and Ile

[ ] as a function of distance r (in Å)between the atoms. Here the unit for DFIRE energy(kcal/mol) is obtained by multiplying a prefactor (0.0157)so that the regression slope between predicted and exper-imentally measured mutation-induced changes in pro-tein stability is 1 (49).

Although a Cβ atom is a nonpolar atom, its associa-tion with different type of amino acid residues(hydrophilic or hydrophobic) leads to different interac-tion energies between them. The interaction betweenthe Cβ atoms of two hydrophobic residues (Leu and Ile)is always attractive (except the region of the hard-repul-sive core), whereas the interaction between the atoms inhydrophobic and hydrophilic residues oscillatesbetween attractive and repulsive. The significant differ-

ence between and reflects the effect of solvation. In another example of thesolvation effect, the DFIRE interaction betweenhydrophobic atom Leu Cβ and hydrophilic atom AspOδ1 is repulsive within a distance of 7 Å, which is sig-nificantly longer than typical van der Waals diametersaround 4 Å (not shown).

In the physical-based CHARMM (chemistry atHarvard Molecular Mechanics) 22 parameter set (1), theCβ atoms of Leu, Ile, and Asp belong to the same atomtype CT2. That is, the Lennard-Jones potential between

u rC CLeu AspDFIRE

β β, ( )u rC CLeu IleDFIRE

β β, ( )

u rC CLeu IleDFIRE

β β, ( )

u rC CLeu AspDFIRE

β β, ( )

u i j r

RTN i j r

rr

rr

N i j r

r r

r r

obs

obsDFIRE

cut cutcut

cut

cut

( , , )ln

( , , )

( , , )

,

,

=

−⎛⎝⎜

⎞⎠⎟

<

>

⎨⎪⎪

⎩⎪⎪

α∆

∆0

u i j r( , , ) = 0

u i j r RT g r RTN i j r

N N r r Vij

obs

i j( , , ) ln ( ) ln

( , , )( )/

,= − = −4 2π ∆

u i j r( , , )

N i j r N i j rij obsijexp( , , ) ( , , )=∑ ∑

A Desirable Statistical Energy Function for Proteins 167

Cell Biochemistry and Biophysics Volume 46, 2006

Fig. 1. The interaction (in kcal/mol) between Cβ atomsas a function of distance between them (in Å). Long dashedand short dashed lines are between the atoms of residuesLeu and Asp and between residues Leu and Ile, respec-tively. The Lennard-Jones (LJ, dash-dotted line) potentialfor the Cβ atoms is from the CHARMM force field. The dot-ted line denotes the location of zero interaction.

the Cβ atoms of Leu and Ile is the same as that of Leuand Asp. Figure 1 shows that, compared with the

Lennard-Jones potential, both and

are less repulsive in the hard-repulsivecore because of the solvent-induced hydrophobic attrac-tion between two nonpolar Cβ atoms and less attractivebetween 4 and 6 Å. The latter is likely the result ofentropic and enthalpic compensation in protein stability(51). Thus the fundamental difference between a physi-cal-based force field and a pairwise knowledge-basedenergy function is that the former is an energy function,whereas the latter is a free energy function that attemptsto incorporate entropic and solvation effects via a pair-wise approximation. It should be noted that in the phys-ical-based CHARMM force field, the Cβ atoms of Leu,Ile, and Asp also differ from each other by their differ-ent partial charges (–0.18, –0.09, and –0.28, respectively),although not by their van der Waals parameters.

The DFIRE energy function was compared toknowledge-based potentials RAPDF and KBP (49).Because the three distance-dependent curves ofRAPDF, KBP, and DFIRE energy functions share a verysimilar pattern, the results of RAPDF and KBP are notshown in Fig. 1 for clarity. It was found that comparingindividual curves of the three methods displays onlysmall difference among the three potentials. TheDFIRE energy function, when averaging over allatomic types, is approx 10% more attractive than theKBP and RAPDF energy functions in the distancerange of 4–12 Å (49).

ASSESSING THE DFIRE ENERGY FUNCTION

Is the DFIRE energy function the desirable statisticalenergy function that we are looking for? We tested theDFIRE energy function as follows. First, a desirableenergy function should have few or no freely adjustableparameters. The DFIRE energy function, built on a refer-ence state of ideal gases confined in protein-size spheres,has only one parameter α, which reflects the finite-sizeeffect of proteins. The question is if the adjustable para-meter α derived from the physical reference state of idealgases would yield an optimal (or near optimal) perfor-mance. Second, an energy function for proteins should, ata minimum, capture the solvent-induced hydrophobiceffect, one of the most important physical characters of aminoacid residues. Third, it is desirable to have an energyfunction for proteins that is mostly database independentand transferable across different interaction systems (e.g.,protein folding, protein–protein, protein–ligand, pro-tein–DNA interactions) without significant change in itsperformance. The database independence is required forpossible prediction of protein structures with new folds.

Finally, the energy function should be reasonably accu-rate in various applications.

Establishment and Confirmation of the PhysicalBasis of the DFIRE-Based Energy Function

To establish the DFIRE-based energy function on afirm physical basis, the value of α should be determinedby the proposed physical system of ideal-gas points infinite protein-size spheres. The value of α was deter-mined by the best fit of rα to the actual distance-depen-dent number of pairs of ideal gas points, N(r), in 1011finite protein-size spheres. The parameter α was foundto be 1.61 based on a structural database of 1011 nonho-mologous, high-resolution proteins (49).

To confirm the physical basis of the derived energyfunction, the α value determined by the previouslydescribed physical method should yield the optimalperformance for the energy function. Indeed, it wasshown that the DFIRE energy function with α = 1.61gives the best performance in native structure discrimi-nation from the decoy sets of 32 proteins and selectionof near-native structures from the decoy sets of 11 loopswhen compared with DFIRE energy function with α =1.50 and α = 1.70 (49).

This result, however, does not preclude the possibil-ity of finding an optimized α value that is different from1.61 for a specific application. The existence of manyapproximations in deriving the DFIRE energy functionmay shift somewhat the actual optimal α value for var-ious applications.

PARTITION OF HYDROPHOBIC AND HYDROPHILIC

RESIDUES IN SOLUBLE PROTEINS

Hydrophobic interactions play an important role inprotein folding and binding (52,53). Thus the ability toreproduce hydrophobicity of 20 amino acid residuesshould be a basic requirement for an energy function ofproteins. To test the DFIRE-based energy function in thisaspect, one can evaluate the average contribution of eachresidue to the total DFIRE energy of 200 randomlyselected proteins (called the DFIRE-based stability scale)(54). It was shown that the DFIRE-based stability scalehas an excellent correlation with the hydrophobic-residue portion of many oil-to-water transfer free energyscales (for example, the correlation coefficient R = 0.95with octanol-to-water transfer scale [55]). The lack ofcorrelation for hydrophilic residues is because organicsolvent is not as good an approximation as the interior ofproteins for hydrophilic residues (56), as indicated bydifferent transfer scales of hydrophilic residues pro-duced by different solvents (57,58). To avoid the uncer-tainty regarding the hydrophobicity scale of hydrophilicresidues, the stability scale of 20 amino acid residueswas extracted directly from more than 1000 experimen-tal data of mutation-induced change in stability (54,56).

u rC CLeu AspDFIRE

β β, ( )

u rC CLeu IleDFIRE

β β, ( )

168 Zhou et al.

Cell Biochemistry and Biophysics Volume 46, 2006

The correlation between DFIRE-based and mutation-extracted stability scales is now significant for bothhydrophobic and hydrophilic residues. (The correlationcoefficient R = 0.91, 0.93, and 0.79 for all, hydrophobic,and hydrophilic residues, respectively [54].)

The significant correlation between the DFIRE-basedand mutation-extracted stability scales indicates thatthe DFIRE energy function can quantitatively explainthe hydrophobic character of 20 amino acid residues.The next question is: Can the DFIRE energy functionreproduce the fact that hydrophobic residues tend to bemore favorably buried inside proteins than hydrophilicones? To address this question, the average contributionof a residue to total DFIRE energies of 200 randomlyselected proteins was obtained as a function of itsburied accessible surface area (ASA) (54). It was foundthat the average contributions for all 20 amino acidresidues are linearly dependent on their respectiveburied ASA (R = 0.79 for Cys and R ≥ 0.88 for the other19 residues). The regression intercepts for all 20 residues(at zero buried ASA) are similar in magnitude (1.9kcal/mol for Cys and 0.4–1.4 kcal/mol for the rest).That is, fully exposed residues make similar contribu-tions to the stability of proteins because their side chains

do not interact strongly with the rest of proteins (exceptthe bonded part). However, the regression slopes varysignificantly for different residues. This suggests thatthe values of regression slopes (called “buriability”)drive the tendency of a residue to be buried or to beexposed to solvent. Indeed, as shown in Fig. 2, there isan excellent correlation (R = 0.91) between the buriabil-ity of a residue (derived from the DFIRE energy func-tion) and the likelihood of that residue to be buried(estimated from 200 randomly selected soluble pro-teins) (54). The large buriability gap observed betweenhydrophobic and hydrophilic residues is responsible forthe burial of hydrophobic residues in soluble proteins.

Reproducing the partitioning of hydrophobic andhydrophilic residues in soluble proteins by the DFIREenergy function is not the trivial consequence of the factthat the energy function was derived from protein struc-tures that are made of hydrophilic surface andhydrophobic core. It was shown (54) that the averagecontribution to total energy given by the statisticalenergy function of either RAPDF (29) or atomic KBP(30) does not have a significant positive correlation withthe buried ASA for most hydrophilic residues. Theresults on the contribution of residue Asn to protein sta-bility are shown in Fig. 3. The stability contributions

A Desirable Statistical Energy Function for Proteins 169

Cell Biochemistry and Biophysics Volume 46, 2006

Fig. 2. The buriability of amino acid residues derivedfrom the DFIRE energy function has a significant correla-tion with the likelihood of a residue to be buried (the aver-age fraction of buried accessible surface area obtainedfrom 200 randomly selected protein structures). R = 0.91,0.86, and 0.84 for all, hydrophobic (filled circles), andhydrophilic (open circles) residues, respectively. There is alarge buriability gap between hydrophobic andhydrophilic residues that explains the partitioning ofhydrophobic and hydrophilic residues in a single protein.

Fig. 3. The contribution to stability of residue Asn as afunction of its buried accessible surface area was extractedfrom experimental mutation data (filled circles), DFIREenergy function (open circles), and atomic KBP (open dia-monds). The two solid lines are from linear regression fitto DFIRE and KBP results, respectively. Both significantbut different correlations (one positive, one negative)reveal that DFIRE and KBP are qualitatively different sta-tistical energy functions.

given by the DFIRE energy function are in semiquanti-tative agreement with the results independentlyobtained from mutation data for both hydrophobic andhydrophilic residues (54) (only Asn is shown here).However, the results given by atomic KBP or RAPDF(not shown) are qualitatively different (54).

DEPENDENCE ON THE DATABASE OF STRUCTURES

Dependence on Secondary StructuresThe database dependence was tested by constructing

two separate databases of all-α and all-β proteins (59).All-α and all-β proteins are dominated by local andnonlocal interactions, respectively. The database depen-dence for the DFIRE-based energy function was com-pared with that of RAPDF and that of atomic KBP.Figure 4 shows one such comparison in which the data-base dependence is illustrated from the comparisonbetween the energy of proteins given by α-protein-trained energy function and that by β-protein-trainedenergy function. The average absolute value of relative

energy differences are 0.29 for RAPDF, 0.65 for atomicKBP, and 0.11 for DFIRE, respectively. Thus DFIRE isabout three times (or six times) smaller database depen-dent than RAPDF (or KBP). The relative difference isused (i.e., the energy difference is divided by the aver-age energy predicted by two energy functions). Thisand other results (59) all indicate that the extractedDFIRE energy function is less dependent on the sec-ondary structure environment surrounding a residuethan atomic KBP (or RAPDF).

Dependence on Solvent ExposureThe database dependence was also tested by extract-

ing statistical energy functions from surface portion(residues whose fraction of ASA >20%) or core portion(residues whose fraction of ASA <20%) of the 1011 pro-tein structures. Figure 5 compares the energies of 200randomly selected proteins (54) given by the energyfunction trained by protein surfaces with those by pro-tein cores. The average absolute value of relative energydifferences are 0.27 for RAPDF, 1.01 for atomic KBP, and

170 Zhou et al.

Cell Biochemistry and Biophysics Volume 46, 2006

Fig. 4. Energies of native protein structures given by α-protein-trained energy functions vs that by β-protein-trained energy functions for 64 all-α proteins (filled circles)and 28 all-β proteins (open circles). The results for RAPDF,KBP, and DFIRE are shown as green, blue, and red, respec-tively. The dashed line indicates the location if there is nodatabase dependence. To plot data in one figure, the ener-gies of all proteins given by an energy function are scaledby the maximum energy in all-β proteins for the energyfunction trained by the database of β proteins. The averageabsolute value of relative differences between the energiesgiven by α-protein-trained and β-protein-trained energyfunctions are 0.29 for RAPDF, 0.65 for atomic KBP, and0.11 for DFIRE, respectively.

Fig. 5. Energies of 200 randomly selected native proteinstructures given by energy functions trained by the struc-tural database of protein surfaces versus that by the struc-tural database of protein cores. The results for RAPDF,KBP, and DFIRE are shown as green, blue, and red, respec-tively. The dashed line indicates the location if there is nodatabase dependence. To plot data in one figure, the ener-gies of all proteins given by an energy function are scaledby the maximum energy from the energy function trainedby the database of protein surfaces. The average absolutevalue of relative differences between the energies given bythe energy functions trained by two different databasesare 0.227 for RAPDF, 1.01 for atomic KBP, and 0.04 forDFIRE, respectively. (For clarity, not all data for KBP areshown in this figure.)

0.04 for DFIRE, respectively. Thus the DFIRE energyfunction only has a negligible database dependence(4%). That is, significantly different environment(mostly hydrophilic residues at surface and hydropho-bic at core) yields essentially the same DFIRE energyfunction. This explains why making surface and coreresidues as separate residue types (39) did not improvethe accuracy of the DFIRE energy function in structurediscrimination (49).

TRANSFERABILITY

The strong database dependence of typical statisticalenergy functions makes it difficult to transfer themacross different applications. For example, it is best toapply α-protein-trained energy function to helical pro-teins. This can be demonstrated from the fact that theenergy of native structures of α proteins is substantiallylower (more favorable) by the α-protein-trained RAPDF(or KBP) energy function than by the β-protein-trainedRAPDF (or KBP) energy function (Fig. 4). It was alsoshown that the success rate of discriminating true inter-faces from crystal interfaces increases from 59 to 86% asthe residue-based KBP changes the training databasefrom monomeric proteins to the interfacial regions ofdimers (34).

In contrast, the DFIRE-based energy function istransferable without significant change in its perfor-mance. For example, the monomer-based DFIRE poten-tial can be applied directly to study protein–protein

interactions with high accuracy (60). The potential wasshown to provide an accurate prediction of bindingfree energy of protein–peptide and protein–proteincomplexes (a correlation coefficient R of 0.87, and anRMSD of 1.76 kcal/mol with 69 experimental datapoints).

The transferability of the DFIRE-based energy func-tion is also revealed by a simple version of the DFIREenergy function that was based on 19 atom types,including atoms not present in amino acid residues (61).This is a significant reduction from residue-specificatom types (167 atom types) in the original version (49).This DFIRE energy function (trained by 200 protein–lig-and complexes) produces reasonably accurate predic-tion not only for protein–ligand binding affinities (Rapprox 0.63 for several training and testing sets) butalso for the binding affinities of protein–protein (pep-tide) (R = 0.73 for 82 complexes) and protein–DNA com-plexes (R = 0.83 for 45 complexes). The results forprotein–DNA complexes are shown in Fig. 6. A reduc-tion of atom types, however, does decrease the accuracyof the DFIRE energy function. It was shown that the cor-relation coefficient between theoretically predicted andexperimentally measured binding affinities of 82 pro-tein–protein complexes is reduced from 0.79 to 0.73when 167 atom types decreases to 13 atom types (61).

THE ACCURACY OF THE DFIRE ENERGY FUNCTION

The DFIRE-based statistical energy function is morephysically reasonable and more transferable than sev-eral existing statistical energy functions based on the cri-teria described previously. As a result, it is not thatuseful to develop a system-specific DFIRE energy func-tion for improving its performance in a specific applica-tion. It was found that this limitation does not affect theperformance of the DFIRE energy function across differ-ent applications. For example, the monomer-trainedDFIRE energy function is even more accurate than a sta-tistical potential trained with interfacial structures ofdimers in distinguishing native complex structures fromdocking decoys (100 vs 52% success rate in 21 dimer/trimer decoy sets) (60). The DFIRE energy function wasalso found to be at least comparable in accuracy to somephysical-based energy functions equipped with variousstate-of-the-art solvation models (illustrated in loop pre-diction (62)) or empirical energy functions with manyadjustable parameters (illustrated in docking (60) andprediction of protein–ligand binding affinities (61)). Theresidue-level version of the DFIRE energy function (63)is also more successful in native structure discriminationthan several residue-level statistical energy functionsthat were compared. The accuracy of the DFIRE energyfunction was also independently verified in predictingprotein stability of arc repressor mutants (64) (also see

A Desirable Statistical Energy Function for Proteins 171

Cell Biochemistry and Biophysics Volume 46, 2006

Fig. 6. The DFIRE energy function trained by the pro-tein–ligand database provides an excellent predictionfor the binding affinities of 45 protein–DNA complexes.(The energy function was scaled and shifted to facilitatecomparison.)

note added after acceptance). Thus more transferablealso means quantitatively more accurate, at least for lim-ited studies conducted so far.

PROBLEMS AND OUTLOOK

Although the DFIRE energy function is reasonablyaccurate and useful in many applications, it is far fromperfect. For example, the success rate based on the firstrank in energy for the native structure is only 29%(9/31) for the DFIRE energy function in Rosetta dockingdecoys (60). The DFIRE-based energy fails to provide areasonable prediction for mutation-induced change instability for residues on the surface of proteins (49). TheRMSD of the near-native structure selected from loopdecoys is about 2 Å larger than the best near-nativestructure in the loop decoys for loops whose lengths arelonger than 10 residues (62). Recent application ofDFIRE to CAPRI IV and V (Critical Assessment ofPRediction of Interactions) is only moderately success-ful (65). Thus the establishment of the DFIRE energyfunction only represents a small step toward the realis-tic free energy function that is responsible for the fold-ing and binding phenomena.

The previously mentioned problems may be caused byfollowing limitations of the DFIRE energy function. First,the DFIRE energy function, as with many other statisticalenergy functions, is a pairwise energy function that onlydepends on distance. As a result, minimization using theenergy function will tend to make the structures of pro-teins more globule-shaped than native ones (66). Second,the solvent contribution to protein interaction is onlyimplicit in most statistical energy functions including theDFIRE energy function. Consequently, the energy functioncannot handle the flexible or disordered region (67), forwhich the direct interaction with solvent is essential.Third, the orientation dependence of hydrogen-bondingand polar-polar interactions is not explicitly taken intoaccount in the DFIRE-energy function. Thus introductionof multibody potentials (68–73)and orientation-dependentpotentials (73–75) may well be useful to make the DFIREenergy function (or its variants) physically and quantita-tively more accurate. Work in these areas is in progress.

The development of an accurate and computation-ally efficient energy function for proteins will remain achallenging task in foreseeable future. A physicallymore realistic and quantitatively more accurate energyfunction will make a lasting impact in every fieldrelated to protein structures and function. Thisincludes, to name a few, a significant improvement inaccuracy of protein structure prediction (10,13,14),structure-based ligand/protein design (11,12), predic-tion of protein–DNA binding (76), and macromolecularassemblies (77).

NOTE ADDED AFTER ACCEPTANCE

In a recent published article, Grigoryan and Keating(78) compared the DFIRE statistical potential with sev-eral energy functions in their abilities to order the sta-bilities of more than 30,000 pairs of dimeric complexes.The DFIRE potential achieves the success rate of 70% inordering compared with 60% by RosettaDesign and63% by FOLD-X in the most difficult test set. Otherenergy functions whose performances are better thanDFIRE use either adjustable parameters or experimen-tal data as input. Zhu, Xie, and Honig (Proteins, inpress, 2006) applied the DFIRE energy function forstructural refinement of segments containing sec-ondary-structure elements (helix, strand, and hairpin).In one test, more than 80% conformations from 104 seg-ments of 81 proteins (4 Å RMSD in average from thenative conformation) are refined to within 2 Å. Thesestudies independently confirm the accuracy of theDFIRE energy function for folding and binding.

ACKNOWLEDGMENTS

This work was supported by NIH (R01 GM 966049and R01 GM 068530), a grant from HHMI to SUNYBuffalo, and by the Center for Computational Researchand the Keck Center for Computational Biology at SUNYBuffalo. Y.Z. is also in part supported by a two-base fundfrom the National Science Foundation of China.

REFERENCES

1. Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J.,Swaminathan, S., and Karplus, M. (1983) CHARMM: aprogram for macromolecular energy, minimization, anddynamics calculations. J. Comp. Chem. 4, 187–217.

2. Weiner, S. J., Kollman, P., Nguyen, D., and Case, D. (1986)An all atom force field for simulations of proteins andnucleic acids. J. Comp. Chem. 7, 230–252.

3. Jorgensen, W. L., Maxwell, D. S., and Tirado-Rives, J.(1996) Development and testing of the OPLS all-atom forcefield on conformational energetics and properties oforganic liquids. J. Am. Chem. Soc. 118, 11225–11236.

4. Scott, W. R. P., Hünenberger, P. H., Tironi, I. G., et al. (1999)The GROMOS biomolecular simulation program package.J. Phys. Chem. A 103, 3596–3607.

5. Tanaka, S., and Scheraga, H. A. (1976) Medium- and long-range interaction parameters between amino acids forpredicting three-dimensional structures of proteins.Macromolecules 9, 945–950.

6. Ghosh, A., Rapp, C. S., and Friesner, R. A. (1998)Generalized Born model based on a surface integral for-mulation. J. Phys. Chem. B 102, 10983–10990.

7. Lazaridis, T., and Karplus, M. (1999) Effective energy func-tion for proteins in solution. Proteins 35, 133–152.

8. Zhang, L. Y., Gallicchio, E., Friesner, R. A., and Levy, R. M.(2001) Solvent models for protein-ligand binding:

172 Zhou et al.

Cell Biochemistry and Biophysics Volume 46, 2006

Comparison of implicit solvent Poisson and surface gener-alized Born models with explicit solvent simulations. J.Comp. Chem. 22, 591–607.

9. Dominy, B. N., and Brooks III, C. L. (2002) Identifyingnative-like protein structures using physics-based poten-tials. J. Comp. Chem. 23, 147–160.

10. Lazaridis, T., and Karplus, M. (2000) Effective energy func-tion for protein structure prediction. Curr. Opin. Struct.Biol. 10, 139–145.

11. Gohlke, H., and Klebe, G. (2001) Statistical potentials andscoring functions applied to protein-ligand binding. Curr.Opin. Struct. Biol. 11, 231–235.

12. Russ, W. P., and Ranganathan, R. (2002) Knowledge-basedpotential functions in protein design. Curr. Opin. Struct.Biol. 12, 447–452.

13. Meller, J., and Elber, R. (2002) Protein recognition bysequence-to-structure fitness: Bridging efficiency andcapacity of threading models. Adv. Chem. Phys. 120, 77–130.

14. N-V Buchete, J. S., and Thirumalai, D. (2004) Developmentof novel statistical potentials for protein fold recognition.Curr. Opin. Struct. Biol. 14, 225–232.

15. Miyazawa, S., and Jernigan, R. L. (1985) Estimation of effec-tive interresidue contact energies from protein crystal struc-tures: quasi-chemical approximation. Macromole 18, 534–552.

16. DeBolt, S. E., and Skolnick, J. (1996) Evaluation of atomiclevel mean force potentials via inverse folding and inverserefinement of protein structures: atomic burial position andpairwise non-bonded interactions. Protein Eng. 9, 637–655.

17. Zhang, C., Vasmatzis, G., Cornette, J., and DeLisi, C. (1997)Determination of atomic desolvation energies from thestructures of crystallized proteins. J. Mol. Biol. 267, 707–726.

18. Skolnick, J., Kolinski, A., and Ortiz, A. (2000) Derivation ofprotein-specific pair potentials based on weak sequencefragment similarity. Proteins 38, 3–16.

19. Sippl, M. J. (1993) Boltzmann’s principle, knowledge-based mean fields and protein folding. An approach to thecomputational determination of protein structures. J.Comp. Aid. Mol. Design 7, 473–501.

20. Melo, F., and Feytmans, E. (1998) Assessing protein struc-tures using a non-local atomic interaction energy. J. Mol.Biol. 277, 1141–1152.

21. Zhou, H., and Zhou, Y. (2004) Single-body knowledge-based energy score combined with sequence-profile andsecondary structure information for fold recognition.Proteins 55, 1005–1013.

22. Rooman, M. J., Kocher, J.-P. A., and Wodak, S. J. (1991)Prediction of protein backbone conformation based onseven structure assignments. Influence of local interac-tions. J. Mol. Biol. 211, 961–979.

23. Rooman, M. J., Kocher, J.-P. A., and Wodak, S. J. (1992)Extracting information on folding from the amino acidsequence: accurate predictions for protein regions withpreferred conformation in the absence of tertiary interac-tions. Biochemistry 31, 10226–10238.

24. Kocher, J.-P. A., Rooman, M. J., and Wodak, S. J. (1994)Factors influencing the ability of knowledge-based poten-tials to identify native sequence-structure matches. J. Mol.Biol. 235, 1598–1613.

25. Sippl, M. J. (1990) Calculation of conformational ensem-bles from potentials of mean force. An approach to the

knowledge-based prediction of local structures in globularproteins. J. Mol. Biol. 213, 859–883.

26. Hendlich, M., Lackner, P., Weitckus, S., et al. (1990)Identification of native protein folds amongst a large num-ber of incorrect models. The calculation of low energy con-formations from potentials of mean force. J. Mol. Biol. 216,167–180.

27. Tobi, D., and Elber, R. (2000) Distance-dependent, pairpotential for protein folding: Results from linear optimiza-tion. Proteins 41, 40–46.

28. Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992) A newapproach to protein fold recognition. Nature 358, 86–89.

29. Samudrala, R., and Moult, J. (1998) An all-atom distance-dependent conditional probability discriminatory func-tion for protein structure prediction. J. Mol. Biol. 275,895–916.

30. Lu, H., and Skolnick, J. (2001) A distance-dependentatomic knowledge-based potential for improved proteinstructure selection. Proteins 44, 223–232.

31. Rojnuckarin, A., and Subramaniam, S. (1999) Knowledge-based interaction potentials for proteins. Proteins 36, 54–67.

32. Hardin, C., Eastwood, M. P., Luthey-Schulten, Z., andWolynes, P. G. (2000) Associative memory Hamiltoniansfor structure prediction without homology: alpha-helicalproteins. Proc. Natl. Acad. Sci. USA 97, 14235–14240.

33. Thomas, P. D., and Dill, K. A. (1996) Statistical potentialsextracted from protein structures: how accurate are they? J.Mol. Biol. 257, 457–469.

34. Lu, H., Lu, L., and Skolnick, J. (2003) Development of uni-fied statistical potentials describing protein-protein inter-actions. Biophys. J. 84, 1895–1901.

35. Glaser, F., Sternberg, D. M., Vakser, I., and Ben-Tal, N.(2001) Residue frequencies and pairing preferences at pro-tein-protein interfaces. Proteins 43, 89–102.

36. Ofran, Y., and Rost, B. (2003) Analyzing six types of pro-tein-protein complexes. J. Mol. Biol. 325, 377–387.

37. Furuichi, E., and Koehl, P. (1998) Influence of protein struc-ture databases on the predictive power of statistical pairpotentials. Proteins 31, 139–149.

38. Moont, G., Gabb, H., and Sternberg, M. (1999) Use of pairpotentials across protein interfaces in screening predicteddocked complexes. Proteins 35, 364–373.

39. Simons, K. T., Kooperberg, C., Huang, E., and Baker, D.(1997) Assembly of protein tertiary structures from frag-ments with similar local sequences using simulated anneal-ing and Bayesian scoring functions. J. Mol. Biol. 268, 209– 225.

40. Eyrich, V. A., Standley, D. M., and Friesner, R. A. (2002) Abinitio protein structure prediction using a size-dependenttertiary folding potential. Adv. Chem. Phys. 120, 223–263.

41. Dehouck, Y., Gilis, D., and Rooman, M. (2004) Database-derived potentials dependent on protein size for in silicofolding and design. Biophys. J. 87, 171–181.

42. Jernigan, R. L., and Bahar, I. (1996) Structure-derivedpotentials and protein simulations. Curr. Opin. Struct. Biol.6, 195–209.

43. Moult, J. (1997) Comparison of database potentials andmolecular mechanics force fields. Curr. Opin. Struct. Biol. 7,194–199.

44. Betancourt, M. R., and Thirumalai, D. (1999) Pair poten-tials for protein folding: choice of reference states and sen-

A Desirable Statistical Energy Function for Proteins 173

Cell Biochemistry and Biophysics Volume 46, 2006

sitivity of predicted native states to variations in the inter-action schemes. Protein Sci. 8, 361–369.

45. Mitchell, J. B. O., Laskowski, R. A., Alex, A., and Thornton,J. M. (1999) BLEEP - potential of mean force describingprotein-ligand interactions: I. generating potential. J.Comp. Chem. 20, 1165–1176.

46. Vijayakumar, M., and Zhou, H.-X. (2000) Prediction ofresidue-residue pair frequencies in proteins. J. Phys. Chem.B 104, 9755–9764.

47. McConkey, B. J., Sobolev, V., and Edelman, M. (2003)Discrimination of native protein structures using atom-atomcontact scoring. Proc. Natl. Acad. Sci. USA 100, 3215–3220.

48. Friedman, H. L. A Course in Statistical Mechanics. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1985.

49. Zhou, H., and Zhou, Y. (2002) Distance-scaled, finite ideal-gas reference state improves structure-derived potentialsof mean force for structure selection and stability predic-tion. Protein Sci. 11, 2714–2726.

50. Wang, G., and Dunbrack Jr., R. L. (2003) PISCES: a proteinsequence culling server. Bioinformatics 19, 1589–1591.

51. Privalov, P. L. (1979) Stability of proteins: Small globularproteins. Adv. Protein Chem. 33, 167–241.

52. Kauzmann, W. (1959) Some factors in the interpretation ofprotein denaturations. Adv. Protein Chem. 14, 1–63.

53. Dill, K. A. (1990) Dominant forces in protein folding.Biochemistry 29, 7133– 7155.

54. Zhou, H. & Zhou, Y. (2004). Quantifying the effect of bur-ial of amino acid residues on protein stability. Proteins 54,315–322.

55. Sharp, K. A., Nicholls, A., Friedman, R., and Honig, B.(1991) Extracting hydrophobic free energies from experi-mental data: relationship to protein folding and theoreticalmodels. Biochemistry 30, 9686–9697.

56. Zhou, H., and Zhou, Y. (2002) The stability scale andatomic solvation parameters extracted from 1023 mutationexperiments. Proteins 49, 483–492.

57. Karplus, P. A. (1997) Hydrophobicity regained. Protein Sci.6, 1302–1307.

58. Chan, H. S. Amino acid side-chain hydrophobicity. InEncyclopedia of Life Sciences. Nature Publishing Group,London, UK, 2001.

59. Zhang, C., Liu, S., Zhou, H., and Zhou, Y. (2004) Thedependence of all-atom statistical potentials on trainingstructural database. Biophys. J. 86, 3349– 3358.

60. Liu, S., Zhang, C., Zhou, H., and Zhou, Y. (2004) A physicalreference state unifies the structure-derived potential ofmean force for protein folding and binding. Proteins 56,93–101.

61. Zhang, C., Liu, S., Zhu, Q., and Zhou, Y. (2005) A knowl-edge-based energy function for protein-ligand, protein-protein and protein-DNA complexes. J. Med. Chem. 48,2325–2335.

62. Zhang, C., Liu, S., and Zhou, Y. (2004) Accurate and effi-cient loop selections using DFIRE-based all-atom statisticalpotential. Protein Sci. 13, 391–399.

63. Zhang, C., Liu, S., Zhou, H., and Zhou, Y. (2004) An accu-rate residue-level pair potential of mean force for folding

and binding based on the distance-scaled ideal-gas refer-ence state. Protein Sci. 13, 400–411.

64. de Armas, R. R., Diaz, H. G., Molina, R., and Uriarte, E.(2004) Markovian backbone negentropies: Moleculardescriptors for protein research. I. Predicting protein sta-bility in arc repressor mutants. Proteins 56, 715–723.

65. Zhang, C., Liu, S., and Zhou, Y. (2005) Docking predictionusing biological in- formation, ZDOCK sampling techniqueand clusterization guided by the DFIRE statistical energyfunction. (CAPRI Special Issue) Proteins 60, 314–318.

66. Li, H., and Zhou, Y. (2005) Fold helical proteins by energyminimization in dihedral space and a DFIRE-based statis-tical energy function. J. Bioinform. Comp. Biol. 3, 1151–1170.

67. Dunker, A. K., Brown, C. J., and Obradovic, Z. (2002)Identification and functions of usefully disordered pro-teins. Adv. Protein Chem. 62, 25–49.

68. Kolinski, A., Galazka, W., and Skolnick, J. (1996) On theorigin of the cooperativity of protein folding: Implicationfrom model simulations. Proteins 26, 271–287.

69. Liwo, A., Oldziej, S., Pincus, M. R., Wawak, R. J.,Rackovsky, S., and Scheraga, H. A. (1997) A united-residueforce field for off-lattice protein-structure simulations. I.functional forms and parameters of long-range sidechaininteraction potentials from protein crystal data. J. Comp.Chem. 18, 849–872.

70. Gan, H. H., Tropsha, A., and Schlick, T. (2001) Lattice pro-tein folding with two and four-body statistical potentials.Proteins 43, 161–174.

71. Krishnamoorthy, B., and Tropsha, A. (2003) Development ofa four-body statistical pseudo-potential to discriminatenative from non-native protein conformations. Bioinformatics19, 1540–1548.

72. Ejtehadi, M. R., Avall, S. P., and Plotkin, S. S. (2004) Three-body interactions improve the prediction of rate andmechanism in protein folding models. Proc. Natl. Acad. Sci.USA 101, 15088–15093.

73. Grishaev, A., and Bax, A. (2004) An empirical backbone-backbone hydrogen-bonding potential in proteins and itsapplications to NMR structure refinement and validation.J. Am. Chem. Soc. 126, 7281–7292.

74. Kortemme, T., Morozov, A., and Baker, D. (2003) An orien-tation-dependent hydrogen bonding potential improvesprediction of specificity and structure for proteins and pro-tein-protein complexes. J. Mol. Biol. 326, 1239–1259.

75. Buchete, N.-V., Straub, J. E., and Thirumalai, D. (2004)Orientational potentials extracted from protein struc-tures improve native fold recognition. Protein Sci. 13,862–874.

76. Sternberg, M. J. E., Gabb, H. A., and Jackson, R. M. (1998)Predictive docking of protein-protein and protein-DNAcomplexes. Curr. Opin. Struct. Biol. 8, 250–256.

77. Devos, D., Dokudovskaya, S., Alber, F., et al. (2004)Components of coated vesicles and nuclear pore complexesshare a common molecular architecture. PLoS Biol. 2, 1–9.

78. Grigoryan, G., and Keating, A. E. (2006) Structure-basedprediction of bZIP partnering specificity. J. Mol. Biol. 355,1125–1142.

174 Zhou et al.

Cell Biochemistry and Biophysics Volume 46, 2006