Description of RNA Folding by \Simulated Annealing\>

13
J. Mol. Biol. (1996) 255, 254–266 Description of RNA Folding by ‘‘Simulated Annealing’’ Michael Schmitz and Gerhard Steger* An algorithm is proposed which describes the thermodynamically as well Institut fu ¨ r Physikalische as the kinetically controlled folding process of RNA. The algorithm, based Biologie, Heinrich-Heine Universita ¨t Du ¨ sseldorf on a special Monte Carlo procedure known as ‘‘Simulated Annealing’’, takes into account the probabilities for opening and closing of single Germany base-pairs. Thus, the algorithm is able to reach structures and structure distributions near the global minimum of structure space, and is not restricted by the tendency to halt in local minima. Three types of structural folding processes may be analysed by this algorithm. Firstly, using thermodynamic data, structure ensembles comparable to those obtained by dynamic programming are achieved. Secondly, using kinetic data, the processes of structure formation and structural rearrangement may be simulated. Thirdly, additionally taking into account RNA polymerase chain elongation rates, the process of ‘‘sequential folding’’ during transcription may be described. Analysis of all types of structural folding and refolding is performed for RNA sequences related to potato spindle tuber viroid (PSTVd). The computed results are in accordance with experimental data and biological functions known for PSTVd. 7 1996 Academic Press Limited Keywords: Monte Carlo; kinetics; sequential folding; secondary structure; *Corresponding author potato spindle tuber viroid Introduction RNA secondary structure plays an essential role for the biological function of many RNAs (for a review, see Wyatt et al ., 1989), either by directly mediating the function or by supporting the formation of an essential tertiary structure motif. Thus, reliable prediction of RNA structure is an important prerequisite for the analysis of functional aspects of RNA. The best mathematical algorithm available today for such predictions is based on graph theory (Bellman & Kalaba, 1960) and dynamic programming which was adopted to the RNA problem by several authors (Nussinov et al ., 1978; Waterman & Smith, 1978; Zuker & Stiegler, 1981; Steger et al ., 1984; Williams & Tinoco, 1986; Jaeger et al ., 1990). By minimizing the free energy DG 0 , the algorithm guarantees to find the optimal and near-optimal secondary structures for a given sequence. Calculation of equilibrium structure ensembles of RNA secondary structures by similar algorithms (McCaskill, 1990; Schmitz & Steger, 1992) allows the prediction of calorimetric and UV melting curves, which are in good accordance with experimental data obtained under high ionic strength conditions. These algorithms, however, are based on the assumption of the RNA being in thermodynamic equilibrium. This is certainly not always the case, for example during or directly after synthesis of the RNA. In each case where significant presence of thermodynamically metastable structures has to be assumed, the thermodynamical methods described above cannot be used to gain insight into the partitioning of structures based on individual activation barriers for structural rearrangements. Furthermore, no information is obtained on the pathway of folding into optimal structures by thermodynamic calculations. In such cases a kinetical approach has to be used. Kinetical approaches to RNA structure prediction were introduced by Martinez (1984), Mironov et al . (1985) and Mironov & Kister (1986). Both authors propose a Monte Carlo construction of secondary struc- ture(s) based on probability or rate constants for iterative addition of complete helical regions to some already existing structure. This approach, however, is close to a gradient-descent method due to the high activation barriers involved in manipu- lation of complete helical segments. Consequently, Abbreviations used: nt(s), nucleotide(s); PSTVd, potato spindle tuber viroid. 0022–2836/96/010254–13 $12.00/0 7 1996 Academic Press Limited

Transcript of Description of RNA Folding by \Simulated Annealing\>

J. Mol. Biol. (1996) 255, 254–266

Description of RNA Folding by ‘‘SimulatedAnnealing’’

Michael Schmitz and Gerhard Steger*

An algorithm is proposed which describes the thermodynamically as wellInstitut fur Physikalischeas the kinetically controlled folding process of RNA. The algorithm, basedBiologie, Heinrich-Heine

Universitat Dusseldorf on a special Monte Carlo procedure known as ‘‘Simulated Annealing’’,takes into account the probabilities for opening and closing of singleGermanybase-pairs. Thus, the algorithm is able to reach structures and structuredistributions near the global minimum of structure space, and is notrestricted by the tendency to halt in local minima. Three types of structuralfolding processes may be analysed by this algorithm. Firstly, usingthermodynamic data, structure ensembles comparable to those obtained bydynamic programming are achieved. Secondly, using kinetic data, theprocesses of structure formation and structural rearrangement may besimulated. Thirdly, additionally taking into account RNA polymerase chainelongation rates, the process of ‘‘sequential folding’’ during transcriptionmay be described. Analysis of all types of structural folding and refoldingis performed for RNA sequences related to potato spindle tuber viroid(PSTVd). The computed results are in accordance with experimental dataand biological functions known for PSTVd.

7 1996 Academic Press Limited

Keywords: Monte Carlo; kinetics; sequential folding; secondary structure;*Corresponding author potato spindle tuber viroid

Introduction

RNA secondary structure plays an essential rolefor the biological function of many RNAs (for areview, see Wyatt et al., 1989), either by directlymediating the function or by supporting theformation of an essential tertiary structure motif.Thus, reliable prediction of RNA structure is animportant prerequisite for the analysis of functionalaspects of RNA. The best mathematical algorithmavailable today for such predictions is based ongraph theory (Bellman & Kalaba, 1960) anddynamic programming which was adopted to theRNA problem by several authors (Nussinov et al.,1978; Waterman & Smith, 1978; Zuker & Stiegler,1981; Steger et al., 1984; Williams & Tinoco, 1986;Jaeger et al., 1990). By minimizing the free energyDG0, the algorithm guarantees to find the optimaland near-optimal secondary structures for a givensequence. Calculation of equilibrium structureensembles of RNA secondary structures by similaralgorithms (McCaskill, 1990; Schmitz & Steger,1992) allows the prediction of calorimetric and UV

melting curves, which are in good accordance withexperimental data obtained under high ionicstrength conditions. These algorithms, however, arebased on the assumption of the RNA being inthermodynamic equilibrium. This is certainly notalways the case, for example during or directly aftersynthesis of the RNA.

In each case where significant presence ofthermodynamically metastable structures has to beassumed, the thermodynamical methods describedabove cannot be used to gain insight into thepartitioning of structures based on individualactivation barriers for structural rearrangements.Furthermore, no information is obtained on thepathway of folding into optimal structures bythermodynamic calculations. In such cases akinetical approach has to be used. Kineticalapproaches to RNA structure prediction wereintroduced by Martinez (1984), Mironov et al. (1985)and Mironov & Kister (1986). Both authors proposea Monte Carlo construction of secondary struc-ture(s) based on probability or rate constants foriterative addition of complete helical regions tosome already existing structure. This approach,however, is close to a gradient-descent method dueto the high activation barriers involved in manipu-lation of complete helical segments. Consequently,

Abbreviations used: nt(s), nucleotide(s); PSTVd,potato spindle tuber viroid.

0022–2836/96/010254–13 $12.00/0 7 1996 Academic Press Limited

RNA Folding 255

this approach tends to halt in local minima of theenergy landscape of RNA secondary structure; i.e.it does not allow to describe the kineticallycontrolled pathway of refolding from metastableinto thermodynamically optimal structures.

Therefore, we decided to develop a less coarse-grained approach taking into account the closingand opening of single base-pairs. However, a pureMonte Carlo approach still tends to produce mainlyenergetically unfavourable structures due to aselection of structures not biased by free energy. Amore efficient sampling of favourable conformationscan be achieved by using a Boltzmann-weightedselection of conformational changes (Metropoliset al., 1953). Due to the discrete nature ofconformational changes in RNA secondary struc-tures, the average size of free energy changesrequired for opening a single base-pair is greaterthan the thermal energy at room temperature. Thus,the resulting transition probabilities for an escapefrom local minima are extremely low leading to ahighly inefficient and slow optimization process.For such cases Kirkpatrick et al. (1983) suggested amethod, termed ‘‘Simulated Annealing’’, to achievea continuous transition between ‘‘random walk’’selection of conformational changes in the begin-ning of an optimization to Metropolis-like be-haviour at the end of the process. Choosing thisapproach we developed a program able to performthree types of structural folding. Firstly, usingthermodynamic data, structure ensembles compar-able to those obtained by dynamic programmingare achieved. Secondly, using kinetic data, theprocesses of structure formation and structuralrearrangement may be simulated. Thirdly, theprocess of ‘‘sequential folding’’ during transcriptionmay be described by taking additionally intoaccount chain elongation rates of RNA polymerasesfor the process of increasing chain length.

Application of the Simulated Annealing algor-ithm developed for modelling the RNA foldingprocess is demonstrated for RNAs related to PSTVd(for a review, see Diener, 1987). The mature RNAof PSTVd consists of a covalently closed circle of359 nts; this naked RNA is able to infect certainplants. Neither (+)stranded nor (−)stranded PSTVdRNAs are translated into proteins. Thus, for all itsbiological functions PSTVd depends on providingdifferent substructures as recognition elements forhost factors. The native structure of PSTVd is arod-like series of short helices and loops withoutbifurcations or any tertiary structure. This nativestructure, experimentally determined by severalbiophysical (for a review, see Riesner et al., 1979)and biochemical (Gross et al., 1978) methods, ispredicted accurately by dynamic programmingmethods (Steger et al., 1984; Schmitz & Steger,1992). In contrast to this native and thermodynam-ically optimal structure of the mature molecule, themultimeric RNAs produced as intermediates duringreplication are supposed to exist as highlybranched, thermodynamically metastable struc-tures (Gruner et al., 1995) characterized by G·C-rich

hairpins that are not part of their thermodynami-cally optimal, native structure. The infectingcircular PSTVd and the (−)strand intermediate arereplicated by DNA-dependent RNA polymerase II(Schindler & Muhlbach, 1992). For replication of the(+)strand circle into multimeric (−)strands a helixpart of the native structure seems to be criticalwhereas for replication of the (−)strand intomultimeric (+)strands a metastable hairpin isimportant (Qu et al., 1993). This hairpin, designatedhairpin II, has a length of ten base-pairs includingnine G·C base-pairs but is not part of the nativestructure. Newly synthesized (+)strands have torearrange from metastable structures into thermo-dynamically favourable structures for processing(Steger et al., 1992); i.e. only such structures arecleaved and ligated by host enzymes into maturecircles (Tsagris et al., 1987). By these structuralfeatures, the various PSTVd RNAs are ideal objectsfor analysis by our Simulated Annealing algorithm.

Theory

The algorithm described below aims at modellingthe structure formation of RNA stochastically, thusmimicking the ‘‘trial and error’’ behaviour of afolding RNA molecule in vivo. This stochasticalmodelling approach requires Monte Carlo methodsto simulate the development of the system.Structure formation or rearrangement is simulatedby iteratively generating conformational changesand random sampling of conformations. Theallowed conformational changes are formation anddisruption of single base-pairs in the secondarystructure. The conformational changes are ac-companied by energy changes that are used in boththe selection and averaging steps of the Monte Carloprocess. Depending on the type of processintended, the energy changes, DE, are eitherchanges in free energy, DDG0, or free activationenergy, DE(. Stability parameters (Steger et al., 1984)are used that allow successfully the prediction ofthe thermodynamic behaviour of viroids (Schmitz& Steger, 1992) or mRNA (Rosenbaum et al., 1993).

Simple Monte Carlo methods choose confor-mational changes randomly from all changespossible at a given state, regardless of the resultingenergy change of the system. This non-biasedselection results in sampling of many conformationswith unfavourable energy, which are suppressedlater during the averaging process by assigningthem a weight corresponding to the Boltzmannfactor exp(−DE/RT ). To improve sampling ofsignificant conformations, we applied theBoltzmann-weight to the selection of conformationsas proposed as a generally applicable method byMetropolis et al. (1953). In this scheme, eachconformational change is accepted with a prob-ability dependent on the energy change. In case ofa favourable energy change the new conformation isalways accepted, whereas in case of an unfavour-able energy change the probability for acceptingthe new conformation equals the corresponding

RNA Folding256

Boltzmann factor. In case of rejected conformationsthe old conformation is used again for averaging aswell as for generating the next conformationalchange. For an infinite number of conformationalchanges and using free energies for calculation ofthe transition probabilities, this procedure leadsto a Boltzmann equilibrium distribution. If thedifference between the activation energies ofinverse processes equals the free energy differencebetween corresponding states, the application ofactivation energies leads also into an equilibriumdistribution.

Because of the highly favourable free energy ofbase-pair stacking, any conformational change ofan RNA secondary structure is accompanied by achange in free energy comparable with or evensignificantly greater than the thermal energy atphysiological temperatures. Thus, the resultingtransition probabilities are extremely low, leadingto a highly inefficient and slow optimizationprocess. To avoid this fundamental disadvantage,we used a method, termed Simulated Annealing(Kirkpatrick et al., 1983), which achieves acontinuous transition from ‘‘random walk’’ selec-tion of conformational changes at the beginning ofan optimization process to Metropolis-like be-haviour, or even further to strict gradient-descentoptimization, at the end. In this procedure thethermodynamic temperature controls the energy-dependent transition probabilities. To achieve thebehaviour described above, the temperature has tovary from an initially high level, resulting inrandom walk selection, to an intermediate value,providing a more energy-directed selection, oreven to a value 0 K for freezing the system into‘‘steepest descent’’ regime at the end. Mostoptimization problems solved by Simulated An-nealing do not possess a system-immanent tem-perature; i.e. their energy functions are notexplicitly temperature-dependent. Thus, the tem-perature may be introduced as a controllingparameter without perturbing the energy functionused as the target of optimization. Since the freeenergy of an RNA structure itself is temperature-dependent, this procedure cannot be applieddirectly to the RNA folding problem. Instead,we have introduced a ‘‘distribution parameter’’,U, that controls the simulation process throughthe Boltzmann probability distribution function,p = exp(−DE(T )/RU), whereas the thermodynamictemperature T is used to calculate the free energyof the structure. In order to switch from simulationphases of less pronounced energy-dependentselection of conformational changes to phases withpredominantly Boltzmann-weighted selection, wevaried U independently of T. With the initiallyhigh value of U (U�T ) unfavourable structuralrearrangements are allowed and global features ofthe structure are optimized during this randomwalk regime of simulation. Structural details areoptimized locally in the final, Metropolis-likeregime (UeT ) of the process. For variation of thedistribution parameter U with iteration steps m

Figure 1. Schematic resentation of the SimulatedAnnealing algorithm developed for the description ofssRNA folding. The outer Repeat/Until loop has tobe executed for each iteration cycle; this loop includesthe option of performing the optimization eitheriteratively in a number of cycles with a varyingtemperature T or repeatedly at a constant temperatureT. In the inner Repeat/Until loop conformationalchanges are generated while decreasing the distributionparameter U from Uinit + T to T hyperbolically in ntotal

steps. A conformational change is accepted if either thetransition energy (free energy difference DDG0 forthermodynamic selection or DE(

i·j for kinetic selection) isnegative or a random number r is less than the transitionprobability.

during the optimization, we used a simplehyperbolic function:

U(m) = T + Uinit × ((n1/2 + 1)/(n1/2 + m)

− (n1/2 + 1)/(n1/2 + ntotal))

To retain a Boltzmann-like distribution of struc-tures, U starts with an initially high value, T + Uinit,and drops to the thermodynamic temperatureT. The half value between initial and final valueof U:

U(n1/2) = T + Uinit/2

is reached after a user-selectable amount, n1/2, thatis typically a small fraction of the total number ofiterations, ntotal. The resulting algorithm is shownschematically in Figure 1. Details of the algorithm’scrucial elements are explained in the followingparagraphs.

RNA Folding 257

Generation of conformational changes

In our model admissible conformational changesare restricted to single base-pair changes. Suchchanges were found to be sufficient duringpreliminary tests. An extension of the algorithm toincorporation of ‘‘base-pair sliding’’ leads tocomputational overhead during determination ofpossible correlated changes.

To generate a conformational change, a first baseis chosen randomly from the sequence and itsbase-pairing state is determined. If an unpairedbase was selected, a second base is chosen randomlyfor pairing with the first one. According to thetopological characteristics of secondary structure,the second base is picked from the loop containingthe first base, thereby excluding ‘‘nonsense’’ movesleading to disallowed conformations. If the twoselected bases are not able to form a base-pair, thechange is rejected. If the first base was part of abase-pair, it is obvious to open the base-pair.However, it is necessary to select the second baserandomly as well, from the helix containing the firstbase. Otherwise a statistically biased selection ofclosing and opening moves would result. If thesecond base is not the base-pair partner of the baseselected first, the opening of the base-pair isrejected. As the next step, a final decision is madeupon the acception or rejection of the generatedchange; this decision depends on the transitionenergy of the change.

Calculation of transition energies

The free energy change of a structural changecaused by opening or closing a single base-pair iscalculated directly from the stability parameters bytaking into account the structural environment ofthe base-pair. Opening of a stacked base-pair eitherdisrupts a single stack and increases the size of theadjacent loop or disrupts two stacks and creates anew internal loop of two bases. A non-stackedbase-pair, which is required as a transient-inter-mediate for nucleation of new helical segments,causes the two adjacent loops to be joined into asingle one upon disruption. Upon closing of abase-pair, the corresponding complementary pro-cesses occur: either stacks are formed and loop sizeis decreased, or a loop is partitioned into two loopsseparated by a single non-stacked pair.

To calculate the free energy difference, DDG0, ofa base-pair change, the structural environmentsbefore (state 1) and after the change (state 2) aretaken into account by the equation:

DDG0 = DG02 − DG0

1

= (DH0s2 − DH0s

1 ) − T

× ((DS0s2 − DS0s

1 ) + (DS0l2 − DS0l

1 ))

DH0si denotes the enthalpic contribution to the free

energy of structure i. DS0si and DS0l

i denote theentropic contributions resulting from stacking s andloop closure l, respectively.

Calculation of free activation energies, DE(, isbased on the following assumptions. For disruptionof a base-pair the transition state consists of thenon-stacked state of the base-pair to be changed. Forformation of a base-pair the transition state reflectsthe change in the loop conformation. This results inDE(

d,i·j = −(DH0s1,i·j − T × DS0s

1,i·j ) for disruption of thebase-pair i·j and DE(

f,i·j = −T × (DS0l2 − DS0l

1 ) for for-mation of the base-pair i·j, when state 1 denotes thestart conformation and state 2 the final conformationof the structure. The kinetic rate constants of abase-pair disruption or formation are given by theArrhenius equation:

ki·j = kcal × exp(−DE(i·j /RT )

with a calibration constant kcal. For correlation of astructure formation process with a macroscopicprocess, like chain polymerization, the rate con-stants have to be calibrated based on experimentaldata:

kcal = kexp/exp(−DE(/RT )

The rate constant kexp has to belong to a homogenousone-step transition determined experimentally; thetheoretical activation energy DE( is calculated usingthe equations given above. As such a calibrationprocess we selected from the literature thedenaturation of a hairpin consisting of a helix often base-pairs (Randles et al., 1982) with a meltingtemperature Tm = 358.48 K (extrapolated from10 mM to 1 M NaCl) and kexp = 750 s−1. With

DE(d = −(DH0

helix − Tm × DS0helix) = 25.04 kJ/mol = DE(

f

a calibration constant kcal = 3.34 × 106 s−1 is derived.This value is in the range of 0.5 × 106 to 15 × 106 s−1

generally accepted in the literature (see Tinoco et al.,1990; Turner et al., 1990). The folding time for acertain process is computed by summing up thereciprocal rate constants of appropriate reactionsteps.

Selection of conformational changesand averaging

For the actual acceptance or rejection of aconformational change, the appropriate transitionprobability is calculated as p = exp(−DE(T )/RU),where DE denotes either the free energy differenceDDG0

i·j for thermodynamic selection or the activationenergy DE(

i·j for kinetic selection of the confor-mational change. Negative transition energies arealways accepted. With unfavourable, positivetransition energies, the calculated transition prob-ability is compared with a random numberuniformly distributed between 0 and 1. For randomnumbers smaller than the transition probability thechange is accepted.

For averaging the structures obtained during theoptimization process, their thermodynamic quan-tities have to be weighted with a Boltzmannfactor. The weight of conformations, sampledduring the random walk regime of the annealingprocess is overestimated according to the selection

RNA Folding258

Figure 2. Schematic representation of the thermodynamically optimal structure of circular PSTVd at 25°C. The regionsmarked by I/I', II/II', and III/III' are complementary to each other but form thermodynamically stable hairpins I, II,and III, respectively, only at temperatures above the main transition in which the rod-shaped structure is denaturedcompletely.

with a Boltzmann factor of p = exp(−DE(T )/RU)and U > T. This overrepresentation of individualcontributions sampled at U > T is corrected to U = Tby a weighting factor:

pcorr = exp0 − U − TU × DDG0

RT 1Multiplication of the weighted selection of confor-mational changes with pcorr results in the Boltzmannfactor as total weighting factor in averaging.

Results

In the following, applications of the SimulatedAnnealing algorithm to RNAs related to PSTVd aredescribed. The thermodynamically most stable,native structure of this RNA, as predicted bydynamic programming (Steger et al., 1984) andproven experimentally (for a review, see Riesneret al., 1979), is depicted in Figure 2. As an alternativeto the rod-shaped structure a bifurcated structure ispredicted to exist in the left terminal region; thisstructure is still very similar to the native structurebut energetically less favoured (see Figure 3(c)). Athigher temperatures, above the main transition at75°C in 1 M NaCl, a bifurcated structure consistingmainly of three G·C-rich hairpins (regions markedI/I', II/II', and III/III' in Figure 2) is thermodynam-ically most stable. These hairpin helices (I, II, andIII) are not part of the native structure, but do existin metastable structures formed by sequentialfolding during synthesis (Loss et al., 1991; Qu et al.,1993; Gruner et al., 1995). The program is applied tothree types of PSTVd RNA folding. Folding ofcircular PSTVd from the completely denatured stateinto the native structure has to be compared withresults from dynamic programming. Folding ofcircular PSTVd from the metastable hairpinstructure into the native structure demonstrates thecapability of the Simulated Annealing algorithm todescribe a kinetically controlled rearrangement.Sequential folding is described for a linear,longer-than-unit-length RNA of PSTVd sequence.

Parameter adjustment

Three parameters, the total number of iterations,ntotal, the half-decay fraction, n1/2/ntotal, and theinitial value of the distribution parameter, Uinit,used during a simulation have to be fixed. The majorcontribution to an efficient optimization process

results from careful adjustment of the parameterUinit. Its value has to be a compromise between therandom walk phase being sufficiently long to allowescape from structures of local minima and theremaining Metropolis phase being long enough toimprove and to stabilize the final distribution.

According to our investigations, the initial valueof U can be estimated by a simple rule. The valueof R × U(n1/2) should be about −1.5 × DG0

sp, whereDG0

sp denotes the free energy specific to a structuralelement or structure that has to be rearrangedinevitably during optimization. This element mightbe found by thermodynamic calculation as the moststable element, or by kinetic calculation as theelement blocking rearrangement into thermody-namically favourable structures. DG0

sp is calculatedas the free energy of this structural element dividedby the number of its base-pairs. This criterionensures that during the first phase of the simulationas adjusted by n1/2, a sufficient probability for arearrangement of this element is obtained. Forexample, the structure shown in Figure 4(a) (topleft), which has to be rearranged completely duringoptimization, has DG0

sp(298 K) = −5.8 kJ/mol thatgives Uinit11200 K.

For all examples shown below we have usedntotal = 10 × N2 for a sequence of length N. This valueis based on our assumption that 10% of theattempted changes to the structure are actuallyaccepted, and that any possible base-pair should beselected at least once for changing the structure.With this choice of ntotal the algorithm is of order N2.On the DEC 3000/800 actual CPU time for a singleiteration cycle was about ten minutes forN = 359 nts. The half-decay fraction n1/2/ntotal

proved to be of little influence to the optimizationwhen compared with the influence of Uinit. Forour purpose, a half-decay fraction of 0.05 wassufficient.

De novo folding of circular PSTVd from thedenatured state

As a first example for the simulation of RNAstructure formation, the folding of circular PSTVdviroid RNA into the native structure is demon-strated. Here the starting conformation of theoptimization process does not contain any base-pairs (see Figure 3(a), left). In principle theoptimization process depends on which base-pairsare formed first during the first iteration. Using aUinit = 1000K + T, actually only little dependence

RNA Folding 259

was found since almost any conformational changeis allowed during the first 5% of each iteration(n1/2/ntotal = 0.05).

By cooling down the distribution parameter U,the system is trapped in a low-energy state at theend of the first iteration cycle (see Figure 3(a), first

Figure 3. (legend opposite)

RNA Folding260

iteration). This low-energy state does not corre-spond to the complete native structure, but the righthalf of the native structure has already been formed.This behaviour stresses the necessity for multipleconsecutive iteration cycles consisting of heating Uto Uinit + T and slowly cooling down to T, each cyclecontinuing with the conformation that was presentwhen the previous cycle stopped. In the example,during consecutive iteration cycles more and moreparts of the native structure are formed until thecomplete structure is finally reached. This iterativerefinement over several cycles is also seen in theaverage free energy plots (Figure 3(b)). At the endof the first cycle an energy of about −400 kJ/mol isreached, whereas in the second half of the cyclesfinal energies of about −600 kJ/mol prevail.

The structure distribution reached finally(Figure 3(a), right bottom) consists solely ofbase-pairs corresponding to the native structure(Figure 3(c)) as determined by dynamic program-ming. This demonstrates that thermodynamicsimulation of structure formation actually leads tothe native conformation.

Rearrangement of circular PSTVd from ametastable hairpin structure

To investigate the ability of our algorithm tosimulate structural rearrangements from metastableto more stable structures, we started the simulationwith an initial conformation containing the threehairpins I, II, and III (Figure 4(a), left). Thisconformation is only stable at high temperatures butcould be shown to be metastable for long periodswhen produced by a fast jump to temperaturesbelow the PSTVd main conformational transition(Henco, 1979; Henco et al., 1979). For our purposethis can be used as a worst-case metastablestructure. Since the rearrangement from metastable

into native structures is basically a kineticphenomenon, the activation energy of the confor-mational changes was used for computing thetransition probability, and the time for each processwas estimated from the actual rate constant for theprocess.

The parameter Uinit was set to 1200 K. This choiceof Uinit is in accordance with our criterion that Uinit

has to correspond to the specific stability of the startconformation. Lower values, like those used for denovo structure formation, were found to be tooinefficient to rearrange the hairpins. Only with theincreased value a sufficient fraction of simulationsreaches the native conformation. Due to the MonteCarlo nature of the simulation, still not everysimulation results in complete formation of thenative structure. To visualize these statistics ofthe rearrangement process, we superimposed thebase-pair probability data from ten independentsimulations to obtain an average representation ofthe rearrangement process (Figure 4(a), right).

The base-pair probability matrix after the firstiteration (Figure 4(a), right) demonstrates that themetastable structure can actually be disrupted. Thenative structure was formed as soon as hairpin II,the most stable of the three hairpins, was disrupted.After the 11th iteration (Figure 4(a), right), the righthalf of the native structure was still not occupiedfully and hairpin III was still present. This points tometastable conformations in which the left half ortwo-thirds of the native structure occur togetherwith hairpin III. The hairpins with small loop sizevisible near the diagonal at the 5' and 3' end of thediagram illustrate the existence of alternativebifurcated conformations at the left terminus of thenative structure.

A plot of computed total folding time versusiteration number would be expected to show amonotonous straight line, illustrating constant time

Figure 3. Thermodynamically controlled structure formation of circular PSTVd from the denatured state. Theparameters used are T = 298.15 K (=25°C), 11 iterations with Uinit = 1000 K, ntotal = 10 × N2, N = 359 nts, n1/2 = 0.05 × ntotal.(a) Schematic representation of the structure distribution during the iteration cycles. The base-pairs of the secondarystructures are presented in a dot plot (cf. Tinoco et al., 1971) with horizontal axes i and j, respectively, representing thesequence from 5' to 3' and the vertical axis representing the probability p for formation of base-pair i·j. Helical regionsappear as vertical bars on top of diagonal base lines; diagonals are not drawn if the probability of any base-pair onthat diagonal is below 0.01. Helices that are part of the native structure of PSTVd (cf. Figure 2 and 3(c)) are given asgrey bars. The hairpins I, II, and III are given as black bars. Left: the starting conformation for the simulation is theopen state without any base-pairs. Right from top to bottom: structure distributions after first, second, and 11th iteration,respectively. During the first iteration the right terminal part of the native secondary structure was formed; during thesecond iteration hairpin I (marked I) and part of the left terminal native structure appeared. During the followingiterations the thermodynamically metastable hairpin I was rearranged in favour of the native structure, which waspresent with slight deviations already after the third iteration. (b) Diagram showing the decease in free energy, DG0,of the structures during the simulation versus the number of steps, m, divided by the sequence length, N. Iterationsof which the final structure distribution is given in (a) are marked by arrows. The free energy of the thermodynamicallyoptimal secondary structure (see Figure 2), DGopt, is shown as a straight line at −630 kJ/mol. Each iteration starts withU = Uinit + T; this high value of the distribution parameter allows opening of favourable base-pairs and/or formationof unfavourable base-pairs both leading to initially low values of the free energy. (c) Schematic drawing of thethermodynamically optimal structure distribution at 25°C (Schmitz & Steger, 1992) as calculated by dynamicprogramming. The thermodynamically optimal secondary structure is rod-like (see Figure 2); bifurcations areimprobable and visible only as tiny bars at the lower left and upper right corners of the triangle (base-pairs 2 to 8/15to 21 and 341 to 347/352 to 358).

RNA Folding 261

Figure 4. Kinetically controlled structural rearrangement of circular PSTVd from a conformation containing the threehairpins I, II, and III. The parameters used are T = 298.15 K (=25°C), overlay of ten simulations each consisting of 11iterations with Uinit = 1200 K, ntotal = 10 × N2, N = 359 nts, n1/2 = 0.05 × ntotal. (a) Schematic representation of the structuredistribution during the iteration cycles. Left: the starting conformation for the simulation is a thermodynamicallymetastable state. Right from top to bottom: structure distributions after first and llth iteration, respectively. During thefirst iteration, part of the left terminal region of the native secondary structure was formed, and the probability for thehairpins was diminished. After the 11th iteration the native structure was present prominently; only formation of theright terminus was partially blocked by hairpin III. (b) Diagram showing the computed total folding time versus iterationsteps. The stepwise increase in folding time reflects the setting of U to high values during each cycle.

requirement for the folding processes on average.Instead this plot (Figure 4(b)) exhibits a stepwiseincrease of folding time at the points where U wasset to Uinit + T. These points correspond to therandom walk regime of the individual iterationcycles. In this regime unlikely and mostly globalrearrangements with low rate constants are selected.Thus, the simulation implies an imbalance of moreglobal versus local folding events. For example, withsequential folding this would lead to an iterativeglobal refolding of the growing RNA chain causedby the ‘‘tick’’ of the annealing scheme alone, insteadof refolding occurring as a function of the newly

accessible structures dependent on the growingchain length.

To avoid this dilemma we abandoned the strictannealing scheme of ‘‘heating up’’ and slowly‘‘cooling’’ the distribution parameter U. Instead ofa continuous decrease of U in a hyperbolicalmanner starting at Uinit + T and converging to T, thevalue for U was chosen randomly for eachindividual step of the simulation. Since thehyperbolic function for U used in the formerapproach had proven to be a good compromisebetween global and local optimization, the sameprofile was used for the random selection of U.

RNA Folding262

Figure 5. Kinetically controlled structural rearrange-ment of circular PSTVd from a conformation containingthe three hairpins I, II, and III. The parametersused are T = 328.15 K (=55°C), overlay of 20 simula-tions each consisting of ten iterations with randomizedU, Uinit = 1800 K, ntotal = 10 × N2, N = 359 nts, n1/2 =0.05 × ntotal. (a) Schematic representation of the structuredistribution during the iteration cycles. Left: the startingconformation for the simulation is a thermodynamicallymetastable state identical to that of Figure 4. Right fromtop to bottom: structure distributions after first, fifth, andtenth iteration, respectively. During the first iteration, onlytwo helices of the left terminus and a single helix of theright terminus of the native secondary structure wereformed. After the tenth iteration the native structure,especially the central part, was present prominently; theprobability of hairpins I to III was diminished greatly. (b)Diagram showing the computed total folding time versusiteration steps.

Thus, the random selection of U from this profilecan be considered a mere random permutation ofthe former continuous function. In fact, thepermutation avoids all correlations betweenthe course of the distribution parameter and the

simulation that had arisen due to the continuousvariation scheme of U. As the hyperbolic annealingscheme was basically responsible for the successfuloptimization, the randomization of U results now ina decreased optimization efficiency. To compensate

RNA Folding 263

Figure 6. (legend on p. 264)

RNA Folding264

at least partially for this loss of efficiency, the valueof Uinit had to be increased. Simulations performedwith this randomized distribution parameterdemonstrated that on the one hand the discontinu-ous behaviour of the simulation, resulting in adiscontinuous time course and iterative occupationof intermediate high-energy structures, vanishedand on the other hand a continuous increase of totalfolding time together with a continuous decrease ofthe average free energy resulted.

In Figure 5 an example for the rearrangement ofthe metastable hairpin structure into more stablestructures is shown. With the randomized U,Uinit = 1800 K, and T = 328.15 K ( = 55°C) a reason-able good convergence into the optimal structurewas obtained. Slow temperature jump experiments(Henco et al., 1979) from a temperature above thePSTVd main transition to a final temperature of26°C in 10 mM NaCl (equivalent to about 55°C in1 M NaCl) resulted in rearrangements into thenative structure with a characteristic relaxation timeconstant of about ten minutes. This experimentalresult is at least qualitatively similar to thatobtained by our computations (Figure 5(b)).

With T = 298.15 K ( = 25°C) and otherwise identi-cal parameters the simulation could not produceany rearrangement of the hairpin structure (data notshown). This result is in accordance with slowtemperature jumps from temperatures above thePSTVd main transition to a final temperature of10°C (equivalent to 35°C in 1 M NaCl), whichresulted in relaxation time constants in the range ofhours to days.

Sequential folding of a linearlonger-than-unit-length PSTVd-RNAduring transcription

Since simulations using the randomized distri-bution parameter U showed a continuous andalmost linear time course of the folding time, weused this modification of our algorithm forsimulation of structure formation and rearrange-ment of linear PSTVd RNA during transcription.For these simulations, the first iteration cycle was

started with a chain length of 10 nts. Duringiterations the sequence length was increased by onenucleotide at every time point where the incremen-tal folding time, that passed since the lastpolymerization event, exceeded the reciprocalpolymerization rate. The polymerization rate wasset to 50 s−1.

The example shown in Figure 6 demonstrates thatthe folding events of the growing RNA chain arerelatively slow in comparison with the elongationrate. Due to many folding events slower thank = 50 s−1 and due to the addition of only a singlenucleotide after these events, the effective elonga-tion rate drops even below the assigned value of50 s−1. The full chain length is already present atsome time during the second iteration cycle, butonly short hairpin structures corresponding tolocal base-pairings can be seen in the base-pairprobability matrix. The average free energy plot(Figure 6(c), left) shows a rather steep descentduring the elongation phase, representing the gainof stability due to the formation of short hairpinsduring chain growth. The phenomenon of fastformation of short hairpins is also observed insimulations starting from the denatured state at fullchain length, so there is no dramatic difference inthe overall folding rate caused by the sequentialfolding process. In the subsequent rearrangementphase after completion of chain elongation, a slowrearrangement into the native structure combinedwith a slow decrease of the average free energy canbe observed.

According to these simulations, the structuralcapability of (+)stranded linear PSTVd RNA seemsto be optimized for fast refolding into the nativestructure even in the situation of sequential folding.This is in accordance with the biological function of(+)stranded linear PSTVd oligomers; i.e. theseRNAs have to be processed by host enzyme(s) tothe circular mature RNA, and the structurerecognized by the enzyme(s) is mainly rod-like(Steger et al., 1992). The different substructurespossible by alternative pairings of 5' and 3' endand their structural rearrangements will be dis-cussed elsewhere in connection with experimental

Figure 6. Sequential folding of a linear longer-than-unit-length PSTVd RNA. The parameters used are T = 298.15 K(=25°C), overlay of 20 simulations each consisting of ten iterations with randomized U, Uinit = 1200 K, ntotal = 10 × N 2,N = 380 nts, n1/2 = 0.05 × ntotal, and 50 nts/s elongation rate. (a) Schematic representation of the RNA used for calculation.This RNA consists of a G plus a full-length linear PSTVd RNA from nts 80 to 359 to 1 to 96 plus AAUU, i.e. this RNAcontains a duplication of 17 nts. The numbering of the sequence corresponds to that of circular PSTVd (see Figure 2).Note that the region I, which is able to form a helix with region I', is present twice. This allows for different structuresand structural rearrangements necessary for processing (compare with Steger et al., 1992). (b) Schematic representationof the structure distribution during the iteration cycles. The numbering of the sequence is from 5' to 3'. The diagonalson which the left (nts 80 to 179/180 to 282 according to (a); nts 2 to 101/102 to 204 counted from 5' end) and right(nts 285 to 359/1 to 73 according to (a); nts 207 to 281/282 to 354 counted from the 5' end) halves of the native structurereside are marked by lh and rh, respectively. During the first iteration the RNA chain was synthesized only to a lengthof about 100 nts; the predominant structure contains hairpin I. During the second iteration synthesis was finished;besides hairpin I, mainly hairpins with small loops were formed. In the following iterations hairpins II and III showup to a low extent. After the tenth iteration the rod-shaped structure was present prominently. The additional alternativehelices, connecting 5' and 3' end, are critical for processing (Baumstark & Riesner, 1995). (c) Diagrams showing the gainin free energy versus total folding time (left) and the computed total folding time versus iteration steps (right). Arrowsmark the positions of the end of iterations shown in (b).

RNA Folding 265

data (Baumstark & Riesner, 1995). Our parametersetting for the elongation rate, however, is onlya rough guess based on the rates observed withpolymerases acting on their natural templates;the transcription rate of PSTVd RNA by theDNA-dependent RNA polymerase II is not known.In case the in vivo transcription rate should belower than 50 s−1, the sequential folding mightfavour a more dominant folding into a metastablestructure containing hairpins I to III. Furthermore,the position of the 5' end of in vivo replicationintermediates is not known; of course, this startingpoint is able to influence the probability forformation of certain metastable structures.

Discussion

The algorithm described is based on SimulatedAnnealing. An independent distribution parameterU is employed for controlling the simulation processthrough the Boltzmann probability distributionfunction. As shown by the examples describedabove, the algorithm is able to simulate both thethermodynamically and the kinetically controlledfolding process of an RNA secondary structure. Inboth cases the simulation may start either from thecompletely denatured state or from some givenmetastable structure. Using adequate values forthe U function, the algorithm is sufficiently efficientto reach structure distributions close to the globalminimum. However, a reasonable description ofthe kinetics of folding is only possible if arandomized distribution parameter U is usedinstead of a continuous variation of this parameter.Folding intermediates and metastable confor-mations can be sampled along the individualfolding pathway. The distribution of structuresduring folding may be presented as three-dimen-sional probability maps. Individual structures, forexample those optimal at certain points, may beextracted and presented by schematic drawings (notshown).

The examples shown for PSTVd RNA agree withexperimental data. Using thermodynamic criteriafor folding of circular PSTVd, a secondary structuredistribution was obtained which is very similar tothat obtained by dynamic programming (Schmitz &Steger, 1992). Additionally, the predicted optimalstructures are in accordance with experimental data(Riesner et al., 1979). From the kinetically controlledfolding of circular (Figure 5) as well as of linearPSTVd (Figure 6) the importance of hairpins I to IIIfor the folding pathway is obvious. This is inaccordance with experimental data from slowtemperature jumps (Henco et al., 1979) and frommutagenesis studies (Loss et al., 1991; Qu et al.,1993).

There are at least two different possibilities forenhancement of our algorithm. The first one isrelated mainly to the sequential folding process.According to the algorithm described, any slowfolding step blocks both the chain elongation andany further fast folding process that may proceed in

parallel without interfering with the slower process.Thus, chain elongation could be accelerated byadding an amount of nucleotides proportional to thetime passed and not only a single nucleotide. Analgorithmic implementation of folding processestaking place in parallel would be much morecomplex. The second possibility for enhancement ofthe algorithm could be a partial or a total removalof the topological constraint for secondary struc-tures. In case of nucleotides located in hairpin loops,pairing nucleotides could be searched for in theregions neighbouring to the hairpin stem; thiswould allow for H-type pseudoknots (for a review,see Pleij, 1990). A total removal of this constraintwould allow for any tertiary structure.

MethodsThe algorithm described in this paper was im-

plemented as a set of FORTRAN programs usingstandard FORTRAN77 elements and a few VAX specificlanguage extensions. Simulations were performed oneither a VAXStation 4000/60 with operating systemVAX/VMS V5.5 (DEC Digital Equipment Corporation,Maynard, MA), a DEC 3000/800 workstation withAXP/VMS V1.5, or a Convex C220 computer with Unix(ConvexOS). Evaluation of the simulation data wasimplemented using the GKS graphics standard (DEC GKSV5.2) or Xlib routines.

Energy averages and other thermodynamical andkinetical data describing the optimization process wereplotted as HPGL data using the GLE V3.3c graphicspackage (C. Pugmire, [email protected]). AllHPGL data could be converted to raster images forprinting by means of the hp2xx filter program (H. W.Werntges, ftp.rz.uni-duesseldorf.de).

The RNA sequences used for calculations are derivedfrom the sequence of PSTVd with GenBank/EMBLdatabase accession nos. V01465 (Gross et al., 1978) andX58388 (Schnolzer et al., 1985).

AcknowledgementsWe are indebted to Dr Detlev Riesner for stimulating

discussions and support throughout the course of thiswork. We especially thank Tilman Baumstark for helpfulcomments. The work was supported by grants from theDeutsche Forschungsgemeinschaft (Ste 465/2).

ReferencesBaumstark, T. & Riesner, D. (1995). Only one of four

possible secondary structures of the central con-served region of potato spindle tuber viroid is asubstrate for processing in a potato nuclear extract.Nucl. Acids Res. 23(21), in the press.

Bellman, R. & Kalaba, R. (1960). On kth best policies.SIAM J. Appl. Math. 8, 582–588.

Diener, T. O. (1987). Editor of The Viroids. Plenum Press,New York.

Gross, H. J., Domdey, H., Lossow, C., Jank, P., Raba, M.,Alberty, H. & Sanger, H. L. (1978). Nucleotidesequence and secondary structure of potato spindletuber viroid. Nature, 273, 203–208.

RNA Folding266

Gruner, R., Fels, A., Qu, F., Zimmat, R., Steger, G. &Riesner, D. (1995). Interdependence of pathogenicityand replicability with potato spindle tuber viroid(PSTVd). Virology, 209, 60–69.

Henco, K. (1979). Viroids: a class of ribonucleic acids withan extraordinary structural and dynamic principle.Thesis, Technische Universitat Hannover.

Henco, K., Sanger, H. L. & Riesner, D. (1979). Finestructure melting of viroids as studied by kineticmethods. Nucl. Acids Res. 6, 3041–3059.

Jaeger, J. A., Turner, D. H. & Zuker, M. (1990). Predictingoptimal and suboptimal secondary structure forRNA. Methods Enzymol. 183, 281–306.

Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P. (1983).Optimization by simulated annealing. Science, 220,671–680.

Loss, P., Schmitz, M., Steger, G. & Riesner, D. (1991).Formation of a thermodynamically metastable struc-ture containing hairpin II is critical for infectivity ofpotato spindle tuber viroid. EMBO J. 10, 719–727.

Martinez, H. M. (1984). An RNA folding rule. Nucl. AcidsRes. 12, 323–334.

McCaskill, J. S. M. (1990). The equilibrium partitionfunction and base pair binding probabilities forRNA secondary structure. Biopolymers, 29, 1105–1119.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N.,Teller, A. H. & Teller, E. (1953). Equation of statecalculations by fast computing machines. J. Chem.Phys. 21, 1087–1092.

Mironov, A. & Kister, A. (1986). RNA secondary structureformation during transcription. J. Biomol. Struct.Dynam. 4, 1–9.

Mironov, A. A., Dyakonova, L. P. & Kister, A. E. (1985).A kinetic approach to the prediction of RNAsecondary structures. J. Biomol. Struct. Dynam. 2,953–962.

Nussinov, R., Pieczenik, G., Griggs, J. R. & Kleitman, D. J.(1978). Algorithms for loop matchings. SIAM J. Appl.Math. 35, 68–82.

Pleij, C. W. A. (1990). Pseudoknots: a new motif in theRNA game. Trends Biochem. Sci. 15, 143–147.

Qu, F., Heinrich, C., Loss, P., Steger, G., Tien, P. & Riesner,D. (1993). Multiple pathways of reversion in viroidsfor conservation of structural elements. EMBO J. 12,2129–2139.

Randles, J. W., Steger, G. & Riesner, D. (1982). Structuraltransitions in viroid-like RNAs associated withcadang-cadang disease, velvet tobacco mottle virus,and Solanum nodiflorum mottle virus. Nucl. Acids Res.10, 5569–5586.

Riesner, D., Henco, K., Rokohl, U, Klotz, G.,Kleinschmidt, A. K., Gross, H. J., Domdey, H.,Jank, P. & Sanger, H. L. (1979). Structure and struc-ture formation of viroids. J. Mol. Biol. 133, 85–115.

Rosenbaum, V., Klahn, T., Lundberg, U., Holmgren, E.,von Gabain, A. & Riesner, D. (1993). Co-existingstructures of an mRNA stability determinant. The 5'region of the Escherichia coli and Serratia marcescensompA mRNA. J. Mol. Biol. 229, 656–670.

Schindler, I.-M. & Muhlbach, H.-P. (1992). Involvement ofnuclear DNA-dependent RNA polymerases intopotato spindle tuber viroid replication: a re-evalu-ation. Plant Sci. 84, 221–229.

Schmitz, M. & Steger, G. (1992). Base-pair probabilityprofiles of RNA secondary structures. Comp. Appl.Biosci. 8, 389–399.

Schnolzer, M., Haas, B., Ramm, K., Hofmann, H. &Sanger, H. L. (1985). Correlation between structureand pathogenicity of potato spindle tuber viroid(PSTV). EMBO J. 4, 2181–2190.

Steger, G., Hofmann, H., Fortsch, J., Gross, H. J., Randles,J. W., Sanger, H. L. & Riesner, D. (1984).Conformational transitions in viroids and virusoids:comparison of results from energy minimizationalgorithm and from experimental data. J. Biomol.Struct. Dynam. 2, 543–571.

Steger, G., Baumstark, T., Morchen, M., Riesner, D.,Tabler, M., Tsagris, & Sanger, H. L. (1992). Structuralrequirements for viroid processing: correlationbetween processing data, thermodynamic studiesand model calculations. J. Mol. Biol. 227, 719–737.

Tinoco, I., Jr, Uhlenbeck, O. C. & Levine, M. D. (1971).Estimation of secondary structure in ribonucleicacids. Nature, 230, 363–367.

Tinoco, I., Jr, Puglisi, J. D. & Wyatt, J. R. (1990). RNAfolding. Nucl. Acids Mol. Biol. 4, 205–226.

Tsagris, M., Tabler, M. & Sanger, H. L. (1987). Oligomericpotato spindle tuber viroid RNA does not processautocatalytically under conditions where other RNAsdo. Virology, 157, 227–231.

Turner, D. H., Sugimoto, N. & Freier, S. M. (1990).Thermodynamics and kinetics of base-pairing and ofDNA and RNA self-assembly and helix-coil tran-sition. In Landolt-Bornstein, Group VII: Biophysics,vol. 1 Nucleic Acids, subvol. c, Spectroscopic and KineticData, Physical Data I. (Saenger, W., ed.), pp. 194–243,Springer Verlag, Berlin.

Waterman, M. S. & Smith, T. F. (1978). RNA secondarystructure: a complete mathematical analysis. Math.Biosci. 42, 257–266.

Williams, A. L. & Tinoco, I., Jr (1986). A dynamicprogramming algorithm for finding alternative RNAsecondary structures. Nucl. Acids Res. 14, 299–315.

Wyatt, J. R., Puglisi, J. D. & Tinoco, I., Jr (1989). RNAfolding: pseudoknots, loops and bulges. Bioessays, 11,100–106.

Zuker, M. & Stiegler, P. (1981). Optimal computer foldingof large RNA sequences using thermodynamics andauxiliary information. Nucl. Acids Res. 9, 133–148.

Edited by R. Huber

(Received 27 June 1995; accepted 3 October 1995)