Predicting Protein Disorder and Induced Folding: From Theoretical Principles to Practical...

15
Current Protein and Peptide Science, 2007, 8, 135-149 135 1389-2037/07 $50.00+.00 © 2007 Bentham Science Publishers Ltd. Predicting Protein Disorder and Induced Folding: From Theoretical Principles to Practical Applications Jean M. Bourhis, Bruno Canard and Sonia Longhi* 1 Architecture et Fonction des Macromolécules Biologiques, UMR 6098 CNRS et Universités Aix-Marseille I et II, Case 932, 163 Avenue de Luminy, 13288 Marseille Cedex 09, France Abstract: In the last years there has been an increasing amount of experimental evidence pointing out that a large number of proteins are either fully or partially disordered (unstructured). Intrinsically disordered proteins are ubiquitary proteins that fulfil essential biological functions while lacking highly populated and uniform secondary and tertiary structure under physiological conditions. Despite the large abundance of disorder, disordered regions are still poorly detected. Recognition of disordered regions in a protein is instrumental for reducing spurious sequence similarity between disordered regions and ordered ones, and for delineating boundaries of protein domains amenable to crystallization. As presently none of the available automated methods for prediction of protein disorder can be taken as fully reliable on its own, we present a brief overview of the methods currently employed highlighting their philosophy. We show a few practical examples of how they can be combined to avoid pitfalls and to achieve more reliable predictions. We also describe the currently available methods for the identification of regions involved in induced folding and provide a few practical examples in which the accuracy of predictions was experimentally confirmed. Keywords: Intrinsic disorder, intrinsically unstructured proteins, induced folding, prediction methods. 1. INTRODUCTION In recent years there has been an increasing amount of experimental evidence pointing out the large abundance of intrinsic disorder within the living world. Intrinsically disordered proteins, also referred to as "natively unfolded proteins" [1], are ubiquitary proteins that fulfil essential biological functions while lacking highly populated and uniform secondary and tertiary structure under physiological conditions [2-9]. Recent studies predict that long intrinsically disordered regions occur in 33% of eukaryotic proteins [10], with 12% of them being fully disordered [11]. Most of the functions of intrinsically disordered proteins are related to molecular recognition, cell cycle regulation, signal transduction, and transcription [2,12-15]. The functional relevance of disorder resides in an increased plasticity that allows binding of multiple partners and enables formation of larger interaction surfaces in protein- protein complexes [2,12-16]. Intrinsically disordered proteins show an extremely wide diversity in their structural properties: indeed they can attain extended conformations (random coil-like) or remain globally collapsed (molten globule-like), where the latter possess regions of fluctuating secondary structure [17]. Conformational and spectroscopic analyses showed that random coil-like proteins can be subdivided in their turn into two major groups. While the first group consists of proteins with extended maximum dimensions typical of random coils *Address correspondence to this author at the Architecture et Fonction des Macromolécules Biologiques, UMR 6098 CNRS et Universités Aix- Marseille I et II, Case 932, 163 Avenue de Luminy, 13288 Marseille Cedex 09, France; Tel: (33) 4 91 82 55 80; Fax: (33) 4 91 26 67 20; E-mail: [email protected] with no (or little) secondary structure, the second group comprises the so-called premolten globules, which are more compact (but still less compact than globular or molten globule proteins) and conserve some residual secondary structure [2,4]. It has been proposed that the residual intramolecular interactions that typify the premolten globule state may enable a more efficient start of the folding process induced by a partner [3,18-20]. Indeed, many intrinsically disordered proteins adopt a well-defined conformation upon interacting with a target molecule [13,14,18,21-23]. The driving force of this disorder-to-order transition, which is referred to as "induced folding", is often the burying of hydrophobic residues of the intrinsically disordered protein at the protein-partner interface. Noteworthy, these disorder- to-order transitions can be accompanied or not by the gain of regular secondary structure elements [7,14,23-25]. The coupled process by which an intrinsically disordered proteins folds and binds to its target bears some resemblance to the induced fit concept of enzyme-substrate binding and allostery. However, in induced fit, ligand binding perturbs an equilibrium between two compact, well defined protein conformations, whereas binding of an intrinsically dis- ordered protein to a target involves a disorder-to-order transition of the intrinsically disordered protein concomitant with formation of a macromolecular complex. The target may have a compact, well defined conformation, or may be an intrinsically disordered protein itself. In the final state, the complex between the intrinsically disordered protein and the target has a compact structure with classical protein folds (at least as a core), possibly with one or more flexible appendages. A protein region is defined as disordered if it is devoid of stable secondary structure and if it has a large number of

Transcript of Predicting Protein Disorder and Induced Folding: From Theoretical Principles to Practical...

Current Protein and Peptide Science, 2007, 8, 135-149 135

1389-2037/07 $50.00+.00 © 2007 Bentham Science Publishers Ltd.

Predicting Protein Disorder and Induced Folding: From Theoretical Principles to Practical Applications

Jean M. Bourhis, Bruno Canard and Sonia Longhi*

1Architecture et Fonction des Macromolécules Biologiques, UMR 6098 CNRS et Universités Aix-Marseille I et II, Case

932, 163 Avenue de Luminy, 13288 Marseille Cedex 09, France

Abstract: In the last years there has been an increasing amount of experimental evidence pointing out that a large number

of proteins are either fully or partially disordered (unstructured). Intrinsically disordered proteins are ubiquitary proteins

that fulfil essential biological functions while lacking highly populated and uniform secondary and tertiary structure under

physiological conditions. Despite the large abundance of disorder, disordered regions are still poorly detected.

Recognition of disordered regions in a protein is instrumental for reducing spurious sequence similarity between

disordered regions and ordered ones, and for delineating boundaries of protein domains amenable to crystallization. As

presently none of the available automated methods for prediction of protein disorder can be taken as fully reliable on its

own, we present a brief overview of the methods currently employed highlighting their philosophy. We show a few

practical examples of how they can be combined to avoid pitfalls and to achieve more reliable predictions. We also

describe the currently available methods for the identification of regions involved in induced folding and provide a few

practical examples in which the accuracy of predictions was experimentally confirmed.

Keywords: Intrinsic disorder, intrinsically unstructured proteins, induced folding, prediction methods.

1. INTRODUCTION

In recent years there has been an increasing amount of experimental evidence pointing out the large abundance of intrinsic disorder within the living world. Intrinsically disordered proteins, also referred to as "natively unfolded proteins" [1], are ubiquitary proteins that fulfil essential biological functions while lacking highly populated and uniform secondary and tertiary structure under physiological conditions [2-9]. Recent studies predict that long intrinsically disordered regions occur in 33% of eukaryotic proteins [10], with 12% of them being fully disordered [11].

Most of the functions of intrinsically disordered proteins are related to molecular recognition, cell cycle regulation, signal transduction, and transcription [2,12-15]. The functional relevance of disorder resides in an increased plasticity that allows binding of multiple partners and enables formation of larger interaction surfaces in protein-protein complexes [2,12-16].

Intrinsically disordered proteins show an extremely wide

diversity in their structural properties: indeed they can attain extended conformations (random coil-like) or remain globally collapsed (molten globule-like), where the latter possess regions of fluctuating secondary structure [17]. Conformational and spectroscopic analyses showed that random coil-like proteins can be subdivided in their turn into two major groups. While the first group consists of proteins with extended maximum dimensions typical of random coils

*Address correspondence to this author at the Architecture et Fonction des

Macromolécules Biologiques, UMR 6098 CNRS et Universités Aix-Marseille I et II, Case 932, 163 Avenue de Luminy, 13288 Marseille Cedex

09, France; Tel: (33) 4 91 82 55 80; Fax: (33) 4 91 26 67 20; E-mail: [email protected]

with no (or little) secondary structure, the second group comprises the so-called premolten globules, which are more compact (but still less compact than globular or molten globule proteins) and conserve some residual secondary structure [2,4]. It has been proposed that the residual intramolecular interactions that typify the premolten globule state may enable a more efficient start of the folding process induced by a partner [3,18-20]. Indeed, many intrinsically disordered proteins adopt a well-defined conformation upon interacting with a target molecule [13,14,18,21-23]. The driving force of this disorder-to-order transition, which is referred to as "induced folding", is often the burying of hydrophobic residues of the intrinsically disordered protein at the protein-partner interface. Noteworthy, these disorder-to-order transitions can be accompanied or not by the gain of regular secondary structure elements [7,14,23-25]. The coupled process by which an intrinsically disordered proteins folds and binds to its target bears some resemblance to the induced fit concept of enzyme-substrate binding and allostery. However, in induced fit, ligand binding perturbs an equilibrium between two compact, well defined protein conformations, whereas binding of an intrinsically dis-ordered protein to a target involves a disorder-to-order transition of the intrinsically disordered protein concomitant with formation of a macromolecular complex. The target may have a compact, well defined conformation, or may be an intrinsically disordered protein itself. In the final state, the complex between the intrinsically disordered protein and the target has a compact structure with classical protein folds (at least as a core), possibly with one or more flexible appendages.

A protein region is defined as disordered if it is devoid of stable secondary structure and if it has a large number of

136 Current Protein and Peptide Science, 2007, Vol. 8, No. 2 Bourhis et al.

conformations as seen using methods such as X-ray crystallography (lack of electron density), nuclear magnetic resonance (NMR), circular dichroism (CD), small angle X-ray scattering and various hydrodynamic measurements [26,27]. However, this definition embraces several categories of disorder: molten globules, partially unstructured proteins (pre-molten globules), and random coils (by increasing mobility and decreasing residual secondary structure content (see Uversky [4]).

Intrinsically disordered proteins possess distinctive sequence features, including paucity of hydrophobic residues and enrichment in hydrophilic residues (see below), which allow them to be predicted with a rather good accuracy.

What is the practical interest of identifying disordered regions? Disorder prediction is an essential prerequisite to protein sequence analysis. Disordered regions often have a biased amino acid composition that can lead to spurious sequence similarity with unrelated proteins. The recognition of regions of disorder is thus crucial to avoid spurious sequence alignments with sequences of structured proteins (for examples, see Iyer et al. [28]). Moreover, the recognition of disordered regions facilitates the identification of Eukaryotic Linear Motifs (ELMs), which are short functional motifs occurring mainly (>70%) within disordered regions (e.g. binding targets for calmodulin and SH3, PDZ, phosphorylation, acetylation, methylation, fatty acylation and ubiquitination sites) [12,29-31].

Secondly, disordered regions often prevent crystallization of proteins, or the generation of interpretable NMR data. Therefore, structural biologists use disorder predictions to

delineate compact domains in order to solve their 3D structure, or to dissect target sequences into a set of independently folded domains in order to facilitate tertiary structure and threading predictions [32].

Although, the identification of disordered regions of less than 20 residues in length is generally thought to be less accurate [33], recent results suggest that progress has been made in predicting short disordered regions [34,35].

Like in other areas of bioinformatics, the reliability of disorder prediction benefits from the use of several methods relying on different concepts, different physico-chemical parameters, or different implementations. In our experience, using a single disorder predictor to achieve predictions good enough to decipher the modular organization of a protein is not sufficient (for examples see [36-40]). Herein, we briefly review the sequence features of disordered proteins, and the various disorder predictors currently available. We present a general suggested procedure for disorder prediction (Fig. (1)) and provide a few practical examples of prediction of disorder and of induced folding. These examples well illustrate how prediction methods need to be combined to achieve accurate disorder predictions.

2. SEQUENCE FEATURES OF DISORDERED PROTEINS

Statistical analysis showed that amino acid sequences encoding for disordered regions are significantly different from those of ordered proteins on the basis of local amino acid composition, flexibility, hydropathy, charge, coordi-nation number etc. In the following paragraphs, we discuss

Fig. (1). General scheme for prediction of disordered and ordered regions in a protein.

Predicting Protein Disorder and Induced Folding Current Protein and Peptide Science, 2007, Vol. 8, No. 2 137

the main sequence features typifying disordered proteins.

2.1. Sequence Composition

Intrinsically disordered proteins generally have a biased amino acid composition. A consensus of two independent studies, focusing respectively on the amino acids preferred at the surface of globular proteins or on those found less frequently in secondary structures [41,42] established the following empirical rule: G, S, P are disorder-promoting amino acids, W, F, I, Y, V, L are order-promoting amino acids, while H, T are considered neutral with respect to disorder. Using sequence composition as the sole predictive parameter of disorder is not reliable. For instance, the RNA cap 2’-O-methyltransferase domain of dengue virus polymerase is structured [43] and yet is heavily depleted in some order-promoting residues and markedly enriched in some disorder-promoting residues (data not shown). However, Weathres et al. recently reported that amino acid composition alone could allow recognition of intrinsically disordered proteins with a good accuracy [44]. In any case, it is recommended to always analyse the sequence composition of proteins prior to further sequence analysis [45].

2.2. Low Predicted Secondary Structure Content

Secondary structure prediction is based on the propensity of each amino acid to belong to each type of secondary structure element, computed along sliding windows. Long (> 70 aa) regions devoid of predicted secondary structure elements (as judged by using a combination of methods) are generally disordered. There are a few exceptions, called "loopy proteins", which have no regular secondary structure and yet are ordered, like the Kringle domain, a triple-looped, disulphide-linked domain, found in some serine proteases and in some plasma proteins [46].

2.3. Low Sequence Complexity

Low complexity regions are regions with a biased composition (homopolymeric runs, short-period repeats, and more subtle over-representation of a few residues), making use of fewer types of amino acids. Intrinsically disordered proteins tend to have a low sequence complexity, although it is not a general rule [47,48]. It has been shown recently that the more low-complexity regions an eukaryotic protein has, the less it is likely to be expressed in a soluble form in bacteria [49]. This might be related to the fact that low-complexity regions, that are more frequent in eukaryotic proteins than in bacterial proteins, are more sensitive to proteolytic degradation [49]. Given the tendency of intrinsically disordered proteins to have low complexity, one could expect that they are less soluble than globular proteins. However, this is not a general rule. For instance, the intrinsically disordered N-terminal domain of the measles virus phosphoprotein is even more soluble than the structured C-terminal domain (see Karlin et al. [50] and Longhi et al. [51]). Likewise, the intrinsically disordered C-terminal domain of the measles virus nucleoprotein has a solubility comparable to that of its structured domain (see Karlin et al. [52] and Longhi et al. [51]). Furthermore, intrinsically disordered proteins are less prone to aggregation

as compared to globular proteins [53,54], which facilitates their purification and conservation.

Some special cases of low-complexity sequences are found in proteins with a certain amino acid periodicity (such as coiled-coils) and other non-globular, yet ordered proteins (collagen for example). It is recommended to always look for low complexity regions, coiled-coils, and repeats in a protein prior to further sequence analysis (using programs such as Paircoil [55] and Multicoil [56]). More subtle parameters to discriminate between globular and non globular proteins using the program SEG are discussed in [45,48].

2.4. High Sequence Variability

Disordered regions have been reported to be on average much more variable than ordered ones [57]. It has been proposed that sequences of disordered proteins evolve faster than those of ordered ones thanks to less restrictive amino acid substitutions and the lack of structural ramifications [57]. In support of a reduction in the restrictiveness of structural constraints of disordered regions, a recent study by the group of Dunker indicates that protein segments affected by alternative splicing are most often intrinsically disordered such that alternative splicing enables functional and regulatory diversity while avoiding structural complications [58]. In contrast with these observations, a more recent study indicates that most conserved disordered regions have sequence conservation greater than or equal to that in conserved ordered regions within the same protein [59]. It should be pointed out however that this study systematically excluded disordered regions that possess high sequence variability [15].

The relationship between sequence variability and flexibility is well known by crystallographers: when a protein does not crystallize despite repeated attempts, crystallographers are used to removing hypervariable regions, presumed to be flexible linkers. High sequence variability is not by itself an evidence of disorder, but only an indicator. A simple method to appreciate sequence variability is visual inspection of a multiple sequence alignment. However, it can sometimes be misleading. Programs that rely on nucleotide substitution rate (as described by Brown et al. [57]) can be very informative and should be used for a more rigorous analysis [60].

3. PREDICTORS OF DISORDER

Recently, a number of groups have published predictors of disorder, the majority of which are available on the web (see http://www.disprot.org) (for reviews see [14,61]). These predictors are all based on the assumption that the absence of a rigid structure is encoded in specific features of the amino acid sequence (reviewed above). We herein briefly describe the various predictors and their philosophy.

Different predictors rely on different physico-chemical parameters. Therefore, a given predictor can be more performant in detecting a given feature of a disordered protein. Thus, predictors are complementary, a point well illustrated in the section focused on practical examples (see section 5).

138 Current Protein and Peptide Science, 2007, Vol. 8, No. 2 Bourhis et al.

As there is no consensus on what disorder means, it is necessary to know precisely what is predicted by each method. For instance, "long disordered regions" predicted by PONDR correspond to regions that are not rigid including random coils, partially unstructured regions, and molten globules. On the opposite, if a protein is predicted to be unstructured by the charge/hydropathy method, it means that it is probably fully unstructured (random coil). This issue has been recently addressed by a systematic comparison between these two prediction methods [62].

Most predictors rely on training against a dataset of disordered protein regions. These datasets are either entirely built up by the authors or represent improvements of existing datasets. Despite continuous efforts, these datasets retain some inconsistencies and are necessarily biased, since large regions of disorder can prevent crystallization. Furthermore, these datasets contain relatively few disordered proteins. Indeed, DisProt (http://www.disprot.org/) [63], which is the largest publicly available database of disordered proteins whose disorder has been experimentally assessed, contains only about 470 entries. For these reasons, it is useful to distinguish two kinds of predictors: those that have been trained on datasets of disordered proteins (PONDR, Globplot, Disembl, Disopred2, RONN, PreLink, DisPro, SPRITZ), and those that have not, namely the charge/ hydropathy method (and its derivative Foldindex), NORSp, IUPred, FoldUnfold and DRIP-PRED. While the former identify disordered regions on the basis of the peculiar sequence properties that characterize them, the latter identify disorder as lack of ordered 3D structure. The second group of predictors avoid the shortcomings and biases associated to the disordered datasets. Therefore, they are expected to perform better than the former methods on disordered proteins presently under-represented in training datasets (i.e. fully or mostly disordered proteins).

3.1. Predictors Trained on Datasets of Disordered Proteins

PONDR (Predictor of Natural Disordered Regions) (see http://www.pondr.com), a neural network based on local amino acid composition, flexibility, and other sequence features, was the first predictor to be developed [47,64,65]. Noteworthy, PONDR has found an application in the study of structured proteins. Indeed, there is a strong inverse correlation between the VL-XT score within ordered regions and the presence of dehydrons, which are underwrapped backbone hydrogen bonds, recently identified as a major determinant of protein-protein interactions [66,67]. PONDR is available in various versions, each having its own specificities. VSL1 performs better to identify short regions of disorder, while VL3 should be preferred to delineate domains as it gives smoother predictions. Notably, VL-XT can highlight potential protein-binding regions, indicated by sharp drops in the middle of long disordered regions (for examples see [62]). However, the accuracy of all these predictors is limited for short disordered regions (<30 residues). Accordingly, the group of Dunker has recently developed a new predictor, VSL2, which is intended to give accurate predictions regardless of the length of the disordered region [68]. The VSL2 predictor is based on a support vector machine. The dataset, obtained from both

DisProt and PDB, has been split into two groups on the basis of the length of disorder (i.e. >30 and <30 residues). VSL2 turned out to behave well with both sub-groups and to be able to identify short disordered regions that were miss-predicted by the previous PONDR predictors [68]. The publicly available VSL2 server (see http://www.ist.temple. edu/disprot/predictorVSL2.php) consists of two variants of the VSL2 predictor: VSL2B is the baseline model that uses only 26 features calculated from the amino acid sequence, while the more accurate VSL2P uses 22 additional features derived from PSI-BLAST [69] profiles. The VSL2 predictor integrating the full set of different features (including residue features, PSI-BLAST profiles, and secondary structure PHD and PSI-PRED predictions) can be downloaded from http://www.ist.temple.edu/disprot/predictorVSL2.php.

Globplot (see http://globplot.embl.de) uses a new scale called "Russell/Linding", specially developed to express the propensity for a given amino acid to be in "random coil" or in "regular secondary structure" [42]. It also provides an easy overview of modular organization of large proteins thanks to user-friendly, built-in SMART, PFAM and low complexity predictions. Note that in Globplot outputs, changes of slope often correspond to domain boundaries.

Disembl (see http://dis.embl.de) is based on a neural network and consists of three separate predictors, trained on separate datasets, that comprise respectively residues within "loops/coils" (as defined by DSSP [70]), "hot loops" (loops with high B-factors – i.e. very mobile from X-ray crystal structure), or that are missing from the PDB X-ray structures (called "Remark 465") [71]. The partitioning of residues into different flexibility groups is very useful depending upon the user's goals (for instance "hot loop" may be used to correlate certain functional aspects of proteins with mobile loops, while "Remark 465" may be used to detect linkers likely to affect crystallization). This predictor also provides prediction of low sequence complexity and aggregation propensity.

Disopred2 (see http://bioinf.cs.ucl.ac.uk/disopred) is based on support vector machine classifiers trained on PSI-BLAST profiles [72]. It therefore incorporates information from multiple sequence alignments since its inputs are derived from sequence profiles generated by PSI-BLAST. Hence, prediction accuracy is lower if there are few homologues.

RONN (see http://www.strubi.ox.ac.uk/RONN) uses a novel approach, a bio-basis function neural network. It relies on the calculation of "distances", as determined by sequence alignment, from well-characterized prototype sequences (ordered, disordered or a mixture of both). Its key feature is that amino acid side chain properties are not considered at any stage [73].

DISPRO (see http://www.ics.uci.edu/~baldig/dispro.html) is based on a neural network [74]. It combines several parameters, including sequence profiles obtained by PSI-BLAST, secondary structure predictions and solvent accessibility. This predictor was trained on disordered sequences (i.e. regions of missing atomic coordinates) derived from the PDB.

The SPRITZ server (see http://protein.cribi.unipd.it/ spritz) takes into account sequence profiles obtained by PSI-

Predicting Protein Disorder and Induced Folding Current Protein and Peptide Science, 2007, Vol. 8, No. 2 139

BLAST and structure predictions. SPRITZ uses two separate predictors based on vector machines trained on different datasets [75]. The training dataset of short disordered regions (less than 45 residues) was derived from a subset of PDB sequences with short regions of missing density, while the training dataset of long regions was derived from both DisProt and PDBselect25 (which is a subset of the PDB [76]). This server authorizes the submission of several sequences at one time and offers the possibility of choosing between predictions of short or of long disordered regions.

Prelink (see http://genomics.eu.org) relies on amino acid composition and on low hydrophobic cluster content [77]. In this respect, it is a derivative of the Hydrophobic Cluster Analysis (HCA), a powerful approach that is discussed below. Prelink predicts regions that are expected to be unstructured in all conditions, regardless of the presence of a binding partner. Thus, it generally predicts as ordered disordered regions that have the potential to be ordered in the presence of a partner (i.e. to undergo induced folding). PreLink is the first predictor that statistically proved the ability of HCA to detect linkers, an ability that had long been noticed before but never previously demonstrated.

3.2. Predictors that have not been Trained on Disordered Proteins

The charge/hydropathy analysis (see http://www.pondr. com) is based on the elegant reasoning that folding of a protein is governed by a balance between attractive forces (of hydrophobic nature) and repulsive forces (electrostatic, between similarly charged residues) [1]. Thus, globular proteins can be distinguished from unstructured ones based on the ratio of their net charge versus their hydropathy. A drawback of this approach is that it gives only a global (i.e. not positional) indication, not valid if the protein is composed of both ordered and disordered regions. It can be only applied to protein domains, implying that a prior knowledge of the modular organization of the protein is required. A derivative of that method, Foldindex (see http://bip.weizmann.ac.il/fldbin/findex), solves this problem by computing the charge/hydropathy ratio along the protein [78]. However, Foldindex does not provide reliable predictions for the N- and C-termini and is therefore not recommended for proteins with less than 100 residues.

NORSp (No Ordered Regular Secondary structure predictor) (see http://cubic.bioc.columbia.edu/services/ NORSp) generates multiple sequence alignments and relies on the principle that long regions predicted to be devoid of secondary structure and accessible to the solvent are generally unstructured [79]. However, this is not always true, as in the case of the Kringle domain mentioned above.

IUPred (see http://iupred.enzim.hu) uses a novel algorithm that evaluates the energy resulting from inter-residues interactions [80]. Although it was derived from the analysis of the sequences of globular proteins only, it allows the recognition of disordered proteins based on their lower interaction energy. This provides a new way to look at the lack of a well-defined structure, which can be viewed as a consequence of a significantly lower capacity to form favourable contacts, correlating with studies by the group of Galzitskaya (see below).

The FoldUnfold predictor (see http://skuld.protres.ru /~mlobanov/ogu/ogu.cgi) calculates the expected average number of contacts per residue from the amino acid sequence alone [81,82]. The average number of contacts per residue was computed from a dataset of globular proteins. A region is considered as natively unfolded when the expected number of close residues is less than 20.4 for its amino acids and the region is greater or equal in size to the averaging window.

DRIP-PRED (Disordered Regions In Proteins PREDiction) (see http://www.forcasp.org/paper2127.html) is based on search of sequence patterns obtained by PSI-BLAST that are not typically found in the PDB [83]. If a sequence profile is not well represented in the PDB, then it is expected to have no ordered 3D structure. For a query sequence, sequence profile windows are extracted and compared to the reference sequence profile windows, and then an estimation of disorder is performed for each position. As a last step, the results of this comparison are weighted by PSI-PRED predictions.

3.3. "Non-Conventional" Disorder Predictors

The program SEG [48], which computes sequence complexity, has not been developed to detect disordered regions but has been used successfully in that aim by the group of Koonin [45]. The SEG program can be downloaded from ftp://ftp.ncbi.nih.gov/pub/seg/seg, while simplified versions with default settings can be run at either http://mendel.imp.univie.ac.at/METHODS/seg.server.html or http://www.ncbi.nlm.nih.gov/BLAST. The stringency of the search for low-complexity segments is determined by 3 user-defined parameters: trigger window length [W], trigger complexity [K(1)] and extension complexity [K(2)]. Typical parameters for disorder prediction of long non-globular domains are [W]=45, [K(1)]=3.4 and [K(2)]=3.75, while for short non-globular domains are [W]=25, [K(1)]=3.0 and [K(2)]=3.3.

Another non-automated method that is very useful for unveiling unstructured regions is HCA (see http://bioserv. rpbs.jussieu.fr/RPBS/cgi-bin/) [84]. HCA makes use of a two-dimensional helical representation of protein sequences in which hydrophobic clusters are plotted along the sequence [84]. HCA stands aside from other predictors, since they only give insights on the extent of disorder/order, but do not correlate this information with the sequence by itself. Furthermore, there is little one can actually learn from comparing the output of these predictors for homologous proteins. In contrast, HCA provides a representation of the short range environment of each amino acid, thus giving information not only on order/disorder but also on the folding potential (see section 6). Although HCA does not provide a quantitative prediction of disorder and rather requires human interpretation, it provides additional, qualitative information as compared to automated predictors, a point illustrated in the section focused on practical examples (see section 5). In particular, HCA highlights coiled-coils and regions with a biased composition, regions with potential for induced folding and very short potential globular domains. Finally, it allows meaningful comparison with related proteins and enables a better definition of the boundaries of disordered regions.

140 Current Protein and Peptide Science, 2007, Vol. 8, No. 2 Bourhis et al.

3.4. Error Rate of Predictors

A general error rate is difficult to evaluate, since it depends of the definition of disorder used, on the evaluation set, and on the criteria of evaluation. These points are well illustrated by the evaluation of disorder predictors within the recent Critical Assessment of Protein Structure Prediction, CASP6, where very different rankings were obtained as a function of the criteria used [85]. Moreover, the accuracy of a given predictor can be limited when predicting a type of disorder different from that against which it was trained.

Another reason that prevents the meaningful calculation of a precise error rate is the fact that a protein can be disordered by itself, and yet adopt a structure either in a cellular context (when binding to a partner, a phenomenon called "induced folding") or because of artefacts (crystal contacts during crystallization, for instance, or structure solved in a non-aqueous medium such as trifluoroethanol (TFE)). Many prediction “errors” fall in fact in these categories, as discussed in two recent articles by the groups of Poupon and of Dunker (see references therein cited, and (Figs. 4-5) in reference [77] and Fig. 6 in reference [62]).

Because of all the reasons stated above, we avoided indicating error rates for the various predictors. In general, predictors are more reliable in predicting order than in predicting disorder, as (i) ordered sequences comprise only a very narrow portion of sequence space, i.e. their sequence properties are much more recognizable, and (ii) because of the limited number of disordered protein sequences available for predictor training. From the authors' personal communications, a conservative accuracy for ab initio methods, such as Disopred2, Disembl and PONDR, is around 60-70% for predicting disorder, and about 80% for predicting order. Reportedly, the charge/hydropathy method has the best overall accuracy (83%, [81]). However, this method requires prior knowledge of the domain boundaries. Despite the inherent difficulty of estimating meaningful error rates, recent studies have pointed out that disorder predictors can been grossly classified into three categories [73,80]. These categories are by no means absolute, and are presented only for convenience, to allow a better interpretation of the results provided by the predictors and to optimize their use in combination.

Some predictors perform better on short disordered regions in the context of globally ordered proteins: Disopred2, Prelink and Disembl (Remark465). Actually, they were specifically developed with that aim. These predictors also have a good specificity (i.e. they predict relatively few ordered residues to be disordered), but a moderate sensitivity (i.e. they miss a significant number of disordered residues). IUPred performs comparatively well for predicting long disordered segments, and has a good sensitivity. Finally, although no method has both a very high specificity and a very high sensitivity, some predictors are "polyvalent" (RONN, PONDR VSL1, Disembl "hot loops", FoldIndex and Globplot).

However, we would like to point out that these observations are only aimed at guiding the user, and in no way they are intended to provide a quantitative comparison. The section focused on practical examples (see section 5)

will also give a qualitative idea of the respective sensitivities and specificities of these methods.

4. PREDICTING INDUCED FOLDING

The analysis of hydrophobic clusters and of secondary structures is of major interest for studying induced folding, because burial of hydrophobic residues provides the major driving force in protein folding. This force is in turn regulated by secondary structures that play a role in guiding the folding pathway. In some cases, hydrophobic clusters are found within secondary structure elements that are unstable in the native protein, but can stably fold upon binding to a partner. Therefore, HCA can be very informative in highlighting potential induced folding (for examples, see section 6).

Molecular Recognition Elements (MoREs) are regions within an intrinsically disordered protein that have a propensity to bind to a partner and thereby to undergo induced folding [86-88]. It has long been noticed that PONDR VL-XT can highlight potential MoREs [86,87]. For instance, a fine analysis of PONDR plots led to the identification of segments of increased structural propensity (i.e. prone to induced folding) in the RNA degradosome-organizing domain of the E. coli ribonuclease RNase E [89]. Moreover, the group of Dunker recently developed a program to identify -helix-forming MoREs (called -MoREs) from the amino acid sequence [87]. Linding also showed how neural network predictions of disorder can indicate the propensity of ELMs to undergo induced folding [30].

5. PRACTICAL EXAMPLES OF DISORDER PREDIC-TION

Fig. (1) illustrates a general sequence analysis scheme that integrates the peculiarities of each method to predict globular and disordered regions. As a first step, one should perform an analysis of sequence composition [90] and complexity [48], a search for signal peptides, transmembrane regions [91], leucine zippers [92], and coiled-coil regions [93-95], to premark regions of biased composition. This step is crucial in that it can avoid pitfalls that can lead to miss-predictions, as exemplified in section 5.1.

It is also recommended to use DIpro [96] to identify possible disulfide bridges and to search for possible metal-binding regions by looking for conserved Cys3-His or Cys2-His2 motifs in multiple sequence alignments. Indeed, the presence of conserved cysteines and/or of metal-binding motifs prevents meaningful local predictions of disorder within these regions, as they may display features typifying disorder while gaining structure upon disulfide formation or upon binding to metal ions [1].

Then, ab initio methods, such as Globplot, Disembl, PONDR, Disopred2, IUPred, RONN, Prelink, Foldindex etc. can be combined to define a consensus on both globular and unstructured regions. Of course any supplemental infor-mation, as for instance sequence similarity of a protein region to multi-domain proteins, are precious in terms of domain boundary definition. Once a rough domain architecture for the protein of interest is established, the case

Predicting Protein Disorder and Induced Folding Current Protein and Peptide Science, 2007, Vol. 8, No. 2 141

of domains whose structural state is uncertain can be settled using the charge/hydropathy method, which has a quite low error rate (see above).

5.1. An Example of a Pitfall: A Coiled-Coil

To illustrate a possible pitfall in disorder prediction, we have chosen the Heat Shock Factor–binding Protein 1. Biophysical and biochemical analyses have shown that it consists of a long, trimeric coiled-coil, with the N and C-termini (respectively aa 1-8 and 58-76) being disordered [97]. PONDR VSL1 predicts the whole protein as disordered, while RONN predicts borderline disorder for

most of the protein (Fig. (2)). SEG does not detect any long, non-globular region (Fig. (2)). However when using more sensitive parameters, SEG detects a medium-length region of biased composition (aa 7-33). All other automated predictors predict the central region as ordered and the N and C-termini as disordered (Fig. (2)). A preliminary analysis using Multicoil [56] and HCA would have solved these discrepancies. In fact, Multicoil gives a high probability of coiled-coil over aa 30-60 (not shown), while the HCA plot is typical of a coiled-coil (a long and horizontally extended hydrophobic cluster encompassing aa 8-56) (Fig. (2)). Furthermore, it is quite obvious from the HCA plot that the

Fig. (2). Analysis of the human Heat Shock Factor-binding Protein 1 sequence (Genbank accession number: AF068754) using different

predictors. Top, structural model of the protein based on data from [97]. The N and C-termini (thin lines) are disordered whereas the central

region forms a triple coiled-coil. Numbers correspond to the amino acid boundaries of these regions. The graphical output of Disopred2 and

RONN and the corresponding interpretation are shown. The precise boundaries of ordered and disordered regions were derived from the

corresponding text output (not shown). SEG parameters are explained in the text. HCA conventions are explicited in the caption.

142 Current Protein and Peptide Science, 2007, Vol. 8, No. 2 Bourhis et al.

protein has a biased composition, being rich in Q and D residues (Fig. (2)). In particular, the Q-rich region roughly corresponds to the low-complexity region detected by SEG (aa 7-33).

Thus, performing the preliminary analysis shown in Fig. (1) would have allowed detection of a coiled-coil (which fooled some predictors into giving a wrong prediction of borderline disorder) and would have overcome this pitfall, while giving precious information on the protein (biased composition). Once the structural status of the region 8-60 has been established as a coiled-coil, the comparative analysis of the ensemble of the results gives a more accurate prediction. Indeed, almost all predictors correctly predict disordered N and C-termini with reasonably accurate boundaries (Fig. (2)). This example also illustrates the advantage of using predictors that rely on different principles: for instance, since Prelink is based on HCA, it is expected to correctly predicted coiled-coils as ordered. As another example, PONDR VL-XT gives a correct prediction, whereas another version of PONDR, VSL1, optimized to detect short disordered regions, is completely fooled by the coiled-coil, and predicts it as disordered.

5.2. Domain Identification

Fig. (3) illustrates the approach we used to study the domain organization of the nucleoprotein (N) of measles virus, a protein that encapsidates the viral RNA. As shown in Fig. (3A), most ab initio methods converge to show the presence of a disordered region at its C-terminus (consensus is aa 401-492), and of a globular core (consensus is aa 145-370). PONDR VSL2 predicts two long disordered regions (aa 105-142 and 401-514), and IUPred predicts three disordered regions (aa 113-119, 125-132 and 397-525). Interestingly, Foldindex highlights a very hydrophilic region (aa 100-150) (Fig. (3A)) that is also visible as a short plateau (aa 131-144) in the output of Disembl Remark 465 predictor (data not shown). Moreover, this region is hypervariable in sequence among Morbillivirus members (not shown). Disopred2 predicts a long disordered region (aa 401-492). Finally, the analysis of the Globplot and PONDR VL-XT outputs (data not shown) indicates the presence of an ordered region spanning residues aa 145-344 and aa 113-308, respectively. From this analysis one would suspect the following domain organization: a first domain or sub-domain encompassing residues 1-105, that might be ordered although some predictors (PONDR VSL2, IUPred and Disopred2) predict a short region of disorder at the N-terminal; an exposed loop spanning aa 106-149 that is predicted by PONDR VSL2, Foldindex, Globplot and Disembl Remark 465 (Fig. (3A) and data not shown); a second, more compact domain (aa 150-400), and a disordered domain encompassing aa 401-525 (see Fig. (3B)). Furthermore, the charge/hydropathy method predicts that the region encompassing both suspected sub-domains (NCORE, aa 1-400) is ordered and confirms that the C-terminal domain (NTAIL, aa 401-525) is disordered (Fig. (3C)). Finally, analysis of the amino acid composition of NTAIL indicates that this region has the typical compositional bias observed in disordered proteins, being depleted in order promoting residues and enriched in disorder promoting residues (Fig. (3C)).

Experimental data available indicate that N is organized into 2 regions, NCORE (aa 1-400) and NTAIL (aa 401-525), respectively ordered [52] and disordered [98,99]. Indeed, limited proteolysis experiments showed that the NTAIL region is hypersensitive to trypsin, while the NCORE region is resistant to proteolysis [52] (see Fig. (4A)). In addition, spectroscopic experiments carried out on the purified NTAIL domain pointed out the lack of highly populated secondary structure (Fig (4B)) [51,100].

The hypervariable region of N (aa 131-149) is indeed accessible to antibodies and thus exposed to the solvent [101]. However, a wealth of mutational data (see Karlin et al. [52] and references therein) indicates that NCORE cannot be divided into independent modules, but rather that the sub-domains indicated in Fig. (3B) (aa 1-105 and aa 150-400) probably fold cooperatively. Thus, the exposed region (aa 131-149) is probably a loop and not a linker that would connect two mobile domains. Whether it is disordered or not is not known. However, as it is not sensitive to proteolysis [52] it is probably at least partially ordered. These unsolved issues nicely illustrate the present limits of disorder prediction.

This example well illustrates how no single prediction method, nor even a combination of two predictors, could successfully unveil the organization of measles virus N, whereas the combined use of many predictors proved to be much more powerful in terms of domain boundary recognition.

Using a similar approach, we could decipher the modular organization of Paramyxoviridae phosphoproteins (see also [36]). The information about the modularity of the measles virus phosphoprotein (P) led us to delineate protein domains amenable to crystallization: indeed we obtained crystals of the multimerization domain (PMD, aa 304-375) (Longhi et al., unpublished data) and solved the crystal structure of the C-terminal domain of P (XD, aa 459-507) [102] (see Fig. (5)).

5.3. Refinement of Boundaries

Fig. (6) illustrates a frequently encountered case in disorder prediction, namely the occurrence of an extended region of intermediate length (20>aa<40) at one extremity of the protein followed by a globular domain (~70 aa). We have used the example of the Ubiquitin-like domain of hPLIC-2, whose structure has been solved by NMR [103]. As shown in Fig. (6) (bottom), the region encompassing residues 1-30 is devoid of regular secondary structure elements and is extended in solution. All predictors detect a disordered region in the N-terminal region (Fig. (6)), however the predicted boundaries of the disordered region vary from one predictor to another, with a predicted length ranging from 19 to 69 residues (see Fig. (6)). Globplot predicts disorder for the 3-19 region, four predictors (Prelink, DisEMBL, VL-XT and IUPred) define the C-terminal boundary of the disordered region around residue 28, two predictors (Disopred2 and Foldindex) predict a disordered region spanning residues 1-32, two predictors (VSL1 and RONN) extend the prediction of disorder to the region encompassing residues 1-60 (see Fig. (6)), and SEG predicts a potential non-globular region within residues 2-69.

Predicting Protein Disorder and Induced Folding Current Protein and Peptide Science, 2007, Vol. 8, No. 2 143

Fig. (3). Measles virus nucleoprotein (N) (accession number: P35972) sequence analysed with different predictors. A. The graphical output of

the various methods and the corresponding interpretation are shown. The precise boundaries of ordered and disordered regions were derived

from the corresponding text output (not shown). B. Predicted domain organization of measles virus N. C. Left. Net charge/ hydrophobicity

plot of measles virus N. The mean net charge (R) and the mean hydrophobicity (H) were calculated as described in [50]. The

charge/hydrophobicity diagram is divided into 2 regions by a line, which corresponds to the equation H = (R+1.151)/2.785. In the left part of

the diagram (where H < (R+1.151)/2.785), a protein is predicted as disordered, whereas it is predicted as ordered in the right part. Right.

Deviation in amino acid composition from the average values in the PDB of the NTAIL region of measles virus N. The relative enrichment in

order promoting (white bars) and disorder promoting (shaded) residues is shown.

144 Current Protein and Peptide Science, 2007, Vol. 8, No. 2 Bourhis et al.

Fig. (4). A. Top. Schematic representation of the measles virus N protein showing that it consists of a globular domain (NCORE, aa 1-400) and

a disordered domain (NTAIL, aa 401-525). 15% SDS-PAGE analysis of limited proteolysis of N leading to digestion of the NTAIL region and to

obtaining the resistant NCORE fragment. Data are taken from [52]. B. Far-UV CD (top) and bidimensional 1H NMR spectrum (bottom) of

NTAIL in 10 mM sodium phosphate buffer at pH 7. Data are taken from [51].

Fig. (5). Modular organization of measles virus P protein. Thin boxes correspond to disordered regions, while large boxes correspond to

structured regions. Crystals of the polymerization domain of P (PMD), as well as the cartoon representation of the crystal structure of the X

domain (XD, PDB code 1OKS) are shown. The picture of XD was obtained using Pymol.

Predicting Protein Disorder and Induced Folding Current Protein and Peptide Science, 2007, Vol. 8, No. 2 145

Fig. (6). Analysis of the sequence of the Ubiquitin-like protein domain of hPLIC-2 (accession number: Q9UHD9) using different predictors.

The ribbon representation of the structure is shown (PDB code 1J8C). The picture was obtained using Pymol. The graphical output of various

prediction methods and the corresponding interpretation are shown. The precise boundaries of ordered and disordered regions were derived

from the corresponding text output (not shown). A black bar highlights the disordered region for each graphical output. The disordered region

predicted by VLXT is highlighted with a dotted line.

Based on these results, one can be confident than the N-terminal moiety of the protein is disordered, although the exact C-terminal boundary of the unstructured region remains uncertain (predictions vary from residue 19 to residue 69). Use of HCA helps to reduce this uncertainty. The HCA plot clearly allows the identification of the 1-30 region as disordered (based on its almost total depletion in hydrophobic clusters), and of the 53-103 region as ordered (given its high density in such clusters). Thus one can confidently predict that the protein is organized into two moieties, with HCA having narrowed the boundary between these two regions down to the 30-52 region. In the absence of functional or biochemical clues (such as limited proteolysis studies), the production of various truncated versions of each moiety is recommended in view of functional or structural studies. Such constructs should start or end at incremental positions between residues 30 and 52 (for instance at the ends of predicted secondary structure elements (not shown)). Indeed, the experience that we gained in the context of past and present structural genomics projects developed in our laboratory (see SPINE and VIZIER projects at http://www.afmb.univ-mrs.fr/-The-Spine-Program- and http://www.afmb.univ-mrs.fr/-VIZIER-) has shown that a critical factor in obtaining good-quality protein crystals is the number of constructs generated around the predicted boundary of the domain under study.

6. PRACTICAL EXAMPLES OF INDUCED FOLDING

PREDICTION

Fig. (7) illustrates the approach we used to identify the regions involved in induced folding within the C-terminal disordered NTAIL domain of measles virus N. As shown in Fig. (7), the density of hydrophobic clusters and the secondary structure predictions indicate that the region spanning residues 400 to 525 cannot be ordered by itself: indeed the hydrophobic clusters in the 494-525 region are not long enough to lead to the formation of a compact domain, contrary to the regions encompassing the two globular domains within NCORE (aa 1-400). We suspected that the isolated hydrophobic cluster with a predicted -helix within the disordered NTAIL domain (aa 494-504, see Fig. (7)) could correspond to a binding region for one of the partners of N, thereby undergoing induced folding. This region was successfully identified [99] by using the already mentioned

-MoRe predictor developed by the group of Dunker [62]. As shown in Fig. (7), PONDR VL-XT was also able to identify this NTAIL region with potential for induced folding: it appears as a sharp drop in the middle of the disordered region. That the -MoRE does interact with the C-terminal X domain of measles virus P (XD) thus undergoing induced folding, was indeed experimentally proven using both biochemical and crystallographic approaches [100,102,104].

146 Current Protein and Peptide Science, 2007, Vol. 8, No. 2 Bourhis et al.

Notably, a drop in the VL-XT output similar to that corresponding to the -MoRE can also be observed for the 400-420 region (see Fig. (7)). The HCA plot of this region shows the presence of a small hydrophobic cluster, consistent with an additional induced folding region (see Fig. (7)). Interestingly, we have recently reported that this region has the potentiality to fold in the presence of TFE, as judged based on the decrease in the mobility of a spin-label grafted at position 407 followed by EPR spectroscopy [105].

Another example of identification of regions of induced folding is illustrated in Fig. (8). Two distinct methods for the prediction of disorder, namely PONDR and the method of the hydrophobicity/mean charge ratio, both converge to show that the N-terminal domain of measles virus P (PNT, aa 1-230) is disordered (see [50]). The disordered nature of PNT was then experimentally proven using hydrodynamic and spectroscopic approaches [50]. The HCA plot of PNT (see Fig. (8A)) is typical of a disordered protein, being depleted in large hydrophobic clusters. However, a small such a cluster can be detected within residues 20-55, which may correspond to a putative induced folding region (Fig. (8A)). The incubation of PNT in the presence of increasing concentrations of TFE induces a pronounced gain of -helicity (see Fig. (8B)). TFE is widely used as a probe for regions that have a propensity to undergo induced folding.

Accordingly, we carried out limited proteolysis experiments in the presence of TFE, which led to the identification of a thermolysin-resistant fragment. N-terminal sequencing and mass spectrometry analysis of this fragment showed that it encompasses residues 27 to 99 (see Fig. (8B), and [50]). This fragment does indeed contain a predicted -helix (aa 27-38) that occurs within the above-mentioned hydrophobic cluster in the HCA plot (see Fig. (8A)). This -helix may represent one of the secondary structure elements involved in the possible unstructured-to-structured transition of PNT upon binding to a partner.

7. CONCLUSION

As we have seen from a few examples, no single predictor can reveal the structural organization of a protein. However, in combination they provide relatively accurate results. Thus, there is room for improvement of predictors by combining features of several programs. It would also be of major interest to check whether known regions of induced folding correlate well with isolated hydrophobic clusters, corresponding to predicted -helices, within disordered regions. Other improvements may arise from a better understanding of the different types, or "flavors" of disorder [106].

Fig. (7). HCA plot of measles virus N. Predicted secondary structure elements (as obtained by the JPRED program

(http://barton.ebi.ac.uk/servers/jpred.html) [108]) are shown. Globular regions (see dotted line) are characterized by a thick distribution of

hydrophobic clusters, while unstructured regions are poor or devoid of hydrophobic clusters. The long C-terminal disordered region is

highlighted by a grey bar. The induced folding region ( -MoRE) is underlined. The graphical output of the VLXT prediction of the NTAIL

domain is also shown, and the -MoRE is highlighted by a black bar. The structure of the -MoRE (dark grey -helix) is presented in

complex with the C-terminal domain of the measles virus P [109] (PDB code: 1T6O). The picture of the measles virus P was obtained using

Pymol. Inset. Effect of removal of the -MoRE in terms of NTAIL ability to bind to P and to undergo induced folding. The -MoRE and the

predicted -helix are shown.

Predicting Protein Disorder and Induced Folding Current Protein and Peptide Science, 2007, Vol. 8, No. 2 147

Fig. (8). A. HCA plot of measles virus P (accession number: NP_056919) N-terminal domain (PNT) showing the scarcity of hydrophobic

clusters. Predicted secondary structure elements (as obtained by the JPRED program (http://barton.ebi.ac.uk/servers/jpred.html) [108]) are

shown. The induced folding region is shaded. B. Left. Far-UV CD spectra of PNT in the presence of increasing TFE concentrations in

sodium phosphate buffer pH 7, showing progressive gain of -helicity. Right. 15% SDS-PAGE time-course analysis of a limited proteolytic

digestion of PNT by thermolysin in the presence of 15% TFE. Arrows 1 and 2 show undigested PNT and a fragment resistant to proteolysis,

respectively. N-terminal sequencing and mass spectrometry (MS) analysis of band 2 (framed) indicated that it spans residues 27 to 99. Data of

panel B are taken from [50].

As a last, optimistic note, one should never give up carrying out a crystallization experiment because of an order/disorder prediction: recently, Mavrakis et al. submitted the phosphoprotein of rabies virus to crystallization trials, despite the fact that the N-terminal moiety, which accounts for more than half of the protein, was predicted to be disordered. Crystals formed in the drop. In fact the N-terminus had been cleaved off by contaminating proteases and the crystals were made of the C-terminal part … whose structure was readily solved [107]!

ACKNOWLEDGEMENTS

We thank K. Dunker, P. Romero, J. Ward, R. Linding, V. Uversky, J. Wootton, I. Callebaut, A. Poupon, R. Esnouf and Z. Dosztányi for their useful comments on their respective predictors. We are also grateful to François Ferron and David Karlin who contributed to this work.

REFERENCES

[1] Uversky, V. N., Gillespie, J. R. and Fink, A. L. (2000) Proteins, 41, 415-427.

[2] Dunker, A. K., Lawson, J. D., Brown, C. J., Williams, R. M., Romero, P., Oh, J. S., Oldfield, C. J., Campen, A. M., Ratliff, C.

M., Hipps, K. W., Ausio, J., Nissen, M. S., Reeves, R., Kang, C., Kissinger, C. R., Bailey, R. W., Griswold, M. D., Chiu, W., Garner,

E. C. and Obradovic, Z. (2001) J. Mol. Graph Model, 19, 26-59.

[3] Tompa, P. (2002) Trends Biochem. Sci., 27, 527. [4] Uversky, V. N. (2002) Protein Sci., 11, 739-756.

[5] Tompa, P. (2003) J. Mol. Struct. (Theochem.), 666-67, 361-371. [6] Fink, A. L. (2005) Curr. Opin. Struct. Biol., 15, 35-41.

[7] Dyson, H. J. and Wright, P. E. (2005) Nat. Rev. Mol. Cell Biol., 6, 197-208.

[8] Uversky, V. N., Oldfield, C. J. and Dunker, A. K. (2005) J. Mol. Recognit., 18, 343-384.

[9] Hansen, J. C., Lu, X., Ross, E. D. and Woody, R. W. (2006) J. Biol. Chem., 281, 1853-1856.

[10] Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. and Jones, D. T. (2004) J. Mol. Biol., 337, 635-645.

[11] Bogatyreva, N. S., Finkelstein, A. V. and Galzitskaya, O. V. (2006) J. Bioinformat. Comput. Biol., 4, 597-608.

[12] Iakoucheva, L. M., Brown, C. J., Lawson, J. D., Obradovic, Z. and Dunker, A. K. (2002) J. Mol. Biol., 323, 573-584.

[13] Dunker, A. K., Cortese, M. S., Romero, P., Iakoucheva, L. M. and Uversky, V. N. (2005) FEBS J., 272, 5129-5148.

[14] Uversky, V. N., Oldfield, C. J. and Dunker, A. K. (2005) J. Mol. Recognit., 18, 343-384.

[15] Chen, J. W., Romero, P., Uversky, V. N. and Dunker, A. K. (2006) J. Proteome Res., 5, 888-898.

[16] Gunasekaran, K., Tsai, C. J., Kumar, S., Zanuy, D. and Nussinov, R. (2003) Trends Biochem. Sci., 28, 81-85.

[17] Dunker, A. K. and Obradovic, Z. (2001) Nat. Biotechnol., 19, 805-806.

[18] Fuxreiter, M., Simon, I., Friedrich, P. and Tompa, P. (2004) J. Mol. Biol., 338, 1015-1026.

[19] Bienkiewicz, E. A., Adkins, J. N. and Lumb, K. J. (2002) Biochemistry, 41, 752-759.

148 Current Protein and Peptide Science, 2007, Vol. 8, No. 2 Bourhis et al.

[20] Lacy, E. R., Filippov, I., Lewis, W. S., Otieno, S., Xiao, L., Weiss,

S., Hengst, L. and Kriwacki, R. W. (2004) Nat. Struct. Mol. Biol., 11, 358-364.

[21] Kriwacki, R. W., Hengst, L., Tennant, L., Reed, S. I. and Wright, P. E. (1996) Proc. Natl. Acad. Sci. USA, 93, 11504-11509.

[22] Uversky, V. N. (2002) Eur. J. Biochem., 269, 2-12. [23] Dyson, H. J. and Wright, P. E. (2002) Curr. Opin. Struct. Biol., 12,

54-60. [24] Fletcher, C. M., McGuire, A. M., Gingras, A. C., Li, H., Matsuo,

H., Sonenberg, N. and Wagner, G. (1998) Biochemistry, 37, 9-15. [25] Bourhis, J. M., Receveur-Bréchot, V., Oglesbee, M., Zhang, X.,

Buccellato, M., Darbon, H., Canard, B., Finet, S. and Longhi, S. (2005) Protein Sci., 14, 1975-1992.

[26] Tompa, P. (2002) Trends Biochem. Sci., 27, 527-533. [27] Receveur-Bréchot, V., Bourhis, J. M., Uversky, V. N., Canard, B.

and Longhi, S. (2006) Proteins: Structure, Function and Bioinfor-matics, 62, 24-45.

[28] Iyer, L. M., Aravind, L., Bork, P., Hofmann, K., Mushegian, A. R., Zhulin, I. B. and Koonin, E. V. (2001) Genome Biol., 2,

RESEARCH0051. [29] Puntervoll, P., Linding, R., Gemund, C., Chabanis-Davidson, S.,

Mattingsdal, M., Cameron, S., Martin, D. M., Ausiello, G., Brannetti, B., Costantini, A., Ferre, F., Maselli, V., Via, A.,

Cesareni, G., Diella, F., Superti-Furga, G., Wyrwicz, L., Ramu, C., McGuigan, C., Gudavalli, R., Letunic, I., Bork, P., Rychlewski, L.,

Kuster, B., Helmer-Citterich, M., Hunter, W. N., Aasland, R. and Gibson, T. J. (2003) Nucleic Acids Res., 31, 3625-3630.

[30] Linding, R. (2004) Linear Functional Modules. Implication for protein function. In. Bioinformatics, Heidelberg

[31] Neduva, V., Linding, R., Su-Angrand, I., Stark, A., Masi, F. D., Gibson, T. J., Lewis, J., Serrano, L. and Russell, R. B. (2005) PLoS

Biol., 3, e405. [32] Friedberg, I., Jaroszewski, L., Ye, Y. and Godzik, A. (2004) Curr.

Opin. Struct. Biol., 14, 307-312. [33] Melamud, E. and Moult, J. (2003) Proteins, 53 Suppl 6, 561-565.

[34] Obradovic, Z., Peng, K., Vucetic, S., Radivojac, P. and Dunker, A. K. (2005) Proteins, 61, 166-182.

[35] Dosztanyi, Z., Csizmok, V., Tompa, P. and Simon, I. (2005) J. Mol. Biol., 347, 827-839.

[36] Karlin, D., Ferron, F., Canard, B. and Longhi, S. (2003) J. Gen. Virol., 84, 3239-3252.

[37] Ferron, F., Rancurel, C., Longhi, S., Cambillau, C., Henrissat, B. and Canard, B. (2005) J. Gen. Virol., 86, 743-749.

[38] Severson, W., Xu, X., Kuhn, M., Senutovitch, N., Thokala, M., Ferron, F., Longhi, S., Canard, B. and Jonsson, C. B. (2005) J.

Virol., 79, 10032-10039. [39] Ferron, F. P. (2005) Approches bioinformatiques et structurales des

réplicase virales. In. Ecole Doctorale des Sciences de la Vie et de la Santé, Aix-Marseille II, Marseille

[40] Llorente, M. T., Barreno-Garcia, B., Calero, M., Camafeita, E., Lopez, J. A., Longhi, S., Ferron, F., Varela, P. F. and Melero, J. A.

(2006) J. Gen. Virol., 87, 159-169. [41] Dunker, A. K., Lawson, J. D., Brown, C. J., Williams, R. M.,

Romero, P., Oh, J. S., Oldfield, C. J., Campen, A. M., Ratliff, C. M., Hipps, K. W., Ausio, J., Nissen, M. S., Reeves, R., Kang, C.,

Kissinger, C. R., Bailey, R. W., Griswold, M. D., Chiu, W., Garner, E. C. and Obradovic, Z. (2001) J. Mol. Graph Model, 19, 26-59.

[42] Linding, R., Russell, R. B., Neduva, V. and Gibson, T. J. (2003) Nucleic Acids Res., 31, 3701-3708.

[43] Egloff, M. P., Benarroch, D., Selisko, B., Romette, J. L. and Canard, B. (2002) EMBO J., 21, 2757-2768.

[44] Weathers, E. A., Paulaitis, M. E., Woolf, T. B. and Hoh, J. H. (2004) FEBS Lett., 576, 348-352.

[45] Koonin E, V. and Galperin, M. (2003) Sequence-Evolution-Function: Computational Approaches in Comparative Genomics,

Kluwer Academic Publishers [46] Liu, J., Tan, H. and Rost, B. (2002) J. Mol. Biol., 322, 53-64.

[47] Romero, P., Obradovic, Z., Li, X., Garner, E. C., Brown, C. J. and Dunker, A. K. (2001) Proteins, 42, 38-48.

[48] Wootton, J. C. (1994) Comput. Chem., 18, 269-285. [49] Dyson, M. R., Shadbolt, S. P., Vincent, K. J., Perera, R. L. and

McCafferty, J. (2004) BMC Biotechnol., 4, 32. [50] Karlin, D., Longhi, S., Receveur, V. and Canard, B. (2002)

Virology, 296, 251-262.

[51] Longhi, S., Receveur-Brechot, V., Karlin, D., Johansson, K.,

Darbon, H., Bhella, D., Yeo, R., Finet, S. and Canard, B. (2003) J. Biol. Chem., 278, 18638-18648.

[52] Karlin, D., Longhi, S. and Canard, B. (2002) Virology, 302, 420-432.

[53] Linding, R., Schymkowitz, J., Rousseau, F., Diella, F. and Serrano, L. (2004) J. Mol. Biol., 342, 345-353.

[54] Tartaglia, G. G., Pellarin, R., Cavalli, A. and Caflisch, A. (2005) Protein Sci., 14, 2735-2740.

[55] Berger, B., Wilson, D. B., Wolf, E., Tonchev, T., Milla, M. and Kim, P. S. (1995) Proc. Natl. Acad. Sci. USA, 92, 8259-8263.

[56] Wolf, E., Kim, P. S. and Berger, B. (1997) Protein Sci, 6, 1179-1189.

[57] Brown, C. J., Takayama, S., Campen, A. M., Vise, P., Marshall, T. W., Oldfield, C. J., Williams, C. J. and Keith Dunker, A. (2002) J.

Mol. Evol., 55, 104-110. [58] Romero, P. R., Zaidi, S., Fang, Y. Y., Uversky, V. N., Radivojac,

P., Oldfield, C. J., Cortese, M. S., Sickmeier, M., LeGall, T., Obradovic, Z. and Dunker, A. K. (2006) Proc. Natl. Acad. Sci.

USA, 103, 8390-8395. [59] Chen, J. W., Romero, P., Uversky, V. N. and Dunker, A. K. (2006)

J. Proteome Res., 5, 879-887. [60] Hurst, L. D. (2002) Trends Genet., 18, 486.

[61] Ferron, F., Longhi, S., Canard, B. and Karlin, D. (2006) Proteins: Structure, Function and Bioinformatics, In press

[62] Oldfield, C. J., Cheng, Y., Cortese, M. S., Brown, C. J., Uversky, V. N. and Dunker, A. K. (2005) Biochemistry, 44, 1989-2000.

[63] Vucetic, S., Obradovic, Z., Vacic, V., Radivojac, P., Peng, K., Iakoucheva, L. M., Cortese, M. S., Lawson, J. D., Brown, C. J.,

Sikes, J. G., Newton, C. D. and Dunker, A. K. (2005) Bioinformatics, 21, 137-140.

[64] Romero, P., Obradovic, Z., Kissinger, C. R., Villafranca, J. E. and Dunker, A. K. (1997) Identifying disordered regions in proteins

from amino acid sequences. In. Proceedings of the IEEE International Conference on Neural Networks.

[65] Li, X., P. Romero, M. Rani, A. K. Dunker, and Obradovic, a. Z. (1999) Genome Informatics, 10, 30-40.

[66] Fernandez, A. and Berry, R. S. (2004) Proc. Natl. Acad. Sci. USA, 101, 13460-13465.

[67] Fernandez, A., Scott, R. and Berry, R. S. (2004) Proc. Natl. Acad. Sci. USA, 101, 2823-2827.

[68] Peng, K., Radivojac, P., Vucetic, S., Dunker, A. K. and Obradovic, Z. (2006) BMC Bioinformatics, 7, 208.

[69] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997) Nucleic Acids Res, 25,

3389-3402. [70] Kabsch, W. and Sander, C. (1983) Biopolymers, 22, 2577-2637.

[71] Linding, R., Jensen, L. J., Diella, F., Bork, P., Gibson, T. J. and Russell, R. B. (2003) Structure (Camb), 11, 1453-1459.

[72] Ward, J. J., McGuffin, L. J., Bryson, K., Buxton, B. F. and Jones, D. T. (2004) Bioinformatics, 20, 2138-2139.

[73] Yang, Z. R., Thomson, R., McNeil, P. and Esnouf, R. M. (2005) Bioinformatics, 21, 3369-3376.

[74] Cheng, J., Sweredoski, M. and Baldi, P. (2005) Data Min. Knowl. Disc., 11, 213-222.

[75] Vullo, A., Bortolami, O., Pollastri, G. and Tosatto, S. C. (2006) Nucleic Acids Res., 34, W164-168.

[76] Hobohm, U. and Sander, C. (1994) Protein Sci., 3, 522-524. [77] Coeytaux, K. and Poupon, A. (2005) Bioinformatics, 21, 1891-

1900. [78] Zeev-Ben-Mordehai, T., Rydberg, E. H., Solomon, A., Toker, L.,

Auld, V. J., Silman, I., Botti, S. and Sussman, J. L. (2003) Proteins, 53, 758-767.

[79] Liu, J. and Rost, B. (2003) Nucleic Acids Res., 31, 3833-3835. [80] Dosztanyi, Z., Csizmok, V., Tompa, P. and Simon, I. (2005)

Bioinformatics, 21, 3433-3434. [81] Garbuzynskiy, S. O., Lobanov, M. Y. and Galzitskaya, O. V.

(2004) Protein Sci., 13, 2871-2877. [82] Galzitskaya, O. V., Garbuzynskiy, S. O. and Lobanov, M. Y.

(2006) Mol. Biol. (Moscow), 40, 341-348. [83] MacCallum, R. M. (2006) FORCASP-CASP Forums Site,

[84] Callebaut, I., Courvalin, J. C., Worman, H. J. and Mornon, J. P. (1997) Biochem. Biophys. Res. Commun., 235, 103-107.

[85] Jin, Y. and Dunbrack, R. L., Jr. (2005) Proteins, 61, 167-175.

Predicting Protein Disorder and Induced Folding Current Protein and Peptide Science, 2007, Vol. 8, No. 2 149

[86] Garner, E., Romero, P., Dunker, A. K., Brown, C. and Obradovic,

Z. (1999) Genome Inform. Ser Workshop Genome Inform., 10, 41-50.

[87] Oldfield, C. J., Cheng, Y., Cortese, M. S., Romero, P., Uversky, V. N. and Dunker, A. K. (2005) Biochemistry, 44, 12454-12470.

[88] Mohan, A., Oldfield, C. J., Radivojac, P., Vlacic, V., Cortese, M., Dunker, A. K. and Uversky, V. N. (2006) J. Mol. Biol., 362(5),

1043-1059. [89] Callaghan, A. J., Aurikko, J. P., Ilag, L. L., Gunter Grossmann, J.,

Chandran, V., Kuhnel, K., Poljak, L., Carpousis, A. J., Robinson, C. V., Symmons, M. F. and Luisi, B. F. (2004) J. Mol. Biol., 340,

965-979. [90] Wilkins, M. R., Gasteiger, E., Bairoch, A., Sanchez, J. C.,

Williams, K. L., Appel, R. D. and Hochstrasser, D. F. (1999) Methods Mol. Biol., 112, 531-552.

[91] Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R. D. and Bairoch, A. (2003) Nucleic Acids Res., 31, 3784-3788.

[92] Bornberg-Bauer, E., Rivals, E. and Vingron, M. (1998) Nucleic Acids Res, 26, 2740-2746.

[93] Lupas, A. (1996) Methods Enzymol, 266, 513-525. [94] Lupas, A., Van Dyke, M. and Stock, J. (1991) Science, 252, 1162-

1164. [95] Lupas, A. (1997) Curr. Opin. Struct. Biol., 7, 388-393.

[96] Baldi, P., Cheng, J. and Vullo, A. (2004) Adv. Neural. Inf. Process Syst., 17, 97-104.

[97] Tai, L. J., McFall, S. M., Huang, K., Demeler, B., Fox, S. G., Brubaker, K., Radhakrishnan, I. and Morimoto, R. I. (2002) J. Biol.

Chem., 277, 735-745.

[98] Longhi, S., Receveur-Brechot, V., Karlin, D., Johansson, K.,

Darbon, H., Bhella, D., Yeo, R., Finet, S. and Canard, B. (2003) J. Biol. Chem., 278, 18638-18648.

[99] Bourhis, J. M., Johansson, K., Receveur-Brechot, V., Oldfield, C. J., Dunker, K. A., Canard, B. and Longhi, S. (2004) Virus Res., 99,

157-167. [100] Bourhis, J., Johansson, K., Receveur-Bréchot, V., Oldfield, C. J.,

Dunker, A. K., Canard, B. and Longhi, S. (2004) Virus Res., 99, 157-167.

[101] Giraudon, P., Jacquier, M. F. and Wild, T. F. (1988) Virus Res., 10, 137-152.

[102] Johansson, K., Bourhis, J. M., Campanacci, V., Cambillau, C., Canard, B. and Longhi, S. (2003) J. Biol. Chem., 278, 44567-

44573. [103] Walters, K. J., Kleijnen, M. F., Goh, A. M., Wagner, G. and

Howley, P. M. (2002) Biochemistry, 41, 1767-1777. [104] Kingston, R. L., Baase, W. A. and Gay, L. S. (2004) J. Virol., 78,

8630-8640. [105] Morin, B., Bourhis, J. M., Belle, V., Woudstra, M., Carrière, F.,

Guigliarelli, B., Fournel, A. and Longhi, S. (2006) J. Phys. Chem., 110, 20596-20598.

[106] Vucetic, S., Brown, C., Dunker, K. and Obradovic, Z. (2003) Proteins, 52, 573-584.

[107] Mavrakis, M., McCarthy, A. A., Roche, S., Blondel, D. and Ruigrok, R. W. (2004) J. Mol. Biol., 343, 819-831.

[108] Cuff, J. A., Clamp, M. E., Siddiqui, A. S., Finlay, M. and Barton, G. J. (1998) Bioinformatics, 14, 892-893.

[109] Kingston, R. L., Hamel, D. J., Gay, L. S., Dahlquist, F. W. and Matthews, B. W. (2004) Proc. Natl. Acad. Sci. USA, 101, 8301-

8306.

Received: September 15, 2006 Revised: October 01, 2006 Accepted: October 03, 2006