Automatic definition of recurrent local structure motifs in proteins

10
J. Mol. Biol. (1990) 213,327-336 Automatic Definition of Recurrent Local Structure Motifs in Proteins Marianne J. Rooman, Joaquin Rodriguez and Shoshana J. Wodak Unit~ de Conformation des Macromol~cules Biologiques Universitd Libre de Bruxelles, CP160, P2, Av. P. Hdger 1050 Bruxelles, Belgium (Received 7 August 1989; accepted 3 January 1990) An automatic procedure for defining recurrent folding motifs in proteins of known structure is described. These motifs are formed by short polypeptide fragments of equal size containing between four and seven residues. The method applies a classical clustering algorithm that operates on distances between selected backbone atoms. In one application, we use it to cluster all protein fragments into only four structural classes. This classification is rough considering the observed diversity of local structures, but comparable in homogeneity to the four classes of secondary structure (a-helix, ft-strand, turn and coil). Yet, it discriminates between extended and curved coil and distinguishes ft-bulges from r-strands. In a second application, the clustering procedure is combined with assignment of backbone dihedral angles to allowed regions in the Ramachandran map. This produces an exhaustive repertoire of highly homogeneous families of structural motifs that contains all the ft-hairpins, fta- and aft-loops previously defined by manual procedures, and new structural families of which two examples, a fta-loop and an a-helix beginning, are analyzed in detail. The described automatic procedures should be useful in categorizing structure information in proteins, thereby increasing our ability to analyze relations between structure and sequence. 1. Introduction The most widely used definitions of local folding motifs in proteins are those of the well-known secondary structures a-helix (H), ft-sheet (E), turn (T) and coil (C) (Pauling et al., 1951; Pauling & Corey, 1951; Venkatachalam, 1968; Chandrasekaran et al., 1973; Lewis et al., 1973; Smith & Pease, 1980; Rose et al., 1985; Flory, 1969). Their locations in proteins can be readily determined from visual inspection of the three-dimensional model. Precise objective criteria are required, however, to assign a particular residue in a protein to one of the classes of secondary structure in a reliable and uniform way. On the basis of such criteria, several pro- cedures for secondary structure assignments have been devised. In the most widely used ones, assign- ments are made according to the pattern of H bonds made by the peptide units (Kabsch & Sander, 1983). Other related methods use criteria derived either directly from C~ co-ordinates (Levitt & Greer, 1977), or by comparing inter-C ~ distances in the protein to those of idealized models of secondary structures (Richards & Kundrot, 1988). Each of these procedures, when applied to a set of known protein structures, yields a self-consistent, 0022-2836/90/100327-10 $03.00/0 though slightly different, set of secondary structure assignments. Largest discrepancies usually occur in defining the precise boundaries of secondary struc- ture elements, owing to different built-in tolerance levels to local distortions from idealized geometries, such as those occurring at helix ends (Richardson & Richardson, 1988; Presta & Rose, 1988) and within ft-sheets (Richardson et al., 1978). In addition, because of the strict requirement to form character- istic H bonds, it is not uncommon to see residues adopting very similar conformations assigned to different secondary structure classes. Hence, extended segments can be found in ft-strands as well as in coil regions, while highly curved segments can occur within a helix, as well as in turns or in coil regions. These observations suggest that different defini- tions of local folding motifs, based on direct measures of conformational resemblance, could be a useful alternative to secondary structures, in partic- ular for investigating relations between structure and amino acid sequence. For that purpose, the analysis of specific folding motifs is also of great interest, particularly since it has long been recog- nized that proteins tend to display similar substruc- tures and folding motifs (Rao & Rossmann, 1973; 327 ~) 1990 Academic Press Limited

Transcript of Automatic definition of recurrent local structure motifs in proteins

J. Mol. Biol. (1990) 213,327-336

Automatic Definition of Recurrent Local Structure Motifs in Proteins

Marianne J. Rooman, Joaquin Rodriguez and Shoshana J. Wodak

Unit~ de Conformation des Macromol~cules Biologiques Universitd Libre de Bruxelles, CP160, P2, Av. P. Hdger

1050 Bruxelles, Belgium

(Received 7 August 1989; accepted 3 January 1990)

An automatic procedure for defining recurrent folding motifs in proteins of known structure is described. These motifs are formed by short polypeptide fragments of equal size containing between four and seven residues. The method applies a classical clustering algorithm that operates on distances between selected backbone atoms. In one application, we use it to cluster all protein fragments into only four structural classes. This classification is rough considering the observed diversity of local structures, but comparable in homogeneity to the four classes of secondary structure (a-helix, ft-strand, turn and coil). Yet, it discriminates between extended and curved coil and distinguishes ft-bulges from r-strands. In a second application, the clustering procedure is combined with assignment of backbone dihedral angles to allowed regions in the Ramachandran map. This produces an exhaustive repertoire of highly homogeneous families of structural motifs that contains all the ft-hairpins, fta- and aft-loops previously defined by manual procedures, and new structural families of which two examples, a fta-loop and an a-helix beginning, are analyzed in detail. The described automatic procedures should be useful in categorizing structure information in proteins, thereby increasing our ability to analyze relations between structure and sequence.

1. I n t r o d u c t i o n

The most widely used definitions of local folding motifs in proteins are those of the well-known secondary structures a-helix (H), ft-sheet (E), turn (T) and coil (C) (Pauling et al., 1951; Pauling & Corey, 1951; Venkatachalam, 1968; Chandrasekaran et al., 1973; Lewis et al., 1973; Smith & Pease, 1980; Rose et al., 1985; Flory, 1969). Their locations in proteins can be readily determined from visual inspection of the three-dimensional model. Precise objective criteria are required, however, to assign a particular residue in a protein to one of the classes of secondary structure in a reliable and uniform way. On the basis of such criteria, several pro- cedures for secondary structure assignments have been devised. In the most widely used ones, assign- ments are made according to the pattern of H bonds made by the peptide units (Kabsch & Sander, 1983). Other related methods use criteria derived either directly from C ~ co-ordinates (Levitt & Greer, 1977), or by comparing inter-C ~ distances in the protein to those of idealized models of secondary structures (Richards & Kundrot, 1988).

Each of these procedures, when applied to a set of known protein structures, yields a self-consistent,

0022-2836/90/100327-10 $03.00/0

though slightly different, set of secondary structure assignments. Largest discrepancies usually occur in defining the precise boundaries of secondary struc- ture elements, owing to different built-in tolerance levels to local distortions from idealized geometries, such as those occurring at helix ends (Richardson & Richardson, 1988; Presta & Rose, 1988) and within ft-sheets (Richardson et al., 1978). In addition, because of the strict requirement to form character- istic H bonds, it is not uncommon to see residues adopting very similar conformations assigned to different secondary structure classes. Hence, extended segments can be found in ft-strands as well as in coil regions, while highly curved segments can occur within a helix, as well as in turns or in coil regions.

These observations suggest that different defini- tions of local folding motifs, based on direct measures of conformational resemblance, could be a useful alternative to secondary structures, in partic- ular for investigating relations between structure and amino acid sequence. For that purpose, the analysis of specific folding motifs is also of great interest, particularly since it has long been recog- nized that proteins tend to display similar substruc- tures and folding motifs (Rao & Rossmann, 1973;

327 ~) 1990 Academic Press Limited

328 M . J . Rooman et al.

Levit t & Chothia, 1976; Richardson, 1981). Analy- ses of a growing number of well-resolved protein crystal structures have led to the identification and classification of specific local motifs embodied in the different classes of fl-hairpins (Sibanda & Thornton, 1985; Milner-White & Poet, 1986), aft-loops (Edwards et al., 1987) and ~- loops (Efimov, 1986; Thornton, 1988). Such classification and identifica- tion of local structures have proven to be extremely useful in modeling protein structures (Chothia et al., 1986; Rees & de la Paz, 1986; Blundell et al., 1987). Moreover, the concept of recurrent local conforma- tions has further been extended to a genuine "spare par ts" approach (Jones & Thirup, 1986; Claessens et al., 1989), in which fragments from known protein structures are used to interpret electron density maps and to model structural changes in engineered proteins.

This paper describes an automatic procedure for defining recurrent structure motifs in proteins. In this procedure, a classical clustering algorithm (Jambu, 1976; J ambu & Lebeaux, 1978) is applied to short polypeptide fragments of uniform lengths containing four to seven residues. These fragments are taken from a database of 75 highly resolved (_<2.5 A; 1 A = 0.1 nm) and refined protein struc- tures with a low level of sequence ident i ty (Huys- mans et al., 1987; M. Huysmans, J. Richelle & S. Wodak, unpublished results). The criteria for clus- tering are based on conformational resemblance, assessed from distances between selected atoms in the fragment backbones.

First, the procedure is used to class all protein fragments into only four classes, with the ult imate purpose of comparing structure prediction schemes based on this classification with those based on the well-known classes of secondary structure, as described in the accompanying paper (Rooman et al., 1990). We show tha t this results in populated classes tha t display a rather large degree of struc- tural variability. Yet, the classification is able to discriminate between extended and curved coil, or to distinguish fl-bulges from fl-strands.

In a second application, the automatic clustering procedure is used as a tool for comprehensive and systematic classification of protein fragments into highly specific and homogeneous structural families. This is not a straightforward problem. Indeed, in a sufficiently large sample of proteins, short frag- ments span a virtually continuous range of confor- mations. Strictly speaking, measures of interatomic distances, or for tha t purpose of root-mean-square (r.m.s.t) deviations of atomic co-ordinates, can be used only to rank these conformations relative to each other by order of resemblance. On the basis of this ranking, families of conformations may be defined, but such definitions are not unique, since they require applications of arbi trary distance cutoffs. Nevertheless, automatic structure classifica- tion procedures along these lines have been reported

t Abbreviation used: r.m.s., root-mean-square.

(Jones, 1987; Unger et al., 1989) and shown to have useful applications.

Here, a somewhat different approach is taken. A clustering procedure based on distances between C ~ atoms is combined with criteria on backbone dihedral angles. The latter are not used to assess structural resemblance between fragments, since dihedral angles are generally ill-suited for homo- geneous sampling of conformational space, but to assign observed backbone conformation to allowed regions of the Ramachandran map. This yields a discrete description of local structures, which has a physical basis, as allowed regions of the Ramachan- dran map represent energetically favorable confor- mations, separated from each other by energy barriers. This approach is able to redefine known classes of structures such as all fl-hairpins, fl~ and ~fl-loops previously reported (Sibanda & Thornton, 1985; Edwards et al., 1987), and it is able to find new structure families, of which two examples are described in detail.

2. Materials and Methods

(a) Protein data

An important asset in the present study has been the use of a database (Huysmans et al., 1987, and unpublished results) containing information on 75 highly resolved and refined protein structures (see the legend to Fig. 3 for detailed list) from the Brookhaven databank (Bernstein et al., 1977) that exhibit a low level of sequence homology. This information includes atomic co-ordinates, amino acid sequence and secondary structure assignments obtained from the program DSSP (Kabseh & Sander, 1983), as well as higher-level descriptions of geometric and topologic parameters.

(b) Clustering algorithm

A clustering algorithm of the hierarchical ascending type (Jambu, 1976; Jambu & Lebeaux, 1978) is used to define families of most-resembling structures of protein fragments of fixed length. The algorithm starts by considering that every fragment forms a class or family of its own. Then, classes containing fragments whose struc- ture is most similar are joined in pairs, until all fragments are grouped into a single class. The classification obtained by this procedure requires the definition of 2 key para- meters; a measure of the distance between any 2 specific structural fragments, and a similarity index character- izing a given structural class that reflects the distance dispersion within the class.

The distance measure between 2 fragments X and Y is defined as the r.m.s, deviation of the inter-C ~ distances:

2 n-l r.m.s.(X, Y)-- ~n{~--l),~ffil 2.. (d,~-d,~) 2 (1)

jffii+l

where n is the number of residues in fragments X and Y, and d~ and d~ are the Euclidean distances between the C~-atoms of residues i and j , respeetively, of fragments X and Y. The advantage of this measure over that based on the r.m.s, deviation of atomic positions, computed after co-ordinate superposition (McLachlan, 1979), is its computational efficiency, a key factor when repeated com- parisons of over 10,000 fragments must be performed. But

Automat ic Defini t ions of Protein Structure Mot i f s 329

it has the disadvantage of being invariant under reflection (as it is based only on distances) and is therefore a less accurate similarity measure (Cohen & Sternberg, 1980). This problem can, however, be readily corrected at sub- sequent stages, using finer measures based either on back- bone dihedral angles or co-ordinate superpositions.

The similarity index of a given class A is commonly defined as its inertia coefficient I(A), computed as the sum of the squared distances between all pairs of elements in the class or, equivalently, by the sum of the squared distances of individual elements in the class to its center of mass g (Jambu, 1976):

1 I (A) = 2P(A) ~ ~ r 'm's '2(X' Y)

X e A ¥~A

= ~ r.m.s.Z(X,9) (2) X6A

with

1 X d~ 1 = - - ~ ~ d w (3) X~ A

where X and Y are individual elements in class A, P(A) is the number of elements in that class, and g its center of mass, expressed as a pseudo-fragment with average inter C~-distanees d~l (eqn (3)). At each clustering level, the operation of merging 2 classes is subject to the constraint of keeping the sum of the inertia coefficients of all classes at a minimum.

The use of squared distances rather than distances in the summation of eqn (2) has the advantage that at each clustering level only the distances between centers of mass of the different classes need to be computed, and not the distances between all elements. Indeed, the inertia coeffi- cient of the new class A w B (created by merging classes A and B) is simply the sum of the individual inertia coeffi- cients of A and B plus a weighted squared distance D 2 between the centers of mass ~ and gs of the 2 classes (Jambu, 1976):

I (A u B) = I(A) + I(B) + D2(g A, ga), (4)

where

P(A)P(B) r m 2 , ~ D2(gA, gs) = P ( A ~ ) " .s. tiT, gs). (5)

Thus, at each clustering level it is sufficient to compute D 2 between the centers of mass of all classes in the level, to join the pair of classes for which D 2 is minimum, and to compute the center of mass gAuB and the number of elements in the new class A u B:

gA,~ B = P(A)gA + p(B)gB P(A u B) '

where

P(A u B) = P(A)+P(B) . (6)

(c) Structural families

The described clustering algorithm is applied to define structural families of fragments of length 4, 5, 6 and 7 residues belonging to the 75 structures of our database. The fragments are defined by sliding a probe corres- ponding to the required length along each polypeptide chain, 1 residue at a time. Those residues whose descrip- tion in the Brookhaven databank is incomplete (missing atoms) and whose C=-distance to the next residue is greater than 4"2 A (to avoid including fragments spanning chain gaps) are excluded from the analysis.

! l i m

l !=! il-l l l l

_7"ma x

2"

0 A B C O E F G H I J K L M

Figure 1. Illustration of a hierarchical clustering tree of the type used here. At the bottom level, 13 individual fragments marked A through M form classes on their own (filled circles). Moving upwards, clas~es are joined 2 at a time (filled squares), until all fragments are grouped into a single class (filled square at the top). In this scheme, classes are positioned along the vertical co-ordinate in order of increasing inertia coefficient/, as indicated at the right-hand side.

For each fragment length, the clustering procedure yields a tree of classes (or families), ordered by increasing inertia coefficient, as shown in Fig. 1. Depending on the level at which the tree is cut, one obtains either a large number of families; each containing highly similar frag- ments (cut performed at lower levels) or only a few families, each containing less well-resembling fragments (cut performed at higher levels). The appropriate level at which the tree is to be cut depends on the purpose of the classification. Here, we focus on 2 different applications that require cutting the clustering tree at quite different levels.

In the first application, the aim is to obtain a small number of structural families of short fragments, to he used as a definition of local structure that is different from secondary structure. To facilitate comparison with the 4 well-known secondary structure classes, helix (H), fl-strand (E), turn (T) and coil (C)t, the clustering tree is cut at the level containing N -- 4 classes. This description is rather rough, considering the observed diversity of local conformations. Therefore, to limit structural hetero- geneity within each class, fragments longer than 7 residues are not considered.

Once the clustering procedure is complete, all overlap- ping fragments (of fixed length) in each of the parent proteins are assigned t.o one of the 4 classes. This informa- tion is then mapped into the linear sequence by assigning the fragment class to the middle residue of each fragment. The result, illustrated in Fig. 2, is a sequence of local structure assignments analogous to that obtained with

t We use here the DSSP assignment for secondary structure (Kabsch & Sander, 1983). H includes ~, ~ and 310-helices, E extended and isolated fl, and T turns and bends.

330 M . J . Rooman et al.

A L W Q F N G M I K C K I P S S E P L L D F N N Y G C Y E G C H H H H H H H H H H H C T T C C H H H H T T T B T T T B $

L G G S G T P V D D L D R C C Q T H D N C Y K Q A K K L D S $ CCC$ 5 C5 $ HHHHHHHHHHHHHHHHT r CHH

C K V L V D N P Y T N N Y S Y S C S N N E I T C S S E N N A H H H T T C C T T T C C C C E E E E T T E E E E C T T C C H

C E A F I C N C D R N A A I C F $ K V P Y N K E H K N L D K HHHHHHHHHHHHHHHHHT 5 C C C G G G B T C C G

K N C G G C x x x

Figure 2. Amino acid sequence of phospholipase A2, its secondary structure following the definitions of DSSP (Kabsch & Sander, 1983: H, c~-helix; G, 31o-helix; E, extended fl; B, isolated fl; T, turn; S, bend; C, coil) and assignments into the 4 structural families 7, e, ~- and ~, containing fragments of length 6, obtained by our clus- tering procedure.

conventional secondary structure definitions, in which the first and last residues of every polypeptide chain are not assigned.

In the second application, the clustering procedure is used as a tool for comprehensive and systematic classifica- tion of protein fragments into highly homogeneous struc- tural families. The level at which the clustering tree has to be cut for that purpose is, however, not uniquely defined. Indeed, in a sufficiently large sample of proteins, short fragments span a virtually continuous range of conforma- tions. Strictly speaking, measures of interatomic distances or r.m.s, deviations of atomic co-ordinates can only rank these conformations relative to each other by order of conformational resemblance. To define conformational families based on this ranking, arbitrary criteria must be used. One can impose a limit on N, the number of fami- lies, or equivalently on the average family size, but this limit is arbitrary since neither quantity is known a priori. On the basis of previous knowledge, one may suspect that some structural families will contain only a few members (Edwards et al., 1987), while others, corresponding to portions of a-helix or fl-strand, will be highly populated. Criteria based on measures of structural similarity can be used. Then, the clustering tree is cut at the level corres- ponding to a maximum allowed value (say 1 A) of the weighted distance between the centers of mass of the merged classes. Such criteria are equally arbitrary and possibly biased, as one would want to avoid imposing a preconceived idea about structural likeness at this stage.

Quite a different approach to analyzing the clustering tree, which yields only a subset of potentially interesting structural families, consists in selecting from the entire clustering tree only those families whose members display a given pattern of secondary structure, rather than cutting the tree at a specific level, and requiring that the family not be contained in a larger retained class.

These different ways of analyzing the clustering tree yield, in general, different sets of structural classes that have the common feature of being reasonably homo- geneous. Nevertheless, they occasionally contain quite dissimilar fragments, due to the fact that they are described by their C ~ co-ordinates instead of a full back- bone representation, and that the similarity measure is based on inter-C~-distances and not on conventional r.m.s.

deviations of atomic co-ordinates computed after co-ordi- nate superposition (Cohen & Sternberg, 1980). To further reduce conformational diversity, additional information must be used. A limit may be set on the conventional r.m.s, deviation of all backbone atoms from the average fragment representing the class. Alternatively, backbone dihedral angles may be computed and required to be in the range of, say, __ 20 ° from the average values of the class. I t is, however, meaningless to compute representa- tive fragments or average backbone angles at this stage, since families may contain dissimilar conformations. It would make more sense to further divide the families into more homogeneous subclasses, before computing their representative conformations. This may be d o n e b y assigning observed conformations to allowed regions of the Ramachandran map, based on values of backbone dihedral angles. These regions represent energetically favorable conformations, which are separated by energy barriers and therefore yield a discrete description of conformational space.

Here, we discuss examples of structural families composed of fragments 7 residues long and generated by cutting the clustering tree at the level where classes contain, on average, 50 members. These classes are further subdivided, whenever necessary, to contain frag- ments with ¢, ~b angles in the same domains of the Rama- ehandran plot, as defined by Wilmot & Thornton (1989). To define the core of each class, a filter is applied, based on r.m.s, deviation of N, C ~ and C atoms, computed after co-ordinate superposition (McLachlan, 1979). Only frag- ments that display less than 1 A r.m.s, deviation relative to the central fragment, which exhibits the lowest average r.m.s, deviation relative to all other members of the class, are analyzed in detail.

3. Results

(a) Four classes of local structures

Clustering trees have been computed for short protein fragments of uniform lengths ranging from four to seven residues, and then cut four levels before the top (see Fig. 1) to yield four classes, each of which contains fragments tha t are only roughly alike. A representative fragment from each class, chosen so tha t its r.m.s, deviation with respect to all other members of its class is minimal, is depicted in Figure 3, together with the plot of its ¢, ~h angles.

The four classes are designated according to the type of structures they contain as follows: helical, ~/; extended, ~; coil, 4; and loop, ~. The same type of structural classes are found for all f ragment lengths (4 to 7 residues). As expected, each class contains a majori ty of fragments with similar secondary struc- ture, as shown in Figure 3: r/is composed mainly of helices, ~ of fl-strands, ~ of everything but helices and 2 of turns and coil. Yet, this classification discriminates between extended and curved coil, classifies some turns into the helical class and others into the loop class, divides fl-strands into more or less extended fl-strands and distinguishes bulges from fl-strands. This is not surprising, since the definition of structure used here is based only on the relative positions of C ~ atoms, and not on H bonds, emphasizing resemblance of a simplified backbone.

Automatic Definitions of Protein Structure Motifs 331

TI4 ¢4 ~4

, I I ~1~ ~1~ - . ~ ~o~[ ~'~ J - ~ ° I ............................. . ....... ~o,.~r. ....................... ~ . . . . . . . . ' . . . . .

--'IT O ~ --'IT , O 11" - - ~ O qT -- ' f f O ~T

(rad.) , (rod.) ~ (rod.) ~ (rod.)

E H T C E H T C E H T C E H T C

(a)

~lS cS ~5 ~-S

o ........ ~ i i . . . . . . . . . . . . . . . . . . . .

--'IT I - ~ r O 'IT

(rod.)

5~'~,,~_ ~

_ , f f - _

--'IT O "IT - -~r O "IT - ~ r O ¢T

(rod.) ~ (rod.) ~ (rQd.)

0 0 0 0 E H T C 1:. H T C E H T C E H T C

( b )

Fig. 3.

332 M . J . Rooman et al.

116 E6 ~6

*rr - - ' r i " I i I - - -- ' IT 0 qi" - - W 0 ' f f --11" 0 ' I f - - ' iT 0 "IT

(rod.) ~ (rod.) ¢ (rad.) ~ (rod.)

E H T C E H T C E H T C E H T C

(c)

117 E7 ~7 ~.7

o ................. i .....................

- ' l l r O ' r r

(rad.) - - ' f f O IT - ~ r O 11" -11" O

(rod.) ¢ (rod.) ~6 (rQd.) 100 0 0 0 0

E H T C E H T C E H T C E H T C

(a)

Figure 3. Four families of local structure obtained by applying the clustering procedure as described in the text, for fragments of uniform length containing from 4 to 7 residues. For each class, the backbone and the ~, ~ map of the representative fragment are displayed. Note that the ¢ value of the first residue and the ~/value of the last in each fragment are irrelevant, because their computation requires the 2 residues flanking the fragment that are not taken into account in the clustering. The fragment backbones are plotted using the molecular modeling package BRUGEL (Delhaise et al., 1985). In addition, the distribution (in %) of fragments in each class that contain the usual secondary structures E, H, T and C in at least 1 position, is given. Hence, fragments containing all 4 secondary structures are counted 4 times, once for each secondary structure. The 75 proteins used in this study are given by their Brookhaven

Automatic Definitions of Protein Structure Motifs 333

Structural classes contain between 1200 and 4900 members each, out of a total of approximately 13,000 fragments in the database. The average r.m.s, deviation between two fragments, based on inter-C ffi distances, is always large in the 2 (loop) family, irrespective of fragment length (see the legend to Fig. 3), reflecting the conformational diversity of this class. On the other hand, the helical family (r/), which is consistently highly populated for all fragment lengths, contains much more similar structures, confirming the more limited structural variability of this class. I t is noteworthy that most a-helical fragments are contained in r/, while 31o helices are rather evenly distributed among all classes except the extended one. Indeed, 31o helices tend to be shorter, rarely extending over the entire fragment length and, as a result, fragments that contain them display more extensive structural variability.

(b) Families of highly specific local structures

Another application of the clustering procedure is the automatic and comprehensive definition of recurrent local structural motifs in proteins. We present here results obtained for fragments seven residues long, using the criteria for defining struc- tural families described in Materials and Methods. The clustering procedure clearly recognizes all ft-hairpins reported by Sibanda & Thornton (1985), i.e. the two-residue ft-hairpins of types I' and II', and the four and five-residue ft-hairpins, as well as the types of aft and fta-loops reported by Edwards et al. (1987): aft1, aft3, fta3 and ftaO. In addition, new specific families are obtained. They comprise a three-residue ft-hairpin, an aft-loop and a fta-loop, which occur, respectively, four, six and ten times in the database. As an illustration, the most populated of these families (a fta-loop referred to as fta2 to be consistent with the above notation) is described in detail.

The superimposed backbones of the ten fragments composing the fta2 family are depicted in Figure 4(a), together with their ~b, ~b plots. The ¢, ~O angles of residue 2 to residue 6 fall, respectively, in domains ft~, aR, ftE, aR and ~t of the Ramachandran map, following the conventions of Wilmot &

Thornton (1989). The average r.m.s, deviation, obtained after co-ordinate superposition of N, C ~ and C atoms, with respect to the central fragment of the class (for definition, see Fig. 4), is 0"66 A. The territory of this class in the database, the corres- ponding amino acid sequences and the secondary structures are given in the legend to Figure 4(a). The fragments occur in rather unrelated proteins, such as y-crystallin, dihydrofolate reductase, crambin, sea lamprey hemoglobin, chains E and I from proteinase B, trypsin inhibitor, staphylococcal nuclease and carbonic anhydrase B.

In some instances, the helix portion adopts a 31o or a turn conformation (respectively, G and T in DSSP (Kabsch & Sander, 1983) notations), and in others the ft-strand is replaced by coil (C). In all fragments except two, an (i, i+3 ) main-chain H bond is formed between the fourth and seventh residue in the fragment. In seven out of ten cases, a H bond between the side-chain of residue 4, most commonly Ser/Thr or Asn/Asp, and the main chain of residue 7, is observed. Finally, in three cases, a H bond between the side-chains of these same residues is formed. Other side-chain-main-chain or side-chain-side-chain H bonds occur occasionally but their pattern is not conserved in all fragments.

The corresponding amino acid sequence displays no strictly conserved residues. A consensus sequence pattern can, however, be defined in terms of residue properties as follows: X-h-X-p-X-X-X, with h and p designating, respectively, hydrophobic and polar residues (Taylor, 1986), and X any residue. More- over, the hydrophobic position is occupied by aro- matic residues in 60 ~/o of the eases, while the polar position contains Ser/Thr and Asn/Asp in nine of the ten fragments.

Our procedure also finds several classes repre- senting different types of beginnings and ends of a-helix and ft-strand. An example of an a-helix beginning termed Ca is shown in Figure 4(b). I t happens to be different from classes previously reported by Presta & Rose {1988), as it contains two instead of three residues before the helix. In this class, the ~b, ~b angles of residues two to six fall, respectively, in domains ft~, a x, ax, a R and a R of the Ramachandran map, and the average r.m.s, devia- tion, obtained after co-ordinate superposition of N,

databank code (Bernstein et al., 1977: 1APR, 2ACT, 1ACX, 4ADH, 5CHA, 2ABX, 2ALP, 1PPT, 2AZA, 1TPP, 3ICB, 2CAB, 5CPA, 8CAT, 2MT2, 2CTS, 2CNA, 1CRN, 2SOD, 2B5C, 2CCY, 1CCR, 2CYP, 3C2C, 2CDV, 1CY3, 1CC5, 155C, 451C, 3DFR, 2EST, 4APE, 2FD1, 3FXC, 4FXN, IGCR, 2GN5, 1GPI, 1HMZ, 1ECD, 4HHB, 2LHB, 1HIP, 1FB4, IINS, 2PKA, 1CTF, 4LDH, 2LH4, 2LZM, 1LZI, 2MDH, 1MLT, 1MBD, 1NXB, 9PAP, 1CPV, 1BP2, 1PCY, 2PAB, 2SGA, 3SGB, 3RP2, 1RHD, 1RN3, 4RXN, 2STV, 1SN3, 2SNS, 1SBT, 3TLN, 1TIM, 5PTI, 3WGA, 2YHX. (a) Fragment families of length 4. Classes 74, 84, ~4 and ~ contain, respectively, 4858, 1795, 3151 and 2949 fragments; the average r.m.s, deviation between 2 fragments, based on inter-C~-distances, is 0"33 A, 0"24 A, 0"39 A and 0"60 A, respectively. (b) Fragment families of length 5. Classes ~s, 8s, ~s and ~s contain, respectively, 3917, 2639, 3377 and 2700 fragments; the average r.m.s, deviation between 2 fragments, based on inter-C~-distances, is 0"48 A, 0.44 A, 0"82 A and 1-07 A, respectively. (c) Fragment families of length 6. Classes r/6, 86, ~e and ~6 contain, respectively, 3645, 4845, 1246 and 2779 fragments; the average r.m.s, deviation between 2 fragments, based on inter-C~-distances, is 0"76 A, 1.00 A, 1-08 A and 1"32 A, respectively. (d) Fragment families of length 7. Classes r/7, 87, ~7 and 2~ contain, respectively, 3997, 3469, 2652 and 2284 fragments; the average r.m.s, deviation between 2 fragments, based on inter-C~-distances, is 1"14 A, 1"06 A, 1-67 A and 1-56 A, respectively.

7 6

334 M . J . R o o m a n et al.

"IT I I

~ o -7 ~ 3

4

- ~ 0

,~ (rad,)

(o)

I I

~ o

~ 0 'n"

(rad.)

(b)

Figure 4. Backbone representation and ~, ~b map for 2 highly specific structural families containing 7 residues, obtained by applying the clustering algorithm. The back- bone and ~, ~b maps were obtained using the molecular modeling package BRUGEL (Delhaise et al., 1985). Frag- ments of each family are superimposed by the algorithm AZFIT (McLachlan, 1979) implemented in BRUGEL. All fragments of a family are fitted onto a central fragment, defined as the fragment that displays the lowest average

C ~ and C atoms with respect to the central f ragment of the class, is 0.39 A. The ter r i tory of this class in the database, the corresponding amino acid sequences and the secondary structures are given in the legend to Figure 4(b). These fragments occur once more in unrelated proteins, such as eytochrome c, calcium-binding protein, sea lamprey hemoglobin, thermolysin, parvalbumin B, citrate synthase, triose phosphate isomerase, rhodanese, endothia- pepsin, spirulina ferrodoxin, lactate dehydrogenase and alcohol dehydrogenase.

The first and second residues are nearly always in coil conformation, exceptionally in ~ or bend conformation, while the last five residues are ~-helical. In the major i ty of the cases (10 out of 15 fragments), side-chain-main-chain H bonds are formed between the second and fifth residues in the fragment. These occur in nearly all possible com- binations, some of which follow observations made by Presta & Rose (1988). In five cases, the side- chains of both residues part icipate in the inter- actions; in five other cases, only one side-chain, either tha t of the second or tha t of the fifth residue, is involved. Interestingly, we observe tha t this interaction pat tern disappears as the fragment

r.m.s, deviation, after superposition of N, C ~ and C atoms, relative to the other fragments of the class. The residues in the fragments are numbered in sequential order from 1 to 7. The first and last residues of each fragment are not represented in the ~, ~b map, as the computation of the ¢ value of the first residue and the ~ value of the last requires the 2 residues flanking the fragment that are not taken into account in the clustering. The territory of each structural family is given by the Brookhaven databank code of the proteins and by the Brookhaven databank number of the 1st residue of each fragment in the parent polypeptide chain (Bernstein et al., 1977}. The corres- ponding amino acid sequence and the secondary structure are given. Here, all 7 types of secondary structure defined in DSSP (Kabsch & Sander, 1983) are considered separately: H, ~-helix; G, 31o-helix; E, extended fl; B, isolated fl; T, turn; S, bend; C, coil. (a) Two residues turn fl~2 between fl-strand and ce-helix. The territory of this structural class in the proteins of the database is: 1GCR 61, DYPDYQQ, EESSGGG; 1GCR 150, EYRRYLD, EECSGGG; 3DFR 75, VVHDVAA, EESSHHH; 1CRN 3, CCPSIVA, ECSSHHH; 2LHB 57, GLTTADE, TCCSHHH; 3SGB I30, TYGNKCN, EESSHHH; 5PTI 44, NFKSAED, CBSSHHH; 3SGB E66, WWANSAR, EESSTTC; 2SNS 16, KAIDGDT, EECSSSE; 2CAB 127, KYSSLAE, TCSCHHH. (b) ~-Helix beginning, denoted ~ . The territory of this structural class in the proteins of the database is: 2CYP 163, MNDREVV, CCHHHHH; 3ICB 23, LSKEELK, BCHHHHH; 2LHB 11, LSAAEKT, CCHHHHH; 3ICB 44, STLDELF, SCHHHHH; 3TLN 279, SNFSQLR, CCHHHHH; 1CPV 97, IGVDEFT, EEHHHHH; 2CTS 102, PTEEQVS, CCHHHHH; 1TIM A16, GKRKSLG, CCHHHHH; 1TIM A129, EKLDERE, ECHHHHH; 1CCR 67, WEENTLY, CSHHHHH; 1RHD 75, PSEAGFA, CCHHHHH; 4APE 302, FGDVALK, ECHHHHT; 3FXC 24, TYILDAA, SCHHHHH; 4LDH 107, ESRLNLV, CCHHHHH; 4ADH 200, LGGVGLS, CSHHHHH.

Automatic Definitions of Protein Structure Motifs 335

structures diverge and as their r.m.s, deviation from the central fragment increases.

Here too, the amino acid sequence displays no strictly conserved position. Yet, in many instances, the amino acid preceding the helix, position 2 in the fragment, is occupied by residues that are statisti- cally frequent at that position (Richardson & Richardson, 1988), i.e. Ser, Thr, Gly and Asn. However, residues such as Tyr, Gin and Lys, statis- tically less common at that position, also occur. The other conserved features are a hydrophobic or Arg residue at position 6, and a strong preference for Glu or Gin at position 5, contributing to the above- mentioned H bonding interactions. The latter preference is observed in eight cases, corresponding to fragments with the smallest r.m.s, deviation rela- tive to the central one.

4. D i s c u s s i o n

We describe automatic procedures for generating repertoires of recurrent three-dimensional structure motifs formed by short polypeptide fragments in 75 highly resolved and refined protein structures. These procedures have in common a classical clus- tering algorithm that operates on distances between C ~ atoms. We show that the approach is general enough to be flexibly adapted to different applications.

Using the clustering alone, it is possible to group all protein fragments into four highly populated, and thus weakly specific, structural families. These classes are analogous in their conformational homo- geneity to the commonly defined secondary struc- tures, but they group a distinct set of conformations, since fragments assigned to the same secondary structure class may be segregated into different classes by ours. The purpose of this exer- cise is to generalize the concept of local structure in proteins by producing alternative descriptions of structural information, with the aim of providing new insights into the link between amino acid sequence and three-dimensional structure, as described in the accompanying paper (Rooman et al. , 1990).

The automatic generation of an exhaustive reper- toire of specific local structure motifs is a more difficult problem. Indeed, short fragments adopt a virtually continuous range of conformations, a pro- perty that is particularly apparent in a sufficiently large sample of proteins such as that considered here. We demonstrate that ways of defining distinct structural families can nevertheless be found. For fragments seven residues long, they require applying the classical clustering algorithm up to a level where classes contain about 50 members, and then subdividing these classes further so that they contain fragments with ¢, ~b angles in the same allowed regions of the Ramachandran map. I t is this latter step that leads to a discrete description of conformational space, since the allowed regions cor- respond to energetically favorable conformations that are separated by energy barriers.

This, or the inverse procedure, in which fragments are first assigned to allowed regions of the Rama- chandran map and then subjected to clustering in Cartesian space (under study in our laboratory), have advantages over previously published methods (Jones, 1987; Unger et al., 1989), in that they avoid the use of arbitrary distance thresholds, usually 1 A r.m.s, deviations in C ~ co-ordinates, to limit confor- mational diversity within a family or to divide it into subclusters. They should be more effective than clustering procedures that operate entirely in back- bone dihedral angle space, which can be designed with equal ease. There is no doubt that the latter would be faster than procedures based on distances between backbone atoms, owing to the reduced number of parameters. They would, however, be less effective for measuring conformational resem- blance, as they do not provide homogeneous sampling of conformational space, evident from the simple consideration that a rotation about a bond in the middle of a fragment entails a much larger deformation than the same rotation performed at an extremity.

An important test of how effective automatic procedures are in defining families of folding motifs consists of analyzing in detail the families they define. Doing that, we find that the method described here is able to re-define known classes of structure such as all the fl-hairpin types and all the fla and aft-loops previously derived from manual inspection (Sibanda & Thornton, 1985; Edwards et al., 1987) and is able to find new structural families not previously observed. Analyzing all of them is outside the scope of this paper. As an illustration, however, two are described in detail. We find that fragments defining each motif occur mainly in unre- lated proteins. A consensus sequence pattern can be derived from each motif, but these patterns are only weakly specific, with not enough information to uniquely characterize the motifs. Characteristic H bond patterns, which seem to have a stabilizing role, are observed in both .motifs. They persist in the fragments that display low r.m.s, deviation with respect to the central one, and disappear as the fragment structures diverge.

Procedures described here should thus be of great help in classifying and categorizing structural infor- mation in proteins, and thereby contribute to improving our capacity to investigate the relations between structure and amino acid sequence.

We are grateful to J.. Richelle and M. Huysmans for use of the SESAM protein database and for valuable discus- sions, and thank J. Thornton and C. Wilmot for many useful suggestions. We acknowledge Ph. Delhaise and M. Bardiaux for help in using the BRUGEL package, Ph. Berthet for friendly assistance with the computer systems, and S. De Moor for excellent secretarial support. J.R. and M.J.R. acknowledge support from the Institut pour l'Encouragement de la Recherche Scientifique dans l'Industrie et l'Agriculture (IRSIA).

336 M. J: Rooman et al.

References

Bernstein, F., Koetzle, T., Williams, G., Meyer, E., Brice, M., Rodgers, J., Kennard, 0., Shimanouchi, T. & Tasumi, M. (1977}. J. Mol. Biol. 112, 535-542.

Blundell, T., Sibanda, B., Sternberg, M. & Thornton, J. (1987). Nature (London), 326, 347-352.

Chandrasekaran, R., Lakshminarayanan, A., Pandya, U. & Ramachandran, G. (1973). Biochim. Biophys. Acta, 303, 14-27.

Chothia, C., Lesk, A., Levitt, M., Amit, A., Mariuzza, R., Phillips, S. & Poljak, R. (1986). Science, 233, 755-758.

Claessens, M., Van Cutsem, E., Lasters, I. & Wodak, S. (1989). Protein Eng. 2, 335-345.

Cohen, F. & Sternberg, M. (1980}. J. Mol. Biol. 138, 321-333.

Delhaise, P., Van Belle, D., Bardiaux, M. & Wodak, S. (1985). J. Mol. Graph. 3, 116-119.

Edwards, M., Sternberg, M. & Thornton, J. (1987). Protein Eng. 1, 173-181.

Efimov, A. (1986). J. Mol. Biol. 20, 250-260. Flory, P. (1969). Statistical Mechanics of Chain Molecules,

Wiley, New York. Huysmans, M., Richelle, J., Rodriguez, J., Rooman, M.,

Moreau, M., Denecker, M., Wodak, S. & Willems, Y. (1987). In Enzyme Engineering: Protein Design and Applications in Biocatalysis, edit. EEC, pp. 63-64, Capri.

Jambu, M. (1976). Cab. An. Don~cs, I, 77-92. Jambu, M. & Lebeaux, M.-O. (1978). C ~ f i c a l i o n Auto-

matique pour l'Analyse des Donn~es, Dunod, Paris. Jones, A. (1987). Collected Abstracts IUC, 14th Inter-

national Congress, Perth, edit. ML18-1. H. Freeman. Jones, A. & Thirup, T. (1986). EMBO J. 5, 819-822. Kabsch, W. & Sander, C. (1983). Biopolymers, 22,

2577-2637. Levitt, M. & Chothia, C. (1976). Nature (London), 261,

552-558.

Levitt, M. & Greer, J. (1977). J. Mol. Biol. 114, 181-239. Lewis, P., Momany, F. & Scheraga, H. (1973). Biochim.

Biophys. Acta, 303,211-229. McLachlan, A. (1979). J. Mol. Biol. 128, 49-79. Milner-White, E. & Poet, R. (1986). Biochem. J. 240,

289-292. Pauling, L. & Corey, R. (1951). Proc. Nat. Acad. Sci.,

U.S.A. 37, 729-740. Pauling, L., Corey, R. & Branson, H. (1951). Proc. Nat.

Acad. Sci., U.S.A. 37, 205-211. Presta, L. & Rose, G. (1988). Science, 240, 1632-1641. Rao, S. & Rossmann, M. (1973). J. Mol. Biol. 76,

241-256. Rees, A. & de la Paz, P. (1986). Trends Biochem. Sci. 11,

144-148. Richards, F. & Kundrot, C. (1988). Proteins, 3, 71-84. Richardson, J. (1981). Advan. Protein Chem. 34, 167-339. Richardson, J. & Richardson, D. (1988). Science, 240,

1648-1652. Richardson, J., Getzoff, E. & Richardson, D. (1978). Proc.

Nat. Aead. Sci., U.S.A. 75, 2574-2578. Rooman, M., Rodriguez, J. & Wodak, S. {1990}. J. Mol.

Biol. 213, 337-350. Rose, G., Gierasch, L. & Smith, J. (1985). Advan. Protein

Chem. 37, 1-109. Sibanda, B. & Thornton, J. (1985). Nature (London), 316,

170-174. Smith, J. & Pease, L. (1980). Crit. Rev. Biochem. 8,

315-399. Taylor, W. (1986). J. Mol. Biol. 188, 233-258. Thornton, J. (1988). ICSU Short Reports, 8, 8-9. Unger, R., Harel, D., Wherland, S. & Sussman, J. {1989).

Proteins, 5,355-373. Venkatachalam, C. (1968). Biopolymers, 5, 1425-1436. Wilmot, C. & Thornton, J. {1990). Protein Eng. in the

press.

Edited by R. Huber