Discovery of local packing motifs in protein structures

14
Discovery of Local Packing Motifs in Protein Structures Inge Jonassen, 1 * Ingvar Eidhammer, 1 and William R. Taylor 1,2 1 Department of Informatics, University of Bergen, Bergen, Norway 2 Division of Mathematical Biology, National Institute for Medical Research, London United Kingdom ABSTRACT We present a language for describ- ing structural patterns of residues in protein struc- tures and a method for the discovery of such pat- terns that recur in a set of protein structures. The patterns impose restrictions on the spatial position of each residue, their order along the amino acid chain, and which amino acids are allowed in each position. Unlike other methods for comparing sets of protein structures, our method is not based on the use of pairwise structure comparisons which is often time consuming and can produce inconsistent results. Instead, the method simultaneously takes into account information from all structures in the search for conserved structure patterns which are potential structure motifs. The method is based on describing the spatial neighborhoods of each resi- due in each structure as a string and applying a sequence pattern discovery method to find patterns common to subsets of these strings. Finally it is checked whether the similarities between the neigh- borhood strings correspond to spatially similar sub- structures. We apply the method to analyze sets of very disparate proteins from the four different pro- tein families: serine proteases, cuprodoxins, cyste- ine proteinases, and ferredoxins. The motifs found by the method correspond well to the site and motif information given in the annotation of these pro- teins in PDB, Swiss-Prot, and PROSITE. Further- more, the motifs are confirmed by using the motif data to constrain the structural alignment of the proteins obtained with the program SAP. This gave the best superposition/alignment of the proteins given the motif assignment. Proteins 1999;34:206–219. r 1999 Wiley-Liss, Inc. Key words: protein structure; motif; pattern discov- ery; algorithm INTRODUCTION Many of the most interesting functional and evolution- ary relationships among proteins are so ancient that they cannot be reliably detected through sequence analysis and are apparent only through a comparison of the tertiary structures. These features often take the form of struc- tural motifs, consisting of perhaps a few secondary struc- tures or residues that adopt a common substructure, typically, in the structural core or functional center of the proteins. As these relationships are ephemeral, confidence in their reality can be greatly boosted when they are found in more than a pair of proteins. The standard approach to the analysis of more than two protein structure is based on the comparison of all pairs of proteins followed by post- analysis to search for, or check the veracity of, a suspected motif. This approach is indirect as there is no cooperation between the separate pairwise comparisons and the criti- cal component (the identification of the motif) is left to an often subjective subsequent analysis. In this work we present an approach to the problem that avoids the often timeconsuming and inconsistent pairwise comparison of structures. The method is based on an algorithm previously developed for the extraction of se- quence motifs which directly incorporates a ‘‘simultaneous view’’ of all structures. Summary of Previous Methods A large number of methods for the comparison of pairs of protein structures have been developed in recent years. (See Brown et al. (1996), for a review.) 1 These divide into the dynamic-programming-based methods that require sequential co-linearity (Taylor and Orengo, 1989; 2 S ˇ ali and Blundell, 1990 3 ) and other (more varied) algorithms, often based on graph-analysis, that do not (Nussinov and Wolf- son, 1991; 4 Holm and Sander, 1993; 5 Grindley et al., 1993; 6 Artymiuk et al., 1994; 7 Gibrat et al., 1997 8 ). Some of these methods have been extended to analyze more than two structures by applying the pairwise comparison method several times to build-up similarities between an increas- ingly larger set of structures (Johnson et al., 1990; 9 Taylor et al. 1994 10 ). The faster methods represent the structures at the level of packing of secondary structure elements (SSEs) which can be an advantage when all pairs need to be compared but can be ‘‘blind’’ to motifs that are not composed of SSEs. Overview of the Current Method In this paper we present a new method for describing structural patterns of residues in which the pattern can impose constraints on the spatial position of each residue, their sequential order, and which amino acid types are allowed in each position. The language can also be ex- tended to incorporate other types of constraints such as solvent exposure and secondary structure state of the residues. The patterns are called local packing patterns and such a pattern is called a local packing motif if it recurs in one structure or in a set of related structures. The Grant sponsor: Norwegian Research Council. *Correspondence to: Inge Jonassen, Department of Informatics, University of Bergen, Hoyteknologisentret, N-5020 Bergen, Norway. Received 19 May 1998; Accepted 15 September 1998 PROTEINS: Structure, Function, and Genetics 34:206–219 (1999) r 1999 WILEY-LISS, INC.

Transcript of Discovery of local packing motifs in protein structures

Discovery of Local Packing Motifs in Protein StructuresInge Jonassen,1* Ingvar Eidhammer,1 and William R. Taylor1,2

1Department of Informatics, University of Bergen, Bergen, Norway2Division of Mathematical Biology, National Institute for Medical Research, London United Kingdom

ABSTRACT We present a language for describ-ing structural patterns of residues in protein struc-tures and a method for the discovery of such pat-terns that recur in a set of protein structures. Thepatterns impose restrictions on the spatial positionof each residue, their order along the amino acidchain, and which amino acids are allowed in eachposition. Unlike other methods for comparing setsof protein structures, our method is not based on theuse of pairwise structure comparisons which isoften time consuming and can produce inconsistentresults. Instead, the method simultaneously takesinto account information from all structures in thesearch for conserved structure patterns which arepotential structure motifs. The method is based ondescribing the spatial neighborhoods of each resi-due in each structure as a string and applying asequence pattern discovery method to find patternscommon to subsets of these strings. Finally it ischecked whether the similarities between the neigh-borhood strings correspond to spatially similar sub-structures. We apply the method to analyze sets ofvery disparate proteins from the four different pro-tein families: serine proteases, cuprodoxins, cyste-ine proteinases, and ferredoxins. The motifs foundby the method correspond well to the site and motifinformation given in the annotation of these pro-teins in PDB, Swiss-Prot, and PROSITE. Further-more, the motifs are confirmed by using the motifdata to constrain the structural alignment of theproteins obtained with the program SAP. This gavethe best superposition/alignment of the proteins giventhe motif assignment. Proteins 1999;34:206–219.r 1999 Wiley-Liss, Inc.

Key words: protein structure; motif; pattern discov-ery; algorithm

INTRODUCTION

Many of the most interesting functional and evolution-ary relationships among proteins are so ancient that theycannot be reliably detected through sequence analysis andare apparent only through a comparison of the tertiarystructures. These features often take the form of struc-tural motifs, consisting of perhaps a few secondary struc-tures or residues that adopt a common substructure,typically, in the structural core or functional center of theproteins. As these relationships are ephemeral, confidencein their reality can be greatly boosted when they are foundin more than a pair of proteins. The standard approach to

the analysis of more than two protein structure is based onthe comparison of all pairs of proteins followed by post-analysis to search for, or check the veracity of, a suspectedmotif. This approach is indirect as there is no cooperationbetween the separate pairwise comparisons and the criti-cal component (the identification of the motif) is left to anoften subjective subsequent analysis.

In this work we present an approach to the problem thatavoids the often timeconsuming and inconsistent pairwisecomparison of structures. The method is based on analgorithm previously developed for the extraction of se-quence motifs which directly incorporates a ‘‘simultaneousview’’ of all structures.

Summary of Previous Methods

A large number of methods for the comparison of pairs ofprotein structures have been developed in recent years.(See Brown et al. (1996), for a review.)1 These divide intothe dynamic-programming-based methods that requiresequential co-linearity (Taylor and Orengo, 1989;2 Sali andBlundell, 19903) and other (more varied) algorithms, oftenbased on graph-analysis, that do not (Nussinov and Wolf-son, 1991;4 Holm and Sander, 1993;5 Grindley et al., 1993;6

Artymiuk et al., 1994;7 Gibrat et al., 19978). Some of thesemethods have been extended to analyze more than twostructures by applying the pairwise comparison methodseveral times to build-up similarities between an increas-ingly larger set of structures (Johnson et al., 1990;9 Tayloret al. 199410).

The faster methods represent the structures at the levelof packing of secondary structure elements (SSEs) whichcan be an advantage when all pairs need to be comparedbut can be ‘‘blind’’ to motifs that are not composed of SSEs.

Overview of the Current Method

In this paper we present a new method for describingstructural patterns of residues in which the pattern canimpose constraints on the spatial position of each residue,their sequential order, and which amino acid types areallowed in each position. The language can also be ex-tended to incorporate other types of constraints such assolvent exposure and secondary structure state of theresidues. The patterns are called local packing patternsand such a pattern is called a local packing motif if itrecurs in one structure or in a set of related structures. The

Grant sponsor: Norwegian Research Council.*Correspondence to: Inge Jonassen, Department of Informatics,

University of Bergen, Hoyteknologisentret, N-5020 Bergen, Norway.Received 19 May 1998; Accepted 15 September 1998

PROTEINS: Structure, Function, and Genetics 34:206–219 (1999)

r 1999 WILEY-LISS, INC.

method we describe, in principle, can simultaneouslyanalyze an arbitrary number of protein structures (one ormore) and find recurring local packing patterns whichconstitute potential motifs.

Following the ideas of Karlin and co-workers (Karlin etal., 1994;11 Karlin and Zhu, 1996;12 Zhu and Karlin,199613), the current method represents the spatial neigh-borhood of each residue (in each structure) as a string (orsequence) of the surrounding residues. A sequence motifdiscovery method (Jonassen et al., 1995;14 Jonassen, 199715)is then used to find motifs in the resulting set of neighborstrings.† The latter approach is the critical component inthe current method since, even with the string-encoding ofthe local structure, for two proteins with N residues, aconventional comparison method would still need to com-pare N2 residue pairs (neighbor strings)—as is done in thedouble dynamic-programming algorithm of Taylor andOrengo (1989).2

The Pratt algorithm of Jonassen (1997),15 however,operates directly on the individual neighbor strings in allthe encoded proteins, irrespective of the protein in whichthey occur. Thus there is no pairwise comparison betweenproteins and common patterns will be identified even ifthey recur within a single protein. Although the worstcomputational-time complexity of Pratt is (in theory)exponential, there is a heavy dependency on the chosenpattern (solution) space which in most applications, includ-ing the experiments done in this work, results in a runningtime of the order of minutes on a UNIX workstation.

The method described in this paper has been imple-mented in a program called SPratt.

METHODSDefinitions

Packing pattern

Given a protein sequence S formed from the alphabet of20 amino acids (S), then; a packing pattern is a tuple

P 5 p1 . . . pn (1)

for some n $ 1 and each pi can be written in the form

pi 5 (xi, yi, zi, Ai, ai), (2)

where xi, yi, zi are real numbers (coordinates), Ai is anamino acid match-set (Ai # S) and ai describes constraintson the secondary structure a, b, l; representing respec-tively a-helix b-sheet and loop, (ai # 5a, b, l6). This defini-tion can be extended to include other discrete properties(e.g., exposed/buried).

Matching a packing pattern

A packing pattern P 5 (x1, y1, z1, A1, a1) . . . (xn, yn, zn, An,an) matches a protein (of sequence S) if there is an orderedlist of residues (s1, . . ., sn) in S so that

1. the residues s1, . . ., sn are unique (each occurs onlyonce),

2. the amino acid of si is in the match-set Ai,3. the secondary structure of si is in ai,4. the residues s1, . . ., sn are in order from the amino (N) to

the carboxy (C) terminal of S,5. the coordinates of s1, . . ., sn can be superimposed onto

the coordinates given in p1, . . ., pn within an root meansquare deviation of at most f (the coordinates of each si

paired with (xi, yi, zi) for i 5 1, . . ., n).

There are many ways to define a single set of coordinates(x, y, z) for a residue: we took a simple measure as themean coordinate of the side-chain atoms of the residue(with the exception of glycine, for which the a-carbon istaken).

Packing motifs

A packing motif is a packing pattern P that has multiplematches (occurences) in a set of (or within one) proteinstructure(s). Packing patterns (or motifs) describe clustersof residues coming spatially close together but which arenot necessarily local in the sequence.

Overview of Approach

Figure 1 shows an overview of the approach to structuremotif discovery. This can be summarized as:

1. analyze the structure(s) and represent each residue’sneighborhood as a neighbor string,

2. use the sequence pattern discovery method Pratt(Jonassen et al., 1995;14 Jonassen, 199715) to findpatterns occuring in several neighbor strings,

3. assess the spatial similarity between neighborhoodsmatching the same sequence pattern and report sets ofspatially similar neighborhoods, as motifs.

If we have structures S1, . . ., Sn (n $ 1) then we want tofind motifs occuring in at least k of them (or if k . n:allowing internal repetition).

Generating Neighbor Strings

Each residue in a structure is represented by one pointp 5 (x, y, z) (representing the sidechain centroid) and,using these coordinates, the spatial distances between allpairs of residues were calculated. A residue b was includedin another residue a’s neighborhood (and vice versa) if thedistance between a and b was less than a pre-definedthreshold (dmax, typically 6–10 Ångstrom). The neighbor-hood of a residue a is then represented as two strings, thefirst containing the neighboring residues that precede a inthe sequence, and one containing the residues following.

We use two strings because this makes the patterndiscovery task for Pratt more managable. Pratt is runtwice; once analyzing forward strings and once analyzingbackward strings. In each run it can restrict the search topatterns which match starting from the first letter in anyneighbour string, helping to prune the search.

†The term sequence is avoided in this context to prevent confusionwith its more common use to describe consecutive residues in thepolypeptide chain.

207LOCAL PACKING MOTIFS

Precise algorithm

The method can be defined more explicitly as: read thePDB structure files for proteins S1, . . ., Sn, and makeneighbor strings. For each residue a, we make two neigh-bor strings CRa and NRa. The string CRa starts with afollowed by all the residues C-terminal to a whose spatialdistance to a is below dmax. The residues are ordered inN-to-C chain order. Similarly, the string NRa starts with aand contains in C-to-N order the residues preceding a inthe chain which are spatially close to a (distance belowdmax).

Discovering Patterns in Neighbor Strings

For pattern discovery, we use a version of Pratt (Jonas-sen et al., 1995;14 Jonassen, 199715) which is able todiscover patterns similar to the ones used in the PROSITEdatabase (Bairoch et al., 1997).16 Using Pratt, one definesa class of patterns, for example by setting limits on thelength and flexibility of wildcards (e.g., x(2,4)) and specify-ing which pattern symbols can be used. The programsearches through this class of patterns in order to find the‘‘best’’ patterns that match some minimum number of theinput sequences (or strings).

Pratt operates directly on the individual neighbor stringsin all the encoded proteins, irrespective of the protein inwhich they occur. Thus there is no pairwise comparison

between proteins and common patterns will be identifiedeven if they recur within a single protein. Pratt achievesthis by encoding all the sequences into a common data-structure in which the sequences are encoded as words (ortuples) of a fixed length. Within each word, the location ofeach amino acid type is recorded allowing immediateaccess to all words, say, that have ‘‘H’’ in position 2. Usingthis indexing, combinations of amino acid positions canquickly be found; for example, sequences containing thepattern HxxC (where ‘‘x’’ is any amino acid) are given bythe intersection of the sets of words with H-at-1 andC-at-4. The extension to more complex patterns is de-scribed more fully in Jonassen et al. (1995)14 and Jonassen(1997).15

Pattern discovery is approached by starting with simplefixed patterns which are extended by the addition of newpattern components: for example, the simplest pattern ‘‘H’’might be extended to ‘‘HC,’’ then ‘‘HxC,’’ then ‘‘HxxC.’’ Witheach modification, the new pattern is checked against thesequences and if it occurs in the required minimumnumber, it is scored by its complexity. (See Methods,Scoring the Packing Motifs, for further details). The searchfor patterns progresses by a constrained combinatorialsearch. The search is pruned for example by not consider-ing extensions of patterns that themselves do not matchthe required minimum number of sequences. For example,

Fig. 1. Overview of method for structure motif discovery. The proteinstructure (tangled line, top left) is decomposed into neighbor strings (topright) in which each string represents a central residue (leftmost in thestring) and all others surrounding it within a given radius. The strings are

processed by the sequence pattern discovery program Pratt which findscommon motifs. The fragments of structure corresponding to these arethen extracted (bottom).

208 I. JONASSEN ET AL.

if HxxC only matches 2 sequences than none of its exten-sions, HxxCA, HxxCxA, . . ., HxxCxxxxY, can possiblymatch any more than two. Depending on user parameters,the search in Pratt is heuristic or can be guaranteed tofind the best patterns (best with respect to the internallyused scoring function).

We have tailored the program so that pattern matchesare restricted to start from the beginning of each neighborstring, and we forbid pattern matches to cross from oneneighbor string to the next. Additionally, this tailoredversion of Pratt, finds and reports all good patterns (andnot only the best pattern), for each ‘‘starting point’’ in thepattern graph (see Jonassen (1997)15 for details). This isbecause we do not know the best pattern in terms ofstructure motif is the best in terms of the internal patternscoring function in Pratt.

From String Pattern to Substructure Similarities

For each pattern found by Pratt:1. for each match, we retrieve from the PDB file the

coordinates of the residues corresponding to the matchin the neighbor string,

2. for all pairs of matches, we calculate the unweightedroot mean square deviation (rmsd), and discard the patternif not all pairwise rmsd values are below some threshold,

3. report the good matches and print-out (1) PDB formatfile with the structures superimposed to minimise rmsdbetween one (arbitrarily) chosen substructure and eachof the other substructures, and (2) a rasmol script fileto highlight the matching substructures.

Scoring the Packing Motifs

To help the user interpret the output from the algorithm,we give a score for each reported pattern in which thepotentially most interesting motifs are assigned the high-est score: specifically, those that contain many residuesand whose occurrences in the structures superimpose withlow rmsd values. The score also depends on the amino acidconstraints on each of the positions: the stricter theconstraints, the higher score is assigned.

A set of motif occurrences O, all matching a pattern P isscored by the formula:

S(P, O) 5I(P )

rmsd(O)(3)

where I(P) is the information content of the sequencepattern and rmsd(O) is the maximum rmsd between anypair of occurrences in O. Only patterns of more than threeresidues were considered since the rmsd of co-planaratoms can otherwise be trivially small.

Representing the pattern P as a regular expression,

P 5 A1 2 x(i1, j1) 2 A2 2 x(i2, j2)

2 . . . 2 x(ip21, jp21) 2 Ap, (4)

(where each Ai is a match-set and each x(i, j) is amismatch-range), then the information content of P is:

I(P ) 5 oi51

p

I8(Ai) 2 c ·ok51

p21

( jk 2 ik), (5)

where c is a constant (normally the value 0.5 is used), andI8(Ai) is the information contents of the pattern componentAi. The information of each component can in turn becalculated as:

I8(Ai) 5 2oa[S

(pa · log (pa)) 1 oa[Ai

1pa

pAi

· log 1pa

pAi22, (6)

where pa is the a priori probability of amino acid a, S is theset of all amino acids, and pAi

5 Sa[Aipa. The probabilities

pa are calculated from the frequency of each amino acid ain the protein sequence database Swiss-Prot (Bairoch andBoeckmann, 1991).16 Further details can be found inJonassen et al. (1995).14

Data Selection

For testing the method, we selected a set of super-families from SCOP (Murzin et al., 1995),17 such that, foreach of the chosen superfamilies there was a (binding oractive) site that was common to all of the members. Asuper-family in SCOP can contain a number of familieswhich again can contain a number of domains. For eachdomain, a number of proteins can be given, and these areoften very similar (e.g. the same protein from differentspecies). To eliminate this redundancy, for each of thechosen super-families, we picked one structure from eachdomain belonging to a family in the super-family. Then weremoved structures from this set in order to find a subsetwhere all pairwise sequence similarities are 30% or lesswhile keeping the structures with the best resolution.

As a basis for assessing the motifs found by SPratt weuse the sites as specified in both the PDB and Swiss-Protdatabase entries and consider it a success if SPratt is ableto find motifs whose occurrences match these sites orcorresponding residues identified by matching the motifsgiven in the PROSITE database.

Summary information on the different proteins and theSPratt runs is given in Table I.

Presentation of Results

For each family, we present information about whichstructures have been included in our analysis, and sum-mary information about the relevant sites in these struc-tures as given in the PDB and Swiss-Prot files andidentified through matching the patterns in the PROSITEdatabase. Results of running SPratt on these sets ofstructures are given and compared to the site information.Most of the information is presented in the form of tables;and for each motif, we give two tables. The first givesinformation about the occurrences of the motif in the

209LOCAL PACKING MOTIFS

analyzed structures, the residues included in the occur-rences are marked to point to corresponding PDB, Swiss-Prot, and PROSITE sites. The symbols ‘‘*,’’ ‘‘†,’’ ‘‘‡,’’ and ‘‘°’’are used to point to different PROSITE sites, and thesymbol ‘‘1’’ is used to indicate residues identified as sites ineither PDB or Swiss-Prot.

A second table gives the root mean square deviation(rmsd) between all pairs of motif occurrences and also thermsd over the best residues as found by the SAP program.The data from SAP (a refined version of the originalprogram described by Taylor and Orengo2) is included togive some information about the overall similarities of thestructures analyzed. However, some of the most divergentpairs of structures were reanalyzed, as it was clearfrom their rmsd values, that SAP had not found a goodsuperposition. For this, a modified form of SAP was usedthat incorporated the SPratt motif information, givingthese residues a high weight in the final superposition.The original values were retained in the Tables, however,as they give a better indication of the most difficult compari-sons.

The patterns found by Pratt in the neighbor strings arewritten in a notation different from that used in PROSITEto make it easier to distinguish neighbor string patternsfrom the PROSITE patterns. Wildcards are written as asequence of the symbols ‘‘x’’ and ‘‘-,’’ where ‘‘x’’ matchesexactly one arbitrary symbol and ‘‘-’’ matches zero or onearbitrary symbol (in the neighbor string). It should beremembered that symbols in the neighbor string do notcorrespond to sequence patterns but can have any numberof residues inserted between any pair of symbols.

Implementation

We used Pratt version 2.2 with a command line option-smotif which triggers the modifications of the standardalgorithm.AUNIX C-shell script—SPratt—has been made

that calls Pratt as well as two new programs written inANSI C. The script SPratt is called with three obligatorycommand line parameters (name of file containing struc-ture names, maximum distance to a neighbor residue, andminimum number of pattern occurrences). Any additionalparameters are passed on to Pratt and can, for instance,be used to change the restrictions on the patterns to beconsidered by Pratt. (Details of Pratt can be found at theWWW-site: http://www.ii.uib.no/ ˜inge/Pratt.html). Sourcecode for the current program is available by request([email protected]) and the structure superposition program SAPcan be obtained at: ftp://glycine. nimr.mrc.ac.uk/pub/sap/.

RESULTSSerine Proteases

Proteins were selected from the SCOP’s superfamily‘‘Trypsin-like serine proteases’’ (from class ‘‘All beta’’ andfold ‘‘trypsin-like serine proteases’’) and three test setswere formed by splitting the full family into proteins fromprokaryotic and eukaryotic organisms. We first analyzedthe eukaryotic and prokaryotic separately.

Eukaryotic proteins

From the 19 eukaryotic domains, we selected: 51try,2hlcA, 1fonA, 1cghA, 1ton, 1rtfB, 1hcg6 as having less than30% pairwise sequence similarity and good resolution. Allbelong to the PROSITE family TRYPSIN_HIS (PS00134),with consensus pattern: [LIVM]-[ST]-A-[STAG]-H-C (*),and the family TRYPSIN_SER (PS00135), with consensuspattern: [DNSTAGC]-[GSTAPIMVQH]-x(2)-G-[DE]-S-G-[GS]-[SAPHV]-[LIVMFYWH]-[LIVMFYSTANQH](†). The active sites consist of His-Asp-Ser.

Many patterns occurring in all structures were found—some examples of which are shown in Table II. As with theother proteins described in this work, most of the top-scoring patterns consist of trivial variants on a commonmotif. Details of one of these patterns, AxHC---DxxSGGxxS are given in Table III. Although not the highestscoring by its SPratt score, this motif is the highest thatincludes both N-terminal and C-terminal components.Improved criteria for the automatic choice of a uniquepattern will be assessed further below.

Prokaryotic proteins

This family consists of 8 domains, from which weselected the set: 51arb, 1tal, 1sgt, 2sfa, 1agjA6.

1tal, 1sgt, 2sfa belongs to the same PROSITE-familiesas the Eukaryotic ones while lagj belongs to the Prositefamily ‘‘Serine proteases, V8 family histidine active site’’(PS00672) with consensus pattern: [ST]-G-[LIVMFYW](3)-[GN]-x(2)-T-[LIVM]-x-T-x(2)-H (‡) and to ‘‘Serine pro-teases, V8 family, serine active site’’ (PS00673), withconsensus pattern: T-x(2)-[GC]-[NQ]-S-G-S-x-[LIVM]-[FY] (°). (1arb is not classified by PROSITE).

Details of the highest scoring pattern, GSxGxxx--Dx-H, matching all structures are given in Table IV.

TABLE I. Summary InformationAbout the DifferentExperiments Done†

Family nr.str av.len nr. match dmax nr. flex Time

Cupro. 11 263 10 8.0 2 2 minCys.prot. 5 560 5 10.0 2 58 min4Fe-4S 5 86 5 10.0 2 11 s

5 86 3 10.0 2 2 min2Fe-2S 4 358 4 10.0 2 2 minxFe-xS 9 207 5 10.0 2 18 minSer.prot Eu 7 238 7 8.0 2 40 sSer.prot Pr 5 221 5 10.0 2 4 minSer.prot 12 231 10 8.0 2 12 min†For each of the test families (Cupro 5 cuprodoxins, Cys.prot 5 cysteineproteinase, 4Fe-4S 5 Fe4S4-binding ferredoxins, 2Fe-2S 5 Fe2S2-binding ferredoxins, xFe-xS 5 combined ferredoxins, Ser.protEu 5 eukaryotic serine proteases, Ser.prot Pr 5 prokaryotic serineproteases, Ser.prot 5 combined serine proteases); the number of struc-tures in the set (nr.str), their average length (av.len), the number ofstructures matching the pattern (nr.match), the maximum distance toa neighbour (dmax), the maximum number of flexible positions in thepattern (nr.flex) and the computation time (time) is tabulated.

210 I. JONASSEN ET AL.

Combined Eukaryotic and Prokaryotic proteins

The 12 structures analyzed in the preceding sectionswere put in one set. Only the pair (1try, 1sgt) had sequenceidentity over 30% (39). No common pattern was found forthis set. The match constraint was then relaxed to requirethat at least 10 out of the 12 structures should match. Thisallowed the appearance of the pattern AxHC--Dx-SGamong the highest scoring (Table V). (The top scoringpattern was similar but included only 4 residues).

The rmsd values over all pairs of serine proteases weresufficiently good, both for the motifs and the whole mol-ecules (Table V), that no recalculation of any superpositionwas necessary to check the motif accuracy.

Cupredoxins

The Cupredoxin super-family in SCOP (All-beta, cupro-doxin fold) contains three families (Plastocyanin/azurin-like, Periplasmic domain of cytochrome c oxidase subunitII, and Multidomain cupredoxins) each of which contains anumber of domains (see Murphy et al. (1997)18 for addi-tional details). We picked one representative structurefrom each domain giving altogether 12 structures fromwhich a subset was selected with pairwise sequence simi-larities of 30% or less. This subset contained the 10proteins: 2plt, 1pmy, 1rcy, 1aac, 1kcw, 1jer, 1cyw, 1nif,1aozA, and 4azuA. Most of these proteins are members of

the COPPER_BLUE and MULTICOPPER_OXIDASE (1 and2) families in PROSITE in which a copper atom is coordinatedby four residues: normally, histidines, a cysteine, and a methio-nine (which is sometimes substituted by glutamine or leucine).Summary data about the proteins is given in Table VI.

We used SPratt to search for patterns containing anysingle amino acid symbol and the match-set [MLQ]. With-out this latter constraint either the two proteins with theuntypical (L and Q) substitutions were not identified or apattern over all the proteins did not contain the typicalmethionine ligand. SPratt used 2 minutes and foundthree patterns being variations of patterns matchingaround the copper binding sites. The pattern matching themost sites is shown in Table VII(a).

From the low rmsd values for the motif residues TableVII(b), it is clear that closely matching motif clusters havebeen found. Some of the whole molecule rms deviations,however, are rather high for the number of residuesequivalenced and those over 3.0 Å rmsd were reanalysedby the modified SAP program. This uses the motif residuesequivalenced by SPratt as constraints and, where appro-priate, with the relevant domain excised. The most deviantpair in Table VII is ascorbate oxidase (1aozA) with azurin(4azu), having 5.1 Å rmsd over only 26 residue pairs. Theresuperposition of these proteins resulted in the greatlyimproved value of 1.56 Å over the best 63 pairs (8.9 Å overall 123 aligned residues) and clearly revealed the common

TABLE II. Top-Scoring Patterns in the Eukaryotic Serine Proteases†

(a) N-patterns, dmax 5 8 (b) C-patterns, dmax 5 8Pattern Score Pattern Score

GGSD-Gx-AxxGGC 30.31 AxHC---DxxSGGxxS 25.58GGSDxx--AxxGGC 30.22 GGxx-CGx--DSGG 28.20GGSD-Gx-AxxGxC 25.83 CGGxA-Cx--SGG 30.45GGSD-Gx-AxxxGC 26.09 A-HC---DxxSGGxxS 25.18GGSx-Gx-AxxGGC 30.84 GGx--CGx--DSGG 27.76GGxD-Gx-AxxGGC 25.40 GGxxx-Gx--DSGG 24.00GxSD-Gx-AxxGGC 26.17 AxHC---DxxSGxxxS 21.30GGSD-Gx-AxxxxC 22.18 AxHC---DxxSxGxxS 22.34GGSD-GxxxxxG-C 17.29 AxHC---DxxxGGxxS 22.32GGSDxx--AxxGxC 25.41 AxHx---DxxSGGxxS 21.43

(c) N-patterns, dmax 5 10 (d) C-patterns, dmax 5 10Pattern Score Pattern Score

GGSDGx-Gxxx-GxxxxxGGC 13.21 CGGx-A-HC 11.22GGSDCxx-Gxx--CxxxxxGGC 14.63 AxHCx----Dxxx-GxG 6.20GGSDGx-Gxxx-CxxxxxGxC 11.28 GDSGG 6.43GGSDxx-Gxxx-CxxxxxGGC 19.66 GGP-GxxS 22.50GGSxGx-Gxxx-CxxxxxGGC 12.56 GDSG--G 5.58GGxDGx-Gxxx-CxxxxxGGC 11.39 Gx-Lxxx-ACV 5.55GxSDGx-Gxxx-CxxxxxGGC 11.30 CGGx-Ax-C 8.83GGSDGxx-Gxx-CxxxxxGxC 12.66 CGGxx--HC 16.87GGSDxxx-Gxx-CxxxxxGGC 17.48 CGxx-A-HC 8.70GGSxGxx-Gxx--CxxxxxGGC 14.66 C-GxxA-HC 8.99†The patterns are ranked by Pratt by their information content and the scoreobtained by dividing by the maximum rmsd of the matched residues (Eq. 3) is shownon the right. The patterns found on the N-terminal side and the C-terminal side areshown for values of the cutoff (dmax) of 8 (a and b) and 10 (c and d). Note that in Tablesa and c, the residues are in reverse order. The leftmost residue in each pattern is thecentral residue around which the others are found within a distance of dmax.

211LOCAL PACKING MOTIFS

b-sheet core. (Figure 2). Although the equivalent motifresidues are joined by dashed lines in this Figure, they aretoo close to be distinguished clearly (but lie on the right ofthe conserved core). The difficulty of the alignment can be

appreciated from the extensive loops in ascorbate oxidasewhich amount to over half the size of the common core.

Cysteine Proteinases

The proteins were selected from SCOP’s ‘‘Cysteine pro-teinases’’ super-family (which is in the ‘‘alpha and beta’’class and ‘‘Cysteine proteinases’’ fold). A representativesubset having maximum resolution was selected so thatthe pairwise similarity between any pair of sequences was30% or less, leaving: 1hucAB, 1gcb, 1fieA, 1aim, and 1ppn.Summary information about the proteins, their activesites (as given in the PDB file) and the relevant PROSITEfamilies is given in Table VIII.

SPratt took 58 min computing time (most of the timeused by Pratt) and found a large number of patterns. Thehighest scoring pattern is summarized in Table IX. Al-though the motif rmsd values were relatively small, thermsd values of the whole molecule superpositions (TableIXb) include several entries where the superposition hasclearly failed—giving a high rmsd value or incorporatingonly a trivial number of residues. This is not unexpectedgiven the size and multi-domain nature of these proteins(for example; 1fieA has 705 residues). The worst compari-sons were recalculated using extracted domains and themotif assignments, resulting in greatly improved similari-ties that confirmed the motifs identified by SPratt ascorrect in all the comparisons. One of these superpositions,for cathepsin-B (1huc) and human coagulation factor XIII(1fieA), is shown in Figure 3. For this comparison, usingthe motif data, the rmsd reduced to 1.24 Å over the best 34residues (10 Å over 126 residues).

TABLE III. Eukaryotic Serine Protease PatternAxHC---DxxSGGxxS

(a) PatternProteinlength

1try224

2hlcA230

1fonA232

1cghA224

1ton228

1rtfB244

1hcg287

PatternA A55* A55* A41* A55* A55* A55* A55*H H571* H571* H431* H571* H571* H571* H571*C C58* C58* C44* C58* C58* C58* C58*D D1021 D1021 D931 D1021 D1021 D1021 D1021

S S1951† S1951† S1871† S1951† S1951† S1951† S1951†

G G196† G196† G188† G196† G196† G196† G196†

G G197† G197† G189† G197† G197† G197† G197†

S S214 S214 S207 S214 S214 S214 S214

(b) rmsd1try 2hlcA 1fonA 1cghA 1ton 1rtfB 1hcg

1try 0.3 0.7 0.9 0.7 0.9 0.22hlcA 0.8/163 0.5 0.9 0.9 0.9 0.31fonA 1.3/135 1.2/130 0.9 1.2 0.9 0.61cghA 0.9/167 0.7/157 1.2/129 1.2 0.2 0.81ton 0.8/167 0.8/157 1.4/124 0.8/175 1.2 0.81rtfB 1.1/176 0.8/160 1.4/139 0.9/162 0.9/159 0.81hcg 0.9/167 1.0/160 1.2/137 1.0/162 1.0/163 0.9/176

a) Occurrences of the motif in the structures (by PDB code), where, ‘1’ indicates a site residue, while ‘*’and ‘†’ indicate PROSITE patterns. (See Results, Eukaryotic proteins, for details). b) rmsd values for thesuperposition of motif residues (top-right triangle) and similarities of complete chains (bottom-lefttriangle) along with the number of residues superposed.

TABLE IV. Prokaryotic Serine Protease PatternGSxGxxx--Dx-H

(a) PatternProteinlength

1arb263

1tal188

1sgt223

2sfa191

1agjA242

PatternG G195 G196† G196† G148† G196°S S1941 S1951† S1951† S147?† S1951°G G192 G193† G193† G145† G193°D D1131 D1021 D1021 D651 D1201

H H571 H571* H571* H351* H721‡

Missing S211(b) rmsd

1arb 1tal 1sgt 2sfa 1agj 1agjA

1arb 0.3 0.3 0.3 0.4 0.41tal 1.4/74 0.3 0.3 0.3 0.31sgt 1.2/105 1.2/68 0.3 0.5 0.52sfa 1.4/82 0.7/138 1.6/78 0.4 0.41agj 1.3/102 1.8/80 1.1/114 2.0/89 0.01agjA 1.3/102 1.8/80 1.1/114 2.0/89

a) The motif in the structures (by PDB code), where, ‘1’ indicates a siteresidue, while ‘*’, ‘†’, and ‘‡’ indicate PROSITE patterns. (see Results,Prokaryotic proteins, for details). The ‘?’ (in 2sfa) indicates that theSwiss-Prot active-site records identified Asp.146 as the catalyticserine. b) rmsd values for the superposition of motif residues (top-righttriangle) and similarities of complete chains (bottom-left triangle)along with the number of residues superposed.

212 I. JONASSEN ET AL.

FerredoxinsFe4S4-binding family

Proteins were selected from SCOP’s superfamily ‘‘4Fe-4Sferredoxins’’ (from class ‘‘alpha and beta’’ and fold ‘‘Ferre-doxin-like alpha 1 beta sandwich . . .’’). The superfamilyconsists of 4 families, with a total of 10 domains. By the30% maximal identity criterion this was reduced to thestructures: 51blu, 6fd1, 1xer, 1fxrA, 2fxb6.

All proteins belong to the PROSITE family 4FE-4S_FERREDOXIN (PS00198) with consensus pattern:C-x(2)-C-x(2)-C-x(3)-C-[PEG]. However, two of the se-quences (1fxrA, 2fxb) do not contain this pattern.

The structure of these proteins consists of the duplica-tion of a bab ‘‘domain’’ of 26 residues each of whichcontains four cysteine residues that bind to a 4Fe-4Scenter. In some bacterial ferredoxins, one of the twoduplicated domains has lost one or more of the fourconserved cysteines and as a consequence, have either losttheir iron-sulfur binding property or bind to a 3Fe-3Scenter instead. The highest scoring pattern found wasCIxCx-C, however this pattern was sequentially verylocal (only seven residues long), and, as discussed above(for the serine proteases), the second-highest scoring pat-tern was taken as it was distributed over the full length ofthe proteins. This pattern was PC-Cxx---C, which hadseven occurrences, details of which is given in Table X.

The only value in Table X that was recalculated derivedfrom the 6fd1/1blu comparison. Although not unreason-able (4.4 Å over 50 residues) this was high compared to theothers and examination of the original superposition re-

vealed the spurious association of two unequivalent heli-ces. Using the motif assignments, this was avoided givinga new value of 1.05 Å over 50 residues. (Fig. 4).

The pattern described in Table X includes only three ofthe four cysteines that are common to all the proteins. Asubset of the proteins (1blu, 1fxrA, 2fxb), however, containfour conserved cysteines and to try and find this extendedsite, we ran SPratt requiring the patterns to occur in atleast 3 of the structures. One of the highest scoring pattern(and the highest with 4 Cysteines) was PCxCx-Cx-C,details of which are given in Table XI.

Fe2S2-binding family

These proteins were selected from SCOP’s superfamily‘‘2Fe-2S ferredoxin’’ (from class ‘‘alpha and beta’’ and fold‘‘beta-Grasp . . .’’). The superfamily consists of one family,with four domains. These proteins bind a single 2Fe-2Siron-sulphur cluster and are unrelated to the four iron-sulphur cluster binding family. As 30% maximal identityset we used 51frd, 1put, 1alo, 2pia6.

All except 1put belongs to Prosite family 2FE-2S_FER-REDOXIN (PS00197) with consensus pattern: C-5C6-5C6-[GA]-5C6-C-[GAST]-5CPDEKRHFYW6-C (*), while 1putbelongs to Prosite family ADX (PS00814) with consensuspattern: C-x(2)-[STAQ]-x-[STAMV]-C-[STA]-T-C-[HR](†).

The best pattern found matching all structures wasCxxxx-CxxCx--C, details of which can be found in TableXII, where it can be seen that all motif rmsd values are low.

TABLE V. PatternAxHC--Dx-SG in the Combined Set of Serine Proteases

(a) PatternsProteinlength

1try224

2hlcA230

1fonA232

1cghA224

1ton228

1rtfB244

1arb263

1tal188

1sgt223

2sfa191

PatternA A55* A55* A41* A55* A55* A55* A55 A55 A55 A33H H571* H571* H431* H571* H571* H571* H571 H571* H571* H351*C C58* C58* C44* C58* C58* C58* C58 C58 C58 C36D D1021 D1021 D931 D1021 D1021 D1021 D1131 D1021 D1021 D651

S S1951† S1951† S1871† S1951† S1951† S1951† S1941 S1951† S1951† S1471†

G G196† G196† G188† G196† G196† G196† G195 G196† G196† G148†

(b) rmsd1try 2hlcA 1fonA 1cghA 1ton 1rtfB 1arb 1tal 1sgt 2sfa

1try 0.3 0.7 0.9 0.7 0.9 0.3 0.1 0.2 0.32hlcA 0.8/163 0.6 0.9 0.9 0.9 0.2 0.3 0.2 0.31fonA 1.3/135 1.2/130 0.8 1.3 0.8 0.6 0.7 0.5 0.71cghA 0.9/167 0.7/157 1.2/129 1.3 0.2 0.8 0.9 0.8 0.81ton 0.8/167 0.8/157 1.4/124 0.8/175 1.3 0.9 0.8 0.9 0.81rtfB 1.1/176 0.8/160 1.4/139 0.9/162 0.9/159 0.8 0.8 0.8 0.71arb 1.6/116 1.4/103 1.4/89 1.4/105 1.5/107 1.3/111 0.3 0.3 0.31tal 1.4/74 1.9/84 1.6/72 1.4/76 1.4/74 1.2/72 1.4/74 0.2 0.21sgt 0.8/170 0.8/147 1.1/134 0.9/156 0.8/162 0.9/158 1.2/105 1.2/68 0.32sfa 1.5/84 1.5/87 1.8/77 1.4/83 1.7/85 1.8/84 1.4/82 0.7/138 1.6/78

a) The motif in the structures (by PDB code), where, ‘1’ indicates a site residue, while ‘*’ and ‘†’ indicate PROSITE patterns. (See text for details). b)rmsd values for the superposition of motif residues (top-right triangle) and similarities of complete chains (bottom-left triangle) along with thenumber of residues superposed.

213LOCAL PACKING MOTIFS

The large value of 18 Å for the 1alo/2pia comparisonresulted from their large size and reduced to 1.2 Å over 38residues when recalculated using the motifs and extracteddomains. The same result was also obtained using thefull chains, illustrating the power of the motif data tofocus SAP onto local similarities (Fig. 5), where thesmall ferredoxin domain can be seen to be matchedamong the otherwise unrelated parts of the two largemolecules.

Combined ferredoxins

As a control experiment, we also ran SPratt on thecombined set of ferredoxins consisting of the five 4Fe-4Sand the four 2Fe-4S proteins. Using default parameters,

no pattern with at least 4 residues was found common toall the structures. Even when the match constraints wererelaxed to require that any pattern should occur in at leastfive structures, only the highest scoring pattern: PCx-Cxxx--C was found that had been seen previously in the4Fe-4S structures. As the two ferredoxin families havedifferent folds this result indicates that the program cancorrectly avoid false associations even among structurallyand functionally similar protein families.

DISCUSSION

The results presented above have illustrated that thecurrent method has the power to find common local

TABLE VI. Site Data for the Cuprodoxins, Showing the Metal Binding Site Location, and Family Membershipin the SCOP and PROSITE Databases

(a) SCOP sitesProtein SCOP family Site given in the pdb file

2plt Plastocyanin/azurin-like none1pmy Plastocyanin/azurin-like H40, C78, H81, M861rcy Plastocyanin/azurin-like H85, C138, H143, M1481aac Plastocyanin/azurin-like H53, C92, H95, M981kcw Multidomain cupredoxins

(type I) H276, C319, H324, L329(type I) H637, C680, H685, M690(labile) E597, H602, D684, E971(type I) H975, C1021, H1026, M1031(labile) E272, E935, H940, D1025

1jer Plastocyanin/azurin-like H46, C89, H941cyw Periplasmic domain of cyt. c ox. II none1nif Multidomain cupredoxins H95 C136 H145 M1501aozA Multidomain cupredoxins H445, C507, H512, M517

(M517 from Swiss-Prot, type I)4azuA Plastocyanin/azurin-like H46, C112, H199

(b) PROSITE sitesProtein PROSITE family Site (from pdbmotif)

2plt COPPER_BLUE 77–921pmy COPPER_BLUE 72–841rcy MULTICOPPER_OXIDASE1 132–1521rcy COPPER_BLUE 131–1481aac COPPER_BLUE 85–981kcw MULTICOPPER_OXIDASE1 313–333, 674–694, 1015–10351kcw MULTICOPPER_OXIDASE2 1020–10311jer COPPER_BLUE 83–991cyw not found none1nif not found none1aozA MULTICOPPER_OXIDASE1 501–5211aozA MULTICOPPER_OXIDASE2 506–5174azuA COPPER_BLUE 105–121

(c) PROSITE patternsKey PROSITE family AC Pattern

* COPPER_BLUE PS00196 [GA]-x(0,2)-[YSA]-x(0,1)-[VFY]-x-C-x(1,2)-[PG]-x(0,1)-H-x(2,4)-[MQ]† MULTICOPPER_OXIDASE1 PS00079 G-x-[FYW]-x-[LIVMFYW]-x-[CST]-x(8)-G-[LM]-x(3)-[LIVMFYW]‡ MULTICOPPER_OXIDASE2 PS00080 H-C-H-x(3)-H-x(3)-[AG]-[LM]

a) For each protein (PDB code), its SCOP family is shown along with binding site residues as defined in the PDB file. b) The structures classified byPROSITE family, giving the family, and the site matching the PROSITE pattern (these definitions are produced automatically by the programPdbMotif and so multiple family matches can occur). c) Details of the PROSITE patterns (used in b) including the key by which they are identifiedin the tables of results.

214 I. JONASSEN ET AL.

structural motifs in protein families that include verydisparate members, containing both large insertions andadditional unequivalent domains. This behavior is typifiedby the correct identification of the (type-I) Cu11-bindingsite across the wide variety of cuprodoxin domains: in-cluding the three equivalent sites in ceruloplasmin(1kcw), a large protein with six similar domains, all ofwhich bind copper. Similarly, the cysteine protease andferredoxin families provided examples of sites thatwere identified despite the presence of large insertionsamong family members and a wide variety of large unre-lated domains. The serine protease domain, which isoften taken as an example in structure motif search

methods, was for the current method, the simplest test, inwhich an extensive motif was found across all familymembers.

While results of comparable quality have been reportedby other methods, (for example, Fischer et al., 1994;19

Wallace et al., 1997),20 the distinction of the currentmethod is that the patterns we report were not specified atthe start. In these other methods, the motif (or structurecontaining it) has been prespecified and used simply as asearch probe. While this clearly tests the power of thestructural search algorithms, it is quite unrelated to the abinitio discovery of structural motifs achieved in the currentwork. Closest in spirit to the current approach, but quite

Fig. 2. Cuprodoxin domain superposition. Left: residues 340–525from ascorbate oxidase (1aozA) are superposed on a complete azurinmolecule (4azu). The chains are colored green and red, respectively.Right: the two chains are colored by their degree of local similarity (as

measured by the program SAP). Red indicates greatest similarity,decreasing (in chromatic progression) to a dark blue that indicatesunmatched regions.

Fig. 3. Cysteine protease domain superposition. Left: cathepsin-B(1huc) superimposed on residues 260–490 from human coagulationfactor XIII (1fieA). Colored green and red, respectively. Right: the two

chains colored by local similarity (as measured by SAP). Red indicatesgreatest similarity, decreasing to a dark blue that indicates unmatchedregions.

215LOCAL PACKING MOTIFS

TABLE VII. Cuprodoxin Pattern Hx----CxH[LMQ]

(a) PatternsProteinlength

2plt98

1pmy123

1rcy151

1aac104

1kcw1017

1kcw1017

1kcw1017

1jer110

1nif333

1aozA552

4azuA128

PatternH H37 H401 H851 H531 H2761 H6371 H9751 H461 H951 H4451 H461

C C84* C78*1 C138*†1 C92*1 C319†1 C680†1 C1021†‡1 C89*1 C1361 C507†‡1 C112*1

H H87* H81*1 H143*1 H95*1 H3241 H6851 H1026‡1 H94*1 H1451 H512‡1 H117*[LMQ] M92* M86*1 M148*†1 M98*1 L329†1 M690†1 M1031†‡1 Q99*1 M1501 M517†‡1 M121*

Missing H199

(b) rmsd2plt 1pmy 1rcy 1aac 1kcw 1kcw 1kcw 1jer 1nif 1aozA 4azuA

2plt 0.2 0.1 0.2 0.5 0.3 0.2 0.2 0.4 0.1 0.21pmy 0.8/72 0.2 0.2 0.6 0.5 0.4 0.1 0.2 0.3 0.41rcy 2.3/65 2.0/62 0.1 0.5 0.3 0.2 0.2 0.4 0.1 0.21aac 0.9/65 0.8/62 2.4/61 0.6 0.4 0.3 0.2 0.3 0.2 0.21kcw 3.6/33 3.4/27 2.5/37 2.2/54 0.4 0.6 0.7 0.7 0.5 0.71kcw 3.6/33 3.4/27 2.5/37 2.2/54 0.3 0.5 0.7 0.3 0.41kcw 3.6/33 3.4/27 2.5/37 2.2/54 0.4 0.6 0.2 0.21jer 2.0/48 1.2/50 2.0/52 2.0/44 2.8/32 2.8/32 2.8/32 0.3 0.2 0.31nif 2.4/62 3.0/58 2.3/67 4.2/61 1.4/159 1.4/159 1.4/159 1.8/47 0.5 0.51aozA 1.9/22 2.8/33 1.5/68 3.1/22 2.3/180 2.3/180 2.3/180 2.8/56 2.1/133 0.24azuA 2.1/57 2.0/59 1.2/68 2.1/58 1.5/28 1.5/28 1.5/28 4.1/44 1.2/59 5.1/26

a) Occurrences of the motif in the structures (by PDB code), where, ‘1’ indicates a site residue and the symbols ‘*,’ ‘†’ and ‘‡ ’ indicate PROSITEmatches. (See Table VI for details). b) rmsd values for the pairwise superposition of motif residues (top-right triangle) and pairwise similarities ofcomplete chains (bottom-left triangle) along with the number of residues superposed.

TABLE VIII. Site Data for the Cysteine Proteinases, Showing theActive Site Location and Family Membershipin the SCOP and PROSITE Databases

(a) SCOP sitesProtein SCOP family Site given in the PDB file

1hucA Papain-like C29, H199,N2191gcb Papain-likeA C73, H369,N3921fieA Transglutaminase C314,H373,D3961aim Papain-like C25, H159,N1751ppn Papain-like C25, H159,N175

(b) PROSITE sitesProtein PROSITE family Site (from pdbmotif)

1hucAB THIOL_PROTEASE_CYS 23-341gcb THIOL_PROTEASE_CYS 67-78

THIOL_PROTEASE_HIS 367-3771fieA TRANSGLUTAMINASES 312-3291aim THIOL_PROTEASE_CYS 19-30

THIOL_PROTEASE_HIS 157-167THIOL_PROTEASE_ASN 170-189

1ppn THIOL_PROTEASE_CYS 19-30THIOL_PROTEASE_HIS 157-167THIOL_PROTEASE_ASN 170-189

(c) PROSITE patternsKey PROSITE family AC Pattern

* THIOL_PROTEASE_CYS PS00139 Q-x(3)-[GE]-x-C-[YW]-x(2)-[STAGC]-[STAGCV]† THIOL_PROTEASE_HIS PS00639 [LIVMGSTAN]-x-H-[GSACE]-[LIVM]-x-[LIVMAT](2)-G-x-[GSADNH]‡ THIOL_PROTEASE_ASN PS00640 [FYCH]-[WI]-[LIVT]-x-[KRQAG]-N-[ST]-W-x(3)-[FYW]-G-x(2)-G-[LFYW]-[LIVMFYG]-x-[LIVMF]° TRANSGLUTAMINASES PS00547 [GT]-Q-[CA]-W-V-x-[SA]-[GA]-[IVT]-x(2)-T-x-[LMSC]-R-[CSA]-[LV]-G

a) For each protein (PDB code), its SCOP family is shown along with active site residues as defined in the PDB file. b) The structures classified byPROSITE family, giving the family, and the site matching the PROSITE pattern (see Table VI for details). c) Details of the PROSITE patterns(used in b) including the key by which they are identified in the following tables of results.

216 I. JONASSEN ET AL.

unrelated algorithmically, is the method of Koch et al.(1996),21 in which the discovery of common structuralfeatures is reported using a graph analysis method (maxi-mal sub-cliques) on a network of secondary structureelements (SSEs). For comparison, we are currently testingour method on this level of data representation. Since ourapproach is based on simple string operations, it would be

expected to be much faster than the combinatorical com-plexity of finding maximal sub-cliques over multiple net-works.

The results show that the time usage of SPratt varies alot. Most of the time is used by Pratt to find the mostpromising patterns matching the minimum number ofneighbor strings. Pratt’s time usage increases with thenumber and lengths of input strings, which in SPrattdepends heavily on the chosen size of neighborhoods (i.e.,the dmax value), bigger neighborhoods results in longerneighbor strings. Also, the time usage depends heavily onhow general the patterns are allowed to become in thesearch, and on the choice between greedy (heuristic) andexhaustive search. In the experiments reported here, wehave used the heuristic search mode which ismuch faster than the exhaustive version, and in practiceoften produces equally good results. A more detaileddiscussion of the time usage of Pratt is found in Jonassen,1997.15

TABLE IX. Cysteine Protease PatternS[DN]xxxxHxxxx--FxC

(a) PatternsProteinlength

1hucAB252

1gcb452

1fieA705

1aim215

1ppn212

PatternS S220 S393 S397 S176‡ S176‡

[DN] N2191 N3921 D3961 N175‡1 N175‡1

H H1991 H369†1 H3731 H159†1 H159†1

F F32 F76 F317 F28 F28C C29*1 C73*1 C314°1 C25*1 C25*1

(b) rmsd1hucAB 1gcb 1fieA 1aim 1ppn

1hucAB 0.2 1.6 0.6 0.31gcb 1.9/10 1.5 0.7 0.21fieA 5.2/4 20.5/42 1.8 1.61aim 0.9/34 1.2/100 13.1/11 0.61ppn 1.1/33 2.1/98 6.1/17 0.8/159

a) Occurrences of the motif in the structures (by PDB code), where, ‘1’indicates a site residue and the symbols ‘*,’ ‘†,’ ‘‡ ’ and ‘°’ indicate PROSITEmatches. (See Table VIII for details). b) rmsd values for the superposition ofmotif residues (top-right triangle) and complete chains (bottom-left tri-angle) along with the number of residues superposed.

TABLE X. 4Fe-4S Ferredoxin Pattern PC-Cxx---C

(a) PatternProteinlength

1blu80

6fd1.a106

6fd1.b106

1xer102

1fxrA64

2fxb.a81

2fxb.b81

PatternP P54 P50* P50* P94* P55 P62 P62C C531 C491* C491* C931* C541 C611 C611

C C141* C161 C161 C511 C171 C171 C171

C C81* C11 C81 C451 C111 C141 C111

Missing C11 C8 C14 C11 C14

(b) rmsd1blu 6fd1.a 6fd1.b 1xer 1fxrA 2fxb.a 2fxb.b

1blu 1.8 0.7 0.2 0.3 2.0 0.26fd1.a 4.4/50 1.4 1.8 1.8 2.3 1.86fd1.b 4.4/50 0.7 0.8 2.3 0.81xer 1.2/46 1.2/43 1.2/43 0.2 2.0 0.21fxrA 1.0/38 1.2/30 1.2/30 0.8/45 2.0 0.12fxb.a 1.8/39 1.1/32 1.1/32 1.1/42 0.8/49 1.92fxb.b 1.8/39 1.1/32 1.1/32 1.1/42 0.8/49

a) Occurrences of the motif in the structures (by PDB code), where, ‘1’indicates a site residue and ‘*’ indicates a PROSITE pattern match.(See Results, Fe4S4-binding family, for details). Alternative matchesare distinguished by the suffix ‘.a’ or ‘.b’ on the PDB code. The bottomline (missing) gives residues that are in the site definition but wereunmatched. b) rmsd values for the pairwise superposition of motifresidues (top-right triangle) and pairwise similarities of completechains (bottom-left triangle) along with the number of residuessuperposed.

TABLE XI. 4Fe-4S Ferredoxin Pattern PCxCx-Cx-C Foundin Three Out of Five 4Fe-4S StructuresAnalyzed

(a) Pattern (b) rmsdProteinlength

1blu80

1fxrA64

2fxb81 1blu 1fxrA 2fxb

Pattern 1blu 0.4 0.3P P54 P55 P62 1fxrA 1.0/38 0.1C C531 C541 C611 2fxb 1.8/39 0.8/49C C141* C171 C171

C C111* C141 C141

C C81* C111 C111

a) Occurrences of the motif in the structures (by PDB code), where, ‘1’indicates a site residue and ‘*’ indicates a PROSITE pattern match.(See Results, Fe4S4-binding family, for details). b) rmsd values for thepairwise superposition of motif residues (top-right triangle) andpairwise similarities of complete chains (bottom-left triangle) alongwith the number of residues superposed.

TABLE XII. 2Fe-2S Ferredoxin Pattern Cxxxx-CxxCx--CFound inAll Four 2Fe-2S StructuresAnalyzed

(a) PatternProteinlength

1frd98

1alo907

2pia321

1put106

PatternC C411* C401* C2721* C391†

C C461* C451* C2771* C451†

C C491* C481* C2801* C481†

C C791 C601 C3081 C861

(b) rmsd1frd 1alo 2pia 1put

1frd 0.6 0.3 0.61alo 1.3/45 0.7 0.52pia 1.6/40 18.3/29 0.71put 1.5/52 2.2/46 2.1/28

a) Occurrences of the motif in the structures (by PDB code), where, ‘1’indicates a site residue, while ‘*’ and ‘†’ indicate PROSITE patternmatches. (See Results, Fe2S2-binding family, for details). b) rmsdvalues for the pairwise superposition of motif residues (top-righttriangle) and pairwise similarities of complete chains (bottom-lefttriangle) along with the number of residues superposed.

217LOCAL PACKING MOTIFS

Verification that the correct motif had been found ineach example was assessed both by the site data recordedin the PDB and Swiss-Prot files and by matching thesequence motifs found in the PROSITE databank. Whilethese data themselves substantially confirmed each assign-ment by SPratt, they were sometimes incomplete andinconsistent (for example; with different PROSITE pat-terns matching different proteins, or even no match at all).Independent automatic verification was provided by thermsd over the motif residues, but with a small number ofresidues, the rmsd value was often difficult to interpretand might sometimes have been obtained by chance. Thebest confirmation of the motifs came from a combination ofmethods, in which the motif data was used to constrain thestructural alignment of the proteins obtained with the

program SAP. This gave the best superposition/alignmentof the proteins given the motif assignment and if the latterwere incorrect, a good alignment of the protein would notbe obtained. The combined technique was able to matchthe equivalent structures correctly in even the most dispar-ate pairs of proteins (see, for example; Figure 5 and Figure3). This combination of techniques, however, assumes thatthe motifs are correct and a fuller treatment of theproblem, without this assumption will be described else-where.

Many of the comparison methods that specialize in localsimilarities are insensitive to the sequential ordering ofthe residues in the motif. This can be desirable when thejuxtaposition of the active site residues is of primaryinterest and can be used to find examples of possible

Fig. 4. 4Fe-4S Ferredoxin superposition. Left: 1blu (red) is superposed on 6fd1 (green). Right:the two chains are colored by their degree of local similarity (SAP score). Red indicates greatestsimilarity, decreasing to dark blue indicating unmatched regions.

Fig. 5. 2Fe-2S Ferredoxin superposition. Left: 1alo (red) is superposed on 2pia (green). Right:the two chains are colored by local similarity (SAP score). Red indicates greatest similarity,decreasing to dark blue in unmatched regions.

218 I. JONASSEN ET AL.

‘‘convergent’’ evolution. The current method finds onlypatterns that have the same ordering of components alongthe sequence. Because of this constraint, it is much morelikely that the patterns discovered will derive from acommon ancestor and we therefore expect that the methodwill find its greatest application in the search for verydistant relationships among proteins. Although sequenceco-linearity is encoded in the current method, it is not afundamental constraint and could be avoided by choosinga different encoding of the neighbor-strings. For ex-ample, these might be ordered by a geometric propertysuch as those used by Karlin and co-workers (Karlin et al.,1994).11

The greatest outstanding problem with the currentmethod is how to score the patterns in terms of theirimportance. For pure sequence patterns, the score used byPratt is rigorous and informative but it cannot be directlyused with structural data, since with these, the geometriccorrespondence is paramount. We have tried a simplecompromise score based on dividing the Pratt score by themaximum rmsd found between all sets of residues (motifoccurrences) matching the pattern. While useful, thismeasure is not ideal as it is too sensitive to the incorpora-tion of spurious matches that give rise to odd high rmsdvalues. Even when all the matches are correct, the rmsdvalue is still sensitive to the number of residues involved,and as was seen in Table II, does not produce any clearfavorite among a large number of related motif variants.An aspect suggested by these results may be to incorporatea component measuring sequence dispersion into thescore, as an ‘‘expert’’ on protein structure motifs mightvalue the significance of a motif more highly when it isconstructed from the assembly of sequentially remoteelements. While we have not pursued a solution to theproblem in the current work, we feel that an answer mightbe found in the combination of the motif data withthe comparison program SAP along the lines discussedabove.

ACKNOWLEDGEMENTS

Jaap Heringa is thanked for useful comment on themanuscript. Inge Jonassen was supported by the Norwe-gian Research Council.

REFERENCES1. Brown NP, Orengo CA, Taylor WR. A protein structure comparison

methodology. Comput Chem 1996;20:359–380.

2. Taylor WR, Orengo CA. Protein structure alignment. J Mol Biol1989;208:1–22.

3. Sali A, Blundell TL. Definition of general topological equivalencein protein structures: A procedure involving comparison of proper-ties and relationship through simulated annealing and dynamicprogramming. J Mol Biol 1990;212:403–428.

4. Nussinov R, Wolfson HJ. Efficient detection of 3-dimensionalstructural motifs in biological macromolecules by computer visiontechniques. Proc Natl Acad Sci USA 1991;88:10495–10499.

5. Holm L, Sander C. Protein-structure comparison by alignment ofdistance matrices. J Mol Biol 1993;233:123–138.

6. Grindley HM, Artymiuk PJ, Rice DW, Willett P. Identification oftertiary structure resemblance in proteins using a maximalcommon subgraph isomorphism algorithm. J Mol Biol 1993;229:707–721.

7. Artymiuk PJ, Porrette AR, Grindley HM, Rice DW, Willett P. Agraph-theoretic approach to the identification of three-dimen-sional patterns of amino acid side-chains in protein structures. JMol Biol 1994;243:327–344.

8. Gibrat JF, Madej T, Spouge JL, Bryant SH. The VAST proteinstructure comparison method. Biophys J 1997;72:MP298.

9. Johnson MS, Sutcliffe MJ, Blundell TL. Molecular anatomy:Phyletic relationships derived from 3-dimensional structures ofproteins. J Mol Evol 1990;30:43–59.

10. Taylor WR, Flores TP, Orengo CA. Multiple protein structurealignment. Protein Sci 1994;3:1858–1870.

11. Karlin S, Zucker M, Brocchieri L. Measuring residue associationsin protein structures—possible implications for protein folding. JMol Biol 1994;239:227–248.

12. Karlin S, Zhu Z-Y. Characterizations of diverse residue clusters inprotein three-dimensional structures. PNAS USA 1996;93:8344–8349.

13. Zhu Z-Y, Karlin S. Clusters of charged residues in proteinthree-dimensional structures. PNAS USA 1996;93:8350–8355.

14. Jonassen I, Collins JF, Higgins DG. Finding flexible patterns inunaligned protein sequences. Prot Sci 1995;4:1587–1595.

15. Jonassen I. Efficient discovery of conserved patterns using apattern graph. Comput Appl Biosci 1997;13:509–522.

16. Bairoch A, Boeckmann B. The SWISS-PROT protein sequencedata bank. Nucleic Acids Res 1991;19:2247–2249.

17. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: A struc-tural classification of proteins database for the investigation ofsequences and structures. J Mol Biol 1995;247:536–540.

18. Murphy MEP, Lindley PF, Adman ET. Structural comparison ofcupredoxin domains: Domain recycling to construct proteins withnovel functions. Protein Sci 1997;6:761–770.

19. Fischer D, Wolfson H, Lin SL, Nussinov R. 3-Dimensional,sequence order-independent structural comparison of a serine-protease against the crystallographic database reveals active-sitesimilarities—potential implications to evolution and to protein-folding. Protein Sci 1994;3:769–778.

20. Wallace AC, Borkakoti N, Thornton JM. TESS: A geometrichashing algorithm for deriving 3D coordinate templates for search-ing structural databases—application to enzyme active sites.Protein Sci 1997;6:2308–2323.

21. Koch I, Lengauer T, Wanke E. An algorithm for finding maximalcommon subtopologies in a set of protein structures. J Computa-tional Biol 1996;3:289–306.

22. Bairoch A, Bucher P, Hofman K. The Prosite database, its statusin 1997. Nucleic Acids Res 1997;25:217–221.

219LOCAL PACKING MOTIFS