Assignment of homology to genome sequences using a library of hidden Markov models that represent...

17
Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure Julian Gough 1 *, Kevin Karplus 2 , Richard Hughey 2 and Cyrus Chothia 1 1 MRC, Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK 2 Department of Computer Engineering, Jack Baskin School of Engineering, University of California, Santa Cruz CA 95064, USA Of the sequence comparison methods, profile-based methods perform with greater selectively than those that use pairwise comparisons. Of the profile methods, hidden Markov models (HMMs) are apparently the best. The first part of this paper describes calculations that (i) improve the performance of HMMs and (ii) determine a good procedure for creat- ing HMMs for sequences of proteins of known structure. For a family of related proteins, more homologues are detected using multiple models built from diverse single seed sequences than from one model built from a good alignment of those sequences. A new procedure is described for detecting and correcting those errors that arise at the model-building stage of the procedure. These two improvements greatly increase selectiv- ity and coverage. The second part of the paper describes the construction of a library of HMMs, called SUPERFAMILY, that represent essentially all proteins of known structure. The sequences of the domains in proteins of known structure, that have identities less than 95 %, are used as seeds to build the models. Using the current data, this gives a library with 4894 models. The third part of the paper describes the use of the SUPERFAMILY model library to annotate the sequences of over 50 genomes. The models match twice as many target sequences as are matched by pairwise sequence comparison methods. For each genome, close to half of the sequences are matched in all or in part and, overall, the matches cover 35 % of eukaryotic genomes and 45 % of bacterial genomes. On average roughly 15% of genome sequences are labelled as being hypothetical yet homologous to proteins of known structure. The annotations derived from these matches are available from a public web server at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enables users to match their own sequences against the SUPERFAMILY model library. # 2001 Academic Press Keywords: genome; superfamily; hidden Markov model; structure; homology *Corresponding author Introduction Protein structure prediction, to discover the fold and hence information about the probable function of the sequence of a gene about which nothing is known, is possible via homology to a sequence of known structure. Protein homology searching methods have been the central tool in sequence analysis for many years, and with the growth in the experimental determination of new sequences following Moore’s law (largely due to the genome projects), they are of increasing value. There are more than 50 completely sequenced genomes at the time of writing, including five eukaryotes. They comprise approximately a quarter of a million sequences. Improvements in the speed of homology methods enables larger-scale studies, and improvements in their selectivity leads to dis- covery of novel relationships. The system pre- sented here probably offers the best selectivity currently available for genomic-scale studies. E-mail address of the corresponding author: [email protected] Abbreviations used: HMM, hidden Markov model; FIM, free insertion module; PDB, Protein Data Bank. doi:10.1006/jmbi.2001.5080 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 313, 903–919 0022-2836/01/040903–17 $35.00/0 # 2001 Academic Press

Transcript of Assignment of homology to genome sequences using a library of hidden Markov models that represent...

doi:10.1006/jmbi.2001.5080 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 313, 903±919

Assignment of Homology to Genome Sequencesusing a Library of Hidden Markov Models thatRepresent all Proteins of Known Structure

Julian Gough1*, Kevin Karplus2, Richard Hughey2 and Cyrus Chothia1

1MRC, Laboratory of MolecularBiology, Hills Road, CambridgeCB2 2QH, UK2Department of ComputerEngineering, Jack Baskin Schoolof Engineering, University ofCalifornia, Santa CruzCA 95064, USA

E-mail address of the [email protected]

Abbreviations used: HMM, hiddeFIM, free insertion module; PDB, P

0022-2836/01/040903±17 $35.00/0

Of the sequence comparison methods, pro®le-based methods performwith greater selectively than those that use pairwise comparisons. Of thepro®le methods, hidden Markov models (HMMs) are apparently thebest. The ®rst part of this paper describes calculations that (i) improvethe performance of HMMs and (ii) determine a good procedure for creat-ing HMMs for sequences of proteins of known structure. For a family ofrelated proteins, more homologues are detected using multiple modelsbuilt from diverse single seed sequences than from one model built froma good alignment of those sequences. A new procedure is described fordetecting and correcting those errors that arise at the model-buildingstage of the procedure. These two improvements greatly increase selectiv-ity and coverage.

The second part of the paper describes the construction of a library ofHMMs, called SUPERFAMILY, that represent essentially all proteins ofknown structure. The sequences of the domains in proteins of knownstructure, that have identities less than 95 %, are used as seeds to buildthe models. Using the current data, this gives a library with 4894 models.

The third part of the paper describes the use of the SUPERFAMILYmodel library to annotate the sequences of over 50 genomes. The modelsmatch twice as many target sequences as are matched by pairwisesequence comparison methods. For each genome, close to half of thesequences are matched in all or in part and, overall, the matches cover35 % of eukaryotic genomes and 45 % of bacterial genomes. On averageroughly 15% of genome sequences are labelled as being hypothetical yethomologous to proteins of known structure. The annotations derivedfrom these matches are available from a public web server at:http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enablesusers to match their own sequences against the SUPERFAMILY modellibrary.

# 2001 Academic Press

Keywords: genome; superfamily; hidden Markov model; structure;homology

*Corresponding author

Introduction

Protein structure prediction, to discover the foldand hence information about the probable functionof the sequence of a gene about which nothing isknown, is possible via homology to a sequence ofknown structure. Protein homology searchingmethods have been the central tool in sequence

ing author:

n Markov model;rotein Data Bank.

analysis for many years, and with the growth inthe experimental determination of new sequencesfollowing Moore's law (largely due to the genomeprojects), they are of increasing value. There aremore than 50 completely sequenced genomes atthe time of writing, including ®ve eukaryotes.They comprise approximately a quarter of amillion sequences. Improvements in the speed ofhomology methods enables larger-scale studies,and improvements in their selectivity leads to dis-covery of novel relationships. The system pre-sented here probably offers the best selectivitycurrently available for genomic-scale studies.

# 2001 Academic Press

904 Sequence Homology Assignment

Of the sequence comparison methods available,pairwise searches perform much less selectivelythan pro®le-based, of which hidden Markovmodels1 ± 4 (HMMs) are apparently the best. Thework by Park et al.5 which is the basis of this asser-tion, is supported by our more recent calculations.6

The SAM HMM package (http://www.cse.ucsc.edu/research/compbio/sam.html) includes aniterative model building procedure called T99,7

which improves remote homology detection. Ofthe HMM procedures available, SAM T99 is themost effective.

The implementation of HMMs is not entirelystraightforward. In the ®rst part of this paper wedescribe calculations that (i) improve the perform-ance of SAM HMMs and (ii) determine a good pro-cedure for creating SAM HMM for sequences ofproteins of known structure.

In the second part of the paper we describe theconstruction of a library of HMMs called SUPER-FAMILY, that represent essentially all proteins ofknown structure. This library of models extendsboth aspects of performance: speed and selectivity.A sequence can be searched against it at over anorder of magnitude faster than building a modelfrom a query sequence and searching an equivalentdatabase of target sequences. Expert creation andselection of the models, coupled with consensusinformation, has greatly improved the selectivity ofthe library.

In the third part of the paper we describe theuse of the SUPERFAMILY library to annotate thesequences of over 50 genomes. For each genome,matches are made to sequences that form roughlyhalf the cytoplasmic proteins. These annotationsare available from a public web server at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY/cgi-bin/gen_list.cgi.

This server also enables users to match theirown sequences again the SUPERFAMILY modellibrary.

Sequences, Domains and Homologiesof the Proteins of Known Structure

Many small proteins contain single domains,whereas nearly all large proteins contain two ormore that have linked by recombination andwhich occur in other proteins in isolation, incombination with different partners, or in boththese states.8 ± 10 This means that to search forhomologies in large sets of sequences, it is moreeffective to use HMMs that represent proteindomains (whole small proteins or the evolution-ary units of large proteins). Thus to buildHMMs representing all proteins of known struc-ture we must ®rst have the sequences of repre-sentative sets of proteins, the de®nition of theirdomain structure and a description of their hom-ologies. These data are available from the SCOPand ASTRAL databases.

SCOP database

This database10 contains a structural and evol-utionary classi®cation of the proteins in the PDB11

and usually keeps up to date to within two to sixmonths. In SCOP, a multi-domain protein is splitup into its constituent domains, which are thenconsidered separately. These protein domains areevolutionary units in that for a protein to bedivided into domains they must be observed inisolation or in different combinations in nature.The fundamental units used in the work describedhere are the protein domains as classi®ed bySCOP. Note that a domain which has anotherdomain inserted within it will comprise multiplechains or regions of sequence.

SCOP is a hierarchical classi®cation, and thelevel of classi®cation relevant to this work is the``superfamily''. A superfamily is de®ned as agroup of domains which have structural and func-tional evidence for their descent from a commonevolutionary ancestor. The level below superfamilyis the ``family'' level which groups together thosedomains that have clear sequence similarities. Thelevel above superfamily is the ``fold'' which groupsdomains that have the same major secondary struc-ture with the same chain topology. Superfamiliesclustered at this level have either no evidence tosuggest an evolutionary relationship or only veryweak evidence that requires conformation.

The superfamily level contains the most distantlyrelated domains and so is the highest level for use-ful remote homology detection. Proteins in thesame superfamily often have the same function,and usually but not always have related functions.Since SCOP classi®es protein domains separately, amulti-domain protein may have contributions toit's overall function from the different domains.

ASTRAL database

Files in this database12 provide sequences corre-sponding to the SCOP domain de®nitions and arederived from the SEQRES entries in PDB ®les.

Domains which are non-contiguous in sequence,i.e. parts of the domain separated by the insertionof another domain, are treated as a whole. Theirsequences are marked with separators between thefragments representing regions belonging to otherdomains. All PDB entries with sequences shorterthan 20 residues, or with poor quality or nosequence information are omitted.

ASTRAL also has available sequence ®les ®lteredto different levels of residue identity. The ASTRALentries are generated entirely automatically andhence have a small number of documented errors.

SUPERFAMILY database

The sequences which are used for the work pre-sented here are available from the SUPERFAMILYserver, described in a later section, and are gener-ated from the ASTRAL sequences yet differ in the

Sequence Homology Assignment 905

following ways: (1) SUPERFAMILY sequence ®leshave any sequence shorter than 30 residuesremoved rather than 20 in ASTRAL. Domainswhich are split across more than one chain hadseparate entries in ASTRAL which had to be joinedto make a single entry in SUPERFAMILY. Theordering of the chains was obtained from NRDBand NRDB90.13 Subsequent releases of ASTRALhowever now include these joined domains. (2) Asmall number of documented ASTRAL errorswhich are signi®cant are corrected by hand. (3)Some errors in domain de®nitions in the SCOPclassi®cation were detected and corrected in theSUPERFAMILY sequence ®les. These were mostlydue to typographical mistakes and have been cor-rected in a subsequent release of SCOP. (4)Sequences which are merely redundant shorterparts of other sequences are removed when ®lter-ing on sequence identity.

The SAM-T99 HMM procedure

The SAM-T99 HMM procedure was developedby Kevin Karplus and his colleagues at SantaCruz. Here we give an outline of the procedure;which is described in detail elsewhere.7

SAM-T99 starts from an initial alignment ofhomologous sequences or a single query sequence,this sequence or alignment is called the ``seed''.The default procedure follows these steps: (1)Using the initial sequence(s) and the WU-BLASTPprocedure (http://blast.wustl.edu/blast/), asearch of a large non-redundant protein databaseis carried out to ®nd two sets of sequences. The®rst set consists of close homologues of the querysequence: those that match it with E-values of0.00003 or less, and these are used to create theinitial HMM. The second set of sequences consistsof those that match the query with E-values of 500or less and, therefore, will probably include moredistant homologues of the query sequence. (2) Aninitial HMM is built from the ®rst set of close hom-ologues from step (1). This is then used to searchthe second set from step (1) for more homologues,which are added to the ®rst set and realigned tocreate a new and larger alignment from which anew and better model can be built. (3) The initialHMM is extended with additional homologues byrepeating step (2) for four iterations. In step (1), theuse of low E-values by WU-BLAST and a strictHMM scoring threshold, ensures that only closehomologues are used. In the four iterations of step(2), thresholds for the HMM score are graduallydecreased. (4) From the ®nal alignment producedby the iterations in step (3), a model is built usingone of the scripts provided by the SAM package.These scripts ®lter and weight the sequences in thealignment before building the model.

The UCSC SAM package used to create theHMMs and to score sequences with the HMMsis available from http://www.cse.ucsc.edu/research/compbio/sam.

Performance of a model from an alignment ofmultiple homologous sequences versusmultiple models from singlehomologous sequences

Previous practice by this group14 and others15

has been to build one HMM model for each pro-tein family or superfamily using an accurate align-ment of the sequences of diverse family members.However, there are two problems with thisapproach: one practical and one theoretical. Thepractical problem is that producing accuratesequence alignments is a not a trivial problem andfor very diverse sequences requires expert humanintervention. The theoretical problem is that it hasnot been demonstrated that using one model builtfrom a good alignment of selected diversesequences produces better results than using mul-tiple models built from different single seedsequences and their homologues (as describedabove). To investigate this second problem variousmethods of modeling ®ve chosen superfamilieswere compared.

The ®ve superfamilies were selected becausedetailed structural and sequence analyses wereavailable, along with the expert knowledgeacquired from these analyses (Bashford et al.16

and our unpublished work). These provide bothaccurate structural-based hand-built sequencealignments, and the means with which to verifythe results of homology searches.

The different models were built for each super-family and searched against NRDB90, and thenchecked and compared. All of the models werebuilt using the SAM T99 iterative proceduredescribed above.

The following four variations on the methodwere investigated: (1) The accurate structure-basedalignments of the superfamily members were usedas the seed alignment for the default T99 pro-cedure which generates the ®nal models. (2) Theaccurate hand alignments were used as the seed,but an additional constraint was applied; the struc-turally conserved core regions of the alignmentwere ®xed throughout the iterative procedurewhich would usually re-align at every step. (3)Completely automatic alignments of the samesequences used in (1) and (2) were created withClustalW.17 These automatic alignments (withmany observed errors) were used as the T99 seed.(4) The individual sequence members of the abovealignments were each used separately as seeds fora set of models. The results from all models wereconcatenated to give one result as in the other pro-cedures.

The number of homologous sequences found bythese four procedures is given in Table 1. The hom-ology criteria were as follows: any hit with a``reverse'' score lower (better) than ÿ15 was takenas a homologue, hence the comparisons presentedabove depend on the scores produced by thedifferent methods being roughly equivalent. Thisvalue (ÿ15) was the score found to produce a 1 %

Table 1. The numbers of hits found for different SCOP families using procedures 1-4 described in the text

Total number of hits

Superfamily Cupredoxins Cytochrome c Flavodoxins Globins Iset

Procedure (1) 94 22 121 492 8440Procedure (2) 63 21 121 489 8431Procedure (3) 82 22 119 492 8298Procedure (4) 106 22 130 505 9687

These data were obtained with SAM-T98 on SCOP release 1.39. The nrdb90 database from 1998 was used including sequencesavailable at the time.

906 Sequence Homology Assignment

rate of errors per query when using T98 (a slightlyolder version of T99) to score all of PDB versusPDB.5 The reverse score is offset against the scoreof the sequence in reverse, this ®lters out low-com-plexity matches.

The target sequences were checked by hand forfalse homologues using a combination of annota-tion, alignments, structural knowledge and furthersequence searching. A small number of unanno-tated potential false homologues were found to bein the search results, and only a couple of certainfalse homologues. The immunoglobulins were notchecked because the homologues were too many,and in the case of the ¯avodoxins the nitric oxidereductases were not counted as false because theyare a well known case of sequence similaritybetween different proteins.18

The results indicate that the use of multiplemodels where each starts with a single sequenceproduces the best results. When sequence align-ments are used the addition of the constraintsdescribed in (2) have little effect. The Cupredoxinhand alignment included only members of one ofthe three families within the superfamily and inthis case procedure (2) did badly. The data alsoshow that there is some loss in performance usingautomatic alignments for seeding the T99 pro-cedure, but not a great deal.

Further analysis of these data shows that notonly did the multiple models procedure (4) pro-vide more hits but it found everything found bythe others. The individual models in this proceduremostly found the same hits. Some models werecompletely redundant with respect to others andsome found outliers which the others did not ®nd(see below for a more detailed analysis of modelredundancy).

These results solve the theoretical problem ofwhether one model or multiple models are mosteffective, and hence remove the need for solvingthe practical problem of accurately aligning dis-tantly related sequences for the purpose of generat-ing good hidden Markov models.

The SUPERFAMILY set of HMM Models

Seed sequences for the HMM library

Given that multiple models are to be built foreach SCOP superfamily, the question remains; howshould the models be generated? The model build-

ing procedure comparisons described show that itis best to create models for a superfamily startingwith a set of single seed sequences. Here we use asseeds for the models sets of what we call ``SUPER-FAMILY'' sequences: these are based on thesequences found for each SCOP superfamily inASTRAL ®ltered to remove any that have identitiesgreater than 95 %. Thus in the SUPERFAMILYmodel library each superfamily is represented byone or more models depending on how manystructures there are with less than 95 % sequenceidentity in the given superfamily. Using the currentdata this produces a model library of 4849 modelswhich is computationally viable both on a genomicscale and on the scale of a fast, single query (seebelow).

Studies on the effect of using models built fromseeds ®ltered to different percentages of sequenceidentity showed that there is a strong fall off in thecoverage achieved, beginning when using seeds ®l-tered to 40 % identity, and falling steadily from30 % and below (Figure 1). Since reducing thelibrary from 95 % to 40 % only gives a reduction ofone-third in the computational cost, and some per-formance is lost (e.g. �50 assignments for an aver-age-sized bacterial genome), the seeds ®ltered to95 % are still used. In fact, the main advantage inusing the larger library is an improvement in thequality of the assignments, rather than an increasein coverage. This is discussed below in the sectionon the assignment procedure.

Model building parameters

There are many parameters which it is possibleto vary during the model building process. Theeffects of a basic set of ®ve variations was chosenas a guide to the best method for building the ®nalset of models. For each set of parameters a modellibrary was built and scored against the sequencesas described below. The parameters which werevaried were: (1) the number of iterations in the T99procedure. The default is 4 and up to 6 iterationswere tried. (2) The cut-off E-values of the iterationsin T99. At each iteration there is a cut off E-valueused to choose new homologues to be included inthe next model. (3) The limit on the scorethreshold, and the maximum number of sequencesincluded in the large set of culled sequences usedin the T99 procedure. (4) The ®nal model-building

Figure 1. The effect of ®lteringthe seed sequences on percentagesequence identity, is shown bycounting the number of domainsfound in four different genomes.As the percentage sequence identityallowed between the seedsequences used to build the modelsincreases, so more seeds, and hencemodels are allowed. There is a gen-eral trend that using more models®nds more domains in the gen-omes.

Sequence Homology Assignment 907

script (mostly affecting the ®ltering of sequences inthe ®nal alignment). Again, the SAM packageincludes a selection of scripts to use. (5) The Dirich-let mixtures used in the regularizer for model-building19 in T99. These encapsulate informationabout the nature of the residue distributions whichare expected to be found in match states. The SAMpackage includes several different prior libraries ofDirichlet mixtures which can be used.

It was generally found that variations in (1), (2),and (3) produced mostly linear changes. It waspossible to alter the parameters in a more ``loose''direction improving the ratio of true to falsehomologues (Figure 2(a)), but also increasing theabsolute error rate (Figure 2(b)). ``Tightening'' theparameters made it possible to reduce the numberof badly built models, but also the coverage(Figure 2(c)). This can be attributed to the fact thatlooser parameters include more information in themodels but increase the risk of including incorrectinformation, which leads to bad models. It was notonly found that the absolute coverage is increasedfor looser models, which is dominated by a fewlarge superfamilies, but that it increased the aver-age coverage per superfamily (excluding single-tons) in a similar way. To optimise the modellibrary in its ®nal form the loosest parameters werechosen to improve the coverage, and the modelswere analysed by hand to rebuild the bad modelsbringing the high error rate down. In fact, doingthis to the library removes the bad models, thusproducing entirely statistical (scoring) errors whichmatch the theoretical E-value very closely, which ismuch lower than even the much tightest libraries(see below and Figure 3(b)). This procedure cannotbe automated because of the complexity of thedecision-making process involved in classifyinginter-superfamily relationships as truly false. Some

very distant inter-superfamily relationships aremore acceptable than others depending on struc-tural and functional similarities. Once the curationhas been carried out on the model library the ben-e®ts are inherent within it, and carried forward toall future scoring.

Looking in more detail at the various parametersthere are a few points worthy of note. The moreiterations which are used, and the larger the culledset, the more computationally expensive the modelbuilding becomes. Once again; an expensive pro-cedure is affordable if it is only to be carried outonce, because the models can be used again andagain at no further cost. The changes of (4) and (5)are of no extra cost. Using model-building script``W0.5`` and prior library ``recode3.20.comp'' werefound to be improvements on the defaults fromwork carried out at UCSC, which this work con-®rms. Tightening the threshold in the ®nal iteration(also suggested by UCSC) reduces the number ofbad models but does little or nothing to improvethe ratio of true to false hits. For an automatic pro-cedure this is very desirable, but in the case of theSUPERFAMILY model library which has badmodels re-built by hand (see below), doing thisreduces the ratio of true to false hits.

Free insertion modules

A Free Insertion Module (FIM) in a model allowsthe free insertion of any number of residues at thatpoint in the sequence without penalty to the score.If you have a non-contiguous domain, it is possibleto replace the inserted domain with a FIM in themodel, thus giving the same score to a non-contig-uous sequence with a domain inserted as a contig-uous sequence. This amounts to removing any gappenalties at one point which might penalise aninserted domain where one is expected. The FIMs

Figure 2 (legend opposite)

908 Sequence Homology Assignment

may be (i) inserted before the T99 process, or (ii)inserted in the ®nal model after the T99 process, or(iii) not used at all.

Models were built for all non-contiguousdomains in the three possible ways, scored againstthe PDB as before and compared. It was foundthat FIMs did not make much difference, but thatusing them ((i) and (ii) above) was slightly detri-mental. The models without FIMs ((iii) above)were found to have an HMM segment with veryhigh insert transition probabilities at the point ofthe inserted domain. In fact the T99 procedure suc-cessfully detects the point of insertion and allowslarge insertions at very low penalty without theneed for intervention.

Curation of the model library

Once a library has been created it is tested. Itsmodels are scored against every SUPERFAMILYsequence. The hits which the models make arethen classi®ed as true or false depending onwhether or not they are classi®ed by SCOP in thesame superfamily.

It was discovered from this that false positivescan arise for two reasons: (1) Errors in scoring.These are random scoring errors which occurbecause a sequence from another superfamily isassigned a score with a higher signi®cance than itbiologically deserves because of a chance similarityin sequence. Statistically this is expected to happen.The E-value is a prediction of how often this willhappen by chance per query. (2) Errors in model-building. These are errors arising at the model-building stage. Such a problem will cause a model

to consistently score certain false homologues witha signi®cant score, because it is inherently built into the model to do so. For example if members ofother superfamilies somehow get into the align-ment during the T99 procedure, then the modelwill also be built from these false sequences, caus-ing it to match them with a signi®cant score.

Models which have errors at the building stage((2) above) can be recognised because their falsehomologues will all be related to each other, occurfrequently and are assigned very signi®cant scores.Scoring errors ((1) above) will inevitably produce avery few false homologues which are unrelatedand will tend to have borderline scores. It is poss-ible to plot the proportion of the models whichmake errors against different potential E-valuethresholds. A change in behaviour (characterisedby a jump discontinuity of the derivative) isobserved at the point where the errors cease to bedominated by the model building errors, of whichthere will be many more with lower E-values thanscoring errors.

Examination of the numbers of models involvedin errors of both kinds, shows that a small percen-tage of models which have model-building errorsproduce most of the observed errors. By excludingless than 4 % of the models over 90 % of the errorsare removed. Figure 3 compares a library with andwithout the 4 % of badly built models against a fulllibrary. This theory of two types of errors is con-®rmed by the errors of a library which has hadbadly built models removed. The error rate is thenvery much closer to the theoretical curve for stat-istical errors than that given by the full library(Figure 3).

Figure 2. (a) The ability for models to discriminate between true and false positives is shown for ®ve differentmodel-building parameters in an all-against-all validation test of SCOP 1.53 sequences ®ltered to 95 % sequence iden-tity. (b) The same validation test shows the proportion of models which produce errors at a given E-value threshold.The theoretical curve shows what would be unavoidably expected by chance. (c) The same validation test shows theaverage coverage per model per superfamily excluding singleton superfamilies at any given E-value. Singletons areexcluded because they will have an average coverage of 100 % by merely ®nding themselves, and thus skew the dis-tribution. In these Figures ®ve different model building parameters were compared. Modlib1 refers to the default par-ameters with the exception of using an alternative Dirichlet mixture, and a very slightly lower E-value for the lastiteration, and the W0.5 script. Modlib2 refers to using ®ve iterations, and a slightly larger culled set. Modlib3 refersto the default SAM-T99 parameters with the W0.5 script. Modlib4 refers to the default parameters with no model-building script. Modlib5 refers to using six iterations, a much larger culled set, and higher E-value thresholds.

Sequence Homology Assignment 909

These excluded models are analysed by hand,comparing the structures to ®nd the cause of theirpersistent errors. About half of these problemmodels appeared to be genuinely badly built, theseare re-run with more restrictive parameters (seenext section) and re-checked until they are behav-ing properly. In this way all of the superfamilies

are properly represented. By doing this the ®nalmodel library has 90 % of its false homologuesremoved and the error rate becomes very close tothe theoretical value.

The other half of the problem models turned outto actually be behaving well and to involve eithertechnical limitations in SCOP or the current lack of

Figure 3. (a) The error rate produced by models built from ®ve different model-building parameters in an all-against-all validation test of SCOP 1.53 sequences ®ltered to 95 % sequence identity. The theoretical line shows theunavoidable error rate expected by chance alone. (b) The error rates of the ®ve model-building parameters are shownhere once all models with false hits below an E-value of 0.01 are removed. This demonstrates that a small number(�2 %) of bad models are causing most of the errors.

910 Sequence Homology Assignment

structural data for plausible superfamily relation-ships: (1) The majority of these models detect inter-superfamily relationships between the few superfa-milies which are in fact related, but classi®ed sep-arately for technical reasons. Most notable of theseare the NAD(P) Rossmann domains and the FAD/NAD(P), and nucleotide binding Rossmann-likedomains (see SCOP annotation). There were manyrelationships detected between these three superfa-milies and a few others such as the N-terminal ofMurD superfamily. (2) Members of SCOP superfa-

milies are classi®ed according to whether, in thelight of the currently known structural sequenceand functional evidence, they possess a commonevolutionary ancestor. Use of this evidence is con-servative in that proteins, for which evidence of anevolutionary relationship is not strong, are placedin the same fold category but as separate super-families. This means that some models may wellcollect sequences that allow them to detect relation-ships between different superfamilies that gobeyond that available from the current structural,

Sequence Homology Assignment 911

sequence, and functional evidence. This is particu-larly true for TIM barrel superfamilies.5,20 Also1c20 and 1bmy (ARID-like domains) havesequence homology but were classi®ed separatelyin SCOP (Table 2). They have subsequently beenre-classi®ed.

(3) There are domains with local structural simi-larities which are detected by the models but theoverall domain structure is different. 1b0p and1fum (Figure 4) have an interesting identical dis-continuous structural motif, but the rest of thedomain does not even share the same secondarystructure, one being all alpha and the other beingmostly beta. 1psd (residues 108-295 on chain ``a''form a Rossmann domain) and 1b6r (residues 1-78chain a form a Biotin carboxylase N-terminaldomain-like domain) superpose 35 residues towithin 1.14 angstroms, but the rest of the structuredoes not superpose at all (Figure 5).

(4) Another common cause of these problems isdue to multi-domain proteins which have similar(e.g. a long alpha-helical linker) or very variablesequence around the domain boundaries. Anexample is the glyceraldehyde-3-phosphate dehy-drogenase-like two-domain proteins. As a result ofthis the models blur the domain boundaries andmay detect part of the other domain. A related pro-blem occurs when the domain boundary de®-nitions vary slightly from one protein to another ,in these cases they can be made consistent in thesequence ®les. 1qtm (residues 423-447) and 1kfs(residues 519-544) show an example of this; thereis a six-turn alpha-helix which belongs to onedomain in one de®nition, and the other domain inthe other de®nition. These domain boundarieshave subsequently been re-classi®ed.

(5) A further cause is when two proteins havedifferent domain boundaries which cause the partsto be classi®ed differently. In the case of 1hfe(chain ``s''), 1feh (residues 520-574 chain a) whichare Fe-only hydrogenases, a fragment is classi®eddifferently when it is a separate chain and when itis part of the main domain. In the case of 1qax(residues 4-108 chain a) and 1dqa (residues 462-586chain a) a section which forms part of a domain is

Table 2. An example of an alignment of twowhich produced a false hit

classi®ed as part of another domain when the restof its domain is not present. This has subsequentlybeen re-classi®ed.

(6) The last case is where there is a simple dis-agreement between the SCOP classi®cation and thehomology suggested by the scores. Mostly in thesecases the homology suggested by the scores issimply wrong. The errors are scoring errors (natu-ral statistical ¯uctuations) and are not caused bybadly built models. For example 1sig (sigma70 sub-unit fragment from RNA polymerase) and 2fb4(Immunoglobulin), have 16 residues which alignalmost identically yet the structures are unrelated.The section in question is b-sheet in one structureand helical in the other (Table 3). A rare example isof 1zrn and 1ek1 (HAD-like proteins) which matcheach other and superpose over 100 residues towithin 1.655 angstroms r.m.s. deviation, indicatingthat they are clearly related structures (Figure 6),and have subsequently been re-classi®ed in SCOP.

A further check was carried out by examinationof the lengths of hits to unknown sequences. Thelibrary was scored against the Escherichia coli gen-ome and the lengths of the hits were compared tothe lengths of the models. It was found that mostdiscrepancies in length were due to genuine inser-tions (sometimes of entire domains). Others werecaused by matches to repeat domains, where thereis a strong sequence similarity across severaldomains and the model matches parts from differ-ent domains. The EF hands were a common sourceof differences in length, as well as some circularlypermuted sequences. As these cases were relativelyfew, and as there is no way to automatically separ-ate genuine inserted domains from mis-matches,these models were left as they were. This does notgreatly impair the homology detection of themodel, merely its ability to distinguish the domainboundaries correctly.

Redundancy of models

Once the model library has been constructed itmay be examined and ®ltered for redundancy.Since the redundancy of the seeds of the models is

unrelated sequences of known structure

Figure 4. (a) The structure of a single domain of themulti-domain protein 1fum, which is classi®ed in SCOPin the globin-like fold, and the superfamily is alpha-heli-cal ferredoxin. This domain covers residues 106-243 ofchain b. There is a structural motif (green) which con-sists of two helices which are sequentially joined by partof the supporting helical structure. (b) The structure of asingle domain of the multi-domain protein 1b0p, whichis classi®ed in SCOP in the ferredoxin-like fold, and the4Fe-4S ferredoxins superfamily. This domain covers resi-dues 669-785 of chain a. In this case the same structuralmotif(green) as in (a) has the two helices sequentiallyjoined by part of the supporting beta-sheet structure.

Figure 5. Shown here is the superposition of a part oftwo domains from the structures 1psd (residues 108-295on chain a) and 1b6r (residues 1-78). The colouredregion covers 35 residues which superpose with anr.m.s. deviation of 1.14 angstroms. These domainsbelong to two different superfamilies: Biotin carboxylaseN-terminal domain-like, and NAD(P)-binding Ross-mann-fold domains.

912 Sequence Homology Assignment

only indirectly linked to the percentage sequenceidentity of the models other measures of similaritymust be introduced. Two measures of redundancycan be de®ned. These are based on the commonpercentage of: (1) the sequences of the multiplealignment from which the ®nal models are built,(2) the sequences hit by the models from 50 com-plete genomes (see below).

For each superfamily, every model was com-pared to every other model in the same superfam-ily using the criteria de®ned above. Although notwo models in the library have exactly the samestate and transition probabilities, under the two cri-teria de®ned above 30 % of the models were foundto be 100 % redundant on both counts. The covari-ance between the percentage sequence identity ofthe seed, model-building sequences, and genomesequence hits shows that there is no signi®cant cor-relation of the sequence identity between modelseeds and the sequence identity between genomehits made by the models (r � 0.135). What thismeans is that sequence identity of the seeds ofmodels is a very poor measure of similarity. Only58 % of models score their seed sequence higherthan any other seed sequence, and only 33 % scoretheir seed higher than any sequence in the com-pleted genomes. This means that a model reallydoes represent its superfamily ®rst and foremost,and not its seed sequence as might be thought.

Figure 6. The superposition of PDB structures 1zrnand 1ek1 (residues 4-225 chain a), which are in differentsuperfamilies in SCOP, yield 91 residues (shown) with1.49 angstroms r.m.s. deviation from each other.

Table 3. An example of an excellent alignment of 16residues, which can occur by chance in unrelated pro-teins

Sequence Homology Assignment 913

Superfamily Assignment Procedure

Once the model library has been built it shouldbe implemented in the best way possible. The aimis to identify the superfamilies of domains insequences. Since most domains are part of multi-domain proteins, the domain boundaries mustoften be identi®ed as well as their assignment to asuperfamily. This can be particularly dif®cult inthe case of non-contiguous domains.

For a given query sequence for which thedomains and their superfamilies are not known,every model is scored (using local Viterbi scoring)across the whole sequence detecting any occur-rences of a domain belonging to the superfamilywhich the model represents. For each region(domain) which is hit, the model which scores thehighest has its superfamily assigned to that region.This reduces scoring (inevitable statistical) errorsby nearly a factor of 2 since incorrect matches areonly kept if they are not better matched by anothersuperfamily (unpublished calculations). If theincorrect match is a member of a superfamily forwhich there is a structure, then the model(s)belonging to the correct superfamily will scorehigher thus knocking out the incorrect assignment.Using a model library built from seed sequenceswith less than 95 % identity rather than 40 % doesnot give a large improvement in the coverage, butdoes, via the assignment procedure, increase thequality of the assignments.

The greatest problem arises in the de®nition ofthe region which is matched. Often the hits willoverlap, and when there is a domain inserted inanother, the assignment will be completely withinthe other. Unfortunately using the domain de®-nitions from the regions hit by the models alonedid not prove adequate. The assignments to 20genomes were analysed and a complicated pro-cedure was developed to cope with such things asoverlapping and inserted domains. This procedurewas rejected because it allowed some false assign-ments to remain.

The solution is that every sequence region whichis hit by a model is aligned to that model. For anygiven query sequence all alignments are comparedto each other and the number of residues alignedto the same position is calculated. This number isdeclared as the overlap. This means that the num-

ber of residues overlapping between alignments isthe sum of the residues with a match state in bothmodels. For a full sequence the regions areassigned one by one, beginning with the highestscoring and adding each subsequent non-con¯ict-ing lower score in turn. A con¯ict is de®ned as anoverlap of 20 % or more. A detailed study of non-contiguous assignments looking at all inserteddomains, long sequence matches and overlappingassignments found only four errors out of 3041assignments. The value of 20 % was suggested bystudies on the initial assignment procedure whichwas ultimately rejected. This procedure adds a sig-ni®cant time to the overall procedure, but it isworth the cost and is not suf®cient to threaten theviability of the whole procedure on a genomicscale.

Assignment of superfamilies togenome sequences

A model library based on the 1.53 release ofSCOP was used to carry out structural superfamilyassignments of 56 complete genomes. The domainswithin sequences which hit with a signi®cant scorewere uniquely assigned to a superfamily. The cre-ation of the SUPERFAMILY structural assignmentprocedure and model library has allowed compre-hensive descriptions of the domain genomesequences on a scale not previously possible(Table 4). The assignments cover roughly 45 %of prokaryotic and 35 % of eukaryotic genomepeptide sequences.

No signi®cant difference in the distribution ofthe scores given to each assignment was foundbetween genomes. Differences were found how-

Table 4. Genome assignments using the SUPERFAMILY model library and procedure

A. The extent of the SUPERFAMILY assignments to the sequences from 56 genomes

914 Sequence Homology Assignment

B. The extent of the SUPERFAMILY assignments from 11 miscellaneous sequence sets including five alternative human gene setsand some incomplete genomes

A, The genome assignments for 56 genomes using the model library and assignment procedure. For each genome the table showsin order: the name of the species of the genome; a two-letter code (ab); the number of genes comprising the genome; the number ofgenes which have at least one SCOP domain assigned; the percentage of genes with at least one domain assigned; the percentage ofthe actual sequence covered by SCOP domains because multi-domain genes may have some domains assigned but not others; thetotal number of domains assigned; the total number (out of the possible 859) of superfamilies represented by at least one domain inthe genome.

B, The assignments for 11 miscellaneous sequence sets including amongst other things ®ve alternative human gene sets and someincomplete genomes. In A the current ensembl (version 1.1) is used for Homo sapiens.

Sequence Homology Assignment 915

ever between the distributions for each genome ofthe sequence identities between the matched gen-ome sequences and the sequences used to seed theHMMs (Figure 7). For a few genomes, namely Hae-mophilus in¯uenzae and Buchnera, a very high cover-age of the genome was observed (the high numberof E. coli genes with 100 % identity is due to itsover-representation in the PDB). This wasaccompanied by a markedly different sequence

identity distribution across the genome. They haveproportionally far more sequences with high iden-tity to sequences of known structure than othergenomes, which accounts for the high number ofassignments.

As well as giving assignments that can be usedin new investigations into genomes, they providepotential new annotations. An analysis ofsequences previously unassigned showed that in

Figure 7. This Figure shows highorder polynomial curves ®tted tothe number of domains found witha given percentage sequence iden-tity to a PDB sequence for 38 gen-omes. The curves are normalisedon the total number of domainsfound.

Figure 8. The genomes shownhere were searched for entries withno annotation using keywordsspeci®c to each genome such as``unknown'' and ``hypothetical pro-tein''. Please see Table 4 for a keyto the genome names. The numberof SCOP domains found bySUPERFAMILY in sequences withno previous annotation is shown.Also shown are the number of pre-viously unannotated sequenceswhich have a hit to a SCOPdomain with a P-value < 0.001using Wu-BLAST. All genomeswere provided with some novelannotation, and an average of 15 %of the genes in a genome were pro-vided with potential annotationwhere previously there was none.Please refer to Table 4 for a key tothe genome names.

916 Sequence Homology Assignment

most genomes the SUPERFAMILY HMMs detectlarge numbers of novel assignments. The extent ofthis varies greatly with the quality of annotationcarried out on the genomes by the sequencing pro-jects (Figure 8). It is also worth noting that thenovel assignments do not in general have marginalscores as one might expect; 63 % of the novelassignments to eukaryotes scored with an E-valuelower than 10ÿ8 which is an excellent score.

A comparison with the homologies found forgenome sequences by pairwise comparisonand other methods

To assess the improvement in the detection ofhomology made by the SAM HMMs describedhere, we compared their performance with that ofa pairwise sequence comparison method. For thiscomparison we used WU-BLAST because previouswork has shown that this is one of the most effec-tive of the pairwise methods.21

The 4849 SCOP sequences that were used toseed the SUPERFAMILY HMM models weredirectly matched by WU-BLAST to the sequencesof the genomes in Table 4. For this calculation weused WU-BLAST version 2.0a19 with default par-ameters and matrix (BLOSSUM62). All those WU-BLAST matches with E-values of less than 0.001were taken as signi®cant. This E-value is expectedto select matches whose signi®cance is the same asthe SAM HMM scores (see Brenner et al.21). Theratio of the number of sequences matched by thetwo methods is very similar for the different gen-omes: the WU-BLAST procedure makes matches tohalf the number of sequences that are matched bySAM HMMs.

On a more detailed level we compared the num-ber of ``hypothetical'' sequences matched by theWU-BLAST and SAM HMM procedures. At thislevel WU-BLAST matched between 4 % and 36 %of those matched by the SAM HMMs (seeFigure 8).

The coverage presented in Table 4 shows that54 % of the genes of Mycoplasma genitalium have astructural assignment. The current SUPERFAMILYHMM library which has been recently updated torelease 1.55 of SCOP has structural assignments to61 % of the genes. It is possible to compare cover-age for this genome with many other methodsbecause it is very small (480 sequences) and somost methods have a comparable analysis. Gen-Threader22 covers 53 % (Jones, personal communi-cation) of the genes, but is computationally costlyand as a consequence has not been applied tomany genomes. PSI-BLAST23 has been used byseveral groups yielding coverages of 37 % (Huynenet al.24), 39 % (Wolf et al.25), and 41 % (Teichmannet al.26). However, these ®gures were obtainedusing the fewer PDB sequences available at thetime. Much more extensive use has been maderecently by Gene3D (http://www.biochem.ucl.a-c.uk/bsm/cath_new/Gene3D) which covers 41 %of Mycoplasma genitalium proteins, and has beenapplied to over 30 genomes including two of thesmaller eukaryotes. None of these methods havebeen applied as extensively as the work presentedhere however, and most notably there is a lack ofanalysis of larger eukaryote genomes.

The SUPERFAMILY public server

As well as being used to produce genomeassignments the library can be used to carry out

Sequence Homology Assignment 917

structural assignments for other sequences of inter-est. A public web server has been set up to servethe model library at http://stash.mrc-lmb.cam.a-c.uk/SUPERFAMILY.

The library may be searched with many (aminoacid or nucleotide) sequences at a time by enteringthem in FASTA format.27 It includes a BLAST pre-®lter to remove the most obvious assignmentswithout running the more costly HMM searches.The server will return all of the structural domainspresent in the sequences with the regions and linksto SCOP. The service has already been extensivelyused, in particular by structural genomics groupsto aid the selection of targets.

The server also makes the assignments to thegenomes described here publicly available alongwith some analysis which has been carried out.The assignments can be browsed starting witheither a superfamily or a genome, selecting thesecond from a list.

Also available on the server is an alignment pro-cedure which allows multiple sequences to bealigned to the models in the library. Querysequences used in searches of the library may bealigned to the models, or sequences can be pastedin or uploaded in FASTA format. Genomesequences and PDB sequences are available on theserver to include in the multiple alignments with-out the need to enter them.

It is the intention that the SUPERFAMILYmodels themselves will be made available todownload. The library will be kept updated withevery release of SCOP and is being used to providea feedback loop for the testing and improvementof SCOP.

As the structural coverage of sequence spaceincreases, especially from the work of structuralgenomics projects, the coverage of the librarywill increase as will the quality of the assign-ments.

Validation

The model library performs extremely well invalidation tests against SCOP but, as it has beenheavily trained on these validation tests them-selves, the rigorous test is one of blind predic-tion such as the CASP28 test. SUPERFAMILYwas submitted to the LiveBench(http://bioin-fo.pl/LiveBench) continuous bench marking ofprediction servers, which is based on the CASPconcept and offers dif®cult targets of recentlysolved structures. All targets have a BLAST29 E-value higher than 0.1 to all other members ofthe PDB. Out of the 203 targets in the Live-Bench-2 project (collected between 13.4.00 and29.12.00), 45 were assigned to the correct SCOPsuperfamily by SUPERFAMILY and one falselyassigned. The one incorrect assignment involveda very short cysteine-rich protein and sequencesof this kind are known to produce false matcheswith good scores.5

Discussion

The SUPERFAMILY HMM library

Here, we have described and assessed new pro-cedures for the determination of homology by hid-den Markov models; the construction of a libraryof models that represents almost all proteins ofknown structure, and the demonstration thatsequences that comprise some 45 % of bacterialgenomes and 35 % of eukaryotic genomes matchthese models

It has been established that by using multiplemodels each starting from a single seed sequenceat least as many, if not more, homologues can befound than by using an accurate structurally basedhand alignment to build a single model. Thisremoves the need for accurate structural align-ments for building a good model library. The prin-ciple does not extend to sequences of unknownstructure because the SCOP classi®cation, which isobtained by detailed structural analysis, is essentialfor determining the set of overlapping modelswhich should represent a superfamily.

A model library has been made for all proteinsof known structure. A superfamily is representedby many separate models each of which is attempt-ing to model the whole superfamily. The modelbuilding procedure is very sensitive which is whya different seed sequence will produce a differentmodel even though they are attempting to modelthe same superfamily. This is also the reason thatthe models are built from a 95 % non-redundantset rather than a less redundant one. Thus, thoughthe models will mostly hit the same sequences theydo so with different scores for the commonmatches and some models uniquely match differ-ent outliers.

The two major factors which contribute to theperformance of this library are (i) each superfamilyis represented by many models rather than a singlemodel and (ii) the ability to use SCOP/PDBsequence analysis to improve the models' perform-ance. This second factor leads to the rejection ofsome models, the re-building of others and gui-dance in homology decision. Further to this,sequences queried against the library will be testedagainst every model so that the results are cross-compared, i.e. a sequence hitting one model maynot be considered true if it hits another model witha higher signi®cance. This is an extremely import-ant contribution in genomic assignments, whereregions will frequently hit many models especiallywithin a single superfamily. The library leans heav-ily on the extensive and accurate information con-tained in the SCOP classi®cation, ensuring that theassignments are as good as they can be given whatis currently known.

Possible inter-superfamily relationships forwhich there is not yet structural evidence havebeen detected, most of which are structurallyplausible.

918 Sequence Homology Assignment

Since the models for the library are created onceand then re-used on a large scale, more effort canbe spent on making the models as good as poss-ible. The two areas in which the models can beimproved at greater expense than would be practi-cal for a single use are as follows: computationallyexpensive model-building parameters can be usedon large computing resources over a long period oftime, the models can be assessed and tuned usingexpert knowledge both in individual cases andacross the whole library.

The models described here have been used forhigh-throughput genomic studies. The results ofthis work and the models themselves provide thebasis of a public web service (http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY).

The genome sequences matched by the HMMs

The library has been used to assign structures tosequences from over 50 genomes with a coverageof roughly 35 % of eukaryotic sequences and 45 %of prokaryotic sequences. Membrane proteins arebelieved to form 20-30 % of all sequences. Thereare only a few models for membrane sequences inSUPERFAMILY. This means that the SUPERFAMI-LIES HMMS match somewhat more that half ofcytoplasmic proteins in bacteria and half, or a littleless, of those in eukaryotes.

Since the library is based on known structures,as more novel folds are discovered through struc-tural genomics projects, the number of matchedgenome sequences is expected to quickly rise.

The applications of this model library are severaland it is already being used by many other projectswithin this laboratory and outside. The annotationprovided by the genome assignments is alreadybeing used by many genome projects either as anaid to annotation or as annotation in its own right:for example those for mouse (http://www.gsc.ri-ken.go.jp/e/FANTOM) and Arabidopsis thaliana(http://www.arabidopsis.org). The procedure hasrecently been added to the ENSEMBL (http://www.ensembl.org) pipeline for annotation of thehuman genome. The library is also being used bystructural genomics projects to make predictions ofthe structure of their targets, and to suggest poten-tial targets, e.g. SPiNE(http://spine.mbb.yale.edu/spine). The SUPERFAMILY web server is heavilyaccessed for genome assignments and sequencealignments. The server has already processednearly 10,000 requests for sequence queries againstthe model library this year.

The genome projects themselves usually anno-tate the genes, but the annotation is just free textentries which are understandably very minimaland incomplete due to the volume of data involved(between 500 and 30,000 genes per genome), andthe methods used. The free text is of little use forglobal studies of a genome because it requireshuman interpretation not possible on this scale.There are no standards of annotation between gen-omes, so inter-genome comparisons can be dif®-

cult, if not impossible. Using the model library toassign structural domains to all genome sequencesprovides both the necessary information about thegenes, and the framework of classi®cation (SCOP)for new comparative studies consistent across allgenomes. Thus the SUPERFAMILY genome assign-ments have formed the basis of several compara-tive studies, for example on domainrecombination30 and on the evolution and for-mation of small molecule metabolic pathways.31

Acknowledgments

We thank Alexey Murzin for helpful discussions.

References

1. Krogh, A., Brown, M., Mian, I. S., Sjolander, K. &Haussler, D. (1994). Hidden Markov models in com-putational biology: applications to protein modeling.J. Mol. Biol. 235, 1501-1531.

2. Eddy, S. R. (1995). Multiple alignment using hiddenmarkov models. Proc. Third Int. Conf. IntelligentSystems for Molecular Biology (Rawlings, C. et al.,eds), pp. 114-120, AAAI Press, Menlo Park.

3. Eddy, S. R. (1996). Hidden Markov models. Curr.Opin. Struct. Biol. 6, 361-365.

4. Hughey, R. & Krogh, A. (1996). Hidden Markovmodels for sequence analysis: extension and analysisof the basic method. CABIOS, 12, 95-107.

5. Park, J., Karplus, K., Barrett, C., Hughey, R.,Haussler, D., Hubbard, T. & Chothia, C. (1998).Sequence comparisons using multiple sequencesdetect three times as many remote homologues aspairwise methods. J. Mol. Biol. 284, 1201-1210.

6. Madera, M. & Gough, J. (2001). A comparison ofHidden Markov Model software for remote hom-ology detection. Bioinformatics, in the press.

7. Karplus, K., Barrett, C. & Hughey, R. (1998). HiddenMarkov models for detecting remote protein hom-ologies. Bioinformatics, 14, 846-856.

8. Rossmann, M. G., Moras, D. & Olsen, K. W. (1974).Chemical and biological evolution of a nucleotide-binding protein. Nature, 250, 194-199.

9. Patthy, L. (1991). Introns and exons. Curr. Opin.Struct. Biol. 4, 383-392.

10. Murzin, A. G., Brenner, S. E., Hubbard, T. &Chothia, C. (1995). SCOP: a structural classi®cationof proteins database for the investigation ofsequences and structures. J. Mol. Biol. 247, 536-540.

11. Berman, H. M., Westbrook, J., Feng, Z., Gilliland,G., Bhat, T. N., Weissig, H. et al. (2000). The ProteinData Bank. Nucl. Acids Res. 28, 235-242.

12. Brenner, S. E., Koehl, P. & Levitt, M. (2000). TheASTRAL compendium for sequence and structureanalysis. Nucl. Acids Res. 28, 254-256.

13. Holm, L. & Sander, C. (1998). Removing near-neigh-bour redundancy from large protein sequence collec-tions. Bioinformatics, 14, 423-429.

14. Teichmann, S. A. & Chothia, C. (2000). Immunoglo-bulin superfamily proteins in C. elegans. J. Mol. Biol.296, 1371-1387.

15. Bateman, A., Birney, E., Durbin, R., Eddy, S. R.,Howe, K. L. & Sonnhammer, E. L. (2000). The pfam

Sequence Homology Assignment 919

protein families database. Nucl. Acids Res. 28, 263-266.

16. Bashford, D., Chothia, C. & Lesk, A. M. (1987).Determinants of a protein fold: unique features ofthe globin amino acid sequences. J. Mol. Biol. 196,199-216.

17. Thompson, J. D., Higgins, D. G. & Gibson, T. J.(1994). CLUSTAL W: improving the sensitivity ofprogressive multiple sequence alignment throughsequence weighting, positions-speci®c gap penaltiesand weight matrix choice. Nucl. Acids Res. 22, 4673-4680.

18. Bredt, D. S., Hwang, P. M., Glatt, C. E., Lowenstein,C., Reed, R. R. & Snyder, S. H. (1991). Cloned andexpressed nitric oxide synthase structurallyresembles cytochrome P-450 reductase. Nature, 351,714-718.

19. Sjolander, K., Karplus, K., Brown, M., Hughey, R.,Krogh, A., Mian, S. & Haussler, D. (1996). Dirichletmixtures: a method for improving detection of weakbut signi®cant protein sequence homology. CABIOS,12, 327-345.

20. Copley, R. R. & Bork, P. (2000). Homology among(ba) barrels: implications for the evolution of meta-bolic pathways. J. Mol. Biol. 303, 627-640.

21. Brenner, S. E., Chothia, C. & Hubbard, T. J. P.(1998). Assessing sequence comparison methodswith reliable structurally identi®ed distant evol-utionary relationships. Proc. Natl Acad. Sci. USA, 95,6073-6078.

22. Jones, D. T. (1999). GenTHREADER: an ef®cient andreliable protein fold recognition method for genomicsequences. J. Mol. Biol. 287, 797-815.

23. Altschul, S. F., Madden, A. A., Schaffer, J., Zhang,Z., Miller, W. & Lipman, D. J. (1997). Gapped

BLAST and PSI-BLAST: a new generation of proteindatabase search programs. Nucl. Acids Res. 25, 3389-3402.

24. Huynen, M., Doerks, T., Eisenhaber, F., Orengo, C.,Sunyaev, S., Yuan, Y. & Bork, P. (1998). Homology-based fold predictions for Mycoplasma genitaliumproteins. J. Mol. Biol. 280, 323-326.

25. Wolf, Y. I., Brenner, S. E., Bash, P. A. & Koonin,E. V. (1999). Distribution of protein folds in thethree superkingdoms of life. Genome Res. 9, 17-26.

26. Teichmann, S. A., Park, J. & Chothia, C. (1998).Structural assignments to the Mycoplasma genitaliumproteins show extensive gene duplications anddomain rearrangements. Proc. Natl Acad. Sci. USA,95, 14658-14663.

27. Pearson, W. R. & Lipman, D. J. (1988). Improvedtools for biological sequence analysis. Proc. NatlAcad. Sci. USA, 85, 2444-2448.

28. Moult, J., Pederson, J. T., Judson, R. & Fidelis, K.(1995). A large-scale experiment to assess proteinstructure prediction methods. Proteins:Struct. Funct.Genet. 23, ii-iv.

29. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. &Lipman, D. J. (1990). Basic local alignment searchtool. J. Mol. Biol. 215, 403-410.

30. Apic, G., Gough, J. & Teichmann, S. A. (2001).Domain combinations in archael, eubacterial andeukaryotic proteins. J. Mol. Biol. 310, 311-25.

31. Teichmann, S. A., Rison, S. C. G., Thornton, J. M.,Riley, M., Gough, J. & Chothia, C. (2001). The evol-ution and structural anatomy of the small moleculemetabolic pathways in Escherichia coli. J. Mol. Biol.311, 693-708.

Edited by G. Von Heijne

(Received 18 May 2001; received in revised form 12 September 2001; accepted 12 September 2001)