Gauss-integral based representation of protein structure for predicting the fold class from the...

12
Mathematical and Computer Modelling 43 (2006) 401–412 www.elsevier.com/locate/mcm Gauss-integral based representation of protein structure for predicting the fold class from the sequence Bjørn G. Nielsen a,* , Peter Røgen b , Henrik G. Bohr a a Quantum Protein Centre (QuP), Department of Physics, Technical University of Denmark, Bldg. 309, DK-2800, Kongens Lyngby, Denmark b Department of Mathematics, Technical University of Denmark, Bldg. 303, DK-2800, Kongens Lyngby, Denmark Received 15 December 2004; received in revised form 2 November 2005; accepted 10 November 2005 Abstract A representative subset of protein chains were selected from the CATH 2.4 database [C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, J.M. Thornton, CATH—a hierarchic classification of protein domain structures, Structure 5 (8) (1997) 1093–1108], and were used for training a feed-forward neural network in order to predict protein fold classes by using as input the dipeptide frequency matrix and as output a novel representation of the protein chains in R 30 space, based on knot invariant values [P. Røgen, B. Fain, Automatic classification of protein structure by using Gauss integrals, Proceedings of the National Academy of Sciences of the United States of America 100 (1) (2003) 119–124; P. Røgen, H.G. Bohr, A new family of global protein shape descriptors, Mathematical Biosciences 182 (2) (2003) 167–181]. In the general case when excluding singletons (proteins representing a topology or a sequence homology as unique members of these sets), the success rates for the predictions were 77% for class level, 60% for architecture, and 48% for topology. The total number of fold classes that are included in the present data set (500) is ten times that which has been reported in earlier attempts, so this result represents an improvement on previous work (reporting on a few handpicked folds). Furthermore, distance analysis of the network outputs resulting from singletons shows that it is possible to detect novel topologies with very high confidence (85%), and the network can in these cases be used as a sorting mechanism that identifies sequences which might need special attention. Also, a direct measure of prediction confidence may be obtained from such distance analysis. c 2006 Elsevier Ltd. All rights reserved. Keywords: Proteins fold class prediction CATH neural networks 1. Introduction Biological function is closely tied up with the biophysical properties of a class of molecules known as proteins. In brief, proteins are linear polymeric compounds whose building blocks are amino acids. There are 20 naturally occurring species of amino acids, and to form proteins they join end to end by peptide bonds (hence the alternative term polypeptide). These molecules play important roles in connection with all of life’s processes, including functions such as creating and maintaining biological structures (e.g., actin and tubulin), movement of organisms (e.g., myosin * Corresponding author. E-mail address: [email protected] (B.G. Nielsen). 0895-7177/$ - see front matter c 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.mcm.2005.11.014

Transcript of Gauss-integral based representation of protein structure for predicting the fold class from the...

Mathematical and Computer Modelling 43 (2006) 401–412www.elsevier.com/locate/mcm

Gauss-integral based representation of protein structure forpredicting the fold class from the sequence

Bjørn G. Nielsena,∗, Peter Røgenb, Henrik G. Bohra

a Quantum Protein Centre (QuP), Department of Physics, Technical University of Denmark, Bldg. 309, DK-2800, Kongens Lyngby, Denmarkb Department of Mathematics, Technical University of Denmark, Bldg. 303, DK-2800, Kongens Lyngby, Denmark

Received 15 December 2004; received in revised form 2 November 2005; accepted 10 November 2005

Abstract

A representative subset of protein chains were selected from the CATH 2.4 database [C.A. Orengo, A.D. Michie, S. Jones,D.T. Jones, M.B. Swindells, J.M. Thornton, CATH—a hierarchic classification of protein domain structures, Structure 5 (8) (1997)1093–1108], and were used for training a feed-forward neural network in order to predict protein fold classes by using as input thedipeptide frequency matrix and as output a novel representation of the protein chains in R30 space, based on knot invariant values[P. Røgen, B. Fain, Automatic classification of protein structure by using Gauss integrals, Proceedings of the National Academyof Sciences of the United States of America 100 (1) (2003) 119–124; P. Røgen, H.G. Bohr, A new family of global proteinshape descriptors, Mathematical Biosciences 182 (2) (2003) 167–181]. In the general case when excluding singletons (proteinsrepresenting a topology or a sequence homology as unique members of these sets), the success rates for the predictions were 77%for class level, 60% for architecture, and 48% for topology. The total number of fold classes that are included in the present dataset (∼500) is ten times that which has been reported in earlier attempts, so this result represents an improvement on previous work(reporting on a few handpicked folds). Furthermore, distance analysis of the network outputs resulting from singletons shows thatit is possible to detect novel topologies with very high confidence (∼85%), and the network can in these cases be used as a sortingmechanism that identifies sequences which might need special attention. Also, a direct measure of prediction confidence may beobtained from such distance analysis.c© 2006 Elsevier Ltd. All rights reserved.

Keywords: Proteins fold class prediction CATH neural networks

1. Introduction

Biological function is closely tied up with the biophysical properties of a class of molecules known as proteins.In brief, proteins are linear polymeric compounds whose building blocks are amino acids. There are 20 naturallyoccurring species of amino acids, and to form proteins they join end to end by peptide bonds (hence the alternativeterm polypeptide). These molecules play important roles in connection with all of life’s processes, including functionssuch as creating and maintaining biological structures (e.g., actin and tubulin), movement of organisms (e.g., myosin

∗ Corresponding author.E-mail address: [email protected] (B.G. Nielsen).

0895-7177/$ - see front matter c© 2006 Elsevier Ltd. All rights reserved.doi:10.1016/j.mcm.2005.11.014

402 B.G. Nielsen et al. / Mathematical and Computer Modelling 43 (2006) 401–412

and kinesin), improving chemical reaction rates (e.g., alcohol dehydrogenase), cellular transmission (e.g., insulin),transport through cellular membranes (e.g., aquaporin, various ion channels), and even as transducers of physicalparameters into activation of biochemical pathways and ionic flows (e.g., photosensitive rhodopsin, thermosensitiveTRP channels).

The function of a protein is determined by the three-dimensional (3D) structure of the protein, that is by the wayin which these polymers are folded in their native environment (see [4] for a brief review). The main tenet leading tothis conclusion is that the individual amino acid side-chains should be positioned in such a way as to create localizedenvironments which are permissive for specific functions [5]. Thus, for example, it is observed that enzymaticallyactive proteins are characterized by having “reaction centers” within which specific chemical processes are facilitated(oxidation, reduction, phosphorylation, etc.), and occur at a much higher rate (e.g., by a factor >106) than what isthe norm in the surrounding bulk water [6]. In a similar vein, ion channel forming proteins in contact with a cellularmembrane fold in such a way as to form small holes that puncture the membrane [7], and through which only veryspecific molecules may flow, sometimes only under very special circumstances [8]. In 2003 Peter Agre and RoderickMacKinnon shared the Nobel prize in chemistry for their work on precisely this type of channels.

The remarkable variability of function attainable by proteins clearly constitutes the main factor leading to thebooming success of the molecular biology field that has been enjoyed in recent years. This is particularly the casefollowing the recent publication of the first drafts of the whole human genome [9,10]. However, the number ofpublished protein sequences made available in Swiss-Prot (162 781 entries in release 44.7 of 11 October 2004) vastlyoutnumbers the number of determined protein tertiary structures made available in the Protein Data Bank (update of 19October 2004 contained 27 761 structures). And this discrepancy increases by the year: In 2003 the Protein Data Bankreleased 3500 new structures while Swiss-Prot was increased by approximately 20 000 entries. The experimentaltechniques most often used for obtaining high resolution protein tertiary structure, namely X-ray crystallographyand/or NMR, are both experimentally costly, can be extremely time-consuming and sometimes do not yield resultsdue to proteins’ resiliency to crystallizing or their tendency to precipitate. It is of course in these cases that one mustdepend on computational methods to give us at least an estimate of the protein’s structure and/or function associatedwith a given amino acid sequence, and it is to these methods that we now turn.

Defining the problem

The well known protein folding problem is simply defined as the problem of finding a protein’s three-dimensionalconfiguration in a given environment, as a function of a particular combination of amino acids forming the chain [4].It has proven useful to subdivide protein structure into several levels [11–13], namely:

(1) Primary structure: the linear sequence of amino acids constituting a protein.(2) Secondary structure: the folding of short segments of the protein into very basic structures such as alpha-helices,

beta-sheets and random coils.(a) Super-secondary structure: Strictly referring to frequently recurring specific local arrangements of secondary

structure elements across various types of tertiary structure. A label is often needed in the prediction literatureto describe the relative total content of secondary structures in a protein, leading to a classification into foursuper-secondary structure classes: mostly α, mostly β, mixed αβ and little secondary structure content.

(3) Tertiary structure: the full folding structure of a single strand of protein.(a) Motifs and domains: the folding of secondary structure elements into highly recognizable and often recurring

structures (helix–turn–helix, beta-barrels, etc.)(4) Quaternary structure: the structural and functional interaction of various protein strands.

So more specifically the problem consists of finding secondary, tertiary and/or quaternary structures based solelyon the protein’s primary structure (for a concise introduction to protein structure and function refer to [13]).

In principle one should be able to write up the equations of motion for all the atoms involved and then simply solvethe system (see [14] for a concise review of the mathematics involved in such simulations). This is the approach takenin Molecular Dynamics (MD), and it is sometimes used to solve aspects of the protein folding problem. However, MDis not suitable for solving the full problem (all the way from primary to tertiary structure) due to several problemswhich make this an untenable approach:

B.G. Nielsen et al. / Mathematical and Computer Modelling 43 (2006) 401–412 403

• Initial conditions: Which should be the starting configuration of the protein for the simulation? A straight line? Arandom coil?

• Environment: Fundamental questions still remain related to the underlying physical causalities that determine aprotein’s structure as a function of its environment (water, lipid bilayer, vacuum) and conditions (temperature, pH).

• Simulation time: According to some estimates protein folding in vivo takes on the order of milliseconds to secondsto complete [5], but even state-of-the-art MD simulation packages (Gaussian, NAMD, Charmm) are limited toprotein dynamics in the nanosecond range (thus a protein folding simulation would have to run for several yearsfor completion).

• Energy minima: Even if a stable “folded” configuration were obtained following a simulation, this would notmean that the conformation found is the bioactive form, it could just be one among a multitude of stableconformations [15,16].

For these reasons, much effort has been invested in finding alternative methods which can give good estimates ofprotein structure and function without the need to perform a detailed modelling of these highly complex molecules.As discussed in [17], the utility of a structural estimate ultimately depends on the resolution of the estimate. Thus atresolutions better than 1.5 A (so far only obtainable with X-ray crystallography and NMR) one may study catalyticmechanisms and docking, whereas any estimates with resolutions above 4 A limit the study to the categorizationof proteins into functional families based on structural similarity, finding conserved surface residues and providingrough estimates of the functional sites. In principle, such estimates may also be used to partly constrain ab initio MDsimulations, e.g. by restricting the initial configurational search space to only a handful of highly probable startingconformations, or could be used as templates to monitor whether an MD simulation is converging onto an unnaturalfold (related to the numerous energy minima that are possible) allowing one to stop the simulation in time.

In the present work we will elaborate on one method for obtaining structural estimates at the level of topologicalfold classes, ultimately aiming at a categorization of new protein sequences into functional families as defined infor example the CATH database [1]. To this end we use artificial neural networks (ANN) to extract the statisticalrelationship that must exist between the different levels of description of protein structure, in order to find the foldclass to which the protein belongs. The topological fold class of a protein is a highly useful intermediate descriptionof protein structure, residing somewhere between secondary structure and tertiary structure. However, being mostly adescriptive categorization it does not provide exact tertiary structure, it only provides a ballpark estimate of how theprotein has folded. According to some estimates [18] more than 800 different folds are currently known. At present,determination of these folds is of course completely dependent on the availability of fully determined structures, andtherefore methods for determining fold classes of proteins that are not included in the Protein Data Bank are highlyneeded.

2. Protein fold class predictions—state of the art

In the general framework of machine learning algorithms, artificial neural networks (ANN) hold a prominent placeas a highly effective method for extracting statistically significant relationships from complex data sets (see [19] fora concise introduction, and [20,21] for a review of some associated methods). And it is in this capacity as non-linearmapping functions that we shall use ANN here. Artificial neural networks have been used on several occasions in anattempt to bridge the statistical gap between protein structural levels. In an early study by Qian and Sejnowski [22],neural networks were used to predict secondary structure from local sequence information, with a success rate of64% at predicting α, β or coil structure. This success rate value was also shown to be the best attainable if onlylocal sequence information is used, and is thus a measure of the statistical relevance of local information for globalprocesses like protein folding. Zhang [23] reports a success rate of 66.4% using a hybrid method combining neuralnetworks with elements from artificial intelligence. In one notable study by Rackovsky [24] it is found that in a single-residue representation only 69% of the sequence elements actually code for structure (or 60% when using a dipeptidefrequency code), while the remaining elements are very flexible. There are thus clear physical limits on the amount ofglobal structural information that may be extracted from local information.

Other early studies utilizing perceptron class networks [25] and back-propagation of error networks [26] reachedsimilar results on the prediction of secondary and super-secondary structure. Later models using larger data sets,non-local sequence information (e.g. distance matrices and dipeptide frequency matrices), different architectures andhomology searches (searching databases for similar primary structures) have reported success rates of up to 80% on

404 B.G. Nielsen et al. / Mathematical and Computer Modelling 43 (2006) 401–412

the super-secondary structure prediction [27–32]. However, there is still some controversy as to the general validity ofsuch results. For example, based on evidence from a component-coupled method Wang and Yuan [33] have reportedthat when dealing with proteins which are non-homologous with respect to the training set (i.e. with less than ∼30%sequence identity), 60% correctness will be the upper limit for a super-secondary structure class prediction fromsequence alone (when using the five types defined in [34]: α, β, separated α + β domains, intertwined α/β domainsand irregular).

Sequence based prediction of protein tertiary structure has been slow to advance, partly due to lack of data, butalso due to the sheer complexity of the problem. In work by Grassmann et al. [35], which considered 42 handpickedfolds (less than one tenth of the total possible from SCOP or CATH databases, see below), and using a set of 268sequences (defined in [36]), even their best training attempts reached less than 50% accuracy when considering non-homologous proteins, that is proteins with less than 35% sequence identity. For homologous proteins, the successrate was, not surprisingly, much higher, at 80%. Similar results were obtained in a more recent survey of statisticalmethods for protein fold class prediction [32]. By using the “Travelling Salesman Approach” [37], and “GeneticAlgorithms” [38], some success in predicting the tertiary structure of a handful of proteins based only on sequenceinformation has been reported. However, the generalizability of these methods has not been properly established, andtheir requirement for sometimes very special starting conditions voids their usability as general protein fold predictionschemes, from sequence alone.

Similar to the approach taken by several of the works reviewed above, our choice is to use the “supervised learning”approach here, which requires the availability of correctly classified input–output data pairs, a fraction of which arepresented to the network during an initial training phase in conjunction with the corresponding output set, in such away that the internal parameters gradually adapt to the data set presented. Performance is subsequently gauged duringwhat is called the test phase simply by presenting the system with preclassified data which was not included in thetraining data set and measuring how well the network reproduces the right output. Upon presentation of completelynew input data (i.e., not in the original data sets), it is expected that the network will output the best estimate based onwhat it has learnt of the statistics and rules connecting the known data sets. However, what makes this work uniqueis our use of very large data sets (the full CATH 2.4), the format of the target data (knot invariants in R30 space), andthe inclusion of singletons only in the test set. Here we define singletons as test proteins which have topological foldclasses and/or belong to homological families that are not included in the training data set. More generally, singletonscorrespond to proteins which are unique members of a topology or a sequence homology.

3. Methods and data selection

Criteria for selection of data from the CATH database

According to the CATH database the number of topological folds approaches 500 [1], and this number is even largerin the classification schemes used for the SCOP database, namely 700 [39]. Work done by Zhang and DeLisi [18]argues that the number might be even higher, perhaps >800 according to their estimate. In the present work we will beconsidering only data available in the CATH database, which according to Getz et al. [40] has a good agreement withthe SCOP database, at least at the level of topological folds (in a CATH→SCOP translation, 93% of cases are correctlytranslated, and in a SCOP→CATH translation 82% of cases are correctly translated). The CATH database classifiesproteins according to a hierarchy which includes: class (secondary structure composition), architecture (overall shapeof domain structures, ignoring connectivity of secondary structures), topology (fold family which describes overallshape plus secondary structure connectivity) and homologous superfamily (homologous protein domains thought toshare a common ancestor). For a full account of this database, and the methods used for the classifications, see [1] orconsult the online documentation currently at http://www.biochem.ucl.ac.uk/bsm/cath/cath info.html.

The training and test sets used in the present work are based on the set of more than 20 000 domains available inversion 2.4 of the CATH database, and which was used in [2] to construct an automatic protein structure classificationalgorithm. From each of the 1053 homology classes represented in this data set one representative is chosen for thetest set. In 172 of the homology classes we find that only one representative exists, and we refer to these special casesas singletons. These singletons are included only in the test set and are therefore not represented in the training set, soafter subtraction of the test set, we are left with only 881 homology classes, some of which are multiply representedin the training set. To avoid this bias and at the same time to get the largest possible sequence diversity in the training

B.G. Nielsen et al. / Mathematical and Computer Modelling 43 (2006) 401–412 405

set, we chose members in the training set for each homology class as follows: The remaining 881 domains are sortedaccording to the three hierarchical levels in CATH classification below the homology class level. From this sorted list,20 representatives with equal distance in the list are chosen for the training set for each homology class. If the list hasless than 20 members in a homology class some domains are chosen repeatedly.

Input—dipeptide frequency matrix

In the data that we are considering (see above), the length of proteins varies by more than an order of magnitude,approximately from 16 to 759 amino acids. This makes it impractical to represent the proteins by their sequencedirectly, instead it is useful to find a representation which has an equal size for all the proteins, either by running awindow of fixed size along the protein backbone (see, e.g., [22] and others), or by transforming each sequence into afixed length representation. A well tested method (see [32,41–44]) consists in finding the frequencies of n-tuples ofamino acids, which conserves many of the sequence attributes and additionally provides neighborhood information.Furthermore, it was recently shown [45] that an analysis of dipeptide frequency and bias may be used to identifyconserved sites and protein motifs within families of proteins, evidence that some structural information is definitelypresent at the dipeptide frequency level. With 20 amino acids, there are 20n possible combinations of amino acidn-tuples (one must respect the usual N-terminus to C-terminus directionality of the sequence). In this work we willuse n = 2, and we will thus be considering the relative frequencies of 400 dipeptides. To construct a dipeptidefrequency matrix one must count all the incidences of a particular dipeptide combination {ai , ai+1} and divide thisby the total number of dipeptides in the sequence (ai and ai+1 represent the i th and the (i + 1)th amino acids in thesequence along the N→C direction). This procedure is repeated for all the sequences in the test and training data setsas defined above.

Output—Gauss integral measures of protein structure

Røgen and Bohr [3] and Røgen and Fain [2] have used aspects of knot theory to transform the tertiary structure ofany protein into a simple R30 metric based on the Gauss integral measures of protein 3D structure. Essentially thesemeasures quantify the amount of entanglement of the protein backbone and the correlations of entanglement along thebackbone (as illustrated by the example below). Thus, for each dipeptide matrix used as input to the network we haveas output a vector containing 30 protein structure measures which are calculated using the methods reported in [2]and [3].

The carbon alpha path of a protein is a piecewise linear curve on which, e.g., the generalized Gauss integralI(1,4)(2,3) is given as a sum I(1,4)(2,3) =

∑0<i1<i2,i3,i4<#residues W (i1, i4)W (i2, i3), where each W (i, j) is a measure

of how entangled the i’th and the j’th line segments are [3]. This entanglement measure corresponds to the inductionon one of the line segments coming from a unit current through the other line segment. As the induction betweencoplanar line segments is zero, the only non-zero W (i, j)’s on the curves shown in Fig. 1 occur when the curvepasses over or under itself. In the limit where the curves get squeezed into the plane of this paper, the W (i, j)’s arenormalized to give ±1 for each crossing where the sign is given by the usual right-hand rule. If the left curve inFig. 1 is considered to consist of many small line segments and is traversed from left to right, then the left crossinginvolving line segments i?1 and i?2 with i?1 < i?2 , and the right crossing involving line segments i?3 and i?4 with i?3 < i?4each give one non-zero W (i, j). Thus we have W (i?1, i?2) ≈ +1 and W (i?3, i?4) ≈ +1, where i?1 < i?2 < i?3 < i?4and all other W (i, j) ≈ 0. The only non-zero product of the form W (i, j)W (k, l) where i < j , k < l and i < kis thus W (i?1, i?2)W (i?3, i?4) ≈ (+1)(+1) = 1, where i?1 < i?2 < i?3 < i?4 . Hence, the Gauss integral I(1,2)(3,4) ≈ 1whereas the Gauss integrals I(1,3)(2,4) and I(1,4)(2,3) are both ≈ 0 as neither of them contain this non-zero term. Fig. 1illustrates how three curves are identified by these three generalized Gauss integrals. The Gauss integrals consideredin [2] include all possible combinations of two and of three crossings.

Neural network architecture

In the present work we have chosen to use the standard method of error back-propagation due to the fact that thismethod’s efficiency, limitations and implementation issues have been copiously documented [19,46–52], allowingus to focus on the task of describing and validating the input/output data sets that were used for training andtesting. We used a simple network with one input layer, one hidden layer and one output layer. The number of

406 B.G. Nielsen et al. / Mathematical and Computer Modelling 43 (2006) 401–412

Fig. 1. Three curves that are each traversed from left to right. The Gauss integral I(1,2)(3,4) is very close to 1, 0, and 0 when evaluated on thesecurves. The Gauss integral I(1,3)(2,4) is very close to 0, 1, and 0 and finally the Gauss integral I(1,4)(2,3) is very close to 0, 0, and −1 whenevaluated on these curves. The minus occurs as the lower crossing of the right curve is the only negative crossing on the figure. The three curvescan thus be identified based on the values of the three Gauss integrals considered.

nodes in the input layer, nin, depended on whether it had to receive a full dipeptide frequency matrix with 400elements, or only the first 87 principal components of such a matrix. The number of nodes in the hidden layer,nhid, was varied from 25 to 250 (the best results were obtained for 150 nodes). The output layer always containednout = 30 nodes, corresponding to the 30 Gauss integrals necessary to capture the topology of each protein [2,3].In the following description, nodes belonging to the input, hidden and output layers are indexed by i , j and krespectively.

For the readers’ convenience we include a brief description of the back-propagation rule as we have used it here,but see [19,51] for a more detailed derivation. During training the activation levels of all the units in the input layer,Ai , were updated using the values in the dipeptide frequency matrix, Di , corresponding to the currently presentedprotein. Activation levels of the hidden units, A j , were obtained as:

A j = g

(nin∑i=1

wi j Ai

)(1)

where wi j is the connection strength between the i th neuron in the input layer and the j th neuron in the hidden layer,and g(x) is a sigmoidal function of the form:

g(x) =1

1 + exp(−λx). (2)

In a similar fashion, the activation levels of the output units, Ak , were calculated from the hidden layer activation:

Ak = g

(nhid∑j=1

w jk A j

). (3)

The resulting activation of the output units was then compared to the pre-calculated target values, Tk , correspondingto the protein presented in order to calculate the error of each output unit, εk = Tk − Ak . To minimize this outputerror, we perform gradient descent on the error surface by applying the delta rule (see, e.g., [19,51] for a very clearderivation of this rule). Thus, for each output unit, k, we modify the connection strengths, w jk , coming from each andevery hidden unit, j according to the following rule:

1w jk = ρεk g′(Ak)A j (4)

where ρ � 1 is the learning rate, and the derivative of the activation function is g′(x) = λg(x) (1 − g(x)).For the hidden layer we do not have an explicit error expression, so instead we use an estimated error bypropagating the output error backwards in the network, weighing these according the connection strength to the outputlayer:

ε j =

nout∑k=1

εk g′(Ak)w jk (5)

B.G. Nielsen et al. / Mathematical and Computer Modelling 43 (2006) 401–412 407

Fig. 2. Schematic representation of the information flow through the prediction system developed here. The three different levels shown are:Sequence preprocessing to produce dipeptide frequency matrices, neural network simulation, and translation of resulting knot invariants into CATHclassification (network outputs correspond to directions in the R30 knot invariant space). During training of the neural network, dipeptide frequencymatrices are imposed onto the input layer and the weights throughout are updated using the back-propagation of error method with precomputedknot invariant values as targets (CATH translation is not used in this situation). During network testing, dipeptide frequency matrices obtained fromsequences in the test set are imposed onto the input layer, and the resulting network outputs are directly fed into the knot invariant to the CATHtranslation system. The resulting CATH values are subsequently compared to the correct CATH values, and performance is evaluated.

this leading to a simple learning rule for the connections to the hidden layer, namely:

1wi j = ρε j g′(A j )Ai . (6)

During training the network is repeatedly presented with all the predetermined input–output pairs until the outputroot mean squared error of the network stabilizes or until the maximum number of epochs is reached (an epoch ishere defined as one complete presentation of all the training vectors). Two sets of training protocols were followed:(1) the number of hidden units was varied, and training proceeded until reaching the maximum number of epochs,and (2) the number of hidden units was held constant and the number of epochs required for completion was varied.In all cases, the networks corresponding to the different training protocols were implemented using custom-madesoftware (programmed in C++) within a parallel processing environment (MPI) residing on a 64 node G4 MacOSXcluster. Given the large set of data that was used, parallel processing was necessary because the CPU time requiredfor training a network with 150 hidden units for 12 000 epochs on a single G4 processor was 700 h.

Network evaluation—class, architecture, topology and homology

According to the results reported in [2], topological folds in the CATH 2.4 database may be clustered within an R30

metric in such a way that a highly accurate and noise tolerant transformation between the two types of description maybe obtained. By feeding the output from the neural network into such a cluster classification system, the network’soutput may thus be directly translated into a CATH classification code. This gives us a direct measure of the network’sefficacy. Fig. 2 shows a schematic example of the process from input sequence to output CATH code.

The clustered-continuum nature of protein structure space has also been suggested by Rackovsky [53], who samplesthe frequencies by which a set of approximately 60 representative local 4-mer configurations occur along each proteinbackbone. Rackovsky hereby represents protein structures in a Euclidean feature space subject to what is denoted asa “short length structural degeneracy” in [53] and means that two different folds may be made from the same set oflocal geometries, but in a different order, and hereby be assigned the same feature vector. In other words a Rackovskyfeature vector cannot be transformed into e.g. a CATH classification code, as is the case with the feature vectors used

408 B.G. Nielsen et al. / Mathematical and Computer Modelling 43 (2006) 401–412

Table 1Prediction scores measured as the percentage of correctly classified proteins in different subclasses of the test set

Class (4) (%) Architecture (37) (%) Topology (591) (%)

Non-singletons 77 60 48Homological (H-)singletons 41 17 3Homological and structural (HS-)singletons 44 25 1

Unbiased random guessing score 25 2.7 0.17

Values in parenthesis indicate the total number of subdivisions within each classification in CATH 2.4. See the text for further details.

in this work. In the following section we will therefore report our results using the R30 metric developed by Røgenand Fain [2].

4. Results

The best results when testing the network’s ability for generalization (testing singletons) were obtained when using150 hidden units and training for 12 000 epochs. However, we observed no significant difference in performancebetween using the full dipeptide frequency matrix or a reduced version composed of 85 principal components. Inwhat follows we will report on the results obtained when using the full dipeptide frequency matrix as input.

4.1. Prediction scores

We have considered three different subclasses in the test set: 881 sequences for which members of their homologoussuperfamily were included in the training set (non-singletons), 77 singletons representing fold classes included inthe training set, here called homological singletons (H-singletons), and 95 singletons representing fold classes notincluded in the training set, here called homological and structural singletons (HS-singletons). Table 1 shows theprediction scores measured as the percentage of correctly classified proteins related to the training set and to thedifferent subdivisions of the test set.

As expected, the best scores were obtained for non-singleton sequences, again highlighting the importance ofhomology for accurate prediction. At the super-secondary structural level (Class or C-level), 77% correct predictionwas possible, which is favourably comparable to earlier work as reported by Kneller et al. [27], Rost and Sander [28],Ruggiero et al. [29], Petersen et al. [30], Andersen et al. [31], Edler et al. [32]. At the topology level ∼50% correctprediction was possible, which at first might seem low if compared directly with the ∼70% scores reported byGrassmann et al. [35], Edler et al. [32]. However, these previous reports considered only the 42 fold classes definedin [36], whereas in our work we have considered a set of fold classes ten times larger (∼500), and hence our networkis able to correctly classify ∼250 folds, which is an improvement on previous work (and about 240 times better thanrandom guessing). As expected, singleton sequences did not score very well at the topology level, and only a modestdifference was observed in the scoring results between H-singletons and HS-singletons.

As mentioned in the methods part, the network’s output corresponds to a representation of protein structure inR30, where the dimensions consist of knot invariant values. According to the analysis given by Røgen and Fain [2],proteins represented in such an R30 metric are found to be distributed into well separated clusters, where each clustercorresponds to a particular CATH classification (henceforth a CATH cluster). Thus, in order to obtain a correctCATH classification using their translation scheme from knot invariants [2], the candidate protein’s knot invariantvalues must bring it into the vicinity of one such CATH cluster. Conversely, it is therefore obvious that most of themisclassifications that we have observed in our simulations are due to the fact that the neural network in some casesproduces output vectors which are far removed from any candidate CATH cluster. This allows for some interestingpossibilities which we will now look into.

Inspection of the network’s outputs reveals two highly interesting features of the output data: it is possible todetermine very directly the certainty that may be attributed to a prediction, and it is possible to distinguish betweensingletons and non-singletons with a high degree of confidence. In order to do so it is necessary to calculate theEuclidean distance between the network’s output (for each protein in the test set) and the target data (for each proteinin the training set). For each of the network’s output vectors we then find the nearest and the second-nearest distanceto clusters in the training vector set. Fig. 3 shows a plot of the nearest versus the second-nearest cluster in the two

B.G. Nielsen et al. / Mathematical and Computer Modelling 43 (2006) 401–412 409

Fig. 3. Plot representing the minimum versus the next to minimum distances in R30. The upper plot shows a distance analysis for the target valuesin the test set against the target values in the training set. The lower plot shows the corresponding analysis of the actual output from the networkwhen presented with test data against the target values in the training set.

different cases: (A) when determining distances between the target test set and the target training set, and (B) whenusing the distances between the neural network’s test outputs and the target training set.

If we consider only the minimum distances, and we sort them according to whether they correspond to correctlypredicted non-singleton proteins (CN), falsely predicted non-singletons (FN), homological singletons (H-S) orhomological and structural singletons (HS-S), one obtains the probability distributions indicated in Fig. 4 (respectivelyin that order from left to right). Using these distributions we may sort any given network output, for examplecorresponding to a novel protein sequence, into one of these categories. By normalizing these probability distributionswith respect to the sum of probabilities at every minimum distance, a prediction confidence plot is obtained (seeFig. 5). Thus for example: if we have a novel protein sequence for which we have predicted its folding class, andfor which the minimum distance to proteins in the training set is close to one, we may consult Fig. 5 to concludethat we are 70% confident in the correctness of our fold classification, and that we are over 95% certain that it is anon-singleton sequence. Conversely, if the minimum distance is larger than five, we must conclude that our predictedfold class is most probably incorrect, but we will be 85% certain that it is a singleton protein which in any case wouldrequire special attention.

5. Discussion and future work

Ideally, the protein fold prediction problem should be solved by finding the “stepping stones” that causallyinterconnect the different levels of protein structure description. However, it has proven very difficult to bridge thegaps that exist between these levels both in theory (biophysical analysis) and in practice (bioinformatics). Severalattempts have been made at producing useful predictions of secondary structure based on primary structure (seethe review in Section 2), but a limit has been reached at about 80% prediction accuracy. Although this seems quiteimpressive, some reports have concluded that the amount of information that may be gained from noisy predictionsof secondary structure is not sufficient to tell us much about the tertiary and quaternary structures. Thus Solis andRackovsky [54] conclude that less than a third of the information encoded in secondary structure may be retrieved

410 B.G. Nielsen et al. / Mathematical and Computer Modelling 43 (2006) 401–412

Fig. 4. Approximate probability distribution of the minimum distances calculated, between the output from the network when presented withthe test data and the actual test targets. Four different categories are considered, namely CN (correctly identified non-singletons), FN (failedidentification of non-singletons), HS (homological singletons) and HSS (homological and structural singletons).

Fig. 5. Prediction confidence that a given guess of fold class is correct when the minimum distance, between the network’s test output and the testtargets are known. Subdivisions are as in Fig. 4.

even with a prediction accuracy of 75%. This is a strong argument in favour of developing methods like the onepresented here that can predict the fold classes directly rather than having first to go through a secondary structureprediction phase. Multivariate analysis, for example in the guise of artificial neural networks, may help us in thisendeavour in various ways simply because such methods only depend on the availability of good quality structuraldata, and do not require a full description of the intervening levels of description.

Even though the number of published sequences far outnumbers the known tertiary structures for proteins, we thinkthat the number of examples in some of the larger databases is sufficient to provide at least a representative view ofprotein fold space. The daunting task that remains is to close the statistical gap that exists between the description of aprotein as an amino acid sequence, and its description as a fold class. Here we have proposed bridging part of this gapby using a recently developed representation of protein structure [2,3], where proteins exist as points in an R30 spacerepresenting knot invariant values. In this R30 space, which is continuous in all directions, protein fold classes existas clearly identifiable clouds, or clusters, and it is precisely this property which makes this representation interestingfrom a neural network point of view: Inputs are defined in a continuous and closed space, as are the outputs, therebytremendously simplifying the mapping task left to the network. As reported above, this method produces predictionsof fold class topology with an accuracy of 48% out of a total fold class library containing ∼500 folds [1]. This resulthas to be measured against the accuracy when guessing at random (∼0.2%) and also represents an improvement onprevious work [32,35] that reported higher scores but only considered 42 handpicked folds in a set-up where a testwith new folds (topological singletons) could not be performed in a meaningful way.

As a further gain, we have observed that, when the network is unable to perform the correct mapping for examplewhen subjected to singletons (here defined as protein sequences representing fold classes not existing in the trainingset), the network will produce a point in R30 space which is far removed from all other points in the output training

B.G. Nielsen et al. / Mathematical and Computer Modelling 43 (2006) 401–412 411

set. By plotting the distance of these singleton points to the two nearest neighbors in the training data set, one obtainsa diagonal of unitary slope. Points on such a diagonal usually coincide with the presence of a novel or unknown foldclass, so in this way the network may be used to detect fold classes which might need closer (manual) inspection. Also,for any predicted fold class, a direct quantification of the prediction confidence can be given based on the minimumdistance of the R30 spatial coordinate of the corresponding protein with respect to proteins in the training set.

In the present work, as is the case in the work by the other groups that we reviewed earlier, much effort hasbeen invested in trying to maximize the prediction performance of the reported methods: For example, by changingthe number of hidden units, by tweaking the various learning parameters, by searching for an optimal number oftraining epochs, by using a variety of optimized error functions, by applying different network architectures, by usinga large number of different data sets, and even by changing the representation, format and interpretation of the data.Nevertheless, correctly predicting fold classes for non-homologous proteins still remains an elusive goal. The clue tothis lack of success might be hidden in an observation made by Hornik et al. [52], after showing that feedforwardneural networks are a class of universal approximators: “. . . any lack of success in applications must arise frominadequate learning, insufficient numbers of hidden units or the lack of a deterministic relationship between inputand target” (italics added). What could be the nature of the thus suggested non-deterministic contents of the data sets?One obvious exclusion in most work on statistically based protein fold prediction is the effect that the environmentnecessarily must have on the folding process. It is therefore suggested that the folding environment might account fora large fraction of the non-determinacy of the commonly used data sets. Further research into these matters is thereforewarranted and is currently being pursued in our group.

References

[1] C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, J.M. Thornton, CATH—a hierarchic classification of protein domainstructures, Structure 5 (8) (1997) 1093–1108.

[2] P. Røgen, B. Fain, Automatic classification of protein structure by using Gauss integrals, Proceedings of the National Academy of Sciencesof the United States of America 100 (1) (2003) 119–124.

[3] P. Røgen, H.G. Bohr, A new family of global protein shape descriptors, Mathematical Biosciences 182 (2) (2003) 167–181.[4] H.S. Chan, K.A. Dill, The protein folding problem, Physics Today (February) (1993) 24–32.[5] C.B. Anfinsen, Principles that govern the folding of protein chains, Science 181 (4096) (1973) 223–230.[6] P. Atkins, J. de Paula, Physical Chemistry, 7th edition, W. Freeman, 2004.[7] B. Hille, Ion Channels of Excitable Membranes, 3rd edition, Sinauer Associates, 2001.[8] E. Tajkhorshid, P. Nollert, M. Jensen, L.J.W. Miercke, J. O’Connell, R.M. Stroud, K. Schulten, Control of the selectivity of the aquaporin

water channel family by global orientational tuning, Science 296 (April) (2002) 525–530.[9] J.D. McPherson et al., A physical map of the human genome, Nature 409 (2001) 934–941.

[10] J.C. Venter et al., The sequence of the human genome, Science 291 (5507) (2001) 1304–1351.[11] K.U. Linderstrøm-Lang, The Lane Medical Lectures, Stanford University Press, Stanford, California, 1952.[12] K.U. Linderstrøm-Lang, J.A. Shellman, Protein structure and enzyme activity, in: P.D. Boyer (Ed.), The Enzymes, vol. 1, Academic Press,

New York, 1959, pp. 443–510.[13] A.M. Lesk, Introduction to Protein Structure, Oxford University Press, Oxford, UK, 2001.[14] A. Neumaier, Molecular modeling of proteins and mathematical prediction of protein structure, SIAM Review 39 (3) (1997) 407–460.[15] S. Abdali, M. Jensen, H.G. Bohr, Energy levels and quantum states of [leu]enkephalin conformations based on theoretical and experimental

investigations, Journal of Physics: Condensed Matter 15 (2003) S1853–S1860.[16] B.G. Nielsen, M.Ø. Jensen, H.G. Bohr, The probability distribution of side-chain conformations in [leu] and [met]enkephalin determines the

potency and selectivity to µ and δ opiate receptors, Biopolymers (Peptide Science) 71 (2003) 577–592.[17] D. Baker, A. Sali, Protein structure prediction and structural genomics, Science 294 (2001) 93–96.[18] C. Zhang, C. DeLisi, Estimating the number of protein folds, Journal of Molecular Biology 284 (1998) 1301–1305.[19] J. Hertz, A. Krogh, R.G. Palmer, Introduction to the Theory of Neural Computation, in: Santa Fe Institute Lecture Notes, vol. I, Addison-

Wesley, Redwood City, CA, 1991.[20] F. Murtagh, Neural networks and related ‘massively parallel’ methods for statistics: A short overview, International Statistical Review 62 (3)

(1994) 275–288.[21] B.D. Ripley, Neural networks and related methods for classification, Journal of Royal Statistical Society Series B 56 (3) (1994) 409–456.[22] N. Qian, T.J. Sejnowski, Predicting the secondary structure of globular proteins using neural network models, Journal of Molecular Biology

202 (1988) 865–884.[23] X. Zhang, A hybrid algorithm for determining protein structure, IEEE Expert 9 (4) (1994) 66–74.[24] S. Rackovsky, On the nature of the protein folding code, Proceedings of the National Academy of Sciences of the United States of America

90 (1993) 644–648.[25] H.G. Bohr, J. Bohr, S. Brunak, R.M.J. Cotterill, B. Lautrup, L. Nørskov, O.H. Olsen, S.B. Petersen, Protein secondary structure and homology

by neural networks, FEBS Letters 241 (1988) 223–228.

412 B.G. Nielsen et al. / Mathematical and Computer Modelling 43 (2006) 401–412

[26] L.H. Holley, M. Karplus, Protein secondary structure prediction with a neural network, Proceedings of the National Academy of Sciences ofthe United States of America 86 (1989) 152–156.

[27] D.G. Kneller, F.E. Cohen, R. Langridge, Improvements in protein secondary-structure prediction by enhanced neural networks, Journal ofMolecular Biology 214 (1990) 171–182.

[28] B. Rost, C. Sander, Prediction of protein secondary structure at better than 70% accuracy, Journal of Molecular Biology 232 (1993) 584–599.[29] C. Ruggiero, R. Sacile, G. Rauch, Peptides secondary structure prediction with neural networks: A criterion for building appropriate learning

sets, IEEE Transactions on Biomedical Engineering 40 (11) (1993) 1114–1121.[30] T.N. Petersen, C. Lundegaard, M. Nielsen, H.G. Bohr, J. Bohr, S. Brunak, G.P. Gippert, O. Lund, Prediction of protein secondary structure at

80% accuracy, PROTEINS: Structure, Function and Genetics 41 (2000) 17–20.[31] C.A. Andersen, H.G. Bohr, S. Brunak, Protein secondary structure: category assignment and predictability, FEBS Letters 507 (2001) 6–10.[32] L. Edler, J. Grassmann, S. Suhai, Role and results of statistical methods in protein fold class prediction, Mathematical and Computer Modelling

33 (2001) 1401–1417.[33] Z. Wang, Z. Yuan, How good is prediction of protein structural class by the component-coupled method, PROTEINS: Structure, Function,

and Genetics 38 (2000) 165–175.[34] H. Nakashima, K. Nishikawa, T. Ooi, The folding type of a protein is relevant to the amino acid composition, Journal of Biochemistry 99

(1986) 153–162.[35] J. Grassmann, M. Reczko, S. Suhai, L. Edler, Protein fold class prediction: New methods of statistical classification. in: Proceedings of the

Seventh International Conference on Intelligent Systems for Molecular Biology, 1999, pp. 106–112.[36] M. Reczko, H.G. Bohr, The DEF data base of sequence based protein fold class predictions, Nucleic Acids Research 22 (17) (1994)

3616–3619.[37] H.G. Bohr, S. Brunak, A travelling salesman approach to protein conformation, Complex Systems 3 (1989) 9–28.[38] X. Guan, R.J. Mural, E.C. Uberbacher, Protein structure prediction using hybrid AI methods, in: Proceedings of the 10th Conference on

Artificial Intelligence Applications, 1994, pp. 471–473.[39] A.G. Murzin, S.E. Brenner, T. Hubbard, C. Clothia, SCOP: a structural classification of the protein database for the investigation of sequence

and structures, Journal of Molecular Biology 247 (1995) 536–540.[40] G. Getz, M. Vendruscolo, D. Sachs, E. Domany, Automated assignment of SCOP and CATH protein structure classifications from FSSP

scores, PROTEINS: Structure, Function and Genetics 46 (2002) 405–415.[41] B. Rost, C. Sander, Combining evolutionary information and neural networks to predict protein secondary structure, PROTEINS: Structure,

Function and Genetics 19 (1994) 55–72.[42] N. Stambuk, P. Konjevoda, New computational algorithm for the prediction of protein folding types, International Journal of Quantum

Chemistry 84 (2002) 13–22.[43] R. Luo, Z. Feng, J. Liu, Prediction of protein structural class by amino acid and polypeptide composition, European Journal of Biochemistry

269 (2002) 4219–4225.[44] C. Yu, J. Wang, J. Yang, P. Lyu, C. Lin, J. Hwang, Fine-grained protein fold assignment by support vector machines using generalized

n-peptide coding schemes and jury voting from multiple-parameter sets, Proteins: Structure, Function and Genetics 50 (2003) 531–536.[45] S.R. Campion, A.S. Ameen, L. Lai, J.M. Kin, T.N. Munzenmaier, Dipeptide frequency/bias analysis identifies conserved sites of

nonrandomness shared by cysteine-rich motifs, Proteins: Structure, Function and Genetics 44 (2001) 321–328.[46] S.I. Amari, Theory of adaptive pattern classifiers, IEEE Transactions on Electronic Computers EC-16 (1967) 299–307.[47] A.E. Bryson, Y.-C. Ho, Applied Optimal Control, Blaisdell, New York, 1969.[48] P. Werbos, Beyond regression: New tools for prediction and analysis in the behavioral sciences, Ph.D. Thesis, Harvard University, Cambridge,

Mass., 1974.[49] D.B. Parker, Learning logic, Technical Report TR-47, Massachusetts Institute of Technology, Cambridge, Mass., 1985.[50] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, in: D.E. Rumelhart, J.L. McLelland

(Eds.), Parallel Distributed Processing: Exploration in the Microstructure of Cognition, vol. 1, MIT Press, Cambridge, Mass., 1986.[51] M.H. Hassoun, Fundamentals of Artificial Neural Networks, The MIT Press, Cambridge, Massachusetts, 1995.[52] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (1989) 359–366.[53] S. Rackovsky, Quantitative organization of the known protein x-ray structures. i. methods and short-length-scale results, Proteins: Structure,

Function and Genetics 7 (1990) 378–402.[54] A.D. Solis, S. Rackovsky, On the use of secondary structure in protein structure prediction: a bioinformatic analysis, Polymer 45 (2004)

525–546.