A new family of global protein shape descriptors

15
A new family of global protein shape descriptors Peter Røgen a, * ,1 , Henrik Bohr b,2 a Department of Mathematics, Technical University of Denmark, Building 303, DK-2800 Kongens Lyngby, Denmark b QUT, The Quantum Protein Center, Department of Physics, Technical University of Denmark, Building 303, DK-2800 Kongens Lyngby, Denmark Received 28 June 2002; received in revised form 7 October 2002; accepted 7 November 2002 Abstract A family of global geometric measures is constructed for protein structure classification. These measures originate from integral formulas of Vassiliev knot invariants and give rise to a unique classification scheme. Our measures can better discriminate between many known protein structures than the simple measures of the secondary structure content of these protein structures. Ó 2003 Elsevier Science Inc. All rights reserved. Keywords: Protein structure classification; Writhe; Average crossing number; Gauss integrals; Crossing configuration 1. Introduction In this paper we address the issue of finding global measures which can geometrically char- acterize and therefore classify the 3-dimensional structures of proteins. One reason for searching global geometric measures is that, in [1] it is proven that, if protein backbones are represented as framed space curves, then any local geometric description, as e.g. curvature and torsion, is dis- continuous and thus unsuited for describing protein structures. The most obvious classification of proteins is derived from the primary sequences of there amino acids and, although not fully automatized and uniquely defined, this is performed by uti- lizing various alignment schemes. As long as one can successfully carry out a multiple alignment Mathematical Biosciences 182 (2003) 167–181 www.elsevier.com/locate/mbs * Corresponding author. Tel.: +45-4525 3044/3031; fax: +45-4588 1399. E-mail addresses: [email protected] (P. Røgen), [email protected] (H. Bohr). 1 The work was supported by a DTU-grant and by a grant from Carlsbergfondet. 2 The work was supported by Grundforskningsfonden. 0025-5564/03/$ - see front matter Ó 2003 Elsevier Science Inc. All rights reserved. doi:10.1016/S0025-5564(02)00216-X

Transcript of A new family of global protein shape descriptors

A new family of global protein shape descriptors

Peter Røgen a,*,1, Henrik Bohr b,2

a Department of Mathematics, Technical University of Denmark, Building 303,

DK-2800 Kongens Lyngby, Denmarkb QUT, The Quantum Protein Center, Department of Physics, Technical University of Denmark,

Building 303, DK-2800 Kongens Lyngby, Denmark

Received 28 June 2002; received in revised form 7 October 2002; accepted 7 November 2002

Abstract

A family of global geometric measures is constructed for protein structure classification. These measures

originate from integral formulas of Vassiliev knot invariants and give rise to a unique classification scheme.

Our measures can better discriminate between many known protein structures than the simple measures of

the secondary structure content of these protein structures.

� 2003 Elsevier Science Inc. All rights reserved.

Keywords: Protein structure classification; Writhe; Average crossing number; Gauss integrals; Crossing configuration

1. Introduction

In this paper we address the issue of finding global measures which can geometrically char-acterize and therefore classify the 3-dimensional structures of proteins. One reason for searchingglobal geometric measures is that, in [1] it is proven that, if protein backbones are represented asframed space curves, then any local geometric description, as e.g. curvature and torsion, is dis-continuous and thus unsuited for describing protein structures.

The most obvious classification of proteins is derived from the primary sequences of thereamino acids and, although not fully automatized and uniquely defined, this is performed by uti-lizing various alignment schemes. As long as one can successfully carry out a multiple alignment

Mathematical Biosciences 182 (2003) 167–181

www.elsevier.com/locate/mbs

* Corresponding author. Tel.: +45-4525 3044/3031; fax: +45-4588 1399.

E-mail addresses: [email protected] (P. Røgen), [email protected] (H. Bohr).1 The work was supported by a DTU-grant and by a grant from Carlsbergfondet.2 The work was supported by Grundforskningsfonden.

0025-5564/03/$ - see front matter � 2003 Elsevier Science Inc. All rights reserved.

doi:10.1016/S0025-5564(02)00216-X

of the various primary sequences, one can define a similarity measure by counting the number ofidentical residue matches that arise from the most optimal alignment. However, such classificationdoes not necessarily have anything to do with the similarity of the protein 3-dimensional struc-tures. Thus one can define protein families and super families based on the proteins sequenceidentity by matching pairs in the alignments.

The concept that concerns protein 3-dimensional structural homology is often termed proteinfolds and protein fold classification. It is done partly by visual inspection using computer graphics[2,3], visualizing the 3-dimensional protein structure from the Cartesian coordinates given in thecrystallographic data bases [4]. From the NMR-structures one has also to deal with the multiplebackbone structures that arise from that methodology. An important element in the identificationof protein structures from computer graphics is the periodic secondary structure elements, thehelices and the sheets, and turns, that appear as �land-marks� that often are high-lighted by ribbonrepresentations. The basic problems from these graphical representations are the difficulties withgoing from a structure in 3-dimensions to various 2-dimensional projections seen on the computerscreen. It is crucial to choose the right projections in order to get the right understanding of the�topology� of the secondary structures, i.e., to determine how the various secondary structures areconnected.

The problem with all the possible 2-dimensional projections, made us think off a rigorousmathematical way of measuring the 3-dimensional structure that models observation of planarprojections. This led us to use geometric measures, the writhe and generalizations there of, knownfrom knot theory. Examples of such work are found in Ref. [5] and the references within thispaper, in which the possibility of seeing n crossings of a backbone from an arbitrary chosen di-rection in space is considered. Here we will not consider crossing number frequency distributions.Instead the measures we introduce count the number of times any given configuration of crossingsis seen in the case of an almost planar curve. The two simplest of the structural measures weconsider are called the writhe and the average crossing number and have previously been appliedto analyse protein structures. Levitt use the writhe to distinguish different chain threadings in Ref.[6]. Arteca and Tapia use the average crossing number and the most probable overcrossing number

as protein shape descriptors in Ref. [7].We shall first discuss the various phenomenological classification schemes for protein

structures and then turn to the knot theoretically inspired treatments of the protein back-bones.

2. A phenomenological look at protein folds

The individual 3-dimensional protein structures determined by NMR and X-ray crystallog-raphy can be grouped into a smaller number of characteristic structural classes. These structuralclasses consist of domains from homologous proteins with similar �chain topological� configura-tions of their backbone atoms [8]. Classification of protein structures have given rise to severaluseful databases, such as SCOP [3] and CATH [2]. These structural domains, or the so-called foldsof the proteins, were introduced in order to clarify the notion of structural similarity. Such foldclasses could contain entire proteins or well-defined sub-domains of proteins depending on thelength of and the complexity of the given protein.

168 P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

Pascarella and Argos [8] have used chain topological similarity as a measure of fold classhomology, while Holm and Sander [9] have used similarity of distance matrices to determine foldclass membership. Orengo et al. [10] have reported a classification of all proteins in the proteinstructural database into more than 100 folds from structural comparison. Common to thesestudies, except perhaps for that of Holm and Sander, is that they define fold classes in a rather adhoc manner. They do not provide a unique way of determining what fold class an entirely newprotein belongs to. In the case of Holm and Sander there is a procedure for determining fold classprovided the proteins have well structured distance matrices (i.e., with clearly separated sub-domains).

While it is feasible to determine membership of a fold class once the 3-dimensional structure ofthe protein is constructed, earlier efforts to predict fold classes only from sequences have achievedlittle success [11]. In the elaborate methods of Jones and co-workers [12] the fold class membershipfor a protein is determined by fitting primary as well as tertiary structures. Similarly, the methodsof Sippl [13] and Gribskov et al. [14] make use of primary structure fitting through multiplealignment, potentials and profiles.

Successful fold class predictions from sequences are those cases where there is significant se-quence homology between the protein whose structure is to be determined and the one whosestructure is established. Most frequently, sequences which have large homology mutually areknown to belong to the same fold class. There are, however, cases where the members of par-ticular fold classes have little sequence homology mutually within a class. For example, theproteins adenosine deaminase (1add), aldolase A (1ald), aldose reductase (1ads), the first domainof cyclodextrin glycosyltransferase (1cdg), beta-amylase (1btc), endo-1,4-beta-DD-glucanase (1tml),the second domain of chloromuconate cycloisomerase (1chr.A), etc. [14], all belong to the �barrel�class and the sequence homology between any pair of these is insignificant. Their structural ho-mology mutually, are of course large.

In most definitions of fold classes, each member would have more than 50% sequenceidentity to each other, although domains with far less sequence similarity could belong to thesame class. It is important that each protein within a class would have a structure with alarge topological similarity and a similar packing pattern to other members of the class. Thedetails of the primary sequence in itself are less important. The notion of fold classes isimportant for predicting new protein structures using homology modeling. In homologymodeling an unknown 3-dimensional protein structure is inferred from other known 3-dimensional protein structures whose amino acid sequences are similar to the sequence of theprotein in question.

One can make a crude classification of protein domains into what we call super fold classesby simply distinguishing them from their content of secondary structures. Such a superclassification might actually also turn out to be deeply connected to the folding process andcould also give rise to a measure of distance among the fold classes in the way that folds mostdifferent in secondary structure content, are most far apart. We define thus four superclass�sbeing: (1) the class of pure alpha helices (denoted a), (2) the class with only beta sheets(denoted b), (3) the class with alpha helices and beta sheets clearly separated (written a þ b)and finally, (4) the class of folds having alpha helices and beta sheets entangled (denoteda � b). These four classes are very well illustrated by the four prototypical proteins shown inFigs. 1 and 2.

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181 169

3. Motivation for the measures

The methodology we propose for classifying space curves representing proteins is, as mentionedabove, concerned with crossings seen in planar projections of the curves. In Knot Theory suchplanar projections are called knot diagrams and serve as a fundamental tool for dealing withknots. Here Knot Theory can only be a matter of inspiration since knots are defined on closedcurves while we treat open curves for proteins. Never the less, it turned out to be useful to employa family of integrals for open curves that are inspired by generalized Gauss integrals involved informulas for the so called Vassiliev knot invariants. Let us therefore introduce a few conceptsfrom Knot Theory.

A knot is a closed curve without self-intersections in 3-space. A knot is said to be oriented ifequipped with a preferred direction of traversion. A plane projection of a knot is called a shadowof the knot. Indicating over and under crossings on a shadow of a knot, as done on Fig. 3, a knotdiagram of the knot is obtained, provided that the shadow has only transversal double points.That is, on the shadow triple points are not allowed and at each double point the two tangents ofthe curve are not parallel. The crossings of an oriented knot diagram may be assigned with a sign

Fig. 1. Typical members from the super fold class a (256B:A left) and b (1ACX right).

Fig. 2. Typical members from the super fold class a þ b (1PAZ––left) and a � b (3TIM:A––right).

170 P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

using the usual right-hand rule. This is also shown on Fig. 3. As the signs of the crossings areindependent under change of the orientation of the knot, this gives signs of crossings in unori-ented knot diagrams.

The first structural measure to be considered here is the writhe of a space curve, known fromthe famous C�aalug�aareanu–Pohl–White self-linking formula. The writhe, Wr, of a closed spacecurve, c, may be calculated using the Gauss integral

WrðcÞ ¼ 1

4p

Z Zc�cnD

xðt1; t2Þdt1 dt2;

where

xðt1; t2Þ ¼½c0ðt1Þ; cðt1Þ cðt2Þ; c0ðt2Þ�

jcðt1Þ cðt2Þj3

and D is the diagonal of c � c. Notice that, the triple scalar product ½c0ðt1Þ; cðt1Þ cðt2Þ; c0ðt2Þ�equals the oriented volume of the parallelepiped spanned by c0ðt1Þ, cðt1Þ cðt2Þ, and c0ðt2Þ. Herebyxðt1; t2Þ ¼ xðt2; t1Þ. As we may assume that c is parametrized by ½0; 1�, it is enough to calculate theabove integral on the 2-simplex D2 ¼ fðt1; t2Þ; 0 < t1 < t2 < 1g. Setting Ið1;2Þ ¼

RD2 xðt1; t2Þdt1 dt2,

we have Wr ¼ ð1=2pÞIð1;2Þ.Let e be a unit vector in 3-space and let peðcÞ be the projection of the closed curve c along e. For

almost all e, the planar curve peðcÞ gives a knot diagram of c. The writhe of a knot diagram is thesum of the signs of the crossings in the knot diagram. A well-known geometric interpretation ofthe writhe of a closed space curve is, that 4pWr equals the integral over all projections of thewrithe of the knot diagrams. That is, Wr is the average number of signed crossings seen, whenlooking at the knot from all directions in 3-space. To explain this, let eðt1; t2Þ be the unit vectorfrom cðt1Þ to cðt2Þ. Then xðt1; t2Þ is the pull back of the area 2-form on the unit 2-sphere. Hence,the integral

R Rc�cnD xðs1; s2Þds1 ds2 measures the signed area on the unit 2-sphere sweeped out by

eðt1; t2Þ. For any t1 and t2 6¼ t1 the projection along eðt1; t2Þ shows a crossing corresponding to thetwo points cðt1Þ and cðt2Þ. This crossing has the same sing as xðs1; s2Þ.

Taking the absolute value of the integrand in the writhe integral one gets the average crossingnumber, that is the average number of crossings (without signs) seen, when looking at the knotfrom all directions in 3-space. By analogy to the above, we introduce the integral Ij1;2j ¼R

D2 jxðt1; t2Þjdt1 dt2.Note, that both the writhe and the average crossing number by there pointwise counting of

crossings have the same geometric interpretations if applied to open curves.Consider the integral Ið1;3Þð2;4Þ ¼

RD4 xðt1; t3Þxðt2; t4Þdt1 dt2 dt3 dt4, where D4 is the 4-simplex given

by 0 < t1 < t2 < t3 < t4 < 1. This integral is involved in a integral formulation of a Vassiliev knot

Fig. 3. Left: a knot diagram of the right handed Trefoil knot. Right: the signs of crossing according the the usual right-

hand rule.

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181 171

invariant of second order and comes from the perturbative expansion of Witten�s Chern–Simonspath integral associated with a knot in 3-space. The geometric interpretation of Ið1;3Þð2;4Þ given in[15], in the case of an almost planar space curve, can be formulated as follows: consider a knotdiagram with two crossings having the property that the parameter-values 0 < t1 < t2 < t3 <t4 < 1 corresponding to these two crossings fulfill that t1 and t3 corresponds to one of the twocrossings and t2 and t4 corresponds to the other of the two crossings. Denote such a crossingconfiguration by ð1; 3Þð2; 4Þ. This crossing configuration is counted as occurring once if bothcrossings are positive or if they both are negative. The crossing configuration is counted as anegative occurrence once if the two crossings have opposite signs. In the limit where the knot getsplanar the integral Ið1;3Þð2;4Þ converges to a constant times the number of crossings (without signs)plus a constant times the number of occurrences of the crossing configuration ð1; 3Þð2; 4Þ, countedwith sign.

A heuristic geometric-statistical explanation why Ið1;3Þð2;4Þ is strongly connected to the frequencyof ð1; 3Þð2; 4Þ crossing configurations and to the frequency of crossings is: Fix t1 and t3. Herebyeðt1; t3Þ is fixed. When t2 and t4 vary, the vector eðt2; t4Þ sweeps out an area on the unit 2-sphere. Ifthis area e.g. is 7.3 times the area of the unit 2-sphere then the equation eðt2; t4Þ ¼ eðt1; t3Þ is ex-pected to be fulfilled 7.3 times in average if eðt2; t4Þ and eðt1; t3Þ are uncorrelated. However, in thelimit t2 ! t1 and t4 ! t3 the vectors eðt2; t4Þ and eðt1; t3Þ get equal. Hence, one would expect theequation eðt2; t4Þ ¼ eðt1; t3Þ to be fulfilled more than the uncorrelated expectation, which is inperfect agreement with the limit of the integral found in [15] for almost planar space curves. Note,that the above geometric-statistical argument does not depend on the fact that the curve is closed.

For a configuration of n crossings given by an ordered pairing of the integers 1; . . . ; 2n of theform ði1; i2Þði3; i4Þ; . . . ; ði2n1; i2nÞ with i1 < i2, i3 < i4; . . . ; i2n1 < i2n, and i1 < i3 < � � � < i2n1 wedefine the integral 3

Iði1;i2Þði3;i4Þ���ði2n1;i2nÞ ¼Z

D2nxðti1 ; ti2Þxðti3 ; ti4Þ � � �xðti2n1

; ti2nÞdt1 � � � dt2n;

where D2n is the 2n-simplex given by 0 < t1 < t2 < � � � < t2n < 1. We refer to the numberIði1;i2Þði3;i4Þ���ði2n1;i2nÞ as a geometric measure of order n. Note, that all of these measures are inde-pendent of translation, rotation, and scale. Furthermore, they all depend continuously on de-formation of the curve.

An argument analogous to the above geometric-statistical argument concerning the integralIð1;3Þð2;4Þ applies also to the general integrals Iði1;i2Þði3;i4Þ���ði2n1;i2nÞ. That is, we expect the geometricmeasure of order n denoted by Iði1;i2Þði3;i4Þ���ði2n1;i2nÞ to be correlated with the average occurrence ofthe crossing configuration ði1; i2Þði3; i4Þ � � � ði2n1; i2nÞ, see Fig. 4, plus constants times the geometricmeasures of lower order.

The main goal of this paper is to introduce the above integrals as measures of protein structureor rather discrete versions applying to polygonal curves. We will not elaborate on the validity ofthe above heuristic geometric-statistical argument.

3 It is more natural to define the integrals times ð2=4pÞn, where 4p normalizes the area of the unit 2-sphere and the

factor of two takes into account that a crossing seen in one direction is also seen from the antipodal direction. As we are

going to rescale the values of the integrals later, we have avoided this multiplicative constant.

172 P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

4. Evaluation on polygonal curves

In the following we shall derive reductions of the generalized Gauss integrals introduced. Wehave decided to represent proteins by the polygonal curve connecting the Ca-atoms. This repre-sentation is not faithful as the positions of three Ca-atoms in sequence in general can be realizedby two different pairs of dihedral angles associated to the middle Ca-atom (see [1]). From astructural point of view, this local ambiguity is unimportant as the large scale shape of the proteinbackbone exactly is unchanged and is the main issue of consideration here.

We start by finding an expression for Ið1;2Þ evaluated on two line segments. We found numericalintegration to slow, but have used is to test the explicit formula given below.

Let l1 be the line segment from r11 to r12 and let l2 be the line segment from r21 to r22. We needto find the signed area on the unit 2-sphere of the domain, D, sweeped out by the unit vectors, e,starting at points on l1 with directions to points on l2. First consider the boundary, oD, of thedomain D. This boundary consists of four great circle segments. The reason for this is as follows:consider the unit vectors, denoted by eðr11; l2Þ, that starts at r11 with directions to the points of l2.If the point r11 does not lie on the line defined by l2, then r11 and l2 define a plane. Hereby,eðr11; l2Þ lies on the intersection between a plane and the unit 2-sphere. The set eðr11; l2Þ is thus apart of a great circle on the unit 2-sphere. The line l1 lies on one side of the plane through r11 andl2. Hence, the domain D is contained in one of the two hemispheres that the above plane dividesthe unit 2-sphere into. We thus conclude that the domain D is convex and is contained in ahemisphere.

The Gauss–Bonnet theorem asserts that: if R is a simply connected region in a surface boundedby a piecewise differentiable curve a, with pieces a1; a2; . . . ; an, making exterior angles �1; �2; . . . ; �nat the vertices of a, then

Pnj¼1

RajkgðsÞdsþ

R RR K dA ¼ 2p

Pnj¼1 �j, where kg is the geodesic

curvature and K is the Gauss curvature. The unit 2-sphere has Gauss curvature one everywhereand great circles have zero geodesic curvature everywhere. Therefore the Gauss–Bonnet theorembecomes

A ¼Z Z

R1dA ¼ 2p

X4j¼1

�j:

By convexity, the domain D is always a simply connected region. If the boundary of D istraversed in the positive direction, then the inclosed area is directly given by the Gauss–Bonnettheorem as A ¼ 2p

P4

j¼1 �j. If the boundary of D is traversed in the negative direction the areaof the complement to D is Ac ¼ 2p

P4

j¼1 �j, by the Gauss–Bonnet theorem. The area of the unit

Fig. 4. Left: a knot diagram of a helix. Right: a corresponding crossing configuration.

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181 173

2-sphere is 4p. Hence the positive area of D is 4p Ac ¼ 2p þP4

j¼1 �j. However D is negativelytraversed, so the signed area of D is A ¼ 2p

P4

j¼1 �j. The two expressions for the signed area ofD are finally collected in the formula

A ¼ signX4j¼1

�j

!2p

X4j¼1

�j:

To obtain the desired expression for Ið1;2Þ we need to find the exterior angles �1; . . . ; �4 given twoline segments l1 and l2. Let a, b, and c be three unit vectors. As we may assume that a, b, and c lieon the same open hemisphere the great circle pieces connecting a to b to c are well-defined. Denotethe exterior angle at b by \ða; b; cÞ. Standard vector calculus gives that

cosð\ða; b; cÞÞ ¼ a� b

ja� bj �b� c

jb� cj ¼ða � bÞðb � cÞ ða � cÞðb � bÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 ða � bÞ2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 ðb � cÞ2

q ¼ ða � bÞðb � cÞ ða � cÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 ða � bÞ2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 ðb � cÞ2

q ;

sinð\ða; b; cÞÞ ¼ a� b

ja� bj �b� ðb� cÞjb� cj ¼ a� bffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1 ða � bÞ2q � ðb � cÞb ðb � bÞcffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1 ðb � cÞ2q

¼ ða� bÞ � cffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 ða � bÞ2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 ðb � cÞ2

q :

Let p1; . . . ; pL denote the vertices of a polygonal line, let lj denote the line segment from pj to pjþ1,and let xði; jÞ denote the value of Ið1;2Þ when restricted to the line segments li and lj. The unitvector with direction from pi to pj is denoted eði; jÞ. The sum of the exterior angles is then, fol-lowing the positive orientation ði; jÞ 7! ðiþ 1; jÞ 7! ðiþ 1; jþ 1Þ 7! ði; jþ 1Þ 7! ði; jÞ 7! � � � of theboundary of the planar square with the same points as corners, given by

Sði; jÞ ¼X4k¼1

�k ¼ \ðeði; jÞ; eðiþ 1; jÞ; eðiþ 1; jþ 1ÞÞ þ \ðeðiþ 1; jÞ; eðiþ 1; jþ 1Þ; eði; jþ 1ÞÞ

þ \ðeðiþ 1; jþ 1Þ; eði; jþ 1Þ; eði; jÞÞ þ \ðeði; jþ 1Þ; eði; jÞ; eðiþ 1; jÞÞ:

And we finally find

xði; jÞ ¼ signðSði; jÞÞ2p Sði; jÞfor 0 < i < j 1 < L. If j ¼ iþ 1 then the two line segments li and liþ1 lie in a plane andxði; iþ 1Þ ¼ 0.

The 2n-double integral Iði1;i2Þði3;i4Þ���ði2n1;i2nÞ is now replaced by the 2n-double sum of products ofthe xði; jÞs

Iði1;i2Þði3;i4Þ���ði2n1;i2nÞ ¼X

0<j1<j2<���<j2n<L

xðji1 ; ji2Þxðji3 ; ji4Þ � � �xðji2n1; ji2nÞ:

By tabulation some partial sums these 2n-double sums are reduced to at most n-double sumswhich is important as a typical protein has 100–400 residues. Finally we note that the integrals areinvariant under permutation inside the brackets, e.g., Ið1;2Þð3;4Þ ¼ Ið2;1Þð3;4Þ, and invariant underpermutation of the ordering of the brackets, e.g., Ið1;2Þð3;4Þ ¼ Ið3;4Þð1;2Þ.

174 P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

5. Application to proteins

With the aim of describing protein geometry, it is most natural to start with the simplest of the 2n-double integral measures Iði1;i2Þði3;i4Þ���ði2n1;i2nÞ. Here we have implemented the measures correspondingto n ¼ 1, n ¼ 2, n ¼ 3.Due to thementioned symmetrieswe have: Ið1;2Þ, Ij1;2j, Ið1;3Þð2;4Þ, Ið1;2Þð3;4Þ, Ið1;4Þð2;3Þ,Ið1;2Þð3;4Þð5;6Þ, Ið1;2Þð3;5Þð4;6Þ, Ið1;2Þð3;6Þð4;5Þ, Ið1;3Þð2;4Þð5;6Þ, Ið1;3Þð2;5Þð4;6Þ, Ið1;3Þð2;6Þð4;5Þ, Ið1;4Þð2;3Þð5;6Þ, Ið1;4Þð2;5Þð3;6Þ,Ið1;4Þð2;6Þð3;5Þ, Ið1;5Þð2;3Þð4;6Þ, Ið1;5Þð2;4Þð3;6Þ, Ið1;5Þð2;6Þð3;4Þ, Ið1;6Þð2;3Þð4;5Þ, Ið1;6Þð2;4Þð3;5Þ, and Ið1;6Þð2;5Þð3;4Þ. There are 105different integral measures with n ¼ 4, causing n ¼ 3 to be a natural limit for this first investigation.

To explore the range of these measures, we have chosen a test sample of 69 proteins 4 with aslarge structural diversity as possible except for pairs like 256BA and 256BB included to also havesome very similar structures. The set of 69 are taken from Rost and Sanders classical training set[18], chosen such that each protein has less than 25% sequence similarity to the others. Thisstructural diversity is illustrated on Fig. 5. We have normalized 5 each of the measures such thateach measure takes values between )1 and 1.

For the pair 256B:A and 256B:B it is found that 14 of themeasures vary less than 3% and the onlybig relative variation is that Ið1;4Þð2;6Þð3;5Þð256B : AÞ ¼ 0:003 and Ið1;4Þð2;6Þð3;5Þð256B : BÞ ¼ 0:001,however the absolute value of this variation is small. The range of each of the measures on the 69proteins is spread over more than two decades. Hence, the relatively small variations of the measurevalues on structurally similar proteins show that the finite resolution of protein data does notconstitute a problem andmore interesting, they show that themeasures give a clear separation of thedifferent protein structures.

In order to see how we in general can discriminate between protein structures, we explain someof the measures in terms of secondary structures. Looking at images of proteins it is obvious thate.g. a-helixes and b-sheets give very different crossings in planar projections. To check if themeasures just gauge the contents of a-helixes, b-sheets, and the remainder, denoted coil, we havemade least square fit of the measure values by polynomials in the numbers of residues contained ina-helixes, b-sheets, and coil respectively.

The measure Ij1;2j (1=2p times the average crossing number) shown on Fig. 6 has a good ap-proximation with a third degree polynomial in a-helix, b-sheet, and coil content. The remainingmeasures can not be properly approximated by polynomials. As an example Fig. 6 illustrates theleast squares fit of Ið1;4Þð2;5Þð3;6Þ with fourth order polynomials. Note, that there are 35 parametersand 69 measurements to fit. Even the seventh degree fits, having 120 parameters, cannot repro-duce the 69 measurements for any of the geometric measures. The fits are never good for proteinswithout b-strand. We conclude that except for the average crossing number the measures are notapproximately given by the secondary structures and size of a protein and therefore the measures

4 1AZU, 1CBH, 1L58, 1PYP, 1R09:2, 2AAT, 2MEV:4, 2OR1:L, 2STV, 2TGP:I, 2UTG:A, 3EBX, 3GAP:A,

3HMG:A, 3RNT, 2FOX, 5ER2:E, 6CPA, 7RSA, 9API:B, 9WGA:A, 1BBP:A, 1CC5, 1CRN, 1ETU, 1FC2:C, 1MRT,

1OVO:A, 1TGS:I, 1BKS:A, 2CCY:A, 2GBP, 1A45, 2MHU, 2RSP:A, 2WRP:R, 1CYO, 3TIM:A, 4CPV, 4RXN,

5LDH, 8ABP, 1BDS, 1CDT:A, 1FDL:H, 1DUR:A, 1GD1:O, 1PAZ, 1PPT, 1RBP, 1S01, 2I1B, 2LHB, 2TMV:P,

3CD4, 3SDH:A, 5HVP:A, 6TMN:E, 9PAP, 256B:A, 256B:B, 1CRN, 2CCY:B, 1ACX, 2LH4, 1CSE:E, 1CSE:I,

2PAB:A, 2PAB:B.5 The normalization factors are one over 146, 1277, 119, 101 023, 1206, 477 989, 6612, 23 946, 6448, 203, 1884, 54 581,

172, 258, 1246, 293, 1396, 36 143, 442, and 2468 respectively for the measures in the ordering above.

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181 175

give more information about a protein than its secondary structure content. Of cause, there areclear tendencies, e.g., that a large a-helix content give large absolute scores of most measures, butclearly there is also dependence of the relative positions of the secondary elements.

To check if the measures are pairwise independent we have plotted the scores of all theproteins on each measure against the scores on any other measure. We comment on some ofthese plots. The left-hand side of Fig. 7 is equivalent to the writhe vs. the average crossingnumber, i.e., signed vs. unsigned crossings. This plot shows a clear separation of the secondarystructure contents. On the right-hand side of this figure is shown Ið1;2Þ vs. Ið1;2Þð3;4Þ that clearly,

Fig. 5. This figure illustrates the set of 69 proteins. Each disc represents a protein being centered at the relative contents

of a-helix respectively. b-strand and having a radius proportional to the radius of the protein. This radius is calculated

as one half of the longest distance between two Ca-atoms.

Fig. 6. Left: the relative scores of Ij1;2j (the max score is set to 100) shown as a function of the numbers of residues

contained in a-helices and b-strands. The gray part of each bar is the value of the least squares fitted third degree

polynomial in a, b, and coil content and the remaining black part is the error in these fit. Right: this shows the similar

data as the left plot, but concerning the measure Ið1;4Þð2;5Þð3;6Þ fitted with fourth order polynomials.

176 P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

and not surprisingly, are correlated. The same is the case for the measures Ið1;2Þ and Ið1;2Þð3;4Þð5;6Þ.Hereby, also Ið1;2Þð3;4Þ and Ið1;2Þð3;4Þð5;6Þ are correlated but all other measures are pairwise un-correlated. To check if the observed correlations are artifacts of the measures, and not on thefact that they have been applied to correctly folded proteins, we have generated more than 200random self-avoiding proteins of various lengths and radiuses corresponding to the length/radius distribution of the set of real proteins. However, these random proteins generally havevery low scores on all measures giving only insignificant derivation from the correlationsfound on real proteins. In general the random proteins have measures similar to thus of allbeta strand proteins.

The plot of Ið1;3Þð2;4Þ vs. Ið1;5Þð2;4Þð3;6Þ, shown on Fig. 8, is one of the few plots on which the all betaproteins are spread over a larger area. Also the plot Ið1;4Þð2;6Þð3;5Þ vs. Ið1;4Þð2;5Þð3;6Þ shown on this figureis special as here the all alpha proteins have the lowest scores.

We call a measure symmetric if it is invariant under reversion of the proteins orientation.Except for the pairs Ið1;2Þð2;5Þð4;6Þ and Ið1;3Þð2;4Þð5;6Þ, Ið1;2Þð3;6Þð4;5Þ and Ið1;4Þð2;3Þð5;6Þ, Ið1;3Þð2;6Þð4;5Þ andIð1;5Þð2;3Þð4;6Þ, and the pair Ið1;4Þð2;6Þð3;5Þ and Ið1;5Þð2;4Þð3;6Þ, that gets interchanged under reversion oforientation, all measures are symmetric. The general behaviour is that the all alpha and the allbeta proteins have equal scores on the pairs of asymmetric measures. See Fig. 9 where the hor-izontally elongated all alpha and the vertically elongated all beta proteins lie on the line of identitywhereas the mixed structures generally lie off this line. This shows that these measures can detectthe packing of the secondary structure elements.

6. Remarks on the average crossing number

In the literature, the average crossing number has been investigated in mathematical theory aswell as in applications to protein structures. Arteca and Tapia [7] use the average crossing

Fig. 7. Left: Measure Ið1;2Þ vs. Ij1;2j. Right: measure Ið1;2Þ vs. Ið1;2Þð3;4Þ. The signatures are as on Fig. 5.

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181 177

number, denoted by the mean overcrossing number, and the most probable overcrossing numberas protein shape descriptors.

The most probable overcrossing number has the disadvantage of not being continuous on theset of protein structures on which it is evaluated, which is a natural requirement for a shapedescriptor. Consider, e.g., a protein structure with two dominant peaks in its overcrossing spec-trum, as seen in Figs. 2–4 in Ref. [7]. A small deformation of such a protein structure can cause themost probable overcrossing number to make the big jump from one dominant peak to the other.

In the abstract of Ref. [7] it is claimed that the mean overcrossing number and the most probable

overcrossing number relate to the content of secondary structure in a protein as well as its global 3-dimensional organization. In case of the average crossing number, this is in contrast to the fact thatwe, in this paper, have found the average crossing number to be the only structural measure that

Fig. 9. Measure Ið1;2Þð3;5Þð4;6Þ vs. Ið1;3Þð2;4Þð5;6Þ. The signatures are as on Fig. 5.

Fig. 8. Left: measure Ið1;3Þð2;4Þ vs. Ið1;5Þð2;4Þð3;6Þ. Right: measure Ið1;4Þð2;6Þð3;5Þ vs. Ið1;4Þð2;5Þð3;6Þ. The signatures are as on Fig. 5.

178 P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

essentially is given by the secondary structure and coil contents alone. However, our set of proteinstructures may not be large enough to show dependency of the global 3-dimensional organization.

A more interesting remark concerns the scaling laws of the average crossing number. Arteca [7]has found the average crossing numbers of a set of protein structures, on average, to be pro-portional to their lengths to the power of 1:37� 0:02. The upper bound, given by Buck [16], onthe average crossing number of a space curve with constant diameter, i.e., a flexible self avoidingtube, is given by a constant times the length/diameter to the power 4=3 � 1:33. It thus seems thatproteins generally are as entangled as self avoiding tubes can be.

The reason why the exponent 1:37� 0:02, found on proteins, is a little greater than the geo-metric limit, 4/3, is that shorter proteins generally are further away from the geometric limit. I.e.,shorter proteins are relatively less entangled than longer proteins. This is seen from Artecas de-tailed treatment of this scaling law in [17], Figs. 4 and 6. Intuitively, this effect may be explained bythe fact that a shorter protein has only little conformational freedom to entangle due to its few butrelatively big and ridged secondary structure elements. However, for a longer protein, the sec-ondary structure elements get so small, when compared to its length, that the protein on largerscale is free to entangle. A final illustration of this effect is, that Arteca in [17, Table 2], givesexponents 1:50� 0:08 for shorter proteins and exponent 1:33� 0:25 for longer proteins. That is,for increasing length, proteins catch up on the entanglement offset of smaller proteins until thegeometric limit is reached.

7. A pseudo metric on the space of protein structures

The collection of the above introduced measures gives, for each protein, a 20-dimensionalvector containing the scores of the protein on each of the measures. Due to the normalization,each measure lies between minus one and plus one. Introducing e.g. the euclidean or the maxi-mum norm on this vector space, a pseudo metric on the space of protein structures is obtained.The configuration of a protein backbone is generally given by the pairs of dihedral angles at theCa-atoms. Once the measures have been calculated for a set of proteins, all against all comparisonof the proteins is very fast as it only involve a norm in 20-space.

The configuration space of a 100 residue long protein backbones is 200-dimensional. A simpledimension count shows that the above pseudo metric cannot be a metric. If one introduces all thefourth order measures and, moreover, introduce all the variations of the measures caused bytaking the absolute value of one or more of the xði; jÞs involved in the sums, then more than 1600measures are obtained. Even with this many measures, a norm as that above will, strictlyspeaking, not be a norm but only a pseudo norm. This is due to the fact that all measures are zeroon planar curves. Since proteins are not planar this does not constitute a problem when con-cerning proteins.

To the left on Fig. 10 is shown the distance matrix for a subset of the 69 protein test set. Thissubset of proteins starts with all alpha proteins like 256B:A (index 3) and ends with the all betaprotein 1ACX (index 15). In between is 2FOX (index 6) an example of a a � b protein and 1PAX(index 10) is an example of an a þ b protein. Is is obvious from this distance matrix that 2FOXserves as an separator between the all alpha block at the lower left corner and more and more betadominated proteins to the upper right from 2FOX. However having looked at all the measures of

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181 179

all the proteins it is obvious that much information is lost when collecting all the measures in onepseudo metric.

Due to the polynomial like behaviour of the measures it makes sense to transform the measuresby the map x ! signðxÞjxj1=3. Note that due to the normalization of the measures they remainwithin the interval between minus one and one under this transformation. The distance matrix(Fig. 10 to the right) obtained using this transformation gets much more white off the diagonalexcept for the three pairs of similar structures. As most of the set of proteins is chosen from theRost List [18] which is constructed to have as much structural diversity as possible this is exactlythe result a good metric should give.

8. Conclusion

We have in the present paper constructed global geometric measures for characterizing andclassifying protein structures. These measures are given as generalized Gauss integrals and areborrowed from integral formulas for the Vassiliev knot invariants. Such measures are shown todetect more than secondary structure content of proteins and to detect packing of secondarystructure elements. We have calculated and compared these measures to usual fold classificationand found that they follow the usual classification of [2,3]. This encourages a full scale comparisonbetween the introduced geometric measures and the usual classification of protein structures,which is a subject of future work of the first author.

The measures give such classification the advantage of being rigorous and automatic and arebetter suited for identification of entirely new proteins in the genome. Moreover, when using thesemeasures, proteins of different sizes can be compared directly without use of e.g. alignment andgap penalties. Since the measures are real numbers, one can talk about being 0.7 class one, 0.24class two etc., which might be of advantage for classification, when much more protein structuresare known. It surely has a potential for structural prediction schemes.

Fig. 10. This figure visualizes the euclidean distance matrix of the measure vectors (left) and the rescaled measure

vectors (right) of the proteins: 2CCY:A, 2CCY:B, 256B:A, 256B:B, 2WRP:R, 2FOX, 4CPV, 2RSP:A, 3RNT, 1PAZ,

1AZU, 7RSA, 2PAB:A, 2PAB:B, and 1ACX chosen to have between 100 and 140 residues.

180 P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181

References

[1] P. Røgen, H. Bohr, On representation of protein backbones with (framed) space curves, MAT-Report No. 2002-14,

Department of Mathematics, Technical University of Denmark, ISSN 0904-7611, Technical Report.

[2] C.A. Orengo et al., CATH: A hierarchical classification of protein domain structures, Structure 5 (1997) 1093.

[3] L.L. Conte et al., SCOP database in 2002: refinements accommodate structural genomics, Nucleic Acids Res. 30

(2002) 264.

[4] Protein Data Bank, Available from <http://www.rcsb.org/pdb/>.

[5] G.A. Arteca, Overcrossing spectra of protein backbones––characterization of 3-dimensional molecular shape and

global structural homologies, Biopolymers 33 (1993) 1829.

[6] M. Levitt, Protein folding by restrained energy minimization and molecular dynamics, J. Molec. Biol. 170 (1983)

723.

[7] G.A. Arteca, O. Tapia, Characterization of fold diversity among proteins with the same number of amino acid

residues, J. Chem. Inf. Comput. Sci. 39 (1999) 642.

[8] S. Pascarella, P. Argos, A data-bank merging related protein structures and requences, Protein Eng. 5 (1992) 121.

[9] L. Holm, C. Sander, Mapping the protein universe, Science 273 (1996) 595.

[10] C.A. Orengo, T.P. Flores, W.R. Taylor, J.M. Thornton, Identification and classification of protein fold families,

Protein Eng. 6 (1993) 485.

[11] I. Dubchak, S.R. Holbrook, S.H. Kim, Prediction of protein folding class from amino acid-composition, Proteins

16 (1993) 79.

[12] D.J. Jones, W.R. Taylor, J.M. Thornton, A new approach to protein fold recognition, Nature 358 (1992) 86.

[13] M.J. Sippl, Calculation of conformational ensembles from potentials and mean forces, J. Molec. Biol. 213 (1990)

859.

[14] M. Gribskov, M. Mclaehlan, D. Eisenberg, Profile analysis: detection of distantly related proteins, Proc. Nat. Acd.

Sci. USA 48 (1998) 4355.

[15] X.-S. Lin, Z. Wang, Integral geometry of plane curves and knot invariants, J. Differ. Geom. 44 (1996) 74.

[16] G. Buck, Four-thirds power law for knots and links, Nature 392 (1998) 238.

[17] G.A. Arteca, Scaling regimes of molecular size and self-entanglements in very compact proteins, Phys. Rev. E 51

(1995) 2600.

[18] B. Rost, C. Sanders, Prediction of protein secondary structures at better than 70% accuracy, J. Molec. Biol. 232

(1993) 584.

P. Røgen, H. Bohr / Mathematical Biosciences 182 (2003) 167–181 181