Protein classification by matching and clustering surface graphs

34
Protein Classification by Matching and Clustering Surface Graphs M.A. Lozano, F. Escolano Departamento de Ciencia de la Computaci´ on e Inteligencia Artificial, Universidad de Alicante, E-03080, Alicante, Spain Abstract In this paper we address the problem of comparing and classifying protein surfaces with graph-based methods. Comparison relies on matching surface graphs, extracted from the surfaces by considering concave and convex patches, through a kernelized version of the Softassign graph-matching algorithm. On the other hand, classifica- tion is performed by clustering the surface graphs with an EM-like algorithm, also relying on kernelized Softassign, and then calculating the distance of an input sur- face graph to the closest prototype. We present experiments showing the suitability of kernelized Softassign for both comparing and classifying surface graphs. Key words: protein classification, graph matching, energy minimization, graph clustering, EM-algorithms PACS: 89.80.+h, 42.30.Sy, 42.30.Tz, 87.10.+e Corresponding author. Fax +34-965-903902 Email address: [email protected] (F. Escolano). Preprint submitted to Elsevier Science 9th August 2005

Transcript of Protein classification by matching and clustering surface graphs

Protein Classification by Matching and

Clustering Surface Graphs

M.A. Lozano, F. Escolano ∗

Departamento de Ciencia de la Computacion e Inteligencia Artificial, Universidad

de Alicante, E-03080, Alicante, Spain

Abstract

In this paper we address the problem of comparing and classifying protein surfaces

with graph-based methods. Comparison relies on matching surface graphs, extracted

from the surfaces by considering concave and convex patches, through a kernelized

version of the Softassign graph-matching algorithm. On the other hand, classifica-

tion is performed by clustering the surface graphs with an EM-like algorithm, also

relying on kernelized Softassign, and then calculating the distance of an input sur-

face graph to the closest prototype. We present experiments showing the suitability

of kernelized Softassign for both comparing and classifying surface graphs.

Key words: protein classification, graph matching, energy minimization, graph

clustering, EM-algorithms

PACS: 89.80.+h, 42.30.Sy, 42.30.Tz, 87.10.+e

∗ Corresponding author. Fax +34-965-903902Email address: [email protected] (F. Escolano).

Preprint submitted to Elsevier Science 9th August 2005

1 Introduction

In recent years, there has been a growing interest in exploiting the 3D informa-

tion derived from the molecular surfaces of proteins in order to infer similarities

between them. This is due to the fact that, as molecular function occurs at the

protein surface, these similarities contribute to understand their function and,

in addition, they reveal interesting evolutionary links between them. Particu-

larly, there are two computational problems in which surface-based methods

play a key role: registration, that is, determining whether two proteins have

a similar shape, and thus, develop a similar function; and docking, that is,

determining whether two proteins have a complementary shape and thus they

can be potentially bounded to form a complex (for antigen-antibody reactions,

enzyme catalysis, and so on). In this regard, the two central elements needed

to solve both problems are: a proper surface description and an efficient and

reliable matching algorithm [1]. Surfaces are often described by the interest

points associated to concave, convex or toroidal patches of the so called Con-

nolly surface [2], the part of the van der Waals surface accessible/excluded

to/by a probe with a typical radius of 1.4A. Given such a description, and as-

suming molecular rigidity, it is possible to find the optimal 3D transformation

satisfying the condition of shape coincidence or complementarity. Geometric

hashing [3] [4], which infers the most effective transformation needed to make

compatible the interest points of both protein surfaces, has proved to be one

of the most effective approaches in this context.

Other approaches related to computer vision come from graph theory and usu-

ally exploit discrete techniques to find the maximal clique [5] or the maximum

common subgraph [6] in graphs with vertices encoding information of parts

2

of the surfaces and edges representing topological or metric relations between

these parts. However, the avenue of more efficient, though approximate, contin-

uous methods for matching [7] [8] [9] [10] recommends an exhaustive evaluation

of graph-based surface registration and docking under this perspective. One of

this methods is the graduated assignment algorithm known as Softassign and

proposed by Gold and Rangarajan [7]. Softassign optimizes an energy func-

tion, which is quadratic with respect to the assignment variables, through a

continuation process (deterministic annealing) which updates the assignment

variables while ensuring that a tentative solution represents a feasible match,

at least in the continuous case. After such process, a cleanup heuristic trans-

lates the continuous solution to a discrete one. However, it has been reported

that the performance of this algorithm decays significantly with increasing

levels of structural corruption (number of deleted nodes), and also that such a

decay can be attenuated by optimizing an alternative non-quadratic cost func-

tion [8]. Furthermore, in our preliminary experiments with random graphs,

we have reported a similar decay if we weight properly the original quadratic

function [11]. Such a weighting relies on distributional information coming

from kernel computations on graphs. The key idea is that when working with

non-attributed graphs there is a high level of structural ambiguity and such

ambiguity is not solved by classical Softassign. However, when using kernels on

graphs [12] [13] we obtain structural features associated to the vertices which

contribute to remove ambiguities, because they help to choose a proper at-

tractor in contexts of high structural ambiguity. Kernels on graphs are derived

from recent results in spectral graph theory [14], particularly the so called dif-

fusion kernels, and provide a measure of similarity between pairs of vertices in

the same graph. In the case of diffusion kernels, such a similarity relies on the

probability that a lazy random walk (a random walk with a given probability

3

of resting at the current vertex) reaches a given node from another one.

On the other hand, graph-matching is only one of the two sides of structure

classification (through protein comparison in this case). The other side refers

to structure categorization, in terms of inferring groups of similar structures,

which must be addressed in order to yield efficient comparisons after having

obtained useful abstractions. The so called graph-clustering problem has been

only addressed recently [15][16][17][18][29][30], although early approaches have

been mainly addressed to deal with attributed relational graphs (ARGs). This

is the case of random graphs [31][32] whose nodes and edges attributes are

random variables and their joint probability distribution defines a probability

measure over the space of all graphs inside a class (outcome graphs). However,

although clustering techniques have been applied to categorize 3D folds [19],

or parts of them like domains [20](considered basic recurrent substructures),

the categorization of protein surfaces, or parts of them like active sites (parts

where proteins interact with other proteins) has been not addressed so far,

and this is partially due to the complexity of the surface/structural registra-

tion problem. However, having an acceptable solution for the registration of

structures or substructures is not enough. One needs a good algorithm for dis-

covering the optimal number of structural families or clusters. In this regard,

we have recently introduced an EM-like unsupervised and adaptive algorithm

[21] which is an adaptation to the domain of graphs of the ACM algorithm

proposed for clustering vectorial data [22][23]. Our clustering algorithm is able

of clustering structures by iteratively discovering both the prototypical graph

of each class and adjusting the optimal number of classes (this latter feature

was not considered in the ACM algorithm). Although the driving registra-

tion algorithm in the original proposal is Comb searching (population-based

4

matching) originally proposed to MRF-labelling problems [24], we have re-

placed it by the kernelized version of Softassign and our initial experiments

where promising [25]. Thus, kernelized Softassing is the key element both for

structure comparison and for structure categorization.

In this paper, we address firstly the problem of protein surface comparison

from the surface graphs extracted after labelling interest points/patches as

concave or convex. This labelling provides application-driven attributes, and

we are interested in knowing the role of structural attributes coming from

kernel computation in this context. In Section 2, we present the kernelized

Softassign with attributes. Secondly, we address the problem of clustering

surface graphs and thus in Section 3 we present the details of both the in-

cremental procedure for building graph prototypes and the EM algorithm for

performing the clustering itself. In Section 4 we present and discuss represen-

tative experiments of both matching and clustering surface graphs with our

graph-based techniques previously presented. Finally, in Section 5, we present

our conclusions and outline our future work in this area.

2 Kernelized Softassign

Consider two graphs GX = (VX , EX) and GY = (VY , EY ), with m = |VX | and

n = |VY |, and adjacency matrices Xab, and Yij (Xab = 1 if (a, b) ∈ EX , and

the same holds for Y ). In the Softassign formulation a feasible solution to the

matching problem is encoded by a matrix M of size m × n with Mai = 1 if

a ∈ VX matches i ∈ VY and 0 otherwise, and satisfying the constrains that

each node in VX must match either a unique node in VY or none of them,

and viceversa. Thus, following the Gold and Rangarajan formulation we are

5

interested in finding the feasible solution M that maximizes the following cost

function

F (GX , GY ; M) =1

2

m∑a=1

n∑i=1

m∑b=1

n∑j=1

MaiMbjCaibj , (1)

where typically Caibj = XabYij. The latter cost function means that when

a ∈ VX matches i ∈ VY , it is convenient that nodes b adjacent to a (Xab �=0) also match nodes j adjacent to i (also Yij �= 0). This is the well-known

rectangle rule (we want to obtain the maximum number of closed rectangles

MaiYijMjbXba as possible).

Furthermore, let CKaibj be a redefinition of Caibj by considering the information

contained in the diffusion kernels derived from X and Y . Such kernels are

respectively m×m and n× n matrices, satisfying the semi-definite condition

for kernels, which are derived from the Laplacians of X and Y . The Laplacian

of X, is the m × m matrix LX = DX − X, where DX is a diagonal matrix

registering the degree of each vertex, and the same holds for LY . It turns out

that the kernels KX and KY result from the matrix exponentiation of their

respective Laplacians:

KX = e−βm

LX and KY = e−βn

LY ,

where the normalization factors depending on the number of vertices of each

graph have been introduced to make both kernels comparable. Then, the def-

inition of CKaibj involves considering the values of KX and KY as structural

attributes for the corresponding vertices (see Fig. 1) because as the Lapla-

cians encode information about the local structure of the graph, its global

structure emerges in the kernels. However, we have found that, as edit oper-

6

ations will give different kernels in terms of the diffusion processes, it is not

an adequate choice to build attributes in the individual values of the kernels.

Particularly, as for a given vertex (row) these values represent probabilities

of reaching the rest of the nodes (columns), it can be considered that a given

row represents a probability distribution (see Fig. 1) and we may use a char-

acterization of such distribution to build structural attributes. Consequently,

we define CKaibj as follows

CKaibj = XabYijδabijKabij, (2)

where

Kaibj = exp−[(HKXa − HKY

i )2 + (HKXb − HKY

j )2] , (3)

where HKX and HKY are the entropies of the probability distributions asso-

ciated to the vertices and induced by KX and KY , and δaibj = 1 if a matches

a vertex i with the same curvature, and also b matches a vertex j with a

similar curvature; otherwise it returns −1. This definition integrates the sur-

face attributes in the cost function. When curvature compatibility arises, the

latter definition of CKaibj ensures that CK

aibj ≤ Caibj, being the equality only

verified when nodes a and i have similar entropies, and the same for nodes b

and j. In practice, this weights the rectangles in such a way that rectangles

with compatible entropies in their opposite vertices are preferred, and oth-

erwise they are underweighted and do not attract the continuation process.

In such process, the matching variables are updated as follows (before being

normalized)

Mai = exp

[− 1

T

∂F

∂Mai

]= exp

[1

2T

m∑i=a

MbjCKaibj

], (4)

7

To see intuitively how the kernelized version works, in Fig. 1 (vertices in dark

grey represent convex patches, whereas vertices in light grey represent concave

ones) we show the matching preferred by the kernelized Softassign just before

performing the clean-up heuristic. Such a matching is the most coherent one

in terms of structural subgraph compatibility.

3 EM Graph Clustering

3.1 Graph Prototypes

3.1.1 Incremental Fusion

Suppose that we have N input graphs Gi = (Vi, Ei) and we want to obtain a

representative prototype G = (V , E) of them. We define such a prototype as

G =N⊕

i=1

(πi � Gi) whereN∑

i=1

πi = 1 . (5)

Firstly, the � operator implements the weighting of each graph Gi by πi

which we will interpret in terms of the probability of belonging to the class

defined by the prototype. The definition of resulting graph Gi = πi � Gi,

with Vi = Vi and Ei = Ei is extended in order include a node-interpretation

function µi : Vi → [0, 1], addressed to register the relative frequency of each

node, and an edge-interpretation function δi : Ei → [0, 1] which must register

the relative frequency of each edge. Then, weighting consists of assigning the

following priors:

µi(a) = πi ∀a ∈ Vi , and δi(a, b) = πi ∀(a, b) ∈ Ei. (6)

8

Secondly, given a set of weighted graphs, the⊕

operator builds the prototype

G by fusing all of them. Ideally, the result should not depend on the order of

presentation of the weighted graphs. However, such a solution would be too

computational demanding because we need to find a high number of isomor-

phisms, that is, to solve a high number of NP-complete problems. In this work

we propose an incremental approach. Initially we select randomly a weighted

graph GX = πX �GX , as a temporary prototypical graph G. Then we select,

also randomly a different graph GY = πY �GY . The new temporary prototype

results from fusing these graphs:

G = GX = πX � GX

G = GX ⊕ GY = (πX � GX) ⊕ (πY � GY )

. . .

An individual fusion GX ⊕ GY relies of finding the matching M maximizing

F (GX , GY ; M) =m∑

a=1

n∑i=1

Mai(1 + ε(µX(a) + µY (i))m∑

b=1

n∑j=1

MbjCKaibj , (7)

where CKaibj = XabYijδabijKaibj, being X = X, Y = Y the respective adjacency

matrices, and ε > 0 is a small positive constant closer to zero (in our imple-

mentation ε = 0.0001). Such a cost function is an extension of Eq. 1 addressed

to deal with ambiguity: If there are many matchings with equal cost in terms

of F (GX , GY ; M), it prefers to choose the alternative in which the nodes with

the higher weights so far are matched.

The fused graph, that is, the new prototype G = GX⊕GY registers the weights

of its nodes and edges through its interpretation functions µ : V → [0, 1] and

δ : E → [0, 1]. Such a registration relies on the optimal match

M∗ = arg maxM

F (GX , GY ; M) . (8)

9

The weights of the nodes m ∈ V are updated as follows:

µ(m) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

µX(a) + µY (i) if (∃a ∈ VX ,∃i ∈ VY : M∗ai = 1)

µX(a) if (a ∈ VX)

µY (i) if (i ∈ VY ),

(9)

that is, we consider that m = a = i in the first case, m = a in the second and

m = i in the third.

Similarly, the weighs of the edges (a, b) ∈ E are updated by

δ(m,n) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

δX(a, b) + δY (i, j)if (∃(a, b) ∈ EX ,∃(i, j) ∈ EY :

: M∗ai = M∗

bj = 1)

δX(a, b) if ((a, b) ∈ EX)

δY (i, j) if ((i, j) ∈ EY ),

(10)

As before we consider that (m,n) = (a, b) = (i, j) in the first case, (m,n) =

(a, b) in the second, and (m,n) = (i, j) in the third. Consequently, weights are

updated by considering that: (i) two matched nodes correspond to the same

node in the prototype; (ii) edges connecting matched nodes are also the same

edge; and (iii) nodes and edges existing only in one of the graphs must be

also integrated in the prototype. In cases (i) and (ii) frequencies are added,

whereas in case (iii) we retain the original frequencies.

Once all graphs in the set are fused, the resulting prototype G must be pruned

in order to retain only those nodes a ∈ V with µ(a) ≥ 0.5 and those edges

10

(a, b) ∈ E also with δ(a, b) ≥ 0.5. This results in first-order median graphs

containing nodes and edges with significant frequencies, and these frequencies

will be neglected in the future, that is, weights are also taken into account

to obtain the prototype. The first-order median graph is implicitly defined in

the updating Equations 9 and 10 for node and edge weights. Although the

formal properties of this median graph have not been addressed in the present

paper, this graph retains the most significant frequencies of nodes and edges

and this is why it represents a first-order intersection of the graph set, that

is, it represents the noiseless structure which is common to all of the graphs

in the set.

3.1.2 Fusion Example

In order illustrate our incremental fusion algorithm we have built a prototype

from four graphs (upper row of Fig. 2). The fusion process occurs as follows

(lower row of Fig. 2 from left to right): (i) The first graph is considered as the

temporary prototype and each node and edge has a weight of 0.25; (ii) As the

second graph is a supergraph of the current prototype, matching nodes and

edges receive now a weight of 0.5 whereas the rest of them have a weight of 0.25;

(iii) When the third graph is incorporated the latter weights are incremented;

(iv) Finally the nodes and edges surviving to the 0.5 prune are give an X-

shaped graph.

3.2 ACM for Graphs

Given N input graphs Gi = (Vi, Ei) the Asymmetric Clustering Model (ACM)

for graphs finds the K graph prototypes Gα = (Vα, Eα) and the class-membership

11

variables Iiα ∈ {0, 1} maximizing the following cost function

L(G, I) = −N∑

i=1

K∑α=1

Iiα(1 − Fiα) , (11)

where Fiα are the values of a symmetric and normalized dissimilarity measure

between individual graphs Gi and prototypes Gα. Such a measure is defined

by

Fiα =maxM [F (Gi, Gα; M)]

max[Fii, Fαα], (12)

where Fii = F (Gi, Gi; I|Vi|), Fαα = F (Gα, Gα; I|Vα|), being I|Vi| and I|Vα| the

identity matrices defining self-matchings. This result in Fiα = Fαi ∈ [0, 1].

Prototypical graphs are built on all individual graphs assigned to each class,

but such an assignment depends on the membership variables. Here we adapt

to the domain of graphs the EM-approach proposed by Hoffman and Puzicha

[22][23]. As in such approach the class-memberships are hidden or unobserved

variables, we start by providing good initial estimations of both the prototypes

and the memberships, feeding with them an iterative process in which we

alternate the estimation of expected memberships with the re-estimation of

the prototypes.

3.2.1 Initialization.

Initial prototypes are selected by a greedy procedure: First prototype is as-

sumed to be a graph selected randomly, and the following ones are the most

dissimilar graphs from any of the yet selected prototypes. Initial memberships

12

I0iα are then given by:

I0iα =

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

1 if α = arg minβ[1 − Fiβ]

0 otherwise

3.2.2 E-step.

Consists of estimating the expected membership variables Iiα ∈ [0, 1] given

the current estimation of each prototypical graph Gα:

I t+1iα =

ρtα exp[(1 − Fiα)/T ]∑K

β=1 ρtβ exp[(1 − Fiβ)/T ]

, being ρtα =

1

N

N∑i=1

I tiα , (13)

that is, these variables encode the probability of assigning any graph Gi to

class cα at iteration t, and T is the temperature, a control parameter which

is reduced at each iteration (we are using the deterministic annealing version

of the E-step, because it is less prone to local maxima than the un-annealed

one).

As in this step we need to compute the dissimilarities Fiα, we need to solve

N × K NP-complete problems.

3.2.3 M-step.

Given the expected membership variables I t+1iα , the prototypical graphs are

re-estimated as follows:

Gt+1α =

N⊕i=1

(πiα � Gi) , where πiα =I t+1iα∑N

k=1 I t+1kα

, (14)

13

where variables πiα are the current probabilities of belonging to each class cα.

This step is completed after re-estimating the K prototypes, that is, after solv-

ing (N − 1) × K NP-complete problems. Moreover, the un-weighted median

graph Gt+1α resulting from the fusion will be used in the E-step to re-estimate

the dissimilarities Fiα through maximizing F (Gi, Gα; M) with kernelized Sof-

tassign.

3.2.4 Adaptation.

Assuming that the iterative process is divided in epochs, our adaptation mech-

anism consists of starting by a high number of classes Kmax and then reducing

such a number, if proceeds, at the end of each epoch. At that moment we se-

lect the two closest prototypes Gα and Gβ as candidates to be fused, and we

compute hα the heterogeneity of cα

hα =N∑

i=1

(1 − Fiα)πiα , (15)

obtaining hβ in a similar way. Then, we compute the fused prototype Gγ by

applying Equation 14 and considering that Iiγ = Iiα + Iiβ, that is

Gγ =N⊕

i=1

(πiγ � Gi) . (16)

Finally, we fuse cα and cβ whenever hγ < (hα + hβ)µ, where µ ∈ [0, 1] is a

merge factor addressed to facilitate class fusion (usually we set µ = 0.6). After

such a decision a new epoch begins. We wait until convergence before trying

two fuse two other classes.

Testing whether two classes must be fused or not needs to solve N − 1 NP-

complete problems, but if we decide to fuse the number of NP-complete prob-

14

lems to solve in each iteration of the next epoch will be reduced in 2N − 1 (a

reduction of N for each E-step and a reduction of N − 1 for each M-step).

4 Experimental Results

The purpose of this paper is to test the reliability and efficiency of the ker-

nelized Softassign and the EM clustering described above for comparing and

clustering protein surfaces. Surface graphs are obtained as follows. Given the

triangulated Connolly surfaces of each protein, and considering the classifi-

cation of each surface point as belonging to a concave, convex or toroidal

patch, we have clustered the points in patches and we have associated each

cluster to a vertex in the surface graph. Retaining only concave and convex

patches/vertices we have obtained the edges of the surface graph through 3D

triangulation of the centroids of all patches.

4.1 Comparing Active Sites

A simple initial experiment consists of testing kernelized Softassing in match-

ing substructures, that is, active sites. In order to do so we have chosen the

experimental set presented by Pickering et al. in Ref. [6] where they propose to

combine the Bron and Kerbosh algorithm, for solving the maximum common

subgraph via clique enumeration, with superimposing the matched surfaces

and minimizing the alignment error. In order to test their approach they se-

lected several proteins for the Protein Data Bank (PDB) [27]and then extract

their active sites from the PDB files. The proteins considered where: 1a7k,

1agn, 1axg, 1hdx, 1hdy, 1hdz, 3hud, 1deh, 1hsz, 1dlt, and 6adh. For each

15

protein, their sites are numbered for instance as 1a7k − 0,1a7k − 1,and so on.

In Fig. 3 we show sites corresponding to four proteins, and in Fig. 4 we give an

example of a good matching, using kernelized Softassign, between two sites in

the same protein, and an example of bad matching between sites of different

proteins.

For analyzing the matching performance for sites of the same proteins Picker-

ing et al. registered the number of features matched. In Fig. 5 we compare the

fraction of common features, obtained by them, with our normalized cost (op-

timal matching between two graphs divided by the minimum of self-matching

costs for each compared graph). In Fig. 5(a)-(c) we show the all-for-all compar-

isons for the four sites of 1a7k. In Fig. 5(c) we compare site 1agn− 1 with the

rest of sites in the same molecule, and the same holds for 1hdx−1 in Fig. 5(d).

From these comparisons it results that our kernelized graph-matching algo-

rithm is consistent with Bron & Kerbosh combined with alignment. However,

such a consistency decays when comparing sites of distantly related proteins

(see Fig. 6). In this latter case, the higher similarity indexes given by our

method are due to the fact that only graphs are compared and none align-

ment is performed. Consequently, as the 3D alignment constraints are not

applied graphs are allowed to deform to increment compatibility. This effect

is not so hard when comparing sites within the same protein.

4.2 Comparing PDB Proteins

Before considering several protein families to test whether the approach is

useful or not for matching complete proteins, we have measured the decay

of our normalized cost as we introduce more and more structural noise. In

16

Fig. 7 we show the results obtained for matching the protein 1crn with itself

in the ideal case (isomorphism) and after artificially removing some atoms.

We observe a progressive decay of the normalized cost.

In order to build a representative experimental set for comparing groups of

proteins, we have considered proteins of five families extracted from the Pro-

tein Data Bank: Crambin/Plant proteins, Ureases, Seryl, Hemoglobins, and

Hydrolases see Fig. 8). The number of vertices in the surface graphs range from

79 to 442 nodes, that is, we have heterogeneous families. We want to demon-

strate that the normalized cost is a useful distance for predicting whether two

proteins belong to the same family or not. Such a normalized cost is bounded

by 0 and 1, and the higher the cost the more similar the proteins should be.

In Fig. 9 (top row) we show some representative matching results. Comparing

proteins of the same family results in globally smooth matchings and normal-

ized costs above 0.5 (1jxt − 1jxu in Crambin). However, the cost falls below

0.1 for many proteins of different families (comparing an urease and a seryl,

1a5k − 1set, we obtain a cost of 0.0701) because in these cases it is not possi-

ble to find globally consistent matches (only pairs of common subgraphs are

matched).

4.3 Clustering Proteins

Another important fact to test is whether the kernelized version improves

significantly the performance of the classical version of Softassign. In our pre-

liminary experiments for the non-attributed case, where the effect of struc-

tural ambiguity is higher, we have found that kernelization yields, in compar-

ison with the classical Softassign, a slower performance decay with increasing

17

structural corruption. However, this does not happen in our surface compari-

son experiments because the concavity/convexity attributes constrain so much

the number of allowable correspondences that there are few pure structural

ambiguities. However, although using kernelized or classical Softassign has a

negligible impact in the normalized cost obtained we obtain noise-free repre-

sentative graph prototypes for different families. These prototypical graphs are

interesting because the ACM algorith for graphs, described above, yields an

useful abstraction for subsequent protein queries. First of all, we test whether

the normalized cost yields higher similarities between proteins in the same

family than between proteins of different families. In Fig. 10 (left) we show the

all-for-all similarities between proteins of different families (8 Hemoglobins, 5

Crambins, 5 Ureases, 4 Seryls, and 3 Hydrolases) which yields good clustering

in the diagonal (white/grey level indicates high similarities). In Fig. 10 (right)

we show the similarities between each protein and the prototype of all classes.

These similarities indicate that the prototypical graphs obtained through both

kernelized graph-matching and incremental fusion are very representative.

Finally, we compare the similarities due to surface comparison to the similar-

ities due to 3D structure comparison (such information comes from the back-

bone of the protein but not from its surface). One of the most effective methods

for 3D fold comparison is Dali [28] 1 . In Fig. 11 we show the 9 proteins closer,

in terms of 3D structure, to 1b8qa. When analyzing these proteins in terms

of their surfaces we obtain three sparse classes: {1pdr, 1qava, 1kwaa, 1be9a},{1m5za, 3pdza, 1qlca, 1b8qa}, and {1i16, 1nf3d}.

1 http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html

18

5 Conclusions and Future Work

In this paper we have presented a graph-matching method, relying on discrete

kernels on graphs, to solve the problem of protein surface comparison, and

an EM graph-clustering algorithms, relying also on the latter graph-matching

algorithm, to learn classes of protein surfaces. We extract the surface graphs

describing the topology of the surfaces and perform efficient and reliable graph

matching by exploiting both application-driven and structural attributes. In

this work we are not using metric attributes, due to the high computational

demand (space requirements and temporal cost) associated to introduce edge-

based attributes, and our experimental set and the derived conclusions are

focused on the effectiveness of exclusively graph matching and clustering meth-

ods in this context.

In the latter regard, our experimental results related to substructure compar-

isons are consistent with the combination of graph-matching and 3D alignment

provided that we compare active sites of the same protein. However, when we

compare sites of distantly related proteins we find that our score is higher

than that of the method combining graph-matching and alignment, and this

is due to the fact that 3D alignment constraints are not applied and graphs

are allowed to deform in order to maximize the normalized cost.

When comparing complete proteins coming from five different families our

experiments show that the normalized cost derived from the kernelized Sof-

tassign algorithm is a useful similarity measure to cluster proteins and also

to build structural prototypes for each family when the EM graph-clustering

algorithm is applied to these data. These prototypes may be very useful for

19

simplifying subsequent protein queries. We believe that this is an interest-

ing starting point for finding clusters of protein surfaces in an unsupervised

manner.

Our main conclusion is that the graph-based approach to compare and cluster

protein surfaces is promising but some extensions related to the subsequent

alignment of proteins should be introduced. This is why our future work in

this field includes, besides the improvement of both kernelized Softassign and

EM-based clustering algorithms, the integration of the graph-matching and

graph-clustering algorithms with 3D alignment ones.

Acknowledgments

This work was partially supported by the research grant PR2004 − 0260 of

the Spanish Government.

References

[1] Halperin, I., Ma, B., Wolfson, H. & Nussinov, R., Principles of docking: An

overview of search algorithms and a guide to scoring functions. Proteins 47,

pp. 409-443 (2002).

[2] Connolly, M., Analytical molecular surface calculation. J Appl Crys 16 pp.

548-558 (1983).

[3] Wolfson, H.J. & Lamdan, Y., Geometric hashing; A general and efficient model-

based recognition scheme. In Proc of the IEEE Int Conf on Computer Vision

238-249 (1988).

[4] Nussinov, R., & Wolfson, H.J., Efficient detection of three-dimensional motifs

20

in biological macromolecules by computer vision techniques. In Proc of the Natl

Acad Sci USA 88 pp. 10495-10499 (1991).

[5] Gardiner, E.J., Willet, P., & Artymiuk, P.J., Graph-theoretic techniques for

macromolecular docking. J Chem Inform Comput Sci 40 273-279 (2000).

[6] Pickering, S.J., Bulpitt, A.J., Efford, N., Gold, N.D., & Westhead, D.R., AI-

based algorithms for protein surface comparisons. Computers and Chemistry

26 79-84 (2001).

[7] Gold, S., & Rangarajan, A., A graduated assignment algorithm for graph

matching. IEEE Trans on Patt Anal and Mach Int 18 (4) 377-388 (1996).

[8] Finch, A.M., Wilson, R.C., & Hancock, E., An energy function and continuous

edit process for graph matching. Neural Computation 10 (7) 1873-1894 (1998).

[9] Pelillo, M., Replicator equations, maximal cliques, and graph isomorphism.

Neural Computation 11 1933-1955 (1999).

[10] Luo, B., & Hancock, E.R., Structural graph matching using the EM algorithm

and singular value decomposition. IEEE Trans on Patt Anal and Mach Int 23

(10) 1120-1136 (2001).

[11] Lozano, M.A., & Escolano, F., A significant improvement of softassign with

diffusion kernels. In Proc S+SPR’04 LNCS 3138 76-84 (2004).

[12] Kondor, R., & Lafferty, J., Diffusion kernels on graphs and other discrete input

spaces. In Proc of the Intl Conf on Mach Learn. Morgan-Kauffman 315-322

(2002).

[13] Smola, A., & Kondor, R.I., Kernels and Regularization on Graphs. In Proc. of

Intl Conf on Comp Learn Theo and 7th Kernel Workshop. LNCS 2777 144-158

(2003).

21

[14] Chung, F.R.K., Spectral graph theory. Conference Board of the Mathematical

Sciences(CBMS) 92. Americal Mathematical Society (1997).

[15] Jiang, X., Munger, A., Bunke, H.: On Median Graphs: Properties, Algorithms,

and Applications. IEEE Trans. on Pattern Analysis and Machine Intelligence,

Vol. 23, No. 10 1144–1151 (2001).

[16] Luo,B., Wilson, R.C, & Hancock,E.R., A Spectral Approach to Learning

Structural Variations in Graphs. In Proc. of ICVS’03. LNCS 2626 407–417

(2003).

[17] Serratosa, F., Alquezar, R., & Sanfeliu, A., Function-described graphs for

modelling objects represented by sets of attributed graphs, Pattern Recognition,

Vol. 23, No. 3 781-798 (2003).

[18] Sanfeliu, A., Serratosa, F.,& Alquezar, R., Clustering of Attributed Graphs and

Unsupervised Synthesis of Function-Described Graphs. In Proc. of ICPR2000

15th International Conference on Pattern Recoginition, Barcelona, Spain, Vol. 2

1026-1029 (2000).

[19] Holm, L., Sander, C. Mapping the protein universe. Science. 1996 Aug

2;273(5275):595-603 (1996).

[20] Heger, A., Holm, L., Exhaustive enumeration of protein domain families. J Mol

Biol. 2003 May 2; 328(3):749-67 (2003).

[21] Lozano, M.A. & Escolano, F., EM algorithm for clustering an ensemble of

graphs with comb matching. In Proc. of EMMCVPR’03. LNCS 2683 52-67

(2003).

[22] Hofmann, T., Puzicha, J.: Statistical Models for Co-occurrence Data. MIT AI-

Memo 1625 Cambridge, MA (1998)

[23] Puzicha, J.: Histogram Clustering for Unsupervised Segmentation and Image

Retrieval. Pattern Recognition Letters, 20, (1999) 899-909.

22

[24] Li, S.Z., Toward Global Solution to MAP Image Estimation: Using Common

Structure of Local Solutions. In Proc. of EMMCVPR’97. LNCS 1223 361-374

(1997)

[25] Lozano, M.A., & Escolano, F., Structural recognition with kernelized softassign.

In Proc of IBERAMIA’04. LNCS 3315 626–635 (2004).

[26] Bron, C.,& Kerbosh, J., Finding all cliques of an undirected graph.

Communications of the ACM. 16, 575-577 (1973).

[27] Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig,

H., Shindyalov, I.N. & Bourne, P.E.,The Protein Data Bank. Nucleic Acids

Research 28 pp. 235-242 (2000).

[28] Holm, L., Sander, C., Protein structure comparison by alignment of distance

matrices. J Mol Biol 233(1): 123-38 (1993).

[29] Gunter, S., Bunke, H., Self-organizing map for clustering in the graph domain.

In Pattern Recognition Letters, 23 pp. 401-417 (2002).

[30] Gunter, S., Bunke, H., Validation indices for graph clustering. In Pattern

Recognition Letters, 24 (8) pp. 1107-1113 (2003).

[31] Wong, A., Constant, J., You, M., Random Graphs. In Syntactic and Structural

Pattern Recognition: Theory and Applications 4 197-234 (1990).

[32] Bunke, H., Foggia, P., Guidobaldi, C., Vento, M., Graph Clustering Using the

Weighted Minimum Common Supergraph. In Proc. of GbR’03. LNCS 2726

235-246 (2003).

23

Vitae

Miguel Angel Lozano received the B.S. degree in computer science from the

University of Alicante, Spain, in 2001. Since 2004 he is a teaching assistant

with the Department of Computer Science and Artificial Intelligence of the

University of Alicante. He has visited Edwin Hancock’s Computer Vision &

Pattern Recognition Lab at the University of York, and more recently, Liisa

Holm’s Bioinformatics Lab at the University of Helsinki. He is member of

the Robot Vision Group of the Univerisity Alicante and his research interests

include computer vision and robotics.

Francisco Escolano received the B.S. degree in computer science from the

Polytechnical University of Valencia, Spain, in 1992 and the Ph.D degree in

computer science from the University of Alicante in 1997. Since 1998 he is an

associate professor with the Department of Computer Science and Artificial

Intelligence of the University of Alicante. He has been postdoctoral fellow with

Dr. Norberto M. Grzywacz at the Biomedical Engineering Department of the

University of Southern California, Los Angeles, CA, and he has also collab-

orated with Dr. Alan Yuille at the Smith-Kettlewell Eye Research Institute,

San Francisco, CA. Recently he has visited the Liisa Holm’s Bioinformatics

Lab at the University of Helsinki. He is the head of the Robot Vision Group

of the Univerisity Alicante whose research interests are focused on the de-

velopment of efficient and reliable computer-vision algorithms for biomedical

applications, active vision and robotics, and video-analysis.

24

Summary. In this paper we address the problem of comparing and classifying

protein surfaces with graph-based methods. Given the so called Conolly sur-

face, the part of the van der Waals surface of the protein accessible/excluded

to/by a probe of given radius, it is described by considering the centroids of

all concave and convex patches and then forming a 3D Delaunay triangula-

tion. Therefore, each node of the so called surface graph, resulting from the

triangulation, corresponds to a patch and it is labelled as concave or con-

vex, and the edges come from the metric relations between the centroids of

neighboring patches. Distances between centroids (edge attributes) are not

considered in this work in order to reduce the spatial and computational cost

of the representation.

Given two surface graphs, protein comparison relies on matching them through

a kernelized version of the Softassign graph-matching algorithm. Softassign

optimizes an energy function, which is quadratic with respect to the assign-

ment variables. In our preliminary experiments with random graphs, we have

reported that the performance decay of the algorithm with increasing levels

of noise (node deletion) may be attenuated if we weight properly the origi-

nal quadratic function. Such a weighting relies on distributional information

coming from kernel computations on graphs. When using kernels on graphs we

obtain structural features associated to the vertices which contribute to remove

ambiguities. When applied to match surface graphs at the substructure level

(active sites) we find results very consistent with recent experiments combin-

ing topology and metric information provided that we match sites within the

same protein. When matching complete structures we have a useful similarity

measure for differentiate between proteins of the same family and proteins of

different families. However in this latter case (different families) the decay of

25

the score (normalized cost) is significant, although attenuated by kerneliza-

tion, when the sizes of the compared proteins are different.

On the other hand, classification is performed by clustering the surface graphs

with an EM-like algorithm, also relying on kernelized Softassign, and then

calculating the distance of an input surface graph to the closest prototype. Our

clustering algorithm is able of grouping structures by iteratively discovering

both the prototypical graph of each class and adjusting the optimal number

of classes. Extracting the temporary prototypical graph of a given class (M-

step) is addressed by an incremental algorithm that fuses all graphs weighted

by their current probabilities of belonging to the class. The E-step consists

of updating the probabilities of belonging to each class on the basis of the

new prototypes. Starting from a high number of potential classes, fusion is

performed if it yields a reduction of heterogeneity. When clustering complete

proteins the results are coherent with the a priori existing families.

Our main conclusion is that the application of kernelized matching and clus-

tering to classify proteins yields a useful topological measure in many cases

but it should be complemented by metric information within complementary

alignment processes, and this is a task to address in the future.

26

(a)

(b)

(c)

Figure 1. Kernelized Softassign. (a) Kernel information for node 1 of graph Y and for

all nodes in graph X. (b) Matching with kernelized Softassign. (c) Kernel matrices.

27

Figure 2. Fusion example (step-by-step). In the upper row we show the graphs

to fuse (vertices in dark grey are convex and those in light grey are concave). In

the upper row we show the state of the prototype when each of these graphs are

incorporated.

28

(a) (b)

(c) (d)

Figure 3. Examples of active sites from four proteins: (a) 1a7k − 0 (b) 1agn− 1 (c)

1axg − 0 (d) 1hdy − 1.

(a) (a)

Figure 4. Matching of active sites. (a) Correct matching between sites 1a7k− 1 and

1a7k − 3. (b) Incorrect matching between sites 1a7k3 and 1axg33.

29

(a) (b)

(c) (d)

(e) (f)

Figure 5. Comparing our method with the method of Pickering et al. The y-axis

represents the obtained score. In each case the sites listed are compared with: (a)

1a7k − 0 (b) 1a7k − 1 (c) 1a7k − 2 (d) 1agn − 1 (e) 1axg − 0 (f) 1hdx − 1

30

(a) (b)

Figure 6. More comparisons between our method and the method of Pickering et

al. Comparing sites with (a) 1axg − 0, and (b) 1gd10

(a) (b) (c)

(d) (e) (f)

Figure 7. Matching results for 1crn (328 atoms) with artificial noise. (a) Isomor-

phism (cost = 1), (b) removing 10 atoms (cost = 0.9561), (c) removing 20 atoms

(cost = 0.9203), (d) removing 30 atoms (cost = 0.9027), (e) removing 100 atoms

(cost = 0.5528), (f) 150 atoms (cost = 0.547).

31

1a00 1jtx 1a5k 1ser 1tca

1a0u 1jxw 1a5m 1ses 1tcb

1a0z 1jxy 1a5o 1sry 1tcc

graph 1a00 graph 1jxt graph 1a5k graph 1ser graph 1tca

Figure 8. Examples of the five families used in the experiments. From left-right and

top-bottom: Hemoglobins, Crambin (plant proteins), Ureases, Seryl, Hydrolases. In

the surfaces, convex patches are represented in dark grey whereas concave patches

are showed in light grey. In the bottom row we show the graphs for the surfaces

showed in the first row.

32

(a) (b)

Figure 9. Some matching results. (a) Correct matching between two proteins of

the same family (1jxt − 1jxu, with cost 0.7515). (b) Incorrect matching between

proteins of different families (1a5k − 1set, with cost 0.0701).

(a) (b)

Figure 10. Matching results for several families: Hemoglobins (1a00, 1a01, 1a0u,

1a0v, 1a0w, 1a0x, 1a0z, 1gzx), Crambin (1jxt, 1jxu, 1jxw, 1jxx, 1jxy), Ureases

(1a5k, 1a5l, 1a5m, 1a5n, 1a5o), Seryl (1ser, 1ses, 1set, 1sry) and Hydrolases (1tca,

1tcb, 1tcc). (a) Comparisons all-for-all. (b) Comparisons with all the prototypes.

33

(1) 1pdr (2) 1qava (3) 1kwaa (4) 1be9a (5) 1m5za

(6) 3pdza (7) 1qlca (8) 1b8qa (9) 1i16 (10) 1nf3d

(a) (b)

Figure 11. Comparison with 3D structure. Top: 1b8qa and the 9 closer proteins

following Dali scores but ordered by surface compatibility. Bottom left: Pairwise

normalized distances. Bottom right: RMS error given by Dali.

34