Post on 20-Feb-2023
Protein Classification by Matching and
Clustering Surface Graphs
M.A. Lozano, F. Escolano ∗
Departamento de Ciencia de la Computacion e Inteligencia Artificial, Universidad
de Alicante, E-03080, Alicante, Spain
Abstract
In this paper we address the problem of comparing and classifying protein surfaces
with graph-based methods. Comparison relies on matching surface graphs, extracted
from the surfaces by considering concave and convex patches, through a kernelized
version of the Softassign graph-matching algorithm. On the other hand, classifica-
tion is performed by clustering the surface graphs with an EM-like algorithm, also
relying on kernelized Softassign, and then calculating the distance of an input sur-
face graph to the closest prototype. We present experiments showing the suitability
of kernelized Softassign for both comparing and classifying surface graphs.
Key words: protein classification, graph matching, energy minimization, graph
clustering, EM-algorithms
PACS: 89.80.+h, 42.30.Sy, 42.30.Tz, 87.10.+e
∗ Corresponding author. Fax +34-965-903902Email address: sco@dccia.ua.es (F. Escolano).
Preprint submitted to Elsevier Science 9th August 2005
1 Introduction
In recent years, there has been a growing interest in exploiting the 3D informa-
tion derived from the molecular surfaces of proteins in order to infer similarities
between them. This is due to the fact that, as molecular function occurs at the
protein surface, these similarities contribute to understand their function and,
in addition, they reveal interesting evolutionary links between them. Particu-
larly, there are two computational problems in which surface-based methods
play a key role: registration, that is, determining whether two proteins have
a similar shape, and thus, develop a similar function; and docking, that is,
determining whether two proteins have a complementary shape and thus they
can be potentially bounded to form a complex (for antigen-antibody reactions,
enzyme catalysis, and so on). In this regard, the two central elements needed
to solve both problems are: a proper surface description and an efficient and
reliable matching algorithm [1]. Surfaces are often described by the interest
points associated to concave, convex or toroidal patches of the so called Con-
nolly surface [2], the part of the van der Waals surface accessible/excluded
to/by a probe with a typical radius of 1.4A. Given such a description, and as-
suming molecular rigidity, it is possible to find the optimal 3D transformation
satisfying the condition of shape coincidence or complementarity. Geometric
hashing [3] [4], which infers the most effective transformation needed to make
compatible the interest points of both protein surfaces, has proved to be one
of the most effective approaches in this context.
Other approaches related to computer vision come from graph theory and usu-
ally exploit discrete techniques to find the maximal clique [5] or the maximum
common subgraph [6] in graphs with vertices encoding information of parts
2
of the surfaces and edges representing topological or metric relations between
these parts. However, the avenue of more efficient, though approximate, contin-
uous methods for matching [7] [8] [9] [10] recommends an exhaustive evaluation
of graph-based surface registration and docking under this perspective. One of
this methods is the graduated assignment algorithm known as Softassign and
proposed by Gold and Rangarajan [7]. Softassign optimizes an energy func-
tion, which is quadratic with respect to the assignment variables, through a
continuation process (deterministic annealing) which updates the assignment
variables while ensuring that a tentative solution represents a feasible match,
at least in the continuous case. After such process, a cleanup heuristic trans-
lates the continuous solution to a discrete one. However, it has been reported
that the performance of this algorithm decays significantly with increasing
levels of structural corruption (number of deleted nodes), and also that such a
decay can be attenuated by optimizing an alternative non-quadratic cost func-
tion [8]. Furthermore, in our preliminary experiments with random graphs,
we have reported a similar decay if we weight properly the original quadratic
function [11]. Such a weighting relies on distributional information coming
from kernel computations on graphs. The key idea is that when working with
non-attributed graphs there is a high level of structural ambiguity and such
ambiguity is not solved by classical Softassign. However, when using kernels on
graphs [12] [13] we obtain structural features associated to the vertices which
contribute to remove ambiguities, because they help to choose a proper at-
tractor in contexts of high structural ambiguity. Kernels on graphs are derived
from recent results in spectral graph theory [14], particularly the so called dif-
fusion kernels, and provide a measure of similarity between pairs of vertices in
the same graph. In the case of diffusion kernels, such a similarity relies on the
probability that a lazy random walk (a random walk with a given probability
3
of resting at the current vertex) reaches a given node from another one.
On the other hand, graph-matching is only one of the two sides of structure
classification (through protein comparison in this case). The other side refers
to structure categorization, in terms of inferring groups of similar structures,
which must be addressed in order to yield efficient comparisons after having
obtained useful abstractions. The so called graph-clustering problem has been
only addressed recently [15][16][17][18][29][30], although early approaches have
been mainly addressed to deal with attributed relational graphs (ARGs). This
is the case of random graphs [31][32] whose nodes and edges attributes are
random variables and their joint probability distribution defines a probability
measure over the space of all graphs inside a class (outcome graphs). However,
although clustering techniques have been applied to categorize 3D folds [19],
or parts of them like domains [20](considered basic recurrent substructures),
the categorization of protein surfaces, or parts of them like active sites (parts
where proteins interact with other proteins) has been not addressed so far,
and this is partially due to the complexity of the surface/structural registra-
tion problem. However, having an acceptable solution for the registration of
structures or substructures is not enough. One needs a good algorithm for dis-
covering the optimal number of structural families or clusters. In this regard,
we have recently introduced an EM-like unsupervised and adaptive algorithm
[21] which is an adaptation to the domain of graphs of the ACM algorithm
proposed for clustering vectorial data [22][23]. Our clustering algorithm is able
of clustering structures by iteratively discovering both the prototypical graph
of each class and adjusting the optimal number of classes (this latter feature
was not considered in the ACM algorithm). Although the driving registra-
tion algorithm in the original proposal is Comb searching (population-based
4
matching) originally proposed to MRF-labelling problems [24], we have re-
placed it by the kernelized version of Softassign and our initial experiments
where promising [25]. Thus, kernelized Softassing is the key element both for
structure comparison and for structure categorization.
In this paper, we address firstly the problem of protein surface comparison
from the surface graphs extracted after labelling interest points/patches as
concave or convex. This labelling provides application-driven attributes, and
we are interested in knowing the role of structural attributes coming from
kernel computation in this context. In Section 2, we present the kernelized
Softassign with attributes. Secondly, we address the problem of clustering
surface graphs and thus in Section 3 we present the details of both the in-
cremental procedure for building graph prototypes and the EM algorithm for
performing the clustering itself. In Section 4 we present and discuss represen-
tative experiments of both matching and clustering surface graphs with our
graph-based techniques previously presented. Finally, in Section 5, we present
our conclusions and outline our future work in this area.
2 Kernelized Softassign
Consider two graphs GX = (VX , EX) and GY = (VY , EY ), with m = |VX | and
n = |VY |, and adjacency matrices Xab, and Yij (Xab = 1 if (a, b) ∈ EX , and
the same holds for Y ). In the Softassign formulation a feasible solution to the
matching problem is encoded by a matrix M of size m × n with Mai = 1 if
a ∈ VX matches i ∈ VY and 0 otherwise, and satisfying the constrains that
each node in VX must match either a unique node in VY or none of them,
and viceversa. Thus, following the Gold and Rangarajan formulation we are
5
interested in finding the feasible solution M that maximizes the following cost
function
F (GX , GY ; M) =1
2
m∑a=1
n∑i=1
m∑b=1
n∑j=1
MaiMbjCaibj , (1)
where typically Caibj = XabYij. The latter cost function means that when
a ∈ VX matches i ∈ VY , it is convenient that nodes b adjacent to a (Xab �=0) also match nodes j adjacent to i (also Yij �= 0). This is the well-known
rectangle rule (we want to obtain the maximum number of closed rectangles
MaiYijMjbXba as possible).
Furthermore, let CKaibj be a redefinition of Caibj by considering the information
contained in the diffusion kernels derived from X and Y . Such kernels are
respectively m×m and n× n matrices, satisfying the semi-definite condition
for kernels, which are derived from the Laplacians of X and Y . The Laplacian
of X, is the m × m matrix LX = DX − X, where DX is a diagonal matrix
registering the degree of each vertex, and the same holds for LY . It turns out
that the kernels KX and KY result from the matrix exponentiation of their
respective Laplacians:
KX = e−βm
LX and KY = e−βn
LY ,
where the normalization factors depending on the number of vertices of each
graph have been introduced to make both kernels comparable. Then, the def-
inition of CKaibj involves considering the values of KX and KY as structural
attributes for the corresponding vertices (see Fig. 1) because as the Lapla-
cians encode information about the local structure of the graph, its global
structure emerges in the kernels. However, we have found that, as edit oper-
6
ations will give different kernels in terms of the diffusion processes, it is not
an adequate choice to build attributes in the individual values of the kernels.
Particularly, as for a given vertex (row) these values represent probabilities
of reaching the rest of the nodes (columns), it can be considered that a given
row represents a probability distribution (see Fig. 1) and we may use a char-
acterization of such distribution to build structural attributes. Consequently,
we define CKaibj as follows
CKaibj = XabYijδabijKabij, (2)
where
Kaibj = exp−[(HKXa − HKY
i )2 + (HKXb − HKY
j )2] , (3)
where HKX and HKY are the entropies of the probability distributions asso-
ciated to the vertices and induced by KX and KY , and δaibj = 1 if a matches
a vertex i with the same curvature, and also b matches a vertex j with a
similar curvature; otherwise it returns −1. This definition integrates the sur-
face attributes in the cost function. When curvature compatibility arises, the
latter definition of CKaibj ensures that CK
aibj ≤ Caibj, being the equality only
verified when nodes a and i have similar entropies, and the same for nodes b
and j. In practice, this weights the rectangles in such a way that rectangles
with compatible entropies in their opposite vertices are preferred, and oth-
erwise they are underweighted and do not attract the continuation process.
In such process, the matching variables are updated as follows (before being
normalized)
Mai = exp
[− 1
T
∂F
∂Mai
]= exp
[1
2T
m∑i=a
MbjCKaibj
], (4)
7
To see intuitively how the kernelized version works, in Fig. 1 (vertices in dark
grey represent convex patches, whereas vertices in light grey represent concave
ones) we show the matching preferred by the kernelized Softassign just before
performing the clean-up heuristic. Such a matching is the most coherent one
in terms of structural subgraph compatibility.
3 EM Graph Clustering
3.1 Graph Prototypes
3.1.1 Incremental Fusion
Suppose that we have N input graphs Gi = (Vi, Ei) and we want to obtain a
representative prototype G = (V , E) of them. We define such a prototype as
G =N⊕
i=1
(πi � Gi) whereN∑
i=1
πi = 1 . (5)
Firstly, the � operator implements the weighting of each graph Gi by πi
which we will interpret in terms of the probability of belonging to the class
defined by the prototype. The definition of resulting graph Gi = πi � Gi,
with Vi = Vi and Ei = Ei is extended in order include a node-interpretation
function µi : Vi → [0, 1], addressed to register the relative frequency of each
node, and an edge-interpretation function δi : Ei → [0, 1] which must register
the relative frequency of each edge. Then, weighting consists of assigning the
following priors:
µi(a) = πi ∀a ∈ Vi , and δi(a, b) = πi ∀(a, b) ∈ Ei. (6)
8
Secondly, given a set of weighted graphs, the⊕
operator builds the prototype
G by fusing all of them. Ideally, the result should not depend on the order of
presentation of the weighted graphs. However, such a solution would be too
computational demanding because we need to find a high number of isomor-
phisms, that is, to solve a high number of NP-complete problems. In this work
we propose an incremental approach. Initially we select randomly a weighted
graph GX = πX �GX , as a temporary prototypical graph G. Then we select,
also randomly a different graph GY = πY �GY . The new temporary prototype
results from fusing these graphs:
G = GX = πX � GX
G = GX ⊕ GY = (πX � GX) ⊕ (πY � GY )
. . .
An individual fusion GX ⊕ GY relies of finding the matching M maximizing
F (GX , GY ; M) =m∑
a=1
n∑i=1
Mai(1 + ε(µX(a) + µY (i))m∑
b=1
n∑j=1
MbjCKaibj , (7)
where CKaibj = XabYijδabijKaibj, being X = X, Y = Y the respective adjacency
matrices, and ε > 0 is a small positive constant closer to zero (in our imple-
mentation ε = 0.0001). Such a cost function is an extension of Eq. 1 addressed
to deal with ambiguity: If there are many matchings with equal cost in terms
of F (GX , GY ; M), it prefers to choose the alternative in which the nodes with
the higher weights so far are matched.
The fused graph, that is, the new prototype G = GX⊕GY registers the weights
of its nodes and edges through its interpretation functions µ : V → [0, 1] and
δ : E → [0, 1]. Such a registration relies on the optimal match
M∗ = arg maxM
F (GX , GY ; M) . (8)
9
The weights of the nodes m ∈ V are updated as follows:
µ(m) =
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
µX(a) + µY (i) if (∃a ∈ VX ,∃i ∈ VY : M∗ai = 1)
µX(a) if (a ∈ VX)
µY (i) if (i ∈ VY ),
(9)
that is, we consider that m = a = i in the first case, m = a in the second and
m = i in the third.
Similarly, the weighs of the edges (a, b) ∈ E are updated by
δ(m,n) =
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
δX(a, b) + δY (i, j)if (∃(a, b) ∈ EX ,∃(i, j) ∈ EY :
: M∗ai = M∗
bj = 1)
δX(a, b) if ((a, b) ∈ EX)
δY (i, j) if ((i, j) ∈ EY ),
(10)
As before we consider that (m,n) = (a, b) = (i, j) in the first case, (m,n) =
(a, b) in the second, and (m,n) = (i, j) in the third. Consequently, weights are
updated by considering that: (i) two matched nodes correspond to the same
node in the prototype; (ii) edges connecting matched nodes are also the same
edge; and (iii) nodes and edges existing only in one of the graphs must be
also integrated in the prototype. In cases (i) and (ii) frequencies are added,
whereas in case (iii) we retain the original frequencies.
Once all graphs in the set are fused, the resulting prototype G must be pruned
in order to retain only those nodes a ∈ V with µ(a) ≥ 0.5 and those edges
10
(a, b) ∈ E also with δ(a, b) ≥ 0.5. This results in first-order median graphs
containing nodes and edges with significant frequencies, and these frequencies
will be neglected in the future, that is, weights are also taken into account
to obtain the prototype. The first-order median graph is implicitly defined in
the updating Equations 9 and 10 for node and edge weights. Although the
formal properties of this median graph have not been addressed in the present
paper, this graph retains the most significant frequencies of nodes and edges
and this is why it represents a first-order intersection of the graph set, that
is, it represents the noiseless structure which is common to all of the graphs
in the set.
3.1.2 Fusion Example
In order illustrate our incremental fusion algorithm we have built a prototype
from four graphs (upper row of Fig. 2). The fusion process occurs as follows
(lower row of Fig. 2 from left to right): (i) The first graph is considered as the
temporary prototype and each node and edge has a weight of 0.25; (ii) As the
second graph is a supergraph of the current prototype, matching nodes and
edges receive now a weight of 0.5 whereas the rest of them have a weight of 0.25;
(iii) When the third graph is incorporated the latter weights are incremented;
(iv) Finally the nodes and edges surviving to the 0.5 prune are give an X-
shaped graph.
3.2 ACM for Graphs
Given N input graphs Gi = (Vi, Ei) the Asymmetric Clustering Model (ACM)
for graphs finds the K graph prototypes Gα = (Vα, Eα) and the class-membership
11
variables Iiα ∈ {0, 1} maximizing the following cost function
L(G, I) = −N∑
i=1
K∑α=1
Iiα(1 − Fiα) , (11)
where Fiα are the values of a symmetric and normalized dissimilarity measure
between individual graphs Gi and prototypes Gα. Such a measure is defined
by
Fiα =maxM [F (Gi, Gα; M)]
max[Fii, Fαα], (12)
where Fii = F (Gi, Gi; I|Vi|), Fαα = F (Gα, Gα; I|Vα|), being I|Vi| and I|Vα| the
identity matrices defining self-matchings. This result in Fiα = Fαi ∈ [0, 1].
Prototypical graphs are built on all individual graphs assigned to each class,
but such an assignment depends on the membership variables. Here we adapt
to the domain of graphs the EM-approach proposed by Hoffman and Puzicha
[22][23]. As in such approach the class-memberships are hidden or unobserved
variables, we start by providing good initial estimations of both the prototypes
and the memberships, feeding with them an iterative process in which we
alternate the estimation of expected memberships with the re-estimation of
the prototypes.
3.2.1 Initialization.
Initial prototypes are selected by a greedy procedure: First prototype is as-
sumed to be a graph selected randomly, and the following ones are the most
dissimilar graphs from any of the yet selected prototypes. Initial memberships
12
I0iα are then given by:
I0iα =
⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩
1 if α = arg minβ[1 − Fiβ]
0 otherwise
3.2.2 E-step.
Consists of estimating the expected membership variables Iiα ∈ [0, 1] given
the current estimation of each prototypical graph Gα:
I t+1iα =
ρtα exp[(1 − Fiα)/T ]∑K
β=1 ρtβ exp[(1 − Fiβ)/T ]
, being ρtα =
1
N
N∑i=1
I tiα , (13)
that is, these variables encode the probability of assigning any graph Gi to
class cα at iteration t, and T is the temperature, a control parameter which
is reduced at each iteration (we are using the deterministic annealing version
of the E-step, because it is less prone to local maxima than the un-annealed
one).
As in this step we need to compute the dissimilarities Fiα, we need to solve
N × K NP-complete problems.
3.2.3 M-step.
Given the expected membership variables I t+1iα , the prototypical graphs are
re-estimated as follows:
Gt+1α =
N⊕i=1
(πiα � Gi) , where πiα =I t+1iα∑N
k=1 I t+1kα
, (14)
13
where variables πiα are the current probabilities of belonging to each class cα.
This step is completed after re-estimating the K prototypes, that is, after solv-
ing (N − 1) × K NP-complete problems. Moreover, the un-weighted median
graph Gt+1α resulting from the fusion will be used in the E-step to re-estimate
the dissimilarities Fiα through maximizing F (Gi, Gα; M) with kernelized Sof-
tassign.
3.2.4 Adaptation.
Assuming that the iterative process is divided in epochs, our adaptation mech-
anism consists of starting by a high number of classes Kmax and then reducing
such a number, if proceeds, at the end of each epoch. At that moment we se-
lect the two closest prototypes Gα and Gβ as candidates to be fused, and we
compute hα the heterogeneity of cα
hα =N∑
i=1
(1 − Fiα)πiα , (15)
obtaining hβ in a similar way. Then, we compute the fused prototype Gγ by
applying Equation 14 and considering that Iiγ = Iiα + Iiβ, that is
Gγ =N⊕
i=1
(πiγ � Gi) . (16)
Finally, we fuse cα and cβ whenever hγ < (hα + hβ)µ, where µ ∈ [0, 1] is a
merge factor addressed to facilitate class fusion (usually we set µ = 0.6). After
such a decision a new epoch begins. We wait until convergence before trying
two fuse two other classes.
Testing whether two classes must be fused or not needs to solve N − 1 NP-
complete problems, but if we decide to fuse the number of NP-complete prob-
14
lems to solve in each iteration of the next epoch will be reduced in 2N − 1 (a
reduction of N for each E-step and a reduction of N − 1 for each M-step).
4 Experimental Results
The purpose of this paper is to test the reliability and efficiency of the ker-
nelized Softassign and the EM clustering described above for comparing and
clustering protein surfaces. Surface graphs are obtained as follows. Given the
triangulated Connolly surfaces of each protein, and considering the classifi-
cation of each surface point as belonging to a concave, convex or toroidal
patch, we have clustered the points in patches and we have associated each
cluster to a vertex in the surface graph. Retaining only concave and convex
patches/vertices we have obtained the edges of the surface graph through 3D
triangulation of the centroids of all patches.
4.1 Comparing Active Sites
A simple initial experiment consists of testing kernelized Softassing in match-
ing substructures, that is, active sites. In order to do so we have chosen the
experimental set presented by Pickering et al. in Ref. [6] where they propose to
combine the Bron and Kerbosh algorithm, for solving the maximum common
subgraph via clique enumeration, with superimposing the matched surfaces
and minimizing the alignment error. In order to test their approach they se-
lected several proteins for the Protein Data Bank (PDB) [27]and then extract
their active sites from the PDB files. The proteins considered where: 1a7k,
1agn, 1axg, 1hdx, 1hdy, 1hdz, 3hud, 1deh, 1hsz, 1dlt, and 6adh. For each
15
protein, their sites are numbered for instance as 1a7k − 0,1a7k − 1,and so on.
In Fig. 3 we show sites corresponding to four proteins, and in Fig. 4 we give an
example of a good matching, using kernelized Softassign, between two sites in
the same protein, and an example of bad matching between sites of different
proteins.
For analyzing the matching performance for sites of the same proteins Picker-
ing et al. registered the number of features matched. In Fig. 5 we compare the
fraction of common features, obtained by them, with our normalized cost (op-
timal matching between two graphs divided by the minimum of self-matching
costs for each compared graph). In Fig. 5(a)-(c) we show the all-for-all compar-
isons for the four sites of 1a7k. In Fig. 5(c) we compare site 1agn− 1 with the
rest of sites in the same molecule, and the same holds for 1hdx−1 in Fig. 5(d).
From these comparisons it results that our kernelized graph-matching algo-
rithm is consistent with Bron & Kerbosh combined with alignment. However,
such a consistency decays when comparing sites of distantly related proteins
(see Fig. 6). In this latter case, the higher similarity indexes given by our
method are due to the fact that only graphs are compared and none align-
ment is performed. Consequently, as the 3D alignment constraints are not
applied graphs are allowed to deform to increment compatibility. This effect
is not so hard when comparing sites within the same protein.
4.2 Comparing PDB Proteins
Before considering several protein families to test whether the approach is
useful or not for matching complete proteins, we have measured the decay
of our normalized cost as we introduce more and more structural noise. In
16
Fig. 7 we show the results obtained for matching the protein 1crn with itself
in the ideal case (isomorphism) and after artificially removing some atoms.
We observe a progressive decay of the normalized cost.
In order to build a representative experimental set for comparing groups of
proteins, we have considered proteins of five families extracted from the Pro-
tein Data Bank: Crambin/Plant proteins, Ureases, Seryl, Hemoglobins, and
Hydrolases see Fig. 8). The number of vertices in the surface graphs range from
79 to 442 nodes, that is, we have heterogeneous families. We want to demon-
strate that the normalized cost is a useful distance for predicting whether two
proteins belong to the same family or not. Such a normalized cost is bounded
by 0 and 1, and the higher the cost the more similar the proteins should be.
In Fig. 9 (top row) we show some representative matching results. Comparing
proteins of the same family results in globally smooth matchings and normal-
ized costs above 0.5 (1jxt − 1jxu in Crambin). However, the cost falls below
0.1 for many proteins of different families (comparing an urease and a seryl,
1a5k − 1set, we obtain a cost of 0.0701) because in these cases it is not possi-
ble to find globally consistent matches (only pairs of common subgraphs are
matched).
4.3 Clustering Proteins
Another important fact to test is whether the kernelized version improves
significantly the performance of the classical version of Softassign. In our pre-
liminary experiments for the non-attributed case, where the effect of struc-
tural ambiguity is higher, we have found that kernelization yields, in compar-
ison with the classical Softassign, a slower performance decay with increasing
17
structural corruption. However, this does not happen in our surface compari-
son experiments because the concavity/convexity attributes constrain so much
the number of allowable correspondences that there are few pure structural
ambiguities. However, although using kernelized or classical Softassign has a
negligible impact in the normalized cost obtained we obtain noise-free repre-
sentative graph prototypes for different families. These prototypical graphs are
interesting because the ACM algorith for graphs, described above, yields an
useful abstraction for subsequent protein queries. First of all, we test whether
the normalized cost yields higher similarities between proteins in the same
family than between proteins of different families. In Fig. 10 (left) we show the
all-for-all similarities between proteins of different families (8 Hemoglobins, 5
Crambins, 5 Ureases, 4 Seryls, and 3 Hydrolases) which yields good clustering
in the diagonal (white/grey level indicates high similarities). In Fig. 10 (right)
we show the similarities between each protein and the prototype of all classes.
These similarities indicate that the prototypical graphs obtained through both
kernelized graph-matching and incremental fusion are very representative.
Finally, we compare the similarities due to surface comparison to the similar-
ities due to 3D structure comparison (such information comes from the back-
bone of the protein but not from its surface). One of the most effective methods
for 3D fold comparison is Dali [28] 1 . In Fig. 11 we show the 9 proteins closer,
in terms of 3D structure, to 1b8qa. When analyzing these proteins in terms
of their surfaces we obtain three sparse classes: {1pdr, 1qava, 1kwaa, 1be9a},{1m5za, 3pdza, 1qlca, 1b8qa}, and {1i16, 1nf3d}.
1 http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html
18
5 Conclusions and Future Work
In this paper we have presented a graph-matching method, relying on discrete
kernels on graphs, to solve the problem of protein surface comparison, and
an EM graph-clustering algorithms, relying also on the latter graph-matching
algorithm, to learn classes of protein surfaces. We extract the surface graphs
describing the topology of the surfaces and perform efficient and reliable graph
matching by exploiting both application-driven and structural attributes. In
this work we are not using metric attributes, due to the high computational
demand (space requirements and temporal cost) associated to introduce edge-
based attributes, and our experimental set and the derived conclusions are
focused on the effectiveness of exclusively graph matching and clustering meth-
ods in this context.
In the latter regard, our experimental results related to substructure compar-
isons are consistent with the combination of graph-matching and 3D alignment
provided that we compare active sites of the same protein. However, when we
compare sites of distantly related proteins we find that our score is higher
than that of the method combining graph-matching and alignment, and this
is due to the fact that 3D alignment constraints are not applied and graphs
are allowed to deform in order to maximize the normalized cost.
When comparing complete proteins coming from five different families our
experiments show that the normalized cost derived from the kernelized Sof-
tassign algorithm is a useful similarity measure to cluster proteins and also
to build structural prototypes for each family when the EM graph-clustering
algorithm is applied to these data. These prototypes may be very useful for
19
simplifying subsequent protein queries. We believe that this is an interest-
ing starting point for finding clusters of protein surfaces in an unsupervised
manner.
Our main conclusion is that the graph-based approach to compare and cluster
protein surfaces is promising but some extensions related to the subsequent
alignment of proteins should be introduced. This is why our future work in
this field includes, besides the improvement of both kernelized Softassign and
EM-based clustering algorithms, the integration of the graph-matching and
graph-clustering algorithms with 3D alignment ones.
Acknowledgments
This work was partially supported by the research grant PR2004 − 0260 of
the Spanish Government.
References
[1] Halperin, I., Ma, B., Wolfson, H. & Nussinov, R., Principles of docking: An
overview of search algorithms and a guide to scoring functions. Proteins 47,
pp. 409-443 (2002).
[2] Connolly, M., Analytical molecular surface calculation. J Appl Crys 16 pp.
548-558 (1983).
[3] Wolfson, H.J. & Lamdan, Y., Geometric hashing; A general and efficient model-
based recognition scheme. In Proc of the IEEE Int Conf on Computer Vision
238-249 (1988).
[4] Nussinov, R., & Wolfson, H.J., Efficient detection of three-dimensional motifs
20
in biological macromolecules by computer vision techniques. In Proc of the Natl
Acad Sci USA 88 pp. 10495-10499 (1991).
[5] Gardiner, E.J., Willet, P., & Artymiuk, P.J., Graph-theoretic techniques for
macromolecular docking. J Chem Inform Comput Sci 40 273-279 (2000).
[6] Pickering, S.J., Bulpitt, A.J., Efford, N., Gold, N.D., & Westhead, D.R., AI-
based algorithms for protein surface comparisons. Computers and Chemistry
26 79-84 (2001).
[7] Gold, S., & Rangarajan, A., A graduated assignment algorithm for graph
matching. IEEE Trans on Patt Anal and Mach Int 18 (4) 377-388 (1996).
[8] Finch, A.M., Wilson, R.C., & Hancock, E., An energy function and continuous
edit process for graph matching. Neural Computation 10 (7) 1873-1894 (1998).
[9] Pelillo, M., Replicator equations, maximal cliques, and graph isomorphism.
Neural Computation 11 1933-1955 (1999).
[10] Luo, B., & Hancock, E.R., Structural graph matching using the EM algorithm
and singular value decomposition. IEEE Trans on Patt Anal and Mach Int 23
(10) 1120-1136 (2001).
[11] Lozano, M.A., & Escolano, F., A significant improvement of softassign with
diffusion kernels. In Proc S+SPR’04 LNCS 3138 76-84 (2004).
[12] Kondor, R., & Lafferty, J., Diffusion kernels on graphs and other discrete input
spaces. In Proc of the Intl Conf on Mach Learn. Morgan-Kauffman 315-322
(2002).
[13] Smola, A., & Kondor, R.I., Kernels and Regularization on Graphs. In Proc. of
Intl Conf on Comp Learn Theo and 7th Kernel Workshop. LNCS 2777 144-158
(2003).
21
[14] Chung, F.R.K., Spectral graph theory. Conference Board of the Mathematical
Sciences(CBMS) 92. Americal Mathematical Society (1997).
[15] Jiang, X., Munger, A., Bunke, H.: On Median Graphs: Properties, Algorithms,
and Applications. IEEE Trans. on Pattern Analysis and Machine Intelligence,
Vol. 23, No. 10 1144–1151 (2001).
[16] Luo,B., Wilson, R.C, & Hancock,E.R., A Spectral Approach to Learning
Structural Variations in Graphs. In Proc. of ICVS’03. LNCS 2626 407–417
(2003).
[17] Serratosa, F., Alquezar, R., & Sanfeliu, A., Function-described graphs for
modelling objects represented by sets of attributed graphs, Pattern Recognition,
Vol. 23, No. 3 781-798 (2003).
[18] Sanfeliu, A., Serratosa, F.,& Alquezar, R., Clustering of Attributed Graphs and
Unsupervised Synthesis of Function-Described Graphs. In Proc. of ICPR2000
15th International Conference on Pattern Recoginition, Barcelona, Spain, Vol. 2
1026-1029 (2000).
[19] Holm, L., Sander, C. Mapping the protein universe. Science. 1996 Aug
2;273(5275):595-603 (1996).
[20] Heger, A., Holm, L., Exhaustive enumeration of protein domain families. J Mol
Biol. 2003 May 2; 328(3):749-67 (2003).
[21] Lozano, M.A. & Escolano, F., EM algorithm for clustering an ensemble of
graphs with comb matching. In Proc. of EMMCVPR’03. LNCS 2683 52-67
(2003).
[22] Hofmann, T., Puzicha, J.: Statistical Models for Co-occurrence Data. MIT AI-
Memo 1625 Cambridge, MA (1998)
[23] Puzicha, J.: Histogram Clustering for Unsupervised Segmentation and Image
Retrieval. Pattern Recognition Letters, 20, (1999) 899-909.
22
[24] Li, S.Z., Toward Global Solution to MAP Image Estimation: Using Common
Structure of Local Solutions. In Proc. of EMMCVPR’97. LNCS 1223 361-374
(1997)
[25] Lozano, M.A., & Escolano, F., Structural recognition with kernelized softassign.
In Proc of IBERAMIA’04. LNCS 3315 626–635 (2004).
[26] Bron, C.,& Kerbosh, J., Finding all cliques of an undirected graph.
Communications of the ACM. 16, 575-577 (1973).
[27] Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig,
H., Shindyalov, I.N. & Bourne, P.E.,The Protein Data Bank. Nucleic Acids
Research 28 pp. 235-242 (2000).
[28] Holm, L., Sander, C., Protein structure comparison by alignment of distance
matrices. J Mol Biol 233(1): 123-38 (1993).
[29] Gunter, S., Bunke, H., Self-organizing map for clustering in the graph domain.
In Pattern Recognition Letters, 23 pp. 401-417 (2002).
[30] Gunter, S., Bunke, H., Validation indices for graph clustering. In Pattern
Recognition Letters, 24 (8) pp. 1107-1113 (2003).
[31] Wong, A., Constant, J., You, M., Random Graphs. In Syntactic and Structural
Pattern Recognition: Theory and Applications 4 197-234 (1990).
[32] Bunke, H., Foggia, P., Guidobaldi, C., Vento, M., Graph Clustering Using the
Weighted Minimum Common Supergraph. In Proc. of GbR’03. LNCS 2726
235-246 (2003).
23
Vitae
Miguel Angel Lozano received the B.S. degree in computer science from the
University of Alicante, Spain, in 2001. Since 2004 he is a teaching assistant
with the Department of Computer Science and Artificial Intelligence of the
University of Alicante. He has visited Edwin Hancock’s Computer Vision &
Pattern Recognition Lab at the University of York, and more recently, Liisa
Holm’s Bioinformatics Lab at the University of Helsinki. He is member of
the Robot Vision Group of the Univerisity Alicante and his research interests
include computer vision and robotics.
Francisco Escolano received the B.S. degree in computer science from the
Polytechnical University of Valencia, Spain, in 1992 and the Ph.D degree in
computer science from the University of Alicante in 1997. Since 1998 he is an
associate professor with the Department of Computer Science and Artificial
Intelligence of the University of Alicante. He has been postdoctoral fellow with
Dr. Norberto M. Grzywacz at the Biomedical Engineering Department of the
University of Southern California, Los Angeles, CA, and he has also collab-
orated with Dr. Alan Yuille at the Smith-Kettlewell Eye Research Institute,
San Francisco, CA. Recently he has visited the Liisa Holm’s Bioinformatics
Lab at the University of Helsinki. He is the head of the Robot Vision Group
of the Univerisity Alicante whose research interests are focused on the de-
velopment of efficient and reliable computer-vision algorithms for biomedical
applications, active vision and robotics, and video-analysis.
24
Summary. In this paper we address the problem of comparing and classifying
protein surfaces with graph-based methods. Given the so called Conolly sur-
face, the part of the van der Waals surface of the protein accessible/excluded
to/by a probe of given radius, it is described by considering the centroids of
all concave and convex patches and then forming a 3D Delaunay triangula-
tion. Therefore, each node of the so called surface graph, resulting from the
triangulation, corresponds to a patch and it is labelled as concave or con-
vex, and the edges come from the metric relations between the centroids of
neighboring patches. Distances between centroids (edge attributes) are not
considered in this work in order to reduce the spatial and computational cost
of the representation.
Given two surface graphs, protein comparison relies on matching them through
a kernelized version of the Softassign graph-matching algorithm. Softassign
optimizes an energy function, which is quadratic with respect to the assign-
ment variables. In our preliminary experiments with random graphs, we have
reported that the performance decay of the algorithm with increasing levels
of noise (node deletion) may be attenuated if we weight properly the origi-
nal quadratic function. Such a weighting relies on distributional information
coming from kernel computations on graphs. When using kernels on graphs we
obtain structural features associated to the vertices which contribute to remove
ambiguities. When applied to match surface graphs at the substructure level
(active sites) we find results very consistent with recent experiments combin-
ing topology and metric information provided that we match sites within the
same protein. When matching complete structures we have a useful similarity
measure for differentiate between proteins of the same family and proteins of
different families. However in this latter case (different families) the decay of
25
the score (normalized cost) is significant, although attenuated by kerneliza-
tion, when the sizes of the compared proteins are different.
On the other hand, classification is performed by clustering the surface graphs
with an EM-like algorithm, also relying on kernelized Softassign, and then
calculating the distance of an input surface graph to the closest prototype. Our
clustering algorithm is able of grouping structures by iteratively discovering
both the prototypical graph of each class and adjusting the optimal number
of classes. Extracting the temporary prototypical graph of a given class (M-
step) is addressed by an incremental algorithm that fuses all graphs weighted
by their current probabilities of belonging to the class. The E-step consists
of updating the probabilities of belonging to each class on the basis of the
new prototypes. Starting from a high number of potential classes, fusion is
performed if it yields a reduction of heterogeneity. When clustering complete
proteins the results are coherent with the a priori existing families.
Our main conclusion is that the application of kernelized matching and clus-
tering to classify proteins yields a useful topological measure in many cases
but it should be complemented by metric information within complementary
alignment processes, and this is a task to address in the future.
26
(a)
(b)
(c)
Figure 1. Kernelized Softassign. (a) Kernel information for node 1 of graph Y and for
all nodes in graph X. (b) Matching with kernelized Softassign. (c) Kernel matrices.
27
Figure 2. Fusion example (step-by-step). In the upper row we show the graphs
to fuse (vertices in dark grey are convex and those in light grey are concave). In
the upper row we show the state of the prototype when each of these graphs are
incorporated.
28
(a) (b)
(c) (d)
Figure 3. Examples of active sites from four proteins: (a) 1a7k − 0 (b) 1agn− 1 (c)
1axg − 0 (d) 1hdy − 1.
(a) (a)
Figure 4. Matching of active sites. (a) Correct matching between sites 1a7k− 1 and
1a7k − 3. (b) Incorrect matching between sites 1a7k3 and 1axg33.
29
(a) (b)
(c) (d)
(e) (f)
Figure 5. Comparing our method with the method of Pickering et al. The y-axis
represents the obtained score. In each case the sites listed are compared with: (a)
1a7k − 0 (b) 1a7k − 1 (c) 1a7k − 2 (d) 1agn − 1 (e) 1axg − 0 (f) 1hdx − 1
30
(a) (b)
Figure 6. More comparisons between our method and the method of Pickering et
al. Comparing sites with (a) 1axg − 0, and (b) 1gd10
(a) (b) (c)
(d) (e) (f)
Figure 7. Matching results for 1crn (328 atoms) with artificial noise. (a) Isomor-
phism (cost = 1), (b) removing 10 atoms (cost = 0.9561), (c) removing 20 atoms
(cost = 0.9203), (d) removing 30 atoms (cost = 0.9027), (e) removing 100 atoms
(cost = 0.5528), (f) 150 atoms (cost = 0.547).
31
1a00 1jtx 1a5k 1ser 1tca
1a0u 1jxw 1a5m 1ses 1tcb
1a0z 1jxy 1a5o 1sry 1tcc
graph 1a00 graph 1jxt graph 1a5k graph 1ser graph 1tca
Figure 8. Examples of the five families used in the experiments. From left-right and
top-bottom: Hemoglobins, Crambin (plant proteins), Ureases, Seryl, Hydrolases. In
the surfaces, convex patches are represented in dark grey whereas concave patches
are showed in light grey. In the bottom row we show the graphs for the surfaces
showed in the first row.
32
(a) (b)
Figure 9. Some matching results. (a) Correct matching between two proteins of
the same family (1jxt − 1jxu, with cost 0.7515). (b) Incorrect matching between
proteins of different families (1a5k − 1set, with cost 0.0701).
(a) (b)
Figure 10. Matching results for several families: Hemoglobins (1a00, 1a01, 1a0u,
1a0v, 1a0w, 1a0x, 1a0z, 1gzx), Crambin (1jxt, 1jxu, 1jxw, 1jxx, 1jxy), Ureases
(1a5k, 1a5l, 1a5m, 1a5n, 1a5o), Seryl (1ser, 1ses, 1set, 1sry) and Hydrolases (1tca,
1tcb, 1tcc). (a) Comparisons all-for-all. (b) Comparisons with all the prototypes.
33
(1) 1pdr (2) 1qava (3) 1kwaa (4) 1be9a (5) 1m5za
(6) 3pdza (7) 1qlca (8) 1b8qa (9) 1i16 (10) 1nf3d
(a) (b)
Figure 11. Comparison with 3D structure. Top: 1b8qa and the 9 closer proteins
following Dali scores but ordered by surface compatibility. Bottom left: Pairwise
normalized distances. Bottom right: RMS error given by Dali.
34