Propagating semantic information in biochemical network models

12
Propagating semantic information in biochemical network models Schulz et al. Schulz et al. BMC Bioinformatics 2012, 13:18 http://www.biomedcentral.com/1471-2105/13/18 (30 January 2012)

Transcript of Propagating semantic information in biochemical network models

Propagating semantic information in biochemicalnetwork modelsSchulz et al.

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18 (30 January 2012)

METHODOLOGY ARTICLE Open Access

Propagating semantic information in biochemicalnetwork modelsMarvin Schulz1*, Edda Klipp1 and Wolfram Liebermeister2

Abstract

Background: To enable automatic searches, alignments, and model combination, the elements of systems biologymodels need to be compared and matched across models. Elements can be identified by machine-readablebiological annotations, but assigning such annotations and matching non-annotated elements is tedious work andcalls for automation.

Results: A new method called “semantic propagation” allows the comparison of model elements based not onlyon their own annotations, but also on annotations of surrounding elements in the network. One may eitherpropagate feature vectors, describing the annotations of individual elements, or quantitative similarities betweenelements from different models. Based on semantic propagation, we align partially annotated models and findannotations for non-annotated model elements.

Conclusions: Semantic propagation and model alignment are included in the open-source library semanticSBML,available on sourceforge. Online services for model alignment and for annotation prediction can be used at http://www.semanticsbml.org.

BackgroundSystems biologists aim to understand the dynamic beha-viour of cellular pathways with the help of quantitativemodels. To construct large-scale models flexibly fromexisting parts, models need to be retrieved, compared,and combined. For this purpose, modellers need to findand match equivalent model elements, for instance, vari-ables describing identical metabolites. In the future,such alignments should be automated or, at least, sensi-ble suggestions should be provided by software.A simple way to match model elements is by compar-

ing their names or identifiers as they appear in the mod-el’s code. Greedy alignments and model combinationbased on element names have been used in [1-4], butthese approaches may fail if models stem from differentsources and therefore use different naming schemes. Asafer and more general approach is to compare modelelements by the biological objects or concepts theystand for. Semantic annotations, for instance the MIR-IAM-compliant annotations [5] used in the Systems

Biology Markup Language (SBML) [6], provide a quali-fied naming scheme by relating model elements toentries in public web resources or ontologies. Knowl-edge from these ontologies can be used to comparealike biological objects and to define quantitative simi-larity scores between them [7,8].Obviously, such comparisons will fail if annotations

are missing. However, there are also algorithms thatcompare biological networks by their structures and cantherefore handle missing annotations. Since the compar-ison of graph structures is computationally hard [9],these algorithms either yield approximative results [10]or are restricted to models having a simple structurelike a path [11,12] or a tree [13,14]. The initial compari-son of network nodes, which is later refined using struc-tural information, also varies from approach toapproach. While some of them use plain node labels,others compare nodes by chemical structures [10,15] orsemantic information like EC numbers [16] or GeneOntology terms [17]. With this information, they caneither refine the alignments [18] or speed up the com-putations by reducing the search space [19]. A recentreview on these and similar works can be found in [20].

* Correspondence: [email protected] für Biologie, Theoretische Biophysik, Humboldt-Universität zu Berlin,Invalidenstr. 42, 10115 Berlin, GermanyFull list of author information is available at the end of the article

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18

© 2012 Schulz et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

In this article, we propose new heuristics for aligningnetwork models with missing annotations. The basicidea is simple: a reaction, for instance in an SBMLmodel, refers to its substrates and products. If two reac-tions are not annotated but their reactants are, we cantrace the reactants, evaluate their annotations, and usethis information for comparing the reactions. This logiccan be applied whenever model elements show cross-references. Instead of collecting and combining thisinformation step by step for each element, we developeda method to propagate all semantic information simulta-neously across the network. Starting from an originaldirect similarity score, which compares elements only bytheir own annotations, we obtain a new inferred similar-ity score that incorporates information obtained fromother elements.Two applications of semantic propagation are pre-

sented in this article. The first one is the improvedalignment of systems biology models. The second one isa method for predicting missing annotations in a modelby aligning it to fully annotated network models. Ourpresent implementation, which is a part of the toolsemanticSBML [21], works for SBML models and forspecific similarity measures [8]. However, semantic pro-pagation is a general approach that applies to a widerange of network models and similarity scores and thatcan be combined with different model matching algo-rithms, including the ones mentioned above.

ResultsAlgorithmWe have developed methods for aligning partially anno-tated biological networks and implemented them forSBML models with MIRIAM-compliant annotations. Amodel alignment is based on two kinds of knowledge:similarities between model elements, which are com-puted from semantic annotations, and referencesbetween elements within each model (called here the“network structure”).The alignment of two models is computed in two

steps. In a first step, semantic information is propagatedacross both network structures. This ensures that all ele-ments, even elements lacking annotations, obtainsemantic information, which makes them comparablebetween models. This claim of improved comparabilityis illustrated with an example of a propagation of “col-our information” shown in Figure 1. Using the networkstructure, we refine the direct similarity s and obtain aninferred similarity measure ψ, which can be nonzeroeven if one of the two compared elements does notcarry any annotations.In the second step, the actual model alignment, ele-

ments are matched between two or more modelsaccording to their inferred similarities. We use a greedy

heuristic that arranges similar elements into tuples, sup-posed to be matched: all elements in a tuple have thesame type (e.g. reaction), stem from different models,and each element is part of one tuple (possibly of size1). Elements within a tuple have to show a high similar-ity, while elements in different tuples are supposed to bedissimilar.Feature propagationThe semantic propagation can be carried out in twoways, as feature propagation (FP) or as similarity propa-gation (SP). For feature propagation, each element hasto be associated with a feature vector. The componentsof this vector (usually numbers between 0 and 1)describe how closely the element resembles certain bio-logical concepts. A model species describing a phos-phorylated MAP kinase, for example, could carryannotations referring to UniProt entry P28482 (MAPK1)and KEGG Compound entry C00562 (Phosphoprotein).The feature vector corresponding to this species wouldcontain mostly zeros except for the two entries specify-ing the relation to these web resource entries. Two ele-ments are similar if their feature vectors point to similardirections, i.e. if they are related to similar biologicalconcepts like a phosphorylated and a non-phosphory-lated MAP kinase. The size of the feature vectordepends on the number of biological concepts consid-ered, e.g. the entries of all web resources being referredto in BioModels Database [22].In feature propagation elements inherit semantic

information from their neighbour nodes, which mightcontribute to the definition of the elements’ identity.The inferred feature vectors do therefore not only char-acterise the model element itself, but also its surround-ing elements in the network. In the case of a reactionelement, they do not only describe the reaction, but alsoits reactants, enzyme, or regulators. The transition fromthe original to the inferred feature vectors is shown by

Figure 1 Propagation of colour features. Feature propagation ina small example network (circles: compounds; squares: reactions). A:Network nodes carry feature vectors v = (r, g, b)T shown by RGBcolours. Feature vectors for non-annotated elements are unknownand set to zero by definition (nodes shown in grey). B: After featurepropagation, all nodes have distinct feature vectors, i.e. colours.Matching nodes by their colours would now allow to self-align theentire graph unambiguously.

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18

Page 2 of 11

an abstract example in Figure 1A. In this example thecolours red, green, and blue correspond to the featuresin a colour vector of size three, which encode forsemantic information. The propagation of this informa-tion (Figure 1B) leads to mixed colours in the center ofthe network, which give each node its own identity andallow for a unique self-alignment of the network. Themathematical details of feature propagation areexplained in the Methods section.Similarity propagationThe second approach, similarity propagation, is compu-tationally harder, but also applies to similarity measuresthat are not based on feature vectors. In contrast to FP,we do not consider single model elements, but pairs ofelements from different models to be compared, e.g. thereactions x and y from Figure 2A. We require that, iftwo such elements refer to other elements that share ahigh similarity, this will increase their own similarity(Figure 2B). Our method determines these similarities ina self-consistent manner and uses a propagation processsimilar to feature propagation. In contrast to featurepropagation the process is not carried out on the reac-tion network, but on an element pair graph. In thisgraph each node represents a pair of elements, one fromeach model. Two nodes are connected if the corre-sponding elements are connected in the reaction net-works. The mathematical details are explained in theMethods section and a mathematical relation betweenfeature and similarity propagation is discussed in addi-tional file 1.Annotation predictionModel alignments can be used to complete missingannotations in a model. To do so, one may align the

model to a large, annotated map of all physiologicalpathways and copy annotations from elements in themap to the model elements they have been aligned to.Even if not all of these annotations are correct, theymay still be presented to users as suggestions duringmanual model annotation.Our semanticSBML web page for model annotation

(http://semanticsbml.org) provides this functionality anduses BioModels Database as a replacement for the large,annotated pathway map. The idea behind the annotationprediction is to apply feature propagation to a sparselyannotated model and all BioModels. Afterwards, thesimilarities between non-annotated model elements andall database elements are calculated. Finally, the annota-tions from the most similar elements contained in thedatabase are presented to the user as suggestions fornew annotations. Details of this procedure are describedin the Methods section.

TestingSelf-alignment of a linear chainTo illustrate how semantic propagation can improvemodel alignment, let us consider a linear metabolicpathway in which only the first and the last species areannotated (see Figure 3). If we align two copies of thispathway based on direct similarities, only these two ele-ments can be matched. With the help of feature propa-gation or similarity propagation, all reactions andmetabolites are matched correctly.Alignment between MAP kinase pathwaysAs a real-world example, we aligned two well-annotatedsignal transduction models, model 9 [23] and model 11[24] from BioModels Database. These models containelements that represent proteins in different phosphory-lation states but carry identical annotations. An align-ment based on direct similarities would not be able todistinguish between these different states, but featurepropagation and similarity propagation both manage toimprove the results (see Figure 4).In practice, models are often less carefully annotated

than these example models and the quality of a modelalignment will depend on the fraction of annotated ele-ments. To study this in detail, we considered the sametwo models, removed some of the annotations in Bio-Model 9 randomly, aligned the models, and scored thequality of the alignments by precision and recall withrespect to a manually chosen alignment. This procedurewas repeated 25 times for each number of removedannotations. Figure 5 shows the results: without seman-tic propagation, both precision and recall increasealmost linearly with the number of annotations present.This is not surprising because the only possible reasonfor alignment errors are ambiguities caused by elementscarrying identical annotations. If we use feature or

Figure 2 Similarity propagation. A: Alignment of two modelsdescribing the same biochemical reaction (circles: compounds;squares: reactions). The reactants a and p have a known similaritysap (dashed line; accordingly for products b and q), while thesimilarity between the reactions x and y is initially unknown. Todetermine how well the reactions match (dotted line), we computethe inferred similarities ψsp. B: Propagation graph. Red arrows showhow information is propagated from species to reactions (apotential propagation back is shown in blue). The similaritybetween the reactions x and y is supported by two paths (x ¬ a↔ p ® y and x ¬ b ↔ q ® y). The respective terms a2sap anda2sbq yield the inferred similarityψ

spxy = α2(σap + σbq).

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18

Page 3 of 11

similarity propagation, the quality increases morerapidly. As expected, the mean recall - i.e. the fractionof correct matches recovered - is improved because pre-viously incomparable pairs can now be matched. Theprecision - counting how many predicted matches areactually correct - depends on how many elements areannotated. If many annotations have to be guessed, bothinferred similarity scores yield a lower precision thanthe direct similarity. Nevertheless, with a higher numberof annotations present, both methods can surpass thesimple matching. SP outperforms FP for both precisionand recall, but its computational effort is also muchhigher. Similar tests with different models lead to similarresults (data not shown).Despite the indifference of the results with respect to

the models which are compared, the quality of theresults varied in between examples in which only speciesor reaction annotations had been removed. The more

detailed the annotations in a specific subset of modelelements was, the more dramatic the decline in match-ing quality has been after their removal. For a deeperanalysis of the matching quality after the removal ofeither species or reaction annotations from BioModel 9,the reader is referred to additional file 1.Automatic suggestion of element annotationsIn order to test the quality of our annotation predic-tion heuristic, we applied it to BioModel 61, a well-studied model of glycolysis [25]. For our first evalua-tion, we removed all annotations from the model andtried to re-predict them via the web interface. The testconfirms that the predictions are not perfect, but thetop 10 results usually contain relevant annotations. Ingeneral, the predictions become better as the fractionof correctly annotated model elements increases and aswe scan more of the top results for the correctannotation.

Figure 3 Alignments of computational models. A: Linear pathway; only the first and the last species are annotated. B: Self-alignment of thelinear pathway. Depicted are the pairwise element similarities according to the three different similarity measures. Each table entry contains sixnumerical values. The entries in the upper line denote the direct similarity s (blue), the similarity ψfp obtained by feature propagation (red), andthe similarity ψsp from similarity propagation (green). Values between 0 and 1 are shown by colour intensities. Similarities between differenttypes of elements (e.g. species and reactions) were set to 0, while ψsp values larger than 1 were set to 1. The lower three entries show theelement matching obtained from these similarities. Thick boxes indicate the correct matching.

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18

Page 4 of 11

To study this systematically, we randomly removedannotations from some of the elements, predicted newannotations for them, and checked if the correct annota-tion appeared in the top results. As shown in Figure 6, asignificant number of annotations were predicted correctly

using our propagation schemes. The results suggest thatthe probability of predicting an annotation correctlyincreases linearly with the number of annotations presentin the model. Considering more of the topmost results didnot strongly increase the quality of the prediction. This

Figure 4 Alignment of MAP kinase cascades. Alignment of BioModels 9 [23] and 11 [24] as shown in Figure 4. The two alignments based oninferred similarities contain previously unmatched pairs (E2, RAFPH), (P_KKK, RAFp), and (KKPase, MEKPH) and some previously false matches areswapped. These incorrect matches resulted from the fact that elements represent proteins in different phosphorylation states but carry the sameannotations. The inferred similarity measures differ in two details: the matching of RAF and RAFpRAFPH, which is correctly predicted fromsimilarity propagation and the matching of Reaction19 and Reaction25, which is correctly obtained from feature propagation (data not shown).

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18

Page 5 of 11

shows that correct annotations, if they can be predicted atall, will rank relatively high in the results.

ImplementationWe have implemented both propagation methods inPython. The source code of this implementation isincluded in our library semanticSBML, which is freelyavailable (GPL 3) from SourceForge (http://sourceforge.net/projects/semanticsbml/) and can be used to e.g.annotate and merge SBML models.Furthermore, we have included the semantic propaga-

tion methods into our web tool semanticSBML http://www.semanticsbml.org. First, this web tool allows usersto visually compare the results of the three methods tocompare model elements (direct similarity, feature pro-pagation, and similarity propagation). Second, for modelmerging, all three methods can be used to suggest aninitial matching of model elements. Third, for modelannotation, new element annotations based on propa-gated features from BioModels will be suggested afterthe “Predict annotations” button has been clicked.

DiscussionAnnotating the elements of systems biology models islaborious. Interactive software can facilitate this work by

keyword searches and by proposing annotations basedon element names. Our software semanticSBML [21,26]helps modellers to annotate and combine SBML modelsbased on MIRIAM-compliant annotations [5]. In thepresent work, we addressed an important open issue,the alignment of models with missing annotations.As the examples have shown, element matchings

based on our inferred similarity measures perform wellin practice. In our tests, semantic propagation increasedthe recall of correct element pairs, but at the price of alower precision if less than half of the elements wereannotated. Our two approaches, feature and similaritypropagation, differ slightly in their quality, but also intheir computational costs. While similarity propagationtends to yield better results, the computational effort ismuch higher than in feature propagation (O(|M|3 · |N|3)instead of O(|M|3 + |N|3), where |M| and |N| denotethe numbers of elements in the two models).We furthermore showed how new annotations for

model elements can be suggested based on the annota-tions already present in a model and an annotated path-way map. For our predictions, we have selected theBioModels Database as a resource for highly curatedannotation data. Analyses using other resources (see addi-tional file 1 for details) show no loss in prediction quality

Figure 5 Assessing the quality of model alignments. The quality of model alignments increases with the fraction of annotated modelelements. Alignments between BioModel 9 and BioModel 11 (see Figure 4) were compared to a manually chosen, correct alignment and scoredby their recall (left boxes) and precision (right boxes). When annotations are randomly removed, recall and precision (y-axis) decrease with thefraction of removed annotations (x-axis). Boxes in different rows show results from alignments with direct similarities, FP similarities, and SPsimilarities, as well as a comparison of their three mean values. The straight lines in the three topmost boxes show trends and stem from alinear regression.

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18

Page 6 of 11

and we plan to extend our background pathway map byusing various other pathway resources in the future.The details on how information is transfered during

semantic propagation depends on the choice of propaga-tion weights. In our tests, changing their values withinreasonable ranges had little influence on the modelalignment. In any case, the scaling factor l (see Meth-ods) allowed to avoid an overly strong propagation,which might lead to spurious similarities. As soon as a

sufficient number of predefined model matchings will beavailable as a gold standard, the propagation weightsmay be further improved by machine learning.

ConclusionsCompared to existing approaches, our methods formodel alignment have three advantages. First, by proces-sing semantic annotations instead of ad-hoc node labels,we can compare models from different sources and

Figure 6 Evaluation of the annotation prediction in BioModel 61. Probability to find a correct annotation within in the first n predictedannotations (y-axis), for varying numbers of annotations present in the model (x-axis). Boxes show the cases n = 1, n = 5, and n = 10. Trends inthe data are shown by straight lines.

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18

Page 7 of 11

compare elements with similar, yet different annotations.Second, considering the network structure allows us todistinguish between model elements carrying identicalannotations, e.g. proteins in different phosphorylationstates. Third, our similarity measures can be combinedwith various structure-based model comparison algo-rithms, e.g. [13]. Although our approach has only beenimplemented for SBML models, it can be extended toany computational models that include structural infor-mation and semantic annotations.

MethodsSemantic similarityThe biological meaning of a model element can bedescribed with the help of annotations, which point toentries in different web resources. Since different webresources overlap in their content (e.g. ChEBI [27] andKEGG Compound [28]), entries from differentresources can have the same biological meaning, i.e.describe the same biological concept, e.g. as ChEBIentry CHEBI:17925 and KEGG Compound entryC00267 both describing b-D-glucose. Given a list ofbiological concepts, we can describe the meaning of amodel element x, e.g. a species, a reaction, or a com-partment, by a feature vector vx. This vector containsnon-zero values vx,i only for those biological conceptsthat appear in the element’s annotations. If, for exam-ple, element x has been annotated using the aforemen-tioned ChEBI entry CHEBI:17925 or the KEGGCompound entry C00267 (and assuming that the jth

biological concept is describing b-D-glucose), the valueof vx,i will be larger than zero. Similarities s betweentwo model elements x and y can be computed by tak-ing the cosine of the angle between their feature vec-tors [29]

σ exy =

vTxvy√

vTx vx

√vT

y vy

,

setting σ exy = 0 if either vx = �0 or vy = �0. This similarity

measure assumes that different biological concepts arelogically independent and completely dissimilar: the basisvectors, each representing one concept, are orthogonaland therefore share similarities of 0. To allow for potentialsimilarities between them, e.g. to allow for a non-zerosimilarity between b-D-glucose and D-glucose, we maypredefine similarities Sij between the biological conceptsand replace the scalar product by a quadratic form basedon the matrix S. The new similarity measure reads

σxy =vT

xSvy√vT

xSvx

√vT

y Svy(1)

with sxy = 0 whenever one of the elements is notannotated [30]. For ways to derive such a matrix S, thereader is referred to [8].

Propagation of semantic informationTo compare non-annotated model elements by theirbiological meaning (e.g. the reactions in Figure 2A), wehave developed two alternative propagation schemes. Infeature propagation (FP) elements receive informationabout semantic annotations from their neighbours,while in similarity propagation element pairs receiveinformation about pairwise similarities between neigh-bouring elements. In both methods information is pro-pagated along the references between model elements(Figure 2B). The strength of information transfer isdetermined by a |M| × |M| propagation matrix rM

where |M| is the number of elements in model M. Thesparse structure of this matrix reflects the connectionsbetween model elements, and its positive, real-valuedentries ρM

xa determine the strength of the informationtransfer between the different types of elements. Forinstance, we may choose the values ρM

xa = α and ρMax = β

whenever a reaction x refers to a species a as its reac-tant or product, and ρM

xa = 0 if a does not appear inreaction x. The parameter a controls the informationtransfer from species to reactions, while the parameter bcontrols information transfer in the opposite direction(Figure 2B). To prevent semantic propagation betweenunrelated reactions, we stop the propagation at cofac-tors, which might be highly connected hubs in the net-work structure. Therefore, we set ρM

ax = ρMxa = 0 whenever

the annotations on a species a suggest that it is a cofac-tor (see additional file 1 for details).Feature propagation applies to all similarity scores

that are computed from feature vectors. For each directfeature vector vx, we define a new inferred feature vectorwx, which is supposed to resemble the inferred featurevectors of the neighbouring elements in the network(see Figure 1B). It is defined by the equation

wx = vx + λ∑

a∈MρM

xawa, (2)

with l being a scaling factor which is determinedbelow. If an element x does not receive any informationfrom other elements (i.e. ∀a∈M:ρ

Mxa = 0), its inferred and

its direct feature vector are identical, i.e. wx = vx. In allother cases, we add the inferred feature vectors of allreferenced elements, multiplied by their propagationweights rxa and l. To compute wx from Eq. (2), we con-sider each feature i separately. We collect the ith compo-nents of all feature vectors vx in a column vector v*,i andthe ith components of all vectors wx in a vector w*,i.With the |M| × |M| matrix R = (ρM

xa), each component

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18

Page 8 of 11

of the vectors in Eq. (2) can be written as

w∗,i = v∗,i + λR w∗,i

= (I − λR)−1v∗,i(3)

After choosing a scaling factor l that is smaller thanthe inverse of the biggest absolute eigenvalue r of R(e.g.λ = 1

2r ), we can replace the matrix inverse in Eq. (3)by

w∗,i =∞∑k=0

(λR)kv∗,i. (4)

The series expansion shows that the inferred featurevectors can be obtained by summing up a number ofterms, describing how the direct features are propa-gated step by step through the network, e.g. from com-partments to molecular species, from reactants toreactions, and so on. If information is propagated onlyin one direction on an acyclic graph, these contribu-tions can be computed successively and the infiniteseries contains only a finite number of non-zero terms.Moreover, the series expansion shows that the inferredfeature vectors are non-negative as long as the directfeature vectors and the propagation weights are nonne-gative. By collecting the vectors v*,i and w*,i in matricesV and W, the equations (4) for all elements i can bejoined into a single matrix equation

W = (I − λR)−1V. (5)

In analogy to Eq. (1), the element similarity withinferred features reads

ψ fpxy =

wTx Swy√

wTxSwx

√wT

y Swy

. (6)

Similarity propagationOur second scheme, similarity propagation (SP), worksfor any similarity score. Given the direct similarities sxy

(from Eq. (1) or elsewhere) between elements x inmodel M and elements y in model N, we define theinferred similarities by

ψ spxy = σxy + λ

∑a∈M,p∈N

ρMxa ψ

spapρ

Nyp. (7)

The idea is analogous to Eq. (2): the inferred similarityψ

spxy represents the direct similarity sxy plus the weighted

inferred similarities between all elements a and p towhich x and y refer. To solve Eq. (7), we merge doublesubscripts (as indicated by brackets) and rewrite it as

ψ spxy = σ(xy) + λ

∑(ap)

Q(xa)(yp)ψsp(ap),

where Q(xa)(yp) = ρMxaρ

Nyp. Using the vectors ψ sp = (ψ sp

(xa))

and s = (s(xa)) and the matrix Q = (Q(xa)(yp)), the solu-tion can again be written as

ψ sp = (I − λQ)−1σ . (8)

Like in Eq. (4), if s is non-negative, l can be chosensuch that ψsp is also non-negative. As an example, expli-cit calculations of both similarity measures for a partiallyannotated model of the phosphoglucoisomerase reactionare shown in section 5 in additional file 1.Analogy to diffusion processesFeature propagation resembles a diffusion process on thereaction network. Unlike diffusion in the strict sense, ourprocess is not symmetric and does not conserve mass.Methods with a similar background idea have alreadybeen successfully applied to assign protein functions inprotein-protein interaction networks [31,32].Likewise, similarity propagation can be seen as a diffu-

sion-like process on a pair propagation graph in whicheach node corresponds to a pair (x, y) of elements fromthe two models. An example is shown in Figure 7. Anedge from node (a, p) to node (x, y) in this graph(with weight ρM

xaρNyp) means that semantic information is

propagated from a to x (with weight ρMxa) and from p to

y (with weight ρMyp ). The matrix Q(xa)(yp) serves as a non-

symmetric, non-mass-conserving diffusion matrix forthis graph. To describe semantic propagation as a tem-poral process on this graph, we may introduce a produc-tion term s and a linear degradation term. We thenobtain the equation

ddtψ sp = σ + Q ψ sp − ψ sp,

whose stationary distribution is given by Eq. (8).The structure of the pair propagation graph deter-

mines how information about element similarities canspread during propagation. For instance, the graph inFigure 7B shows that an initial matching between spe-cies (edges 1, 2, 4, and 5) or between reactions (edge 9)cannot lead to a matching between species and reactions(edges 3, 6, 7, and 8) because they appear in separatesubgraphs.

Model alignmentA main task in model merging is to detect and combinecorresponding elements between models. In such amodel alignment the element matching has to be non-ambiguous (an element cannot be matched to severalelements in another model) and transitive (if x, y, and zare elements from three different models and if thepairs (x,y) and (y,z) are matched, the pair (x,z) ismatched automatically).

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18

Page 9 of 11

Given two models M and N with elements x and y,a model alignment can be defined as a setP = {(x, y), (a, p), (b, q), ...} of disjoint element pairs. Ifmore than two models are aligned, we may also obtaintriples or larger tuples of elements. All remaining singleelements are also treated as tuples. To find a good align-ment P, we use precalculated element similarities ψxy

and evaluate the score function

f =∑

(xy)∈Pψxy. (9)

Instead of maximising the score exactly (which wouldbe computationally hard), we employ a greedy algo-rithm: first, it chooses the element pair with the highestψ value. If this value is positive, the two elements arematched and the pair is removed from the list. Then,we continue to match the element pairs with the highestremaining ψ value until no positive ψ values are left. Ifthis leads to inconsistencies (i.e, if one of the two ele-ments already has a different matching partner in the

other model), the next best matching pair is consideredinstead. In the end, the alignment score is given by thesum of all ψ values collected.For aligning three or more models, we use the same

matching score (for element pairs), but consider thetransitivity constraint: Whenever we match two ele-ments, we also match all their previously determinedmatching partners (and consider the respective mutualsimilarities in the score function).

Finding element annotations based on model alignmentsSemantic propagation can be used to suggest annotationsfor non-annotated model elements. For this purpose theconsidered partially annotated model is aligned to a fullyannotated large pathway map. Afterwards, annotations aretransferred from elements in the map to the correspond-ing, non-annotated elements in the model.As a prerequisite, we collected many annotated model

elements appearing in BioModels Database, whichserved as a replacement for the large, annotated pathwaymap, and calculated their propagated feature vectors.For each element (x Î M) in each of the models, itspropagated feature vector (wx) is computed. For everyannotation i on the element (∀i:v∗,i �=0) we do the follow-ing. The corresponding feature is removed from the vec-tor (wx ,i = 0) and the pair of the feature and thispropagated feature vector is added to the collection.To find annotations for a non-annotated element of

interest, we calculate its propagated feature vectorwithin its model and check if there are similar vectorsin our collection. Afterwards, the annotations associatedwith these vectors are presented to the user. The generalidea is that given a model from the BioModels Databasethat lacks annotations for a single feature on a given ele-ment, this method would be able to predict annotationsfor this element with perfect accuracy.For annotation prediction on the semanticSBML web-

site, the feature propagation algorithm has been slightlymodified. To reduce the number of annotations sug-gested, we set the propagation values r between reaction- reactant/product to 1

2 and all others to 0. Moreover,information is propagated only to direct neighbours (i.e.the sum in Eq. (4) does not go to infinity but to 1).

Additional material

Additional file 1: Appendix. Text document containing details on theimplementation, a further comparison of the methods, an additionalnumerical example, and further analyses.

AcknowledgementsThe authors thank Isabel Rojas for suggesting the idea of semanticpropagation, Falko Krause for implementation work in semanticSBML, Ron

Figure 7 Similarity propagation resembles diffusion of a graph.A: Propagation graph from Figure 2, redrawn with labels forelement pairs (diamonds). B: Similarity propagation on a pairpropagation graph. Nodes in the graph correspond to element pairsin the original networks. Edges indicate that information can bepropagated simultaneously for both elements in a pair. For instance,similarity information can be propagated from the pair (a,p) to thepair (x,y), resulting in the edge 1 ® 9 with a weight a · a. C:Feature propagation entails a similar process with more paths forpropagation. For instance, similarity information could now bepropagated from the pair (a,y) to the pair (x,y), resulting in the edge5 ® 9 with weight a · 1.

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18

Page 10 of 11

Milo for kindly hosting W.L. at the Weizmann Institute of Science, and ananonymous referee for valuable suggestions.This work was supported by the German Research Foundation [CRC 618 andGZ: LI 1676/2-1], the International Max Planck Research School forComputational Biology and Scientific Computing, and the EuropeanCommission [grant number LSHG-CT-2006-037469].

Author details1Institut für Biologie, Theoretische Biophysik, Humboldt-Universität zu Berlin,Invalidenstr. 42, 10115 Berlin, Germany. 2Institut für Biochemie, Charité-Universitätsmedizin Berlin, Seestr. 73, 13347 Berlin, Germany.

Authors’ contributionsMS and WL developed the semantic propagation and wrote the paper. MSimplemented the algorithms and analysed the data. MS, EK, and WLdesigned computational experiments. All authors read and approved thefinal manuscript.

Received: 2 November 2011 Accepted: 30 January 2012Published: 30 January 2012

References1. Goodfellow M, Wilson J, Hunt E: Biochemical network matching and

composition. Proceedings of the 2010 EDBT Workshops, ACM 2010, 1-7.2. Randhawa R, Shaffer C, Tyson J: Model aggregation: a building-block

approach to creating large macromolecular regulatory networks.Bioinformatics 2009, 25(24):3289.

3. Moutselos K, Kanaris I, Chatziioannou A, Maglogiannis I, Kolisis F:KEGGconverter: a tool for the in-silico modelling of metabolic networksof the KEGG Pathways database. BMC bioinformatics 2009, 10:324.

4. Wang YT, Huang YH, Chen YC, Hsu CL, Yang UC: PINT: PathwaysINtegration Tool. Nucleic Acids Res 2010.

5. Le Novère N, Finney A, Hucka M, Bhalla U, Campagne F, Collado-Vides J,Crampin E, Halstead M, Klipp E, Mendes P, Nielsen P, Sauro H, Shapiro B,Snoep J, Spence H, Wanner B: Minimum information requested in theannotation of biochemical models (MIRIAM). Nature biotechnology 2005,23(12):1509-1515.

6. Hucka M, Finney A, Sauro H, Bolouri H, Doyle J, Kitano H, Arkin A,Bornstein B, Bray D, Cornish-Bowden A, Cuellar AA, Dronov S, Gilles E,Ginkel M, Gor V, Goryanin II, Hedley WJ, Hodgman TC, Hofmeyr JH,Hunter PJ, Juty NS, Kasberger JL, Kremling A, Kummer U, Le Novère N,Loew LM, Lucio D, Mendes P, Minch E, Mjolsness ED, Nakayama Y,Nelson MR, Nielsen PF, Sakurada T, Schaff JC, Shapiro BE, Shimizu TS,Spence HD, Stelling J, Takahashi K, Tomita M, Wagner J, Wang J: Thesystems biology markup language (SBML): a medium for representationand exchange of biochemical network models. Bioinformatics 2003,19(4):524-531.

7. Sevilla J, Segura V, Podhorski A, Guruceaga E, Mato J, Martinez-Cruz L,Corrales F, Rubio A: Correlation between gene expression and GOsemantic similarity. IEEE/ACM Transactions on Computational Biology andBioinformatics (TCBB) 2005, 2(4):330-338.

8. Schulz M, Krause F, Le Novère N, Klipp E, Liebermeister W: Retrieval,alignment, and clustering of computational models based on semanticannotations. Mol Syst Biol 2011, 19(7):513.

9. Gay S, Soliman S, Fages F: A graphical method for reducing and relatingmodels in systems biology. Bioinformatics 2010, 26(18):i575.

10. Yang Q, Sze S: Path matching and graph matching in biologicalnetworks. Journal of Computational Biology 2007, 14:56-67.

11. Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, Ideker T: PathBLAST: atool for alignment of protein interaction networks. Nucleic Acids Res 2004,1(32 Web Server):W83.

12. Shlomi T, Segal D, Ruppin E, Sharan R: QPath: a method for queryingpathways in a protein-protein interaction network. BMC bioinformatics2006, 7:199.

13. Pinter R, Rokhlenko O, Yeger-Lotem E, Ziv-Ukelson M: Alignment ofmetabolic pathways. Bioinformatics 2005, 21(16):3401.

14. Dost B, Shlomi T, Gupta N, Ruppin E, Bafna V, Sharan R: Research inComputational Molecular Biology Springer; 2007, 1-15.

15. Hattori M, Okuno Y, Goto S, Kanehisa M: Development of a chemicalstructure comparison method for integrated analysis of chemical and

genomic information in the metabolic pathways. Journal of the AmericanChemical Society 2003, 125(39):11853-11865.

16. Tohsato Y, Matsuda H, Hashimoto A: A multiple alignment algorithm formetabolic pathway analysis using enzyme hierarchy. Proceedings of the8th International Conference on Intelligent Systems for Molecular Biology(ISMB’00) 2000, 376-383.

17. Lord P, Stevens R, Brass A, Goble C: Investigating semantic similaritymeasures across the Gene Ontology: the relationship between sequenceand annotation. Bioinformatics 2003, 19(10):1275.

18. Gamalielsson J, Olsson B: GOSAP: Gene Ontology-Based SemanticAlignment of Biological Pathways. International Journal of BioinformaticsResearch and Applications 2008, 4(3):274-294.

19. Wernicke S, Rasche F: Simple and fast alignment of metabolic pathwaysby exploiting local diversity. Bioinformatics 2007, 23(15):1978-1985.

20. Fionda V, Palopoli L: Biological Network Querying Techniques: Analysisand Comparison. Journal of Computational Biology 2011, 18(4):595-625.

21. Krause F, Uhlendorf J, Lubitz T, Schulz M, Klipp E, Liebermeister W:Annotation and merging of SBML models with semanticSBML.Bioinformatics 2010, 26(3):421.

22. Le Novère N, Bornstein B, Broicher A, Courtot M, Donizelli M, Dharuri H,Li L, Sauro H, Schilstra M, Shapiro B, Snoep J, Hucka M: BioModelsDatabase: a free, centralized database of curated, published,quantitative kinetic models of biochemical and cellular systems. NucleicAcids Research 2006, 34:D689-D691.

23. Huang C, Ferrell J: Ultrasensitivity in the mitogen-activated protein kinasecascade. Proceedings of the National Academy of Sciences 1996,93(19):10078.

24. Levchenko A, Bruck J, Sternberg P: Scaffold proteins may biphasicallyaffect the levels of mitogen-activated protein kinase signaling andreduce its threshold properties. Proceedings of the National Academy ofSciences of the United States of America 2000, 97(11):5818.

25. Hynne F, Danø S, Sørenson P: Full-scale model of glycolysis inSaccharomyces cerevisiae. Biophysical Chemistry 2001, 94:121-163.

26. Schulz M, Uhlendorf J, Klipp E, Liebermeister W: SBMLmerge, a System forCombining Biochemical Network Models. Genome Informatics 2006,17:62-71.

27. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A,Alcantara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database andontology for chemical entities of biological interest. Nucleic acids res2008, , 36 Database: D344-D350.

28. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T,Kawashima S, Okuda S, Tokimatsu T, et al: KEGG for linking genomes tolife and the environment. Nucleic acids research 2008, 36(suppl 1):D480.

29. Salton G: The SMART retrieval system - experiments in automatic documentprocessing Prentice-Hall, Inc. Upper Saddle River, NJ, USA; 1971.

30. Becker J, Kuropka D: Topic-based vector space model. Proceedings of the6th International Conference on Business Information Systems 2003, 7-12.

31. Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Whole-proteomeprediction of protein function via graph-theoretic analysis of interactionmaps. Bioinformatics 2005, 21(suppl 1):i302.

32. Singh R, Xu J, Berger B: Global alignment of multiple protein interactionnetworks with application to functional orthology detection. Proceedingsof the National Academy of Sciences 2008, 105(35):12763.

doi:10.1186/1471-2105-13-18Cite this article as: Schulz et al.: Propagating semantic information inbiochemical network models. BMC Bioinformatics 2012 13:18.

Schulz et al. BMC Bioinformatics 2012, 13:18http://www.biomedcentral.com/1471-2105/13/18

Page 11 of 11