Fault Tolerance for Large Scale Protein 3D Reconstruction from Contact Maps

14
Fault Tolerance for Large Scale Protein 3D Reconstruction from Contact Maps Luciano Margara Marco Vassura Pietro di Lena Filippo Medri Piero Fariselli Rita Casadio Technical Report UBLCS-2007-16 May 2007 Department of Computer Science University of Bologna Mura Anteo Zamboni 7 40127 Bologna (Italy)

Transcript of Fault Tolerance for Large Scale Protein 3D Reconstruction from Contact Maps

Fault Tolerance for Large Scale Protein 3DReconstruction from Contact Maps

Luciano Margara Marco Vassura Pietro di Lena

Filippo Medri Piero Fariselli Rita Casadio

Technical Report UBLCS-2007-16

May 2007

Department of Computer ScienceUniversity of Bologna

Mura Anteo Zamboni 740127 Bologna (Italy)

The University of Bologna Department of Computer Science Research Technical Reports are available inPDF and gzipped PostScript formats via anonymous FTP from the area ftp.cs.unibo.it:/pub/TR/UBLCSor via WWW at URL http://www.cs.unibo.it/. Plain-text abstracts organized by year are available inthe directory ABSTRACTS.

Recent Titles from the UBLCS Technical Report Series

2006-24 Reconstruction of the Protein Structures from Contact Maps, Margara, L., Vassura, M., di Lena, P.,Medri, F., Fariselli, P., Casadio, R., October 2006.

2006-25 Lambda Types on the Lambda Calculus with Abbreviations, Guidi, F., November 2006.

2006-26 FirmNet: The Scope of Firms and the Allocation of Task in a Knowledge-Based Economy, Mollona, E.,Marcozzi, A. November 2006.

2006-27 Behavioral Coalition Structure Generation, Rossi, G., November 2006.

2006-28 On the Solution of Cooperative Games, Rossi, G., December 2006.

2006-29 Motifs in Evolving Cooperative Networks Look Like Protein Structure Networks, Hales, D., Arteconi, S.,December 2006.

2007-01 Extending the Choquet Integral, Rossi, G., January 2007.

2007-02 Towards Cooperative, Self-Organised Replica Management, Hales, D., Marcozzi, A., Cortese, G., Febru-ary 2007.

2007-03 A Model and an Algebra for Semi-Structured and Full-Text Queries (PhD Thesis), Buratti, G., March2007.

2007-04 Data and Behavioral Contracts for Web Services (PhD Thesis), Carpineti, S., March 2007.

2007-05 Pattern-Based Segmentation of Digital Documents: Model and Implementation (PhD Thesis), Di Iorio, A.,March 2007.

2007-06 A Communication Infrastructure to Support Knowledge Level Agents on the Web (PhD Thesis), Guidi, D.,March 2007.

2007-07 Formalizing Languages for Service Oriented Computing (PhD Thesis), Guidi, C., March 2007.

2007-08 Secure Gossiping Techniques and Components (PhD Thesis), Jesi, G., March 2007.

2007-09 Rich Media Content Adaptation in E-Learning Systems (PhD Thesis), Mirri, S., March 2007.

2007-10 User Interaction Widgets for Interactive Theorem Proving (PhD Thesis), Zacchiroli, S., March 2007.

2007-11 An Ontology-based Approach to Define and Manage B2B Interoperability (PhD Thesis), Gessa, N., March2007.

2007-12 Decidable and Computational Properties of Cellular Automata (PhD Thesis), Di Lena, P., March 2007.

2007-13 Patterns for Descriptive Documents: a Formal Analysis, Dattolo, A., Di Iorio, A., Duca, S., Feliziani, A.A., Vitali, F., April 2007.

2007-14 BPM + DM = BPDM, Magnani, M., Montesi, D., May 2007.

2007-15 A Study on Company Name Matching for Database Integration, Magnani, M., Montesi, D., May 2007.

Fault Tolerance for Large Scale Protein 3D Reconstruc-

tion from Contact Maps

Luciano Margara1 Marco Vassura1 Pietro di Lena1 Filippo Medri1

Piero Fariselli2 Rita Casadio2

Technical Report UBLCS-2007-16

May 2007

Abstract. In this paper we describe FT-COMAR an algorithm that improves fault tolerance ofour heuristic algorithm (COMAR) previously described for protein reconstruction [10]. The algo-rithm [COMAR-Contact Map Reconstruction] can reconstruct the three-dimensional (3D) struc-ture of the real protein from its contact map with 100% efficiency when tested on 1760 proteinsfrom different structural classes. Here we test the performances of COMAR on native contactmaps when a perturbation with random errors is introduced. This is done in order to simulatepossible scenarios of reconstruction from predicted (and therefore highly noised) contact maps.From our analysis we obtain that our algorithm performs better reconstructions on blurred con-tact maps when contacts are under predicted than over predicted. Moreover we modify the algo-rithm into FT-COMAR [Fault Tolerant-COMAR] in order to use it with incomplete contact maps.FT-COMAR can ignore up to 75% of the contact map and still recover from the remaining 25%entries a three dimensional structure whose root mean square deviation (RMSD) from the nativeone is less then 4 A. Our results indicate that the quality more than the quantity of predicted con-tacts is relevant to the protein 3D reconstruction and that some hints about “unsafe” areas in thepredicted contact maps can be useful to improve reconstruction quality. For this, we implementa very simple filtering procedure to detect unsafe areas in contact maps and we show that by thisand in the presences of errors the performance of the algorithm can be significantly improved.Furthermore, we show that both COMAR and FT-COMAR overcome a previous state-of-the-artalgorithm for the same task [13].

Conctact: [email protected]; [email protected]: http://vassura.web.cs.unibo.it/cmap23derr/

1 Introduction

One of the yet-unsolved problems in structural Bioinformatics is ab-initio Protein Structure Pre-diction (PSP), i.e. the problem of determining the three-dimensional structure (tertiary structure)of proteins from their one-dimensional chain of amino acidic residues (primary structure) [9].Predicting the tertiary structure of a protein directly from its primary structure is a complexproblem. A typical alternative approach is to identify a set of sub-problems, such as the predic-tion of protein secondary structures, solvent accessibility and/or prediction of residue contactsand try to search specific solutions. Among different possibilities, the prediction of contact mapsof proteins starting from the protein chain is particularly promising, since a partial solution of itcan significantly help the prediction of the protein structure [6].

1. Computer Science Department, University of Bologna.2. Biocomputing Group Department of Biology, University of Bologna.

1

1 Introduction

A contact map of a given protein 3D structure is a two-dimensional symmetric binary ma-trix M such that M i,j= 1 iff the Euclidean distance between amino acids i and j is less than orequal to a pre-assigned threshold t. The general problem to compute a set of three-dimensionalcoordinates consistent with some given contact map has been shown to be NP-hard [5]. A seriesof heuristic algorithms have been developed to solve the problem. Galaktinov and Marshall [7]reconstructed the structures of five small proteins by adopting information relative to the residuecoordination numbers. Other approaches rely on steepest descent with inequality distance con-straints [4] and on an algorithm that minimizes a continuous cost function that embodies con-straints associated with contact and angle maps [11], respectively. On average these methodsreconstruct the protein structures without completely satisfying the contact map in the sense thatthe reconstructed proteins structures may have contact maps that slightly differ from the nativeones. Vendruscolo et al. [12, 13] described a method based on simulated annealing with thecontact map as a target potential. They achieved an average RMSD of 2.5 A on some 20 proteinstructures and it is considered the state-of-the-art solution.

In [10] we proposed COMAR, a heuristic algorithm to find a set of three-dimensional co-ordinates consistent with some native contact map. Our algorithm has been tested on a non-redundant data set consisting of 1760 proteins. It is always able to produce for the whole data setthree dimensional coordinates consistent with the native contact maps (computed adopting con-tact thresholds ranging from 8 to 18 A, [10]). Moreover, the algorithm shows good reconstructionperformances in terms of RMSD and outperforms, to our knowledge, all other reconstructiontechniques so far documented in literature [10]. Performance analysis of our algorithm showsthat there exist native contact maps for which there are numerous different possible structuresconsistent with them. In general, the reconstruction quality is better for contact maps of thresh-old between 10 and 18 A, suggesting that contact maps of higher threshold are more informativethan those of lower threshold. However, despite the good performance, the algorithm cannotbe directly used in the context of protein structure prediction. This is to some extent the conse-quence of the poor performance of contact maps predictors in predicting the physical contact mapof proteins. Our previous version of the algorithm was tested on native contact maps [10]. How-ever contact map predictions are highly blurred, typically noisy, and can produce non-physicalcontact maps, i.e. not consistent with any given set of three-dimensional coordinates.

In this paper we analyze and improve fault tolerance of COMAR for protein reconstruction.To the purpose of this investigation we introduce three different classes of random errors: generalerrors, errors on contacts (that is errors on 1-entries of contact maps) and errors on non-contacts(that is errors on 0-entries of contact maps). We perform extensive tests of the reconstructionquality of our algorithm on a set of 120 non-redundant protein chains and compare the recon-struction performances in terms of RMSD on the three classes of errors introduced. Our analysisshows that in general the reconstruction quality decreases with the length of the protein and thatour algorithm largely tolerates errors on contacts. In particular, the experimental results showthat the reconstruction quality of contact maps with 50% errors on contacts is comparable to thereconstruction quality of contact maps with 1% errors on non-contacts. That is, our algorithm ismuch more tolerant to under prediction than to over prediction of contacts. We further testedthis hypothesis by performing an analysis on incomplete contact maps with an improved ver-sion of our algorithm, called FT-COMAR (Fault Tolerant COMAR). Experimental tests show thatFT-COMAR can ignore up to 75% of the contact map and still obtain a protein three dimensionalstructure whose RMSD from the native one is less then 4 A. Furthermore the reconstruction qual-ity is independent from protein length. This suggests that, to improve protein reconstructionfrom contact maps, contact map prediction should put much more emphasis on prediction qual-ity than quantity. A simple way to improve the quality of reconstruction is by a pre-processingof contact map in order to detect unsafe contact regions. This filtering pre-processing indicatesto FT-COMAR which areas of the contact maps have to be ignored. In this paper we comparepre-processing computed according to a perfect filtering procedure (that eliminates all the wrongcontacts and non contacts, and labels them as non-determined) and with a simple basic real filterbased on second connectivity information in the contact map. As expected, the perfect filter givesthe upper limit of reconstruction efficiency. However from our analysis it appears that even with

UBLCS-2007-16 2

2 Protein structure reconstruction from contact maps

the simple basic filter the reconstruction quality is overall better than with COMAR and, further-more, the results are independent of the length of the protein for errors less than 8%. To conclude,we compare the performances of our algorithms with results of the state-of-the-art reconstructionalgorithm [13] and we find here that both COMAR and FT-COMAR have better reconstructionquality.

2 Protein structure reconstruction from contact maps

2.1 Protein representation with contact maps

In this paper we adopt the widely used Cα representation of the protein backbone, where residuesare considered as unique entities. The contact map of a given protein is a binary symmetric ma-trix CM such that CM[i, j] = 1 iff the Euclidean distance between residues i and j is less thanor equal to a pre-assigned threshold t(Fig. 1a, area above diagonal). Typical values of t con-sidered in literature vary between 7 and 12 A. As we showed in [10], higher threshold valuesallow better reconstruction, and in this work we adopt t= 12 A. An introduction to reconstruc-tion of protein structures from contact maps can be found in [3]. To measure the similarity be-tween two three-dimensional protein structures, described by some set of coordinates C, C′ ∈R

3×n, we use the Root Mean Square Deviation (RMSD); it is defined as the smallest distance

Dk =

1n

n∑

i=1

(C′[i]− Ck[i])2, whereCk ∈ R3×n is obtained by rotating and translating the co-

ordinates set C.

2.2 Description of COMAR and FT-COMAR

COMAR (Contact Map Reconstruction) finds a set of three-dimensional coordinates consistentwith some native contact maps [10]. COMAR consists of two phases (see the pseudo code below).In the first phase it generates an initial set of 3D coordinates C ∈ R

3×n while in the secondphase it refines iteratively the set of coordinates by applying a correction/perturbation procedureto C. The refinement applies until the set of coordinates is consistent with the given contact mapor until a control parameter ǫ becomes 0. The control parameter ǫ has initially a positive valueand it is decremented every some amount of refinement steps. If the ǫ parameter reaches 0 and acorrect solution is still not found, a new initial random solution is generated and the refinementprocess starts over again.

COMAR(CM ∈ {0,1}n×n, t ∈ N)1: while coordinates set C is not correct do//First phase: initial solution generation2: C ← RANDOM-PREDICT(CM, t)//Second phase: refinement3: C ← CORRECT(CM, C, t)4: set ǫ to a strictly positive value5: while coordinates set C is not consistent with CM and ǫ > 0 do6: C ← PERTURBATE(CM, C, t, ǫ)7: C ← CORRECT(CM, C, t)8: decrement slightly ǫ

9: return C

Extended tests for native contact maps and detailed description of the algorithm can be foundin [10]. To test the reliability of our reconstruction technique on faulty contact maps we need tomodify the termination conditions of COMAR: in this paper the algorithm always stops afterthe first run of the main cycle, i.e. the while loop of the first line is executed just once. Thismodification is necessary since a faulty contact map can be not physical, i.e. there are no three-dimensional structures consistent with it, and the termination condition of our original algorithm(COMAR line 1) imposes the procedure to run forever when applied on a not physical contactmap.

UBLCS-2007-16 3

3 Experimental results

To reconstruct partial and blurred contact maps we develop FT-COMAR (Fault Tolerant CO-MAR), a simple basic improvement of COMAR. FT-COMAR can work on incomplete contactmaps, i.e. contact maps with some unknown entries, in the sense that FT-RANDOM-PREDICT,FT-CORRECT and FT-PERTURBATE are simple modifications of RANDOM-PREDICT, COR-RECT and PERTURBATE which do not consider unknown entries during the processing. More-over, to deal with blurred contact maps the reconstruction phase of FT-COMAR is preceded bypreprocessing the contact map (FILTER) in order to detect (and then mark as unknown) unsafeentries of the contact map. FT-COMAR is general enough to accept any type of filtering proce-dure. In this work we analyze the performances of FT-COMAR adopting a perfect FILTER, i.e.enabling to detect and mark as unknown exactly all faulty entries of the contact map (Sect. 3.4),and a basic real filtering algorithm (Sect. 3.5).

FT-COMAR(CM ∈ {-1,0,1}n×n, t ∈ N)// Pre-processing phase: error filtering1: CM’← FILTER(CM)//First phase: initial solution generation2:C ← FT-RANDOM-PREDICT(CM’, t)//Second phase: refinement3:C ← FT-CORRECT(CM’, C, t)4: set ǫ to a strictly positive value5: while coordinates set C is not consistent with CM’ and ǫ > 0 do6: C ← FT-PERTURBATE(CM’, C, t, ǫ)7: C ← FT-CORRECT(CM’, C, t)8: decrement slightly ǫ

9: return C

3 Experimental results

3.1 Data set

We selected the proteins from SCOP [2] release 1.67 with X-ray protein structures from the PDB,with resolution <2.5 A, without missed internal residues. We removed sequence redundanciesusing BLAST [1], ending up with a datasets of 1760 protein chains with sequence similarity lowerthan 25%. Among these we selected 120 proteins, distributed (not uniformly) between lengths of50 and 1100 residues. To avoid contact maps for which we know there are very different possiblestructures consistent with them [10] we choose proteins whose three-dimensional structure canbe reconstructed by COMAR up to a 1 A RMSD distance from the native structure. Distributionof the resulting protein set according to the SCOP structural classes is: 8 all Alpha; 20 all Beta;58 Alpha/Beta; 14 Alpha+Beta in the mono-domain; 3 Multi-{B,C,D} and 17 Other consist ofmulti-domain proteins, for a total of 100 proteins in the mono-domain and 20 proteins in themulti-domain3.

3.2 Error generation and tests configuration

To study how protein 3D structure can be reconstructed with our algorithm from faulty contactmaps we introduce three classes of random errors:

• Err. Errors are generated by flipping the entry of randomly chosen rows and columns ofthe contact map (Fig. 1a, area below diagonal). To introduce x% errors we generate x errors

for each 100 couples of residues, that is x100

n(n−1)2 total errors.

• Err-0 (designed to preserve contacts). Errors are generated as before but the entry of thecontact map is flipped only if it is not a contact (Fig. 1b, below diagonal). Here x% errors

means a number of x100

(

n(n−1)2 −#contacts

)

total errors.

3. The complete list is available at the URL http://vassura.web.cs.unibo.it/protlist120.tgz

UBLCS-2007-16 4

3 Experimental results

• Err-1 (designed to preserve non-contacts). The entry of the contact map is flipped only if itis a contact (Fig. 1b, above diagonal). Here x% errors means a number of

(

x100 ·#contacts

)

total errors.

(a) (b)

Figure 1. Contact map of the Asn102 mutant of trypsin (PDB code: 1trmA). The contact map is computedwith a threshold of 12 A: gray areas are contacts, white areas are non-contact and black areas are errors.(a) Above diagonal: native map, 24753 pairs of residues, 3595 contacts, 21158 non-contacts, and no errors.(a) Below diagonal: Err 5%, so to say (5% of 24753 =) 1237 random errors. (b) Above diagonal: Err-1 5%,that is (5% of 3595 =) 179 random errors on contacts. (b) Below diagonal: Err-0 5%, that is (5% of 21158 =)1057 random errors on non-contacts. The protein is also the test case shown in [13].

In our testing, for each protein contact map and for each percentage of error considered, wegenerate 100 different faulty contacts maps. Thus, having 120 proteins in our set, we do 12000tests for each percentage of error. By this, our test results have to be always considered as theaverage values obtained from the 100 different instances we generate. All test runs have beenexecuted on personal computers equipped with the Intel Pentium 4 processor with clock rate of2.8GHz and 1Gb of RAM memory. Times reported are Unix user CPU times, and are measuredusing the time() C library function. The Heuristic is freely available for testing on the web at thefollowing URL: http://vassura.web.cs.unibo.it/cmap23derr/.

3.3 Structure reconstruction from faulty contact maps

In this section we show experimental results on the behavior of COMAR with faulty contactmaps. We perform tests by introducing from 1% up to 10% random errors of class Err. The av-erage RMSD of the reconstruction from those faulty contact maps is shown in Fig. 2. The resultsindicate that the quality of the protein 3D structure reconstruction depends on the protein size:proteins with less than 150 residues are reconstructed with a RMSD (from the native structure)that is less than 5 A even when 10% random errors are introduced. For proteins with a numberof residues ranging between 150 and 400, the quality of the reconstruction decreases with the in-crease of errors but the average RMSD still remains less than 5 A for small percentages of errors.For proteins with more than 400 residues our algorithm shows poor performances (RMSD>5A)even for small percentages of errors including 1% errors. Note that the sheer number of errorsrelative to the same percentage increases with size: as example 10% random errors for a proteinof size 100 means 450 errors, while 1% random errors for a protein of size 400 means 798 errors.

We analyze how the reconstruction quality varies among SCOP categories with the aim ofhighlighting whether some categories can be reconstructed better than others. In Fig. 3 we show

UBLCS-2007-16 5

3 Experimental results

Figure 2. Reconstruction quality (RMSD) as function of the number of residues in the protein (Size)and of the percentage of random errors on the total pairs of residues (Err%). Better reconstruction hasdarker colors. As expected, reconstruction quality decreases for bigger proteins and higher percentagesof errors. Note that the sheer number of errors relative to the same percentage increases with size: 10%random errors for a protein of size 100 means 450 errors, while 1% random errors for a protein of size 400means 798 errors (12000 contact maps are analyzed).

how reconstruction quality varies for different SCOP categories when we introduce 5% randomerrors. As shown in Fig. 2, the mean RMSD from the native structure increases proportionally toprotein size, with some exceptions. The most notable exception is the CDK4/6 inhibitory proteinp18INK4c (1ihb chain A; (size 156) that is in the SCOP Alpha+Beta category. It appears (Fig. 3)that exceptions to the length dependent behavior of the quality of the reconstruction are rare and

distributed among SCOP categories so that it cannot be concluded that one SCOP category ismore difficult to be reconstructed from faulty contact maps than another.

We analyze how different types of errors influence the quality of reconstruction. In particular,in Fig. 4, we compare the performance of COMAR on the three classes of errors Err, Err-0 (errorson non-contacts), Err-1 (errors on contacts) introduced in Section 3.2. As shown in Fig. 4, on theaverage, for COMAR is better to deal with Err-1 errors than with Err-0 errors. For example, wecan see that contact maps with 50% errors on contacts are reconstructed with the same quality ofcontact maps having 1% errors on non-contacts (which means about 10% extra contacts).

Figure 3. Reconstruction quality (RMSD) with an error Err 5% as a function of the protein length (Size)clustered according to SCOP categories. As expected the quality is better for small mono domain pro-teins, with few exceptions. Note that the exceptions do not belong to the same SCOP category, so that nocategory is better reconstructed with COMAR than others. (The number of contact maps is as in Fig. 2).

UBLCS-2007-16 6

3 Experimental results

Figure 4. Average RMSD to the native structure of structures reconstructed from contact maps as a func-tion of the percentage of errors with respect to (wrt) each error class: Err refers to random errors, Err-1refers to errors on contacts and Err-0 refers to errors on non-contact. Note that reconstruction quality isbetter in presence of Err-1 errors. (The number of contact maps is as in Fig. 2).

3.4 Improving the reconstruction from faulty contact maps

Our tests give some clues on how the quality of the prediction of contact maps could influence thereconstruction phase. This is much more evident if we analyze the reconstruction quality of FT-COMAR on faulty contact maps assuming to have a perfect filtering procedure, i.e. a procedurewhich is able to detect all errors on faulty contact maps. To test this approach we generate ran-dom incomplete contact maps by randomly choosing a column and a row of the contact map andmarking that entry, corresponding to a detected error, as not safe (to be not considered during thereconstruction routine). As shown in Fig. 5, FT-COMAR with perfect filtering can skip up to 75%of the contact map area and still compute a reconstructed 3D structure which is endowed with aRMSD < 4 A from the native structure. Furthermore this reconstruction quality is independentof the protein size. This unexpected result is due to the fact that FT-COMAR does not considerskipped entries in the refinement phase (see Section 2.2 for the description of the algorithm). Inthis way FT-COMAR does not uses wrong information during the refinement phase avoiding thepropagations of errors. The drawback is that this is true only assuming that the remaining entriesof the contact map are correct, i.e. only in presence of a perfect filtering. As shown in Fig. 6, evenif we skip only 25% of the entries, the reconstruction quality is rapidly decreasing at the increas-ing of errors on the remaining 75% of the map. Again note that in this case the reconstructionquality depends on the length of the protein. We can interpret these results as an evidence of thefact that the quality of the reconstruction is negatively influenced by the erroneous predictions ofsome contacts more than by ignoring a consistent subset of contacts during the reconstruction.

3.5 Error filters preprocessing with FT-COMAR

The experimental results in Section 3.4 show that we can reconstruct with much more reliabilitythe 3D structure of a protein if we are able to predict which areas of the contact map are unsafe.This suggests that prediction quality is more important than quantity of contacts predicted: forinstance, comparing Fig. 2 and Fig. 5 it is evident that it is better to predict 25% of the contactmap with no errors than 100% of the contact map with 5% errors. This holds especially forproteins with a high number of residues. At the present time there is no way to predict contactmaps with high reliability. Labeling unsafe contact map areas seems therefore an alternative wayout to find possible solutions. There are various properties that can be implemented to test the“safeness” of contact map areas, from physical constraints to graph properties. Here we proposea simple filtering procedure based on the so called second connectivity property, namely thenumber of common contacts of two contact nodes in the undirected graph (contact map) and weanalyze how this procedure improves the prediction of our algorithm on faulty contact maps.The second connectivity property roughly assumes that two residues i, j are in contact if and

UBLCS-2007-16 7

3 Experimental results

Figure 5. Reconstruction quality (RMSD) as function of the number of residues in the protein chain(Size) and of the percentage of random skipped pairs on the total pairs of residues (see legend). Lowerpercentages of Skip have darker colors: note that we reconstruct with RMSD<4 A up to 75% unknownentries of the contact map for proteins of any size. (The number of contact maps is as in Fig. 2).

Figure 6. Reconstruction quality (RMSD) as function of the number of residues in the protein (Size)when 25% of the input contact map is skipped. Increasing percentages of random errors (Err) on theremaining 75% of the map are shown (see legend). Lower percentages of Err have darker colors: notethat we reconstruct with RMSD < 4 A only for low percentages of errors and reconstruction quality isdecreasing at increasing protein sizes. (The number of contact maps is as in Fig. 2).

UBLCS-2007-16 8

4 Conclusions and perspectives

only if they share a high number of neighbors, i.e. there is a high number of residues which areclose to both i and j. Experimentally, in our dataset of 1760 non-redundant protein chains onlythe 6% of residues which are in contact share less than 10 neighbors and just the 0.7% of residueswhich are not in contact share more that 18 neighbors. Thus our second connectivity filteringprocedure skips contact i, j if:

• C[i, j]=1 (i e j are in contact) and i, j share less than 10 neighbors, i.e. residue i is in contactwith less than 10 residues which are in contacts also with residue j;

• C[i, j]=0 (i e j are not in contact) and i, j share more than 18 neighbors, i.e. residue i is incontact with more than 18 residues which are in contacts also with residue j.

Results for reconstruction quality using FT-COMAR with the simple filter described above areshown in Fig. 7. We note that for percentages of errors less than 8% the reconstruction quality isindependent from the protein length, as in Fig. 5. This means that the filter skips large enoughfaulty areas to avoid their negative influence on the whole reconstruction. When errors are over16% the reconstruction quality decreases at increasing protein length. To avoid this behavior abetter adjustment of filtering parameters (based on number of expected contacts, or other typesof filtering procedures) should be considered. Nevertheless, in general the overall reconstruction

Figure 7. Reconstruction quality (RMSD) of FT-COMAR as function of the number of residues in theprotein (Size). Lower percentages of random errors (Err%) on the whole contact map are shown withdarker colors. Note that we reconstruct with RMSD < 4 A for 1-8% of errors for proteins of any size,while over 16% of errors the simple filtering preprocessing adopted is not able to skip enough errors tokeep reconstruction quality independent from protein size. (The number of contact maps is as in Fig. 2).

quality with this simple/basic filter is significantly improved, as it stems out of the compari-son of Fig. 2 and Fig. 7. We remark also that our algorithms runs within minutes, allowing themto be used also for a large-scale number of predictions. The reconstruction times of FT-COMARfor our 120 proteins data set are shown in Fig. 8.

3.6 Comparison with previous work

In Fig. 9 our target is the protein 1trm chain A to compare with the previous state-of-the-artreconstructing algorithm of Vendruscolo et al. [13]. The reconstruction quality is shown as afunction of the number of included random errors. Both with COMAR and FT-COMAR (with thefiltering procedure described in Section 3.5) we obtain better reconstruction quality. To comparethis result with the other tests described in this work, it should be considered that 1000 errors areapproximately 4% of the total number of contact residue pairs and 4000 errors are approximately16% of contact residue pairs.

4 Conclusions and perspectives

In this paper we develop FT-COMAR an algorithm that improve fault tolerance of our heuristicalgorithm (COMAR) previously described for protein reconstruction [10]. We perform exten-sive tests of the reconstruction quality of COMAR on a set of 120 non-redundant protein chains

UBLCS-2007-16 9

4 Conclusions and perspectives

Figure 8. Average FT-COMAR reconstruction times in seconds for our 120 proteins data set as function ofthe protein length for four percentages of random errors: 1%, 8%, 16% and 64%. Note that for 64% errorsthe execution time of FT-COMAR decreases. In this case the quality of the reconstruction also decreases(Fig. 7). (The number of contact maps is as in Fig. 2).

Figure 9. Average reconstruction quality (RMSD) for the protein 1trm (chain A, 223 residues) as a func-tion of the number of random errors included in the native contact map. Vend refers to the performancesdescribed in [13]. 1000 errors are approximately 4% of the number of pairs of residues.

UBLCS-2007-16 10

REFERENCES

and compared the reconstruction performances in terms of RMSD on three classes of differenterrors: general errors, errors on contacts (that is errors on 1-entries of contact maps) and errorson non-contacts (that is errors on 0-entries of contact maps). The experimental results show thatthe reconstruction quality of contact maps with 50% errors on contacts is comparable to the re-construction quality of contact maps with 1% errors on non-contacts. That is, COMAR is muchmore tolerant to errors on contacts than to errors on non-contacts. FT-COMAR can work onincomplete contact maps, i.e. contact maps with a set of unknown entries. We showed that FT-COMAR can ignore up to 75% of the contact map and still recover a three dimensional structurefrom the remaining 25% entries with a RMSD value from the native one of less then 4 A. Ourconclusion is therefore that in order to improve structure reconstruction from contact maps moreemphasis should be put on the quality than on the quantity of contact predictions. This is cor-roborated also by the better results obtained when a simple basic filter is implemented to detectunsafe (randomly perturbed) contact map areas. The very basic filtering algorithm we develop isbased on the contact second connectivity property and its performance is tested versus the recon-struction quality obtained with the not filtered faulty contact maps. The reconstruction qualityof FT-COMAR with this simple filtering procedure is overall better and, furthermore, it resultsto be independent of the length of the protein for percentage of errors less than 8%. We thinkthat on this line other more complex filtering procedures will further improve the reconstructionefficiency.

References

[1] S.F.Altshul, T.L.Madden, A.A.Shaffer, J.Zhang,Z.Zhang,W.Miller,D.J.Lipman. GappedBLAST and PSI-BLAST: a new Generation of protein database search programs. NucleicAcid Res, 1997 Sep 1; 25(17):3389-402

[2] Andreeva, D.Howorth, S.E. Brenner, T.J.Hubbard, C.Chothia, A.G.Murzin. SCOP databasein 2004: refinement integrate structure and sequence family data. Nucleic Acids Res. 2004Jan 1;32(Database issue):D226-9

[3] L. Bartoli, E. Capriotti, P. Fariselli, P.L. Martelli, R. Casadio. The pros and cons of predictingprotein contact maps.

[4] J. Bohr, et al. Protein structures from distance inequalities. J. Mol. Biol. 231, 861-869, 1993.

[5] H. Breu, D.G. Kirkpatrick, Unit disk graph recognition is NP-hard, Computational Geome-try 9 (1998) 3-24.

[6] P.Fariselli, O.Olmea, A.Valencia, R.Casadio. Progress in predicting inter-residue contacts ofproteins with neural networks and correlated mutations. Proteins:45 Suppl 5:157-162(2001)

[7] S.G.Galaktinov, G.R.Marshall. Properties of intraglobular contacts in proteins: an approachto prediction of tertiary structure. In System Sciences, 1994. Vol. V; Proceedings of theTwenty-Seventh Hawaii International Conference on Biotechnology Computing Vol.5,4-7Jan. 1994 Page(s):326-335

[8] T.F. Havel. Distance Geometry: Theory, Algorithms, and Chemical Applications in the En-cyclopedia of Computational Chemistry (1998).

[9] A.Lesk. Introduction to Bioinformatics, Oxford University Press, 2006

[10] L. Margara, M. Vassura, P. Di Lena, F. Medri, P. Fariselli, R. Casadio. Reconstruction of theProtein Structures from Contact Maps. To appear in Lecture Notes in Bioinformatics pro-ceedings of ISBRA07, Atlanta.

[11] G.Pollastri, A.Vullo, P.Fiasconi, P.Baldi. Modular DAG-RNN Architectures for assemblingCoarse Protein Structures J.Comp.Biol., 13:3,631-650,2006

UBLCS-2007-16 11