Heuristic Strategy for Geometric Hashing Based Protein Structure Comparison of Ellipsoidal...

Heuristic Strategy for Geometric Hashing based Protein Structure Comparison of Ellipsoidal Representation

Yhi Shiau1,2, Jia-Nan Wang1, Yu-Feng Huang1, Chien-Kang Huang3∗ 1Department of Computer Science and Information Engineering, National Taiwan University,

Taipei 106, Taiwan 2Chunghwa Telecom Laboratories, Tauyuan 326, Taiwan

3Department of Engineering Science and Ocean Engineering, National Taiwan University, Taipei 106, Taiwan

[email protected], [email protected], [email protected], [email protected]

∗ To whom correspondence should be addressed. Tel: +886 2 3366 5736; Fax: +886 2 2932 9885; Email: [email protected]

Abstract

Many protein structure comparison methods use secondary structure information to do fast structure similarity search for initial alignment finding and refine the results from possible optimal candidate solutions by iteratively dynamic programming to optimize the final results. In this paper, we develop a method, Ellipsoidal Model Protein Structure Comparison, based on the concept of secondary structure elements alignment followed by iteratively refinement. In order to utilize all possible structure information to obtain alternative solutions for further analysis, we use ellipsoidal model to represent not only mainly α-helices and β-sheets, but the remaining fragments for structural alignment. Different heuristic filters and geometric hashing based global alignment estimation are applied for quick finding better initial alignments. We also provide top-N solutions without increasing extra computational time rather than only best solution in the previous works. Now, we provide the online web service, Ballerina (http://ballerina.csie.ntu.edu.tw/), for protein structure comparison. 1. Introduction

Since Beccari discovered first protein of vegetable origin in 1747, the proteins play an important role in biochemical reactions. Within these research topics, protein structure comparison (PSC) is one of the most basic and important subjects to detect the evolutionary and functional relationships between them. And we

know that the functionality of one protein is related to its 3D structure [2], that is, proteins with similar substructures may have similar functions. Therefore, improving the methodology and tools of PSC is an important issue in molecular biology and bioinformatics for many years [1, 3, 8, 14, 18, 24].

In order to detect the functional or evolutionary relationships between proteins, the PSC algorithms try to define the similarity between protein structures. The purpose of PSC is to identify maxima equivalent Cα atoms upon which to align the 3D structures of compared proteins optimally. Previously proposed PSC algorithms exploit many different computing approaches including Monte Carlo [9], dynamic programming [7, 17, 23, 24], 3D clustering [25], graph theory [29], spline approximation [4] and geometric hashing [5]. Today some non-sequential PSC algorithms are proposed [14, 27]. SCALI (Structural Core ALIgnmnet) program [27] can efficiently find conserved packing arrangement, even if they are non-sequential ordered in space. Its algorithm starts from Secondary Structure and applies distance matrices and hidden Markov models (HMMs) skills.

Based on previous studies, we propose heuristic strategy which is more efficient than conventional PSC approaches without the limits of FLASH [21]. In this paper, we approach different heuristic strategies based on ellipsoidal model, geometric hashing, and filtering criteria to compare protein structures, named as Ellipsoidal Model Protein Structure Comparison (EMPSC) [20]. The most important concept inside EMPSC is ellipsoidal representation that we build ellipsoidal model for SSEs identified with DSSP [12]

2007 IEEE International Conference on Bioinformatics and Biomedicine

0-7695-3031-1/07 $25.00 © 2007 IEEEDOI 10.1109/BIBM.2007.41

266

https://www.researchgate.net/publication/6680184_Connectivity_independent_protein-structure_alignment_A_hierarchical_approach?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/6680184_Connectivity_independent_protein-structure_alignment_A_hierarchical_approach?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/14399651_Surprising_similarities_in_structure_comparison?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/14328407_Using_Iterative_Dynamic_Programming_to_Obtain_Accurate_Pairwise_and_Multiple_Alignments_of_Protein_Structures?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/14621134_DALI_A_network_tool_for_protein_structure_comparison?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/257408195_Classification_of_protein_folds?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/220667102_A_Protein_Structure_Comparison_Methodology?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/8188453_Non-sequential_Structure-based_Alignments_Reveal_Topology-independent_Core_Packing_Arrangements_in_Proteins?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==


https://www.researchgate.net/publication/14491180_Threading_A_Database_of_Protein_Cores?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/7321054_CTSS_a_robust_and_efficient_method_for_protein_structure_alignment_based_on_local_geometrical_and_biological_features?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/4115291_A_tool_for_structure_alignment_of_molecules?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/242395027_An_Introduction_to_Protein_Structure?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/228809635_EMPSC_A_New_Method_Based_on_Ellipsoidal_Model_for_Protein_Structure_Comparison?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

and coil/loop structure. If a protein structure with few or no α-helix or β-sheet, our approach will segment the remaining parts and represent with ellipsoidal model. Rather than using single vector representation as FLASH, we use ellipsoid model to represent each sequential segment. The initial alignment finding is selected from segment pairs (mainly SSEs) instead of residue pairs. EMPSC can also output multiple solutions. Like VAST and FLASH, we believe that SSEs is obviously more important for protein’s structure conformation. And abstract the structure information with SSEs (α-helix, β-sheet and coil) could be better than CE’s AFP and ProSup’s seed. In addition, EMPSC like CE and ProSup, the initial alignment finding comes from the view of local alignment. 2. Materials and method

The EMPSC algorithm basically has four steps – (1) Preprocessing: segment the proteins into SSEs with DSSP, and generate the ellipsoidal representation for each segment (mainly α-helix, β-sheet); (2) Initial alignment: generate the potential aligned segment pairs, looking for a good initial alignment via heuristic filtering; (3) Refinement: iteratively apply a dynamic programming algorithm to refine the initial alignment, which is a well-known procedure in most PSC algorithms [18, 30, 32]; (4) Final evaluation: evaluate the refined alignments, and provide the number of corresponding residues and root mean square deviation (RMSD) of alignment solutions.

2.1. Preprocessing: ellipsoidal representation

The step (1) of the EMPSC algorithm is to generate ellipsoidal representation for each segment of the target protein. At first, in step (1a), we use the DSSP to identify the mainly secondary structures of α-helix and β-sheet. In step (1b), contiguous residues of remaining will be clustered into one segment; therefore, the remaining residues of this protein, the coil/loop information, are further clustered to a set of new segments (coil sub-segments) according the adjacencies of residues. After that, SSEs will be represented by a set of 3D-ellipsoidal model and the PCA (Principal Component Analysis) is applied in finding their 3 orthogonal eigenvectors and 3 respective eigenvalues. According to the step (1c), we decompose the protein into a set of residue segments and ellipsoidal representations.

2.2. Initial alignment: good superimposed transformation for the initial alignments

In step (2a), we generate all possible initial alignments from the compared proteins, and every pair of SSE segments, including α-helix and β-sheet of the compared proteins can be the center of the new coordinates, and the remaining coil sub-segments will be used in the biochemical filtering step (2b). In step (2b), EMPSC will further filter the initial alignments with a heuristic filtering function for each initial alignment by calculating the similarity between the two mapping SSE pair. We currently implement several subsequent filtering processes to filter out unmatched or dissimilar pairs.

We define three filtering criteria – type filter, mass filter, and biochemical filter. The type filter checks matching segments with secondary structure type (such as α-α, or β-β). The mass filter checks the difference of residue numbers between the two matching segments must be less than four residues. The biochemical filter checks the similarity of biochemical properties between two segments. In this filtering process, EMPSC makes sure that biochemical features of the surrounding coil sub-segments are similar.

In step (2c), EMPSC aligns the geometric center and 3 primary eigenvectors of the candidate mapping segments, and then new coordinates for the two compared proteins will be generated. In step (2d), a fast global alignment estimation based on geometric hashing [16] is developed to estimate the quality of the initial alignments. Finally, only the top-N superimposed transformations will be outputted as the good initial alignment candidates for further refinement. 2.3. Refinement and final evaluation

The step (3) applies the same refined process as most heuristic PSC algorithms. The least square method is applied in refinement process, and the step (3) will repeatedly refine the initial alignment solutions until the number of corresponding residues converges. Finally, in step (4), EMPSC will output the refined alignments of the top-N candidate from step (2) as the N alternative solutions. 2.4. Complexity analysis

The ellipsoid clustering is very fast, and the time complexity is O(r) where r is the number of residues of the segments. In the initial alignment finding stage, the time complexity for EMPSC algorithm in this stage is O(eloge + pn) with the scoring function based on fast

267


https://www.researchgate.net/publication/20450636_The_alignment_of_protein_structures_in_three_dimensions?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/44347427_Protein_architecture_a_practical_approach_A_M_Lesk?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

the O(n) hash function, where n is the number of residues in the protein, e is the number of segments in the protein, p is the number of mapping SSE candidate segment pairs and p is much smaller than e. The complexity of refined alignment stage is O(Cn2), while the C is number of iterations before the refinement process is converged for each initial alignment. In the discussion section, we will find how the refinement process affects the execution time of EMPSC. 3. Experiments

Three experiments are designed to test EMPSC in different conditions of protein structure comparison problems. In each experiment, EMPSC provides maximal 10 alternative solutions; therefore the Top-10 initial alignments in EMPSC will be selected. These results reveal the efficiency and effectiveness of EMPSC in comparison with Dali [9], CE [22], VAST [17], ProSup [15], and FLASH [21]. The results of Dali and ProSup come from the original papers, and the results of CE and FLASH are gathered from our experiment environment, which are consistent with the original papers. We run all experiments on Linux workstation of Intel Xeon 3.06GHz dual CPUs with 2GB memory and all testing programs are not parallelized.

3.1. One-against-all search for structural neighbors

As previous research works, we choose cAMP-dependent protein kinase to experiment on one-against-all search for structural neighbors. In order to compare with the existing results of Dali, CE, VAST, ProSup, and FLASH, the parameter dc (the distance cutoff for alignment construction) is assigned to 6Å. For all methods, we list the number of maximal correspondent residues and minimal RMSD in Table 1. Since we can find the program of CE and FLASH, we also list the execution time of CE and FLASH running in our computing environment. Comparing the value of RMSD and number of corresponding residues, EMPSC can perform as well as other previous methods.

3.2. 10 Difficult cases

In this experiment, we use a well-known data set, 10 difficult cases [6] reported by Fisher, 1996. Table 2 displays all the structure alignment results for 10 difficult cases. The EMPSC performs worse in the case of 1ten:_(89) vs. 3hhr:B(195), but FLASH has some cases that don’t output any statistical significant solution.

4. Discussions 4.1. Characteristic of EMPSC

The EMPSC possesses two major features to make EMPSC a good choice of PSC algorithms. First, the ellipsoidal representation can provide a good summary of 3D information for residue segments. Because of the different sharp of α-helix and β-sheet, α-helix is hard to bend and β-sheet structure is usually bending or curved; therefore, if we use single vector to represent, it is proper for α-helix, but the representing vector of β-sheet does drop some structure information and the length of identified β-sheet will affect the derived single vector effectiveness seriously. Moreover, EMPSC can support α-helix, β-sheet and loop or coil structures abstraction with the ellipsoidal model, and effectively abstract the curved β-sheet, because the three orthogonal eigenvector of the ellipsoid keeps more information of residues’ distribution in space.

Second, EMPSC provide a platform that can plug in different filters for different purposes. Via the different combinations of filters, EMPSC can filter the candidate mapping segment pairs according to profession specific requirement. In our current experiments results, the combination of type filter, mass filter and biochemical filter can get a good accuracy and efficiency in most cases. In addition, we also found that biochemical filter is especially effective for comparing similar proteins of the same family.

4.2. Efficiency and number of alternative solutions

Although, we list the execution time of every comparison in Table 1 and Table 2; it is very hard to observe the performance relationship between CE, FLASH and EMPSC. Therefore, we add the residue numbers of the two compared proteins, and plot the relationship diagram of execution time vs. total residues, as Figure 1. These trend lines for each method are polynomial regressions of order 2 which are provided by Microsoft Excel Trend function in both Figure 1 and 2. In Figure 1, it is obviously that EMPSC is truly faster than CE, especially for large protein structure comparisons. However, EMPSC looks slower than FLASH.

In order to know whether we can further speed up EMPSC, we do more experiments about EMPSC with different numbers of alternative solutions. We repeat the experiments in previous section with Top-3 and Top-5 alternative solutions and compare it with Top-10 results and FLASH. We plot the diagram of the

268

https://www.researchgate.net/publication/13840313_Assessing_the_Performance_of_Fold_Recognition_Methods_By_Means_of_a_Comprehensive_Benchmark?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==


https://www.researchgate.net/publication/12168898_ProSup_A_refined_tool_for_protein_structure_alignment?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==


execution time versus different number of alternative solutions, as Figure 2. In this diagram, we can find that execution time of EMPSC is perfectly proportional to the number of alternative solutions. After profiling our EMPSC program, our observation is that EMPSC spend most execution time in alignment refining process. If FLASH provides alternative solutions, it spends about the same execution time as EMPSC. Two data points of FLASH in Figure 2 show such case. This

conclusion can be applied in any PSC algorithm that claims fast but provides only one solution (like FAST), except hash-based alignment refining algorithm. Obviously, according to our observation, EMPSC is a good choice for solving protein structure comparison problems. In addition, we can conclude that further enhancement of PSC algorithms should be focused on the alignment refining process.

Table 1. An experimental set of structural neighbors of cAMP-dependent protein kinase(1atp:E) identified by different PSC methods.

A sample set of structural neighbors of cAMP-dependent protein kinase (1atp:E)(336) Dali CE VAST ProSup FLASH EMPSC Protein

(residues) RMSD/#res RMSD/#res/sec RMSD/#res RMSD/#res RMSD/#res/sec RMSD/#res/sec 2cpk:E(336) 0.4 / 336 0.37/336/4.9 0.4 / 334 0.4 / 336 0.37/336/0.38 0.37/336/6.1 1apm:E(341) 0.3 / 336 0.33/336/4.94 0.3 / 334 0.3 / 336 0.33/336/0.51 0.32/336/6.19 1cdk:A(343) 0.4 / 336 0.38/336/4.95 0.4 / 334 0.4 / 336 0.38/336/0.39 0.38/336/6.36 1ydt:E(334) 0.5 / 334 0.45/336/4.5 0.5 / 334 0.5 / 334 0.45/336/0.31 0.45/334/6.03 1bkx:A(337) 0.8 / 334 0.76/336/4.61 0.7 / 314 0.8 / 336 0.76/336/0.09 0.75/334/6.16 1bx6:_(337) 1.0 / 334 1.01/336/4.67 1.0 / 314 1.0 / 336 1.01/336/0.43 1.01/334/6.09 1stc:E(334) 1.1 / 334 1.1/336/4.98 1.1 / 333 1.1 / 334 1.1/336/0.07 1.09/334/5.25 1cmk:E(350) 2.0 / 335 2/336/5.82 2.0 / 331 1.5 / 316 1.72/330/0.77 1.71/330/6.72 1daw:A(327) 3.1 / 267 2.77/266/10.99 2.8 / 259 2.0 / 239 1.87/250/0.53 1.92/252/5.91 1qmz:C(296) 2.5 / 259 2.07/252/6.94 2.3 / 233 1.9 / 239 1.9/251/0.67 1.96/253/5.38 1day:A(327) 2.7 / 263 2.61/262/9.96 2.9 / 262 2.0 / 239 1.96/252/0.49 1.98/253/5.99 1koa:_(447) 2.8 / 261 2.7/258/10.81 2.4 / 225 2.1 / 233 2.16/249/0.42 2.14/249/8.26 1jnk:_(346) 2.8 / 253 2.49/194/11.97 3.0 / 240 2.2 / 220 2.19/242/0.29 2.23/244/6.21 1gag:A(300) 2.8 / 265 2.87/267/7.78 2.7 / 247 2.3 / 232 2.36/251/0.67 2.46/254/5.37 1bl7:A(351) 3.5 / 254 3.14/246/8.79 3.1 / 223 2.3 / 220 2.39/235/0.54 2.4/236/6.33 1cja:B(327) 4.7 / 165 4.19/165/10.9 - 2.7 / 115 2.85/143/0.48 3.01/149/6.36 1e7v:A(850) 4.0 / 159 4.43/165/43.51 - 2.8 / 116 3/142/0.75 3.1/155/15.59 1bo1:B(318) 3.9 / 138 3.9/145/12.25 - 3.0 / 103 2.98/136/0.16 2.9/135/5.71 1b40:A(517) 3.4 / 45 5.68/83/21.4 - 2.9 / 57 3/107/0.67 3.36/105/9.44 1lar:B(533) 2.6 / 34 5.77/123/23.11 - 3.0 / 66 3.07/88/0.79 3.21/86/10.14

Table 2. Comparison of different structure alignment results for 10 difficult cases.

10 difficult cases (Fisher 1996) Dali CE VAST ProSup FLASH EMPSC Protein 1

(#res) Protein 2 (#res) RMSD/#res RMSD/#res/sec RMSD/#res RMSD/#res RMSD/#res/sec RMSD/#res/sec

1bge:B(159) 2gmf:A(121) 3.3 / 94 4.02/102/2.59 2.3 / 71 2.4 / 87 -/-/-a 2.56/95/0.44 1cew:I(108) 1mol:A(94) 2.3 / 81 2.34/81/2.07 2.0 / 71 1.9 / 76 1.92/79/0.07 2.11/81/0.49 1cid:_(177) 2rhe:_(114) 3.1 / 96 2.97/98/2.4 2.0 / 78 2.3 / 84 2.24/94/0.24 2.23/94/1.19 1crl:_(534) 1ede:_(310) 3.6 / 212 3.91/220/16.29 3.7 / 186 2.6 / 161 2.49/191/0.79 2.7/199/9.3 1fxi:A(96) 1ubq:_(76) 2.5 / 52 2.79/64/1.79 2.1 / 48 2.6 / 54 2.47/62/0.03 2.56/63/0.47 1ten:_(89) 3hhr:B(195) 1.9 / 86 1.9/87/2.14 1.5 / 76 1.7 / 85 1.73/86/0.21 2.2/76/1.01 1tie:_(166) 4fgf:_(124) 3.1 / 114 2.86/115/2.23 1.6 / 76 2.4 / 104 2.28/108/0.29 2.44/113/1.15 2sim:_(381) 1nsb:A(390) 3.2 / 289 2.99/276/9.24 4.2 / 299 2.6 / 248 2.61/276/7.8 2.71/282/8.96 2aza:A(129) 1paz:_(120) 3.0 / 82 2.9/85/1.94 2.1 / 70 2.6 / 82 2.34/81/0.1 2.22/82/0.88 3hla:B(99) 2rhe:_(114) 3.0 / 74 3.46/85/2.49 2.3 / 58 2.7 / 71 2.94/79/0.09 2.75/77/0.65 1bge:B(159) 2gmf:A(121) 3.3 / 94 4.02/102/2.59 2.3 / 71 2.4 / 87 -/-/-a 2.56/95/0.44 1cew:I(108) 1mol:A(94) 2.3 / 81 2.34/81/2.07 2.0 / 71 1.9 / 76 1.92/79/0.07 2.11/81/0.49

a This result is available in the original FLASH paper, but we could not get any result while running the FLASH program provided by the authors.

269

0

10

20

30

40

50

60

70

80

90

0 200 400 600 800 1000 1200 1400 1600

Total Residues of Compared Proteins

Exec

utio

n Ti

me

(sec

onds

)

CEEMPSCFLASHTrend (CE)Trend (EMPSC)Trend (FLASH)

Figure 1. The execution time of CE, FLASH and

EMPSC, given different total residues of compared proteins.

0

5

10

15

20

25

30

35

0 200 400 600 800 1000 1200 1400 1600

Total Residues of Compared Proteins

Exec

utio

n Ti

me

(sec

onds

) FLASHEMPSC Top-3EMPSC Top-5EMPSC Top-10Trend (FLASH)Trend (EMPSC Top-3)Trend (EMPSC Top-5)Trend (EMPSC Top-10)

Figure 2. The execution time of FLASH and EMPSC

Top-3, Top-5, Top-10 alternative solutions.

5. Application: mining conserved local structure

Protein function is highly correlated to their localized region; therefore, we approach purely structure comparison method, EMPSC, to mining conserved local structure from functional hierarchical classification [10]. For example, in EC 2.3.1.74, we have 4 conserved local structures with substrate contact, and there are 10 protein chains including 1I86:A, 1I88:A, 1I89:A, 1I8B:A, 1JWX:A, 1D6I:A, 1CGK:A, 1CML, 1CHW, and 1U0V, share with this conserved local structure. In addition, these four conserved local structures are similar. Furthermore, as shown in Figure 3 as examples, we have conserved local structure close to substructure CSD (3-SULFINOALANINE) in 1I86:A, 1I88:A, 1I89:A, 1I8B:A [11], 1JWX:A, and 1D6I:A, NAR (NARINGENIN) in 1CGK:A, PIN (PIPERAZINE-N, N'-BIS) in 1CML:A, and HXC (HEXA-NOYL-COENZYME A) in 1CHW:A. By the way, there is no substrate information in 1U0V:A. In addition, substrate contact is defined as residues that are within 6.5 Å around a substrate [19]. 6. Web service and visualization tool

Based on EMPSC, we build a web service for protein structure alignment and analysis, Ballerina which can be accessed at http://ballerina.csie.ntu.edu.tw/. This web service provides an interactive interface for pair-wise protein structure comparison. The interactive interface is modified from JMOL 10.0 (http://jmol.sourceforge.net/) to facilitate the view and analysis of aligned protein structure. We provide both online and offline visualization for users for further analysis. In current

version, other PSC results can be loaded with specified format. As shown in Figure 4, (a) is the interactive interface of online service, and (b) is the visualization tool for result demonstration, and we provide both online and offline visualization tools currently. (a) 1CGK:A

(b) 1CHW:A

(c) 1CML:A

(d) 1D6I:A

(e) 1I8B:A

(f) 1I86:A

(g) 1I88:A (h) 1I89:A

270

https://www.researchgate.net/publication/11623174_Structure-Guided_Programming_of_Polyketide_Chain-Length_Determination_in_Chalcone_Synthase?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/7755922_Protein_flexibility_prediction_by_an_all-atom_mean-field_statistical_theory?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

https://www.researchgate.net/publication/228353014_Mining_Conserved_Local_Structure_from_Functional_Hierarchical_Classification_via_Local_Structure_Comparison?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==

(i) 1JWX:A

(j) 1U0V:A

Figure 3. Conserved local structure (blue) and its substrate contact (in CPK mode) of EC 2.3.1.74.

(a)

(b)

Figure 4. (a) Online web service for protein structure comparison and analysis. (b) The visualization tool for result demonstration developed based on JMOL.

7. Conclusions In this work, we combine different heuristic

strategies to modify the original framework of protein structure comparison based on SSE alignment. Without the loss of secondary structure information, we use 3D ellipsoidal model to represent mainly α-helices and β-sheets and the remaining parts of coil fragments. We apply geometric hashing approach for quick finding possible optimal coordinate systems in rough alignment stage. With the combination of integrated filter strategies, biochemical property filter is very useful to compare compared pairs, and the experimental results show that. In addition, multiple solutions are necessary for protein structure comparison without extra computation time consumed for further analysis. Moreover, we apply EMPSC to mining conserved local structure on EC family; as the result revealed, conserved local structure can be mined and substrate contact can be discovered in EC family. 8. References [1-30] [1] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, "The Protein Data Bank," Nucleic Acids Res, vol. 28, pp. 235-42, Jan 1 2000.

[2] C.-I. Brändén and J. Tooze, Introduction to protein structure, 2nd ed. New York: Garland Pub., 1999.

[3] N. P. Brown, C. A. Orengo, and W. R. Taylor, "A protein structure comparison methodology," Comput Chem, vol. 20, pp. 359-80, 1996.

[4] T. Can and Y. F. Wang, "CTSS: A Robust and Efficient Method for Protein Structure Alignment Based on Local Geometrical and Biological Features," Proc IEEE Comput Soc Bioinform Conf, vol. 2, pp. 169-79, 2003.

[5] P.-K. Chang, C.-C. Chen, and M. Ouhyoung, "A Tool for Structure Alignment of Molecules," IEEE Sixth International Symposium on Multimedia Software Engineering - Special Session on Bioinformatics, pp. 354-61, 2004.

[6] D. Fischer, A. Elofsson, D. Rice, and D. Eisenberg, "Assessing the performance of fold recognition methods by means of a comprehensive benchmark," Pac Symp Biocomput, pp. 300-18, 1996.

[7] M. Gerstein and M. Levitt, "Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures," Proc Int Conf Intell Syst Mol Biol, vol. 4, pp. 59-67, 1996.

[8] J. F. Gibrat, T. Madej, and S. H. Bryant, "Surprising similarities in structure comparison," Curr Opin Struct Biol, vol. 6, pp. 377-85, Jun 1996.

271

























[9] L. Holm and C. Sander, "Dali: a network tool for protein structure comparison," Trends Biochem Sci, vol. 20, pp. 478-80, Nov 1995.

[10] J. Y.-F. Huang, C.-J. Sheu, T.-W. Hsu, and C.-K. Huang, "Mining Conserved Local Structure from Functional Hierarchical Classification via Local Structure Comparison," Proceeding of the International Computer Symposium 2006, vol. 3, pp. 1361-1367, 2006.

[11] J. M. Jez, M. E. Bowman, and J. P. Noel, "Structure-guided programming of polyketide chain-length determination in chalcone synthase," Biochemistry, vol. 40, pp. 14829-38, Dec 11 2001.

[12] W. Kabsch and C. Sander, "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features," Biopolymers, vol. 22, pp. 2577-637, Dec 1983.

[13] P. Koehl, "Protein structure similarities," Curr Opin Struct Biol, vol. 11, pp. 348-53, Jun 2001.

[14] Kolbeck et al., "Connectivity independent protein-structure alignment: a hierarchical approach," BMC Bioinformatics, 7:510, 2006.

[15] P. Lackner, W. A. Koppensteiner, M. J. Sippl, and F. S. Domingues, "ProSup: a refined tool for protein structure alignment," Protein Eng, vol. 13, pp. 745-52, Nov 2000.

[16] A. M. Lesk, Protein architecture : a practical approach. Oxford England ; New York: IRL Press, 1991.

[17] T. Madej, J. F. Gibrat, and S. H. Bryant, "Threading a database of protein cores," Proteins, vol. 23, pp. 356-69, Nov 1995.

[18] C. Orengo, "Classification of protein folds," Curr Opin Struct Biol, vol. 4, pp. 429-40, June 1994.

[19] B. P. Pandey, C. Zhang, X. Yuan, J. Zi, and Y. Zhou, "Protein flexibility prediction by an all-atom mean-field statistical theory," Protein Sci, vol. 14, pp. 1772-7, Jul 2005.

[20] Y. Shiau, J.-N. Wang, Y.-F. Huang, and C.-K. Huang, "EMPSC: A New Method Based on Ellipsoidal Model for Protein Structure Comparison," Technique report,

Department of Engineering Science and Ocean Engineering, National Taiwan University, p. URL http://www.csie.ntu.edu.tw/~yfhuang/papers/EMPSC.pdf, 2006.

[21] E. S. Shih and M. J. Hwang, "Protein structure comparison by probability-based matching of secondary structure elements," Bioinformatics, vol. 19, pp. 735-41, Apr 12 2003.

[22] I. N. Shindyalov and P. E. Bourne, "Protein structure alignment by incremental combinatorial extension (CE) of the optimal path," Protein Eng, vol. 11, pp. 739-47, Sep 1998.

[23] S. Subbiah, D. V. Laurents, and M. Levitt, "Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core," Curr Biol, vol. 3, pp. 141-8, Mar 1993.

[24] W. R. Taylor, "Protein structure comparison using iterated double dynamic programming," Protein Sci, vol. 8, pp. 654-65, Mar 1999.

[25] G. Vriend and C. Sander, "Detection of common three-dimensional substructures in proteins," Proteins, vol. 11, pp. 52-8, 1991.

[26] H. J. Wolfson and I. Rigoutsos, "Geometric hashing: an overview," Computational Science and Engineering, IEEE [see also Computing in Science & Engineering], vol. 4, pp. 10-21, 1997.

[27] Yuan X., Bystroff C., "Non-sequential structure-based alignments reveal topology-independent core packing arrangements in proteins," Bioinformatics, vol. 21, pp. 1010-9, 2005.

[28] Z. Zhang, "Iterative point matching for registration of free-form curves and surfaces," Int. J. Comput. Vision, vol. 13, pp. 119-52, 1994.

[29] J. Zhu and Z. Weng, "FAST: a novel protein structure alignment algorithm," Proteins, vol. 58, pp. 618-27, Feb 15 2005.

[30] M. Zuker and R. L. Somorjai, "The alignment of protein structures in three dimensions," Bull Math Biol, vol. 51, pp. 55-78, 1989.

272

































https://www.researchgate.net/publication/3344381_Geometric_hashing_An_overview?el=1_x_8&enrichId=rgreq-631763d5f928ae3a0673af9bcee72cd1-XXX&enrichSource=Y292ZXJQYWdlOzIyMTIwMzk5MDtBUzoxMDM5MjIwODE0Njg0MjhAMTQwMTc4ODM2Mzc3NQ==















Heuristic Strategy for Geometric Hashing Based Protein Structure Comparison of Ellipsoidal...

Documents

Transcript of Heuristic Strategy for Geometric Hashing Based Protein Structure Comparison of Ellipsoidal...