EC-PSI: Associating Enzyme Commission ... - Semantic Scholar

33
EC-PSI: Associating Enzyme Commission Numbers with Pfam Domains Seyed Ziaeddin ALBORZI PhD Student at Université de Lorraine & Inria Nancy Grand-Est July 1, 2015 JOBIM 2015 at Clermont-Ferrand

Transcript of EC-PSI: Associating Enzyme Commission ... - Semantic Scholar

EC-PSI: Associating Enzyme Commission Numbers with Pfam

Domains Seyed Ziaeddin ALBORZI

PhD Student at Université de Lorraine & Inria Nancy Grand-Est

July 1, 2015 JOBIM 2015 at Clermont-Ferrand

Motivation • Need for annotating protein structures at the domain

level. – Relating protein structure (at domain level) to protein function. – Using standard function description, e.g. Enzyme Commission

numbers

• Available resources – SIFTS, UniProt : relates 3D protein structures and sequences to

Pfam domains and to EC-numbers. – InterPro : relates Pfam domains to EC-numbers

• => Inferring new EC-Pfam associations

– Statistical training-based approach : EC-PSI = EC-Pfam Statistical Inference

2/24

Presentation Outline

• Introduction • EC-PSI method • Results and statistics • Conclusion and perspectives

3/24

State of the art : Domain Annotation with EC Numbers

• Enzyme Commission classification – 6 main branches (1-

digit) • 5 Isomerases

– Hierarchical organization

• 2-digits EC5.3 • 3- digits EC5.3.1 • 4-digits EC5.3.1.16

• Protein domain annotation is incomplete – InterPro : less than 10% entries with EC numbers

4/24

The Case of PDB : Protein Data Bank

Resource Number of Entries

PDB >108,000 Pfam Domains (in PDB) 16,230 (2,606) EC numbers (in PDB) 6,277 (2,575)

Protein 3D structures

=> Do it in a computational automatic fashion !

5/24

Zoom Into One example • 1JVN: CRYSTAL STRUCTURE OF IMIDAZOLE

GLYCEROL PHOSPHATE SYNTHASE (EC 2.4.2.-)

PF00117: Glutamine

amidotransferase

EC: ?

PF00977: Histidine biosynthesis

Protein

EC: ?

6/24

Presentation Outline

• Introduction • EC-PSI method • Results and statistics • Conclusion and perspectives

7/24

The Data Sources • Structure integration with function, taxonomy and

sequence (SIFTS) is an up-to-date resource for mapping between PDB entries and IntEnz, GO, Pfam, …

SIFTS

8/24

The Data Sources Swiss-Prot is the manually annotated and reviewed section of the UniProt Knowledgebase, non-redundant protein sequence database.

SIFTS SwissProt

8/24

The Data Sources TrEMBL is the automatically annotated and unreviewed section of the UniProt Knowledgebase.

SIFTS SwissProt TrEMBL

8/24

The Data Sources

SIFTS SwissProt TrEMBL

Enzyme

IntEnz is a resource for enzyme nomenclature and classification of enzyme-catalysed reactions.

8/24

The Data Sources

SIFTS SwissProt TrEMBL

Enzyme

PDB

The PDB archive contains information about experimentally-determined structures of proteins, nucleic acids, and complex assemblies.

8/24

The Data Sources

SIFTS SwissProt TrEMBL

Enzyme

PDB Pfam

Pfam is a database of protein domains and families that includes their annotations.

8/24

The Data Sources

SIFTS SwissProt TrEMBL

Enzyme

PDB Pfam

InterPro

InterPro is a database of protein families, domains and functional sites. It provides functional analysis of proteins by classifying them into families and predicting domains and important sites.

8/24

SIFTS Associations 1) From SIFTS data, extract associations between PDB

chains and I. 4-digit EC numbers II. Pfam domains

9/24

Graph-like Relations 2) Draw a graph-like set of relations for all EC numbers and Pfam domains using all associated PDB chains.

10/24

Frequency Score 3) Calculate EC-Pfam frequency score

|𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 𝒕𝒕𝒕𝒕 𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷||𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 𝒕𝒕𝒕𝒕 𝑬𝑬𝑬𝑬|

11/24

𝑷𝑷𝑷𝑷𝑷𝑷𝑬𝑬𝑬𝑬𝒏𝒏𝑷𝑷 =| 𝑷𝑷𝒊𝒊𝑷𝑷; 𝑷𝑷𝒏𝒏 ∈ 𝑷𝑷𝒊𝒊𝑷𝑷, 𝒊𝒊 = 𝟏𝟏, … ,𝑬𝑬𝑷𝑷 |

𝑬𝑬𝑷𝑷

Confidence Score

4) Calculate the corresponding frequencies from SwissProt (PSFEC) and TrEMBL (PTFEC).

5) Aggregate the three frequency scores into one

confidence score for each (ECm, Pfamn) association 𝑬𝑬𝒕𝒕𝒏𝒏𝑷𝑷𝒊𝒊𝑪𝑪𝑪𝑪𝒏𝒏𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝒕𝒕𝑪𝑪𝑪𝑪𝑷𝑷,𝒏𝒏 =

𝑷𝑷 × 𝑷𝑷𝑷𝑷𝑷𝑷𝑬𝑬𝑬𝑬𝑷𝑷,𝒏𝒏 + 𝒃𝒃 × 𝑷𝑷𝑪𝑪𝑷𝑷𝑬𝑬𝑬𝑬𝑷𝑷,𝒏𝒏 + 𝑪𝑪 × 𝑷𝑷𝑷𝑷𝑷𝑷𝑬𝑬𝑬𝑬𝑷𝑷,𝒏𝒏𝑷𝑷 + 𝒃𝒃 + 𝑪𝑪

12/24

Training (1) : Data Source Weights

6) Find the best values for weighting factors a, b, c using InterPro “Gold-Standard” as training set.

I. a, b and c varied from 1 to 10 in steps of 1 II. For each step, "true“ and "false" associations retrieved and

scored. III. ROC plot drawn. IV. Highest AUC value chosen to select the best three values

Results: weight values

– SIFTS (PPFEC): a= 1 – SwissProt (PSFEC): b=10 – TrEMBL(PTFEC): c=1

13/24

Training (2) : Score Threshold

7) Filter all possible associations according to a threshold on confidence score

a. Divide InterPro Gold-Standard in two sets (training and test)

b. Vary confidence score threshold from 1 to 100 % in steps of 1

c. For each step, calculate F-measure on training set.

𝑷𝑷 −𝑴𝑴𝑪𝑪𝑷𝑷𝑷𝑷𝑴𝑴𝑪𝑪𝑪𝑪 =𝟐𝟐 × 𝑷𝑷𝑷𝑷

𝟐𝟐 × 𝑷𝑷𝑷𝑷 + 𝑷𝑷𝑷𝑷 + 𝑷𝑷𝑭𝑭

d. Highest F-measure (82%) found for threshold = 8% e. Similar F-measure found on test set (81%)

14/24

Summary : EC-PSI Flow-Chart

15/24

Presentation Outline

• Introduction • EC-PSI method • Results and statistics • Conclusion and perspectives

22/24

Statistics on EC-Pfam Associations

Dataset EC-Pfam assoc. 4-digit EC no. Pfam domains SIFTS 6204 2575 2606 SwissProt 9879 3959 3147 TrEMBL 28572 3538 5839 Merged 32018 4588 6290

Dataset EC-Pfam assoc. 4-digit EC no. Pfam domains Gold-Standard InterPro

1493 676 1273

Dataset EC-Pfam assoc. 4-digit EC no. Pfam domains EC-PSI (Calculated) 8329 4436 2462 Common to EC-PSI and InterPro

1089 593 944

23/24

Distribution of Associations Per EC Numbers and Pfam Domains

A. Average number of EC-Pfam associations 1. per EC 2. per Pfam

18/24

B. Percentage of Pfam domains according to their numbers of associations with EC numbers.

=> Increase in multiple associations for Pfam domains.

Returning to Our Example • PDB entry 1JVN in SIFTS

– 2 Pfam domains: • N-terminal: Glutamine amidotransferase class-1 (PF00117). • C-terminal: Histidine biosynthesis protein (PF00977).

– 1 global EC annotation • EC 2.4.2.- : Pentosyl transferase.

• EC-PSI retrieved the following annotations for each domain: – 8 EC numbers for PF00117

• with a majority of EC 6.3.-.-: Ligase forming carbon-nitrogen bonds. – 1 EC number for PF00977

• EC 5.3.1.16 specific isomerase, part of the histidine biosynthesis pathway.

19/24

=> These new annotations (not present in InterPro) enrich the annotation of this PDB structure

PF00117 EC 6.3.-.-

PF00977 EC 5.3.1.16

Fully Annotated PDB Entry

20/24

Presentation Outline

• Introduction • EC-PSI method • Results and statistics • Conclusion and perspectives

21/24

EC-PSI Conclusion • Summary

– Problem • Annotating PDB protein chains at the domain level with

standardized functional annotations. – Method

• Collect existing associations between EC numbers and protein chains or sequences.

• Design EC-PSI – A statistical training-based scoring method to analyse the

many-to-many relations embedded in input data.

– Result • Inferred a total of 8,329 EC-Pfam associations. • Found over six times as many associations as InterPro. • Completely automatic fashion.

22/24

EC-PSI Perspectives • Perspective

– Contribute considerably to enrich the annotations of PDB protein chains.

– Facilitate a better understanding and exploitation of structure-function relationships at the protein domain level.

• Future Work – Explore 3-digit EC number

• Better recall of InterPro positive associations. – Explore multiple EC-number assignments for Pfam

domains. – Get our results reviewed by a domain expert.

23/24

Acknowledgement

Supervised by: Dave RITCHIE & Marie-Dominique DEVIGNES

24/24

Extra Slide for QA

Result for first–Digit EC • Increase in EC-Pfam associations depending on the top-level

EC branch (first-digit)

1 : Oxydoreductases ; 2 : Transferases ; 3 : Hydrolases ; 4 : Lyases ; 5 : Isomerases ; 6 : Ligases ; All : All EC numbers.

InterPro Gold-Standard

SIFTS

TrEMBL SwissProt

• Intersection of 3 input data sources

• Positive data: From InterPro (Star in the middle)

• Negative data: Associations from at least 2 sources with lowest confidence score. (Same size as InterPro)