EC-PSI: Associating Enzyme Commission ... - Semantic Scholar
-
Upload
khangminh22 -
Category
Documents
-
view
9 -
download
0
Transcript of EC-PSI: Associating Enzyme Commission ... - Semantic Scholar
EC-PSI: Associating Enzyme Commission Numbers with Pfam
Domains Seyed Ziaeddin ALBORZI
PhD Student at Université de Lorraine & Inria Nancy Grand-Est
July 1, 2015 JOBIM 2015 at Clermont-Ferrand
Motivation • Need for annotating protein structures at the domain
level. – Relating protein structure (at domain level) to protein function. – Using standard function description, e.g. Enzyme Commission
numbers
• Available resources – SIFTS, UniProt : relates 3D protein structures and sequences to
Pfam domains and to EC-numbers. – InterPro : relates Pfam domains to EC-numbers
• => Inferring new EC-Pfam associations
– Statistical training-based approach : EC-PSI = EC-Pfam Statistical Inference
2/24
Presentation Outline
• Introduction • EC-PSI method • Results and statistics • Conclusion and perspectives
3/24
State of the art : Domain Annotation with EC Numbers
• Enzyme Commission classification – 6 main branches (1-
digit) • 5 Isomerases
– Hierarchical organization
• 2-digits EC5.3 • 3- digits EC5.3.1 • 4-digits EC5.3.1.16
• Protein domain annotation is incomplete – InterPro : less than 10% entries with EC numbers
4/24
The Case of PDB : Protein Data Bank
Resource Number of Entries
PDB >108,000 Pfam Domains (in PDB) 16,230 (2,606) EC numbers (in PDB) 6,277 (2,575)
Protein 3D structures
=> Do it in a computational automatic fashion !
5/24
Zoom Into One example • 1JVN: CRYSTAL STRUCTURE OF IMIDAZOLE
GLYCEROL PHOSPHATE SYNTHASE (EC 2.4.2.-)
PF00117: Glutamine
amidotransferase
EC: ?
PF00977: Histidine biosynthesis
Protein
EC: ?
6/24
Presentation Outline
• Introduction • EC-PSI method • Results and statistics • Conclusion and perspectives
7/24
The Data Sources • Structure integration with function, taxonomy and
sequence (SIFTS) is an up-to-date resource for mapping between PDB entries and IntEnz, GO, Pfam, …
SIFTS
8/24
The Data Sources Swiss-Prot is the manually annotated and reviewed section of the UniProt Knowledgebase, non-redundant protein sequence database.
SIFTS SwissProt
8/24
The Data Sources TrEMBL is the automatically annotated and unreviewed section of the UniProt Knowledgebase.
SIFTS SwissProt TrEMBL
8/24
The Data Sources
SIFTS SwissProt TrEMBL
Enzyme
IntEnz is a resource for enzyme nomenclature and classification of enzyme-catalysed reactions.
8/24
The Data Sources
SIFTS SwissProt TrEMBL
Enzyme
PDB
The PDB archive contains information about experimentally-determined structures of proteins, nucleic acids, and complex assemblies.
8/24
The Data Sources
SIFTS SwissProt TrEMBL
Enzyme
PDB Pfam
Pfam is a database of protein domains and families that includes their annotations.
8/24
The Data Sources
SIFTS SwissProt TrEMBL
Enzyme
PDB Pfam
InterPro
InterPro is a database of protein families, domains and functional sites. It provides functional analysis of proteins by classifying them into families and predicting domains and important sites.
8/24
SIFTS Associations 1) From SIFTS data, extract associations between PDB
chains and I. 4-digit EC numbers II. Pfam domains
9/24
Graph-like Relations 2) Draw a graph-like set of relations for all EC numbers and Pfam domains using all associated PDB chains.
10/24
Frequency Score 3) Calculate EC-Pfam frequency score
|𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 𝒕𝒕𝒕𝒕 𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷||𝑷𝑷𝑷𝑷𝑷𝑷𝑷𝑷 𝒕𝒕𝒕𝒕 𝑬𝑬𝑬𝑬|
11/24
𝑷𝑷𝑷𝑷𝑷𝑷𝑬𝑬𝑬𝑬𝒏𝒏𝑷𝑷 =| 𝑷𝑷𝒊𝒊𝑷𝑷; 𝑷𝑷𝒏𝒏 ∈ 𝑷𝑷𝒊𝒊𝑷𝑷, 𝒊𝒊 = 𝟏𝟏, … ,𝑬𝑬𝑷𝑷 |
𝑬𝑬𝑷𝑷
Confidence Score
4) Calculate the corresponding frequencies from SwissProt (PSFEC) and TrEMBL (PTFEC).
5) Aggregate the three frequency scores into one
confidence score for each (ECm, Pfamn) association 𝑬𝑬𝒕𝒕𝒏𝒏𝑷𝑷𝒊𝒊𝑪𝑪𝑪𝑪𝒏𝒏𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝑪𝒕𝒕𝑪𝑪𝑪𝑪𝑷𝑷,𝒏𝒏 =
𝑷𝑷 × 𝑷𝑷𝑷𝑷𝑷𝑷𝑬𝑬𝑬𝑬𝑷𝑷,𝒏𝒏 + 𝒃𝒃 × 𝑷𝑷𝑪𝑪𝑷𝑷𝑬𝑬𝑬𝑬𝑷𝑷,𝒏𝒏 + 𝑪𝑪 × 𝑷𝑷𝑷𝑷𝑷𝑷𝑬𝑬𝑬𝑬𝑷𝑷,𝒏𝒏𝑷𝑷 + 𝒃𝒃 + 𝑪𝑪
12/24
Training (1) : Data Source Weights
6) Find the best values for weighting factors a, b, c using InterPro “Gold-Standard” as training set.
I. a, b and c varied from 1 to 10 in steps of 1 II. For each step, "true“ and "false" associations retrieved and
scored. III. ROC plot drawn. IV. Highest AUC value chosen to select the best three values
Results: weight values
– SIFTS (PPFEC): a= 1 – SwissProt (PSFEC): b=10 – TrEMBL(PTFEC): c=1
13/24
Training (2) : Score Threshold
7) Filter all possible associations according to a threshold on confidence score
a. Divide InterPro Gold-Standard in two sets (training and test)
b. Vary confidence score threshold from 1 to 100 % in steps of 1
c. For each step, calculate F-measure on training set.
𝑷𝑷 −𝑴𝑴𝑪𝑪𝑷𝑷𝑷𝑷𝑴𝑴𝑪𝑪𝑪𝑪 =𝟐𝟐 × 𝑷𝑷𝑷𝑷
𝟐𝟐 × 𝑷𝑷𝑷𝑷 + 𝑷𝑷𝑷𝑷 + 𝑷𝑷𝑭𝑭
d. Highest F-measure (82%) found for threshold = 8% e. Similar F-measure found on test set (81%)
14/24
Presentation Outline
• Introduction • EC-PSI method • Results and statistics • Conclusion and perspectives
22/24
Statistics on EC-Pfam Associations
Dataset EC-Pfam assoc. 4-digit EC no. Pfam domains SIFTS 6204 2575 2606 SwissProt 9879 3959 3147 TrEMBL 28572 3538 5839 Merged 32018 4588 6290
Dataset EC-Pfam assoc. 4-digit EC no. Pfam domains Gold-Standard InterPro
1493 676 1273
Dataset EC-Pfam assoc. 4-digit EC no. Pfam domains EC-PSI (Calculated) 8329 4436 2462 Common to EC-PSI and InterPro
1089 593 944
23/24
Distribution of Associations Per EC Numbers and Pfam Domains
A. Average number of EC-Pfam associations 1. per EC 2. per Pfam
18/24
B. Percentage of Pfam domains according to their numbers of associations with EC numbers.
=> Increase in multiple associations for Pfam domains.
Returning to Our Example • PDB entry 1JVN in SIFTS
– 2 Pfam domains: • N-terminal: Glutamine amidotransferase class-1 (PF00117). • C-terminal: Histidine biosynthesis protein (PF00977).
– 1 global EC annotation • EC 2.4.2.- : Pentosyl transferase.
• EC-PSI retrieved the following annotations for each domain: – 8 EC numbers for PF00117
• with a majority of EC 6.3.-.-: Ligase forming carbon-nitrogen bonds. – 1 EC number for PF00977
• EC 5.3.1.16 specific isomerase, part of the histidine biosynthesis pathway.
19/24
=> These new annotations (not present in InterPro) enrich the annotation of this PDB structure
PF00117 EC 6.3.-.-
PF00977 EC 5.3.1.16
Fully Annotated PDB Entry
20/24
Presentation Outline
• Introduction • EC-PSI method • Results and statistics • Conclusion and perspectives
21/24
EC-PSI Conclusion • Summary
– Problem • Annotating PDB protein chains at the domain level with
standardized functional annotations. – Method
• Collect existing associations between EC numbers and protein chains or sequences.
• Design EC-PSI – A statistical training-based scoring method to analyse the
many-to-many relations embedded in input data.
– Result • Inferred a total of 8,329 EC-Pfam associations. • Found over six times as many associations as InterPro. • Completely automatic fashion.
22/24
EC-PSI Perspectives • Perspective
– Contribute considerably to enrich the annotations of PDB protein chains.
– Facilitate a better understanding and exploitation of structure-function relationships at the protein domain level.
• Future Work – Explore 3-digit EC number
• Better recall of InterPro positive associations. – Explore multiple EC-number assignments for Pfam
domains. – Get our results reviewed by a domain expert.
23/24
Result for first–Digit EC • Increase in EC-Pfam associations depending on the top-level
EC branch (first-digit)
1 : Oxydoreductases ; 2 : Transferases ; 3 : Hydrolases ; 4 : Lyases ; 5 : Isomerases ; 6 : Ligases ; All : All EC numbers.