Michigan Molecular Interactions (MiMI): putting the jigsaw puzzle together

Michigan Molecular Interactions (MiMI) putting thejigsaw puzzle togetherMagesh Jayapandian Adriane Chapman V Glenn Tarcea Cong Yu Aaron Elkiss

Angela Ianni Bin Liu Arnab Nandi Carlos Santos Philip Andrews Brian Athey

David States and H V Jagadish

Center for Computational Medicine and Biology University of Michigan Ann Arbor MI 48109 USA

Received August 15 2006 Revised September 13 2006 Accepted October 1 2006

ABSTRACT

Protein interaction data exists in a number of repo-sitories Each repository has its own data formatmolecule identifier and supplementary informationMichigan Molecular Interactions (MiMI) assistsscientists searching through this overwhelmingamount of protein interaction data MiMI gathersdata from well-known protein interaction databasesand deep-merges the information Utilizing an iden-tity function molecules that may have differentidentifiers but represent the same real-world objectare merged Thus MiMI allows the users to retrieveinformation from many different databases at oncehighlighting complementary and contradictoryinformation To help scientists judge the usefulnessof a piece of data MiMI tracks the provenance of alldata Finally a simple yet powerful user interfaceaids users in their queries and frees them from theonerous task of knowing the data format or learninga query language MiMI allows scientists to query alldata whether corroborative or contradictory andspecify which sources to utilize MiMI is part of theNational Center for Integrative BiomedicalInformatics (NCIBI) and is publicly available athttpmimincibiorg

1 INTRODUCTION

Both the volume and number of data sources in molecularbiology are increasing rapidly Often multiple resourcesprovide overlapping partial and polymorphic views of thesame data These data are stored and published in a diverseset of data sources Each source is distinct with respect toits biological focus (eg SNPs gene promoters etc)

organism (eg fly) and format (eg tab delimited filerelational database etc) Even after narrowing the problemdown to a subset of biological information such as proteininteraction information there is a deluge of informationWith such a rich variety of sources to choose from ascientist who wishes to visualize the full picture concerninga particular protein must visit a myriad of sites learn aplethora of names aliases and identifiers compile informa-tion from journal papers and then piece the resultingjigsaw puzzle together This task becomes even more onerousdue to several complicating factors First no naming or iden-tification scheme has been agreed upon Thus the scientistmust painstakingly map her protein of interest to a series ofdifferent names and identifiers Second many interactiondatabases or even lab web pages place an interactionin the public domain even if it is supported by only oneexperiment This forces scientists to search through multipledatabases for conflicting or corroborating evidence Thirdheterogeneous sources storing information in their ownunique formats force scientists to become programmers inorder to trawl through large volumes of data andreorganize it into an understandable format Finally once aresearcher has gathered data from several sources siftedthrough it and amalgamated it there is usually no trail leftlinking the data to their original sources At this stage ifthe scientist discovers mutually exclusive pieces of informa-tion existing in her amalgamated view she has no way ofmaking an informed decision about how to correct the data

The work of (12) minimizes the burden on the user byintegrating a large number of disparate sources containinginformation over a range of attributes such as expressionstructure and family However while the integration isfrom a large number of heterogeneous sources it is a shallowintegration Michigan Molecular Interactions (MiMI) helpsscientists search through large quantities of information byintegrating all information from participating data sourcesthrough the process of deep-merging As a result redundantdata are removed and related data are combined Moreover

To whom correspondence should be addressed at Department of Electrical Engineering and Computer Science University of Michigan 2260 Hayward AvenueAnn Arbor MI 48109 USA Tel +1 734 763 4433 Fax +1 734 763 8094 Email apchapmaumichedu

The authors wish it to be known that in their opinion the first two authors should be regarded as joint First Authors

2006 The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc20uk) which permits unrestricted non-commercial use distribution and reproduction in any medium provided the original work is properly cited

D566ndashD571 Nucleic Acids Research 2007 Vol 35 Database issue Published online 27 November 2006doi101093nargkl859

the provenance of each piece of information is trackedthroughout the system allowing scientists to choose whichdata to trust (3) MiMI allows users to ask more advancedquestions than each of its component databases can answerindependently MiMI attempts to relieve scientists of theburden of tracking down multiple sources mapping multipleidentifiers and merging redundancies By integrating well-known datasets such as HPRD (4) and BIND (5) MiMIcreates a deep-merged repository that is a synergy of all themerged datasets By such integration MiMI shows scientistswhen facts are corroborated by different datasets and whenfacts are contradicted among datasets Moreover the proven-ance of each data item is annotated allowing scientists toview information from only the sources they trust and facili-tating understanding of contradictory information MiMIrsquosintegration is distinctly different from the approach usedby the International Molecular Exchange Consortium(IMEx) While several of the integral components of MiMIalso belong to IMEx the tasks are very different IMEx isattempting to increase the rate of data curation by separatingcuration tasks among different groups Once curation isdone the information is shared among all However regard-less of any cooperation between data curation sites or parti-tioning of resources there will always be some data overlapor redundancy among them MiMI does not attempt to findnew data to curate but to augment known information byhighlighting redundancy and contradictions AdditionallyIMEx itself is in the very infancy of data exchange andthere is as yet no cohesive united and deeply merged datasetproduced by it

MiMI provides a simple interface that allows new users topose complex queries Utilizing a simple point and clickmethod our user interface allows scientist to formulateadvanced queries without specialized programming skillsAdditionally MiMI output complies with the PSI-MI format

and Cytoscape (6) allowing users to take advantage of indus-try tools for viewing interactions The following is a briefdescription of the underlying concepts of MiMI as well asa detailed list of datasets employed by MiMI

2 DATABASE CONSTRUCTION

MiMI uses XML as its data model XML is the current linguafranca of biological data exchange and gives the MiMI datamodel the flexibility to change as biological understandingincreases The physical storage of MiMI is built upon Timber(7) a native XML database MiMI is a component of theNational Center for Integrative Biomedical Informatics(httpwwwncibiorg) and is publicly available at httpmimincibiorg

Datasets

MiMI currently has 117 549 molecules and 256 757 inter-actions and is the result of integrating BIND (5) DIP (8)BioGRID (9) HPRD (4) and IntAct (10) as well as datasetsfrom Center for Cancer Systems Biology at Harvard (11)and the Max Delbrueck Center (12) Additionally supple-mentary protein information was integrated from GO (13)InterPro (14) IPI (15) miBLAST (16) OrganelleDB (17)OrthoMCL (18) PFam (19) and ProtoNet (20)

Identity

The issue of identity is determining when two databaseentries refer to the same real-world object For instance toa human it is obvious that an Hsp10p molecule found inyeast and listed in DIP is the same as the Hsp10p moleculefound in yeast and listed in BIND In the simplest case iden-tity is defined by the uniqueness of a key attribute (set) inthis case the name However in protein identification no

Figure 1 Sample protein data for Hsp10 from IntAct NCBI and BIND

Nucleic Acids Research 2007 Vol 35 Database issue D567

key exists across all datasets necessitating keyless identityfunctions

For example in BioGRID there is an entry for HSP10and an external reference to NCBIrsquos RefSeq NP_014663In BIND there is an entry for Hsp10 with an externalreference to NCBIrsquos GI 6324594 From a human perspectiveit is obvious these are the same proteins with differentcapitalizations However there is no linking identifier ineither BioGRID or BIND Further searching reveals thatNCBIrsquos records hold a link between this GI and RefSeqGiven less obviously matching names for this protein suchas CPN10 (BioGRID) and Yor020p (BIND) and differentexternal identifiers matching identical records is not a trivialtask for an individual Combine this with the thirteen knownnames for this object in BIND DIP BioGRID and IntActand the human user is bound to miss incorporating somerelevant data MiMI utilizes keyless identity functions to

determine which proteins represent the same real-worldobject Once this identity has been determined the entriesare deep-merged

Deep-merge

Datasets frequently have overlapping and sometimes evencontradictory information content Our goal is to fuse informa-tion frommultiple sources even when these sources have over-lapping or contradictory information and present a cohesiveresult to the user We call this process deep-merging or deepintegration (In contrast shallow integration performs just theschema translations and groups the datasets together) Toappreciate the issues involved let us consider an example

Example 1 Figure 1 shows a brief look at some of the entriesforHsp10 Each database has different identifiers and names forthe molecule Bind in 1 calls the protein Hsp10 while IntAct in1 calls it ch10_yeast NCBI itself has at least four versions ofthis protein with the exact same sequence and different sup-portive information Assuming that an appropriate identityfunction is found that integrates all six molecules shallow inte-gration would result in 15 listed interactions However thereare only 13 non-redundant interactions reported in the datasetsA similar problem occurs for other information on the mole-cule such as PTMs Figure 2 shows a view of the resultingdeep-merging process

There is significant redundancy across data sources Table 1shows the number of molecules and interactions provided byeach source It also shows the resulting number of moleculesand interactions after a deep-merge For molecules there is awhopping 49 redundancy rate while 40 of the interactionsare redundant across sources

Figure 2 The Hsp10 information from Figure 1 after a Deep Merge

Table 1 Number of molecules and interactions for each source as well as total

deep-merged molecules and interactions in MiMI v26

Source Molecules Interactions

BIND 111 394 175 678IntAct 62 667 67 955HPRD 18 839 66 723BioGRID 15 687 53 378DIP 19 050 54 511Center for Cancer Systems Biology dataset 3134 6726Max delbrueck center dataset 1909 3269MiMI 117 549 256 757

D568 Nucleic Acids Research 2007 Vol 35 Database issue

Provenance

Using the identity functions discussed above datasets can bedeep-merged into MiMI However not all datasets are createdequal Some are the result of careful curation and fact checkingwhile others can be from a single lab after one round of

experiments Knowing where the data came from augmentsits reliability MiMI tracks the provenance of each data itemallowing the user to determine which sources to use or ignoreMoreover database queries can use sources as a search crite-rion returning only trusted information to the user

(a)

(b)


3 DATABASE ACCESS

MiMI is stored in the Timber native XML database (7) Thismeans that the data format follows a specific schema and canbe queried using XQuery However learning somebodyelsersquos internal schema is an onerous task Additionally writ-ing declarative queries in a database query language such asXQuery has a steep learning curve Our goal is to makeMiMI easily accessible in an intuitive manner that does notrely on specialized skills

Traditional search options

MiMI provides the user with several traditional searchoptions such as browse and keyword search From thebrowse interface a user may view the list of molecules inter-actions or organisms included in MiMI However whilebrowsing gives a general overview of the data it is a slowway to find a particular protein To this end we have includeda keyword query facility to allow the user quick and easyaccess to specific information Figure 3a and b depict thesetraditional options in MiMI This is a standard form-basedsearch option that is similar to forms found in many onlinebiological databases

MQuery

Traditional approaches such as keyword forms can bestifling and restrictive to the scientist However thealternativemdashwriting a query in a declarative database querylanguagemdashis prohibitive Writing efficient XQuery is an

acquired skill browsing and keyword-based searching res-tricts the userrsquos ability to pose non-trivial queries MQueryaddresses this dilemma It combines the ease of form-basedqueries with the power of custom query writing (21) Addition-ally the XQuery produced by MQuery is tuned to the underly-ing database technology and will produce an efficient XQuery

One of the major obstacles to writing declarative queries isthe need to understand the underlying document structureMQuery allows the user to point and click on various schemaelements place conditions on them and combine these condi-tions conjunctively or disjunctively Figure 3c depicts a querya beginner user could easily build by browsing through theschema on the left clicking on fields of interest and fillingin search words In this case the user chose lsquocellularCompo-nentrsquo and lsquomoleculeFunctionrsquo and specified that she was onlyinterested in proteins in the cytoplasm that are involved inbinding Once the user has created a customized form accord-ing to her specifications she presses lsquoGenerate XQueryrsquo andthe appropriate XQuery statement is generated and displayedto the user This allows the user to learn the general formof an XQuery statement and revise the current MQuery ifneeded before submitting it to the database to obtain theresults of the query

Viewing information

Information from MiMI can be viewed in a variety offormats The simplest way is to view all information viathe web browser Each page succinctly shows all recorded

(c)

Figure 3 Database access options (a) Keyword query (b) browse (c) MQuery


information for each protein as well as the provenance asso-ciated with each piece of data A second option is to view allinteractions via Cytoscape We package all information suchthat it can be easily loaded and viewed in a Cytoscapebrowser Finally all information is downloadable in threedifferent formats XML using MiMIrsquos internal schema PSI-MI format (version 25) and plain text Information can bedownloaded from several different places the individualmolecule display page the interaction display page and theMQuery result page

4 FUTURE DEVELOPMENTS

MiMI is continuously changing We are constantly lookingfor public datasets that would either complement the existingdata or expand it We actively encourage biologists and otherusers to inform us of deficiencies in either the data or theusability of the website Our aim is to create an essentialcomprehensive and biologist-friendly database of proteininteractions

Pathway information from sites such as Reactome (22)will be included in the next release of MiMI In addition tomolecules and interactions complexes polymers biochemi-cal reactions and pathways will also be merged Thus auser viewing a complex found in both Reactome and BINDwill be able to see the deep-merged record with data fromboth sources

ACKNOWLEDGEMENTS

This work has been supported in part by grants R01 LM008106and U54 DA021519 from the National Institutes of Healthby grant IIS 0219513 from the National Science Foundationand by the Michigan Center for Biological Information

Conflict of interest statement None declared

REFERENCES

1 BirklandA and YonaG (2006) The BIOZON database a hub ofheterogeneous biological data Nucleic Acids Res 34 D235ndashD242

2 DavidsonSB CrabtreeJ BrunkBP SchugJ TannenV OvertonGCand StoeckertCJJr (2001) K2Kleisli andGUS experiments in integratedaccess to genomic data sources IBM Syst J 40 512ndash531

3 BunemanP ChapmanA and CheneyJ (2006) Provenancemanagement in curated databases ACM Sigmod 539ndash550

4 PeriS NavarroJD AmanchyR KristiansenTZJonnalagaddaCK SurendranathV NiranjanV MuthusamyBGandhiTKB GronborgM et al (2003) Development of humanprotein reference database as an initial platform for approachingsystems biology in humans Genome Res 13 2363ndash2371

5 BaderG BetelD and HogueCW (2003) BIND the biomoleculeinteraction network database Nucleic Acids Res 31 248ndash250

6 ShannonP MarkielA OzierO BaligaNS WangJT RamageDAminN SchwikowskiB and IdekerT (2003) Cytoscape a softwareenvironment for integrated models of biomolecular interactionnetworks Genome Res 13 2498ndash2504

7 JagadishHV Al-KhalifaS ChapmanA LakshmananLVNiermanA PaparizosS PatelJM SrivastavaD WiwatwattanaNWuY et al (2002) Timber a native XML database VLDB J 11274ndash291

8 XenariosI SalwinskiL DuanXJ HigneyP KinS-M andEisenbergD (2002) DIP the database of interacting proteins aresearch tool for studying cellular networks of protein interactionsNucleic Acids Res 30 303ndash305

9 StarkC BreitkreutzB-J RegulyT BoucherL BreitkreutzA andTyersStarkM (2006) BioGRID a general repository for interactiondatasets Nucleic Acids Res 34 D535ndashD539

10 HermjakobH Montecchi-PalazziL LewingtonC MudaliSKerrienS OrchardS VingronM RoechertB RoepstorffPValenciaA et al (2004) IntActmdashan open source molecular interactiondatabase Nucleic Acids Res 32 D452ndashD455

11 HanJD BertinN HaoT GoldbergDS BerrizGF ZhangLVDupuyD WalhoutAJ CusickME RothFP et al (2004) Evidencefor dynamically organized modularity in the yeast protein-proteininteraction network Nature 430 88ndash93

12 StelzlU WormU LalowskiM HaenigC BrembeckFHGoehlerH StroedickeM ZenknerM SchoenherrA KoeppenSet al (2005) A human proteinndashprotein interaction network a resourcefor annotating the proteome Cell 122 957ndash968

13 Gene Ontology Consortium (2001) Creating the geneontology resource design and implementation Genome Res 81425ndash1433

14 MulderNJ ApweilerR AttwoodTK BairochA BatemanABinnsD BradleyP BorkP BucherP CeruttiL et al (2005)Interpro progress and status in 2005 Nucleic Acids Res 33D201ndashD205

15 KerseyPJ DuarteJ WilliamsA KaravidopoulouY BirneyE andApweilerR (2004) The international protein index an integrateddatabase for proteomics experiments Proteomics 4 1985ndash1988

16 KimYJ BoydA AtheyBD and PatelJM (2005) miBLASTscalable evaluation of a batch of nucleotide sequence queries withblast Nucleic Acids Res 33 4335ndash4344

17 WiwatwattanaN and KumarA (2005) Organelle DB a cross-speciesdatabase of proteinlocalization and function Nucleic Acids Res 33D598ndashD604

18 ChenF MackeyAJ StoeckertCJJr and RoosDS (2006)OrthoMCL-DB querying a comprehensive multi-species collectionof ortholog groups Nucleic Acids Res 34 D363ndashD368

19 FinnRD MistryJ Schuster-BocklerB Griffiths-JonesSHollichV LassmannT MoxonS MarshallM KhannaADurbinR et al (2006) PFam clans web tools and servicesNucleic Acids Res 34 D247ndashD251

20 SassonO VaakninA FleischerH PortugalyE BiluY LinialNand LinialM (2003) ProtoNet hierarchical classification of the proteinspace Nucleic Acids Res 31 348ndash352

21 JayapandianM and JagadishHV (2006) Automating the design andconstruction of query forms In Proceedings of ICDE ConferenceAtlanta Georgia pp 125ndash137

22 Joshi-TopeG GillespieM VastrikI DrsquoEustachioP SchmidtE deBonoB JassalB GopinathGR WuGR MatthewsL et al (2005)Reactome a knowledgebase of biological pathways Nucleic AcidsRes 33 D428ndashD432


the provenance of each piece of information is trackedthroughout the system allowing scientists to choose whichdata to trust (3) MiMI allows users to ask more advancedquestions than each of its component databases can answerindependently MiMI attempts to relieve scientists of theburden of tracking down multiple sources mapping multipleidentifiers and merging redundancies By integrating well-known datasets such as HPRD (4) and BIND (5) MiMIcreates a deep-merged repository that is a synergy of all themerged datasets By such integration MiMI shows scientistswhen facts are corroborated by different datasets and whenfacts are contradicted among datasets Moreover the proven-ance of each data item is annotated allowing scientists toview information from only the sources they trust and facili-tating understanding of contradictory information MiMIrsquosintegration is distinctly different from the approach usedby the International Molecular Exchange Consortium(IMEx) While several of the integral components of MiMIalso belong to IMEx the tasks are very different IMEx isattempting to increase the rate of data curation by separatingcuration tasks among different groups Once curation isdone the information is shared among all However regard-less of any cooperation between data curation sites or parti-tioning of resources there will always be some data overlapor redundancy among them MiMI does not attempt to findnew data to curate but to augment known information byhighlighting redundancy and contradictions AdditionallyIMEx itself is in the very infancy of data exchange andthere is as yet no cohesive united and deeply merged datasetproduced by it

MiMI provides a simple interface that allows new users topose complex queries Utilizing a simple point and clickmethod our user interface allows scientist to formulateadvanced queries without specialized programming skillsAdditionally MiMI output complies with the PSI-MI format

and Cytoscape (6) allowing users to take advantage of indus-try tools for viewing interactions The following is a briefdescription of the underlying concepts of MiMI as well asa detailed list of datasets employed by MiMI

2 DATABASE CONSTRUCTION

MiMI uses XML as its data model XML is the current linguafranca of biological data exchange and gives the MiMI datamodel the flexibility to change as biological understandingincreases The physical storage of MiMI is built upon Timber(7) a native XML database MiMI is a component of theNational Center for Integrative Biomedical Informatics(httpwwwncibiorg) and is publicly available at httpmimincibiorg

Datasets

MiMI currently has 117 549 molecules and 256 757 inter-actions and is the result of integrating BIND (5) DIP (8)BioGRID (9) HPRD (4) and IntAct (10) as well as datasetsfrom Center for Cancer Systems Biology at Harvard (11)and the Max Delbrueck Center (12) Additionally supple-mentary protein information was integrated from GO (13)InterPro (14) IPI (15) miBLAST (16) OrganelleDB (17)OrthoMCL (18) PFam (19) and ProtoNet (20)

Identity

The issue of identity is determining when two databaseentries refer to the same real-world object For instance toa human it is obvious that an Hsp10p molecule found inyeast and listed in DIP is the same as the Hsp10p moleculefound in yeast and listed in BIND In the simplest case iden-tity is defined by the uniqueness of a key attribute (set) inthis case the name However in protein identification no

Figure 1 Sample protein data for Hsp10 from IntAct NCBI and BIND





Deep-merge










Provenance



(a)

(b)


3 DATABASE ACCESS




MQuery




Viewing information


(c)







ACKNOWLEDGEMENTS



REFERENCES



























Deep-merge










Provenance



(a)

(b)


3 DATABASE ACCESS




MQuery




Viewing information


(c)







ACKNOWLEDGEMENTS



REFERENCES
























Provenance



(a)

(b)


3 DATABASE ACCESS




MQuery




Viewing information


(c)







ACKNOWLEDGEMENTS



REFERENCES
























3 DATABASE ACCESS




MQuery




Viewing information


(c)







ACKNOWLEDGEMENTS



REFERENCES




























ACKNOWLEDGEMENTS



REFERENCES
























Michigan Molecular Interactions (MiMI): putting the jigsaw puzzle together

Documents

Transcript of Michigan Molecular Interactions (MiMI): putting the jigsaw puzzle together