Introducing glycomics data into the Semantic Web

JOURNAL OFBIOMEDICAL SEMANTICS

Aoki-Kinoshita et al. Journal of Biomedical Semantics 2013, 4:39http://www.jbiomedsem.com/content/4/1/39

SHORT REPORT Open Access

Introducing glycomics data into theSemantic WebKiyoko F Aoki-Kinoshita1, Jerven Bolleman2, Matthew P Campbell3, Shin Kawano4, Jin-Dong Kim4, Thomas Lütteke5,Masaaki Matsubara6, Shujiro Okuda7,8, Rene Ranzinger9, Hiromichi Sawaki10, Toshihide Shikanai10,Daisuke Shinmachi10, Yoshinori Suzuki10, Philip Toukach11, Issaku Yamada6, Nicolle H Packer3

and Hisashi Narimatsu10*

Abstract

Background: Glycoscience is a research field focusing on complex carbohydrates (otherwise known as glycans)a,which can, for example, serve as “switches” that toggle between different functions of a glycoprotein or glycolipid.Due to the advancement of glycomics technologies that are used to characterize glycan structures, many glycomicsdatabases are now publicly available and provide useful information for glycoscience research. However, thesedatabases have almost no link to other life science databases.

Results: In order to implement support for the Semantic Web most efficiently for glycomics research, thedevelopers of major glycomics databases agreed on a minimal standard for representing glycan structure andannotation information using RDF (Resource Description Framework). Moreover, all of the participants implementedthis standard prototype and generated preliminary RDF versions of their data. To test the utility of the converteddata, all of the data sets were uploaded into a Virtuoso triple store, and several SPARQL queries were tested as“proofs-of-concept” to illustrate the utility of the Semantic Web in querying across databases which were originallydifficult to implement.

Conclusions: We were able to successfully retrieve information by linking UniCarbKB, GlycomeDB and JCGGDB in asingle SPARQL query to obtain our target information. We also tested queries linking UniProt with GlycoEpitope aswell as lectin data with GlycomeDB through PDB. As a result, we have been able to link proteomics data withglycomics data through the implementation of Semantic Web technologies, allowing for more flexible queriesacross these domains.

Keywords: BioHackathon, Carbohydrate, Data integration, Glycan, Glycoconjugate, SPARQL, RDF standard,Carbohydrate structure database

BackgroundIt is widely acknowledged that developing a mechanismto handle multiple databases in an integrated manner iskey to making glycomics accessible to other -omic disci-plines. The National Academy of Science published areport called “Transforming Glycoscience: A Roadmapfor the Future” that exemplifies the hurdles and

* Correspondence: [email protected] Center for Medical Glycoscience, National Institute of AdvancedIndustrial Science and Technology, Tsukuba Central-2, Umezono 1-1-1, Tsukuba305-8568, JapanFull list of author information is available at the end of the article

© 2013 Aoki-Kinoshita et al.; licensee BioMedCreative Commons Attribution License (http:/distribution, and reproduction in any medium

problems faced by the Glycomics research communitydue to the disconnected and incomplete nature of exist-ing databases [1]. Within the last decade, a large num-ber of carbohydrate structure (sequence) databases havebecome available on the web, all providing their ownunique data resources and functionalities [2]. After theconclusion of the CarbBank project [3], the GermanCancer Research Center used the available data to de-velop their GLYCOSCIENCES.de database [4], which ingeneral focuses on the three-dimensional conformationsof carbohydrates. KEGG GLYCAN was added to the

Central Ltd. This is an open access article distributed under the terms of the/creativecommons.org/licenses/by/2.0), which permits unrestricted use,, provided the original work is properly cited.

mailto:[email protected]

http://creativecommons.org/licenses/by/2.0

Aoki-Kinoshita et al. Journal of Biomedical Semantics 2013, 4:39 of 7http://www.jbiomedsem.com/content/4/1/39

KEGG resources as a new glycan structure database thatis linked to their genomic and pathway information [5].The Consortium for Functional Glycomics also devel-oped a glycan structure database to supplement theirdata resources storing experimental data from glycanarray, glycan profiling from mass spectrometry, glyco-gene knockout mouse and glyco-gene microarray[6]. In Russia, the Bacterial Carbohydrate StructureDatabase (BCSDB) was developed, which containscarbohydrate structures from bacterial species col-lected from the scientific literature [7]. Additionally,small databases used in local laboratories have beendeveloped, and so the GlycomeDB database was devel-oped to integrate all the records in these databasesto provide a web portal that allows researchers tosearch across all supported databases for particularstructures [8]. The developers of GlycomeDB werea part of the EUROCarbDB project, which was anEU-funded initiative for developing a framework forstoring and sharing experimental data of carbohy-drates [9]. Several resources were developed under theEUROCarbDB framework including, a database for or-ganizing monosaccharide information was developed,called MonosaccharideDB [10] and the HPLC-focuseddatabase GlycoBase [11]. MonosaccharideDB is animportant database for integrating carbohydrate struc-tures from different resources, since oftentimes differ-ent representations are used for the same monosaccharides.Unfortunately, funding-support for the EUROCarbDBproject ended, however the data resources and soft-ware, which are all available as open source software,were taken on by the UniCarbKB project [12]. Meanwhilein Japan, the Japan Consortium for Glycobiology andGlycotechnology Database (JCGGDB) was developed tointegrate all the carbohydrate resources in Japan [13].However, despite all of these efforts to develop useful andvaluable glycomics databases, a lack of interoperabilityis hampering the development of ‘mashup’ applicationsthat are capable of integrating glycan related data withother -omics data.Almost all databases mentioned above provide their

information using web pages restricting the query pos-sibilities to the limited search options provided by thedevelopers. In addition only a few databases provideweb services that allow retrieval of data in a machine-readable non-HTML format. The few implementedweb service interfaces return proprietary non-standardformats making it hard to retrieve and integrate datafrom several resources into a single result. Despitesome efforts to standardize and exchange their data[14,15], most glycomics databases are still regarded as“disconnected islands” [1]. Standardization of carbohy-drate primary structures is more difficult than genom-ics or proteomics, mainly because of the inherent

structural complexity of oligosaccharides exemplifiedby complex branching, glycosidic linkages, anomericityand residue modifications. Individual databases devel-oped their own formats to cope with these problemsand encode glycan primary structures in a machinereadable way [2].

Collaboration agreementIn order to integrate data in the life sciences usingRDF (Resource Description Framework), several an-nual BioHackathons (Biology + Hacking + Marathon)sponsored by the National Bioscience Database Center(NBDC) and Database Center for Life Science (DBCLS) inJapan have been held since 2008. The 5th BioHackathonwas held in Toyama city, Japan, from September 2nd to7th, 2012 [16]. The glycan RDF subgroup convened inToyama to discuss and implement the initial version of acontextualized RDF document (GlycoRDF) representingthe respective glycan database contents in a standardizedRDF format.For a better understanding of the processes that gly-

cans are involved in, the participants all agreed thatnot only should the information on primary structuresbe available but also associated metadata such as thebiological contexts the glycans have been found in(including information on the proteins that glycansare linked to), specification of glycan-binding proteins,associated publications and experimental data mustbe taken into consideration. Such data are spreadover the various resources, which are (e.g. in the con-text of proteins) not limited to only glyco-relateddatabases. A better integration of all these data collec-tions will allow researchers to answer more complexbiological questions than simply using individual data-bases or only cross-linking primary structures. Con-necting glycomics resources with other kinds of lifescience data will also significantly improve the inte-gration of glycan information into systems biologyapproaches.Each of the glycan databases already has an existing tool

chain and infrastructure in place. Therefore, the glycandatabases were first translated into an agreed-upon RDFdata model. This RDFication process is unique for eachresource due to their respective data contents. However, aminimal agreement was made by which the databasescould be linked with one another. The following gene-ralization illustrates some examples of the RDF data gene-rated by the databases used in the proof-of-conceptqueries. Note that a unified prefix “glyco:” was agreedupon, as well as the use of identifiers.org as the URI to beused when referencing external databases. As a result, gly-can structures, monosaccharides, biological sources,literary references and experimental evidence data couldbe RDFized.

Table 1 RDFized glycan databases in this study

DB name URL Number of entries asof May 2013

Number of triples Reference

UniCarbKB http://www.unicarbkb.org/ Over 3300 glycan structures,approximately 9000 structureand protein associations,with over 900 publications

1977 triples for structure andprotein data of one experiment

[12]

BCSDB http://csdb.glycoscience.ru/bacterial/ Over 10,000 structures, over 4000publications, over 5000 taxons,and over 2500 NMR spectra

2,595,411 triples from all data [7]

GlycomeDB http://www.glycome-db.org/ 37,140 glycan structure entries 518,733 triples from all data [8]

MonosaccharideDB http://www.monosaccharidedb.org/ About 700 monosaccharide entries 1911 triples of Basetypes, 14,692triples of Monosaccharides, and275 triples of Substituents

[10]

GlycoEpitope http://www.glyco.is.ritsumei.ac.jp/epitope2/ 174 glycoepitopes recognized by613 antibodies and a wide rangeof biochemical information relatedto the glycoepitopes and antibodies

220,545 triples from all data [17]

GlycoProtDB http://jcggdb.jp/rcmg/glycodb/LectinSearch

1,830 entries of mice glycoproteinsand 701 of C. elegans glycoproteins

2,337,104 triples [18]

LfDB (Lectin frontierDataBase)

http://jcggdb.jp/rcmg/glycodb/LectinSearch

479 entries of lectin data, includingPDB information, and their glycaninteraction data

902 triples of lectin-PDBrelationship data

[19]


http://www.unicarbkb.org/

http://csdb.glycoscience.ru/bacterial/

http://www.glycome-db.org/

http://www.monosaccharidedb.org/

http://www.glyco.is.ritsumei.ac.jp/epitope2/






Proof-of-concept SPARQL queriesAt the time of this writing, UniCarbKB, BCSDB, Glyco-meDB, MonosaccharideDB, GlycoEpitope [17], Glyco-ProtDB [18] and Lectin frontier DataBase (LfDB) [19]have implemented RDF versions of all or part of theirdata using a minimal RDF standard (Table 1).After the conversion of these data into RDF, we set up

a local triplestore using Virtuoso [20], uploaded all ofthe data and tested the following queries to see if thetarget data could be retrieved:

Query 1Because JCGGDB entries have no links to UniProt [21]entries, we tried to retrieve UniProt ID from JCGGDBID using information from other databases. A JCGGDBentry has a link to a GlycomeDB entry, which contains theglycan structure in GlycoCT format [22]. A UniCarbKBentry has a link to its related UniProt entry and alsocontains a glycan structure in GlycoCT format. Thereforewe mapped JCGGDB IDs to UniCarbKB entries usingGlycomeDB and were able to retrieve the UniProtIDs (stored in UniCarbKB) for each JCGGDB ID. Anexecution of this example query is illustrated in Figure 1,

Figure 1 Query 1. A) SPARQL query 1 which retrieves UniProt accession na short example of the result set. B) Schematic workflow of cross-database

showing the resulting UniProt IDs which are related toJCGGDB IDs.

Query 2To test whether it would be possible to link lectininformation with glycan structures, we used the PDBinformation [23] in the LfDB data. Since GlycomeDBcontained PDB IDs for glycan structures found in them,we could obtain the glycan structures in GlycoCT format.GlycomeDB provides references to PDB entries containingglycans which have been extracted using pdb2linucs [24].This allowed obtaining the glycan structures in GlycoCTformat for each PDB entry. The list of results includescovalently linked glycan structures (post translationalmodifications) as well as glycan structures bound by thelectin. Figure 2 illustrates this query.

Query 3Carbohydrates or parts of carbohydrates are often recog-nized as epitopes with which antibodies/toxins/viruses/bacteria interact, so it was important for us to be able touse the GlycoEpitope database in a query. With the RDFversion of GlycoEpitope, we could identify the carrier

umber from JCGGDB ID via GlycomeDB and UniCarbKB together withquery.

Figure 2 Query 2. A) SPARQL query 2 retrieving relevant glycan structure information from lectin data. B) Schematic workflow of cross-databasequery. Note that LfDB does not provide any glycan binding information for this lectin (Galectin-3 in human). However, from the glycan structureinformation in the PDB data, we could obtain related glycan structures through this query.


proteins of glycan epitopes by NCBI RefSeq identifiersusing a single SPARQL query. In particular, from theantibody information, the related epitopes could be ob-tained, by which UniProt protein IDs are referenced.From there, NCBI RefSeq IDs could also be retrieved.Figure 3 illustrates this query, which resulted in 57matches. In theory, it should be possible to obtainprotein IDs from GlycoProtDB by retrieving theNCBI protein gi number from the RefSeq ID ob-tained in this query, which is then referenced byGlycoProtDB protein IDs as the core protein. In ourtests, however, since GlycoEpitope mainly containshuman protein information and GlycoProtDB hasonly mouse and C. elegans proteins, we were un-able to obtain GlycoProtDB information in a singlequery. We are considering the possibility of includ-ing orthologue information in order to make thispossible.

Discussion and conclusionIn this report, we illustrate the utility of RDFizingglyco-databases in order to link glycan data from

different glycomics resources with proteomics data.The developers of existing databases agreed uponusing RDF as a straightforward approach to linkrelevant data with one another. This would in turnenable the creation of links with other -omics datasources. In particular, we have shown in this workthat the availability of formalized RDF data of gly-coscience resources has allowed not only the inte-grated query of multiple glyco-related databases,but also the integration with UniProt, which is avaluable resource of proteomics data. Although fewgenomic resources are currently on the SemanticWeb, as the utility of this new technology spreads,we expect that other proteomics, metabolomicsand even medical data will become available.Moreover, it is a simple matter of adding triplesto existing data to link with new resources as theybecome available, illustrating the power of the Se-mantic Web.In order to further add other pertinent glycomics

data to the Semantic Web, two points should bekept in mind: 1) the consistent usage of predicates

Figure 3 Query 3. A) SPARQL query 3 retrieving NCBI RefSeq protein IDs of the carrier proteins of glycan epitopes that are recognized byantibodies as stored in the GlycoEpitope database. This example illustrates the ease by which glycoepitope data could be queried together withUniProt in a single query. B) Schematic workflow of cross-database query.


throughout the related data, and 2) the consistentusage of URIs. For 1), it will be necessary to de-velop an ontology for glycomics data, which iscurrently under development. For 2), we suggestthe usage of identifiers.org when referring to exter-nal databases. This base URI is intended to be apersistent URI for any major data resource such that if theoriginal URI changes, identifiers.org will point to the up-dated resource. Thus users will not need to managethe update of outdated URIs.Future work entails the development of a more formal-

ized glyco-ontology in order to organize the semantics ofthe existing glyco-related data, as mentioned above. Thiscan be most easily undertaken by first focusing on theRDF data at hand. As evident from queries 2 and 3, wewere forced to use regular expression filters in order toobtain our target data. Thus, we are currently discussingthe first version of this glyco-ontology and plan on imple-menting a more standardized version of our RDF data.This data will be made available as a public SPARQLendpoint in the near future such that federated que-ries can be performed. This will also make it possiblefor developers of other related databases to use our

standard to most efficiently link their data with theglycomics world.

EndnoteaNote that in this manuscript, we may use the terms

“carbohydrate structure” and “glycan” or “glycan structure”interchangeably. Note also that terms starting with“glyco-“ refer to glycans, which are composed ofmonosaccharides. For example, glycoproteins are gly-cosylated proteins, which are protein structures withat least one monosaccharide attached to one of itsamino acids.

Competing interestsThe authors declare that they have no competing interests.

Authors’ contributionsHN oversees the JCGGDB project which promoted this research. KFK led theorganization of the glyco-group at BioHackathon 2012. All authors discussedand created the Glycan RDF standard. The following authors converted theirrespective databases to RDF, MC: UniCarbKB; TL: MonosaccharideDB; SO:GlycoEpitope; RR: GlycomeDB; HS: GlycoProtDB and LfDB, with assistance fromDS; PT: BCSDB. KFK, SK, HS, MC, TL, JB and RR wrote this paper. All authors read,revised and approved the final manuscript.


AcknowledgementsThis work has been supported by National Bioscience Database Center (NBDC)of Japan Science and Technology Agency (JST), National Institute of AdvancedIndustrial Science and Technology (AIST) in Japan, and the Database Center forLife Science (DBCLS) in Japan. The developers recognize the invaluablecontributions from the community and those efforts to curate and sharestructural and experimental data collections. MC acknowledges funding fromthe Australian National eResearch Collaboration Tools and Resources project(NeCTAR). PT acknowledges funding from Russian Foundation for BasicResearch, grant 12-04-00324. RR is supported by NIH/NIGMS funding theNational Center for Glycomics and Glycoproteomics (8P41GM103490).

Author details1Department of Bioinformatics, Faculty of Engineering, Soka University, 1-236Tangi-machi, Hachioji, Tokyo 192-8577, Japan. 2Swiss Institute ofBioinformatics, CMU 1, rue Michel Servet 1211, Geneva 4, Switzerland.3Biomolecular Frontiers Research Centre, Macquarie University, Sydney, NewSouth Wales, Australia. 4Database Center for Life Science, ResearchOrganization of Information and Systems, 2-11-16 Yayoi, Bunkyo-ku, Tokyo113-0032, Japan. 5Institute of Veterinary Physiology and Biochemistry,Justus-Liebig-University Giessen, Frankfurter Str. 100, 35392 Giessen, Germany.6Laboratory of Glyco-organic Chemistry, The Noguchi Institute, 1-8-1 Kaga,Itabashi-ku, Tokyo 173-0003, Japan. 7Department of Bioinformatics, College ofLife Sciences, Ritsumeikan University, 1-1-1 Nojihigashi, Kusatsu, Shiga525-8577, Japan. 8Niigata University Graduate School of Medical and DentalSciences, 1-757 Asahimachi-dori, Chuo-ku, Niigata 951-8510, Japan. 9ComplexCarbohydrate Research Center, University of Georgia, Athens, Georgia 30602,USA. 10Research Center for Medical Glycoscience, National Institute ofAdvanced Industrial Science and Technology, Tsukuba Central-2, Umezono1-1-1, Tsukuba 305-8568, Japan. 11NMR Laboratory, N.D. Zelinsky Institute ofOrganic Chemistry, Leninsky prospekt 47, 119991 Moscow, Russia.

Received: 9 May 2013 Accepted: 17 October 2013Published: 26 November 2013

References1. Committee on Assessing the Importance and Impact of Glycomics and

Glycosciences, Board on Chemical Sciences and Technology, Board on LifeSciences, Division on Earth and Life Studies, National Research Council:Transforming Glycoscience: A Roadmap for the Future. Washington, D.C., USA:The National Academic Press; 2012.

2. Aoki-Kinoshita KF: Using databases and web resources for glycomicsresearch. Mol Cell Proteomics 2013, 12:1036–1045.

3. Doubet S, Albersheim P: CarbBank. Glycobiology 1992, 2:505.4. Lütteke T, Bohne-Lang A, Loss A, Goetz T, Frank M, von der Lieth CW:

GLYCOSCIENCES: de: an Internet portal to support glycomics andglycobiology research. Glycobiology 2006, 16:71R–81R.

5. Hashimoto K, Goto S, Kawano S, Aoki-Kinoshita KF, Ueda N, Hamajima M,Kawasaki T, Kanehisa M: KEGG as a glycome informatics resource.Glycobiology 2006, 16:63R–70R.

6. Raman R, Venkataraman M, Ramakrishnan S, Lang W, Raguram S,Sasisekharan R: Advancing glycomics: implementation strategies at theconsortium for functional glycomics. Glycobiology 2006, 16:82R–90R.

7. Toukach PV: Bacterial carbohydrate structure database 3: principles andrealization. J Chem Inf Model 2011, 51:159–170.

8. Ranzinger R, Herget S, von der Lieth CW, Frank M: GlycomeDB-a unifieddatabase for carbohydrate structures. Nucleic Acids Res 2011,39:D373–D376.

9. von der Lieth CW, Freire AA, Blank D, Campbell MP, Ceroni A, Damerell DR,Dell A, Dwek RA, Ernst B, Fogh R, Frank M, Geyer H, Geyer R, Harrison MJ,Henrick K, Herget S, Hull WE, Ionides J, Joshi HJ, Kamerling JP, Leeflang BR,Lütteke T, Lundborg M, Maass K, Merry A, Ranzinger R, Rosen J, Royle L,Rudd PM, Schloissnig S, et al: EUROCarbDB: An open-access platform forglycoinformatics. Glycobiology 2011, 21:493–502.

10. Luetteke T, Monosaccharide DB: http://www.monosaccharidedb.org/(accessed August 18, 2013).

11. Campbell MP, Royle L, Radcliffe CM, Dwek RA, Rudd PM: GlycoBase andautoGU: tools for HPLC-based glycan analysis. Bioinformatics 2008,24:1214–1216.

12. Campbell MP, Hayes CA, Struwe WB, Wilkins MR, Aoki-Kinoshita KF, HarveyDJ, Rudd PM, Kolarich D, Lisacek F, Karlsson NG, Packer NH: UniCarbKB:

putting the pieces together for glycomics research. Proteomics 2011,11:4117–4121.

13. Japan consortium for glycobiology and glycotechnology database.http://jcggdb.jp/index_en.html.

14. Packer NH, von der Lieth C-W, Aoki-Kinoshita KF, Lebrilla CB, Paulson JC,Raman R, Rudd P, Sasisekharan R, Taniguchi N, York WS: Frontiers inglycomics: bioinformatics and biomarkers in disease: an NIH white paperprepared from discussions by the focus groups at a workshop on theNIH campus, Bethesda MD (September 11–13, 2006). Proteomics 2008,8:8–20.

15. Toukach P, Joshi H, Ranzinger R, Knirel Y, von der Lieth CW: Sharing ofworldwide distributed carbohydrate-related digital resources: onlineconnection of the bacterial carbohydrate structure data base andGLYCOSCIENCES.de. Nucleic Acid Res 2007, 35:D280–D286.

16. BioHackathon 2012. http://2012.biohackathon.org/] (will replace toBiohackathon 2011/2012 paper.

17. GlycoEpitope. http://www.glyco.is.ritsumei.ac.jp/epitope2/.18. Kaji H, Shikanai T, Sasaki-Sawa A, Wen H, Fujita M, Suzuki Y, Sugahara D,

Sawaki H, Yamauchi Y, Shinkawa T, Taoka M, Takahashi N, Isobe T, NarimatsuH: Large-scale identification of N-glycosylated proteins of mouse tissuesand construction of a glycoprotein database, GlycoProtDB. J ProteomeRes 2012, 11:4553–4566.

19. Lectin Frontier DataBase. http://jcggdb.jp/rcmg/glycodb/LectinSearch.20. Orri E, Mikhailov I: RDF Support in the Virtuoso DBMS. Conference on Social

Semantic Web 2007, 113:59–68.21. Consortium UP: Update on activities at the universal protein resource

(UniProt) in 2013. Nucleic Acids Res 2013, 41:D43–D47.22. Herget S, Ranzinger R, Maass K, Lieth CW: GlycoCT-a unifying sequence

format for carbohydrates. Carbohydr Res 2008, 343:2162–2171.23. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H,

Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res 2000,28:235–242.

24. Lutteke T, Frank M, von der Lieth CW: Data mining the protein data bank:automatic detection and assignment of carbohydrate structures.Carbohydr Res 2004, 339:1015–1020.

doi:10.1186/2041-1480-4-39Cite this article as: Aoki-Kinoshita et al.: Introducing glycomics data intothe Semantic Web. Journal of Biomedical Semantics 2013 4:39.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

http://www.monosaccharidedb.org/

http://jcggdb.jp/index_en.html

http://2012.biohackathon.org/

http://www.glyco.is.ritsumei.ac.jp/epitope2/


Introducing glycomics data into the Semantic Web

Documents

Transcript of Introducing glycomics data into the Semantic Web