Ontology work at the Royal Society of Chemistry

43
Ontology work at the Royal Society of Chemistry Antony J. Williams, Colin Batchelor, Peter Corbett, Jon Steele and Valery Tkachenko ACS Dallas March 16 th 2014

Transcript of Ontology work at the Royal Society of Chemistry

Ontology work at the Royal Society of ChemistryAntony J. Williams, Colin Batchelor, Peter Corbett, Jon Steele and Valery Tkachenko

ACS DallasMarch 16th 2014

Royal Society of Chemistry

• You know us as a publisher and society but

• We are a host of chemistry databases

• We are a charity and community support

• We are a provider of grant-based services

• We are an innovator in cheminformatics

We have data to manage…

• Compounds• Reactions• Spectra• Crystals• Materials• Assays• Algorithms• …

We have data to manage…

• Compounds• Reactions• Spectra• Crystals• Materials• Assays• Algorithms• …

Properties - experimental

Physicochemical properties

LONG LIST: log P, log D (at pH 5.5, at pH 7.4), bioconcentration factor, KOC (at pH 5.5, at pH 7.4), index of refraction, polar surface area, molar refractivity, molar volume, polarizability, surface tension, density at STP, flash point at 1 atm, boiling point at 1 atm, enthalpy of vaporization at STP, vapour pressure at STP…

All are amenable to ontologiesand should blend standards • Compounds and properties are

handled (InChIs are important)• Reactions are covered (and

RInChIs help)• Spectra (JCAMP, AnIML, NetCDF,

mzML)• Crystals (CIFs)• Materials (MatML)• Assays (MIAME)• Algorithms• …

ChemSpider Reactions

ChemSpider Spectra

ChemSpider is 7 years old

• When ChemSpider was developed ontologies were not directly implemented

• The ontologies and technologies have developed and more accepted in seven years

• Some efforts have been made to include ontologies – layer on MeSH. We support a lot of standards – InChI, RInChI, JCAMP, CIF

• The ChemSpider architecture is being rebuilt and considering new standards and ontologies

Some available ontologies…

• RSC has built and opened in-house ontologies:

• Chemical methods (CHMO)• Name reactions (RXNO) • Molecular processes (MOP), largely

auto-generated from the corresponding ChEBI classes

• We have contributed to external ontologies:

• Small molecules (ChEBI)• Cheminformatics (CHEMINF)

Chemistry ontologies 1ChEBI (molecules, families of molecules, parts of molecules, 32128 fully annotated classes) (http://www.ebi.ac.uk/chebi/)

perylene (CHEBI:29861)

a perylene (CHEBI:60201)

perylene skeleton (CHEBI:60200)

ChEBI Ontology

RSC Ontologies

Chemistry ontologies 2Chemical Methods Ontology (http://rsc-cmo.googlecode.com)

2745 classes describes methods used to: •collect data in chemical experiments, such as MS and NMR•prepare and separate material for further analysis, such as sample ionisation, chromatography, and electrophoresis •synthesise materials, such as continuous vapour deposition •also describes the instruments used in these experiments, such as mass spectrometers and chromatography columns and their outputs•Should be of value to chemical hazards and safety data

Chemistry ontologies 3RSC Name Reaction Ontology(http://rxno.googlecode.com/)

421 classesExamples:Diels–Alder cyclization

Chemistry ontologies 4CHEMINF(http://code.google.com/p/semanticchemistry/)

638 classes Describes cheminformatics methods. Not presently used in text mining (see Open PHACTS usage later).

doi:10.1371/journal.pone.0025513

Limits of ontologiesChemical space is very big:

‘The “small molecule universe” (SMU), the set of all synthetically feasible organic molecules of 500 Daltons molecular weight or less, is estimated to contain over 1060 structures, making exhaustive searches for structures of interest impractical.”

Virshup et al., J. Am. Chem. Soc., doi:10.1021/ja401184g

Why a named reaction ontology?

• Despite attempts to introduce systematic nomenclature for organic reactions, lots of chemists still prefer to attach human names.

A big challenge• Classification is based on what the

experimenter intends

• Build the ontology around intended product molecules rather than might be by-products

• (Carbon dioxide, water, hydrolysed protecting groups, protons, etc. etc.)

Defining the skeleton

Limits of reaction classification

• Much of RXNO is still classified by hand

• Example: we can’t just define a cyclization as a reaction where a cyclic compound is formed. The Friedel–Crafts acylation produces a cyclic compound but is not a cyclization!

RXNO in the wild510 classes in the RXNO namespace… and RXNO is built in to NextMove Software’s reaction identification tool.

RXNO: next steps• More reactions!• More cross-references!• More example reactions!• Links to graphical versions! (All

drawn, just awaiting uploading.)• More SMIRKS strings!

Using ontologies in text mining

• To provide a controlled vocabulary of terms found in text and a common identifier.

• This identifier hopefully is a resolvable HTTP URI, for example, for chemical compounds http://purl.obolibrary.org/obo/CHEBI_36063 ) and to methods terminology

Ontologies as synonym sets for text-mining

• We have text-mined the whole 21st century RSC archive with a myriad of ontologies. Results are on the publishing platform

• We have looked for correlations between molecules and ontology terms.

• Two examples follow…

Co-occurrences with ?

alcohols (CHEBI:30879) solvents (CHEBI:46787) coproporphyrins (CHEBI:23388) 3D DOSY-TOCSY (CHMO:0001950) lipase activity (GO:0016298) solvolysis (MOP:0000620) wood (ENVO:00002040) aliphatic alcohol (CHEBI:2571) Raman circular dichroism spectroscopy (CHMO:0001160) propoxy group (CHEBI:46881) steam reforming (CHMO:0001450) hydrogenation (MOP:0000589) aqueous-phase reforming (CHMO:0001444) sonication (CHMO:0001707)

Co-occurrences with ?

reducing agent (CHEBI:63247) ascorbic acid (CHEBI:22652) antioxidant (CHEBI:22586) reduction (MOP:0000569) electrode (CHMO:0002344) ascorbate (CHEBI:22651) modified residue (SO:0001089) phosphate buffer (CHMO:0001734) oxidation (MOP:0000568) nafion polymer (CHEBI:61428) vitamin C (CHEBI:21241) antioxidant activity (GO:0016209) atom-transfer radical polymerisation (MOP:0000684) detection of glucose (GO:0051594) reducing agent (CHEBI:63247) glucose (CHEBI:17234) graphene (CHEBI:36973)

Projects and Ontologies

• 3-year Innovative Medicines Initiative project

• Integrating chemistry and biology data using semantic web technologies

• Open source code, open data and open standards

• Academics, Pharmas, Publishers…• To put medicines in the pipeline…

The Open PHACTS community ecosystem

Our RDF schemaTwo dozen calculated properties >106 molecules•CHEMINF ontology for cheminformatics•QUDT for units and numeric values•ChemSpider IDs for molecules

Calculation

connection table

has_input

benzeneis_about

calculated log Phas_output

dimensionless

has_unit 2.177has_value

0.234has standard uncertai

nty

RSC data in Open PHACTS

1. Molecule synonyms and identifiers2. Linksets between ChEBI, ChEMBL,

DrugBank and OPS identifiers3. Molecule–molecule relations (“parent–

child”) of interest for drug discovery4. Calculated physicochemical properties

for compounds (both molecular and macroscopic)

Synonyms and identifiers

Newly added to the CHEMINF ontology:

•Validated ChemSpider synonyms•Unvalidated ChemSpider synonyms•Validated database identifiers•Unvalidated database identifiers •InChI, InChIKey, SMILES •Preferred ChemSpider name

Physicochemical properties

log P log D (at pH 5.5, at pH 7.4) bioconcentration factor KOC (at pH 5.5, at pH 7.4) index of refraction polar surface area molar refractivity molar volume polarizability surface tension density at STP flash point at 1 atm boiling point at 1 atm enthalpy of vaporization at STP vapour pressure at STP

It is actually more complicated..

benzene’s connection table

OPSbenzene

calculation result

QUDTdimensionless quantity

“2.17”^^xsd:float

IAOis

about

OBIhas

specified output

OBIhas

specified input

QUDThas

value

QUDThas

standard uncertainty

QUDThas unit

CHEMINFcalculated log P

rdf:type

CHEMINFconnection table

rdf:type

“0.234”^^xsd:float

calculation process

CHEMINFexecution of

ACD/Labs PhysChem

software library version 12.01

rdf:type

What’s built on top of this?

Chemistry Data to manage…

• Compounds• Reactions• Spectra• Crystals (in development)• Materials• Assays• Algorithms• …

Future Work• Extending use of ontologies across

all of our work on databases and as an underpinning to the Chemical Data Repository

• Adding ontologies to other grant-based projects such as PharmaSea

• Continued collaborations with University of Southampton on Labtrove for Chemistry

• RSC collaboration with Dr Stuart Chalk (UNF) on data standards and ontologies

• Working with CHAS on hazard/safety data

Thank you

•Email: [email protected]•ORCID: 0000-0002-2668-4821 •Twitter: @ChemConnector•Personal Blog: www.chemconnector.com •SLIDES: www.slideshare.net/AntonyWilliams