Formalizing MedDRA to support semantic reasoning on adverse drug reaction terms

15
1 FORMALIZING MEDDRA TO SUPPORT SEMANTIC REASONING ON ADVERSE DRUG REACTION TERMS Cédric Bousquet 1,2 , Éric Sadou 1 , Julien Souvignet 1,2 , Marie-Christine Jaulent 1 , Gunnar Declerck 1,a 1 INSERM, U1142, LIMICS, F-75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006, Paris, France; Université Paris 13, Sorbonne Paris Cité, LIMICS, (UMR_S 1142), F-93430, Villetaneuse, France. 2 University of Saint Etienne, Department of Public Health and Medical Informatics, Saint-Etienne, France. a Corresponding author, [email protected] . ABSTRACT Although MedDRA has obvious advantages over previous terminologies for coding adverse drug reactions and discovering potential signals using data mining techniques, its terminological organization constrains users to search terms according to predefined categories. Adding formal definitions to MedDRA would allow retrieval of terms according to a case definition that may correspond to novel categories that are not currently available in the terminology. To achieve semantic reasoning with MedDRA, we have associated formal definitions to MedDRA terms in an OWL file named OntoADR that is the result of our first step for providing an “ontologized” version of MedDRA. MedDRA five-levels original hierarchy was converted into a subsumption tree and formal definitions of MedDRA terms were designed using several methods: mappings to SNOMED- CT, semi-automatic definition algorithms or a fully manual way. This article presents the main steps of OntoADR conception process, its structure and content, and discusses problems and limits raised by this attempt to “ontologize” MedDRA. KEYWORDS. MedDRA, SNOMED-CT, ontology, terminology, semantic reasoning, adverse drug reaction. 1 INTRODUCTION MedDRA 1 (Medical Dictionary for Drug Regulatory Activities) terminology is used to record and report adverse drug reactions (ADR) data for pre-marketing as well as post-marketing drug surveillance in most countries and is recommended by the ICH for the electronic transmission of individual case safety reports (ICSR) [1]. MedDRA includes most terms of common ADR dictionaries, e.g. WHO-ART (World Health Organization - Adverse Reactions Terminology), and terminologies such as the International Classification of Diseases (ICD-9) [2]. MedDRA generally allows a precise description of ADRs and related issues (e.g. investigations and surgical procedures carried out on the patient) [3,4]. MedDRA terms are organized in 5 levels: System Organ Class (SOC), High Level Group Terms (HLGT), High Level Terms (HLT), Preferred Terms (PT) and Low Level Terms (LLT). Each PT is linked to a primary SOC and can be linked to other secondary SOCs. LLTs are usually synonyms of PTs (including spelling or lexical variant) but can also be more precise terms. This structure and systematic hierarchy facilitates navigation and updating. MedDRA “multi- axiality” – i.e. the same PT can belong to several SOCs – allows to group terms in different ways, depending on context needs [5,6] and Standardised MedDRA Queries (SMQs) are available to aid in case identification. However, MedDRA also suffers limitations and might be improved in several ways [7-11]. In our view, its main limitation comes from its standard terminological format, which restricts the possibility of accessing terms based on their semantics. MedDRA is a system halfway between first-generation systems (paper-based systems) and second generation systems (compositional systems) [12]. MedDRA is available in electronic format, but unlike second-generation systems, it does not enable the creation of new concepts by composition of preexisting 1 MedDRA® is a registered trademark of the International Federation of Pharmaceutical Manufacturers and Associations.

Transcript of Formalizing MedDRA to support semantic reasoning on adverse drug reaction terms

1

FORMALIZING MEDDRA TO SUPPORT SEMANTIC REASONING ON ADVERSE

DRUG REACTION TERMS

Cédric Bousquet1,2, Éric Sadou1, Julien Souvignet1,2, Marie-Christine Jaulent1, Gunnar Declerck1,a

1 INSERM, U1142, LIMICS, F-75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06,

UMR_S 1142, LIMICS, F-75006, Paris, France; Université Paris 13, Sorbonne Paris Cité, LIMICS, (UMR_S 1142), F-93430, Villetaneuse, France.

2 University of Saint Etienne, Department of Public Health and Medical Informatics, Saint-Etienne, France.

a Corresponding author, [email protected].

ABSTRACT

Although MedDRA has obvious advantages over previous terminologies for coding adverse drug reactions and discovering potential signals using data mining techniques, its terminological organization constrains users to search terms according to predefined categories. Adding formal definitions to MedDRA would allow retrieval of terms according to a case definition that may correspond to novel categories that are not currently available in the terminology. To achieve semantic reasoning with MedDRA, we have associated formal definitions to MedDRA terms in an OWL file named OntoADR that is the result of our first step for providing an “ontologized” version of MedDRA. MedDRA five-levels original hierarchy was converted into a subsumption tree and formal definitions of MedDRA terms were designed using several methods: mappings to SNOMED-CT, semi-automatic definition algorithms or a fully manual way. This article presents the main steps of OntoADR conception process, its structure and content, and discusses problems and limits raised by this attempt to “ontologize” MedDRA.

KEYWORDS. MedDRA, SNOMED-CT, ontology, terminology, semantic reasoning, adverse drug reaction.

1 INTRODUCTION

MedDRA1 (Medical Dictionary for Drug Regulatory Activities) terminology is used to record and report adverse drug reactions (ADR) data for pre-marketing as well as post-marketing drug surveillance in most countries and is recommended by the ICH for the electronic transmission of individual case safety reports (ICSR) [1]. MedDRA includes most terms of common ADR dictionaries, e.g. WHO-ART (World Health Organization - Adverse Reactions Terminology), and terminologies such as the International Classification of Diseases (ICD-9) [2]. MedDRA generally allows a precise description of ADRs and related issues (e.g. investigations and surgical procedures carried out on the patient) [3,4]. MedDRA terms are organized in 5 levels: System Organ Class (SOC), High Level Group Terms (HLGT), High Level Terms (HLT), Preferred Terms (PT) and Low Level Terms (LLT). Each PT is linked to a primary SOC and can be linked to other secondary SOCs. LLTs are usually synonyms of PTs (including spelling or lexical variant) but can also be more precise terms. This structure and systematic hierarchy facilitates navigation and updating. MedDRA “multi-axiality” – i.e. the same PT can belong to several SOCs – allows to group terms in different ways, depending on context needs [5,6] and Standardised MedDRA Queries (SMQs) are available to aid in case identification.

However, MedDRA also suffers limitations and might be improved in several ways [7-11]. In our view, its main limitation comes from its standard terminological format, which restricts the possibility of accessing terms based on their semantics. MedDRA is a system halfway between first-generation systems (paper-based systems) and second generation systems (compositional systems) [12]. MedDRA is available in electronic format, but unlike second-generation systems, it does not enable the creation of new concepts by composition of preexisting

1MedDRA® is a registered trademark of the International Federation of Pharmaceutical Manufacturers and

Associations.

2

atomic concepts and their meaning cannot be processed automatically by semantic tools. For example, Gastric Ulcer PT is part of MedDRA but neither of the atomic concepts, Stomach and Ulcer, necessary for an explicit representation of meaning are included. The only semantic information available in MedDRA derives from the terms’ labels and to some extent from their hierarchical organization.

If this lack of compositionality would not interfere with the objectives of MedDRA, there would be no need to worry. But is this really the case? The primary stated objective of MedDRA is to provide an internationally approved classification for efficient reporting and communication of ADR data between countries [13]. To that aim, MedDRA must ensure accurate and consistent term selection2. However, the large number of terms in MedDRA makes this goal difficult to achieve. Several studies have observed that MedDRA users often experience difficulties in selecting appropriate terms for coding due to their higher specificity or generality compared to their verbatim [15,16] (see [3,7,17] for similar issues). MedDRA provides more precise terms than its predecessors (WHO-ART or COSTART), but a precise and consistent selection of terms (i.e. homogenous from one report to another) remains difficult due to the absence of univocal definitions, or the impossibility to assist the coding activity with querying tools able to process their meaning.

We can hypothesize that converting MedDRA into a third generation system – i.e. its “ontologization” – could help to overcome those difficulties: a formal representation of MedDRA terms’ meaning might improve the coding accuracy and contribute to harmonizing the coding strategies by making possible the use of semantic processing tools. Computational ontologies and associated semantic web techniques are known to favor interoperability in the medical data sharing process [18-22]. Contrary to traditional medical classifications where the meaning of concepts relies on the implicit knowledge of the user, ontologies provide an explicit and logical (thus univocal) description of the semantics, which reduces the risk of misinterpretation of the terms and ensures the possibility of reliable information sharing [18,20]. MedDRA ontologization could also benefit data mining techniques used for post-market drug surveillance. Studies of the impact of the MedDRA hierarchical organization on automated signal detection – i.e. measures of drug-reaction causal relatedness based on statistical comparison between observed and expected cases – have pointed out that taxonomic limitations decrease the sensitivity and specificity of the signals computed [7,23-25]. Conversely, the possibility to perform semantic reasoning on MedDRA terms meaning has been shown to increase the performances of signal detection algorithms [26,27].

We previously experimented with providing formal definitions to WHO-ART terms using an alignment with the SNOMED-CT clinical terminology [28]. We assume that SNOMED-CT is still the best candidate for providing formal definitions to MedDRA because SNOMED-CT terms are defined using a Description Logic (DL) formalism and because a fair number of alignments between MedDRA and SNOMED-CT is described in the UMLS (Unified Medical Language System) metathesaurus [29]3.

We report here our first step for providing an “ontologized” version of MedDRA: adding formal definitions to MedDRA terms using SNOMED-CT. We first describe our alignment techniques between MedDRA and SNOMED-CT terms and present the resulting formal definitions of MedDRA terms that are given in an OWL file named OntoADR. The primary objective of OntoADR is to define the PT level of MedDRA, and only secondarily the terms of upper levels. Focusing on the PTs was motivated by our intended use for grouping ADRs reported at the PT level because this level is recommended for analysis of pharmacovigilance data. Moreover, ADRs are described using PTs in the public version of the FDA (U.S. Food and Drug Administration) pharmacovigilance database that we use for signal detection. After presenting OntoADR, we discuss why formal definitions are not sufficient for an ontological version of MedDRA when keeping the original hierarchy. We finally highlight limits and problems related to this choice and propose additional work that would be necessary to achieve an improved and more advanced version of OntoADR.

Before the version of OntoADR presented in this paper, different preliminary versions were designed, the first one in 2003. We used them in several studies to perform grouping of case reports with different methods of terminological reasoning: mainly subsumption and approximate matching. Our studies showed that grouping of medically related conditions using terminological reasoning significantly improves signal generation performances [27]. More occurrences of drug-ADR associations could be identified with that method than by using the MedDRA hierarchy.

2 As the ICH guide for MedDRA users explains, “unless users achieve consistency in how they assign terms to

verbatim reports of symptoms, signs, diseases, etc., use of MedDRA cannot have the desired harmonizing

effect in the exchange of coded data. […] Consistent term selection promotes medical accuracy for sharing

MedDRA-coded data and facilitates a common understanding of shared data among academic, commercial

and regulatory entities.” [14] 3 The UMLS is a semantic network developed by the NLM (U.S. National Library of Medicine) to link terms from

more than hundred controlled vocabularies and provide semantic definitions of terms [30]. Both SNOMED-CT

and MedDRA are included.

3

2 METHODS

2.1 Providing MedDRA terms with computable formal definitions

Two main strategies are possible for providing MedDRA terms with formal definitions and thus achieving the advantages of third generation systems: 1) building an OWL-DL (Web Ontology Language - Description Logics) representation of MedDRA; 2) keeping the terminological format of MedDRA and mapping MedDRA terms with concepts from other existing ontologies in order to use their semantic representations and make possible indirect reasoning. The second strategy is adopted by the UMLS metathesaurus and the Bioportal Website [31], or different research works [11,32], where MedDRA terms are mapped to equivalent concepts (or at least assumed so) from several biomedical ontologies or terminologies, including SNOMED-CT. However, we believe this solution suffers from certain weaknesses:

(1) This strategy is limited because not every MedDRA term can be mapped to a unique SNOMED-CT concept having the same meaning (one-to-one mapping). For example, the MedDRA term Joint dislocation postoperative does not have a direct equivalent concept in SNOMED-CT and must be expressed through several Snomed-CT concepts, e.g. Dislocation of joint and Postoperative complication. Decomposing the meaning of a MedDRA term in that way is different than stating a mere conceptual equivalence: in the present example, the mapping relations correspond to subsumption (IS-A) relations, or, to use SKOS (Simple Knowledge Organization System) mapping vocabulary, to SKOS:BROADMATCH relations. In other cases, domain relations are even necessary.

(2) This strategy does not permit to perform semantic reasoning based on the relations MedDRA terms have to each other, first of all hierarchical relations. The main added value of formalizing semantics using subsomptive relations: the principle of property inheritance is thus lost. The same applies to inferences based on formal properties of properties, e.g. on their transitive or inverse character. Reasoning can be made on the concepts and conceptual relations from the external resource used to define MedDRA, but not on MedDRA itself.

(3) Finally, an indirect mapping-based formalization of MedDRA semantics is simply unnecessary if the purpose is to provide a formal definition usable for reasoning. Directly defining the MedDRA terms avoids having to use a supplementary resource in the semantic reasoning procedure.

For these reasons, we have opted for a direct formalization of MedDRA terms. We have built an OWL-DL version of MedDRA named OntoADR, where MedDRA terms are defined by a set of semantic properties. This strategy does not, however, prevent the use of mappings: as we will see, the formal definition of MedDRA terms can be partially achieved using mappings to concepts of existing biomedical ontologies.

To our knowledge, related work does not describe any attempt to build semantic representation of MedDRA where terms’ meaning is described with a logical formalism allowing semantic reasoning. Previous works have been published by our team but have not been completed (mostly because they were feasibility studies [23,26,27,33]): only some parts of MedDRA have been formalized, leaving most of the terms without semantic representation. One reason for this situation is the cost of such an operation: defining a single MedDRA term requires specific expert knowledge, and verifying a precise definition can be time-consuming. The operation has to be repeated for every term. MedDRA’s latest version (version 16.0, published on 1 March 2013) counts 93,460 terms in total, and 22,134 if we exclude low level terms (LLT). LLTs may be used for coding – they are recommended by the MSSO [14] – but PTs are the preferred level for data analysis and retrieval. According to E2B (R2) guideline for electronic transmission of ICSRs, PTs must also be specified in case reports [1]. Difficulties inherent in any attempt to ontologize terminological systems should also not be overlooked [34,35]. Terminologies’ design follows principles that can be very different from those of ontologies: the way items are organized is generally less systematic, and often closer to the way concepts are used by stakeholders in actual situations. In addition, concepts’ description in terminologies can use extremely complex language resources (such as exceptions, exclusions and references to implicit knowledge), and hold ambiguities, which make it difficult to translate them into a language such as OWL-DL, where any ambiguity is, in principle, prohibited.

2.2 The conception of OntoADR: main principles

2.2.1 Semi-automatizing the process to formalize MedDRA terms semantics

Three critical issues are generally acknowledged in the construction of ontologies: 1) identifying the relevant concepts of the domain of interest; 2) organizing the concepts hierarchically; and 3) defining the semantics of those concepts using a formal knowledge representation language such as a description logic (DL) formalism.

A major constraint is that the ontology used for semantic reasoning must cover the whole terminological field: all relevant terms used to encode data in the databases and documents processed by the semantic algorithm must be conceptualized [18,36]. In our case, the relevant domain concepts were already identified: they are present in

4

MedDRA. Our main problem concerned the definition of concepts, i.e. providing MedDRA terms with formal representations usable by semantic reasoning algorithms.

In principle the definition process can, of course, be purely manual. However, resources needed for such a manual modelization can be huge, and developing an ontology from scratch consumes a lot of time. Such a manual definition was not an option regarding the limited time and human resources we had available. In the first version of OntoADR (which covered only a small part of MedDRA: 530 terms in total), the semantic definitions of MedDRA terms were produced entirely manually with the OilEd ontology editor [27]. This procedure gave us an estimation of the time necessary for such a task: we projected that it would have taken us several years to complete formal definitions for all ADRs with no guarantee in terms of maintainability. To bypass this problem, we developed several semi-automatic methods to complete the formal representations of MedDRA concepts (they are described in detail in Section 3.2). When a mapping between MedDRA terms and SNOMED-CT concepts was available, we reused the semantic information within SNOMED-CT in order to build the formal definition of MedDRA terms. For some context-relevant MedDRA terms and according to our studies’ needs, when a mapping was not available, the formal definition was achieved manually by knowledge engineers and pharmacovigilance experts.

2.2.2 SNOMED-CT approach to semantic formalization

SNOMED-CT is a broad clinical terminology (the 2011 version includes more than 311,000 concepts) which features ontological characteristics such as the systematic and logical mode of representation of semantics. SNOMED-CT is not developed using OWL-DL standard language. However, SNOMED-CT’s general structure and formalism can be quite easily converted into an ontological representation, using a subset of the EL++ description logic formalism, a lightweight version of OWL [35,37,38].

Especially: (1) the hierarchical organization of SNOMED-CT concepts is equivalent to a subsumption (IS-A) relation (modelizing errors regarding subsumption have however been detected [35,39]), which means that concepts follow the logical law of inheritance: a concept inherits the definitional properties of the concept which subsumes it in the hierarchy; and that “each and every instance of a child concept is also an instance of its respective parent concepts” [38]. (2) The properties used to define SNOMED-CT concepts are also organized hierarchically in a subsumption tree. (3) Each relation is defined by a precise domain and range [40]. (4) SNOMED-CT concepts are defined using several properties which are not explicitly combined with logical operators as it is typically the case in classical computational ontologies. But it can be assumed that properties are connected by a conjunctive relation (AND) (disjunction or negation are not used). (5) Triplets defining relational statements in SNOMED-CT are equivalent to an OWL description of class properties using an existential restriction. For instance, SNOMED-CT Nance-Horan syndrome is defined by the OCCURRENCE

CONGENITAL statement, which can be converted in the OWL-DL expression: NANCE-HORAN SYNDROME

SUBCLASS-OF ‘HAS-OCCURRENCE SOME CONGENITAL’ .

2.2.3 Semantic relations used to define MedDRA concepts

In line with SNOMED-CT, OntoADR makes use of a lightweight ontological formalism (corresponding to EL++), exploiting only basic possibilities made available by OWL. Only existential restrictions (SOME) are used to express the semantics of MedDRA terms, and the properties used to build the definitions are combined only with the logical conjunction operator (AND). These formalization resources are sufficient to enable semantic reasoning and present the advantage of being computable in a reasonable time. One of the difficulties being faced in any ontology design project is to find a balance between expressiveness of the formal language used to describe the semantics of the domain concepts and computation time. Lightweight versions of OWL (OWL-Lite for OWL 1 and OWL-EL for OWL 2) are commonly used in medical informatics [41], because they offer sufficient expressiveness for the majority of concepts used in the medical field and are computable in reasonable time. As the W3C guide describing OWL 2 explains, the EL formalism “is particularly suitable for applications employing ontologies that define very large numbers of classes and/or properties, captures the expressive power used by many such ontologies, and for which ontology consistency, class expression subsumption, and instance checking can be decided in polynomial time.” [42] Choosing a language providing greater expressiveness, as OWL-full, would have solved some formalization problems for some complex MedDRA terms, e.g. terms using exclusive or (XOR), negations (NOT), or possible (vs. necessary) conditions. But it would have raised computability problems. OntoADR is intended to support pharmacovigilance work in real time, including assistance in querying or grouping MedDRA terms to search for signals in pharmacovigilance databases. It was imperative that we selected a modeling language providing reasonable performance.

The semantic properties used to define MedDRA concepts in OntoADR are built on relations used in the medical domain and selected from SNOMED-CT (25 in total). Examples of relations used to express the medical meaning of MedDRA concepts in OntoADR are: HAS-FINDING-SITE, which specifies the body site

5

affected by a condition; HAS-ASSOCIATED-MORPHOLOGY, which describes the morphologic changes seen at the tissue or cellular level that are characteristic features of a disease; or HAS-OCCURRENCE, which refers to the specific period of life during which a condition first presents. A sample of those relations is presented in Table 1 (see section 3.3).

OntoADR is then composed of two main branches: (1) MedDRA_Hierarchy, comprising the MedDRA concepts and reproducing the original MedDRA hierarchy; this first branch corresponds to the domain of the semantic relations used in OntoADR; (2) SNOMED-CT_Imported_Concepts, comprising concepts coming from SNOMED-CT and used to define the MedDRA concepts of the first branch via the semantic relations available; this second branch corresponds to the range of some of those relations (20 out of 25 relations). Some other relations – ASSOCIATED-WITH, OCCURS-AFTER, DUE-TO, HAS-FOCUS, INTERPRETS – have for range the second branch and the first branch, which means that they can link two MedDRA concepts together.

For instance, the MedDRA PT concept Eyelid bleeding (which is mapped to the SNOMED-CT concept Hemorrhage of eyelid) is defined with the following properties: HAS-ASSOCIATED-MORPHOLOGY SOME

'HEMORRHAGE' AND HAS-FINDING-SITE SOME 'EYELID STRUCTURE'. 'Hemorrhage' and 'Eyelid structure' are SNOMED-CT concepts that have been imported into OntoADR and are thus placed in the SNOMED-CT_Imported_Concepts branch (see Figure 1).

Figure 1 Structure of OntoADR and range of relations in the example of the Eyelid bleeding MedDRA concept.

The original hierarchy of SNOMED-CT is also reproduced in OntoADR SNOMED-CT_Imported_Concepts

branch: for each SNOMED-CT concept used in the formal definition of a MedDRA concept, all ancestors in the SNOMED-CT hierarchy are reproduced (including multiples parents).

3 RESULTS

3.1 Converting the MedDRA classification into a subsumption tree

OntoADR was built on the basis of MedDRA 13.0 (version published on 1st March 2010), which was the most advanced version of MedDRA available in October 2010. MedDRA 13.0 comprises 26 SOCs, 335 HLGTs, 1709 HLTs, 18786 PTs and 68258 LLTs, which gives a total of 89,114 terms.

The first step in the conception of OntoADR was the conversion of the MedDRA terminology hierarchy into a subsumption tree, i.e. a hierarchical structure where the concepts are related to each other via an IS-A relation

6

(SUBCLASS-OF relations in OWL). This step corresponds to the building of the MedDRA_Hierarchy branch referred above: only subsumption relations are created between concepts.

The conversion procedure into a subsumption tree is fairly simple: all MedDRA terms, from the PT level to the SOC level, are converted into OWL classes. And hierarchical relations from the PT level to the SOC level are converted into SUBCLASS-OF relations. LLTs are not converted into OWL classes but are integrated into the related PT concept as annotations: <LLTL ABEL> for current LLTs and <NONCURRENTLLTL ABEL> for non-current LLTs. Only current LLTs are used for coding, non-current LLTs are discarded because they are “vague, ambiguous, truncated, abbreviated, out-dated, or misspelled” [13]. Most non-current LLTs derive from terminologies incorporated into MedDRA and are retained to preserve historical data for retrieval and analysis. Except for the LLT level, the MedDRA original hierarchical organization is thus preserved in the subsumption skeleton of OntoADR: the MedDRA_Hierarchy branch reproduces the four MedDRA levels from SOCs to PTs.

Figure 2 First step of OntoADR creation: the conversion of MedDRA terms to OWL classes and the creation of a subsumption tree based on the original MedDRA hierarchy. * Most MedDRA hierarchical relation are of is-a type, but they can also be different, e.g. they can correspond to is-related-to, is-generally-a, is-sometimes-a, is-sign-of, or is-synonym-of relations. Those relations are partially determined by categories, pragmatic imperatives and way of thinking prevailing in domain actors’ practices (see section 4.1).

We might have chosen to break the original MedDRA hierarchy in order to build OntoADR. That would have

solved problems related to inappropriate properties inheritance (due to inconsistency of some subsumption relations in OntoADR) described below (see Section 4.1). However building a new hierarchy from scratch to reorganize MedDRA (more than 20,000 terms without taking the LLTs into account) would have required a drastic effort of conceptualization and concomitant human resources we did not have. Moreover, it appears that the majority of hierarchical relations in MedDRA correspond to actual subsumption relations, i.e., to put it in terms of class extensions: inclusions of sets.

For instance, the following subsumption relations between MedDRA terms are totally valid and result from a simple conversion of the original MedDRA hierarchy relations:

Cardiac disorders [SOC]

IS-A Cardiac arrhythmias [HLGT]

IS-A Supraventricular arrhythmias [HLT]

IS-A Supraventricular tachyarrhythmia [PT]

The choice of not converting LLTs into concepts (i.e. OWL classes) is motivated by the fact that most LLTs

are pure synonyms or lexical variants of the PT they relate to [13]. For example, the LLTs Diabetes with renal manifestations and Diabetic renal disease are synonyms of the PT Diabetes nephropathy. LLTs may also be more granular terms [8,13]. For example, the LLT Type II diabetes mellitus with renal manifestations is one

7

kind of Diabetes nephropathy. In the latter case, a conversion of the LLT into an OWL concept would have been legitimate. But this operation would have required distinguishing manually which LLTs are synonyms of the related PTs and which are more specific concepts (MedDRA unfortunately confounds the two cases)4. Considering that MedDRA 13.0 contains almost 70,000 LLTs, such an operation would have required a huge amount of work. This task should however be performed in the future for achieving a more advanced “ontologized” version of MedDRA.

Each MedDRA concept in OntoADR is identified by an URI: the http address of the server used by OntoADR followed by a numerical sequence of 8 digits. MedDRA is not currently distributed as an ontology and we do not have the rights to publish OntoADR without MSSO’s agreement. In case MSSO decides to distribute an ontologized version of MedDRA, MSSO would have to define the URIs. The information recorded for each MedDRA concept can be seen in Figure 3. Note that all those annotation tags except the RDF tag <LABEL> (available in OWL) have been specifically created for OntoADR.

Figure 3 OWL representation of Hearing impaired MedDRA term in OntoADR. Only an excerpt of the formal definition is displayed.

4 One also finds cases where LLTs correspond not to narrower but to broader categories than the PTs subsuming

them. This is for instance the case of the LLT “Typhoid and paratyphoid fevers”, which is below the PT

“Typhoid fever”, but corresponds to a broader category of disease. For instance, “Typhoid fever” is under

“Typhoid and paratyphoid fevers” in ICD-10 and Snomed-CT hierarchies. Such cases raise difficulties when

converting MedDRA hierarchical relations to subsomption relations, since this is in that case the PT which has

an is-a relation to the LLT, not the inverse.

<owl:Class rdf:about="#00002429"> <rdfs:label xml:lang="en">Hearing impaired</rdfs:label> <idMedDRA>10019245</idMedDRA> <MedDRALevel>PT</MedDRALevel> <idWhoart>1368004</idWhoart> <idSnomed>15188001</idSnomed> <umlsCui>C1384666</umlsCui>

<LLTLabel xml:lang="en">Auditory disorder</LLTLabel> <LLTLabel xml:lang="en">Auditory disorder (NOS)</LLTLabel> <LLTLabel xml:lang="en">Abnormal auditory perception, unspecified</LLTLabel> <LLTLabel xml:lang="en">Auditory disorder NOS</LLTLabel> <LLTLabel xml:lang="en">Impairment of auditory discrimination</LLTLabel> <LLTLabel xml:lang="en">Hearing impairment aggravated</LLTLabel>

<nonCurrentLLTLabel xml:lang="en">Other abnormal auditory perception</nonCurrentLLTLabel> <snomedPrefLabel xml:lang="en">Hearing loss</snomedPrefLabel> <snomedFSNLabel xml:lang="en">Hearing loss (disorder)</snomedFSNLabel> <snomedSynLabel xml:lang="en">Hypoacusis</snomedSynLabel> <snomedSynLabel xml:lang="en">Impaired hearing</snomedSynLabel>

<rdfs:subClassOf> <owl:Class rdf:about=" #00000861"/> //** URI for “Hearing losses” </rdfs:subClassOf> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource="#hasFindingSite"/> <owl:someValuesFrom rdf:resource="#ImportedSnomedCTConcept5860"/> //** URI for “Structure of auditory system” </owl:Restriction> </rdfs:subClassOf> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource="#hasForPrimarySOC"/> <owl:someValuesFrom rdf:resource="#00000773"/> //** URI for “Ear and labyrinth disorders” </owl:Restriction> </rdfs:subClassOf> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource="#hasForSnomedctParent"/> <owl:someValuesFrom rdf:resource="#00000862"/> //** URI for “Hearing disorders” </owl:Restriction> </rdfs:subClassOf> </owl:Class>

8

For each PT concept created, a property using the HAS-FOR-PRIMARY-SOC relation linking the PT concept to its MedDRA primary SOC concept is also created. When a PT belongs to several SOCs, which means that it is present in several part of the hierarchy, the PT concept is defined as subsumed by all corresponding parent classes: several SUBCLASS-OF relations are created with the different HLT concepts. But only one HAS-FOR-PRIMARY-SOC property is created, since by definition each PT can only have one primary SOC.

3.2 Formalizing the semantics of MedDRA concepts

A key step in the design of OntoADR was the formalization of the semantics5. As explained above, part of this work was achieved reusing semantic information coming from SNOMED-CT. A first step was consequently to identify for each MedDRA term a mapping candidate in SNOMED-CT. Three main sources of mapping were used: (1) the UMLS metathesaurus; (2) other mapping resources available, especially mapping propositions from Nadkarni & Darer [11]; (3) manually defined mappings. All mapping propositions from sources (1) and (2) were validated by knowledge engineers and pharmacovigilance experts from our team. Manual validation included mapping selection when several Snomed-CT candidates were available. In UMLS, terms from different vocabularies are linked together by association to a UMLS concept identified by a Concept Unique Identifier (CUI). A UMLS concept can be associated with several SNOMED-CT concepts which either correspond exactly to the UMLS concept or are more specific. For example, the MedDRA term Spondylitis is mapped to three SNOMED-CT concepts through the UMLS: Undifferentiated spondylitis, Inflammatory spondylopathy and Spondylitis. In such case, only one among the several proposed SNOMED-CT concepts was chosen to build the MedDRA term definition. The first element we take into account to make the selection is the label. Most of the time a string matching makes the selection obvious (in the previous example, Spondylitis was chosen). However, in other cases the comparison of the labels is not enough. We thus take into account medical relevance of the mapping (as evaluated by our expert) and/or the formal definitions of the candidates, by privileging the more accurately defined SNOMED-CT concepts.

Once a mapping between a MedDRA concept CMed and a SNOMED-CT concept CSno is decided, the following changes are made to OntoADR (fully-automated procedure):

1) The semantic properties of CSno are imported into OntoADR to define CMed. For example, the MedDRA PT Acanthamoeba infection was mapped to the SNOMED-CT concept Infection by Acanthamoeba, which is defined with the two following properties in SNOMED-CT: CAUSATIVE AGENT ACANTHAMOEBA and PATHOLOGICAL PROCESS PARASITIC PROCESS. The MedDRA concept Acanthamoeba infection is thus defined with the following axiom in OntoADR: HAS-CAUSATIVE-AGENT SOME 'ACANTHAMOEBA' AND HAS-PATHOLOGICAL-PROCESS SOME 'PARASITIC PROCESS'.

2) If the SNOMED-CT concepts used by the previous semantic relations (in that case: Acanthamoeba and Parasitic process) have not been created yet in the SNOMED-CT_Imported_Concepts branch of OntoADR, they are automatically created, and their position in the SNOMED-CT original hierarchy is duplicated: which means that all the ancestor concepts (including multiple parents) of the considered concepts in SNOMED-CT hierarchy are also created, with their original subsumption relations. The following restriction is also applied in the procedure of SNOMED-CT properties import: some SNOMED-CT concepts have properties whose object is the top-concept subsuming the different SNOMED-CT concepts defining the range of the relation. This is typically the case with the SEVERITY, CLINICAL COURSE and EPISODICITY relations: a lot of concepts of disorders in SNOMED-CT have e.g. the property CLINICAL COURSE COURSES. Such a property provides no information on the meaning of the concept, except the fact that the considered disorder has a clinical course (but without telling which kind). To avoid overloading OntoADR, this kind of properties was not imported in the MedDRA concepts definitions in OntoADR.

3) The terminological information of CSno available in SNOMED-CT (preferred label, synonyms, etc.) is recorded in the annotations of CMed: <SNOMEDPREFLABEL>, <SNOMEDFSNLABEL>, <SNOMEDSYNLABEL> (one annotation is created for each synonym), as well as the Snomed-CT identifier of CSno <IDSNOMED>. When the mapping between CSno and CMed comes from UMLS, a <UMLSCUI> annotation is created to record the identifier of the corresponding UMLS concept (see Figure 3).

4) The SNOMED-CT parent(s) of CSno is/are created in the SNOMED-CT_Imported_Concepts branch of OntoADR following the procedure described above (see point 2). And a property HAS-FOR-SNOMEDCT-PARENT SOME ‘CSNO PARENT’ (several relations when CSno has several parents) is added in CMed definition. The HAS-FOR-SNOMEDCT-PARENT relation is potentially useful for terminological reasoning because it allows selecting MedDRA terms from hierarchical relations (subsumption) of SNOMED-CT concepts

5 The different methods we used to build MedDRA terms definitions in OntoADR and the issues related to their

application will be the subject of a forthcoming article. Here, we describe only their main principle.

9

with which these terms have been mapped. For example, one can select all MedDRA concepts mapped to SNOMED-CT concepts that are children of the same parent.

Beside the use of Snomed-CT mappings, two additional methods were used to formalize MedDRA terms

semantics in OntoADR: 1) Purely manual completion of the formal definition of MedDRA concepts that could not be mapped

through UMLS or other available mapping resources. For reasons of time and human resources, only 876 PTs without mappings were defined manually: we focused on 13 adverse drug events categories identified as main pharmacovigilance targets by Trifirò et al. [43], e.g. Cardiac valve fibrosis, Confusional state, Upper gastrointestinal bleeding, or Maculo-papular erythematous eruptions. This procedure ensured that for those sensitive pharmacovigilance topics the relevant MedDRA terms were defined in OntoADR and could consequently be manipulated by semantic reasoning algorithms.

2) Automatic enrichment methods adding definitional properties to MedDRA concepts on the basis of a syntactic analysis of their label. For example, if the string “pain” or “algia” is detected in the <LABEL> of the MedDRA concept, the property HAS-DEFINITIONAL-MANIFESTATION SOME PAIN is automatically added to its formal definition. Similarly, if the string “perforation” is detected, the property HAS-ASSOCIATED-MORPHOLOGY SOME PERFORATION is added. All generated properties were validated by an expert and illegitimate properties were deleted. This method was applied to 11 from the 25 SNOMED-CT relations used in OntoADR, with 82 different string searches in MedDRA terms <LABEL>. 7691 new properties were created using this method.

It is important to note that the first objective of OntoADR is to define the MedDRA PT level, and only

secondarily the terms of upper levels (i.e. SOC, HLGT and HLT). Most higher-level terms have been defined semi-automatically based on the mapping with Snomed-CT terms, but we did not focus our manual definition efforts on those MedDRA levels. Similarly, our goal was not to build an exhaustive formal definition of each PT concept, but only to formally represent the potentially useful semantic components for terminological reasoning in pharmacovigilance. However, nothing prevents such exhaustiveness, which can be planned and achieved in future developments of OntoADR.

3.3 Degree of semantic completion of OntoADR

OntoADR6 includes 34,994 concepts using a total of 150,491 asserted MedDRA definitional axioms (i.e. average of 4.5 definitional axioms per MedDRA concept). Most concepts (59%) come from MedDRA 13.0. The other concepts (41%) come from SNOMED-CT (July 2010 Release).

Thanks to UMLS or other mapping methods and manual definitions, 56 % of MedDRA 13.0 terms (LLT are excluded) were defined using a direct mapping with a SNOMED-CT concept or a handmade definition. However this figure is not fully representative because those terms represent a larger amount of the terms commonly used in pharmacovigilance coding. Especially, they cover 97% of MedDRA terms commonly used in the FDA AERS database (period 2004-2010, 11 millions ADRs). Actually, much more MedDRA concepts in OntoADR have formal properties describing their meaning, thanks to: (1) the inheritance mechanism (concepts inherit the semantic properties defining the concepts subsuming them in the ontological hierarchy); (2) the process of automatic enrichment of concepts properties based on the syntactic analysis of their labels.

Twenty five semantic relations coming from SNOMED-CT are used in OntoADR to express the meaning of MedDRA terms (see Table 1). The most used are HAS-FINDING-SITE and HAS-ASSOCIATED-MORPHOLOGY, which are particularly useful when reasoning on ADR term semantics, for instance to query MedDRA terms expressing the same kind of disorders but distributed in different branches of the MedDRA hierarchy [23,44].

6 In this paper we present OntoADR v.0.1. The development of OntoADR is currently in progress. Updated

versions will include more Snomed-CT concepts and additional axioms of definition of MedDRA concepts.

RELATION DESCRIPTION Nb of uses

in OntoADR

HAS-FINDING-SITE Body site affected by a condition. 12,226

HAS-ASSOCIATED-MORPHOLOGY Morphologic changes seen at the tissue or cellular level that are

characteristic features of a disease. 8,551

INTERPRETS Refers to the entity being evaluated or interpreted, when an evaluation,

interpretation or “judgment” is intrinsic to the meaning of a concept. 4,470

HAS-INTERPRETATION Designates, together with the INTERPRETS relation, the judgment aspect

being evaluated or interpreted for a concept. 4,046

HAS-METHOD Represents the action being performed to accomplish the procedure. 3,339

10

Table 1 Sample of SNOMED-CT relations used to express the medical meaning of MedDRA concepts in OntoADR. Only the most used relations are represented. The number of times the relation is used is specified in the third column. The description of the relations is adapted from SNOMED-CT User Guide [40].

4 DISCUSSION

4.1 Can the MedDRA hierarchy be converted into a subsumption tree?

The direct conversion of MedDRA hierarchy into a subsumption tree is sometimes causing semantic inconsistencies. This is a problem because the applicability of the logic of inheritance, one of the main added values of ontological representation, depends on the correctness of subsumption relations.

It is well known that terminology ontologization reveals semantic inconsistencies [34,35,45-48]. Schultz et al. [34] have for instance demonstrated that modeling errors resulting from falsely interpreting existential restrictions are frequent in the OWL-DL representation of the NCI Thesaurus: they took a random sample of 354 axioms using the SOME-VALUES-FROM operator and asked two domain experts to rate their correctness. Their results show that “roughly half of these examples, and in consequence more than 76,000 axioms in the OWL-DL version, make incorrect assertions if interpreted according to description logics semantics”. Such an erroneous modelization renders “most logic-based reasoning unreliable”.

The same goes for ontological principles other than subsumption. Terminologies obey organizational principles that are less rigorous and more implicit than ontologies do. Medical terminologies such as MedDRA or ICD represent most high level categories mirroring domain actors’ practices (e.g. following the different medical specialties one can find in a standard hospital), and for that reason they do not apply homogenous semantic principles when building the hierarchical organization of concepts. Most hierarchical relations correspond to standard generality-specificity taxonomic relations that are convertible into formal IS-A subsumption relations. But in other cases, it is a simple relation of semantic proximity that is reflected by the hierarchical organization, or any other relation presenting an interest in the domain actors’ practices.

The most obvious example of such flexibility in MedDRA (which becomes inconsistency when considered from an ontologist point of view) is the way symptom concepts are positioned in the MedDRA hierarchy. Groups of symptoms (HLGT or HLT) are placed under the general categories of disorders they are the symptom of (SOC or HLGT). This immediately raises problems when the MedDRA hierarchy is converted into a subsumption tree, because in no way can the relation BEING-A-SYMPTOM-OF be treated as an IS-A relation. A symptom of disease X is not a kind of disease X. The principle of heritability of semantic properties cannot be applied: the subsumed concept (the symptom) cannot be given the semantic properties of the subsuming concept (the disease indicated by this symptom). For instance, the HLGT Cardiac disorder signs and symptoms is under the SOC Cardiac disorders, but not all terms of this HLGT are cardiac disorders: Dyspnoea, Dizziness or Syncope (PT level terms) refer to conditions that may be caused by disorders affecting the heart or the cardiovascular system, and can be considered as symptoms of such troubles in the context of a diagnostic procedure, but strictly speaking they are not cardiac disorders. The subsumption relation is not valid: those PTs cannot inherit the semantic properties defining the Cardiac disorders SOC concept, for instance HAS-FINDING-SITE SOME HEART STRUCTURE.

To meet the formal requirements of ontologies, we needed to break the original MedDRA hierarchy. We created a concept called Signs and Symptoms at the root of the MedDRA_Hierarchy branch into which we moved each sign and symptom HLGT and HLT concept (together with the PTs they subsume). To keep track of the original hierarchical relation, we assigned to each moved concept a IS-SIGN-OR-SYMPTOM-OF property pointing to the associated SOC or HLGT.

Does not include the surgical approach, equipment, or physical forces.

HAS-PATHOLOGICAL-PROCESS Provides information about the underlying pathological process for a

disorder. 2,092

HAS-CAUSATIVE-AGENT Identifies the direct causative agent of a disease. Does not include

vectors. 1,962

HAS-OCCURRENCE Refers to the specific period of life during which a condition first

presents. 1,633

HAS-DEFINITIONAL-MANIFESTATION Links disorders to the manifestations (observations) that define them. 1,275

HAS-SEVERITY Used to specify the severity of a clinical finding. 784

HAS-PROCEDURE-SITE Describes the body site acted on or affected by a procedure. 723

HAS-COMPONENT Refers to what is being observed or measured by a procedure. 699

HAS-CLINICAL -COURSE Used to represent both the course and onset of a disease. 564

HAS-DIRECT-SUBSTANCE Describes the substance or pharmaceutical / biologic product on which

the procedure's method directly acts. 181

11

The same applies to MedDRA terms referring to complications. For example, the HLT Renal failure complications is positioned under Renal disorders (excl nephropathies) HLGT, while the complications associated with renal failure may not be kidney dysfunctions (but the consequences of such dysfunctions). For instance, the PT Low turnover osteopathy refers to a dysfunction of bone metabolism, not of the kidney.

Subsumption inconsistencies have also arisen with other types of MedDRA concepts, more isolated and therefore more difficult to detect. For example, the MedDRA PT Sudden death belongs to Death and sudden death HLT, but also to Ventricular arrhythmias and cardiac arrest HLT, and, through the latter, to Cardiac arrhythmias HLGT. The conversion of this hierarchical relation into an IS-A relation is problematic because Cardiac arrhythmias is defined in OntoADR (due to the mapping procedure described in section 3.2) with the HAS-FINDING-SITE SOME ‘HEART STRUCTURE’ property, which the Sudden death PT cannot legitimately inherit. Indeed, (1) the medical concept Sudden death may refer to types of death that are not of cardiac origin, even if heart failure is a condition of death from a clinical point of view. This is evidenced by the dual parenthood of Sudden death PT in the MedDRA hierarchy and the fact that it includes LLTs like Sudden death, cause unknown or Sudden death unexplained. Note that a PT Cardiac death already exists in the Ventricular arrhythmias and cardiac arrest HLT. And (2) even for strict cardiac related deaths, it seems semantically incorrect to attribute to the phenomenon of sudden death an anatomical location restricted to the heart, as death concerns the whole body or the person considered as a psychosocial entity (the whole person dies, not only the heart).

To solve this kind of inheritance problems, we needed to come to a compromise between the formal consistency requirement and the constraint to maintain the original MedDRA hierarchy. When the conversion of MedDRA hierarchical relations into subsumption relations led to illegitimate property inheritance, we simply removed the properties at level n (the subsuming concept) which couldn’t legitimately be transferred to level n-1 (the subsumed concept). This method, although questionable from a formal point of view (since one removes properties that are valid when considering the meaning of the ‘level n’ concept in isolation), ensures the consistency of semantic reasoning (all inherited properties are semantically valid). In addition, we used this method only in relatively isolated cases; the majority of MedDRA hierarchical relations are convertible into subsumption relations without semantic inconsistency.

It must be remarked that the SKOS vocabulary formalism could have been used to formalize hierarchical relations between MedDRA terms. SKOS provides some of the advantages of formal semantics (e.g. it makes it possible to use transitivity to make inferences), but it is less strict than OWL in terms of formal constraints. Typically, the SKOS:BROADMATCH and SKOS:BROADER relations can be used to express a hierarchical relation (A IS-MORE-GENERAL-THAN B) without constraining its meaning to subsomption (A IS-A B). Such interpretive freedom has of course a cost: it is less powerful computationally. Whether SKOS or OWL should be used shall be decided case-by-case depending on the level of semantic precision which is achievable (and desirable regarding end users needs) when formalizing the content of the source terminology. Although some parts of the MedDRA hierarchy were not based on subsomption relations and would benefit from SKOS formalism, we favoured the OWL option in order to simplify inheritance properties and classification.

4.2 Why Snomed-CT?

One may ask whether it is justified to use SNOMED-CT, which aims at covering the whole medical field from psychiatry to cardiac surgery, to represent the circumscribed field of pharmacovigilance. In principle, pharmacovigilance is only concerned with clinical observations that may be causally related to drugs. Several arguments can be made to justify this choice:

1) The semantic relations needed to define concepts in the field of pharmacovigilance generally do not differ from those used in other medical fields.

2) One of SNOMED-CT’s objectives is to meet requirements of a reference terminology for healthcare. According to the IHTSDO (International Health Terminology Standards Development Organisation), the not-for-profit association owning and maintaining SNOMED-CT, it is “the most comprehensive, multilingual clinical terminology in the world [and] a vital component for safe and effective communication and reuse of meaningful health information” [28]. SNOMED-CT has the advantage of covering to a large extent, if not entirely, other standard medical terminologies such as ICD-10. Efforts are also currently made to harmonize SNOMED-CT and the upcoming ICD-11 [49].

3) SNOMED-CT is the most complete and most detailed terminology of medicine with an ontological foundation currently on the market [10]. It is also the most likely terminology to cover MedDRA. In the UMLS Metathesaurus, about 50% of MedDRA terms have one or more corresponding SNOMED-CT concepts [10,11,32], a degree of coverage that, to our knowledge, no other current medical ontology is able to match.

It should be further noted that the methodology we used to build OntoADR can be reused to provide other terminologies than MedDRA with formal representations of semantics. One major constraint is however that the hierarchical relations in the source terminology shall be convertible to subsomption relations (see section 4.1). In principle, our method can also be deployed with other formal semantic resources than SNOMED-CT. The

12

only prerequisite is that such resources must be sufficiently large and comprehensive to cover the needs of the terminology to be formalized. Considering the current offer in terms of semantic resources for the biomedical domain, SNOMED-CT nevertheless appears as the best choice.

4.3 Formalizing MedDRA term semantics, what for?

The main added value of formalizing MedDRA is to enable automatic selection of terms based on their semantics. OWL queries can be designed to select MedDRA terms on the basis of semantic criteria. For instance the query: HAS-ASSOCIATED-MORPHOLOGY SOME 'HEMORRHAGE' AND HAS-FINDING-SITE SOME 'UPPER

GATROINTESTINAL STRUCTURE' can be used to select MedDRA terms that are defined in OntoADR as hemorrhages of the upper parts of the gastrointestinal structure [44].

Such functionality might support the coding of ADR cases by enabling health professionals to select the most appropriate terms in MedDRA. Currently, only syntactic queries (searching on the string of terms) are possible, for example with the MedDRA Browser Desktop published by the MSSO [2]. A semantic query tool could decrease substantially the time needed for terms selection and improve the precision of the terms used to describe ADRs.

The possibility of semantic reasoning on MedDRA may also play a decisive role in the knowledge extraction process. One potential limitation of most statistical approaches for signal detection is their confinement to pure quantitative computation of data: they do not take into account the semantic level of information present in case reports. They will for instance ignore the semantic proximity between ADR terms such as Epidermolysis bullosa, Stevens-Johnson syndrome or Bullous impetigo, which are different forms of bullous eruptions, and won’t be able to process case reports where those ADRs are associated with the same drug as different occurrences of a same ADR type. In addition, the finer granularity of MedDRA terms to describe medical data compared to other terminologies implies less occurrences of a given drug-event association within the database, which further lowers the performance of automated signal detection algorithms [23]. To compensate for this lack of semantics in the processing of pharmacovigilance data, SMQs have been introduced in MedDRA to group together terms with similar meaning. SMQs are collections of PTs assembled from various HLTs that relate to a common clinical situation but are not necessarily hierarchically related and can’t therefore be easily identified on the basis of the original MedDRA hierarchy [2]. They have been developed to be used in information retrieval systems (1) to ensure that cases are not inadvertently missed when searching a database, and (2) to enhance signal detection [6], although their efficiency has not yet been demonstrated – [50] and [51] have only demonstrated limited value of SMQs when compared to algorithms performances with single PTs.

However, SMQs development raises important difficulties. SMQs are currently created manually by MSSO experts. And once defined, they are not intended to be customized. The existing SMQs do not cover all issues related to drugs. Their granularity is not always appropriate: regarding a given ADR being investigated, SMQs can be too thin or too coarse. Because experts have (even slightly) different understandings of the medical condition targeted by the SMQ, the kind of terms and the rationale for their selection differ from a SMQ to another. This means that from one group of experts to another, for the same ADR topic, the list of selected MedDRA terms will be different. Because SMQs are manual, they could also miss relevant MedDRA terms.

For those different reasons, the development of methods for automatic or semi-automatic selection of MedDRA terms on the basis of semantic information is highly desirable. An automation of the process of SMQs creation, even partial, could allow an important saving of time. That is precisely the kind of functionalities OntoADR could support.

5 CONCLUSION

This article presents the conception process of OntoADR, an OWL-DL file using a sample of SNOMED-CT relations to provide MedDRA terms with formal definitions. This work demonstrates that it is possible, using the semi-automatic method described, to build an ontologized version of MedDRA that can be used for algorithmic reasoning on MedDRA terms’ formal definitions. The possibility to automate part of the development process is important because MedDRA has a large number of terms and is rapidly evolving (a new version is published every six month), and a fully manual design would be extremely difficult to achieve. The partial reuse of SNOMED-CT to build the formal definitions of MedDRA concepts also ensures a certain degree of semantic interoperability between those two coding vocabularies.

OntoADR development is ongoing. In order to achieve a more advanced ontologized version of MedDRA additional steps are required: curating OntoADR and identifying MedDRA hierarchical relations that cannot be converted into subsumption relations, adding formal definitions to those LLTs that express concepts more specific than the associated PT (rather than being synonyms of it), completing the definition of the SOCs,

13

HLGTs, HLTs levels at the top, and defining the PTs that are not or only partially defined. Issues related to the update of OntoADR following MedDRA evolutions should also be addressed.

Further studies will evaluate how the semantic reasoning possibilities OntoADR supports can assist pharmacovigilance experts in the work of searching for ADRs, especially when using signal detection algorithms based on disproportionality assessment. However, first results are promising. Preliminary studies have demonstrated that OntoADR can efficiently support the realization of automatic MedDRA terms groupings using OWL queries [44], and that those query-based groupings allow similar signal detection performances as standard handmade MedDRA terms groupings [52, 53].

6 ACKNOWLEDGMENTS

The authors express their gratitude to Adrien Fanet for his technical contribution to the conception of OntoADR and to Anne Jamet for her medical expertise. The authors also wish to thank the reviewers of the Journal of Biomedical Informatics for their comments and suggestions following the first submission of this article. The research leading to these results was conducted as part of the PROTECT consortium (Pharmacoepidemiological Research on Outcomes of Therapeutics by a European ConsorTium, www.imi-protect.eu) which is a public-private partnership coordinated by the European Medicines Agency. The PROTECT project has received support from the Innovative Medicine Initiative Joint Undertaking (www.imi.europa.eu) under Grant Agreement n° 115004, resources of which are composed of financial contribution from the European Union’s Seventh Framework Program (FP7/2007-2013) and EFPIA companies’ in kind contribution.

7 REFERENCES

1. ICH guideline E2B (R2), Electronic transmission of individual case safety reports - Message specification (ICH ICSR DTD Version 2.1), Final Version 2.3, Document Revision February 1, 2001.

2. MedDRA MSSO website. http://www.meddramsso.com/ 3. Brown EG. Methods and Pitfalls in Searching Drug Safety Databases Utilising the Medical Dictionary

for Regulatory Activities (MedDRA). Drug saf 2003;26(3):145-158. 4. Brown EG. Using MedDRA. Drug saf 2004;27(8):591-602. 5. MedDRA Data Retrieval and Presentation: Points to Consider. ICH-Endorsed Guide for MedDRA

Users on Data Output. Release 3.2, based on MedDRA Version 14.1. 1 October 2011. 6. Mozzicato P. Standardised MedDRA Queries: Their Role in Signal Detection. Drug Saf

2007;30(7):617-619. 7. Bousquet C, Lagier G, Lillo-Le Louët A, Le Beller C, Venot A, Jaulent MC. Appraisal of the MedDRA

conceptual structure for describing and grouping adverse drug reactions. Drug Saf 2005;28(1): 19-34. 8. Merrill G. The MedDRA Paradox. AMIA Annu Symp Proc. 2008 6:470-4. 9. Goldman SA. Adverse event reporting and standardized medical terminologies: strengths and

limitations. Drug Inf J 2002;36(2):439-444. 10. Richesson RL, Fung KW, Krischer JP. Heterogeneous but “standard” coding systems for adverse

events: Issues in achieving interoperability between apples and oranges. Contemp Clin Trials 2008;29(5):635.

11. Nadkarni PM, Darer JA. Determining correspondences between highfrequency MedDRA concepts and SNOMED: a case study. BMC Med Inform Decis Mak 2010;10:66.

12. Rossi Mori A, Consorti F, Galeazzi E. Standards to support development of terminological systems for healthcare telematics. Methods Inform Med 1998;37:551-63.

13. MedDRA Introductory Guide Version 14.0. MSSO-DI-6003-14.0.0. March 2011. 14. MedDRA Term Selection: Points to Consider. ICH-Endorsed Guide for MedDRA Users. Release 4.5,

based on MedDRA Version 16.0. 1 April 2013. 15. Tonéatti C, Saïdi Y, Meiffrédy V, Tangre P, Harel M, Eliette V, Dormont J, Aboulker PJ. Experience

using MedDRA for global events coding in HIV clinical trials. Contemp. Clin. Trials 2006;27(1):13-22. 16. Tremmel L, Scarpone L. Using MedDRA for adverse events in cancer trials: Experience, caveats, and

advice. Drug Inf J 2001;35:845-852. 17. Schroll JB, Maund E, Gøtzsche PC. Challenges in Coding Adverse Events in Clinical Trials: A

Systematic Review. PLoS ONE 2012;7(7):e41174.

14

18. Borgo S, Leitao P. The Role of Foundational Ontologies in Manufacturing domains. Lecture Notes in Computer Science 2004;3290:670-688.

19. Cruz IF, Huiyong X. The Role of Ontologies in Data Integration. Eng Intell Syst Elect 2005;13(4):245-252.

20. Masolo C, Borgo S., Gangemi A., Guarino N., Oltramaria A. WonderWeb Deliverable D18 Ontology Library (final). Internal report, IST Project 2001-33052 WonderWeb: Ontology Infrastructure for the Semantic Web; 2003.

21. Musen M. Discussion of “Biomedical Ontologies: Toward Scientific Debate”, V. Maojo, J. Crespo, M. García-Remesal, D. de la Iglesia, D. Perez-Rey, C. Kulikowski. Methods Inform Med 2011;50(3):226-227.

22. Wache H, Vögele T, Visser U, Stuckenschmidt H, Schuster G, Neumann H, Hübner S. Ontology-based integration of information. A survey of existing approaches. In Stuckenschmidt H. (ed.), IJCAI-01 Workshop: Ontologies and Information Sharing 2001:108-117.

23. Bousquet C, Henegar C, Lillo-Le Louet A, Degoulet P, Jaulent MC. Implementation of automated signal generation in pharmacovigilance using a knowledge-based approach. Int J Med Inform 2005;74:563-571

24. Brown EG. Effects of coding dictionary on signal generation: a consideration of use of MedDRA compared with WHO-ART. Drug Saf 2002;25(6):445-52.

25. Yokotsuka M, Aoyama M, Kubota K. The use of a medical dictionary for regulatory activities terminology (MedDRA) in prescription-event monitoring in Japan (J-PEM). Int J Med Inform. 2000;57(2-3):139-53.

26. Henegar C, Bousquet C, Lillo-Le Louët A, Degoulet P, Jaulent MC. Building an ontology of adverse drug reactions for automated signal generation in pharmacovigilance. Comput Biol Med 2006;36(7-8):748-67.

27. Henegar C, Bousquet C, Lillo-Le Louët A, Degoulet P, Jaulent MC. A knowledge-based approach for automated signal generation. Proc. Medinfo 2004:626-630.

28. SNOMED-CT Website. http://www.ihtsdo.org/SNOMED-CT/ 29. Lindberg DAB., Humphreys BL, McCray AT. The Unified Medical Language System. Methods Inform

Med 1993;32:281-91. 30. UMLS Website. http://www.nlm.nih.gov/research/umls/ 31. Bioportal Website. http://bioportal.bioontology.org/ 32. Bodenreider O. Using SNOMED-CT in combination with MedDRA for reporting signal detection and

adverse drug reactions reporting. AMIA Annu Symp Proc. 2009 Nov 14;2009:45-9. 33. Alecu I, Bousquet C, Mougin F, Jaulent MC. Mapping of the WHO-ART terminology on SNOMED-CT

to improve grouping of related adverse drug reactions. Stud Health Technol Inform 2006;124:833-8. 34. Schulz S, Schober S, Tudose I, Stenzhorn H. The Pitfalls of Thesaurus Ontologization. The Case of

the NCI Thesaurus. AMIA Annu Symp Proc 2010:727-731. 35. Ingenerf J, Linder R. Assessing applicability of ontological principles to different types of biomedical

vocabularies. Methods Inform Med 2009;48(5):459. 36. Cimino JJ. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inform

Med 1998;37:394-403. 37. Rector A, Brandt S. Why Do It the Hard Way? The Case for an Expressive Description Logic for

SNOMED. J Am Med Inform Assoc 2008;15(6):744-751. 38. Schulz S, Cornet R, Spackman K. Consolidating SNOMED CT's ontological commitment. Appl Ontol

2011;6(1):1-11. 39. Héja G, Surján G, Varga P. Ontological analysis of SNOMED-CT. BMC Med Inform Decis Mak

2008;8(1):S8. 40. SNOMED Clinical Terms® User Guide, 2010 July International Release. 41. EL description, W3C OWL working group. http://www.w3.org/2007/OWL/wiki/EL 42. OWL 2 Web Ontology Language Profiles (Second Edition), W3C Recommendation 11 December

2012. http://www.w3.org/TR/owl2-profiles/ 43. Trifirò G, Pariente A, Coloma PM, Kors JA, Polimeni G, Miremont-Salamé G, Catania MA, Salvo F,

David A, Moore N, Caputi AP, Sturkenboom M, Molokhia M, Hippisley-Cox J, Acedo CD, van der Lei J, Fourrier-Reglat A. EU-ADR group. Data mining on electronic health record databases for signal detection in pharmacovigilance: which events to monitor? Pharmacoepidemiol Drug Saf 2009; 18(12):1176-84.

44. Declerck G, Bousquet C, Jaulent MC. Automatic generation of MedDRA terms groupings using an ontology. Stud Health Technol Inform 2012;180:73-7.

15

45. Golbreich C, Zhang S, Bodenreider O. The foundational model of anatomy in OWL: Experience and perspectives. Web Semantics: Science, Services and Agents on the World Wide Web 2006;4(3):181-195.

46. Aranguren ME, Bechhofer S, Lord P, Sattler U, Stevens R. Understanding and using the meaning of statements in a bio-ontology: recasting the Gene Ontology in OWL. BMC bioinformatics 2007;8(1):57.

47. Noy NF, de Coronado S, Solbrig H, Fragoso G, Hartel FW, Musen MA. Representing the NCI Thesaurus in OWL DL: Modeling tools help modeling languages. Appl Ontol 2008;3(3):173-190.

48. Möller M, Sintek M, Biedert R, Ernst P, Dengel A, Sonntag D. Representing the International Classification of Diseases Version 10 in OWL. Proc KEOD 2010, 50-59.

49. Chute C, Üstün B, Spackman K. Ontology-based convergence of medical terminologies: SNOMED CT and ICD-11. Proceedings of the eHealth2012. 2012 May 10-11; Vienna, Austria. OCG; 2012.

50. Pearson RK, Hauben M, Goldsmith DI, Gould AL, Madigan D, O'Hara DJ, Reisinger SJ, Hochberg AM. Influence of the MedDRA hierarchy on pharmacovigilance data mining results. Int J Med Inform 2009;78(12):97-103.

51. Hill R, Hopstadius J, Lerch M, Noren N. An attempt to expedite signal detection by grouping related adverse reaction terms, Worshop “Computational methods in pharmacovigilance”, 24th European Medical Informatics Conference (MIE 2012), 26-29 August 2012, Pisa (Italy). Drug Saf 2012; 35 (12): 1194-5.

52. Souvignet J, Declerck G, Trombert B, Rodrigues JM, Jaulent MC, Bousquet C. Evaluation of automated term groupings for detecting anaphylactic shock signals with drugs. AMIA Annu Symp Proc. 2012;2012:882-90.

53. Souvignet J., Declerck G., Jaulent M-C., Bousquet C. - Evaluation of Automated Term Groupings for Detecting Upper Gastrointestinal Bleeding Signals for Drugs, Drug Saf 2012; 35 (12): 1195-6 Drug Saf 2012; 35 (12): 1195-6.