An ontological modeling approach to cerebrovascular disease studies: The NEUROWEB case

16
An ontological modeling approach to cerebrovascular disease studies: The NEUROWEB case q Gianluca Colombo e,1 , Daniele Merico a,1 , Giorgio Boncoraglio f , Flavio De Paoli e , John Ellul b , Giuseppe Frisoni e , Zoltan Nagy d , Aad van der Lugt g , István Vassányi c , Marco Antoniotti e, * a Terrence Donnelly Centre for Cellular and Biomolecular Research (CCBR)/Banting and Best Department of Medical Research, University of Toronto, 160 College Street, Toronto, Ontario, Canada M5S 3E1 b Department of Neurology, University of Patras, Greece c Department of Information Systems, University of Pannonia, Veszprem, Egyetem u. 10, H-8200, Hungary d Department of Neurology, Sommelweiss University, 1088 Budapest VIII, Balassa u. 6, Hungary e Dipartimento di Informatica, Sistemistica e Comunicazione (DISCo), Università degli Studi di Milano Bicocca, U14 Viale Sarca 336, I-20126 Milan, Italy f Istituto Neurologico Carlo Besta, Via Celoria 11, I-20133 Milan, Italy g Department of Radiology, Erasmus MC, University Medical Center, ’s-Gravendijkwal 230, 3015 CE Rotterdam, The Netherlands article info Article history: Received 12 October 2008 Available online 13 January 2010 Keywords: Biomedical ontologies Data integration Clinical phenotypes Association studies abstract The NEUROWEB project supports cerebrovascular researchers’ association studies, intended as the search for statistical correlations between a feature (e.g., a genotype) and a phenotype. In this project the phe- notype refers to the patients’ pathological state, and thus it is formulated on the basis of the clinical data collected during the diagnostic activity. In order to enhance the statistical robustness of the association inquiries, the project involves four European Union clinical institutions. Each institution provides its pro- prietary repository, storing patients’ data. Although all sites comply with common diagnostic guidelines, they also adopt specific protocols, resulting in partially discrepant repository contents. Therefore, in order to effectively exploit NEUROWEB data for association studies, it is necessary to provide a framework for the phenotype formulation, grounded on the clinical repository content which explicitly addresses the inherent integration problem. To that end, we developed an ontological model for cerebrovascular phenotypes, the NEUROWEB Refer- ence Ontology, composed of three layers. The top-layer (Top Phenotypes) is an expert-based cerebrovascu- lar disease taxonomy. The middle-layer deconstructs the Top Phenotypes into more elementary phenotypes (Low Phenotypes) and general-use medical concepts such as anatomical parts and topological concepts. The bottom-layer (Core Data Set, or CDS) comprises the clinical indicators required for cerebro- vascular disorder diagnosis. Low Phenotypes are connected to the bottom-layer (CDS) by specifying what combination of CDS values is required for their existence. Finally, CDS elements are mapped to the local repositories of clinical data. The NEUROWEB system exploits the Reference Ontology to query the differ- ent repositories and to retrieve patients characterized by a common phenotype. Ó 2009 Elsevier Inc. All rights reserved. 1. Background The NEUROWEB project aims to support association studies in the field of cerebrovascular disease, by providing a framework for clinical data integration and exchange. In general, the purpose of association studies is to find statistical relationships between a (set of) feature(s) and a phenotype, which is a composite observa- ble state of an individual organism. NEUROWEB phenotypes specif- ically correspond to the pathological state of the patients, formulated according to clinical data collected during the diagnos- tic activity. In addition, the project is specifically committed, although not restricted, to provide support for exploring geno- type–phenotype relations [1,2]. In order to achieve statistical robustness in these studies, large patient cohorts are required. To accomplish this goal, the NEURO- WEB project involves four clinical sites from different EU-member countries, which are recognized excellence centers in the field of cerebrovascular diseases, with a particular focus on ischemic stroke. Although all sites comply with international guidelines, they have developed specific competencies in different areas of stroke diagnosis, treatment and stroke research (such as imaging, 1532-0464/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2009.12.005 q This work has been supported by the EC FP6 project NEUROWEB Grant 518513 and partially by the EC Marie Curie IRG Grant 031140. * Corresponding author. E-mail addresses: [email protected], [email protected] (M. Antoniotti). 1 These authors contributed equally to this paper. Journal of Biomedical Informatics 43 (2010) 469–484 Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin

Transcript of An ontological modeling approach to cerebrovascular disease studies: The NEUROWEB case

Journal of Biomedical Informatics 43 (2010) 469–484

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier .com/locate /y jb in

An ontological modeling approach to cerebrovascular disease studies:The NEUROWEB case q

Gianluca Colombo e,1, Daniele Merico a,1, Giorgio Boncoraglio f, Flavio De Paoli e, John Ellul b,Giuseppe Frisoni e, Zoltan Nagy d, Aad van der Lugt g, István Vassányi c, Marco Antoniotti e,*

a Terrence Donnelly Centre for Cellular and Biomolecular Research (CCBR)/Banting and Best Department of Medical Research, University of Toronto, 160 College Street,Toronto, Ontario, Canada M5S 3E1b Department of Neurology, University of Patras, Greecec Department of Information Systems, University of Pannonia, Veszprem, Egyetem u. 10, H-8200, Hungaryd Department of Neurology, Sommelweiss University, 1088 Budapest VIII, Balassa u. 6, Hungarye Dipartimento di Informatica, Sistemistica e Comunicazione (DISCo), Università degli Studi di Milano Bicocca, U14 Viale Sarca 336, I-20126 Milan, Italyf Istituto Neurologico Carlo Besta, Via Celoria 11, I-20133 Milan, Italyg Department of Radiology, Erasmus MC, University Medical Center, ’s-Gravendijkwal 230, 3015 CE Rotterdam, The Netherlands

a r t i c l e i n f o

Article history:Received 12 October 2008Available online 13 January 2010

Keywords:Biomedical ontologiesData integrationClinical phenotypesAssociation studies

1532-0464/$ - see front matter � 2009 Elsevier Inc. Adoi:10.1016/j.jbi.2009.12.005

q This work has been supported by the EC FP6 projeand partially by the EC Marie Curie IRG Grant 031140

* Corresponding author.E-mail addresses: [email protected]

Antoniotti).1 These authors contributed equally to this paper.

a b s t r a c t

The NEUROWEB project supports cerebrovascular researchers’ association studies, intended as the searchfor statistical correlations between a feature (e.g., a genotype) and a phenotype. In this project the phe-notype refers to the patients’ pathological state, and thus it is formulated on the basis of the clinical datacollected during the diagnostic activity. In order to enhance the statistical robustness of the associationinquiries, the project involves four European Union clinical institutions. Each institution provides its pro-prietary repository, storing patients’ data. Although all sites comply with common diagnostic guidelines,they also adopt specific protocols, resulting in partially discrepant repository contents. Therefore, in orderto effectively exploit NEUROWEB data for association studies, it is necessary to provide a framework forthe phenotype formulation, grounded on the clinical repository content which explicitly addresses theinherent integration problem.

To that end, we developed an ontological model for cerebrovascular phenotypes, the NEUROWEB Refer-ence Ontology, composed of three layers. The top-layer (Top Phenotypes) is an expert-based cerebrovascu-lar disease taxonomy. The middle-layer deconstructs the Top Phenotypes into more elementaryphenotypes (Low Phenotypes) and general-use medical concepts such as anatomical parts and topologicalconcepts. The bottom-layer (Core Data Set, or CDS) comprises the clinical indicators required for cerebro-vascular disorder diagnosis. Low Phenotypes are connected to the bottom-layer (CDS) by specifying whatcombination of CDS values is required for their existence. Finally, CDS elements are mapped to the localrepositories of clinical data. The NEUROWEB system exploits the Reference Ontology to query the differ-ent repositories and to retrieve patients characterized by a common phenotype.

� 2009 Elsevier Inc. All rights reserved.

1. Background

The NEUROWEB project aims to support association studies inthe field of cerebrovascular disease, by providing a framework forclinical data integration and exchange. In general, the purpose ofassociation studies is to find statistical relationships between a(set of) feature(s) and a phenotype, which is a composite observa-

ll rights reserved.

ct NEUROWEB Grant 518513.

.it, [email protected] (M.

ble state of an individual organism. NEUROWEB phenotypes specif-ically correspond to the pathological state of the patients,formulated according to clinical data collected during the diagnos-tic activity. In addition, the project is specifically committed,although not restricted, to provide support for exploring geno-type–phenotype relations [1,2].

In order to achieve statistical robustness in these studies, largepatient cohorts are required. To accomplish this goal, the NEURO-WEB project involves four clinical sites from different EU-membercountries, which are recognized excellence centers in the field ofcerebrovascular diseases, with a particular focus on ischemicstroke. Although all sites comply with international guidelines,they have developed specific competencies in different areas ofstroke diagnosis, treatment and stroke research (such as imaging,

2 This is a very active research area; see, for example [32]

470 G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484

biochemical essays, etc.). This situation of compliance to generalrequirements, though with specific skills and research commit-ments, is mirrored by the content of their clinical repositories,which were independently developed to store the patients’ profilesgathered during the diagnostic activity. The size of each datarepository varies from 500 to 1500 patient records.

The exploitation of NEUROWEB local repositories for associa-tion studies presents two major modeling aspects (cf. [3–5]):

(1) developing an IT infrastructure enabling the access to differ-ent technological platforms for data storage (inter-operability);

(2) resolving the semantic misalignments among locally-definedclinical indicators, yet preserving the methodological coher-ence and consistency of phenotype formulation.

Specifically, the misalignment among clinical indicators doesnot rest on a purely linguistic or terminological basis, as displayedby the following cases.

(1) The same type of exam can be performed using differentdiagnostic methodologies or technology (e.g., CTA scan vs.MRI scan), with different implications for diagnosticreliability.

(2) Different findings can be derived from the raw results of thesame type of exam; as a consequence, the same clinical indi-cator may have different values in different sites (e.g., Brachi-

ocephalic artery lesion, derived from CTA scan results).(3) Different scale of granularity, intended as level of detail con-

cerning the patient’s state assessment (e.g., ICA–CCA Stenosis

Present/Absent vs. ICA–CCA Stenosis Left Present / Absent).(4) Different criteria can be applied to assign a stroke classifica-

tion label (e.g., Atherosclerotic Ischemic Stroke) by combiningdifferent clinical indicators, or by setting different criteria(e.g., ranges, thresholds, acceptable values, etc.) on the sameindicators.

Many state-of-the-art solutions for data integration (for healthcare as well as other domains) revolve around the idea of databaseschema matching, i.e., establishing relations between elementsfrom local databases [6,7]. Database schema matching can be sup-ported by semantic models, specifying the meaning of concepts be-hind database elements. Such models can range in complexityfrom simple graphs to semantic networks and domain ontologies[8,9].

An important distinction can be set between systems relyingon a single, global conceptual schema or local, independent con-ceptual schemas. The two solutions have specific advantages anddisadvantages: the global schema solution requires a constant up-date of the semantic model whenever local databases change,whereas local schemas can be hard to reconcile unless they arealready based on shared concepts or terminologies. Of course,there are already several systems exploiting terminological re-sources, ‘‘semantic mediators” or ontological knowledge modelsfor data integration and retrieval; a few examples are [10–13].Also, several Ontology-based (OB)-Systems supporting high-throughput processing of biological and clinical data have ap-peared [14–21].

These systems are traditionally aimed at gathering genotypicinformation associated to patients [21] through the adoption ofbio-ontologies. The genotypic information gathered is then usedeither to guide data selection and knowledge discovery processes[18], or for biomedical data integration, extraction and mining[14,17,20]. A unified panorama of available bio-ontologies is of-fered by the National Center for Biomedical Ontology (NCBO) Bio-

portal [19].

With respect to existing OB-Systems supporting phenotype–genotype scientific inquiries, NEUROWEB primarily copes withthe representation of the clinical rather than biological knowledgeinvolved in association studies. Moreover, case-control studieshave to be exhibited [22] in order to strengthen an associationhypothesis between a given polymorphism and phenotypic pro-files of cerebrovascular diseases [23,24]. Such investigation re-quires both a robust clinical knowledge modeling (i.e., clinicalphenotype ontology) and the definition of ad hoc system modulesthat guarantee the methodological coherence in data mapping,data extraction and data collection (i.e., IT-facilities described inthe following sections). No state-of-the art systems nor state-of-the-art ontologies on clinical phenotypes modeling (as discussedin [25]) to support clinical based testings of hypothesized pheno-type–genotype associations were available at the time the NEURO-WEB project began, nor, to the best of our knowledge have beendeveloped in the meantime.

Given this background, the NEUROWEB project set forth to pro-vide its own form of integration. In the case of NEUROWEB Project,semantic reconciliation goes a step further than straightforwarddatabase schema matching. In fact, the NEUROWEB system wasdesigned to offer clinicians, and also biologists, the capability ofperforming collaborative research through access of a network ofdata repositories. Each repository could be queried either usingunified, higher-level concepts referring to common-use cerebro-vascular phenotypes, or using new, user-defined phenotypes,assembled from elementary phenotype units. To achieve this goal,we developed an ontological model [26–30] for cerebrovascularphenotypes – the NEUROWEB Reference Ontology – that accuratelycaptures the core diagnostic/classificatory knowledge of clinicians.The development of the Reference Ontology required very inten-sive knowledge acquisition and knowledge structuring activities,leading to a progressive refinement of the semantic model, as de-scribed in Section 2. In the final model, top-level diagnostic classesare deconstructed into more elementary phenotypes and medicalconcepts. Every phenotype is associated to the specific combina-tion of clinical indicators required for its occurrence. The contentof local databases is mapped to the Reference Ontology locally;the current version relies on direct mapping and simple rules,but the future adoption of local ontologies, to capture local featuresin a more expressive way, is supported as well. Specifically, map-ping local database elements to the ontology elements enablesthe reconciliation of granularity discrepancies between locally-de-fined clinical indicators that would be hard to manage using directmappings between such elements. The adoption of a ReferenceOntology grounded on expert knowledge grants the respect ofmethodological consistency, which is an essential requirementfor association studies.

A major decision to be taken was to choose between the devel-opment of a specific Reference Ontology and the adoption of anexisting one [31]. The second solution would apparently offersuperior advantages, by granting inter-operability with externalresources, and by facilitating the involvement of new partners incase they are already complying with an existing ontology. How-ever, there are crucial problems undermining this solution. Indeed,no publicly-available medical ontology is committed to the repre-sentation of clinical findings and phenotypes.2 The phenotypeontologies developed within the biological community are orientedto high-throughput genetic experiments in model organisms[29,33,34], and hence are not suitable for clinical applications. Froma pragmatic perspective one may argue that even if an optimal solu-tion is not available, the reuse of an existing general-purpose medi-cal ontology may still offer significant advantages. As a case in point,

G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484 471

we did consider SNOMED-CT as a possible candidate, but decidedagainst it because of a number of actual shortcomings that wouldbe encountered by adopting it (see also [35–37]).

� Considering the stroke type taxonomy used by NEUROWEBclinicians, most of the concepts are either missing fromSNOMED-CT, or they are formulated in an unsuitable way.For instance, the definition of SNOMED-CT AtheroscleroticOcclusive Disease clearly implies an atherosclerotic etiology,but not the specific features of stroke, which are part of theNEUROWEB Ischemic Stroke definition. In addition, severalclinical indicators required for stroke diagnosis and present inNEUROWEB clinical repositories are missing from SNOMED-CT(e.g., relevant scan lesion).

� SNOMED-CT offers qualitative scales for clinical findings butdoes not provide quantitative criteria to assign them (e.g., nostenosis percentage ranges are associated to the previous scale),nor does it resolve inter-dependencies among differentindicators.

Similar shortcomings are displayed by other medical re-sources, proving that these problems are not specific ofSNOMED-CT only. For instance, the Disease Ontology [38] is ageneral-purpose classification of pathologies, describing biologi-cal-samples within genetic data banks (developed for the NUgeneproject). The Disease Ontology includes concepts which have ter-minological correspondence to NEUROWEB phenotypes and clin-ical indicators (e.g., Stroke, Atherosclerosis, Subarachnoidhemorrhage, Cerebral embolism, Cerebral thrombosis, Occlusion andstenosis of carotid artery, etc.). However: (1) they do not followthe taxonomy adopted by the cerebrovascular clinicians whoare part of the NEUROWEB network; (2) no criteria are providedto assign them on the basis of clinical data; (3) the concepts areorganized by adopting only is-a, part-of, inverse-of, union-of anddisjoint-from relations, thus lacking any specification of causalityor diagnostic evidence. As a whole, there is a wealth of knowl-edge, relevant to the aims of the NEUROWEB system, which couldnot be properly conveyed by this ontology. Consequently, it wasdecided to develop a new Reference Ontology committed to accu-rately represent the diagnostic knowledge of the NEUROWEBcommunity. Nonetheless, NEUROWEB concepts were mapped toexternal terminologies to support keyword-based searches inexternal resources. The value and validity of the diagnosticknowledge encoded by the NEUROWEB Reference Ontology isnot restricted to the NEUROWEB Consortium, as the stroke typetaxonomy adopted largely follows the TOAST classification guide-line. This resource was developed by cerebrovascular experts, andis regarded as a reliable standard by the international cerebrovas-cular community [39–41].

2. The Development of the NEUROWEB Reference Ontology

The development of the NEUROWEB Reference Ontologyproceeded iteratively over a number of phases. In the narrativeof Section 2.1, we identify three major phases where designdecisions where made and provide a rationale for the choicesmade.

2.1. Initial models

2.1.1. Basic level definitionThe modeling activity started by identifying a minimal set of

clinical indicators (Core Data Set, in brief CDS), mapped to themajority of the local repositories and of primary importance forstroke diagnosis. In a sense, the CDS is the equivalent of one of

the numerous minimal information specifications now coordinated,e.g., in the MIBBI project [42]. The definition of the CDS was char-acterized by close collaboration with clinicians, and several cyclesof refinement at round table meetings.

The most straightforward solution for clinical data integrationconsists in mapping the content of local repositories to the CDS.In this scenario, the user formulates queries on the CDS indica-tors, which are then translated on local databases. This solutionis problematic, as there are often granularity discrepancies be-tween local indicators, preventing a direct mapping to CDS ele-ments (e.g., scan lesion = yes vs. scan lesion side = LEFT). Inaddition, some CDS indicators are not atomic, meaning that theyimplicitly refer to other entities of the CDS. For instance, relevant

scan lesion does not refer to a single exam result; according todiagnostic knowledge, a lesion is relevant only in the presenceof a co-axial scan lesion (i.e., the evidence of some brain tissuedamage) and stenosis (i.e., the evidence of a partial occlusionin a brain-afferent artery). It follows that relevant scan lesion = yes

requires a set of constraints on several atomic entities of theCDS, namely:

scan lesion = yes,stenosis degree > 50%,side of the scan lesion = RIGHT (LEFT), andside of the stenosis = RIGHT (LEFT).

Stroke classification labels (e.g., TOAST Atherosclerotic Ische-

mic Stroke Evident) are a specific type of non-atomic CDS indica-tors. They characterize the patient’s state in a comprehensiveway, implicitly encompassing many other indicators. Theseclassification labels could be used to directly retrieve patientsfor association studies. However, since they were manually as-signed by clinicians, errors and methodological inconsistenciesare possible. In particular, it is possible that different cliniciansor different sites used different criteria for their assignment.For this reason, using the stroke classification labels would bea satisfactory solution under the mere perspective of inter-operability, but methodological coherence and consistency,which are crucial for rigorous association studies, may not beensured.

In conclusion, the CDS alone is insufficient and a richer model isrequired, utilizing the CDS as the groundwork.

2.1.2. Two-layers solutionWe next developed a model composed of two layers, where the

top-layer is a taxonomy of phenotypes (Top Phenotypes), each con-nected to the CDS entities (the bottom-layer) via a definition for-mula. The formula is structured as a conjunction/disjunction ofcriteria on CDS indicators, expressed as equality/inequality bounds,or quantitative ranges to be satisfied. For instance: (Blood pres-

sure > 50) AND (Anterior cerebral artery lesion = yes).The model above was used for an extensive knowledge acquisi-

tion activity with clinicians. It offers the advantages of being sim-ple to understand, and it includes entities already familiar to theclinicians. Using the CDS entities to formulate Top Phenotypes alsoprovided valuable feedback for the refinement of the content andstructure of the CDS.

2.1.3. Three-layer solutionThe major weakness of the two-layer model is the absence of

relations deconstructing the stroke types into more elementaryphenotypes and medical concepts of general use (e.g., co-occurringpathologies such as diabetes and obesity, anatomical parts, topo-logical concepts). This additional array of entities is necessary tosupport important functionalities.

Reference Ontology

Top Phenotypes (stroke taxonomy)

Low Phenotypes (building blocks)

Local Clinical Repositories

Core Data-Set (clinical indicators)

NEUROWEB to Local Mapping

Topology &

Anatomy

Phenotype Ontology

Ischemic Stroke

Indicator Value

Mapping Rule

Fig. 1. The diagram displays the overall architecture of the NEUROWEB Reference Ontology.

472 G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484

� Simplify the handling of the phenotype formulas; in fact, sub-units of the full definition formula can be associated to certainaspects of the pathological state, of general validity even in a dif-ferent context than the stroke specialty.3

� Establish a mapping to medical ontologies and terminologies(e.g., SNOMED-CT, MeSH [43,44]), in order to support documentretrieval, and other searches on resources outside the NEURO-WEB consortium.

� Support the integration of bioinformatics resources for geno-type–phenotype association (e.g., the Human Gene MutationDatabase – HGMD [45,46]) and gene function (most notablyfrom the Gene Ontology [47]).

These problems can be overcome by adopting an ontologicalframework for phenotype formulation and introducing an addi-tional layer (Low Phenotypes) to deconstruct the Top Phenotypesinto more elementary concepts.

Henceforth, the resulting model is composed of three layers,and it will be fully described in Section 2.2.

2.2. The final ontology architecture

The NEUROWEB Reference Ontology is composed of three lay-ers (Fig. 1).

� Top Phenotypes: taxonomy of stroke types.� Low Phenotypes: anatomical parts, topological relations, diseases

and other phenotypes.� Core Data Set (CDS): unified clinical indicators.

The Top and Low Phenotype are grouped into the PhenotypeOntology, which, alongside the CDS, is part of what we refer to asthe NEUROWEB Reference Ontology. The CDS (bottom level) is theset of minimal-granularity clinical indicators required for strokediagnosis. The Phenotype Ontology is composed of Top Phenotypeand Low Phenotype layers. The Top Phenotype layer is a taxonomy

3 For instance, the Top Phenotype Atherosclerothic Ischemic Stroke can be decom-posed into two parts: the Ischemic Stroke (a cerebrovascular accident), and its durativeetiological factor Atherosclerosis (a vessel disease). Identifying these two parts is useful,as the Ischemic Stroke unit is also a part of Cardioembolic Ischemic Stroke, another TopPhenotype.

of stroke types, connected to the Low Phenotype layer by relationssuch as causality and existence of diagnostic evidence; the decon-struction of Top Phenotypes into low Phenotypes provides an ex-plicit model for the inherent classification criteria underlying theTop Phenotype taxonomy. Finally, the Low Phenotypes are con-nected to the CDS entities and the corresponding combination ofvalues required for the phenotype to occur.

Mapping from NEUROWEB Reference Ontology elements to lo-cal repositories occurs primarily at the CDS level. However, map-ping to Low Phenotypes is also possible; this solution ensures thehandling of the aforementioned granularity discrepancies betweenlocal repositories (cf. Section 2.1), as the phenotype ontology pro-vides progressively more general concepts than the CDS. For moredetails on the mapping solution, please refer to Section 3.2.

2.2.1. The Core Data SetIn the final, three-layer model, the CDS was restricted to mini-

mal-granularity clinical indicators. In fact, non-atomic conceptsare handled within the Phenotype Ontology (e.g., stroke classifica-tions are represented by Top Phenotypes). For this reason, mappingto local repositories is not restricted to the CDS layer.

The CDS entities are organized into categories and sub-catego-ries, according to the different types of examinations (see Fig. 2).In general, the values of CDS entities can be quantitative (e.g.,Age, Hemoglobin on admission g/dL), Boolean (e.g., Current use of alpha

blockers: yes, no), or categorical (e.g., Cognitive Function: normal, mildly

impaired, confused; Gender: male, female).

2.2.2. The Phenotype OntologyThe upper-layer of the Phenotype Ontology is populated by the

Top Phenotypes. These entities represent the main classes of path-ological states typically diagnosed by clinicians. They are inter-re-lated by is-a relations, thus forming a taxonomy of stroke types.

The root of the taxonomy is Ischemic Stroke, whose children areAtherosclerotic Stroke, Cardioembolic Stroke and Lacunar Stroke; thesestroke types constitute the three main etiological groups, as theyare identified in the TOAST classification. Each of them is then di-vided into Evident, Probable and Possible. Deeper levels of phenotypespecification (e.g., according the anatomical location of the vascu-lar lesion) lead to the addition of children to these stroke subtypes.

AgeCholesterol on Admission

Facial Palsy

ECG (Electrocardiogram) Rhythm on Admission

Routine Blood Tests Hemoglobin on Admission

negonirbiFseidutSnoitalugaoC

Follow-up Information Month 1Follow-up Information Month 3Follow-up Information Month 6Follow-up Information Month 12Medication at Follow-up Month 1Medication at Follow-up Month 3Medication at Follow-up Month 6Medication at Follow-up Month 12

TOAST - ClassicalTOAST - Extended

ICD-9CMOCSP

Carotid Bifurcation Stenosis

Main Categories

Gender

Location of the Lesion

Paroxysmal Atrial Fibrillation

Personal Identifying Data

Brain Imaging Studies

Clinical Data

Sub-Categories

Heart Studies

Transesophageal-echocardiogram

Example Entities

CT (Computed Tomography)

MRI (Magnetic Resonance Imaging)

Internal Carotid Artery Stenosis

Holter

Right-left Shunt

Vessel StudiesCTA (Computed Tomography Angiography)

MRA (Magnetic Resonance Angiography)

DSA (Digital Subtraction Angiography)

Duplex

Follow-up Information Cognitive Function

Laboratory Studies

Classification

Medication at Follow-up Current Use of Alpha-blockers

Fig. 2. The table displays the categories and sub-categories used to organize the CDS entities. An example of CDS entity is provided for each partition. The classification layeris shaded, as these indicators are conveyed by Top Phenotypes, and were part of the CDS only in the initial models.

G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484 473

To define the entities and relations deconstructing the Top intoLow Phenotypes,we followed the TOAST classification criteria.

� Etiology (Atherosclerotic, Cardioembolic, Lacunar Stroke).� Confidence of the etiological assessment (Evident, Probable, Pos-

sible), depending on the strength of the diagnostic evidence forthe most-probable etiology.

In addition, anatomy (i.e., the location of the lesion) is not usedby the TOAST, but could be suitably used to extend it, and is con-sistently used in the different clinical communities. These criteriaare explicitly recognized by the clinicians, and are generally usedin clinical medicine.

It is also important to consider that diagnostic activity is alwayscharacterized by the acquisition of diagnostic evidences, enablingthe reconstruction of the undergoing patho-physiological processesand structures, even if they are not directly observed. In the specificcase of ischemic stroke, there is a consistent partition between theevidences for the ischemic damage (typically brain imaging display-ing the damaged tissue, and the cognitive/motor impairment of thepatient) and the evidences for the cause of the occlusion causingischemia; the latter is usually a persistent or progressive state ofthe patient’s organism, such as Atherosclerosis, whereas the latter oc-curs after a chain of point-events (i.e., with a very compact time-span) eventually leading to a brain lesion (a trauma).

From a biological standpoint, however, it is important to repre-sent not only the diagnostic evidences, but also the patho-physio-logical processes inferred. In our modeling case, ischemic stroke iscaused by the rupture of an atherosclerotic plaque, triggering thecoagulation cascade, the release of a clot particle into the blood-stream (embolization) and the obstruction of a brain artery, even-

tually resulting in a brain lesion in the region deprived of the bloodsupply. These processes can be captured at the biomolecular leveland represented in the ontology.

Reflecting the above mentioned criteria, the Top Phenotypes arefirst decomposed into Low Phenotypes (see Fig. 3) according to eti-ology. We used two different causal relations:

� Has-Cause-Durative,� Has-Cause-PointEvent.

Has-Cause-Durative connects a Top Phenotype to its Durative Etio-

logical Background (e.g., Atherosclerotic Disease), representing thelong-term pathology responsible for the generation of the ischemicevent. Has-Cause-PointEvent connects a Top Phenotype to its Trau-

matic Point Event, representing the cerebrovascular accident thatoccurred in the patient. As far as the current version of the ontol-ogy is concerned, this will be always IschemicEvent or any of itsspecifications. However, cerebrovascular accidents other thanIschemic Stroke can be represented as well (e.g., HemorrhagicStroke).

This first group of Low Phenotypes reflects patho-physiologicalprocesses. The underlying biomolecular processes are connectedvia the Involves relation. For more details on how biomolecular pro-cesses are exploited by the system, please refer to Section 2.4.1.

Both Durative Etiological Background and Traumatic Point Event arethen connected to their diagnostic evidences (Durative or Point-

event, respectively), via the Has-Diagnostic-Evidence relation. Diag-nostic evidences can be decomposed into more elementary diag-nostic evidences using the same relation.

An additional dimension to be taken into account is the ana-tomical location implied by the phenotype. Anatomical parts cannot

Fig. 3. The diagram displays the main types of entities and relations composing the NEUROWEB Reference Ontology. The three layers of the Reference Ontology (TopPhenotypes, Low Phenotypes, CDS) correspond to the large shaded boxes in the background on the right side. Additional entity types are displayed on the left side. Coloredarrows represent relations. The graphical pattern A ! B1 ! B2 represents the DL construct: A � $relationX.(B1 u $relationY.B2).

Table 1Description Logic code for the fragment of the NEUROWEB Reference OntologyAtherosclerotic Ischemic Stroke Evident (AISE).

AISE � (1)$hasCausePointEvent.IschemicEvent

u$ hasCauseDurative.(AtheroscleroticDisease

u$ hasDiagnosticEvidence.SevereStenosis)

IschemicEvent �$ hasDiagnosticEvidence.RelevantLesion (2)

RelevantLesion � (3)$ hasDiagnosticEvidence.LeftRelevantLesion

t$ hasDiagnosticEvidence.RightRelevantLesion

LeftRelevantLesion �$ hasDiagnosticEvidence.((ModerateLesion t SevereLesion) u$hasSide.Right)u$ hasDiagnosticEvidence.(SevereStenosis u$ hasSide. Right)

ModerateLesion � (5)Lesion u$ byMeansOf(CT-From2.5to5CentimetersLesion

t MRI-From2.5to5CentimetersLesion

t PET-From2.5to5CentimetersLesion)

CT-From2.5to5CentimetersLesion � CT u$ hasValue. 2.5-5centimeters (6)MRI-From2.5to5CentimetersLesion �MRI u$ hasValue. 2.5-5centimeters (7)PET-From2.5to5CentimetersLesion � PET u$ hasValue. 2.5-5centimeters (8)

474 G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484

be regarded as phenotypes (i.e., observable properties): they arerather physical entities bearing observable properties [29–31,33].For this reason, we introduced new types of entities, Anatomical

Parts and Topological Concepts. Diagnostic evidences are connectedto Topological Concepts via the Has-Side relation, and to Anatomical

Parts via the Has-Location relation.Low Phenotypes of type Diagnostic Evidence are finally decon-

structed into the exams required for their assessment. The Has-Va-

lue relation allows for the formulation of validity ranges that mustbe satisfied by a CDS indicator, in order to elicit the occurrence of acertain phenotype; the resulting construct is connected to the per-tinent Low Phenotype through the relation By-Means-Of.

As an example of integration feasibility, the Phenotype Ontol-ogy can be effectively mapped to other medical ontologies, in orderto support queries on external resources. At the present stage ofmodel development we provide the mapping between SNOMED-CT terms and Reference Ontology entities, belonging to the LowPhenotypes and the Anatomical Parts modules. The mapping can-not be systematically provided for the Top Phenotypes and CDSentities, as SNOMED-CT does not offer a satisfactory coverage, forthe reasons already explained in the Section 1.

2.3. An example of phenotype formulation

As an example, let us have a look at the ontological definitionfor the Top Phenotype Atherosclerotic Ischemic Stroke Evident (AISE);see Table 1 and Fig. 4.

The first axiom states (a) the existence of a relation from Athero-

sclerotic Ischemic Stroke Evident (AISE) to the Low Phenotype Ischemic

Event via the Has-Cause-PointEvent relation, and (b) the existence ofa relation from AISE to the Low Phenotype Atherosclerotic Disease viathe Has-Cause-Durative relation. As an additional requirement, Ath-

erosclerotic Disease must have Severe Stenosis as diagnostic evidence– this is a specific requirement of Atherosclerotic Ischemic Strokewhen it is Evident. According to the second axiom, Ischemic Event re-quires the presence of the diagnostic evidence Relevant Lesion; Ath-

erosclerotic Disease is not further decomposed just for the sake ofbrevity. Relevant Lesion is further decomposed into Left and Right

Relevant Lesion, which consist of a Moderate or Severe Lesion con-

nected to the topological indication of the side via the Has-Side rela-tion. Moderate and Severe Lesion are eventually decomposed intoCDS elements and appropriate diagnostic values by the By-Means-

Of and Has-Value relations.

2.4. Extensions of the NEUROWEB Reference Ontology

In the following we will briefly discuss some extensions of theNEUROWEB Reference Ontology under development: a terminolog-ical extension to support text-search tools, a more sophisticatedtreatment of the temporal dimension, and in more details the addi-tion of a layer for the treatment of biomolecular processes.

One of the primary aims of the NEUROWEB Reference Ontologyis also to support the retrieval of patients characterized by a certainphenotype. Nonetheless, such an semantically articulate model canbe usefully exploited also for text-mining purposes, thus support-

Atherosclerotic Ischemic Stroke

Atherosclerotic Ischemic Stroke

Evident

Atherosclerotic Ischemic Stroke

Probable

Atherosclerotic Ischemic Stroke

Possible

Relevant Lesion

Atherosclerosis

Severe Stenosis

Left Relevant Lesion

Right Relevant Lesion

Severe Lesion

Right

Moderate Lesion

Right

Severe Stenosis

Right

CT.Lesion.Size

2-5cm

Ischemic Traumatic Event

CT.Presence

Yes

CT.Lesion.Side

Right

Is-a Is-a Is-a

Has-Cause-Durative Has-Cause-

PointEvent

Has-Diagnostic-Evidence

Has-Diagnostic-Evidence

By-Means-Of By-Means-Of

By-Means-Of

Has-Diagnostic-

Evidence

Has-Diagnostic-Evidence

Has-Diagnostic-Evidence

Has-Side

Has-Value

(OR)

(AND)

Fig. 4. A visual diagram for the AISE DL formulation. Big boxes represent entities, arrows represent relations. Small gray boxes (transitions) are used to convey AND/OR logicalconnections: AND corresponds to relations using the same transition, OR corresponds to relation using independent transitions.

G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484 475

ing the NEUROWEB in literature searches. To effectively supportthe processing of texts, an ontology needs to be extended with aterminological basis, i.e., the set of written expressions corre-sponding to the ontology-encoded concepts [48]. A practicablesolution would be to link each phenotype concept to the corre-sponding terms from publicly available terminological resources(e.g., SNOMED-CT, MeSH, UMLS [49], ICD-9CM [50], ICD-10 [51]),as was exemplified in Section 2.2 for SNOMED-CT. Specifically,the mapping to MeSH would be very effective in supporting Med-line searches, as Medline is already indexed by MeSH terms. A casecould also be made to incorporate synonym searches over Word-Net [52], although such searches would mostly yield uncontestual-ized (in the biomedical sense) results. The terminologicalextension of the Reference Ontology is currently under way, andcan be already used in its prototype stage.

The formal representation and the computational treatment ofthe time dimension is a crucial topic for the data management sys-tems [53] and the ontological knowledge representation area [54],as well as for the biomedical support systems [55] and the devel-oping area of biomedical ontologies [56,57].

According to [55] two main research directions drive the tem-poral dimension issues within the biomedical and clinical informa-tion system development: (1) temporal reasoning, in order to

support inferential tasks, such as therapy planning and execution;(2) temporal data maintenance topics, pertaining to the storageand retrieval of clinical data having heterogeneous temporaldimensions.

As far as the representation and reasoning issues are con-cerned, the temporal dimension introduced by the diagnosticactivity – i.e., modeling the temporal order of the diagnosticexaminations, and the process of progressive hypothesis refine-ment – will not be addressed, as it would be out of the NEU-ROWEB scope (that is, supporting association studies viapatients clustering). On the contrary, the temporal dimensionrelevant to the NEUROWEB system concerns the formulationof phenotype on the basis of dynamic patterns, i.e., the varyingbehavior of clinical indicators over time. For instance, it couldbe valuable to define improving or worsening conditions. Sucha modeling effort would require, on one hand, a systematictreatment of repeated clinical examinations over time; on theother hand, it would require an additional knowledge acquisi-tion activity, in order to identify the criteria implicitly appliedby the experts when they recognize different classes of dynamicpatterns. Finally, the knowledge representation formalism couldreasonably be the one proposed in [54], or an adaptation of thesame.

476 G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484

2.4.1. The Biomolecular extension of the Reference OntologyThe NEUROWEB consortium is committed to integrate clinical

and genetic databases of the participating centres, referring to pa-tients affected by cerebrovascular diseases that were studied withhigh-throughput Single Nucleotide Polymorphism (SNP) [58,59] ge-netic profiling. The data will be analyzed to identify statistically sig-nificant genotype–phenotype associations, fostering the discoveryof new diagnostic markers, disease genes and potentially therapeu-tic targets. Due to co-inheritance patterns [58], the analysis of SNPdata typically poses a genomic mapping problem: given a certainpolymorphism, which is significantly associated to the phenotypeof interest, it is often hard to assess what gene mutation is actuallyresponsible for the observed phenotype. One of the heuristics thatcan be used to address this problem is to combine the newer wholegenome approach with the older candidate gene approach [2], select-ing for genes characterized by functions compatible with the ob-served phenotype (e.g., cholesterol homeostasis in AtheroscleroticIschemic Stroke) [60]. The NEUROWEB Reference Ontology was ex-tended to biomolecular entities in order to support this goal. In addi-tion, to provide mappings to external resources, the ontologyincludes Has-Reference relations connecting NEUROWEB processesto gene functions from Gene Ontology (GO), and pathways from re-sources such as KEGG [61–63]. Here follows a qualitative descriptionof how the NEUROWEB system supports polymorphism mapping.The input consists of single SNP-phenotype associations.

� The SNP (identified by the NCBI dbSNP ID [59]) is mapped to itsgenetic locus.

� The NEUROWEB Genomic Engine retrieves all the genes map-ping to the locus; the Gene Ontology (GO) functional annotationis retrieved for those genes.

� If any of the GO annotations are mapped to the NEUROWEB bio-molecular processes, the ontology is navigated to identify one(or more) connections to Low Phenotypes via the Involved-In

relation.� The ontology is further navigated to search for connections

between the previously identified Low Phenotypes and the phe-notype of interest (i.e., the one associated to the polymorphism).

� The system outputs the genes and the relative biomolecular pro-cesses that are associated to a valid path. Output genes repre-sent good candidates for the refinement of the genotype–phenotype association. The corresponding output biomolecularprocesses offer a plausible functional explanation of the geno-type–phenotype association.4

The implementation of this functionality in the NEUROWEB sys-tem is currently under completion.

The biomolecular extension of the NEUROWEB Ontology can befurther exploited for the integration of NEUROWEB genotype–phe-notype associations with data from publicly available resources(e.g., the NIH Genetic Association Database [64]). This functionalityis currently under study; so far we have identified two major dis-crepancies to be addressed.

� Use of different criteria to identify phenotypes and diseases.� Phenotypes can be referred to different levels of biological

organization.

4 For instance, let’s consider the association between a given SNP and the TopPhenotype Atherosclerotic Ischemic Stroke Evident. The SNP is mapped onto an LDLReceptor gene, annotated for Cholesterol Homeostasis according to Gene Ontology. GOCholesterol Homeostasis is connected to NEUROWEB Cholesterol Metabolism and Homeo-

stasis, and this process is connected to the Low Phenotype Atherosclerotic Disease, whichis connected to Atherosclerotic Ischemic Stroke Evident via Has-Cause-Durative. It is thuspossible to establish a valid path from Atherosclerotic Ischemic Stroke (and all itsoffspring) to the SNP of interest, by traversing Atherosclerotic Disease and Cholesterol

Metabolism and Homeostasis.

The first problem occurs because, to the best of our knowledge,there is no common ontology for clinical phenotypes working as acommon exchange language, and because different communities ofclinical expert may have specific needs. Thanks to the decomposi-tion of the Top Phenotypes into the more general Low Phenotypes,the NEUROWEB Reference Ontology is already structured to mini-mize this problem. The second problem typically occurs when phe-notype annotation refers to cellular and molecular processes orcomponents, whereas clinical phenotypes are most often referredto the organ or system level.5 A complete solution of this problemwould require a systematic ontology of biological functions andparts, spanning all the organization levels, from molecules to organsand systems. Yet, a more limited yet satisfactory solution can beachieved relating biomolecular processes to Low Phenotypes. Estab-lishing such relations amounts to defining the molecular mecha-nisms of pathological manifestations (e.g., atherosclerosis, atrialfibrillation, etc.).

3. Methods

3.1. Description Logic Implementation of the NEUROWEB ReferenceOntology

Description logics [65–67] are a family of logic-based knowl-edge representation formalisms designed to represent and reasonabout the knowledge of an application domain in a structuredand well-understood way. The basic notions in description logicsare atomic concepts and atomic roles (unary and binary predicatesin the terminology of first order language, respectively). In order todistinguish the function of each concept in the relation (repre-sented by a role), the individual object that corresponds to the sec-ond argument of the role, viewed as a binary predicate, is calledrole filler. For instance, hasPart.Wheel is an expression which de-scribes properties of cars having wheels, in which the individualobjects belonging to the concept Wheel are fillers of the role has-

Part. A specific description logic is mainly characterized by the con-structors it provides to form complex concepts and roles from theatomic ones. The language we used to formalize the ontologicalclinical knowledge is SHOIN [65], which is an extension of the basicdescription logic. In order to develop the Reference Ontology com-putational model we have adopted the OWL DL version [68]. Theeditor adopted for the OWL files generation is Protégé, a wellknown tool developed and distributed by the Stanford University,where the Reference Ontology concepts are represented as T-Box (Terminological Box) entities.

3.2. Exploiting the Reference Ontology for clinical queries

The NEUROWEB phenotypes are defined in the PhenotypeOntology as axioms that cannot be directly exploited to querythe local repositories, which are typically implemented with rela-tional databases. To retrieve patient clusters on a phenotypic basis,such high-level definitions need to be typically translated andmapped into CDS entities and then into local database queries.Moreover, since a local database may include only some of theindicators represented by the CDS and possibly other elements,the actual mapping may occur at different level of the ReferenceOntology. For example, a specific local database may include infor-mation about a given phenotype (e.g., presence or absence of ste-

5 E.g., HGMD phenotype Apolipoprotein A1 deficiency is related to an increasedsusceptibility to thrombogenesis and embolization in presence of ulcerated athero-sclerotic plaques, and thus should be related to the NEUROWEB Atherosclerotic Ischemic

Stroke.

Fig. 5. User interface of the Ontology Mapper tool for a flat local database table. The user can select one or more local database fields (upper right) and assign them to a CDSlabel or a phenotype (upper left). The tool generates the syntax for the mapping to be saved as an SQL view.

G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484 477

nosis) that could be directly mapped to the corresponding NEURO-WEB phenotype.

To define NEUROWEB-to-Local (N2L) mapping, local medicalexperts familiar with the actual meaning, and – possibly – codingof the local repository logical structure, are requested to map thelocal repository elements to the Reference Ontology elements. Inorder to facilitate this task, a graphical interactive tool, namedOntology Mapper, has been developed as a Protégé plug-in. In mostcases the mapping may occur at the CDS level since the CDS struc-ture should be quite similar to the one of the clinical repositories.In some cases the mapping could be reduced to a linguistic trans-lation of terms in the CDS into the local terminology, in other cases,several local data fields may be used to construct a single CDS ele-ment. However, to host new clinical partners it might be necessaryto map the Reference Ontology concepts or even to develop localontologies to express a more sophisticated relationship and exploitthe full range of the NEUROWEB solution (see Fig. 5).

The resulting high-level and CDS-level mappings are stored thesame way at clinical servers as database views in our prototypeimplementation (cf., ‘‘Mapping to local entities” in Fig. 8). TheCDS labels and phenotypes that appear in these views are called‘‘supported”, while others are ‘‘unsupported” by the clinicalpartner.

The clinical sites periodically communicate the list of their sup-ported CDS labels and phenotypes to the center for storage and usewhen designing queries (cfr., ‘‘Mapping to supported entities” inFig. 8).

The support for phenotype-level mappings is useful to avoid‘‘semantic gaps”, i.e., loss of information, in user queries. Such gapsoccur when querying for a phenotype whose CDS constituents can-not be mapped to a local repository due to granularity discrepan-cies. Mapping at the phenotype level increases the flexibility ofthe system. However, the user is given the opportunity to set con-

straints on such flexibility. To that end, we use a system of weightsand thresholds: the user can assign weights to specific componentsof the query (e.g., certain CDS elements); then each record re-trieved from each data source is scored, by summing the weightsof the supported fields. If this score does not satisfy a user-definedthreshold, that record will not be output to the user (see Fig. 6).

The process of generating queries to access a local repository re-quires two tasks: (1) the elements of the local repository need to bemapped into the ones in the Reference Ontology and the CDS, and(2) the NEUROWEB phenotypes need to be transformed in queriesin terms of the reference ontology elements that map to the localrepository.

To perform the two tasks, we have developed two componentsthat are part of the NEUROWEB architecture as illustrated in Fig. 8.(1) The Phenotype Converter that relies on the mapping informa-tion provided by every participating repository to generate tailoredSQL queries in the NEUROWEB terminology. (2) The NEUROWEB-to-Local (N2L) Mapper that exploits the mapping information togenerate the actual queries that will be used to access the localdatabase.

The Phenotype Converter has been implemented as a Java cen-tralized component to ensure the logic consistency over the set ofperformed queries, exploiting the Jena programming interface [69]to navigate the ontology and extract class names and axioms. Inpractice, every request is decomposed in the same way accordingto the current definitions included in the Reference Ontology;whenever they change only the centralized component needs tobe changed without notifying the involved sites. This solution im-proves the awareness of the users that can be immediately notifiedabout who is going to answer a specific query and how.

The conversion process essentially navigates the referencesfrom the top level phenotype axioms to low phenotypes, and final-ly to conditions on CDS elements, which are the leaf nodes of the

Fig. 6. Two example mappings to local data fields in the PL-SQL language (simplified versions of the originals). The first mapping maps the local field stroke.ctl to thephenotype Presence_Of_Carotid_Stenosis if the field contains the specified string. The second mapping creates the CDS label F.01.04.09.01.00.00 (code for Documented

recurrent stroke/TIA) using the local fields stroke.nihss6ho and stroke.nihss12ho.

Fig. 7. A visual diagram for the Presence of Carotid Occlusion as represented in theReference Ontology. Big boxes represent entities, arrows represent relations. Smallgray boxes (transitions) are used to convey AND/OR logical connections: ANDcorresponds to relations using the same transition, OR corresponds to relation usingindependent transitions.

478 G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484

phenotype tree. For this navigation, the has-cause, has-evidence andby-means-of relations are used. By combining the simple conditionswith the operators and quantifiers of the axiom, the top-level phe-notype can be represented as a nested AND/OR expression includ-ing the elements that are supported at a specific site.

However, before generating a CDS-level query, the PhenotypeConverter always checks if a phenotype-level mapping is availablefor all or a part of a phenotype tree. If yes, the high-level mappingis used in preference to the CDS-level mapping. This means thatthe name of the phenotype will appear in the SQL query insteadof its CDS-level representation.

As an example, consider the fragment of the NEUROWEB Ontol-ogy in Fig. 7; it represents a simplified version of Presence of Carotid

Stenosis concept as defined by the NEUROWEB clinicians. In a pa-tient the presence of a stenosis in the carotid artery may be diag-nosed, if there is evidence of it in the segment called Internal/Common Carotid Artery (ICA–CCA) or in the Origin of the CommonCarotid Artery. The ICA–CCA Stenosis can be diagnosed if there isevidence of it in the right or in the left Internal/Common CarotidArtery and in particular if there is an evidence of a severe stenosisor a complete occlusion of the vessel.

One of the two branches reaching the last level of the ontologystates that the occlusion must be diagnosed by means of the CDSindicator Duplex Carotid Degree Of Stenosis in Left CCA–ICA, and thatthis indicator must have value (Occlusion). For every clinical site,the mapping can be done at the best fitting level: for example ifthe clinical database ‘‘A” includes information only about the pres-ence of a carotid stenosis in the patient, but without details on itsposition, the mapping can be done only at the phenotype level, butnot at the CDS level. Thanks to the phenotype-level mapping, ageneric query that asks for patients with carotid stenosis will re-ceive reliable answers. For more specific queries, for example selectall patients with a stenosis in the Internal Carotid Artery, the database‘‘A” cannot provide answers due to a lack of information.

Fig. 8. The figure displays the information flow of a clinical query in the NEUROWEB system. The software modules are displayed as solid boxes, whereas the data items aredisplayed as cylinders. The dashed boxes group the components of a local site (only two sites were depicted for reasons of compactness). Thick arrows identify theinformation flow elicited by each user’s query.

G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484 479

The system has to build a targeted query for each database,depending on the level of detail at which it is mapped. The Pheno-type Converter performs this task. In the previous example, a queryselect all patients with presence of carotid stenosis = Yes will easily re-ceive answers from the database ‘‘A”, but it must also receive an-swers from a database ‘‘B” that supports mapping at the CDSlevel. The Phenotype Converter translates the previous query intoselect all patients with Duplex Carotid Degree Of Stenosis in LeftCCA–ICA = Occlusion OR Duplex Carotid Degree Of Stenosis in LeftCCA–ICA = Severe OR Duplex Carotid Degree Of Stenosis in RightCCA–ICA = Severe OR . . . OR Duplex Carotid Degree Of Stenosis in Ori-gin Common = Severe’, by following the Ontology axioms that statehow each concept is defined in terms of CDS elements.

Of course, a database ‘‘C” can be mapped at any intermediate le-vel. For example, if the distinction between the ICA–CCA and theOrigin Common Carotid Artery were present in database ‘‘C”, themapping would occur at that level, and the query would be trans-lated in select all patients with ICA–CCA Stenosis = Yes OR Origin Com-mon Carotid Artery Stenosis = Yes.

The NEUROWEB system exploits well-established Web ser-vices technology to decouple the central components from the lo-cal sites in order to facilitate access by new clinical partners.Although each partner is free to chose the preferred technologyto implement the local components, we have developed andmade available a reference implementation to deliver a samplesolution that exploits the popular open-source technologiesGlassfish for the communication tasks [70] and Postgres [71] forimplementing the database view.

Glassfish safely manages call wrappings to expose the localinterface as a WSDL document. The Clinical Query Application runsas a web application on the Glassfish server, processes the incom-ing web service call, translates it into a SQL query, runs the queryon the database view, wraps and returns the resulting record set.

4. Conclusions

The purpose of the NEUROWEB Reference Ontology is to sup-port the retrieval of patient clusters in a rich semantic context,coupled with strict requirements on the clarity and coherence ofclustering criteria. In fact, the criteria adopted to categorize pa-tients in computer-unaided practice are deeply rooted in diagnos-tic knowledge of the cerebrovascular experts. In addition, thespecific commitment to association studies is particularly demand-ing from a methodological standpoint: on one hand, it is effectiveto provide the different researchers with a set of common and ex-plicit clustering criteria, endued with a shared semantic; on theother hand, it is necessary to grant coherence and consistencywhen different repositories are integrated into a unified, virtualdata warehouse. The tremendous importance of such issues isthe rationale driving the cerebrovascular community to establishstandards for stroke classification, such as the TOAST [39–41]; inparticular, the growing amount of available data requires the adop-tion of computer-aided practices, and thus formal representationof classification systems.

The specific advantages provided by an ontological modelingapproach with respect to the issues discussed are clear. On onehand, the explicit representation of classification criteria, as pro-vided by the layer of the Low Phenotypes, enables to analyticallyexpress the meaning of the groupings organized in the Top Pheno-type layer. On the other hand, the mapping to the local clinicalrepositories, mediated by the Core Data Set layer, enables an auto-mated retrieval of patient clusters driven by the previously definedcriteria, which spares the researcher from tedious and time-con-suming query formulation activities, though preserving methodo-logical coherence. Considering this latter issue, the main problemis that different repositories contain classification data, but it isnot granted that they were assigned applying coherent criteria.

480 G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484

Moving to a more general level, the NEUROWEB ontological frame-work provides a formalization of concepts explicitly or implicitlyencoded in the TOAST system, a scope not addressed by the avail-able medical ontologies. According to this perspective, the NEURO-WEB modeling effort, in spite of its prototypical nature, covers anunchallenged area of medical knowledge representation, whichcan be reasonably deemed to elicit the interest of the broader cere-brovascular community.

An even greater challenge is posed by the analysis and valida-tion of genotype–phenotype associations, which will be gener-ated by high-throughput genotyping campaigns on NEUROWEBpatients. Nevertheless, the present structure adopted for theontological model proved to be already adequate for the exten-sion to the biomolecular domain, providing an effective ground-work to establish relations between cerebrovascular pathologiesand their underlying molecular mechanisms. As a whole, theLow Phenotype layer, by analytically deconstructing the stroketypes into general concepts of the cerebrovascular domain, pro-vides an effective groundwork both for (a) the phenotype redef-inition, and (b) the representation of relations between

Fig. A.1. Specification

Fig. A.2. The Phenotype

cerebrovascular pathologies and their underlying molecularmechanisms.

Acknowledgments

The authors wish to thank all the clinical partners: Istituto Naz-ionale Neurologico Carlo Besta (INNCB, Milan – Italy), OrszagosPszichiatriai es Neurologiai Intezet (AOK-OPNI, Budapest – Hun-gary), University of Patras (UOP, Patras – Greece), Erasmus Univer-sitair Medisch Centrum Rotterdam (MI-EMC, Rotterdam –Holland). In particular, we wish to thank Dr. Stella Marousi fromUOP, Dr. Csaba Ovary from AOK-OPNI, Dr. Philip Homburg fromMI-EMC, and Dr. Eugenio Parati from INNCB, for their valuable con-tributions during the knowledge acquisition campaign and modelrefinement process. Finally, the authors wish to express the utmostgratitude to Prof. Giancarlo Mauri from the Dipartimento diInformatica Sistemistica e Comunicazione (DISCo) of the Universitàdegli Studi di Milano Bicocca for his coordination efforts and lead-ership role in the NEUROWEB project.

of Data sources.

Specification GUI.

Table A.1Example A. The query compiled for Clinic A (with full CDS-level mapping).

SELECT

//patient ID, selected automatically‘‘A.01.01.01.00.00.00”,//the CDS codes for the 5 elements specified in step 5‘‘A.01.01.02.00.00.00”,‘‘A.01.02.01.01.00.00”,‘‘A.01.02.09.02.01.00”,‘‘D.01.01.01.00.00.00”,‘‘A.01.01.07.00.00.00”

FROM <local database view name at clinic A>WHERE

‘‘A.01.01.01.03.00.00” = 1 //meaning gender = femaleAND ( //definition of Atherosclerotic_Ischemic_Stroke

( //definition of Ischemic_Stroke,//involving 11 CDS-level constraints, NOT DETAILED HERE. . .

)AND (

//definition of Atherosclerotic_Disease,//involving 5 CDS-level constraints‘‘A.01.02.02.10.01.00” = 2 //myocardial infarction‘‘OR A.01.02.02.11.01.00” = 2 //angina pectoris‘‘OR A.01.02.02.09.01.00” = 2 //family stroke/TIA‘‘OR A.01.02.02.05.01.00” = 2 //arterial hypertension‘‘OR A.01.02.02.07.01.00” = 2 //hypercholesterolemia

))

Table A.2Example B. The query compiled for Clinic B (with a phenotype-level mapping).

SELECT

//patient ID, selected automatically‘‘A.01.01.01.00.00.00”,//the CDS codes for the 5 elements specified in step 5‘‘‘‘A.01.01.02.00.00.00”,‘‘‘‘A.01.02.01.01.00.00”,‘‘‘‘A.01.02.09.02.01.00”,‘‘‘‘D.01.01.01.00.00.00”,‘‘‘‘A.01.01.07.00.00.00”

FROM <local database view name at clinic B>WHERE

‘‘A.01.01.01.03.00.00” = 1 //meaning gender = femaleAND ( //definition of Atherosclerotic_Ischemic_Stroke

( //definition of Ischemic_Stroke,//involving 11 CDS-level constraints, NOT DETAILED HERE. . .

)AND (

//definition of Atherosclerotic_Disease,//involving a single phenotype-level mappingAtherosclerotic_Disease = ‘yes’

))

G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484 481

Appendix A

A.1. Exploiting the Reference Ontology in a concrete test case

This section exemplifies the use of the NEUROWEB ReferenceOntology for queries as well as the interactions between the userand the system interface. We will focus on the top phenotype Ath-

erosclerotic_Ischemic_Stroke, which is defined as the intersection ofIschemic_Stroke and Atherosclerotic_Disease. Specifically, the pres-ence of Atherosclerotic_Disease in a patient is defined in the Refer-ence Ontology as a disjunction of five CDS-level constraints (CDSalphanumeric codes are displayed between parentheses).

� Presence of myocardial infarction = yes, within 4 weeks(A.01.02.02.10.01.00 = 2).

� Presence of angina pectoris = yes(A.01.02.02.11.01.00 = 2).

� Presence of family stroke/TIA = yes(A.01.02.02.09.01.00 = 2)

� Presence of arterial hypertension = yes(A.01.02.02.05.01.00 = 2).

� Presence of hypercholesterolemia = yes(A.01.02.02.07.01.00 = 2).

The structure of this phenotype is shown in the lower right partof Fig. A.2. We omit the definition of Ischemic Stroke only forsimplicity.

Now we show a typical use of the NEUROWEB system, focusingon how the reference ontology is exploited for clinical, genomic,and literature searches.

(1) Define the mapping. Suppose that two NEUROWEB clinics(clinic A and clinic B) compile the NEUROWEB-to-Local map-ping for a subset of the CDS elements. Clinic A has full CDSmapping, whereas clinic B can support mapping only at thephenotype level for Atherosclerotic disease. The list of supportedCDS elements and phenotypes is sent and stored in the centraldatabase. The clinics also set up their local web services.

(2) Define the goal. A NEUROWEB user, with read access grantedto the databases of clinic A and B, intends to study how ath-erosclerotic ischemic stroke in women might be associatedwith genetic polymorphisms.

(3) Select a data source. The user enters the NEUROWEB portaland selects clinic A and clinic B as data sources. If we con-sider the whole query as SELECT data elements of interestFROM clinical DBs of interest WHERE patient filter condition,then this step specifies the FROM part (Fig. A.1).

(4) Specify a phenotype. The user moves to the next tab of the GUIto specify a phenotype. She selects the phenotype Atheroscle-

rotic ischemic stroke. The tree structure, with the CDS elementsas leaf nodes, appears in the right pane (see Fig. A.2) The treeshows the logical relations (AND/OR) of its branches as well asthe required data values at the CDS elements. Note that the topphenotype Atherosclerotic_Ischemic_Stroke is defined asIschemic_Stroke AND Atherosclerotic_Disease. Fig. A.2 does notshow the structure of the low phenotype Ischemic_Stroke.Theuser can browse and modify the phenotype tree at any level,for example, she can change the value of CDS element Presence

of angina pectoris from ‘‘yes” to ‘‘no”. The GUI supports anycombination of existing phenotypes and CDS elements toform a new phenotype, which the user can save as her ownphenotype for future use. Now she wants to specify femalepatients, so she adds the CDS element ‘‘Gender” and sets itsvalue to ‘‘Female”, then connects it to the phenotype tree withan AND relation. This action is shown in Fig. A.2.The GUI readsthe Reference Ontology to display the tree. The user-compiledphenotype will later be the source of the WHERE part of thequery.

(5) Select the dataset elements. The user selects on the next tabthe CDS elements or phenotypes that she wants to view forthe patients matching the user-defined phenotype. This listwill be the SELECT part of the query. If this list containsan item that is unsupported for a certain clinic, the returnedvalues will be null. Suppose that the user selects:

� Date of birth (CDS code A.01.01.02.00.00.00).� Date of admission (CDS code A.01.02.01.01.00.00).� Discharge status (CDS code A.01.02.09.02.01.00).� Time between onset and first scan (CDS code

D.01.01.01.00.00.00).� Marital status (CDS code A.01.01.07.00.00.00).

(6) Running the query. The user can set some additional param-eters and then execute the query. The phenotype is sent tothe Phenotype Converter that, using the mapping informa-

482 G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484

tion, produces two different WHERE parts for the clinics A andB.

� For clinic A, the query will contain the AND/OR combina-

tion of the 16 CDS-level constraints that make up thephenotype, plus the constraint on gender. See Table A.1.

� For clinic B, the query will contain Atherosclerotic dis-

ease = ‘yes’ instead of the 5 CDS-level constraints thatdefine Atherosclerotic disease in the Reference Ontology.This clause is combined with several other CDS con-straints like Presence of angina pectoris = ‘no’ etc., comingfrom the other parts of the user-defined phenotype tree.See Table A.2

The two queries, with the same SELECT parts, are sent to theclinical web services declared in the FROM part, and the matchingrecords are returned (Fig. A.3).

Fig. A.3. The patient set returned as a response, alongside aggregate

Fig. A.4. Semantic Que

(7) Analyze the result set. The user can browse aggregate anddetail information on the patients in tabular form. She canrun data mining methods like decision tree to select thedominant variables of the SELECT part, with respect to anyvariable chosen as class label. She can also save the completequery as well as the returned patient set to ensure consis-tency with future queries in longitudinal studies.

(8) Run a semantic query. The goal here is to find publicationsthat link the user-defined phenotype to a chromosome.When started, the Semantic Engine searches a cached copyof public resources like MedLine to find publications rele-vant to the phenotype specified by the user. The engineextracts search terms from the phenotype tree, also allowingthe user to edit them. Then it runs the search, using the Ref-erence Ontology also to rewrite the term list in case of too

d data (upper part) and individual patient records (lower part).

ry definition GUI.

G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484 483

few or too many hits. The user views the ranked and high-lighted abstracts, and may refine the original phenotypeaccordingly (see Fig. A.4).

(9) Run a genomic query. Suppose the locus 9p21 was suggestedby the relevant publications in the previous step. Now, usingthe Genomic Engine, the user can search public genomic dat-abases to find out which SNPs are associated to this locus,and then check which of these her patients in the returnedrecord set actually have, thus verifying the original hypoth-esis. This concludes the phenotype–genotype associationstudy.

The above example demonstrated a phenotype–genotype study.The NEUROWEB tools, however, can be used for other purposeslike verifying a suggested treatment in the partners’ databasesfor a certain phenotype or a single patient.

References

[1] Botstein D, Risch NJ. Discovering genotypes underlying human phenotypes:past successes for mendelian disease, future approaches for complex disease.Nat Genet 2003;33(Suppl.):228–37.

[2] Tabor HK, Risch NJ, Myers RM. Candidate-gene approaches for studying genetictraits: practical considerations. Nat Rev Genet 2002;3(5):391–7.

[3] Beneventano D, Bergamaschi S, Guerra F, Vincini M. Synthesizing an integratedontology. IEEE Internet Comput 2003;7(5):42–51.

[4] Beneventano D, Bergamaschi S, Lodi S, Sartori C. Consistency checking incomplex object database schemata with integrity constraints. IEEE TransKnowledge Data Eng 1998;10(4):576–98.

[5] Lenz R, Beyer M, Kuhn KA. Semantic integration in healthcare networks. Int JMed Inform 2006;76(2–3):201–7.

[6] Brazhnik O, Jones JF. Anatomy of data integration. J Biomed Inform2007;40(3):252–69.

[7] Rahm E, Bernstein PA. A survey of approaches to automatica schema matching.VLDB J 2001;10(4):334–50.

[8] Pérez-Rey D, Maojo V, Garcia-Remesal M, Alonso-Calvo R, Billhardt H, Martin-Sánchez F, et al. ONTOFUSION: ontology-based integration of genomic andclinical databases. Comput Biol Med 2006;36(7):712–30.

[9] Sun Y. Methods for automated concept mapping between medical databases. JBiomed Inform 2004;37(3):162–78.

[10] Alladin-eHealth Project. Document at <www.alladin-ehealth.org/Pub/Docs/D9.1.pdf>; 2007.

[11] Arens Y, Hsu C-N, Knoblock CA. Query processing in the SIMS informationmediator. In: Readings in agents. Morgan Kaufmann Publishers Inc.; 1998. p.82–90.

[12] Sun J, Miao H, Cao X. A domain formal ontology and the application in servicecomponent retrieval. In: Proceedings of the international conference onsoftware engineering advances (ICSEA), Tahiti, French Polynesia. IEEEComputer Society; 2006.

[13] Garcia-Remesal M, Maojo V, Billhardt H, Crespo J, Alonso-Calvo R, Perez-Rey D,et al. ARMEDA II: supporting genomic medicine through the integration ofmedical and genetic databases. In: Proceedings of the fourth IEEE symposiumon bioinformatics and bioengineering (BIBE’04). IEEE Computer Society; 2004.

[14] Köhler J, Philippi S, Lange M. Semeda: ontology based semantic integration ofbiological databases. Bioinformatics 2003;19(18):2420–7.

[15] Mougin F, Burgun A, Bodenreider O, Chabalier J, Loréal O, Beux P. Automaticmethods for integrating biomedical data sources in a mediator-based system.In: DILS ’08: Proceedings of the fifth international workshop on dataintegration in the life sciences. Heidelberg, Berlin: Springer-Verlag; 2008. p.61–76.

[16] Sahoo SS, Bodenreider O, Rutter JL, Skinner KJ, Sheth AP. An ontology-drivensemantic mashup of gene and biological pathway information: application tothe domain of nicotine dependence. J Biomed Inform 2008;41(5):752–65.

[17] Pérez-Rey D, Maojo V, García-Remesal M, Alonso-Calvo R, Billhardt H, Martin-Sánchez F, et al. ONTOFUSION: ontology-based integration of genomic andclinical databases. Comput Biol Med 2005.

[18] Coulet A, Smaı̈l-Tabbone M, Benlian P, Napoli A, Devignes M-D. Ontology-guided data preparation for discovering genotype–phenotype relationships.BMC Bioinform 2008;9(S-4).

[19] Rubin DL, Lewis SE, Mungall CJ, Misra S, Westerfield M, Ashburner M, et al.National Center for Biomedical Ontology: advancing biomedicine throughstructured organization of scientific knowledge. OMICS 2006;10(2):185–98.

[20] Coulet A, Smaı̈l-Tabbone M, Benlian P, Napoli A, Devignes M-D. SNP-converter:an ontology-based solution to reconcile heterogeneous SNP descriptions forpharmacogenomic studies. In: DILS; 2006. p. 82–93.

[21] Burgun A, Bodenreider O. Accessing and integrating data and knowledge forbiomedical research. Yearbook Med Inform 2008:91–101.

[22] So H-C, Chen RYL, Chen EYH, Cheung EFC, Li T, Sham PC. An association studyof rgs4 polymorphisms with clinical phenotypes of schizophrenia in a chinese

population. Am J Med Genet Part B, Neuropsychiatr Genet2008;147B(1):77–85.

[23] Lee C, Kong M. An interactive association of common sequence variants in theneuropeptide y gene with susceptibility to ischemic stroke. Stroke2007;38(10):2663–9.

[24] Boncoraglio GB, Bodini A, Brambilla C, Carriero MR, Ciusani E, Parati EA. Aneffect of the pai-1 4g/5g polymorphism on cholesterol levels may explainconflicting associations with myocardial infarction and stroke. Cerebrovasc Dis2006;22(2–3):191–5.

[25] Scheuermann RH, Ceuster W, Smith B. Towards an ontological treatment ofdisease and diagnosis, cf. <http://summit2009.amia.org/>; March 2009, inpress.

[26] Guarino N, Poli R, editors. Formal ontology in conceptual analysis andknowledge representation. Kluwer Academic Press; 1994.

[27] Bouquet P, Donà A, Serafini L, Zanobini S. ConTeXtualized local ontologyspecification via CTXML. In: MeaN-02 – AAAI workshop on meaningnegotiation, Edmonton, Alberta, Canada. AAAI; 2002.

[28] Van der Pet PE, Mars NJI. Bottom-up construction of ontologies. IEEE TransKnowledge Data Eng 1998;10(4):513–26.

[29] Bard JB, Rhee L, Seung Y. Ontology in biology: design, application and futurechallenges. Nat Rev Genet 2004;3:213–22.

[30] Fielding JM, Simon J, Ceusters W, Smith B. Ontological theory for ontologicalengineering: biomedical systems information integration. In: Proceedings ofthe ninth international conference on the principles of knowledgerepresentation and reasoning (KR2004), Whistler, BC, Canada; June 2004.

[31] Gomez-Perez A, Corcho O, Fernandez-Lopez M. Ontological engineering. NewYork, NY, USA: Springer-Verlag; 2004.

[32] Scheuermann R, Smith B. Signs, symptoms and findings: first steps toward anontology of clinical phenotypes. Web site at <www.bioontology.org/wiki/index.php/DallasWorkshop>. Dallas Phenotype Workshop; September 2008.

[33] Smith C, Goldsmith CA, Eppig J. The mammalian phenotype ontology as a toolfor annotating, analyzing and comparing phenotypic information. GenomeBiol 2004;6(1):R7.

[34] PATO – Phenotypic Quality Ontology. Web site at <www.bioontology.org/wiki/index.php/PATO:MainPage>.

[35] Bodenreider O, Smith B, Kumar A, Burgun A. Investigating subsumption inSNOMED-CT: an exploration into large description logic-based biomedicalterminologies. Artif Intell Med 2007;39(3):183–95.

[36] Linksvan der Kooij J, Goossen WT, Goossen-Baremans AT, de Jong-FintelmanM, van Beek L. Using SNOMED-CT codes for coding information in electronichealth records for stroke patients. Stud Health Technol Inform2006;124:815–23.

[37] Cimino JJ. Coding systems in health care. Methods Inform Med 1996;35(4–5):273–84 [Review paper].

[38] Disease Ontology (DisO) – The NUgene Project. Web site at<diseaseontology.sf.net>.

[39] Adams Jr HP, Bendixen BH, Kappelle JJ, Biller J, Love BB, Gordon DL, et al.Classification of subtype of acute ischemic stroke, definition for use in amulticenter clinical trial, TOAST. Trial of Org 10172 in Acute Stroke Treatment.Stroke 1993;24:35–41.

[40] Goldstein LB, Jones MR, Matchar DB, Edwards LJ, Hoff J, Chilukuri V, et al.Improving the reliability of stroke subgroup classification using the Trial ofORG 10172 in Acute Stroke Treatment (toast) criteria. Stroke 2001;32:1091–7.

[41] Ay H, Furie KL, Singhal A, Smith WS, Sorensen AG, Koroshetz WJ. An evidence-based causative classification system for Acute Ischemic Stroke. Ann Neurol2005;58:688–97.

[42] Taylor CF, Field D, Sansone S-A, Aerts J, Apweiler R, Ashburner M, et al.Promoting coherent minimum reporting guidelines for biological andbiomedical investigations: the MIBBI project. Nat Biotechnol2008;26(8):889–96.

[43] Donnelly K. SNOMED-CT: the advanced terminology and coding system foreHealth. Stud Health Technol Inform 2006;121:279–90.

[44] Lipscomb CE. Medical Subject Headings (MeSH). Bull Med Library Assoc2000;88(3).

[45] Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NS, et al. HumanGene Mutation Database (HGMD): 2003 update. Human Mutat2003;21(6):577–81.

[46] Stenson PD, Ball E, Howells K, Phillips A, Mort M, Cooper DN. Human GeneMutation Database: towards a comprehensive central mutation database. JMed Genet 2008;45(2):24–126.

[47] Lewis SE. Gene Ontology: looking backwards and forward. Genome Biol2004;6(103).

[48] Spasic I, Ananiadu S, McNaught J, Kumar A. Text mining and ontologies inbiomedicine: making sense of raw text. Brief Bioinform 2005;6(3):239–51.

[49] Chen Y, Perl Y, Geller J, Cimino JJ. Analysis of a study of the users, uses, andfuture agenda of the UMLS. J Maerican Med Inform Assoc 2007;14(2):221–31.

[50] ICD9-CM. Web site at <www.icd9cm.net>.[51] ICD10. Web site at <www.who.int/classifications/icd>.[52] Fellbaum C, editor. WordNet: an electronic lexical database. MIT Press; 1998.[53] Jensen CS, Snograss RT. Temporal data management. Knowledge Data Eng

1999;11(1):36–44.[54] Artale A, Parent C, Spaccapietra S. Modeling the evolution of objects in

temporal information systems. In: Kirchberg M, Hegner SJ, editors,Proceedings of the fourth international symposium of foundations of

484 G. Colombo et al. / Journal of Biomedical Informatics 43 (2010) 469–484

information and knowledge-based systems. Lecture notes in computerscience, vol. 3861; 2006. p. 22–42.

[55] Shahar Y. Timing is everything: temporal reasoning and temporal datamaintenance in medicine. In: Joint European conference on artificialintelligence in medicine and medical decision making (AIMDM’99). Lecturenotes in artificial intelligence, vol. 1620; 1999. p. 30–46.

[56] Rector AL, Rogers J. Ontological and practical issues in using a description logicto represent medical concept systems: experiences from GALEN. In: ReasoningWeb 2006. Lecture notes in computer science, vol. 4126. Springer-Verlag;2006. p. 197–231.

[57] Palma P, Llamas B, Gonzales A, Menarguez M. Acquisition and representationof causal and temporal knowledge in medical domains. In: Knowledge-basedintelligent information and engineering systems, seventh internationalconference, KES 2003, Lecture notes in computer science, vol. 2777; 2003. p.1284–90.

[58] International HapMap Consortium. A haplotype map of the human genome.Nature 2005;437(7063):1299–320.

[59] Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP:the NCBI database of genetic variation. Nucleic Acids Res 2001;29:308–11.

[60] Meschia JF. Clinically translated ischemic stroke genomics. Stroke 2004;35(11Suppl. 1):2735–9.

[61] Kaneisha M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. NucleicAcid Res 2000;28:27–30.

[62] Kaneisha M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima M, et al.From genomics to chemical genomics: new developments in KEGG. NucleicAcid Res 2006;34:D354–7.

[63] Kaneisha M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, et al. KEGG forlinking genomes to life and the environment. Nucleic Acid Res2008;36:D480–4.

[64] Becker K, Barnes KC, Bright TJ, Wang SA. The genetic association database. NatGenet 2004;36(5):431–2.

[65] Baader F, Calvanese D, McGuinness DL, Nardi D, Patel-Schneider PF, editors.The description logic handbook: theory, implementation and application. NewYork, NY, USA: Cambridge University Press; 2003.

[66] Sattler U. Description logics for the representation of aggregate objects. In:Proceedings of the 14th European conference on artificialintelligence. Amsterdam, Holland: IOS Press; 2000.

[67] Horrocks I, Sattler U, Tobies S. Practical reasoning for expressive descriptionlogics. In: Proceedings of the sixth international conference on logic andautomated programming (LPAR99), Tblisi, Georgia, 1999.

[68] W3C. OWL Web Ontology Language. Web site at <www.w3c.org/TR/owl-guide>, February 2004.

[69] Jena – A Semantic Web Framework for Java. Web site at <http://jena.sourceforge.net>.

[70] GlassFish. Web site at <http://glassfish.dev.java.net>.[71] Douglas K, Douglas S. PostgreSQL. Sams, 2nd ed.; 2005.