Structuring Medical Records with Apache Stanbol
-
Upload
khangminh22 -
Category
Documents
-
view
3 -
download
0
Transcript of Structuring Medical Records with Apache Stanbol
Structuring Medical Records with Apache Stanbol
Rafa Haro, Senior Software Engineer, Athento Antonio Pérez Morales, Senior Software Engineer, Ixxus
• Committer, PMC Member @ Apache Stanbol, Apache ManifoldCF
• Topics: Document Analysis, NLP, Machine Learning, Semantic Technologies, ECM
• Committer @ Apache Stanbol, Apache ManifoldCF
• Topics: ECM, Semantic Search, ETL, Machine Learning
Apache Stanbol provides a set of reusable components for semantic content management. It extends existing CMSs with a number of semantic services.
CMS
Traditional Semantic
Apache Stanbol Story
• Started within FP7 European Project IKS (Interactive Knowledge Stack. 2009 - 2012)
• IKS project brought together an Open Source Community for Defining and Building Platforms in the Semantic CMS Space
• Incubated in November 2010
• Successfully promoted within CMS and ECM industry through IKS Early Adopters Program
• Graduated to Top-Level Apache Project in October 2012
What is a Semantic CMS?
Traditional CMS
Atomic Unit: Document
Properties as meta-data (key-value schemas)
Keyword Search
Document Management Document Types
Document Workflow
Semantic CMS
Atomic Unit: Entity
Semantic meta-data (RDF)
Semantic Search
Knowledge Management Entity Management
Ontologies
Source: What Apache Stanbol Can Do for You?. Fabian Christ. ApacheCon Europe 2012
Key Points• Designed to bring Semantic Technologies to existing CMS
• Non-intrusive set of RESTful ‘Semantic’ Services
• Extremely Modular: Use only the modules you need
• Main Features: • Multilingual Content Enhancement: Structure Content through Semantic
Metadata
• Knowledge Bases Management
• Knowledge Models and Reasoning
• Semantic Indexing and Search
Stanbol Components• Stanbol components provide:
• RESTful API • Java APIs and OSGi services
• Stanbol components do NOT depend on each other • however they can be easily combined to
www.iks-project.eu
Page:
Apache Stanbol Service Layer
Apache StanbolComponent Layer
ApacheStanbol
Reasoners
ApacheStanbol
Enhancer
ApacheStanbol Rules
ApacheStanbol
Ontology Manager
ApacheStanbol
ContentHub
ApacheStanbol
EntityHub
ApacheStanbol
FactStoreStanbolEnhancement
Engines
VIE - User Interface LayerVIE VIE
Widgets
ApacheStanbol
CMS Adapter
Copyright IKS Consortium
6
Service-Oriented View
Stanbol Components (II)• Enhancer: Extracts Knowledge from unstructured parsed content
• EntityHub: Manage Domain Entities and Topics (Knowledge Bases)
• ContentHub: Semantic Indexing / Search over your - semantic enhanced - Content
• CMS Adapter: Sync. your CMS with Apache Stanbol (JCR/CMIS)
• Ontology Manager: Manage you formal Domain Knowledge
• Reasoners & Rules: Apply Domain Knowledge to improve / validate extracted Information. Refactor / refine knowledge to align it to public schemas such as schema.org
Built on Top of Apache….
• Apache Felix as OSGi environment
• Apache Sling launchers and OSGi Tools
• Apache Maven for building
• Apache Clerezza as RDF Framework
• Apache Jena as TripleStore
• Apache Solr for Knowledge Bases Management
• Apache Tika for converting input
• Apache OpenNLP for NLP Processing
Integration Scenarios
Source: What Apache Stanbol Can Do for You?. Fabian Christ. ApacheCon Europe 2012
• Stand-Alone Server (Stanbol Launchers)
• Web Application (Servlet-Container)
• Embedded within an OSGi environment
Project Current Status
Contributions (commits) to Trunk Since Incubation
Incubation (Nov 2010)
Apache Stanbol 0.9.0-incubating
(Aug 2012)
Graduation (October 2012)
IKS Project Ending (Dec 2012)
Apache Stanbol 0.12.0
(March 2014)
Apache Stanbol 1.0.0
(October 2016)
Project Current Status (II)
• 22 PMC Members (Last Addition Jul 2016) • 26 Committers (Last Addition May 2015)
• 3-5 active committers last 2 years • [email protected]: 228 subscribers
• Activity has been gradually decreasing • 3 major releases
Source: Apache Stanbol Committee Report Helper (https://reporter.apache.org/?stanbol)
Stanbol Enhancement Chains• Define how Content is processed by the Enhancer through an ExecutionPlan • Different Implementations:
• ListChain: in order sequential enhancement engines execution. Parallel Execution of engines not supported
• WeightedChain: ExecutionPlan is calculated using the engines order metadata. Parallel Execution of engines allowed
• API: • /enhancer: executes the default chain • /enhancer/chain/{chain-name}: executes a concrete named chain • /enhancer/engine/{engine-name}: executes a concrete named engine
Current Enhancement Engines• Preprocessing
• Tika Engine • content type detection • text extraction from several document formats • metadata extraction from several document formats
• Natural Language Processing • Language Detection (different implementations) • Sentence Detection (OpenNLP, SmartCN, REST) • Tokenizer (OpenNLP, SmartCN, REST) • POS Tagging (OpenNLP, REST) • Chunking (OpenNLP, REST) • NER (OpenNLP, OpenCalais, REST)
• Entity Linking • Named Entity Linking • EntityHub Linking Engine • FST (Lucene Finit State Transducer) Linking Engine • Entity Co-mention • Commercial Engines (OpenCalais, Zemanta, CELI…)
• Sentiment Analysis • Disambiguation
• DBPedia Spotlight • Solr MLT based
• PostProcessing: • Dereferencing
Stanbol EntityHub (II)• Manage Multiple Entity Sources (Knowledge Bases)
• Allows Fast Entity-Lookup using Apache Solr
• Referenced Site (Remote LD + Local Caches) Vs Managed Site (Entity CRUD Api over manually configured Sites)
• API: • Query for Entities (used by Entity Linking Engines)
• CRUD for Managed Sites • LDPath support for:
• Graph Path Retrieval (Used for dereferencing) • Schema Translation • Simple Reasoning
schema:name = rdfs:label[@en];
friend-names = foaf:knows/foaf:name
curl -X POST -d "name=lyon&limit=10" \ http://localhost:8080/entityhub/site/dbpedia/find
Use Case: Hexin Project - Structuring Medical Records
• R&D Project for Sergas (Galician Public Health Office) • Clinical Data Analysis Platform for supporting:
• Clinical Assistance • Epidemiology studies • Medical Research
• Big Data approach for analyzing both structured historical clinical data and unstructured medical records
• Medical Records are written in Spanish and Galician
Hexin: Architecture
Validation AnalysisPatient
Data Source
URX
ETL
BIG DATA (HDFS +
HIVE)
Event Detection Process
Cassandra
Reference Cases Detection Process
New Case
BIPatientId Date Structured Events Semantic Events Symptoms: • Cough • Unrest
Unrest Cough Fever>38
Rules
Hexin:Solution Design
• Structure Medical Records using Apache Stanbol Enhancer • Custom Ontology:
• Symptoms • Diseases • Diagnosis Tests • Family and Personal History
• Custom Enhancement Chain: • Language Detection > NLP > Entity Linking > Negation
Detection > Fact Extraction
Hexin: Ontology Indexing
• For supporting the Entity Linking process against Hexin Ontology, an EntityHub site must be created
• 2 options: • ManagedSite: full CRUD storage <-> DYNAMIC • ReferencedSite: READ-ONLY remote site + local index
• Stanbol EntityHub Indexing Tool: • RDF —> JenaTDB —> Solr Index
• Configure Custom Namespaces, Mappings and Properties • Generates an OSGi Bundle with the Yard and YardSite default
configurations • Copy the index to Stanbol /datafiles folder and install the bundle
using Apache Felix OSGi Web Console
hexin:*hexin:label > rdfs:label
NegexFact Extract.Hexin Linking
Hexin: Enhancement Chain
OpenNLP-ChunkerOpenNLP-POSOpenNLP-TokenOpenNLP-Sent.Lang. Detect.
Custom Hexin Engine. Implemented for the project
Entity Linking Engine. Available in Stanbol with a Custom Configuration for this use case
NLP Engines. Available in Stanbol. Default Configuration
Pre-Processing Engine. Available in Stanbol
Hexin: Custom Engines
@Component@Servicepublic class MyEngine implements EnhancementEngine {
@Activate public void activate(ComponentContext c) { // initialize, configure, ... }
public int canEnhance(ContentItem item) { if(...item matches our expectations...) { return ENHANCE_SYNCHRONOUS; } else { return CANNOT_ENHANCE; } }
public void computeEnhancements(ContentItem item) { // run the engine and add results to item’s // RDF graph based on the item’s InputStream }}
maven-bundle-plugin
adds OSGI metadata
Maven build
maven-scr-pluginadds services metadata
registered by OSGi
MyEngineService
MANIFEST.MF
OSGi metadata
OSGi bundle
Install in Stanbol no restart
needed
NLP at Apache Stanbol (II)• Browsable Map with Spans
• Spans sorted by Natural Order • Iterator based API that allows
concurrent Modifications • Annotations supported at Spans Level
• POS Annotation • PosTag
tag (e.g. NE) lexical category (e.g. Noun)
• Phrase Annotation (chunks) • PhraseTag
tag (e.g. NP) lexical-category (e.g. NounPhrase)
• Sentiment Annotation • SentimentTag:: Double
Stanbol is an Amazing Tool
Sentence
Chunk
Token
Span Types: • Token • Chunk • Sentence • Text Section • Analyzed Text
Hexin Custom Engine: Negex
• Context/Negex: Algorithm for Negation Detection • Based on Triggers-Terms + Regex
Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform. Oct 2001;34(5):301-310.
public abstract class AbstractNegexDetector implements NegexDetector {
@Overridepublic Set<IRI> detectNegations(String language, Graph metadata, AnalysedText at) throws NegexException{}
protected abstract boolean isNegated(String language, String concept, String sentence);
}
Hexin Custom Engine: Negex (II)• Triggers Types:
• Pre-condition Negation terms (e.g. absence of) • Pseudo Negation terms (e.g. no increase) • Pre-condition possibility phrase (e.g. rule him out) • Post-condition negation terms (e.g. unlikely) • Termination terms (e.g. but, however)
• Implementation available under Apache License 2.0 • Engine Implementation Challenges:
• Entity Annotations as Targets • AnalyzedText and EntityAnnotations relationships are currently obfuscated • GLUE CODE for locating Entity Annotations Spans by using START - END Text
Annotations properties • Once Entity Annotation sentence is located, is used as context along with the Entity
surface-form (mention) for applying the algorithm • Negation Returned as a Custom Property for the TextAnnotation (negated = True or False)
Hexin Custom Engine: Fact Extraction
“Paciente diabético desde los 5 años y con EPOC moderada grado 2 de la GOLD”
Hexin Custom Engine: Fact Extraction (II)
• In-Context Entity Fact Extraction • Facts returned as Entity RDF Metadata like the rest of Entity
Properties • Different Implementations of Context (all extracted from
AnalyzedText structure) • Sentence Context (default and usually enough) • Window of Text Context • Paragraph Context
• Rule Based Approach: • Regex over RAW Text or POS tags Sequence
• ENTITY reserved word -> OR expression for all ENTITY labels
Hexin Custom Engine: Fact Extraction (III)
• Supported Expressions: • diabetes|diabético|DM desde los N años • diabetes|diabético|DM a los N años • Debut diabetes|diabético|DM a los N años
Hexin Custom Engine: Fact Extraction (IV)
• POS based Rules: Diabetes diagnosed when he was 5 years old
NNS VB WRB PRP VBD CD NNS JJ ENTITY \s VB * VB[be] (CD) years old or simply
ENTITY \s VB * VB[be] (CD)