DBpedia InsideOut: An Introduction to a Major Hub for Linked Open Data
Transcript of DBpedia InsideOut: An Introduction to a Major Hub for Linked Open Data
DBPEDIA INSIDEOUT: AN INTRODUCTION TO THE MAJOR HUB FOR
LINKED OPEN DATA !
Cristina Pattuelli, Pratt Institute
March 16, 2015
WHAT IT IS
DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web in the form of Linked Open Data.
Source: http://lod-cloud.net/
THE STATE OF THE LOD CLOUD 2014
2011: 295 DATASETS 2014: 570 DATASETS (+93%)
Source: blog.classora.com/2012/10/10/describiendo-el-conocimiento-en-un-formato-estandar-para-la-web-semantica-rdf/
Connected with other Linked Datasets by 50 million RDF links
Most widely used linking predicates: owl:sameAs, rdfs:seeAlso, foaf:knows
CENTRAL INTERLINKING HUB OF THE WEB OF DATA
“Which albums did Miles Davis record with female instrumentalists?” “Which populated places in Australia are below sea level?” “What did Andy Warhol and Thelonious Monk have in common ?”
PEAN TO DBPEDIA
Multi-domain Automatically evolving Community consensus driven
Multilingual >125 language editions Accessible on the Web
“THINGS”
Each thing in the DBpedia dataset is identified by a URI of the form http://dbpedia.org/resource/Name Name is derived from the URL of the source Wikipedia article, which has the form
http://en.wikipedia.org/wiki/Name. .
GENERATING FACTS FOR THE ENTITY BILLIE HOLIDAY
has name
Subject Predicate Object
S <http://dbpedia.org/resource/Billie_Holiday>
P <http://xmlns.com/foaf/0.1/name> O ”Billie Holiday”
Billie Holiday
S <http://dbpedia.org/resource/Billie_Holiday>
P <http://dbpedia-owl:occupation>
O <http://dbpedia.org/page/Songwriter>
HARVESTING FACTS
Wikipedia articles consist mostly of free text, but also contain different types of structured information, such as infobox templates, categorization information, images, geo-coordinates, and links to external Web pages.
The core of DBpedia consists of an infobox extraction process. I n f o b ox e s a r e t e m p l a t e s contained in many Wikipedia ar t ic les. They are usual ly displayed in the top right corner of articles and contain factual information.
INFOBOX EXTRACTION
Raw Infobox Extraction – create triples directly from the infobox data. Mapping-based Infobox Extraction – mappings against the DBpedia Ontology.
RAW INFOBOX EXTRACTION
Generic Algorithm-based Retains property names used in the infobox Properties are identified by the dbpprop prefix.
MAPPING-BASED INFOBOX EXTRACTION
Mapping of infobox data to community-curated DBpedia Ontology. Properties are identified by the dbpedia-owl prefix.
RAW INFOBOX EXTRACTION
Pros: Complete coverage of all the infobox attributes (not all the infoboxes have been mapped yet) Cons: Lower data quality (synonyms are not resolved e.g., paceOfBirth/birthPlace; high error rate to determine the datatype of an attribute value)
MAPPING-BASED INFOBOX EXTRACTION
Pros: Data is cleaner (typing resources, merging name variants, assigning specific datatypes to the values). Cons: Not full coverage.
4.58 million things 4.22 million are classified in a consistent ontology.
THE DBPEDIA ONTOLOGY
Cross-domain ontology Large thematic coverage Currently covers 685 classes which form a subsumption hierarchy and 2,795 different p r o p e r t i e s d e s c r i b i n g t h e c l a s s e s (aircraftHelicopterAttack) Shallow (≤ 5 levels)
THE DBPEDIA ONTOLOGY
Because the DBpedia Ontology is built upon infobox templates, its semantic structure suffers from a lack of logical consistency and present significant semantic gaps in the hierarchy.
Hierarchy is kept shallow (sake of visualization and navigation). – http://dbpedia.org/ontology/MusicalArtist
WIKIPEDIA CATEGORY SYSTEM
Wikipedia categories to group articles that share similar subjects. Wikipedia categories are constantly evolving and currently number more than 740,000. 80.9 million links to Wikipedia categories.
WIKIPEDIA CATEGORY SYSTEM
Most categories are assigned manually by Wikipedia contributors and can be found listed as links at the bottom of a Wikipedia article.
CATEGORIZING PEOPLE
At least four categories: • the year the person was born • the year they died • their nationality • their reason for being notable.
CATEGORIZATION OF PEOPLE
First sentence of an article: Billie Holiday (born Eleanora Fagan; April 7, 1915 – July 17, 1959) was an American jazz singer and songwriter.
Year born: Category:1915 births Year died: Category:1959 deaths Nationality: Category: American people
Reason for notability / Occupation: Category:Musicians
WIKIPEDIA CATEGORY SYSTEM
Collaborative effort Advantages ! categories are continually updated to correspond with article content. Dis/advantages ! lack of consistency in its hierarchical structure and “rather loose relatedness between articles” (Bizer et al. (2009). “Messy hierarchy”
RE-CATEGORIZATION OF BILLIE HOLIDAY
(��External links: re-categorisation per Wikipedia:Categories for discussion/Log/2014 December 26, replaced: Category:American women composers
� Category:American female composers) (undo) -- (Robot - Moving category African-American female musicians toCategory:African-American musicians per CFD at Wikipedia:Categories for discussion/Log/2013 January 10.)
WIKIPEDIA ONTOLOGY IN DBPEDIA
The hierarchical structure of the categories is represented in DBpedia by way of two different properties: dcterms:subject (relate entity to category) skos:broader (relate child to parent category)
http://ensiwiki.ensimag.fr/images/f/fa/Dbpedia-relation-discovery-demo.pdf
The$Hierarchy$of$categories$between$“flower”$and$“cucumber”$
YAGO ONTOLOGY
A robust classification scheme with a deep hierarchical structure. Originally derived from the Wikipedia category system using the semantic lexicon WordNet.
Over 350,000 classes; 100 relationships Provides DBpedia data with coherence and structural consistency A taxonomic backbone
QUERYING DBPEDIA FOR LINKED JAZZ
Jazz Name Vocabulary Personal name vocabulary in the form of RDF statements including the artist’s name paired with a Uniform Resource Identifier (URI).!
<http://dbpedia.org/resource/Billie_Holiday>!<http://xmlns.com/foaf/0.1/name> !�Billie Holiday�$
QUERYING DBPEDIA FOR LINKED JAZZ
DBpedia was initially queried for literal triples with a foaf:name predicate that satisfied the following criteria: 1. the entity must be an rdf:type of dbpedia-owl:MusicalArtist
2. must have dbpedia:genre property: dbpedia:Jazz.
QUERYING DBPEDIA FOR LINKED JAZZ
DBpedia was initially queried for literal triples with a foaf:name predicate that satisfied the following criteria: 1. the entity must be an rdf:type of dbpedia-owl:MusicalArtist
2. must have dbpedia:genre property: dbpedia:Jazz.
+ rdfs:label ! name of the resource
QUERYING DBPEDIA FOR LINKED JAZZ
Prominent musicians who we expected to find by querying dbpedia:Jazz property were not returned. Example: “Count Basie” - f e l l u n d e r d b p e d i a : S w i n g _ m u s i c ,
dbpedia:Big_band_music and dbpedia:Piano_blues
- not under dbpedia:Jazz This required us to revise our query method by expanding it to include additional relevant music genres.
IN SUM
New type of knowledge representation environment -constant state of flux. -decentralized interplay of different descriptive and classification systems. -it challenges our tolerance threshold for data quality and our traditional notion of authority control.