arXiv:2002.06039v1 [cs.DB] 14 Feb 2020

31
Benchmarking Knowledge Graphs on the Web Michael R ¨ oder and Mohamed Ahmed Sherif and Muhammad Saleem and Felix Conrads and Axel-Cyrille Ngonga Ngomo DICE group, Department of Computer Science Paderborn University, Germany Institute for Applied Informatics, Leipzig, Germany Abstract The growing interest in making use of Knowl- edge Graphs for developing explainable arti- ficial intelligence, there is an increasing need for a comparable and repeatable comparison of the performance of Knowledge Graph-based systems. History in computer science has shown that a main driver to scientific advances, and in fact a core element of the scientific method as a whole, is the provision of bench- marks to make progress measurable. This pa- per gives an overview of benchmarks used to evaluate systems that process Knowledge Graphs. 1 Introduction With the growing number of systems using Knowl- edge Graphs (KGs) there is an increasing need for a comparable and repeatable evaluation of the performance of systems that create, enhance, main- tain and give access to KGs. History in computer science has shown that a main driver to scientific advances, and in fact a core element of the scien- tific method as a whole, is the provision of bench- marks to make progress measurable. Benchmarks have several purposes: (1) they highlight weak and strong points of systems, (2) they stimulate techni- cal progress and (3) they make technology viable. Benchmarks are an essential part of the scientific method as they allow to track the advancements in an area over time and make competing systems comparable. The TPC benchmarks 1 demonstrate how benchmarks can influence an area of research. Even though databases were already quite estab- lished in the early 90s, the benchmarking efforts resulted in algorithmic improvements (ignoring ad- vances in hardware) of 15% per year, which trans- lated to an order of magnitude improvement in an 1 http://www.tpc.org/information/ benchmarks.asp Figure 1: Price/performance trend lines for TPC-A and TPC-C. The 15-year trend lines track Moore’s Law (100x per 10 years) (Gray and Levine, 2005). already very established area (Gray and Levine, 2005). Figure 1 shows the reduction of the costs per transaction over time. This paper gives an overview of benchmarks used to evaluate systems that process KGs. For cre- ating better comparability between different bench- marks, we determine that each benchmark com- prises the following components: The definition of the functionality that will be benchmarked. A set of tasks T . Each task t j =(i j , e j ) is a pair of input data (i j ) and expected output data (e j ). Background data B which is typically used to initialize the system. One ore more metrics, which for each task t j receive the expected result e j , as well as the result r j provided by the benchmarked system. The remainder of this paper comprises two major parts. First, we present the existing benchmark ap- proaches for KG-processing systems. To structure arXiv:2002.06039v1 [cs.DB] 14 Feb 2020

Transcript of arXiv:2002.06039v1 [cs.DB] 14 Feb 2020

Benchmarking Knowledge Graphs on the Web

Michael Roder and Mohamed Ahmed Sherif and Muhammad Saleemand Felix Conrads and Axel-Cyrille Ngonga Ngomo

DICE group, Department of Computer SciencePaderborn University, Germany

Institute for Applied Informatics, Leipzig, Germany

Abstract

The growing interest in making use of Knowl-edge Graphs for developing explainable arti-ficial intelligence, there is an increasing needfor a comparable and repeatable comparison ofthe performance of Knowledge Graph-basedsystems. History in computer science hasshown that a main driver to scientific advances,and in fact a core element of the scientificmethod as a whole, is the provision of bench-marks to make progress measurable. This pa-per gives an overview of benchmarks usedto evaluate systems that process KnowledgeGraphs.

1 Introduction

With the growing number of systems using Knowl-edge Graphs (KGs) there is an increasing needfor a comparable and repeatable evaluation of theperformance of systems that create, enhance, main-tain and give access to KGs. History in computerscience has shown that a main driver to scientificadvances, and in fact a core element of the scien-tific method as a whole, is the provision of bench-marks to make progress measurable. Benchmarkshave several purposes: (1) they highlight weak andstrong points of systems, (2) they stimulate techni-cal progress and (3) they make technology viable.Benchmarks are an essential part of the scientificmethod as they allow to track the advancementsin an area over time and make competing systemscomparable. The TPC benchmarks1 demonstratehow benchmarks can influence an area of research.Even though databases were already quite estab-lished in the early 90s, the benchmarking effortsresulted in algorithmic improvements (ignoring ad-vances in hardware) of 15% per year, which trans-lated to an order of magnitude improvement in an

1http://www.tpc.org/information/benchmarks.asp

Figure 1: Price/performance trend lines for TPC-A andTPC-C. The 15-year trend lines track Moore’s Law(100x per 10 years) (Gray and Levine, 2005).

already very established area (Gray and Levine,2005). Figure 1 shows the reduction of the costsper transaction over time.

This paper gives an overview of benchmarksused to evaluate systems that process KGs. For cre-ating better comparability between different bench-marks, we determine that each benchmark com-prises the following components:

• The definition of the functionality that will bebenchmarked.

• A set of tasks T . Each task t j = (i j,e j) is apair of input data (i j) and expected output data(e j).

• Background data B which is typically used toinitialize the system.

• One ore more metrics, which for each task t j

receive the expected result e j, as well as theresult r j provided by the benchmarked system.

The remainder of this paper comprises two majorparts. First, we present the existing benchmark ap-proaches for KG-processing systems. To structure

arX

iv:2

002.

0603

9v1

[cs

.DB

] 1

4 Fe

b 20

20

Figure 2: The Linked Data lifecycle (Auer andLehmann, 2010; Ngomo et al., 2014).

the large list of benchmarks, we follow the stepsof the Linked Data lifecycle (Auer and Lehmann,2010; Ngomo et al., 2014). Second, we present ex-isting benchmarking frameworks and briefly sum-marize their features before concluding the paper.

2 Benchmarking the Linked Datalifecycle steps

In (Auer and Lehmann, 2010; Ngomo et al., 2014),the authors propose a lifecycle of Linked Data(LD) KGs comprising 8 steps. Figure 2 gives anoverview of these steps. In this Section, we gothrough the single steps, briefly summarize the dif-ferent actions they cover and how these actions canbe benchmarked.

2.1 Extraction

A first step to enter the LD lifecycle is to extractinformation from unstructured or semi-structuredrepresentations and transform them into the RDFdata model (Ngomo et al., 2014). This field ofinformation extraction is the subject of researchsince several decades. The Message UnderstandingConference (MUC) introduced a systematic com-parison of information extraction approaches in1993 (Sundheim, 1993). Several other challengesfollowed, e.g., the Conference on ComputationalNatural Language Learning (CoNLL) publishedthe CoNLL corpus and organized a shared task onnamed entity recognition (Tjong Kim Sang andDe Meulder, 2003). In a similar way, the Auto-

matic Content Extraction (ACE) challenge (Dod-dington et al., 2004) has been organized by NIST.Other challenges are the Workshop on KnowledgeBase Population (TAC-KBP) hosted at the TextAnalytics Conference (McNamee, 2009) and theSenseval challenge (Kilgarriff, 1998). In 2013,the Making Sense of Microposts workshop se-ries started including entity recognition and link-ing challenge tasks (Cano Basave et al., 2013;Cano et al., 2014; Rizzo et al., 2015, 2016). In2014, the Entity Recognition and Disambiguation(ERD) challenge took place (Carmel et al., 2014).The Open Knowledge Extraction challenge seriesstarted in 2015 (Nuzzolese et al., 2015, 2016; Specket al., 2017, 2018).

2.1.1 Extraction TypesThe extraction of knowledge from unstructereddata comes with a wide range of different types.In (Cornolti et al., 2013), the authors propose aset of different extraction types that can be bench-marked. These definitions are further refined andextended in (Usbeck et al., 2015; Roder et al.,2018). A set of basic extraction functionalitiesfor single entities can be distinguished as fol-lows (Roder et al., 2018):

• ERec: For this type of extraction the input tothe system is a plain text document d. Theexpected output are the mentions M of all en-tities within the text (Roder et al., 2018). Aset of entity types T is used as backgrounddata to define which types of entities shouldbe marked in the texts.

• D2KB: The input for entity disambiguation(or entity linking) is a text d with alreadymarked entity mentions µ ∈M. The goal isto map this set of given entities to entitiesfrom a given Knowledge Base (KB) K or toNIL. The latter represents the set of entitiesthat are not present in K (also called emerg-ing entities (Hoffart et al., 2014)). AlthoughK is not part of the task (and the benchmarkdatasets do not contain it), it can be seen asbackground data of a benchmark.

• Entity Typing (ET): The entity typing is simi-lar to D2KB. Its goal is to map a set of givenentity mentions to the type hierarchy of a KB.

• C2KB: The concept tagging (C2KB) aims todetect entities relevant for a given document.

Formally, the function takes a plain text asinput and returns a subset of the KB.

Based on these basic types, more complex typeshave been defined (Roder et al., 2018):

• A2KB: This is a combination of the ERecand D2KB types and represents the classicalnamed entity recognition and disambiguation.Thus, the input is a plain text document d andthe system has to identify entities mentions µ

and link them to K.

• RT2KB: This extraction type is the combina-tion of entity recognition and typing, i.e., thegoal is to identify entities in a given documentset D and map them to the types of K.

For extracting more complex relations the REexperiment type is definedin (Speck and NgomoNgonga, 2018; Speck et al., 2018). The input forthe relation extraction is a text d with marked en-tity mentions µ ∈ M that have been linked to aknowledge base K. Let u j ∈U be the URI of thej-th marking µ j with U ⊂ K ∪NIL. The goal isto identify the relation the entities have within thetext and return it as triple (s, p,o). Note that theextracted triple does not have to be within the givenknowledge base. To narrow the amount of possiblerelations, a set of properties P is given as part of thebackground data. Table 1 summarizes the differentextraction types.

2.1.2 MatchingsOne of the major challenges when benchmarking asystem is to define how the system’s response canbe matched to the expected answer. This is a non-trivial task when either the expected answer or thesystem response are sets of elements. A matchingdefines the conditions that have to be fulfilled by asystem response to be a correct result, i.e., to matchan element of the expected answer.

Positional Matching. Some of the different ex-traction types described above aim to identify posi-tions of entities within a given text. In the followingexample, a system recognizes two entities (es1 andes2) that have to be compared to the expected enti-ties (ee1 and ee2) by the benchmarking framework.

es1 es2|-----| |-----|

President Obama visits Paris.|---------------| |-----|

ee1 ee2

While the system’s marking es2 matches exactlyee2, the comparison of es1 and ee1 is non-trivial.Two different strategies have been established tohandle such cases (Cornolti et al., 2013; Usbecket al., 2015; Roder et al., 2018). The strong anno-tation matching matches two markings if they haveexactly the same positions. The weak annotationmatching relaxes this condition. It matches twomarkings if they overlap with each other.

URI Matching. For the evaluation of the systemperformance with respect to several types of extrac-tion, the matching of URIs is needed. Extractionsystems can use different URIs from the creators ofa benchmark, e.g., because the extraction systemreturns Wikipedia article URLs instead of DBpediaURIs. Since different URIs can point to the samereal world entity, the matching of two URIs is notas trivial as the comparison of two strings. Theauthors of (Roder et al., 2018) propose a workflowfor improving the fairness of URI matching:

1. URI set retrieval. Instead of representing ameaning as a single URI, it is representedas a set of URIs comprising URIs that arepointing to the same real world entity. Thisset can be expanded by crawling the Seman-tic Web graph using owl:sameAs links andredirects.

2. URI set classification. As defined above, theURIs returned by the extraction system eitherrepresent an entity that is in K or it representsan emerging entity (NIL). Hence, the after ex-panding the URI sets, they are classified intothese two classes. A URI set S is classified asS ∈CKB if it contains at least one URI of K.Otherwise it is classified as S ∈CEE .

3. URI set matching. Two URI sets S1,S2 arematching if both are assigned to the CKB classand the sets are overlapping or both sets areassigned to the CEE class:

((S1 ∈CKB)∧ (S2 ∈CKB)

∧(S1∩S2 6= /0))

∨((S1 ∈CEE)∧ (S2 ∈CEE))

(1)

A comparison of URIs of emerging entitiesis not suggested since most of these URIs aretypically synthetically generated by the extrac-tion system.

Table 1: Summary of the Extraction types.

Type Input i j Output e j Background data B

A2KB d {(µ,u)|µ ∈M,u ∈ K∪NIL} K,TC2KB d U ⊂ K KD2KB (d,M) {(µ,u)|µ ∈M,u ∈ K∪NIL} KERec d M TET (d,M) {(µ,T)|µ ∈M,T⊂ K} KRE (d,M,U) {(s, p,o)|s,o ∈U, p ∈ P} K,TRT2KB d {(µ,c)|c ∈ K} K,T

2.1.3 Key Performance IndicatorsThe performance of knowledge extraction systemsis mainly measured using Precision, Recall andF1-measure. Since most benchmarking datasetscomprise multiple tasks, Micro and Macro aver-ages are used for summarizing the performance ofthe systems (Cornolti et al., 2013; Usbeck et al.,2015; Roder et al., 2018). A special case is thecomparison of entity types. Since the types can bearranged in a hierarchy, the hierarchical F-measurecan be used for this kind of experiment (Roder et al.,2018). For evaluating the systems’ efficiency, theresponse time of the systems is a typical KPI (Us-beck et al., 2015; Roder et al., 2018). It should benoted that most benchmarking frameworks sendthe tasks sequentially to the extraction systems. Incontrast to that, the authors of (Speck et al., 2017)introduced a stress test. During this test, tasks canbe sent in parallel and the gaps between the tasksare reduced over time. For a comparison of theperformance of the systems under pressure, a β

metric is used, which combines the F-measure andthe runtime of the system.

2.1.4 DatasetsSince the field of information extraction has beentackled for several years, a large number of datasetsis available. Table 2 shows some example datasets.It is clear that the datasets vary a lot in their numberof documents (i.e., the number of tasks), length ofsingle documents and number of entities per doc-ument. All datasets listed in the table have beenmanually annotated. However, it is known that thistype of datasets has two drawbacks. First, (Jhaet al., 2017) showed that although the datasets areused as gold standards, they are not free of errors.Apart from wrongly positioned entity mentions, allof these dataset have the issue that the KBs usedas reference to link the entities have evolved. Thisleads to annotations that are using outdated URIs

not present in the lates KB versions. In (Roderet al., 2018), the authors propose a mechanism toidentify and handle these outdated URIs. Second,the size of the datasets is limited by the high costsof the manual annotation of documents. It can beseen from the table that if the number of documentsis high, the datasets typically comprises short mes-sages and if the number of words per documentis high, the number of documents in the dataset islow. Hence, most of the datasets can not be usedto benchmark the scalability of extraction systems.To this end, (Ngomo et al., 2018) proposed the au-tomatic generation of annotated documents basedon a given KB.

2.2 Storage & Querying

After a critical mass of RDF data has been ex-tracted, the data has to be stored, indexed and madeavailable for querying in an efficient way (Ngomoet al., 2014). Triplestores are data managementsystems for storing and querying RDF data. Inthis section, we discuss the different benchmarksused to assess the performance of triplestores. Inparticular, we highlight key features of triplestorebenchmarks pertaining to the three main compo-nents of benchmarks, i.e., datasets, queries, andperformance metrics. State-of-the-art triplestorebenchmarks are analyzed and compared againstthese features. Most of the content in this sectionis adopted from (Saleem et al., 2019).

2.2.1 Triplestore Benchmark Design FeaturesIn general, triplestore benchmarks comprise threemain components: (1) a set of RDF datasets,(2) a set of SPARQL queries, and (3) a set ofperformance metrics. With respect to the bench-mark schema defined in Section 1, each task of aSPARQL benchmark comprises a single SPARQLquery as input while the expected output data com-prises the expected result of the query. The RDF

Table 2: Example benchmarks for knowledge extraction (Roder et al., 2018). Collections of datasets, e.g., for asingle challenge, have been grouped together.

Corpus Task Topic |Documents| Entities/Doc. Words/Doc.

ACE2004 (Ratinov et al., 2011) A2KB news 57 5.37 373.9AIDA/CoNLL (Hoffart et al., 2011) A2KB news 1393 25.07 189.7AQUAINT (Milne and Witten, 2008) A2KB news 50 14.54 220.5Derczynski IPM NEL (Derczynski et al., 2015) A2KB tweets 182 1.57 20.8ERD2014 (Carmel et al., 2014) A2KB queries 91 0.65 3.5GERDAQ (Cornolti et al., 2016) A2KB queries 992 1.72 3.6IITB (Kulkarni et al., 2009) A2KB mixed 103 109.22 639.7KORE 502 A2KB mixed 50 2.88 12.8Microposts2013 (Cano Basave et al., 2013) RT2KB tweets 4265 1.11 18.8Microposts2014 (Cano et al., 2014) A2KB tweets 3395 1.50 18.1Microposts2015 (Rizzo et al., 2015) A2KB tweets 6025 1.36 16.5Microposts2016 (Rizzo et al., 2016) A2KB tweets 9289 1.03 15.7MSNBC (Cucerzan, 2007) A2KB news 20 37.35 543.9N3 Reuters-128 (Roder et al., 2014) A2KB news 128 4.85 123.8N3 RSS-500 (Roder et al., 2014) A2KB RSS-feeds 500 1.00 31.0OKE 2015 Task 1 (Nuzzolese et al., 2015) A2KB, ET mixed 199 5.11 25.5OKE 2016 Task 1 (Nuzzolese et al., 2016) A2KB, ET mixed 254 5.52 26.6Ritter (Ritter et al., 2011) RT2KB news 2394 0.62 19.4Senseval 2 (Edmonds and Cotton, 2001) ERec mixed 242 9.86 21.3Senseval 3 (Mihalcea et al., 2004) ERec mixed 352 5.70 14.7Spotlight Corpus?? A2KB news 58 5.69 28.6UMBC (Finin et al., 2010) RT2KB tweets 12973 0.97 17.2WSDM2012/Meij (Blanco et al., 2015) C2KB tweets 502 1.87 14.4

dataset(s) are part of the Background data of abenchmark against which the queries are executed.In the following, we present key features of eachof these components that are important to considerin the development of triplestore benchmarks.

Datasets. Datasets used in triplestore bench-marks are either synthetic or selected from real-world RDF datasets (Saleem et al., 2015b). Theuse of real-world RDF datasets is often regarded asuseful to perform evaluation in close-to-real-worldsettings (Morsey et al., 2011). Synthetic datasetsare useful to test the scalability of systems based ondatasets of varying sizes. Synthetic dataset gener-ators are utilized to produce datasets of varyingsizes that can often be optimized to reflect thecharacteristics of real-world datasets (Duan et al.,2011). Previous works (Duan et al., 2011; Qiao andOzsoyoglu, 2015) highlighted two key measures forselecting such datasets for triplestores benchmark-ing: (1) Dataset Structuredness and (2) Relation-ship Specialty. The formal definitions of these met-rics can be found in (Saleem et al., 2019). However,observations from the literature (see e.g., (Saleemet al., 2018; Duan et al., 2011)) suggest that otherfeatures such as varying number of triples, numberof resources, number of properties, number of ob-jects, number of classes, diversity in literal values,average properties and instances per class, averageindegrees and outdegrees as well as their distribu-tion across resources should also be considered.

1. Dataset Structuredness: Duan et al. (Duanet al., 2011) combine many of the aforemen-tioned dataset features into a single compos-ite metric called dataset structuredness or co-herence. This metric measures how well theclasses (i.e., rdfs:Class) defined within adataset are covered by the different instancesof this class within the dataset. The struc-turedness value for any given dataset lies in[0,1], where 0 stands for the lowest possiblestructuredness and 1 represents the highestpossible structured dataset. They concludethat synthetic datasets are highly structuredwhile real-world datasets have structurednessvalues ranging from low to high, covering thewhole structuredness spectrum. The formaldefinition of this metric can be found in (Duanet al., 2011).

2. Relationship Specialty: In datasets, some at-tributes are more common and associated

with many resources. In addition, some at-tributes are multi-valued, e.g., a person canhave more than one cellphone number or pro-fessional skill. The number of occurrencesof a predicate associated with each resourcein the dataset provides useful information onthe graph structure of an RDF dataset, andmakes some resources distinguishable fromothers (Qiao and Ozsoyoglu, 2015). In realdatasets, this kind of relationship specialty iscommonplace. For example, several millionpeople can like the same movie. Likewise,a research paper can be cited in several hun-dred of other publications. Qiao et al. (Qiaoand Ozsoyoglu, 2015) suggest that syntheticdatasets are limited in how they reflect thisrelationship specialty. This is either due to thesimulation of uniform relationship patterns forall resources, or a random relationship gener-ation process.

The dataset structuredness and relationship spe-cialty directly affect the result size, the numberof intermediate results, and the selectivities of thetriple patterns of a given SPARQL query. There-fore, they are important dataset design featuresto be considered during the generation of bench-marks (Duan et al., 2011; Saleem et al., 2015b;Qiao and Ozsoyoglu, 2015).

SPARQL Queries. The literature aboutSPARQL Queries (Aluc et al., 2014; Gorlitz et al.,2012; Saleem et al., 2015b, 2018, 2017) suggeststhat a SPARQL querying benchmark shouldvary the queries with respect to various featuressuch as query characteristics: number of triplepatterns, number of projection variables, resultset sizes, query execution time, number of BGPs,number of join vertices, mean join vertex degree,mean triple pattern selectivities, BGP-restrictedand join-restricted triple pattern selectivities,join vertex types, and highly used SPARQLclauses (e.g., LIMIT, OPTIONAL, ORDER BY,DISTINCT, UNION, FILTER, REGEX). All ofthese features have a direct impact on the runtimeperformance of triplestores. The formal definitionsof many of the above mentioned features can befound in (Saleem et al., 2019).

Performance Metrics. Based on the previoustriplestore benchmarks and performance evalua-tions (Aluc et al., 2014; Bail et al., 2012; Bizer andSchultz, 2009; Demartini et al., 2011; Guo et al.,

2005; Erling et al., 2015; Szarnyas et al., 2018b;Morsey et al., 2011; Saleem et al., 2015b; Schmidtet al., 2009; Szarnyas et al., 2018a; Wu et al., 2014;Conrads et al., 2017) the performance metrics forsuch comparisons can be categorized as:

• Query Processing Related: The performancemetrics in this category are related to thequery processing capabilities of the triple-stores. The query execution time is the centralperformance metric in this category. However,reporting the execution time for individualqueries might not be feasible due to the largenumber of queries in the given benchmark. Tothis end, Query Mix per Hour (QMpH) andQueries per Second (QpS) are regarded as cen-tral performance measures to test the queryingcapabilities of the triplestores (Saleem et al.,2015b; Morsey et al., 2011; Bizer and Schultz,2009). In addition, the query processing over-head in terms of the CPU and memory usageis important to measure during the query ex-ecutions (Schmidt et al., 2009). This also in-cludes the number of intermediate results, thenumber of disk/memory swaps, etc.

• Data Storage Related: Triplestores need toload the given RDF data and mostly createindexes before they are ready for query exe-cutions. In this regard, the data loading time,the storage space acquired, and the index sizeare important performance metrics in this cat-egory (Demartini et al., 2011; Wu et al., 2014;Schmidt et al., 2009; Bizer and Schultz, 2009).

• Result Set Related: Two systems can only becompared if they produce exactly the same re-sults. Therefore, result set correctness andcompleteness are important metrics to beconsidered in the evaluation of triplestores(Bizer and Schultz, 2009; Schmidt et al., 2009;Saleem et al., 2015b; Wu et al., 2014).

• Parallelism with/without Updates: Some ofthe aforementioned triplestore performanceevaluations (Conrads et al., 2017; Wu et al.,2014; Bizer and Schultz, 2009) also measuredthe parallel query processing capabilities ofthe triplestores by simulating workloads frommultiple querying agents with and withoutdataset updates.

We analyzed state-of-the-art existing SPARQLtriplestore benchmarks across all of the above men-

tioned dataset and query features as well as theperformance metrics. The results are presented inSection 2.2.3.

2.2.2 Triplestore BenchmarksTriplestore benchmarks can be broadly divided intotwo main categories, namely synthetic and real-data benchmarks.

Synthetic Triplestore Benchmarks. Syntheticbenchmarks make use of the data (and/or query)generators to generate datasets and/or queries forbenchmarking. Synthetic benchmarks are usefulin testing the scalability of triplestores with vary-ing dataset sizes and querying workloads. How-ever, such benchmarks can fail to reflect the char-acteristics of real-world datasets or queries. TheTrain Benchmark (TrainBench) (Szarnyas et al.,2018a) uses a data generator that produces railwaynetworks in increasing sizes and serializes themin different formats, including RDF. The Water-loo SPARQL Diversity Test Suite (WatDiv) (Alucet al., 2014) provides a synthetic data genera-tor that produces RDF data with a tunable struc-turedness value and a query generator. Thequeries are generated from different query tem-plates. SP2Bench (Schmidt et al., 2009) mirrorsvital characteristics (such as power law distribu-tions or Gaussian curves) of the data in the DBLPbibliographic database. The Berlin SPARQL Bench-mark (BSBM) (Bizer and Schultz, 2009) uses querytemplates to generate any number of SPARQLqueries for benchmarking, covering multiple usecases such as explore, update, and business intelli-gence. Bowlogna (Demartini et al., 2011) modelsa real-world setting derived from the Bologna pro-cess and offers mostly analytic queries reflectingdata-intensive user needs. The LDBC Social Net-work Benchmark (SNB) defines two workloads.First, the Interactive workload (SNB-INT) mea-sures the evaluation of graph patterns in a localizedscope (e.g., in the neighborhood of a person), withthe graph being continuously updated (Erling et al.,2015). Second, the Business Intelligence work-load (SNB-BI) focuses on queries that mix complexgraph pattern matching with aggregations, touchingon a significant portion of the graph (Szarnyas et al.,2018b), without any updates. Note that these twoworkloads are regarded as two separate triplestorebenchmarks based on the same dataset.

Triplestore Benchmarks Using Real Data.Real-data benchmarks make use of real-world

datasets and queries from real user query logs forbenchmarking. Real-data benchmarks are useful intesting triplestores more closely in real-world set-tings. However, such benchmarks may fail to testthe scalability of triplestores with varying datasetsizes and querying workloads. FEASIBLE (Saleemet al., 2015b) is a cluster-based SPARQL bench-mark generator, which is able to synthesize cus-tomizable benchmarks from the query logs ofSPARQL endpoints. The DBpedia SPARQL Bench-mark (DBPSB) (Morsey et al., 2011) is anothercluster-based approach that generates benchmarkqueries from DBpedia query logs, but employs dif-ferent clustering techniques than FEASIBLE. TheFishMark (Bail et al., 2012) dataset is obtainedfrom FishBase3 and provided in both RDF andSQL versions. The SPARQL queries were ob-tained from logs of the web-based FishBase ap-plication. BioBench (Wu et al., 2014) evaluates theperformance of RDF triplestores with biologicaldatasets and queries from five different real-worldRDF datasets4, i.e., Cell, Allie, PDBJ, DDBJ, andUniProt. Due to the size of the datasets, we wereonly able to analyze the combined data and queriesof the first three.

Table 3 summarizes the statistics from selecteddatasets of the benchmarks. More advanced statis-tics will be presented in the next section. The tablealso shows the number of SPARQL queries of thedatasets included in the corresponding benchmarkor query log. It is important to mention that weonly selected SPARQL SELECT queries for anal-ysis. This is because we wanted to analyze thetriplestore benchmarks for their query runtime per-formance and most of these benchmarks only con-tain SELECT queries (Saleem et al., 2015b). Forthe synthetic benchmarks that include data genera-tors, we chose the datasets used in the evaluationof the original paper that were comparable in sizeto the datasets of other synthetic benchmarks. Fortemplate-based query generators such as WatDiv,DBPSB and SNB, we chose one query per availabletemplate. For FEASIBLE, we generated a bench-mark of 50 queries from the DBpedia log file to becomparable with a well-known WatDiv benchmarkthat includes 20 basic testing query templates, and30 extensions for testing.5

3FishBase: http://fishbase.org/search.php4BioBench: http://kiban.dbcls.jp/

togordf/wiki/survey#data5The WatDiv query templates are available at http://

dsg.uwaterloo.ca/watdiv/.

2.2.3 Analysis of the Triplestore BenchmarksWe present a detailed analysis of the datasets,queries, and performance metrics of the selectedbenchmarks and datasets according to the designfeatures presented in Section 2.2.1.

Datasets. We presents results pertaining to thedataset features of Section 2.2.1.

• Structuredness. Figure 3a shows the struc-turedness values of the selected benchmarks.Duan et al. (Duan et al., 2011) establish thatsynthetic benchmarks are highly structuredwhile real-world datasets are low structured.This important dataset feature is well-coveredin recent synthetic benchmarks such as Train-Bench (with a structuredness value of 0.23)and WatDiv, which lets the user generate abenchmark dataset of a desired structurednessvalue. However, Bowlogna (0.99), BSBM(0.94), and SNB (0.86) have relatively highstructuredness values. Finally, on average,synthetic benchmarks are still more structuredthan real data benchmarks (0.61 vs. 0.45).

• Relationship Specialty. According to (Qiaoand Ozsoyoglu, 2015), relationship specialtyin synthetic datasets is limited, i.e., the over-all relationship specialty values of syntheticdatasets are lower than those of similar real-world datasets. The dataset relationship spe-cialty results presented in Figure 3b mostlyconfirm this behavior. On average, syntheticbenchmarks have a smaller specialty scorethan real-world datasets. The relationshipspecialty values of Bowlogna (8.7), BSBM(2.2), and WatDiv (22.0) are on the lower sidecompared to real-world datasets. The highestspecialty value (28282.8) is recorded in theDBpedia dataset.

An important issue is the correlation betweenstructuredness and the relationship specialty of thedatasets. To this end, we computed the Spearman’srank correlation between the stucturedness and spe-cialty values of all the selected benchmarks. Thecorrelation of the two measures is −0.5, indicatinga moderate inverse relationship. This means thatthe higher the structuredness, the lower the spe-cialty value. This is because in highly structureddatasets, data is generated according to a specificdistribution without treating some predicates moreparticularly (in terms of occurrences) than others.

Table 3: High-level statistics of the data and queries used on existing triplestore benchmarks. Both SNB-BI andSNB-INT use the same dataset and are therefore named as SNB for simplicity.

Benchmark Subjects Predicates Objects Triples QueriesSy

nthe

ticBowlogna (Demartini et al., 2011) 2,151k 39 260k 12M 16TrainB. (Szarnyas et al., 2018a) 3,355k 16 3,357k 41M 11BSBM (Bizer and Schultz, 2009) 9,039k 40 14,966k 100M 20SP2Bench (Schmidt et al., 2009) 7,002k 5,718 19,347k 49M 14WatDiv (Aluc et al., 2014) 5,212k 86 9,753k 108M 50SNB (Erling et al., 2015; Szarnyas et al., 2018b) 7,193k 40 17,544k 46M 21

Rea

l

FishMark (Bail et al., 2012) 395k 878 1,148k 10M 22BioBench (Wu et al., 2014) 278,007k 299 232,041k 1,451M 39FEASIBLE (Saleem et al., 2015b) 18,425k 39,672 65,184k 232M 50DBPSB (Morsey et al., 2011) 18,425k 39,672 65,184k 232M 25

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Dat

ase

t st

ruct

ure

dn

ess

Synthetic benchmark

Real data benchmark

(a) Structuredness

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

Dat

ase

t re

lati

on

ship

sp

eci

alty

(lo

g sc

ale

)

Synthetic benchmark

Real data benchmark

(b) Relationship Specialty

Figure 3: Analysis of the datasets of triplestore benchmarks and real-world data.

Queries. This section presents results pertainingto the query features discussed in Section 2.2.1.Figure 4 shows the box plot distributions of real-world datasets and benchmark queries across thequery features defined in Section 2.2.1. The valuesinside the brackets, e.g., the 0.89 in “BioBench(0.89)”, show the diversity score of the benchmarkfor the given query feature.

Starting from the number of projection variables(ref. Figure 4a), the DBPSB dataset has the lowestdiversity score (0.32) and SP2Bench has the high-est score (1.14). The diversity scores of DBPSB,SNB-BI, SNB-INT, WatDiv, BSBM, WatDiv, andBowlogna are below the average value (0.63). Theaverage diversity score of the number of join ver-tices (ref. Figure 4b) is 0.91 and hence the diversityscores of the Bowlogna, FishMark, WatDiv, BSBM,SNB-INT, and TrainBench are below the averagevalue. It is important to mention that the highestnumber of join vertices recorded in a query is 51in the SNB-BI benchmark. The average diversityscore of the number of triple patterns (ref. Fig-ure 4c) is 0.83 and hence the diversity scores of the

FishMark, Bowlogna, BSBM, and WatDiv bench-marks are below the average value. The averagediversity score of the result sizes (ref. Figure 4d) is2.29 and hence the diversity scores of SNB-BI,SNB-INT, BSBM, TrainBench, SP2Bench, andDBPSB are below the average value. The aver-age diversity score of the join vertex degree (ref.Figure 4e) is 0.55 and hence the diversity scores ofSNB-BI, SNB-INT, BSBM, Bowlogna, BioBench,and WatDiv are below the average value. The aver-age diversity score of the triple pattern selectivity(ref. Figure 4f) is 2.06 and hence the diversityscores of SNB-BI, SNB-INT, TrainBench, DBPSB,FishMark, SP2Bench, and BSBM are below theaverage. The average diversity score of the join-restricted triple pattern selectivity (ref. Figure 4g)is 1.01 and hence the diversity scores of Train-Bench, SNB-BI, SNB-INT, FishMark, Bowlogna,WatDiv, and BioBench are below the average value.The average diversity score of the BGP-restrictedtriple pattern selectivity (ref. Figure 4h) is 2.14 andhence the diversity scores of TrainBench, SNB-BI, SNB-INT, SP2Bench, Bowlogna BioBench,

0 5 10 15 20 25

Bowlogna(0.54)

TrainBench(0.72)

BSBM(0.6)

SP2Bench(1.14)

WatDiv(0.42)

SNB-BI(0.37)

SNB-INT(0.5)

FEASIBLE(0.82)

FishMark(0.62)

DBPSB(0.32)

BioBench(0.89)

DBpedia3.5.1(0.65)

SWDF(0.56)

NCBIGene(0.16)

SIDER(0.4)

DrugBank(0.43)Real World datasetReal data benchmarkSynthetic benchmark

(a) No. of projection variables

0 10 20 30 40 50 60

Bowlogna(0.6)

TrainBench(0.87)

BSBM(0.65)

SP2Bench(1.25)

WatDiv(0.61)

SNB-BI(1.12)

SNB-INT(0.85)

FEASIBLE(1.43)

FishMark(0.61)

DBPSB(0.98)

BioBench(0.98)

DBpedia3.5.1(1.23)

SWDF(0.89)

NCBIGene(3.16)

SIDER(5.75)

DrugBank(0.7)Real World datasetReal data benchmarkSynthetic benchmark

(b) No. of join vertices

0 10 20 30 40 50 60 70 80 90

Bowlogna(0.59)

TrainBench(0.85)

BSBM(0.54)

SP2Bench(0.96)

WatDiv(0.36)

SNB-BI(1.1)

SNB-INT(0.83)

FEASIBLE(1.13)

FishMark(0.73)

DBPSB(0.94)

BioBench(1.04)

DBpedia3.5.1(1.03)

SWDF(0.86)

NCBIGene(0.42)

SIDER(0.41)

DrugBank(0.26)Real World datasetReal data benchmarkSynthetic benchmark

(c) No. of triple patterns

20 10 . 825 10 . 8

(d) Result size

0 2 4 6 8 10 12 14 16

Bowlogna(0.34)

TrainBench(0.62)

BSBM(0.33)

SP2Bench(0.93)

WatDiv(0.49)

SNB-BI(0.22)

SNB-INT(0.2)

FEASIBLE(1.09)

FishMark(0.59)

DBPSB(0.74)

BioBench(0.46)

DBpedia3.5.1(1.09)

SWDF(0.71)

NCBIGene(3.03)

SIDER(5.06)

DrugBank(0.43)Real World datasetReal data benchmarkSynthetic benchmark

(e) Join vertex degree

0 0.2 0.4 0.6 0.8 1

Bowlogna(2.48)

TrainBench(1.13)

BSBM(1.79)

SP2Bench(1.55)

WatDiv(2.52)

SNB-BI(0.91)

SNB-INT(1)

FEASIBLE(5.56)

FishMark(1.48)

DBPSB(1.27)

BioBench(2.99)

DBpedia3.5.1(7.68)

SWDF(3.86)

NCBIGene(1.47)

SIDER(4.46)

DrugBank(8.4) Real World datasetReal data benchmarkSynthetic benchmark

(f) Triple pattern (TP) selectivity

0 0.2 0.4 0.6 0.8 1

Bowlogna(0.88)

TrainBench(0.21)

BSBM(1.48)

SP2Bench(1.03)

WatDiv(0.95)

SNB-BI(0.64)

SNB-INT(0.92)

FEASIBLE(1.94)

FishMark(0.74)

DBPSB(1.35)

BioBench(1)

DBpedia3.5.1(2.25)

SWDF(1.45)

NCBIGene(2.37)

SIDER(3.43)

DrugBank(1.1) Real World datasetReal data benchmarkSynthetic benchmark

(g) Join-restricted TP selectivity

0 0.2 0.4 0.6 0.8 1

Bowlogna(1.9)

TrainBench(0.7)

BSBM(2.22)

SP2Bench(1.79)

WatDiv(5.04)

SNB-BI(1.38)

SNB-INT(1.58)

FEASIBLE(2.03)

FishMark(2.86)

DBPSB(2.02)

BioBench(2)

DBpedia3.5.1(3.44)

SWDF(14.68)

NCBIGene(19.43)

SIDER(0.52)

DrugBank(1.69) Real World datasetReal data benchmarkSynthetic benchmark

(h) BGP-restricted TP selectivity

0 5 10 15 20 25

Bowlogna(0)

TrainBench(0.36)

BSBM(0.5)

SP2Bench(0.45)

WatDiv(0)

SNB-BI(0.54)

SNB-INT(0.54)

FEASIBLE(1.05)

FishMark(1.72)

DBPSB(0.9)

BioBench(1.17)

DBpedia3.5.1(1.45)

SWDF(0.78)

NCBIGene(0.12)

SIDER(0.16)

DrugBank(0.22)Real World datasetReal data benchmarkSynthetic benchmark

(i) No. of BGPs

0 5 10 15 20 25

Bowlogna(0.43)

TrainBench(0.38)

BSBM(0.44)

SP2Bench(0.36)

WatDiv(0)

SNB-BI(0.18)

SNB-INT(0.48)

FEASIBLE(0.41)

FishMark(0.09)

DBPSB(0.38)

BioBench(0.46)

DBpedia3.5.1(0.45)

SWDF(0.31)

NCBIGene(0.14)

SIDER(0.21)

DrugBank(0.33)Real World datasetReal data benchmarkSynthetic benchmark

(j) No. of LSQ features

0 100000 200000 300000 400000 500000 600000

Bowlogna(2.73)

TrainBench(1.7)

BSBM(2.22)

SP2Bench(2.39)

WatDiv(3.02)

SNB-BI(2.56)

SNB-INT(1.52)

FEASIBLE(3.41)

FishMark(1.69)

DBPSB(0.51)

BioBench(1.97)

DBpedia3.5.1(25.68)

SWDF(17.23)

NCBIGene(4.12)

SIDER(24.31)

DrugBank(100.85) Real World datasetReal data benchmarkSynthetic benchmark

(k) Runtimes (ms)

0

0.5

1

1.5

2

2.5

Qu

eri

es

div

ers

ity

sco

re

Synthetic benchmark

Real data benchmark

(l) Overall diversity score

Figure 4: Analysis of queries used in triplestore benchmarks and for real-world datasets.

DBPSB, and FEASIBLE are below the averagevalue. The average diversity score of the num-ber of BGPs (ref. Figure 4i) is 0.66 and hencethe diversity scores of SNB-BI, SNB-INT, BSBM,SP2BENCH, TrainBench, WatDiv, and Bowlognaare below the average value.

The Linked SPARQL Queries (LSQ) (Saleemet al., 2015a) representation stores additionalSPARQL features, such as the use of DISTINCT,REGEX, BIND, VALUES, HAVING, GROUPBY, OFFSET, aggregate functions, SERVICE,OPTIONAL, UNION, property paths, etc. We makea count of all of these SPARQL operators and func-tions and use it as a single query dimension as num-ber of LSQ features. The average diversity score ofthe number of LSQ features (ref. Figure 4j) is 0.33,and hence only the diversity scores of SNB andWatDiv are below average value. Finally, the av-erage diversity score of the query runtimes is 2.04(ref. Figure 4k), and hence the diversity scoresof DBPSB, FishMark, TrainBench, WatDiv, andBioBench are below average value.

In summary, FEASIBLE generates the most di-verse benchmarks, followed by BioBench, Fish-Mark, WatDiv, BowLogna, SP2Bench, BSBM,DBPSB, SNB-BI, SNB-INT, and TrainBench.

Table 4 shows the percentage coverage of widelyused (Saleem et al., 2015a) SPARQL clauses andjoin vertex types for each benchmark and real-world dataset. We highlighted cells for bench-marks that either completely miss or overuse cer-tain SPARQL clauses and join vertex types. Train-Bench and WatDiv queries mostly miss the impor-tant SPARQL clauses. All of FishMark’s queriescontain at least one “Star” join node. The dis-tribution of other SPARQL clauses, such as sub-

query, BIND, aggregates, solution modifiers, prop-erty paths, and services are provided in the LSQversions of each of the benchmark queries, avail-able from the project website.

Performance Metrics. This section presents re-sults pertaining to the performance metrics dis-cussed in Section 2.2.1. Table 5 shows the per-formance metrics used by the selected benchmarksto compare triplestores. The query runtimes forcomplete benchmark’s queries is the central perfor-mance metrics and is used by all of the selectedbenchmarks. In addition, the QpS and QMpH arecommonly used in the query processing category.We found that in general, the processing overheadgenerated by query executions is not paid muchattention as only SP2Bench measures this metric.In the “storage” category, the time taken to loadthe RDF graph into triplestore is most common.The in-memory/HDD space required to store thedataset and corresponding indexes did not get muchattention. The result set correctness and complete-ness are important metrics to be considered whenthere is a large number of queries in the benchmarkand composite metrics such as QpS and QMpH areused. We can see many of the benchmarks do notexplicitly check these two metrics. They mostlyassume that the results are complete and correct.However, this might not be the case (Saleem et al.,2018). Additionally, only BSBM considers theevaluation of triplestores with simultaneous userrequests with updates. However, benchmark ex-ecution frameworks such Iguana (Conrads et al.,2017) can be used to measure the parallel queryprocessing capabilities of triplestores in presenceof multiple querying and update agents. It is further

Table 4: Coverage of SPARQL clauses and join vertex types for each benchmark in percentages. SPARQL clauses:DIST[INCT], FILT[ER], REG[EX], OPT[IONAL], UN[ION], LIM[IT], ORD[ER BY]. Join vertex types:Star, Path, Sink, Hyb[rid], N[o] J[oin]. Missing � and overused � features are highlighted.

Distributions of SPARQL Clauses Distr. of Join Vertex Type

Benchmark DIST FILT REG OPT UN LIM ORD Star Path Sink Hyb. N.J.

Synt

hetic

Bowlogna 6.2 37.5 6.2 0.0 0.0 6.2 6.2 93.7 37.5 62.5 25.0 6.2TrainB. 0.0 45.4 0.0 0.0 0.0 0.0 0.0 81.8 27.2 72.7 45.4 18.1BSBM 30.0 65.0 0.0 65.0 10.0 45.0 45.0 95.0 60.0 75.0 60.0 5.0SP2Bench 42.8 57.1 0.0 21.4 14.2 7.1 14.2 78.5 35.7 50.0 28.5 14.2Watdiv 0.0 0.0 0.0 0.0 0.0 0.0 0.0 28.0 64.0 26.0 20.0 0.0SNB-BI 0.0 61.9 4.7 52.3 14.2 80.9 100.0 90.4 38.1 80.9 52.3 0.0SNB-INT 0.0 47.3 0.0 31.5 15.7 63.15 78.9 94.7 42.1 94.7 84.2 0.0

Rea

l

FEASIBLE 56.0 58.0 22.0 28.0 40.0 42.0 32.0 58.0 18.0 36.0 16.0 30.0Fishmark 0.0 0.0 0.0 9.0 0.0 0.0 0.0 100.0 81.8 9.0 72.7 0.0DBPSB 100.0 48.0 8.0 32.0 36.0 0.0 0.0 68.0 20.0 32.0 20.0 24.0BioBench 28.2 25.6 15.3 7.6 7.6 20.5 10.2 71.7 53.8 43.5 38.4 15.3

described in Section 3.3.

2.2.4 Summary of Storage BenchmarksWe performed a comprehensive analysis of existingbenchmarks by studying synthetic and real-worlddatasets as well as by employing SPARQL querieswith multiple variations. Our evaluation resultssuggest the following:

1. The dataset structuredness problem is wellcovered in recent synthetic data generators(e.g., WatDiv, TrainBench). The low relation-ship specialty problem in synthetic datasetsstill exists in general and needs to be coveredin future synthetic benchmark generation ap-proaches.

2. The FEASIBLE framework employed on DB-pedia generated the most diverse benchmarkin our evaluation.

3. The SPARQL query features we selected havea weak correlation with query execution time,suggesting that the query runtime is a com-plex measure affected by multi-dimensionalSPARQL query features. Still, the numberof projection variables, join vertices, triplepatterns, the result sizes, and the join vertexdegree are the top five SPARQL features thatmost impact the overall query execution time.

4. Synthetic benchmarks often fail to contain im-portant SPARQL clauses such as DISTINCT,FILTER, OPTIONAL, LIMIT and UNION.

5. The dataset structuredness has a direct correla-tion with the result sizes and execution times

of queries and indirect correlation with datasetspecialty.

2.3 Manual Revision & Authoring

Using SPARQL queries, the user can interact withthe stored data directly, e.g., by correcting errorsor adding additional information (Ngomo et al.,2014). This is the third step of the Linked DataLifecycle. However, since it mainly comprises userinteraction, there are no automatic benchmarks fortools supporting this step. Hence, we will not lookfurther into it.

2.4 Linking

With cleaned data originating from differentsources, the generation of links between thedifferent information assets need to be estab-lished (Ngomo et al., 2014). The generation ofsuch links between knowledge bases is one of thekey steps of the Linked Data publication process.6

A plethora of approaches has thus been devised tosupport this process (Nentwig et al., 2017).

The formal specification of Link Discoveryadopted herein is akin to that proposed in (Ngomo,2012b). Given two (not necessarily distinct) setsS resp. T of source resp. target resources as wellas a relation R, the goal of LD is is to find the setM = {(s,τ) ∈ S ×T : R(s,τ)} of pairs (s,τ) ∈S ×T such that R(s,τ). In most cases, comput-ing M is a non-trivial task. Hence, a large num-ber of frameworks (e.g., SILK (Isele et al., 2011),LIMES (Ngomo, 2012b) and KnoFuss (Nikolovet al., 2007)) aim to approximate M by computing

6http://www.w3.org/DesignIssues/LinkedData.html

Table 5: Metrics used in the selected benchmarks pertaining to query processing, data storage, result set, simulta-neous multiple client requests, and dataset updates. QpS: Queries per Second, QMpH: Queries Mix per Hour, PO:Processing Overhead, LT: Load Time, SS: Storage Space, IS: Index Sizes, RCm: Result Set Completeness, RCr:Result Set Correctness, MC: Multiple Clients, DU: Dataset Updates.

Processing Storage Result Set Additional

Benchmark QpS QMpH PO LT SS IS RCm RCr MC DUSy

nthe

tic

Bowlogna 7 7 7 3 7 3 7 7 7 7TrainBench 7 7 7 3 7 7 3 3 7 3BSBM 3 3 7 3 7 7 3 3 3 3SP2Bench 7 7 3 3 3 7 3 3 7 7WatDiv 7 7 7 7 7 7 7 7 7 7SNB-BI 3 3 7 7 7 7 3 3 7 7SNB-INT 3 3 7 7 7 7 3 3 7 3

Rea

l

FEASIBLE 3 3 7 7 7 7 3 3 7 7Fishmark 3 7 7 7 7 7 7 7 7 7DBPSB 3 3 7 7 7 7 7 7 7 7BioBench 7 7 7 3 3 7 3 7 3 7

the mapping M′ = {(s,τ)∈S ×T : σ(s,τ)≥ θ},where σ is a similarity function and θ is a simi-larity threshold. For example, one can configurethese frameworks to compare the dates of birth,family names and given names of persons acrosscensus records to determine whether they are du-plicates. We call the equation which specifies M′

a link specification (short LS; also called linkagerule in the literature, see e.g., (Isele et al., 2011)).Note that the Link Discovery problem can be ex-pressed equivalently using distances instead of sim-ilarities in the following manner: Given two sets Sand T of instances, a (complex) distance measureδ and a distance threshold θ ∈ [0,∞[, determineM′ = {(s,τ) ∈S ×T : δ (s,τ)≤ ϑ}.7

Under this so-called declarative paradigm, twoentities s and τ are then considered to be linkedvia R if σ(s,τ) ≥ θ . Naıve algorithms requireO(|S ||T |) ∈ O(n2) computations to output M′.Given the large size of existing knowledge bases,time-efficient approaches able to reduce this run-time are hence a central component of efficientlink discovery, as link specifications need to becomputed in acceptable times. This efficient com-putation is in turn the proxy necessary for ma-chine learning techniques to be used to optimizethe choice of appropriate σ and θ and thus en-sure that M′ approximates M well even when M islarge (Nentwig et al., 2017)).

Benchmarks for link discovery frameworks use

7Note that a distance function δ can always be trans-formed into a normed similarity function σ by settingσ(x,y) = (1+ δ (x,y))−1. Hence, the distance threshold ϑ

can be transformed into a similarity threshold θ by means ofthe equation θ = (1+ϑ)−1.

tasks of the form ti = ((Si,Ti,Ri),Mi). Dependingon the given datasets Si and Ti, and the given rela-tion Ri, a benchmark is able to make different de-mands on the Link discovery frameworks. To thisend, such frameworks are designed to accommo-date a large number of link discovery approacheswithin a single extensible architecture to addresstwo main challenges measured by the benchmark:

1. Time-efficiency: The mere size of existingknowledge bases (e.g., 30+ billion triples inLinkedGeoData (Stadler et al., 2012), 20+ bil-lion triples in LinkedTCGA (Saleem et al.,2013)) makes efficient solutions indispensableto the use of link discovery frameworks in realapplication scenarios. LIMES for exampleaddresses this challenge by providing time-efficient approaches based on the characteris-tics of metric spaces (Ngomo and Auer, 2011;Ngomo, 2012a), orthodromic spaces (Ngomo,2013) and filter-based paradigms (Soru andNgomo, 2013).

2. Accuracy: The second key performance in-dicator is the accuracy provided by link dis-corvery frameworks. Efficient solutions areof little help if the results they generate areinaccurate. Hence, LIMES also accommo-dates dedicated machine-learning solutionsthat allow the generation of links betweenknowledge bases with a high accuracy. Thesesolutions abide by paradigms such as batchand active learning (Ngomo et al., 2011;Ngomo and Lyko, 2012, 2013), unsupervisedlearning (Ngomo and Lyko, 2013) and evenpositive-only learning (Sherif et al., 2017).

The Ontology Evaluation Alignment Initiative(OAEI)8 has performed yearly contests since 2005to comparatively evaluate current tools for ontol-ogy and instance matching. The original focushas been on ontology matching, but since 2009 in-stance matching has also been a regular evaluationtrack. Table 6 (From (Nentwig et al., 2017)) givesan overview of the OAEI instance matching tasksin five contests from 2010 to 2014. Most tasks haveonly been used in one year while others like IIMBhave been changed in different years. The majorityof the linking tasks are based on synthetic datasetswhere values and the structural context of instanceshave been modified in a controlled way. Linkingtasks cover different domains (life sciences, people,geography, etc.) and data sources (DBpedia, Free-base, GeoNames, NYTimes, etc.). Frequently, thebenchmarks consist of several interlinking tasks tocover a certain spectrum of complexity. The num-ber of instances are rather small in all tests with9,958 the maximum size of data source. The eval-uation focus has been solely on the effectiveness(e.g., F-Measure) while runtime efficiency has notbeen measured. Almost all tasks focus on iden-tifying equivalent instances (i.e., owl:sameAslinks).

We briefly characterize the different OAEI tasksas follows:

• IIMB and Sandbox (SB). The IIMB bench-mark has been part of the 2010, 2011 and2012 contests and consists of 80 tasks usingsynthetically modified datasets derived frominstances of 29 Freebase concepts. The num-ber of instances varies from year to year butthe instances have a very small size (e.g., atmost 375 instances in 2012). The Sandbox(SB) benchmark from 2012 is very similar toIIMB but limited to 10 different tasks (Aguirreet al., 2012).

• PR (Persons/Restaurant). This benchmark isbased on real-person and restaurant-instancedata that are artificially modified by addingduplicates and variations of property values.With 500–600 instances for the restaurant dataand even less in the person data source, thedataset is relatively small. (Euzenat et al.,2010)

• DI-NYT (Data Interlinking - NYT). This 2011benchmark includes seven tasks to link about

8http://www.ontologymatching.org

10,000 instances from the NYT data sourceto DBpedia, Freebase and GeoNames in-stances. The perfect match result containsabout 31,000 owl:sameAs links to be iden-tified (Euzenat et al., 2011).

• RDFT. This 2013 benchmark is also of smallsize (430 instances) and uses several testswith differently modified DBpedia data. Forthe first time in the OAEI instance matchingtrack, no reference mapping is provided forthe actual evaluation task. Instead, trainingdata with an appropriate reference mappingis given for each test case thereby support-ing frameworks relying on supervised learn-ing (Grau et al., 2013).

• OAEI 2014. Two benchmark tasks have tobe performed in 2014. The first one (id-rec)requiring the identification of the same real-world book entities (owl:sameAs links).For this purpose, 1,330 book instances haveto be matched with with 2,649 syntheticallymodified instances in the target dataset. Datatransformations include changes like the sub-stitution of book titles and labels with key-words as well as language transformations. Inthe second task (sim-rec) similarity of pairs ofinstances are determined, which do not reflectthe same real-world entities. This addressescommon preprocessing tasks, e.g., to reducethe search space for LD. In 2014, the eval-uation platform SEALS (Garcıa-Castro andWrigley, 2011), which has been used for on-tology matching in previous years, is used forinstance matching, too (see Section 3.4).

2.5 EnrichmentLinked Data Enrichment is an important topic forall applications that rely on a large number ofknowledge bases and necessitate a unified viewon this data, e.g., Question Answering frame-works, Linked Education and all forms of seman-tic mashups. In recent work, several challengesand requirements to Linked Data consumption andintegration have been pointed out (Millard et al.,2010). Several approaches and frameworks havebeen developed with the aim of addressing manyof these challenges. For example, the R2R frame-work (Bizer and Schultz, 2010) addresses those byallowing mappings to be published across knowl-edge bases, mapping classes defining and property

Table 6: OAEI instance matching tasks over the years. “–” means not existing, “?” unclear from publication (Nen-twig et al., 2017).

Yea

r

NameInput Problem

Domains LOD Sources Link TypeMax. #

TasksFormat Type Resources

2010

DI RDF real life sciences diseasome equality 5,000 4drugbankdailymedsider

IIMB OWL artificial cross-domain Freebase equality 1,416 80

PR RDF, artificial people – equality 864 3OWL geography

2011

DI-NYT RDF real people NYTimes equality 9,958 7geography DBpediaorganizations Freebase

Geonames

IIMB OWL artificial cross-domain Freebase equality 1,500 80

2012 SB OWL artificial cross-domain Freebase equality 375 10

IIMB OWL artificial cross-domain Freebase equality 375 80

2013 RDFT RDF artificial people DBpedia equality 430 5

2014 id-rec OWL artificial publications ? equality 2,649 1

sim-rec OWL artificial publications ? similarity 173 1

value transformation. While this framework sup-ports a large number of transformations, it does notallow the automatic discovery of possible transfor-mations. The Linked Data Integration FrameworkLDIF (Schultz et al., 2011), whose goal is to sup-port the integration of RDF data, builds upon R2Rmappings and technologies such as SILK (Iseleet al., 2011) and LDSpider9. The concept behindthe framework is to enable users to create periodicintegration jobs via simple XML configurations.These configurations however have to be createdmanually. The same drawback holds for the Se-mantic Web Pipes10, which follows the idea ofYahoo Pipes11 to enable the integration of data informats such as RDF and XML. By using Seman-tic Web Pipes, users can efficiently create semanticmashups by using a number of operators (such asgetRDF, getXML, etc.) and connecting these man-ually within a simple interface. It begins by detect-ing URIs that stand for the same real-world entityand either merging them to one or linking themvia owl:sameAs. Fluid Operations’ InformationWorkbench12 allows users to search through, ma-nipulate and integrate for purposes such as businessintelligence. The work presented in (Choudhuryet al., 2009) describes a framework for semanticenrichment, ranking and integration of web videos.(Abel et al., 2011) presents semantic enrichmentframework of Twitter posts. (Hasan et al., 2011)tackles the linked data enrichment problem for sen-sor data via an approach that sees enrichment as aprocess driven by situations of interest.

To the best of our knowledge, DEER (Sherifet al., 2015) is the first generic approach tailoredtowards learning enrichment pipelines of LinkedData given a set of atomic enrichment functions.DEER models RDF dataset enrichment workflowsas ordered sequences of enrichment functions.These enrichment functions take as input exactlyone RDF dataset and return an altered version ofit by virtue of addition and deletion of triples, asany enrichment operation on a given dataset can berepresented as a set of additions and deletions.

The benchmarking proposed by (Sherif et al.,2015) aims to quantify how well the presented RDFdata enrichment approach can automate the enrich-ment process. The authors thus assumed being

9http://code.google.com/p/ldspider/10http://pipes.deri.org/11http://pipes.yahoo.com/pipes/12http://www.fluidops.com/

information-workbench/

given manually created training examples and hav-ing to reconstruct a possible enrichment pipelineto generate target concise bounded descriptions(CBDs13) resources from the source CBDs. There-fore, they used three publicly available datasets forgenerating the benchmark datastes:

1. From the biomedical domain, the authorschose DrugBank14. They chose this datasetbecause it is linked with many other datasets15

from which the authors extracted enrichmentdata using their atomic enrichment functions.For evaluation, they deployed a manual enrich-ment pipeline Mmanual , where they enrichedthe drug data found in DrugBank using ab-stracts dereferenced from DBpedia. Then,they conformed both DrugBank and DBpe-dia source authority URIs to one unified URI.For DrugBank they manually deployed twoexperimental pipelines:

• M1DrugBank = (m1,m2), where m1 is a

dereferencing function that dereferencesany dbpedia-owl:abstract fromDBpedia and m2 is an authority confor-mation function that conforms the DBpe-dia subject authority16 to the target sub-ject authority of DrugBank17.• M2

DrugBank =M1DrugBank++ m3, where m3

is an authority conformation functionthat conforms DrugBank’s authority tothe Example authority18.

2. From the music domain, the authors chose theJamendo19 dataset. They selected this datasetas it contains a substantial amount of embed-ded information hidden in literal propertiessuch as mo:biography. The goal of theirenrichment process is to add a geospatial di-

13https://www.w3.org/Submission/CBD/14DrugBank is the Linked Data version of the DrugBank

database, which is a repository of almost 5000 FDA-approvedsmall molecule and biotech drugs, for RDF dump seehttp://wifo5-03.informatik.uni-mannheim.de/drugbank/drugbank_dump.nt.bz2

15See http://datahub.io/dataset/fu-berlin-drugbank for complete list of linkeddataset with DrugBank.

16http://dbpedia.org17http://wifo5-04.informatik.

uni-mannheim.de/drugbank/resource/drugs18http://example.org19Jamendo contains a large collection of music-

related information about artists and recordings, forRDF dump see http://moustaki.org/resources/jamendo-rdf.tar.gz

mension to Jamendo, e.g., location of a record-ing or place of birth of a musician. To this end,the authors of (Sherif et al., 2015) deployeda manual enrichment pipeline, in which theyenriched Jamendo’s music data by adding ad-ditional geospatial data found by applying theNER of the NLP enrichment function againstmo:biography. For Jamendo they manu-ally deployed one experimental pipeline:

• M1Jamendo = {m4}, where m4 is an en-

richment function that finds locationsin mo:biography using natural lan-guage processing (NLP) techniques.

3. From the multi-domain knowledgebase DBpedia (Lehmann et al., 2015)the authors selected the class Ad-ministrativeRegion to be our thirddataset. As DBpedia is a knowledge basewith a complex ontology, they build a set of 5pipelines of increasing complexity:

• M1DBpedia = {m5}, where m5 is an author-

ity conformation function that conformsthe DBpedia subject authority to the Ex-ample target subject authority.• M2

DBpedia = m6 ++ M1DBpedia, where m6

is a dereferencing function that derefer-ence any dbpedia-owl:ideologyand ++ is the list append operator.• M3

DBpedia =M2DBpedia++ m7, where m7 is

a NLP function that finds all named enti-ties in dbpedia-owl:abstract.• M4

DBpedia =M3DBpedia++ m8, where m8 is

a filter function that filters for abstracts.• M5

DBpedia = M3DBpedia ++ m9, where m9

is a predicate conformation functionthat conforms the source predi-cate dbpedia-owl:abstractto the target predicate ofdcterms:abstract.

Altogether, The authors manually generated a setof 8 pipelines, which they then applied against theirrespective datasets. The resulting pairs of CBDswere used to generate the positive examples, whichwere simply randomly selected pairs of distinctCBDs. All generated pipelines are available at theproject web site.20

20https://github.com/GeoKnow/DEER/evaluations/pipeline_learner

2.6 Quality Analysis

Since Linked Data origins from different publish-ers, a variety of information quality exists and dif-ferent strategies exists to assess them (Ngomo et al.,2014). Since several previous steps are extractingand generating triples automatically, an importantpart of the quality analysis is the evaluation of sin-gle facts with respect to their veracity. These factvalidation systems (also known as fact checkingsystems) take a triple (s, p,o) as input i j and returna veracity score as output (Syed et al., 2019).

An extension to the fact validation task is thetemporal fact validation. A temporal fact valida-tion system takes a point in time as additional input.It validates whether the given fact was true at thegiven point in time. However, to the best of ourknowledge, (Gerber et al., 2015) is the only ap-proach that tackles this task and FactBench is theonly dataset available for benchmarking such a sys-tem.

2.6.1 Validation system types

Most of the fact validation systems use referenceknowledge to validate a given triple. There aretwo major types of knowledge which can be used—structured and unstructured knowledge (Syed et al.,2019).

Approaches relying on structured knowledge usea reference knowledge base to search for patternsor paths supporting or refuting the given fact. Otherapproaches transfer the knowledge base into an em-bedding space and calculate the similarity betweenthe entities positions in the space and the positionsthe entities should have if the fact is true (Syedet al., 2019). For all these approaches, it is nec-essary that the single IRIs of the fact’s subject,predicate and object are available in the knowledgebase. Hence, the knowledge base the facts of afact validation benchmark rely on is backgroundknowledge a benchmarked fact validation systemneeds to have.

The second group of approaches is basedon unstructured—in most cases textual—knowledge (Gerber et al., 2015; Syed et al., 2019).These approaches are generating search queriesbased on the give fact and search for documentsproviding evidences for the fact. To this end, it isnecessary for such an approach to be aware of thesubject’s and object’s label. Additionally, a patternhow to formulate a fact with respect to the givenpredicate might be necessary.

2.6.2 DatasetsSeveral datasets have been created during the lastyears. Table 7 gives an overview. FactBench (Ger-ber et al., 2015) is a collection of 7 datasets.21 Thefacts of the datasets have a time range attachedwhich shows at which point in time the facts aretrue. The datasets share 1500 true facts selectedfrom DBpedia and Freebase using 10 differentproperties.22 However, the datasets differ in theway the 1500 false facts are derived from true facts.

• Domain: The subject is replaced by a randomresource.

• Range: The object is replaced by a randomresource.

• Domain-Range: Subject and object are re-placed as described above.

• Property: The predicate of the triple is re-placed with a randomly chosen property.

• Random: A triple is generated randomly.

• Date: The time range of a fact is changed.If the range is a single point in time, it israndomly chosen from a gaussian distributionwhich has its mean at the original point in timeand a variance of 5 years. For time intervals,the start year and the duration are drawn fromtwo gaussion distributions.

• Mix: A mixture comprising one sixth of eachof the sets above.

It has to be noted that all changes are done in a waythat makes sure that the subject and object of thegenerated false triple follow the domain and rangerestrictions of the fact’s predicate. In all cases, thegenerated fact must not be made present in theknowledge base.

In (Shi and Weninger, 2016), the authors proposeseveral datasets based on triples from the DBpedia.CapitalOf #1 comprises capitalOf relations of the5 largest US cities cities of the 50 US states. If thecorrect city is not within the top 5 cities, an addi-tional fact with the correct city is added. CapitalOf#2 comprises the same 50 correct triples. Addi-tionally, for each capital four triples with a wrong

21https://github.com/DeFacto/FactBench22The dataset has 150 true and false facts for each of the

following properties: award, birthPlace, deathPlace, foun-dationPlace, leader, team, author, spouse, starring and sub-sidiary.

state are created. The US Civil War dataset com-prises triples connecting military commanders todecisive battles of the US Civil war. The CompanyCEO dataset comprises 205 correct connectionsbetween CEOs and their company using the keyPer-son property. 1025 incorrect triples are created byrandomly selecting a CEO and a company fromthe correct facts. The NYT Bestseller comprises465 incorrect random pairs of authors and booksas well as 93 true triples of the New York Timesbestseller lists between 2010 and 2015. The USVice-President contains 47 correct and 227 incor-rect pairs of vice-presidents and presidents of theUnited States. Additionally, the authors create twodatasets based on SemMedDB. The Disease datasetcontains triples marking an amino acid, peptide orprotein as a cause of a disease or syndrome. It com-prises 100 correct and 457 incorrect triples. TheCell dataset connects cell functions with the genethat causes them. It contains 99 correct and 435incorrect statements.

In (Shiralkar et al., 2017), the authors proposeseveral datasets. FLOTUS comprises US presidentsand their spouses. TheNBA-Team contains triplesassigning NBA players to teams. The Oscarsdataset comprises triples related to movies. Addi-tionally, the authors reuse triples from the GoogleRelation Extraction Corpora (GREC) containingtriples about the birthplace, deathplace, educationand institution of famous people. The ground truthfor these triples is derived using crowd sourcing.The profession and nationality datasets are derivedfrom the WSDM Cup 2017 Triple Scoring chal-lenge. Since the challenge only provides true facts,false facts have been randomly drawn.

In (Syed et al., 2018), the authors claim thatseveral approaches for fact validation do not takethe predicate into account and that for most of theexisting benchmarking datasets, checking the con-nection between the subject and object is alreadysufficient. For creating a more difficult dataset, theauthors query a list of persons from DBpedia thathave their birth- and deathplaces in two differentcountries. They use these triples as true facts andswap birth- and deathplace to create wrong facts.

2.6.3 Key Performance IndicatorsThe performance of a fact validation system canbe measured regarding its effectiveness and effi-ciency. To measure the effectiveness, the Areaunder ROC is a common choice (Syed et al., 2019).Interpreting the validation as a binary classification

Table 7: Summary of benchmark datasets for fact validations.

Dataset Facts TemporalTrue False Total

Birthplace/Deathplace ((Syed et al., 2018)) 206 206 412 noCapitalOf #1 ((Shi and Weninger, 2016)) 50 209 259 noCapitalOf #2 ((Shi and Weninger, 2016)) 50 200 250 noCell ((Shi and Weninger, 2016)) 99 435 534 noCompany CEO ((Shi and Weninger, 2016)) 205 1025 1230 noDisease ((Shi and Weninger, 2016)) 100 457 557 noFactBench Date ((Gerber et al., 2015)) 1 500 1 500 3 000 yesFactBench Domain ((Gerber et al., 2015)) 1 500 1 500 3 000 yesFactBench Domain-Range ((Gerber et al., 2015)) 1 500 1 500 3 000 yesFactBench Mix ((Gerber et al., 2015)) 1 500 1 560 3 060 yesFactBench Property ((Gerber et al., 2015)) 1 500 1 500 3 000 yesFactBench Random ((Gerber et al., 2015)) 1 500 1 500 3 000 yesFactBench Range ((Gerber et al., 2015)) 1 500 1 500 3 000 yesFLOTUS ((Shiralkar et al., 2017)) 16 240 256 noGREC-Birthplace ((Shiralkar et al., 2017)) 273 819 1 092 noGREC-Deathplace ((Shiralkar et al., 2017)) 126 378 504 noGREC-Education ((Shiralkar et al., 2017)) 466 1 395 1 861 noGREC-Institution ((Shiralkar et al., 2017)) 1 546 4 638 6 184 noNBA-Team ((Shiralkar et al., 2017)) 41 143 164 noNYT Bestseller ((Shi and Weninger, 2016)) 93 465 558 noOscars ((Shiralkar et al., 2017)) 78 4 602 4 680 noUS Civil War ((Shi and Weninger, 2016)) 126 584 710 noUS Vice-President ((Shi and Weninger, 2016)) 47 227 274 noWSDM-Nationality ((Shiralkar et al., 2017)) 50 150 200 noWSDM-Profession ((Shiralkar et al., 2017)) 110 330 440 no

task allows the usage of Precision, Recall and F1-measure. However, this comes with the search ofa threshold for the minimal veracity value that canstill be accepted as true (Syed et al., 2018). Theefficiency is typically measured with respect to theruntime of a fact validation system (Syed et al.,2019).

2.7 Evolution & Repair

If problems within a KG are detected, repairingand managing the evolution of the data is nec-essary (Ngomo et al., 2014). Since repair toolsmainly support a manual task, and are thereforehard to benchmark automatically, we will focus onthe benchmarking of versioning systems. Thesesystems work similarly to SPARQL stores de-scribed in Section 2.2. However, instead of an-swering a query based on a single RDF graph, theyallow the querying of data from a single or overdifferent versions of the graph (Fernandez et al.,2019).

With respect to the proposed benchmark schema,the benchmarks for versioning systems are verysimilar to the triple store benchmarks. A bench-mark dataset comprises several versions of a knowl-edge base, a set of queries and a set of expectedresults for these queries. The knowledge base ver-sions can be seen as background data that is pro-vided at the beginning of the benchmarking. Theinput for a task is a single query and the expectedoutput is the expected query result. The bench-marked versioning system has to answer the querybased on the provided knowledge base versions.

2.7.1 Query TypesIn the literature, different types of queries are de-fined, which a versioning system needs to be ableto handle. The authors in (Fernandez et al., 2019)provide the following 6 queries:

• Version materialisation queries retrieve thefull version of a knowledge base for a givenpoint in time.

• Single-version structured queries retrieve asubset of the data of a single version.

• Cross-version structured queries retrieve dataacross several versions.

• Delta materialisation queries retrieve the dif-ferences between two or more versions.

• Single-delta structured queries retrieve datafrom a single delta of two versions.

• Cross-delta structured queries retrieve dataacross several deltas.

The authors in (Papakonstantinou et al., 2016,2017) create a more fine-grained structure provid-ing 8 different types of queries.

2.7.2 DatasetsThe BEAR benchmark dataset proposed in(Fernandez et al., 2019) comprises three parts—BEAR-A, BEAR-B, and BEAR-C. BEAR-A com-prises 58 versions of a single real-world knowledgebase taken from the Dynamic Linked Data Obser-vatory.23 The set of queries comprises triple patternqueries and is manually created by the authors tocover all basic functionalities a versioning systemhas to provide. Additionally, it is ensured that thequeries give different results for all versions whilethey are ε-stable, i.e., the cardinalities of the re-sults across the versions never exceed (1± ε) ofthe mean cardinality.

The BEAR-B dataset relies on the DBpedia Liveendpoint. The authors of (Fernandez et al., 2019)took the changesets from August to October 2015reflecting changes in the DBpedia triggered by ed-its in the Wikipedia. The dataset comprises data ofthe 100 most volatile resources and the changes toit. The data is provided in three granularities—theinstant application of changes leads to 21,046 dif-ferent versions. Additionally, the authors appliedsummarizations hourly and daily, leading to 1,299and 89 versions, respectively. The queries for thisdataset are taken from the LSQ dataset (Saleemet al., 2015a).

The BEAR-C dataset relies on 32 snapshots ofthe European Open Data portal.24 Each snapshotcomprises roughly 500m triples describing datasetsof the open data portal. The authors create 10queries that should be too complex to be solved byexisting versioning systems “in a straightforwardand optimized way” (Fernandez et al., 2019).

Evogen is a dataset generator proposedin (Meimaris and Papastefanatos, 2016). It relieson the LUBM generator (Guo et al., 2005) andgenerates additional versions based on its configu-ration. Based on which delta the generator has usedto create different versions, the 14 LUBM queriesare adapted to the generated data.

23http://swse.deri.org/dyldo/24http://data.europa.eu/euodp/en/data/

The SPBv dataset generator proposed in (Pa-pakonstantinou et al., 2017) is an extension of theSPB generator (Kotsev et al., 2016), creating RDFbased on the BBC core ontology. It mimics theevolution of journalistic articles in the world of on-line journalism. While new articles are published,already existing articles are changed over time anddifferent versions of articles are created.

2.7.3 Key Performance IndicatorsThe main key performance indicator used for ver-sioning systems is the query runtime. Analysingthese runtimes shows for which query type a certainversioning system has a good or bad performance.A second indicator is the space used by the ver-sioning system to store the data (Fernandez et al.,2019; Papakonstantinou et al., 2017). In additionto, (Papakonstantinou et al., 2017) proposes themeasurement of the time a system needs to storethe initial version of the knowledge base and thetime it needs to apply the changes to create newerversions.

2.8 Search, Browsing & Exploration

The user has to be enabled to access the LD in a fastand user friendly way (Ngomo et al., 2014). Thisneed has led to the development of keyword searchtools and question answering systems for the Webof Data. As a result of the interest in these systems,a large number of evaluation campaigns have beenundertaken (Usbeck et al., 2019). For example,the TREC conference started a question answer-ing track to provide domain-independent evalua-tions over unstructured corpora (Voorhees et al.,1999). Another series of challenges is the BioASQseries (Tsatsaronis et al., 2015), which seeks toevaluate QA systems on biomedical data. In con-trast to previous series, systems must use RDF datain addition to textual data. For the NLQ sharedtask, a dataset has been released that is answerablepurely by DBpedia and SPARQL. Another series ofchallenges is the Question Answering over LinkedData (QALD) campaign (Unger et al., 2014, 2015,2016). It includes different types of question an-swering benchmarks that are (1) purely based onRDF data, (2) based on RDF and textual data, (3)statistical data, (4) data from multiple KBs or (5)based in the music-domain.

Alongside the main question answering task(QA), the authors of (Usbeck et al., 2019) define5 sub tasks that are used for deep analysis of theperformance of question answering systems.

• QA: The classical question answering taskuses a plain question qi as input and expectsa set of answers Ai as output. Since the an-swering of questions presumes certain knowl-edge, most question answering datasets definea knowledge base K as background knowl-edge for the task. It should be noted that Ai

might contain URIs, labels or a literal (like aboolean value). Matching URIs and labels canbe a particularly difficult task for a questionanswering benchmarking framework.

• C2KB: This sub-task aims to identify all rel-evant resources for the given question. It isexplained in more detail in Section 2.1.

• P2KB: This sub-task is similar to C2KB butfocusses on the identification of properties Pthat are relevant for the given question.

• RE2KB: This sub-task evaluates triples thatthe question answering system extracted fromthe given search query. The expected answercomprises triples that are needed to build theSPARQL query for answering the question.Note that these triples can contain resources,literals or variables. Two triples are the sameif their three elements—subject, predicate andobject—are the same. If the triples comprisevariables, they must be at the same positionwhile the label of the variable is ignored.

• AT: This sub task identifies the answer type α

of the given question. The answer types A aredefined by the benchmark dataset. In (Ungeret al., 2016), the authors define 5 differentanswer types: date, number, string,boolean and resource, where a resourcecan be a single URI or a set of URIs.

• AIT2KB: This sub-task aims to measurewhether the benchmarked question answeringsystem is able to identify the correct type(s)of the expected resources. If the answer doesnot contain any resources, the set of types Tis expected to be empty. Similar to the ERectask in Section 2.1, a set of entity types T isused as background data for this task.

The tasks and their formal descriptions are summa-rized in Table 8. The key performance indicatorsused for the different tasks are the Macro and Microvariants of Precision, Recall and F1-measure. Theefficiency of the systems is evaluated by measuring

the runtime a system needs to answer a query (Us-beck et al., 2019).

Table 9 lists datasets based on DBpedia and theirfeatures. It is clear that most of the datasets origi-nate from the QALD challenge and contain a smallnumber of questions. This mainly results from thehigh costs of the manual curation. At the sametime, evolving knowledge bases cause a similarproblem of outdated datasets, as described in Sec-tion 2.1.4. Questions that have been answered onprevious versions of a knowledge base might notbe answerable on new versions and vice versa. Ad-ditionally, the URIs of the resources listed as an-swers might be different in new versions. A steptowards automating the creation of questions is theLC-QuAD dataset (Trivedi et al., 2017). The toolfor creating this dataset uses SPARQL templates togenerate SPARQL queries and their correspondingquestions. However, human annotators are still nec-essary to check the questions and manually repairfaulty questions.

3 Benchmarking platforms

During recent years, several benchmarking plat-forms have been developed for a single, a subsetor all steps of the LD lifecycle. In this Section, wegive a brief overview of some of these platforms.

3.1 BAT

The BAT-framework (Cornolti et al., 2013) is aone of the first benchmarking frameworks devel-oped for the extraction step of the LD lifecycle.Its aim is to facilitate the benchmarking of namedentity recognition, named entity disambiguationand concept tagging approaches. BAT comparesseven existing entity annotation approaches usingWikipedia as reference. Moreover, it defines sixdifferent task types. Three of these tasks—D2KB,C2KB and A2KB—are described in in Section 2.1.The other three tasks are mainly extensions ofA2KB and C2KB. Sa2KB is an extension of A2KBthat accepts an additional confidence score, whichis taken into account during the evaluation. Sc2KBis the same extension for the C2KB task. Rc2KB isvery similar but expects a ranking of the assignedconcepts instead of an explicit score. Additionally,the authors propose five different matchings for thesix tasks, i.e., ways how the expected markingsand the markings of the system are matched. Theframework offers the calculation of six evaluationmeasures and provides adapters for five datasets.

3.2 GERBIL

GEBRIL (Usbeck et al., 2015; Roder et al., 2018)is an effort of the knowledge extraction communityto enhance the evaluation of knowledge extractionsystems. Based on the central idea of the BATframework (to create a single evaluation frameworkthat eases the comparison and enables the repeata-bility of experiments), GERBIL goes beyond theseconcepts in several ways. Figure 5 shows the sim-plified architecture of GERBIL. By offering thebenchmarking of knowledge extraction systems viaweb service API calls, as well as the addition ofuser defined web service URLs and datasets, GER-BIL enables users to add their own systems anddatasets in an easy way. Additionally, the GERBILcommunity runs an instance of the framework thatcan be used for free.25 In the sense of the FAIRdata principles (Wilkinson et al., 2016), GERBILprovides persistent, citable URLs for executed ex-periments supporting their repeatability. In its latestversion, GERBIL comes with system adapters formore than 20 annotation systems, adapters for thedatasets listed in Table 2 and supports the extractiontasks listed in Table 2.1.1.26 Additionally, GERBILoffers several measures beyond Precision, Recalland F1-measure for a fine grained analysis of eval-uation results. These additional measures only takeemerging entities or entities from the given knowl-edge base into account. Another important featureis the identification of outdated URIs. If it is notpossible to retrieve the new URI for an entity (e.g.,because such a URI does not exist) it is handledlike an emerging entity. In a similar way, GER-BIL supports the retrieval of owl:sameAs linksbetween entities returned by the system and enti-ties expected by the dataset’s gold standard. Thisenables the benchmarking of systems that do notuse the knowledge base the benchmark dataset isrelying on.

The success of the GERBIL framework led toits deployment in other areas. GERBIL QA trans-fers the concept of a benchmarking platform basedon web services into the area of question answer-ing (Usbeck et al., 2019).27 It supports the evalu-ation of question answering systems for all taskslisted in Table 8. It uses the same retrieval mecha-nism for owl:sameAs links. In addition to that,it uses complex matching algorithms to enable the

25http://w3id.org/gerbil26https://dice-research.org/GERBIL27http://w3id.org/gerbil/qa

Table 8: Summary of the question answering tasks.

Task Input i j Output e j Background data B

QA q A⊂ K KC2KB q U ⊂ K KP2KB q P⊂ K KRE2KB q {(s, p,o)|s, p,o ∈ K∪V} KAT q α ∈ A AAIT2KB q T⊂ T T

Web service calls

Datasets

Web service calls

Interface

Interface

Annotators

OPEN

Configuration

GERBIL

...

Benchmark Core

Your Annotator

Your Dataset

Figure 5: Overview of GERBIL’s abstract architecture (Usbeck et al., 2015). Interfaces to users and providers ofdatasets and annotators are marked in blue.

Table 9: Example datasets for benchmarking question answering systems and their features (Trivedi et al., 2017;Usbeck et al., 2019).

Dataset #Questions Knowledge Base

LC-QuAD 5000 DBpedia 2016-04NLQ shared task 1 39 DBpedia 2015-04QALD1 Test dbpedia 50 DBpedia 3.6QALD1 Train dbpedia 50 DBpedia 3.6QALD1 Test musicbrainz 50 MusicBrainz (dump 2011)QALD1 Train musicbrainz 50 MusicBrainz (dump 2011)QALD2 Test dbpedia 99 DBpedia 3.7QALD2 Train dbpedia 100 DBpedia 3.7QALD3 Test dbpedia 99 DBpedia 3.8QALD3 Train dbpedia 100 DBpedia 3.8QALD3 Test esdbpedia 50 DBpedia 3.8 esQALD3 Train esdbpedia 50 DBpedia 3.8 esQALD4 Test Hybrid 10 DBpedia 3.9 + long abstractsQALD4 Train Hybrid 25 DBpedia 3.9 + long abstractsQALD4 Test Multilingual 50 DBpedia 3.9QALD4 Train Multilingual 200 DBpedia 3.9QALD5 Test Hybrid 10 DBpedia 2014 + long abstractsQALD5 Train Hybrid 40 DBpedia 2014 + long abstractsQALD5 Test Multilingual 49 DBpedia 2014QALD5 Train Multilingual 300 DBpedia 2014QALD6 Train Hybrid 49 DBpedia 2015-10 + long abstractsQALD6 Train Multilingual 333 DBpedia 2015-10

benchmarking of question answering systems thatreturn labels of entities instead of the entity URIs.

GERBIL KBC is developed to support the evalu-ation of knowledge base curation systems—mainlyfact validation systems.28 It has been used as themain evaluation tool for the Semantic Web Chal-lenges 2017–2019 and uses Precision, Recall, F1-measure and ROC-AUC as metrics to evaluate theperformance of fact validation approaches. Addi-tionally, it supports the organisation of challengesby offering leaderboards including visualisationsof the ROC curves of the best performing systems.

3.3 Iguana

Iguana (Conrads et al., 2017) is a SPARQL bench-mark execution framework that tackles the prob-lem of a fair, comparable and realistic SPARQLbenchmark execution.29 Figure 6 shows the archi-tecture of Iguana. It consists of a core that executesthe benchmark queries against the benchmarkedtriplestores and the result processor that calculates

28http://w3id.org/gerbil/kbc29https://github.com/dice-group/iguana

metric results.

3.3.1 ComponentsCore. The core uses a given configuration and ex-ecutes the benchmark accordingly against severalprovided triplestores. To tackle the problem of a re-alistic benchmark, Iguana implements a highly con-figurable stresstest, which consists of several userpools that in themselves comprise several threads,representing simulated users. Each user pool con-sists of either SPARQL queries or SPARQL up-dates that are executed by its threads against thebenchmarked triplestore. Iguana supports SPARQL1.1 queries as well as SPARQL pattern queries asseen in the DBpedia SPARQL Benchmark (Morseyet al., 2011). Furthermore it supports SPARQL up-date queries as well as updating Triple files directly.Each of the parallel-running threads starts at a dif-ferent query, iterating repeatedly over the given setof queries until either a certain time limit is reachedor a certain amount of queries were executed. Foreach query execution the stresstest checks if thequery succeeded or failed, and measures the timethe triplestore took to answer the query. This infor-

COREResult Processor

BenchmarkTasks

Metrics

Figure 6: Architecture of Iguana

mation is forwarded to the result processor.

esult Processor. The result processor collects theexecution results and calculates pre defined metrics.In its latest version, Iguana supports the followingmetrics:

• Queries Per Second (QPS) counts how oftena single query (or a set of queries) can beanswered per second.

• Query Mixes Per Hour (QMPH) measureshow often one query benchmark set can beanswered by the triplestore per hour.

• Number of Queries Per Hour (NoQPH) countshow many queries can be answered by thetriplestore per hour.

The calculated results will then be stored as RDFin a triplestore and can be queried directly usingSPARQL.

3.3.2 Benchmark WorkflowThe workflow to benchmark a triplestore usingIguana comprises the following seven steps:

1. Create benchmark queries

2. Start triplestore

3. Load dataset into triplestore

4. Create benchmark configuration

5. Execute configuration using Iguana

6. Iguana executes benchmark tasks

7. and provides results

In case the user is not reusing an already existingbenchmark dataset, queries have to be created. Af-ter that, the triplestore can be started and the datasetcan be loaded into the store. A benchmark configu-ration has to be created containing information like(1) the address of the triplestore, (2) the queriesthat should be used, (3) the number of user poolsand (4) the runtime of the test. Iguana will executethe stresstest as configured flooding the triplestorewith requests using the provided queries. Afterthe benchmark finishes, the detailed benchmarkingresults are available as RDF.

3.4 SEALS

The SEALS platform (Wrigley et al., 2012) is oneof the major results of the Semantic EvaluationAt Large Scale (SEALS) project.30 It offers thebenchmarking of LD systems in different areaslike ontology reasoning and ontology matching .It enables the support of different benchmarks byoffering a Web Service Business Process ExecutionLanguage (WSBPEL) (Alves et al., 2007) interface.This interface can be used to write scripts coveringthe entire lifecycle of a single evaluation. Theplatform aims to support evaluation campaigns andhas been used in several campaigns during recentyears (Garcıa-Castro and Wrigley, 2011).

3.5 HOBBIT

HOBBIT (Roder et al., 2019) is the first holistic BigLinked Data benchmarking platform. The platformis available as open-source project and as an onlineplatform.31 Compared to the previously mentioned

30http://www.seals-project.eu/31https://github.com/hobbit-project/

platform, https://master.project-hobbit.eu

benchmarking platforms, the HOBBIT platformoffers benchmarks for all steps of the LD lifecy-cle that can be benchmarked automatically. Forexample, it contains all benchmarks the previouslydescribed GERBIL platforms implement. Addi-tionally, it ensures the comparability of benchmarkresults by executing the benchmarked systems in acontrolled environment using the Docker containertechnology.32 The same technology supports thebenchmarking of distributed systems and the exe-cution of distributed benchmarks. The latter is nec-essary to be able to generate a sufficient amount ofdata to evaluate the scalability of the benchmarkedsystem.

Figure 7 shows the components of the HOBBITplatform. The blue boxes represent the compo-nents of the platform. One of them is the graphicaluser interface allowing the interaction of the userwith the platform. A user management ensures thatnot all users have access to all information of theplatform. This is necessary for the support of chal-lenges where the challenge organiser has to setupthe configuration of the challenge benchmark in asecret way. Otherwise, participants could adjusttheir solutions to the configuration. The storagecontains all evaluation results as triples and offersa public read-only SPARQL interface. The analy-sis components offers the analysing of evaluationresults over multiple experiments. This can leadto helpful insights regarding strengths and weak-nesses of benchmarked systems. The resource mon-itoring enables the benchmark to take the amountof resources used by the benchmarked system (e.g.,CPU usage) into account. The logging componentgives benchmark and system developers access tothe log messages of their components.

Although the HOBBIT platform supports nearlyevery benchmark implementation, the authorsof (Roder et al., 2019) suggest a general structurefor a benchmark which separates it into a controllerthat orchestrates the benchmark components, a setof data generators responsible for providing thenecessary datasets, a set of task generators creatingthe tasks the benchmarked systems has to fulfill, anevaluation storage that stores the expected answersand the responses generated by the benchmarkedsystem and an evaluation module that implementsthe metrics the benchmark uses. A detailed descrip-tion of the components and the workflow is givenin (Roder et al., 2019).

32https://docker.com

4 Conclusion

In this chapter, we presented an overview of bench-marking approaches for the different steps of theLinked Data lifecycle. These benchmarks are im-portant for the usage and further development ofKG-based systems since they enable a comparableevaluation of the system’s performance. Hence,for the development of explainable artificial intel-ligence based on KGs these benchmarks will playa key role to identify the best KG-based systems-for interacting with the KG. In future works, thesegeneral benchmarks might be adapted towards spe-cial requirements explainable artificial intelligenceapproaches might raise with respect to their under-lying KG-based systems.

ReferencesFabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao.

2011. Semantic enrichment of twitter posts for userprofile construction on the social web. In Proc. ofESWC, pages 375–389. Springer.

Jose-Luis Aguirre, Kai Eckert, Jerome Euzenat, AlfioFerrara, Willem Robert van Hage, Laura Hollink,Christian Meilicke, Andriy Nikolov, DominiqueRitze, Francois Scharffe, et al. 2012. Results of theOntology Alignment Evaluation Initiative 2012. InProceedings of the 7th International Workshop onOntology Matching, pages 73–115.

Gunes Aluc, Olaf Hartig, M. Tamer Ozsu, and Khuza-ima Daudjee. 2014. Diversified stress testing ofRDF data management systems. In ISWC, pages197–212.

Alexandre Alves, Assaf Arkin, Sid Askary, Charl-ton Barreto, Ben Bloch, Francisco Curbera, MarkFord, Yaron Goland, Alejandro Guızar, NeelakantanKartha, et al. 2007. Web Services Business ProcessExecution Language Version 2.0. Oasis standard,W3C.

Soren Auer and Jens Lehmann. 2010. Creatingknowledge out of interlinked data. Semant. web,1(1,2):97–104.

Samantha Bail, Sandra Alkiviadous, Bijan Parsia,David Workman, Mark van Harmelen, Rafael S.Goncalves, and Cristina Garilao. 2012. FishMark:A linked data application benchmark. In Proceed-ings of the Joint Workshop on Scalable and High-Performance Semantic Web Systems, volume 943,pages 1–15. CEUR-WS.org.

Christian Bizer and Andreas Schultz. 2009. The BerlinSPARQL benchmark. Int. J. Semantic Web Inf. Syst.,5(2):1–24.

PlatformController

Data Generator

Task Generator

Data Generator

Data Generator

Task Generator

Task Generator

Graphical User Interface

Benchmarked System

data flowcreates component

StorageAnalysis

BenchmarkController

Evaluation Module

Eval. Storage

User ManagementRepository

Resource Monitoring Logging

Figure 7: The components of the HOBBIT platform (Roder et al., 2019).

Christian Bizer and Andreas Schultz. 2010. The r2rframework: Publishing and discovering mappingson the web. COLD, 665.

Roi Blanco, Giuseppe Ottaviano, and Edgar Meij. 2015.Fast and space-efficient entity linking in queries. InProceedings of the eighth ACM international confer-ence on Web search and data mining.

Amparo E. Cano, Giuseppe Rizzo, Andrea Varga,Matthew Rowe, Milan Stankovic, and Aba-sahDadzie. 2014. Making sense of microposts: (#mi-croposts2014) named entity extraction & linkingchallenge. In Proceedings of the 4th Workshopon Making Sense of Microposts, volume 1141,pages 54–60. Co-located with the 23rd Interna-tional World Wide Web Conference (WWW 2014)Edited by Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie.

Amparo Elizabeth Cano Basave, Andrea Varga,Matthew Rowe, Milan Stankovic, and Aba-SahDadzie. 2013. Making sense of microposts(#msm2013) concept extraction challenge. In Am-paro E. Cano, Matthew Rowe, Milan Stankovic,and Aba-Sah Dadzie, editors, #MSM2013 : con-cept extraction challenge at Making Sense of Mi-croposts 2013, pages 1–15. CEUR-WS.org, BRA.Cano Basave, AE, Varga, A, Rowe, M, Stankovic,M & Dadzie, A-S: Making sense of microposts(#MSM2013) concept extraction challenge. Proc. ofthe workshop on ’Making Sense of Microposts’ co-located with the 22nd international World WideWeb conference (WWW’13), Rio de Janeiro, Brazil,13 May, ceur-ws.org/Vol-1019/msm2013-challenge-report.pdf.

David Carmel, Ming-Wei Chang, Evgeniy Gabrilovich,Bo-June Paul Hsu, and Kuansan Wang. 2014.Erd’14: Entity recognition and disambiguation chal-lenge. In 37th international ACM SIGIR conferenceon Research & development in information retrieval.

Smitashree Choudhury, John G Breslin, and Alexan-dre Passant. 2009. Enrichment and ranking of the

youtube tag space and integration with the linkeddata cloud. Springer.

Felix Conrads, Jens Lehmann, Muhammad Saleem,Mohamed Morsey, and Axel-Cyrille NgongaNgomo. 2017. Iguana: A generic framework forbenchmarking the read-write performance of triplestores. In Proceedings of the 16th InternationalSemantic Web Conference (ISWC). Springer.

Marco Cornolti, Paolo Ferragina, and MassimilianoCiaramita. 2013. A framework for benchmarkingentity-annotation systems. In 22nd World Wide WebConference.

Marco Cornolti, Paolo Ferragina, Massimiliano Cia-ramita, Stefan Rud, and Hinrich Schutze. 2016. Apiggyback system for joint entity mention detectionand linking in web queries. In Proceedings of the25th International Conference on World Wide Web,pages 567–578, Republic and Canton of Geneva,Switzerland. International World Wide Web Confer-ences Steering Committee.

Silviu Cucerzan. 2007. Large-scale named entity dis-ambiguation based on Wikipedia data. In Proceed-ings of the 2007 Joint Conference on EmpiricalMethods in Natural Language Processing and Com-putational Natural Language Learning (EMNLP-CoNLL), pages 708–716, Prague, Czech Republic.Association for Computational Linguistics.

Gianluca Demartini, Iliya Enchev, Marcin Wylot,Joel Gapany, and Philippe Cudre-Mauroux. 2011.BowlognaBench - benchmarking RDF analytics. InData-Driven Process Discovery and Analysis SIM-PDA, volume 116, pages 82–102. Springer.

Leon Derczynski, Diana Maynard, Giuseppe Rizzo,Marieke van Erp, Genevieve Gorrell, RaphaelTroncy, Johann Petrak, and Kalina Bontcheva. 2015.Analysis of named entity recognition and linking fortweets. Information Processing and Management,51(2):32–49.

George R Doddington, Alexis Mitchell, Mark A Przy-bocki, Lance A Ramshaw, Stephanie Strassel, andRalph M Weischedel. 2004. The automatic contentextraction (ACE) program-tasks, data, and evalua-tion. In LREC.

Songyun Duan, Anastasios Kementsietsidis, SrinivasKavitha, and Octavian Udrea. 2011. Apples and or-anges: A comparison of RDF benchmarks and realRDF datasets. In SIGMOD, pages 145–156. ACM.

Philip Edmonds and Scott Cotton. 2001. Senseval-2:Overview. In The Proceedings of the Second Inter-national Workshop on Evaluating Word Sense Dis-ambiguation Systems, pages 1–5, Stroudsburg, PA,USA. Association for Computational Linguistics.

Orri Erling, Alex Averbuch, Josep-Lluis Larriba-Pey,Hassan Chafi, Andrey Gubichev, Arnau Prat-Perez,Minh-Duc Pham, and Peter A. Boncz. 2015. TheLDBC Social Network Benchmark: Interactiveworkload. In SIGMOD, pages 619–630. ACM.

Jerome Euzenat, Alfio Ferrara, Willem Robert vanHage, Laura Hollink, Christian Meilicke, AndriyNikolov, Dominique Ritze, Francois Scharffe, PavelShvaiko, Heiner Stuckenschmidt, et al. 2011. Re-sults of the Ontology Alignment Evaluation Initia-tive 2011. In Proceedings of the 6th InternationalWorkshop on Ontology Matching, pages 85–113.

Jerome Euzenat, Alfio Ferrara, Christian Meilicke,Juan Pane, Francois Scharffe, Pavel Shvaiko, HeinerStuckenschmidt, Ondrej Svab-Zamazal, VojtechSvatek, and Cassia Trojahn dos Santos. 2010. Re-sults of the Ontology Alignment Evaluation Initia-tive 2010. In Proceedings of the 5th InternationalWorkshop on Ontology Matching (OM-2010), pages85–117.

Javier D. Fernandez, Jurgen Umbrich, Axel Polleres,and Magnus Knuth. 2019. Evaluating query andstorage strategies for RDF archives. Semantic Web,10:247–291.

Tim Finin, Will Murnane, Anand Karandikar, NicholasKeller, Justin Martineau, and Mark Dredze. 2010.Annotating named entities in twitter data withcrowdsourcing. In Proceedings of the NAACL HLT2010 Workshop on Creating Speech and LanguageData with Amazon’s Mechanical Turk, pages 80–88,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Raul Garcıa-Castro and Stuart N. Wrigley. 2011.SEALS Methodology for Evaluation Campaigns.Technical report, Seventh Framework Programme.

Daniel Gerber, Diego Esteves, Jens Lehmann, LorenzBuhmann, Ricardo Usbeck, Axel-Cyrille NgongaNgomo, and Rene Speck. 2015. Defacto—temporaland multilingual deep fact validation. Journal ofWeb Semantics, 35:85–101. Machine Learning andData Mining for the Semantic Web (MLDMSW).

Olaf Gorlitz, Matthias Thimm, and Steffen Staab. 2012.SPLODGE: systematic generation of SPARQLbenchmark queries for linked open data. In ISWC,volume 7649, pages 116–132. Springer.

Bernardo Cuenca Grau, Zlatan Dragisic, Kai Eck-ert, Jerome Euzenat, Alfio Ferrara, Roger Granada,Valentina Ivanova, Ernesto Jimenez-Ruiz, An-dreas Oskar Kempf, Patrick Lambrix, et al. 2013.Results of the Ontology Alignment Evaluation Ini-tiative 2013. In Proceedings of the 8th InternationalWorkshop on Ontology Matching co-located with the12th International Semantic Web Conference, pages61–100.

Jim Gray and Charles Levine. 2005. Thousands ofDebitCredit Transactions-Per-Second: Easy and In-expensive. Technical report, Microsoft Research.

Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. 2005.LUBM: a benchmark for OWL knowledge base sys-tems. J. Web Sem., 3(2-3):158–182.

Souleiman Hasan, Edward Curry, Mauricio Banduk,and Sean O’Riain. 2011. Toward situation aware-ness for the semantic sensor web: Complex eventprocessing with dynamic linked data enrichment.Semantic Sensor Networks, page 60.

Johannes Hoffart, Yasemin Altun, and GerhardWeikum. 2014. Discovering emerging entities withambiguous names. In Proceedings of the 23rd In-ternational Conference on World Wide Web, pages385–396, New York, NY, USA. ACM.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-dino, Hagen Furstenau, Manfred Pinkal, Marc Span-iol, Bilyana Taneva, Stefan Thater, and GerhardWeikum. 2011. Robust disambiguation of named en-tities in text. In Proceedings of the 2011 Conferenceon Empirical Methods in Natural Language Process-ing, pages 782–792, Edinburgh, Scotland, UK. Asso-ciation for Computational Linguistics.

Robert Isele, Anja Jentzsch, and Christian Bizer. 2011.Efficient multidimensional blocking for link discov-ery without losing recall. In Proceedings of the 14thInternational Workshop on the Web and Databases2011.

Kunal Jha, Michael Roder, and Axel-Cyrille NgongaNgomo. 2017. All That Glitters is not Gold – Rule-Based Curation of Reference Datasets for NamedEntity Recognition and Entity Linking. In The Se-mantic Web. Latest Advances and New Domains:14th International Conference, ESWC 2017, Pro-ceedings. Springer International Publishing.

Adam Kilgarriff. 1998. Senseval: An exercise in eval-uating word sense disambiguation programs. 1stLREC.

Venelin Kotsev, Nikos Minadakis, Vassilis Papakon-stantinou, Orri Erling, Irini Fundulaki, and AtanasKiryakov. 2016. Benchmarking RDF Query En-gines: The LDBC Semantic Publishing Benchmark.In BLINK@ISWC.

Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan,and Soumen Chakrabarti. 2009. Collective annota-tion of wikipedia entities in web text. In Proceed-ings of the 15th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining,pages 457–466, New York, NY, USA. Associationfor Computing Machinery.

Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch,Dimitris Kontokostas, Pablo N. Mendes, SebastianHellmann, Mohamed Morsey, Patrick van Kleef,Soren Auer, et al. 2015. DBpedia - a large-scale, multilingual knowledge base extracted fromwikipedia. Semantic Web Journal, 6(2):167–195.

Paul McNamee. 2009. Overview of the TAC 2009knowledge base population track.

Marios Meimaris and George Papastefanatos. 2016.The EvoGen Benchmark Suite for Evolving RDFData. In MEPDaW/LDQ@ ESWC, pages 20–35.

Rada Mihalcea, Timothy Chklovski, and Adam Kilgar-riff. 2004. The Senseval-3 English Lexical SampleTask. In Proceedings of Senseval-3: Third Interna-tional Workshop on the Evaluation of Systems for theSemantic Analysis of Text. Association for Computa-tional Linguistics (ACL).

Ian Millard, Hugh Glaser, Manuel Salvadores, andNigel Shadbolt. 2010. Consuming multiple linkeddata sources: Challenges and experiences. In COLDWorkshop.

David Milne and Ian H. Witten. 2008. Learning to linkwith wikipedia. In Proceedings of the 17th ACMConference on Information and Knowledge Manage-ment, pages 509–518, New York, NY, USA. Associ-ation for Computing Machinery.

Mohamed Morsey, Jens Lehmann, Soren Auer, andAxel-Cyrille Ngonga Ngomo. 2011. DBpediaSPARQL benchmark - performance assessment withreal queries on real data. In ISWC, volume 7031,pages 454–469. Springer.

Markus Nentwig, Michael Hartung, Axel-Cyrille Ngonga Ngomo, and Erhard Rahm. 2017.A survey of current link discovery frameworks.Semantic Web, 8(3):419–436.

Axel-Cyrille Ngonga Ngomo. 2012a. Link discoverywith guaranteed reduction ratio in affine spaces withminkowski measures. In The Semantic Web - ISWC2012 - 11th International Semantic Web Conference,Boston, MA, USA, November 11-15, 2012, Proceed-ings, Part I, pages 378–393.

Axel-Cyrille Ngonga Ngomo. 2012b. On link discov-ery using a hybrid approach. J. Data Semantics,1(4):203–217.

Axel-Cyrille Ngonga Ngomo. 2013. ORCHID -reduction-ratio-optimal computation of geo-spatialdistances for link discovery. In The Semantic Web- ISWC 2013 - 12th International Semantic Web

Conference, Sydney, NSW, Australia, October 21-25,2013, Proceedings, Part I, pages 395–410.

Axel-Cyrille Ngonga Ngomo and Soren Auer. 2011.LIMES - A time-efficient approach for large-scalelink discovery on the web of data. In IJCAI 2011,Proceedings of the 22nd International Joint Confer-ence on Artificial Intelligence, pages 2312–2317.

Axel-Cyrille Ngonga Ngomo, Soren Auer, JensLehmann, and Amrapali Zaveri. 2014. Introductionto Linked Data and Its Lifecycle on the Web, pages1–99. Springer International Publishing, Cham.

Axel-Cyrille Ngonga Ngomo, Jens Lehmann, SorenAuer, and Konrad Hoffner. 2011. RAVEN - activelearning of link specifications. In Proceedings of the6th International Workshop on Ontology Matching,pages 25–36.

Axel-Cyrille Ngonga Ngomo and Klaus Lyko. 2012.EAGLE: efficient active learning of link specifica-tions using genetic programming. In The Seman-tic Web: Research and Applications - 9th ExtendedSemantic Web Conference, ESWC 2012, Heraklion,Crete, Greece, May 27-31, 2012. Proceedings, pages149–163.

Axel-Cyrille Ngonga Ngomo and Klaus Lyko. 2013.Unsupervised learning of link specifications: deter-ministic vs. non-deterministic. In Proceedings ofthe 8th International Workshop on Ontology Match-ing co-located with the 12th International SemanticWeb Conference, pages 25–36.

Axel-Cyrille Ngonga Ngomo, Michael Roder, DiegoMoussallem, Ricardo Usbeck, and Rene Speck.2018. BENGAL: an automatic benchmark generatorfor entity recognition and linking. In Proceedings ofthe 11th International Conference on Natural Lan-guage Generation, pages 339–349. Association forComputational Linguistics.

Andriy Nikolov, Victoria S. Uren, and Enrico Motta.2007. Knofuss: a comprehensive architecture forknowledge fusion. In Proceedings of the 4th Inter-national Conference on Knowledge Capture, pages185–186.

Andrea Giovanni Nuzzolese, Anna Lisa Gentile,Valentina Presutti, Aldo Gangemi, Robert Meusel,and Heiko Paulheim. 2016. The second open knowl-edge extraction challenge. In Semantic Web Chal-lenges, pages 3–16, Cham. Springer InternationalPublishing.

Andrea-Giovanni Nuzzolese, AnnaLisa Gentile,Valentina Presutti, Aldo Gangemi, Darıo Garigliotti,and Roberto Navigli. 2015. Open knowledgeextraction challenge. In Semantic Web EvaluationChallenges, volume 548, pages 3–15. SpringerInternational Publishing.

Vassilis Papakonstantinou, Giorgos Flouris, Irini Fun-dulaki, Kostas Stefanidis, and Giannis Roussakis.2016. Versioning for Linked Data: Archiving Sys-tems and Benchmarks. In BLINK@ISWC.

Vassilis Papakonstantinou, Giorgos Flouris, Irini Fun-dulaki, Kostas Stefanidis, and Yannis Roussakis.2017. Spbv: Benchmarking linked data archivingsystems. In BLINK/NLIWoD3@ISWC.

Shi Qiao and Z. Meral Ozsoyoglu. 2015. RBench:Application-specific RDF benchmarking. In SIG-MOD, pages 1825–1838. ACM.

L. Ratinov, D. Roth, D. Downey, and M. Anderson.2011. Local and global algorithms for disambigua-tion to wikipedia. In ACL.

Alan Ritter, Sam Clark, Mausam, and Oren Etzioni.2011. Named entity recognition in tweets: An ex-perimental study. In EMNLP.

Giuseppe Rizzo, Amparo Cano Basave, Bianca Pereira,and Andrea VARGA. 2015. Making sense of micro-posts (#microposts2015) named entity recognition &linking challenge.

Giuseppe Rizzo, Mareike van Erp, Julien Plu, andRaphael Troncy. 2016. Making Sense of Microp-osts (#Microposts2016) Named Entity rEcognitionand Linking (NEEL) Challenge. pages 50–59.

Michael Roder, Denis Kuchelev, and Axel-Cyrille Ngonga Ngomo. 2019. Hobbit: A platformfor benchmarking big linked data. Data Science,PrePress(PrePress):1–21.

Michael Roder, Ricardo Usbeck, Sebastian Hellmann,Daniel Gerber, and Andreas Both. 2014. N3 - a col-lection of datasets for named entity recognition anddisambiguation in the nlp interchange format. In 9thLREC.

Michael Roder, Ricardo Usbeck, and Axel-Cyrille Ngonga Ngomo. 2018. GERBIL -benchmarking named entity recognition andlinking consistently. Semantic Web, 9(5):605–625.

Muhammad Saleem, Muhammad Intizar Ali, AidanHogan, Qaiser Mehmood, and Axel-Cyrille NgongaNgomo. 2015a. LSQ: the linked SPARQL queriesdataset. In ISWC, volume 9367, pages 261–269.Springer.

Muhammad Saleem, Ali Hasnain, and Axel-Cyrille Ngonga Ngomo. 2018. LargeRDFBench:A billion triples benchmark for SPARQL endpointfederation. J. Web Sem., 48:85–125.

Muhammad Saleem, Qaiser Mehmood, and Axel-Cyrille Ngonga Ngomo. 2015b. FEASIBLE:a Feature-based SPARQL Benchmark GenerationFramework. In International Semantic Web Confer-ence (ISWC).

Muhammad Saleem, Shanmukha S. Padmanabhuni,Axel-Cyrille Ngonga Ngomo, Jonas S. Almeida, Ste-fan Decker, and Helena F. Deus. 2013. Linkedcancer genome atlas database. In I-SEMANTICS2013 - 9th International Conference on SemanticSystems, ISEM ’13, Graz, Austria, September 4-6,2013, pages 129–134.

Muhammad Saleem, Claus Stadler, Qaiser Mehmood,Jens Lehmann, and Axel-Cyrille Ngonga Ngomo.2017. SQCFramework: SPARQL query contain-ment benchmark generation framework. In K-CAP,pages 28:1–28:8.

Muhammad Saleem, Gabor Szarnyas, Felix Conrads,Syed Ahmad Chan Bukhari, Qaiser Mehmood, andAxel-Cyrille Ngonga Ngomo. 2019. How represen-tative is a sparql benchmark? an analysis of rdftriplestore benchmarks? In The World Wide WebConference, pages 1623–1633. ACM.

Michael Schmidt, Thomas Hornung, Michael Meier,Christoph Pinkel, and Georg Lausen. 2009.SP2Bench: A SPARQL performance benchmark.In Semantic Web Information Management - AModel-Based Perspective, pages 371–393.

Andreas Schultz, Andrea Matteini, Robert Isele, Chris-tian Bizer, and Christian Becker. 2011. LDIF -linked data integration framework. In COLD.

Mohamed Ahmed Sherif, Axel-Cyrille NgongaNgomo, and Jens Lehmann. 2015. AutomatingRDF dataset transformation and enrichment. In12th Extended Semantic Web Conference, Portoroz,Slovenia, 31st May - 4th June 2015. Springer.

Mohamed Ahmed Sherif, Axel-Cyrille NgongaNgomo, and Jens Lehmann. 2017. WOMBAT -A Generalization Approach for Automatic LinkDiscovery. In 14th Extended Semantic Web Confer-ence, Portoroz, Slovenia, 28th May - 1st June 2017.Springer.

Baoxu Shi and Tim Weninger. 2016. Discriminativepredicate path mining for fact checking in knowl-edge graphs. Knowledge-Based Systems, 104:123–133.

Prashant Shiralkar, Alessandro Flammini, FilippoMenczer, and Giovanni Luca Ciampaglia. 2017.Finding streams in knowledge graphs to support factchecking. In 2017 IEEE International Conferenceon Data Mining (ICDM), pages 859–864. IEEE.

Tommaso Soru and Axel-Cyrille Ngonga Ngomo.2013. Rapid execution of weighted edit distances.In Proceedings of the 8th International Workshop onOntology Matching co-located with the 12th Interna-tional Semantic Web Conference, pages 1–12.

Rene Speck and Axel-Cyrille Ngomo Ngonga. 2018.On extracting relations using distributional seman-tics and a tree generalization. In Knowledge Engi-neering and Knowledge Management. Springer In-ternational Publishing.

Rene Speck, Michael Roder, Felix Conrads, Hyn-davi Rebba, Catherine Camilla Romiyo, GurudeviSalakki, Rutuja Suryawanshi, Danish Ahmed, NikitSrivastava, Mohit Mahajan, et al. 2018. Open knowl-edge extraction challenge 2018. In Semantic WebChallenges, pages 39–51, Cham. Springer Interna-tional Publishing.

Rene Speck, Michael Roder, Sergio Oramas, LuisEspinosa-Anke, and Axel-Cyrille Ngonga Ngomo.2017. Open knowledge extraction challenge 2017.In Semantic Web Challenges, pages 35–48, Cham.Springer International Publishing.

Claus Stadler, Jens Lehmann, Konrad Hoffner, andSoren Auer. 2012. Linkedgeodata: A core for a webof spatial open data. Semantic Web, 3(4):333–354.

Beth M. Sundheim. 1993. Tipster/MUC-5: Informa-tion extraction system evaluation. In Proceedings ofthe 5th Conference on Message Understanding.

Zafar Habeeb Syed, Michael Roder, and Axel-Cyrille Ngonga Ngomo. 2019. Unsupervised dis-covery of corroborative paths for fact validation. InThe Semantic Web – ISWC 2019, pages 630–646,Cham. Springer International Publishing.

Zafar Habeeb Syed, Michael Roder, and Axel-CyrilleNgonga Ngomo. 2018. Factcheck: Validating rdftriples using textual evidence. In Proceedings ofthe 27th ACM International Conference on Infor-mation and Knowledge Management, pages 1599–1602, New York, NY, USA. Association for Com-puting Machinery.

Gabor Szarnyas, Benedek Izso, Istvan Rath, andDaniel Varro. 2018a. The Train Benchmark: Cross-technology performance evaluation of continuousmodel queries. Softw. Syst. Model., 17(4):1365–1393.

Gabor Szarnyas, Arnau Prat-Perez, Alex Averbuch,Jozsef Marton, Marcus Paradies, Moritz Kaufmann,Orri Erling, Peter A. Boncz, Vlad Haprian, andJanos Benjamin Antal. 2018b. An early look at theLDBC social network benchmark’s business intelli-gence workload. In Proceedings of the 1st ACM SIG-MOD Joint International Workshop on Graph DataManagement Experiences & Systems (GRADES)and Network Data Analytics (NDA), pages 9:1–9:11.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition.In Proceedings of CoNLL-2003.

Priyansh Trivedi, Gaurav Maheshwari, MohnishDubey, and Jens Lehmann. 2017. Lc-quad: A cor-pus for complex question answering over knowledgegraphs. In The Semantic Web – ISWC 2017, pages210–218, Cham. Springer International Publishing.

George Tsatsaronis, Georgios Balikas, ProdromosMalakasiotis, Ioannis Partalas, Matthias Zschunke,Michael R. Alvers, Dirk Weissenborn, AnastasiaKrithara, Sergios Petridis, Dimitris Polychronopou-los, et al. 2015. An overview of the BIOASQ large-scale biomedical semantic indexing and question an-swering competition. BMC Bioinformatics, 16:138.

Christina Unger, Corina Forascu, Vanessa Lopez, Axel-Cyrille Ngonga Ngomo, Elena Cabrio, Philipp Cimi-ano, and Sebastian Walter. 2014. Question answer-ing over linked data (QALD-4). In CLEF, pages1172–1180.

Christina Unger, Corina Forascu, Vanessa Lopez, Axel-Cyrille Ngonga Ngomo, Elena Cabrio, Philipp Cimi-ano, and Sebastian Walter. 2015. Question answer-ing over linked data (QALD-5). In CLEF.

Christina Unger, Axel-Cyrille Ngonga Ngomo, andElena Cabrio. 2016. 6th Open Challenge on Ques-tion Answering over Linked Data (QALD-6), pages171–177. Springer International Publishing, Cham.

Ricardo Usbeck, Michael Roder, Michael Hoffmann,Felix Conrads, Jonathan Huthmann, Axel-CyrilleNgonga Ngomo, Christian Demmler, and ChristinUnger. 2019. Benchmarking Question AnsweringSystems. Semantic Web, 10(2):293–304.

Ricardo Usbeck, Michael Roder, Axel-Cyrille NgongaNgomo, Ciro Baron, Andreas Both, MartinBrummer, Diego Ceccarelli, Marco Cornolti, DidierCherix, Bernd Eickmann, et al. 2015. GERBIL –general entity annotation benchmark framework. In24th WWW conference.

Ellen M Voorhees et al. 1999. The trec-8 question an-swering track report. In Trec, volume 99, pages 77–82.

Mark D Wilkinson, Michel Dumontier, IJsbrand JanAalbersberg, Gabrielle Appleton, Myles Axton,Arie Baak, Niklas Blomberg, Jan-Willem Boiten,Luiz Bonino da Silva Santos, Philip E Bourne, et al.2016. The fair guiding principles for scientific datamanagement and stewardship. Scientific data, 3.

Stuart N. Wrigley, Raul Garcıa-Castro, and LyndonNixon. 2012. Semantic evaluation at large scale(seals). In Proceedings of the 21st InternationalConference on World Wide Web, pages 299–302,New York, NY, USA. Association for ComputingMachinery.

Hongyan Wu, Toyofumi Fujiwara, Yasunori Ya-mamoto, Jerven T. Bolleman, and Atsuko Yam-aguchi. 2014. BioBenchmark Toyama 2012: Anevaluation of the performance of triple stores on bio-logical data. J. Biomedical Semantics, 5:32.