Repairing Broken RDF Links in the Web of Data

18
AUTHOR’S DRAFT Int. J. Web Engineering and Technology, Vol. x, No. x, xxxx 1 Repairing Broken RDF Links in the Web of Data Mohammad Pourzaferani * Department of Computer Engineering, University of Isfahan, Isfahan, Iran E-mail: [email protected] E-mail: [email protected] *Corresponding author Mohammad Ali Nematbakhsh Department of Computer Engineering, University of Isfahan, Isfahan, Iran E-mail: [email protected] Abstract: In the Web of Data, linked datasets are changed over time. These changes include updating on features and address of entities. The address change in RDF entities causes their corresponding links to be broken. Broken link is one of the major obstacles that the Web of Data is facing. Most approaches to solve this problem, attempt to fix broken links at the destination point. These approaches have two major problems: a single point of failure; and reliance on the destination data source. In this paper, we introduce a method for fixing broken links which is based on the source point of links, and discover the new address of the detached entity. To this end, we introduce two datasets, which we call “Superior” and “Inferior”. Through these datasets, our method creates an exclusive graph structure for each entity that needs to be observed over time. This graph is used to identify and discover the new address of the detached entity. Afterward, the most similar entity, which is candidate for the detached entity, is deduced and suggested by the algorithm. The proposed model is evaluated with DBpedia dataset within the domain of “person” entities. The result shows that most of the broken links, which had referred to a “person” entity in DBpedia, had been fixed correctly. Keywords: Broken Links; Link Integrity; Linked Data; Web of Data. Reference to this paper should be made as follows: Pourzaferani, M. and Nematbakhsh, M.A. (xxxx) ‘Reconstructing Broken RDF Links in the Web of Data Using Superior and Inferior Sets’, Int. J. of Web and Grid Services, Vol. x, No. x, pp.xxx–xxx. Biographical notes: Mohammad Pourzaferani received the BS degree in computer engineering from Islamic Azad University, Najafabad branch, Iran in 2009. He is currently pursuing a Master of Science in Software Engineering at The University of Isfahan, Iran. His research interests include semantic web, intelligent systems and soft computing. Copyright c 2012 Inderscience Enterprises Ltd.

Transcript of Repairing Broken RDF Links in the Web of Data

AUTHOR’S

DRAFT

Int. J. Web Engineering and Technology, Vol. x, No. x, xxxx 1

Repairing Broken RDF Links in theWeb of Data

Mohammad Pourzaferani *Department of Computer Engineering,University of Isfahan,Isfahan, IranE-mail: [email protected]: [email protected]*Corresponding author

Mohammad Ali NematbakhshDepartment of Computer Engineering,University of Isfahan,Isfahan, IranE-mail: [email protected]

Abstract: In the Web of Data, linked datasets are changed over time.These changes include updating on features and address of entities.The address change in RDF entities causes their corresponding links tobe broken. Broken link is one of the major obstacles that the Web ofData is facing. Most approaches to solve this problem, attempt to fixbroken links at the destination point. These approaches have two majorproblems: a single point of failure; and reliance on the destination datasource. In this paper, we introduce a method for fixing broken linkswhich is based on the source point of links, and discover the newaddress of the detached entity. To this end, we introduce two datasets,which we call “Superior” and “Inferior”. Through these datasets, ourmethod creates an exclusive graph structure for each entity that needsto be observed over time. This graph is used to identify and discoverthe new address of the detached entity. Afterward, the most similarentity, which is candidate for the detached entity, is deduced andsuggested by the algorithm. The proposed model is evaluated withDBpedia dataset within the domain of “person” entities. The resultshows that most of the broken links, which had referred to a “person”entity in DBpedia, had been fixed correctly.

Keywords: Broken Links; Link Integrity; Linked Data; Web of Data.

Reference to this paper should be made as follows: Pourzaferani, M.and Nematbakhsh, M.A. (xxxx) ‘Reconstructing Broken RDF Links inthe Web of Data Using Superior and Inferior Sets’, Int. J. of Web andGrid Services, Vol. x, No. x, pp.xxx–xxx.

Biographical notes: Mohammad Pourzaferani received the BS degreein computer engineering from Islamic Azad University, Najafabadbranch, Iran in 2009. He is currently pursuing a Master of Science inSoftware Engineering at The University of Isfahan, Iran. His researchinterests include semantic web, intelligent systems and soft computing.

Copyright c⃝ 2012 Inderscience Enterprises Ltd.

AUTHOR’S

DRAFT

2 M. Pourzaferani and M.A. Nematbakhsh

Mohammad Ali Nematbakhsh is an associate professor of computer

engineering at the University of Isfahan. He received his B.Sc. in

Electrical Engineering from Louisiana Tech University in 1981 and his

M.Sc. and PhD degrees in Electrical and Computer Engineering from

the University of Arizona, USA in 1983 and 1987, respectively. He

worked for Micro Advanced Co. and Toshiba Corporation in USA and

Japan for many years before joining The University of Isfahan. He

has published more than 100 research papers, three U.S. registered

patents and a database book that is widely used in universities. His

main research interests include semantic web, Data Base systems and

computer network software.

1 Introduction

Web of Data is a new practical approach to achieve Semantic Web goals (Bizeret al., 2009). Technically, the Web of Data is a method for publishing data onthe Web, which relies on URIs as resource identifiers and the HTTP protocolto retrieve resource descriptions (Auer et al., 2007). This technology employsResource Description Framework (RDF) data model for publishing structureddata on the Web and makes a well-structured and machine-readable generationof Web. Therefore, machines can process and mine information without usingcomplex natural language processing techniques which in many situations bringback inappropriate results. These differences between Web of document and Webof Data are lead to a remarkable achievement for the Web of Data technology(Popitsch and Haslhofer, 2011).

Although there are various types of tools for publishing information regardingto Linked Data publishing rules (Auer et al., 2009; Zhihong et al., 2011) and manytools for discovering semantic links between the entities (Volz et al., 2009), howeverthere are not suitable tools to preserve these links (Popitsch and Haslhofer, 2011;Volz et al., 2009; Vesse et al., 2010).

One aspect of Linked Data datasets is that they change over time (Umbrichet al., 2010). New entities are added and old entities are removed, new links areset to other entities and some links are eliminated when the target entity has beendetached. Another type of change is movement change. In this case, the addressof the target entity has been changed. Thus, the address is not dereferencableanymore. These links are called broken links and play an important role in thedata quality of Web of Data (Popitsch and Haslhofer, 2011; Vesse et al., 2010).Therefore, finding an automated tool to fix broken links is a problem worthtackling.

Broken links appear in two situations:

• Destination of the link is not dereferencable: in this case, the destinationentity address had been deleted or changed. Thus, these links are structurallybroken.

AUTHOR’S

DRAFT

Repairing Broken RDF Links in the Web of Data 3

• Definition of the entity had been changed over time: in this case the entity isdereferencable, however its meaning had changed in a way that the originalimplied meaning set by its producer has inherently been altered.

In this paper, we consider broken links which were structurally broken. Recentresearches related to this field have revealed two common weaknesses, namely: (i)checking and fixing broken links are held at the destination point of the link. Thisimplies that there is a requirement for the data of destination entity that has beendetached. (ii) Some approaches create a central service to notify changes to otherdatasets. Consequently, this centrality causes a single point of failure.

The proposed approach in this paper attempts to detect and fix broken link atthe source point; by distributing the approach to all datasets the second drawbackcan also be covered. To this end, we introduce two entity datasets: Superior andInferior. With the usage of these two datasets, the algorithm finds the missingentity and fixes the broken link.

The subsequent sections of this paper are organised as follows. In Section 2, acomprehensive overview of related work is presented. In Section 3, a real examplefrom DBpedia dataset is described to explain the broken link phenomenon. InSection 4, a formal conceptual definition of the different terms including RDF,Superior entity and Inferior entity is provided. In Section 5, the proposed model tofix broken links is presented, alongside a comprehensive description of each moduleof the model is detailed. Section 6 describes the experiments that were carried outon a real dataset; it will also illustrates a comparison with other existing works.Finally, a conclusion will be drawn, while directions for future research will beillustrated in Section 7.

2 Related work

Broken link has been discussed in other areas such as Web and hypertext systemsand was known as Link Integrity. In fact this problem is divided into two subproblems (Vesse et al., 2010; Ashman, 2000):

• Dangling Link: this type of problem occurs when the destination point of thelink is not accessible, and is usually caused by deletion of the entity or whenthe address changes.

• Editing Problem: although the destination entity is accessible, the semanticof the entity has been changed in a way that is no longer meant what thecreator of the link intended.

In the Web of Document, there are a number of approaches to ensure link integrity.People are the end users of web pages; therefore, many sites do not deal withthis problem and relinquish it to the end users. People are able to find alternativepaths to access their information by using search engines. On the other hand, thisis obviously much harder for machine actors to deal with the problem. Therefore,redirection of the problem to an application is not suitable because if theselinks have not been repaired, it causes some other problems in the program anddecreases the quality of datasets.

AUTHOR’S

DRAFT

4 M. Pourzaferani and M.A. Nematbakhsh

The proposed algorithm by (Vesse et al., 2010) is based on an expansionprofile that is used to find data about an URI. The expansion profile is aVocabulary of Interlinked Datasets (VoID) that is a set of datasets and linksetswhich is defined by the user. In this approach, the algorithm finds equivalentURI using SameAs.orga discovery endpoint and Sindiceb API. These services arenot generally available for the Web of Data. Furthermore, if the observed datasetdoes not linked with other datasets these services will not produce any results. Inaddition, the achievement of the algorithm depends on the end user, to specify theexpansion profile.

(Auer et al., 2009) presented a simple tool for publishing Linked Data fromrelational databases called Triplify, which is also a method for publishing updatelogs of updates that occurred within the specific timespan. In this study updatesand deletions are logged but movement changes are not considered. In addition,the overview of the approach shows that broken links are fixed at the destinationpoint of the link, which means that each dataset should publish their changes overtime even if no entity refers to them.

(Popitsch and Haslhofer, 2011) proposed a method to fix broken links, whichconsists of two steps. In the first step, which is called “Monitoring”, each newlycreated entity that had been detected is added to item index II and deletedentities are added to removed item index RII. The second step is housekeeping,and in this step, these two indices are compared and the algorithm finds whichentity in II index is equivalent to RII index. They propose a Time-interval-basedblocking method to reduce the number of comparisons.

(Liu and Li, 2011) utilised metadata to resolve broken links. Each entity has itsown metadata. In the proposed approach, entities were linked through metadata.Therefore, linked data became the linked metadata. To fix broken links theyintroduced two concepts called “internal ID” and “external ID”. The internal IDis used to identify entities within the same data source and external ID is used toaccess the entity from outside. For all of the entities there is a mapping betweenexternal and internal IDs. When the movement event occurs, the algorithm searchover the metadata to find an appropriate alternative for the detached entity. If thealternative is found, the external ID is mapped into the new internal ID.

3 Case Study

Figure 1 shows an illustrative example of a person entity in DBpedia snapshot3.6c which the address of this entity has been changed in DBpedia snapshot3.7d. As depicted in this example many changes have been happened to an entityafter movement. These changes are divided into three categories (there were manyproperties which we omit here for reasons of brevity):

• Deletion: there are some properties which were removed due to some reasons.For instance the battle “Iraq war” was removed.

ahttp://www.sameas.orgbhttp://www.sindice.comchttp://wiki.dbpedia.org/Downloads36dhttp://wiki.dbpedia.org/Downloads37

AUTHOR’S

DRAFT

Repairing Broken RDF Links in the Web of Data 5

• Modification: some properties are updated due to changes over time. Asshown in Figure 1, the birth place of the entity was the resource “DBpe:United Kingdom” but in 3.7 snapshot it has been changed to “DBp:England” (see Figure 2). Furthermore, the associated predicate with birthplace had been changed from “birthplace” to “placeOfBirth” according toontology changes.

• Addition: some information is added to the entity. As an example someinformation such as he was at the battle of “The Troubles” and awards thathe had achieved during the period of 3.6 and 3.7 publishing date, are addedto the new version.

Figure 1 The sample entity in 3.6 snapshot

DBpedia:property/battles

URIDBpedia:Green_Howards

URIDBpedia:United_Kingdom

URIDBpedia:Iraq_War

URIDBpedia:Kosovo

URIDBpedia:Richard_Dannatt

URIDBpedia:Military_Cross

URIDBpedia:Category:Evangelical_Anglicans

DBpedia:Ontology/battle

purl.org/dc/terms/subject

DBped

ia:property/aw

ards

DBpedia

:Ontology/b

irthPla

ce

DBpedia:property/com

mands

DBpedia:Ontology/battle

Addition

Modification

Deletion

4 Preliminaries

The proposed algorithm uses semantic links among entities to fix broken links.In the Web of Data, RDF model encodes data in the form of subject, predicate,object triples. The subject of a triple is identified by HTTP URI. The object canbe either an URI or a string literal (Papavassiliou et al., 2009; Auer and Herre,2007; Udrea et al., 2007). The predicate specifies how the subject and object arerelated, and is also represented by an URI.

RDF model creates a graph structure for each entity. This structure consists oftwo types of entities which we call “Superior Entity” and “Inferior Entity”.

• Definition 4.1: Superior Entity: Each entity which has a semantic linkreferring to the observed entity. Therefore, each entity which has this entityin its Object is included in Superior dataset.

eShortened form of “http://dbpedia.org/resource/”

AUTHOR’S

DRAFT

6 M. Pourzaferani and M.A. Nematbakhsh

Figure 2 The sample entity in 3.7 snapshot

URIDBpedia:Green_Howards

URIDBpedia:Category:Crossbench_life_peers

purl.org/dc/terms/subject

DBpedia

:propert

y/birthP

lace

DBpedia:property/commands

URIDBpedia:HQ_Land_Forces

URIDBpedia:Richard_Dannatt,_Baron_Dannatt

URIDBpedia:England

URIDBpedia:Broomfield,_Essex

URIDBpedia:General_(United_Kingdom)

URIDBpedia:Constable_of_the_Tower

DBpedia:property/rank

DBpedia:property/placeOfBirthDBpedia:Ontology/battle

DBpedia:property/battles

DBpedia:property/unit

DBpedia:Ontolo

gy/militaryUnit

DBpedia:Ontology/occupation

DBpedia:property/laterwork

URIDBpedia:Kosovo_War

Figure 3 Superior and Inferior entities

Superior Entity

Inferior Entity

The Observed Entity

• Definition 4.2: Inferior Entity: Each entity that the observed entity has asemantic link to. Refer to the previous definition, each entity that has theobserved entity in its Subject is included in Inferior dataset.

5 The Proposed Algorithm

The fundamental idea introduced in this paper is that entities mostly preservetheir graph structure even after movement to another location. Despite the changesthat held in the properties of the entity, the graph structure has not been modifiedsignificantly. Therefore, the graph structure is used to find candidate entities whichare similar to the detached entity.

The proposed approach has two basic goals:

• To introduce a graph structure for an entity which is used as a fingerprint todistinguish it from other entities.

AUTHOR’S

DRAFT

Repairing Broken RDF Links in the Web of Data 7

Figure 4 The proposed algorithm architecture

Find the superior of Inferiors and vice versa

Superior Dataset

Inferior Dataset

Entity Repository

Initial Repository

DestinationData Source

Ranking Module

Query Maker

Analyzer Module

Crawler ControllerExternal

links

LocalData Source

Find Superiors and Inferiors entities

Send Query

Final Candidates

Durability and PopularityStatistic

• Searching for the detached entity in the entire dataset is not an appropriateapproach. Therefore, we adopt an approach to reduce the search space.

The proposed algorithm consists of five basic modules (see Figure 4), which arebriefly described as follows.

5.1 Entity Repository

The proposed approach consists of two basic phases. In the first phase, some usefulinformation is stored in the Entity Repository module. This module is responsiblefor storing entities and their associated links. For each entity which there is a linkto it at the destination data source, the algorithm stores two types of information.The Superior entities and their predicates are stored in the Superior dataset,while the Inferior entities and their associated predicates are stored in the Inferiordataset. The second phase starts in case the link is broken. In the second phase,the link is repaired via the information that was stored in the first phase.

Changes in the structure and content is a prominent aspect of data sources.Therefore, this module should remove obsolete entities from the repository andstore the new addresses for the moved entities. Further statistics about the changesbetween DBpedia 3.6 and 3.7 snapshots are shown in Table 1. In the second phase,through this information and the following modules, the link is repaired.

AUTHOR’S

DRAFT

8 M. Pourzaferani and M.A. Nematbakhsh

Table 1 Statistics regarding DBpedia dataset

Measure Value

# of Persons entities in 3.6 version 296595

# of Entities had changes in address 4726

# of Entities had deleted 16527

# of Entities added in 3.7 version 510636

5.2 Crawler Controller

This module starts its task with the initial entities in the local data source. At first,for each link which refers to outer datasets, the algorithm stores the destinationentity in the initial repository. Afterward, for each entity in the initial repository,Crawler Controller starts to find Superiors and Inferiors entities and stores themin the correspondent repository (see Figure 4). There is a relationship between thismodule and Analyzer module. The Analyzer module conducts Crawler Controllerby feeding it with appropriate queries.

5.3 Analyzer

The basic idea behind the proposed approach is as follows: the Crawler Controllermodule searches for the superiors of each entity in the inferior dataset, and viceversa. Subsequently, the Analyzer module retrieves the features of the producedentities and finds the detached entity by comparing similarity between the graphstructures.

Because many entities are found in this manner, therefore useless entitiesshould be removed. As an example, if the algorithm try to find the newaddress of a person entity which was born in England, there are 7970 peoplein DBpedia snapshot 3.7 were born in England. But, only one of theseentities is the target entity and the others are useless. Therefore, this predicate(“DBpedia:Ontology/birthplace”) is a common predicate for many entities. On theother hand, some predicates like (“DBpedia:property/vicepresident”) are definedonly for some special people, therefore for these predicates the search space isnarrowed and the number of candidate entities are decreased. This module selectsthe links that are more important for an entity, and then ranks the type of linksaccording to their importance and gives the result to the Query Builder module tocreate an appropriate query. We also introduce a threshold to limit the reductionof comparison in this module. As shown in Figure 5, when the threshold pointincreased, we did not notice a significant change in the performance. On the otherhand, if the threshold is set to the minimum, there should be a major decreasein comparison computation (see Figure 6). This is obvious that we need to haveas few comparisons as possible; we can therefore achieve this goal by setting thethreshold to the minimum.

The main reason why raising the threshold point did not affect the precision isthat there are a number of entities which have a similar graph structure. Therefore,when we decrease the threshold we do not notice a significant change in the results

AUTHOR’S

DRAFT

Repairing Broken RDF Links in the Web of Data 9

(see Figure 5). On the other hand, when the threshold increased, the search spaceget wider, therefore more comparisons should be done.

Figure 5 Number of correctly fixed link for different Thresholds

3170

3175

3180

3185

3190

3195

3200

3205

3210

5 10 15 20 25 30 35 40 45 50 55 60 65

Nu

mb

er o

f C

orre

ct

Fix

ed

Lin

ks

Threshold

Figure 6 Number of Comparison for different Thresholds

50000

60000

70000

80000

90000

100000

110000

120000

130000

140000

5 10 15 20 25 30 35 40 45 50 55 60 65

Nu

mb

er

of

com

pa

riso

ns

Threshold

For this purpose, the algorithm should reduce the number of features(predicates) to a sufficient minimum. Thus, predicates should been sortedaccording to their importance (Theodoridis and Koutroumbas, 2008; Manninget al., 2008). Therefore, we define two factors as follow:

• Durability: a special score is assigned to each predicate which shows howmuch an entity maintained during different snapshots. As we have discussedin the preliminary section, each predicate refers to a resource in thedestination dataset. To calculate the score, we extract predicates in inferiordataset and for each one calculate its durability between correspondentsnapshots.

• Publicity: some resources and properties are more common between entities.In this module the publicity measure is computed and some excerpts are

AUTHOR’S

DRAFT

10 M. Pourzaferani and M.A. Nematbakhsh

shown in Table 2 and Table 3. Through this measure the algorithm selectsthe most important predicates.

These values had been measured for different entities that had a role asa predicate or object. Afterward, these statistics are employed to create anappropriate query for finding the new address for the detached entity.

Table 2 Object Publicity

Object Resource Value

http://dbpedia.org/resource/United States 194370

http://dbpedia.org/resource/Poland 59315

http://dbpedia.org/resource/United Kingdom 49308

http://dbpedia.org/resource/Category:Year of birth missing (living people) 46596

http://dbpedia.org/resource/France 46325

http://dbpedia.org/resource/Pop music 44665

http://dbpedia.org/resource/England 44097

http://dbpedia.org/resource/Rock music 41749

http://dbpedia.org/resource/India 31984

http://dbpedia.org/resource/Alternative rock 29055

http://dbpedia.org/resource/Canada 25369

http://dbpedia.org/resource/Actor 25181

http://dbpedia.org/resource/Germany 22511

http://dbpedia.org/resource/Midfielder 21924

http://dbpedia.org/resource/Australia 18361

http://dbpedia.org/resource/Jazz 16943

http://dbpedia.org/resource/Hard rock 15580

http://dbpedia.org/resource/Hip hop music 15535

http://dbpedia.org/resource/World War II 15424

5.4 Query Maker

After analyzing Superior and Inferior datasets, the statistics are given to the QueryMaker module. In this module the appropriate query is built and supplied to theCrawler Controller module. As discussed previously, candidate entities should bedecreased and entities that are no longer useful should be removed.

The algorithm tries to reduce candidate numbers to reach the minimumthreshold, but it is not always possible to reach the threshold due to lack ofinformation. Some entities have a few features, which are common in other entities,therefore, the number of candidate entities cannot be reduced. Figures 7 and 8show where there are some jumps that illustrate this problem.

AUTHOR’S

DRAFT

Repairing Broken RDF Links in the Web of Data 11

In this module, SPARQLf is used as an RDF query language to create queries(Elbassuoni et al., 2010) and Jena is used to implement the algorithm. Jena is aJava framework for building Semantic Web applications.

Table 3 Predicate Publicity

Predicate Value

http://purl.org/dc/terms/subject 3034592

http://dbpedia.org/ontology/country 358955

http://dbpedia.org/ontology/genre 250984

http://dbpedia.org/ontology/birthPlace 230179

http://dbpedia.org/property/genre 144107

http://dbpedia.org/ontology/team 91476

http://dbpedia.org/ontology/occupation 89183

http://dbpedia.org/ontology/recordLabel 62521

http://dbpedia.org/ontology/deathPlace 49538

http://dbpedia.org/ontology/nationality 38455

http://dbpedia.org/ontology/hometown 36504

http://dbpedia.org/ontology/instrument 36468

http://dbpedia.org/property/clubs 36201

http://dbpedia.org/ontology/battle 32855

http://dbpedia.org/property/occupation 29582

http://dbpedia.org/property/birthPlace 28782

http://dbpedia.org/ontology/award 26054

http://dbpedia.org/ontology/party 24898

http://dbpedia.org/property/position 24669

http://dbpedia.org/property/battles 18972

5.5 Ranking

In this module, the entity that is most similar to the target entity is found andpassed as the final result. The similarity is calculated in two steps.

• It compares the old entity with the candidate entities in their object valuesand only takes into account the dereferencable objects. Each candidate entitygets a score and the more similar to the old entity, the more score thecandidate entity receives.

• The investigation shows that a few changes have occurred after moving to thenew address. Therefore, the algorithm compares entities in their string value.There are many algorithms comparing strings (Elmagarmid et al., 2007). In

fhttp:// www.w3.org/TR/rdf-sparql-query/

AUTHOR’S

DRAFT

12 M. Pourzaferani and M.A. Nematbakhsh

this paper, a new generation of Jaro distance metric called Jaro-Winkler isused to calculate the similarity between two entities.

Figure 7 Number of candidates for each entity with threshold=5

0

50

100

150

200

250

300

350

400

450

1

89

17

7

26

5

35

3

44

1

52

9

61

7

70

5

79

3

88

1

96

9

10

57

11

45

12

33

13

21

14

09

14

97

15

85

16

73

17

61

18

49

19

37

20

25

21

13

22

01

22

89

23

77

24

65

25

53

26

41

27

29

28

17

29

05

29

93

30

81

31

69

32

57

33

45

34

33

Nu

mb

er

of

can

did

ate

s

The ID of the detached entity

Figure 8 Number of candidates for each entity with threshold=65

0

50

100

150

200

250

300

350

400

450

1

89

17

7

26

5

35

3

44

1

52

9

61

7

70

5

79

3

88

1

96

9

10

57

11

45

12

33

13

21

14

09

14

97

15

85

16

73

17

61

18

49

19

37

20

25

21

13

22

01

22

89

23

77

24

65

25

53

26

41

27

29

28

17

29

05

29

93

30

81

31

69

32

57

33

45

34

33

Nu

mb

er

of

can

did

ate

s

The ID of the detached entity

The pseudo-code representation of the proposed algorithm is depicted inAlgorithm 1. The algorithm illustrates the procedure for finding the detachedentity in the destination dataset. In the first step, the algorithm finds Superiorsand Inferiors for each entity in the Initial Repository (Lines 2-5). This Repositoryincludes all the entities for which there is a link from local data source thatrefers to them. In the second phase, for each link in the Superior dataset, thealgorithm finds its Inferiors and stores them in a temporary dataset TInfofSup

(Line 7). Following this, candidate entity in this dataset that is most similar tothe detached entity is scored and stored in STarget (Line 8). In addition, for eachlink in the Inferior dataset, the algorithm finds its Superiors and stores them in atemporary dataset TSupofInf (Line 11). As we discussed, the number of candidatesshould be reduced. Therefore, until the threshold is reached, the algorithm findsthe best candidates using Durability and Publicity criteria (Lines 12-14). Then,for each candidate entity, the algorithm finds its Inferiors and stores them in

AUTHOR’S

DRAFT

Repairing Broken RDF Links in the Web of Data 13

TInfofCandidates (Line 15). Next, the candidate that is most similar to the detachedentity is scored and stored in ITarget (Line 16). Finally, the entity with the highestscore is chosen as the final result (Line 18).

input : RDF Link set, Thresholdoutput: List of Corrected Broken Links

1 begin2 foreach entity (vi) ∈ Lset do3 RSup ← Superiors(vi)4 RInf ← Inferiors(vi)

5 end6 foreach link (Li) ∈ R(Sup) do7 TInfofSup ← Inferiors(Li)8 STarget ←MostSimilar(Rdetached, TInfofSup)

9 end10 foreach link (Li) ∈ R(Inf) do11 TSupofInf ← Superiors(Li)12 while Candidate# > Threshold do13 TCandidates ← BestCandidate(TSupofInf )14 end15 TInfofCandidates ← Inferiors(TCandidates)16 ITarget ←MostSimilar(Rdetached, TInfofCandidates)

17 end18 LCorrected ←Maximum(STarget, ITarget)

19 endAlgorithm 1: The proposed algorithm

6 Experimental Results

In order to evaluate the proposed approach, we used DBpedia dataset. DBpediaproject has derived such a data corpus from Wikipedia encyclopedia. It plays asignificant role in the Web of Data and many data publishers have begun to setdata-level links to DBpedia resources, making DBpedia a central interlinking hubof the Web of Data (Auer et al., 2007; Bizer et al., 2009).

To evaluate the proposed approach, we concentrated on two cases. (a) Comparethe proposed algorithm with related work (b) Evaluate the algorithm in a largeexperiment.

• The most successful study that has been published thus far is DSNotify(Popitsch and Haslhofer, 2011). This study used 3.2 and 3.3 DBpediasnapshotsg for evaluation. In this case study they tested 179 move events.We have tested the proposed algorithm on these datasets and precision,

gThe snapshots contain a subset of all instances of type foaf:Person and can be downloadedfrom http://dbpedia.org/ (filename: persondata en.nt).

AUTHOR’S

DRAFT

14 M. Pourzaferani and M.A. Nematbakhsh

recall and F-measure rates are depicted in Figure 10, Figure 11 andFigure 12, respectivelyh. Comparison between our algorithm and DSNotifyis also depicted in these figures. DSNotify can be configured in multipleways. DSNotify configuration parameters are set according to what theauthors used in (including threshold values, housekeeping frequency, set ofextracted features and their weight, etc.) (Popitsch and Haslhofer, 2011).In this evaluation, simulation length has changed and the results werecalculated. The results show that the curve of DSNotify has been decreasedby increasing simulation length. As we mentioned earlier DSNotify hasa parameter called housekeeping that shows the period of time that thealgorithm starts to compare entities to detect movement event. When theperiod of housekeeping has increased, there will be more events occurred andmore entities have to be compared. Another disadvantage of DSNotify is thatit depends on the coverage and entropy of properties. On the other hand,the proposed algorithm is not depending on time. Therefore, the results arestable for different simulation length. There is a configurable parameter inour approach which is the minimum threshold. The result shows that fordifferent values the accuracy is almost the same (see Figure 9).

• To evaluate the algorithm in a wider area, we used the last two snapshotsof DBpedia. There were 296595 person entities in DBpedia 3.6 snapshot. Asshown in Table 1 there were 4726 entities which their addresses had beenchanged. Some of these entities refer to entities outside of person dataset in3.7 snapshot. We omitted these entities and the remaining were 3505 entities.Afterward, these entities were supplied to the proposed algorithm and theresult is shown in Figure 9.

Figure 9 Evaluation for different Thresholds

0.87

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

5 10 15 20 25

Various Threshold

30 35 40 45 50 55 60 65

Various Threshold

Precision

Recall

F-measure

hThe prefix “dbpedia” is shortened form of “http://dbpedia.org/ontology/” and “foaf” isshortened form of “http://xmlns.com/foaf/0.1/”

AUTHOR’S

DRAFT

Repairing Broken RDF Links in the Web of Data 15

6.1 Discussion

Accomplishment of the algorithm depends on the intersection between theprevious Superiors of the entity and current Superiors and the same for theInferiors. Although the entity URI changed over time, the intersection betweentwo snapshots shows that the entity preserved its structure and content (furtherinformation is shown in Table 4).

Table 4 Specific Statistics

Entities condition Value Percentage

# of entities which have Superior in 3.6 version 1091 31%

# of entities which have Inferior in 3.6 version 3505 100%

# of entities which have Superior in 3.7 version 1945 56%

# of entities which have Inferior in 3.7 version 3499 100%

# of entities which have at least one Superior in 3.6 and 3.7 versions 947 27%

# of entities which have at least one Inferior in 3.6 and 3.7 versions 3505 100%

Furthermore, we used the DBpedia redirect dataseti to identify movements.This dataset includes any redirection between entities in the whole DBpedialifelong. It provides a special link, so-called “wikiPageRedirects”. These linkswould be used as a metric to calculate the precision of the algorithm.

Figure 10 Comparison between the proposed model and DSNotify on precision

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7

Pre

cisi

on

(%

)

Log10(average events/housekeeping cycle)

foaf:name

dsnotify:rdfhash

The Proposed Algorithm

dbpedia:draftyear

iRedirection dataset can be downloaded from http://dbpedia.org/ (filename: redirects en.nt).

AUTHOR’S

DRAFT

16 M. Pourzaferani and M.A. Nematbakhsh

Figure 11 Comparison between the proposed model and DSNotify on recall

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7

Re

call

(%

)

Log10(average events/housekeeping cycle)

foaf:name

dsnotify:rdfhash

dbpedia:birthdate

dbpedia:birthplace

dbpedia:height

dbpedia:draftyear

The Proposed Algorithm

Figure 12 Comparison between the proposed model and DSNotify on F1-measure

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7

F -

me

asu

re (

%)

Log10(average events/housekeeping cycle)

foaf:name The proposed Algorithm

dbpedia:birthplace

dbpedia:height

dbpedia:draftyear

7 Discussion and Future Works

In this article we developed an algorithm to fix broken RDF links in the Web ofData. Thus, we define two datasets which we called “Superior” and “Inferior”. Atthe first step outer entities, which we want to be dereferencable over the time,are stored in the initial dataset. Afterward, when their address is changed thealgorithm tries to find an alternative address for those entities. Our evaluationresults confirm the feasibility of using Superior and Inferior datasets. The resultsshow that more than 90 percent of broken links repaired correctly. In the future, weplan to evaluate the proposed algorithm with other datasets from different domainsand will try to fix the semantic broken links.

AUTHOR’S

DRAFT

Repairing Broken RDF Links in the Web of Data 17

References

Bizer, C., Heath, T. and Berners-Lee, T. (2009) ‘Linked Data - The Story So Far’,IJSWIS, Vol. 5, No. 3, pp.1–22.

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R. and Ives, Z. (2007)‘Bpedia: a nucleus for a web of open data’, Proceedings of the 6th international Thesemantic web and 2nd Asian conference on Asian semantic web conference, Busan,Korea, pp. 722–735.

Popitsch, N. and Haslhofer, B. (2011) ‘DSNotify A solution for event detection and linkmaintenance in dynamic datasets’, Web Semantics: Science, Services and Agents onthe World Wide Web, Vol. 9, No. 3, pp. 266–283.

Auer, S., Dietzold, S., Lehmann, J., Hellmann, S. and Aumueller, D. (2009) ‘Triplify:light-weight linked data publication from relational databases’, Proceedings of the18th international conference on World wide web, Madrid, Spain, pp. 621–630.

Zhihong, S., Yufang, H. and Jianhui, L. (2011) ‘Publishing distributed files as LinkedData’, FSKD, pp.1694–1698.

Volz, J., Bizer, C., Gaedke, M. and Kobilarov, G. (2009) ‘Discovering and MaintainingLinks on the Web of Data’, The Semantic Web – ISWC 2009, Chantilly, VA, USA,pp. 650–665.

Vesse, R., Hall, W. and Carr, L. (2010) ‘Preserving Linked Data on the Semantic Web bythe application of Link Integrity techniques from Hypermedia’, LDOW2010, Raleigh,North Carolina.

Umbrich, J., Hausenblas, M., Hogan, A., Polleres, A. and Decker, S. (2010) ‘TowardsDataset Dynamics: Change Frequency of Linked Open Data Sources’, LDOW2010,Raleigh, North Carolina.

Ashman, H. (2000) ‘Electronic document addressing: dealing with change’, ACMComput. Surv., Vol. 32, No. 3, pp. 201–212.

Liu, F. and Li, X. (2011) ‘Using Metadata to Maintain Link Integrity for Linked Data’,Proceedings of the 2011 International Conference on Internet of Things and 4thInternational Conference on Cyber, Physical and Social Computing, Vol. 14, pp.432–437.

Papavassiliou, V., Flouris, G., Fundulaki, I., Kotzinos, D. and Christophides, V.(2009) ‘On Detecting High-Level Changes in RDF/S KBs’, Proceedings of the 8thInternational Semantic Web Conference, Chantilly, VA, pp.473–488.

Auer, S. and Herre, H. (2007) ‘A versioning and evolution framework for RDF knowledgebases’, Proceedings of the 6th international Andrei Ershov memorial conference onPerspectives of systems informatics, Novosibirsk, Russia, pp.55–69.

Udrea, O., Pugliese, A. and Subrahmanian, V. S. (2007) ‘GRIN: a graph basedRDF index’, Proceedings of the 22nd national conference on Artificial intelligence,Vancouver, British Columbia, Canada, pp.1465–1470.

Theodoridis, S. and Koutroumbas, K. (2008) ‘Pattern Recognition, Fourth Edition’,Academic Press.

Manning, C., Raghavan, P. and Schtze, H. (2008) ‘ntroduction to information retrieval’,Cambridge University Press.

Elbassuoni, S., Ramanath, M., Schenkel, R. and Weikum, G. (2010) ‘Searching RDFGraphs with SPARQL and Keywords’, IEEE Data Eng. Bull., Vol. 33, No. 1, pp.16–24.

Elmagarmid, A. K., Ipeirotis, P. G. and Verykios, V. S. (2007) ‘Duplicate RecordDetection: A Survey’, IEEE Trans. on Knowl. and Data Eng., Vol. 19, No. 1, pp.1–16.

AUTHOR’S

DRAFT

18 M. Pourzaferani and M.A. Nematbakhsh

Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R. and Hellmann,S. (2009) ‘ DBpedia - A crystallization point for the Web of Data’, Web Semantics:Science, Services and Agents on the World Wide Web, Vol. 7, No. 3, pp. 154–165.