A Novel Dataset for Quranic Words Identification and Authentication

6
#20 A Novel Dataset for Quranic Words Identification and Authentication 1 Thabit Sabbah, 2 Ali Selamat Universiti Teknologi Malaysia (UTM) Malaysia 1 [email protected], 2 [email protected] Abstract— Quran is the holy book for all Muslims around the world. During hundreds of years, it was preserved in all possible ways from distortion. The huge increment and spread of digital media and internet usage, leaded to many organizational and individual websites, services, and applications are being introduced to spread the knowledge related to Quran as well as Quranic Verses, Translations, Explanations with the Tafseer and other Quranic sciences in its digital formats, some of these services are less authentic. The first step of authentication is the correct detection and identification of Quranic words among the text. In this paper, we introduce a novel dataset for Quranic words identification and Authentication. The proposed dataset contains more than 93000 samples with 64 features for each sample extracted in numerical form. Samples are categorized into two labeled classes; “Quranic” and “non-Quranic”, Validation tests of our dataset show a high accuracy average. Keywords: Quranic words detection, Quranic words Dataset, Arabic words classification, Arabic Diacritic Words; I. INTRODUCTION Quranic words identification can be defined as determining which words of the text belongs to Holy Quran and is written exactly as it is written in holy Quran 1 . Consecutive wellordered Quranic words forms a verse. In common, Quran is written with diacritics. Originally, diacritics are used in Arabic to distinguish the vocal pronunciation of words and represents the vowel sounds [1]; In ordinary Arabic writing few diacritics are commonly used as shown in Table I. However, in Quranic text writing there are many other diacritics and special symbols used to give the reader other sorts of reading guidance, we will call all of these diacritics and special symbols as diacritics for abbreviation. TABLE I. COMMON DIACRITICS USED IN ARABIC WRITING 1 Based on standard Quranic writing style known as (Uthmani). Muslims a round world use verses in their daily life for many reasons such as support decision making, deduce solutions deduce solutions for their social and religious problems or to analyze many issues. Most of Muslim authors quote verses as an evidence to support their conclusions and their analysis of events [2]. Quotation is not only used in written publications, but also it is common in speeches, conversations, dialogs and discussions. This scientific method is popular in Islamic societies in general and Arabs in particular [3]. According to this method, in written works, the quoted verses are distinguished among the text by many ways such as surrounded by brackets, written in standard Quranicstyle (known as Uthmani) with full diacritics, and many other techniques. On the other hand, Quranic verses is distinguished in oral works by reciting these verses according to the standard Quranic recitation rules or by adding the common phrases used to indicate the starting and the ending of the quoted verse, as well many other vocal techniques are used to distinguish verses in oral works. In Islamic multimedia works many advanced graphics and sound effects techniques are utilized to distinguish verses among other multimedia content. However in much other less authentic works especially online recourse such as forums, social networks, blogs, personal websites, users occasionally use the scientific method of quoting and citing Quranic verses, this case make it hard not only for a reader to distinguish the Quranic verses (words) but also it become harder for the automatic text processing and natural language processing techniques such as Information Retrieval Systems (IRS) and Knowledge Management Systems (KMS) to work efficiently on Arabic text in general and religious content in particular. In this paper we propose a novel dataset for Quranic words identification and authentication; the proposed dataset contains 93161 sample with 64 features for each sample. Samples are categorized into two categories; “Quranic” and “non-Quranic”. The “Quranic” labeled samples of the dataset were collected from highly trusted resource, while the “non-Quranic” labeled samples were collected from thousands of Arabic post downloaded from one of the biggest Arabic religious forum on the web. In the rest of this paper, section II highlights our motivation by focusing on previous works related to Quranic verse detection. Section III explains our framework in 263

Transcript of A Novel Dataset for Quranic Words Identification and Authentication

#20

A Novel Dataset for Quranic Words Identificationand Authentication

1Thabit Sabbah, 2Ali SelamatUniversiti Teknologi Malaysia (UTM)

[email protected], [email protected]

Abstract— Quran is the holy book for all Muslims around theworld. During hundreds of years, it was preserved in all possibleways from distortion. The huge increment and spread of digitalmedia and internet usage, leaded to many organizational andindividual websites, services, and applications are beingintroduced to spread the knowledge related to Quran as well asQuranic Verses, Translations, Explanations with the Tafseer andother Quranic sciences in its digital formats, some of theseservices are less authentic. The first step of authentication is thecorrect detection and identification of Quranic words among thetext. In this paper, we introduce a novel dataset for Quranicwords identification and Authentication. The proposed datasetcontains more than 93000 samples with 64 features for eachsample extracted in numerical form. Samples are categorizedinto two labeled classes; “Quranic” and “non-Quranic”,Validation tests of our dataset show a high accuracy average.

Keywords: Quranic words detection, Quranic words Dataset,Arabic words classification, Arabic Diacritic Words;

I. INTRODUCTIONQuranic words identification can be defined as determining

which words of the text belongs to Holy Quran and is writtenexactly as it is written in holy Quran1. Consecutive wellorderedQuranic words forms a verse. In common, Quran is writtenwith diacritics. Originally, diacritics are used in Arabic todistinguish the vocal pronunciation of words and represents thevowel sounds [1]; In ordinary Arabic writing few diacritics arecommonly used as shown in Table I. However, in Quranic textwriting there are many other diacritics and special symbolsused to give the reader other sorts of reading guidance, we willcall all of these diacritics and special symbols as diacritics forabbreviation.

TABLE I. COMMON DIACRITICS USED IN ARABIC WRITING

1 Based on standard Quranic writing style known as (Uthmani).

Muslims a round world use verses in their daily life formany reasons such as support decision making, deducesolutions deduce solutions for their social and religiousproblems or to analyze many issues. Most of Muslim authorsquote verses as an evidence to support their conclusions andtheir analysis of events [2]. Quotation is not only used inwritten publications, but also it is common in speeches,conversations, dialogs and discussions. This scientific methodis popular in Islamic societies in general and Arabs inparticular [3]. According to this method, in written works, thequoted verses are distinguished among the text by many wayssuch as surrounded by brackets, written in standardQuranicstyle (known as Uthmani) with full diacritics, andmany other techniques. On the other hand, Quranic verses isdistinguished in oral works by reciting these verses accordingto the standard Quranic recitation rules or by adding thecommon phrases used to indicate the starting and the ending ofthe quoted verse, as well many other vocal techniques are usedto distinguish verses in oral works. In Islamic multimediaworks many advanced graphics and sound effects techniquesare utilized to distinguish verses among other multimediacontent. However in much other less authentic worksespecially online recourse such as forums, social networks,blogs, personal websites, users occasionally use the scientificmethod of quoting and citing Quranic verses, this case make ithard not only for a reader to distinguish the Quranic verses(words) but also it become harder for the automatic textprocessing and natural language processing techniques such asInformation Retrieval Systems (IRS) and KnowledgeManagement Systems (KMS) to work efficiently on Arabictext in general and religious content in particular.

In this paper we propose a novel dataset for Quranic wordsidentification and authentication; the proposed dataset contains93161 sample with 64 features for each sample. Samples arecategorized into two categories; “Quranic” and “non-Quranic”.The “Quranic” labeled samples of the dataset were collectedfrom highly trusted resource, while the “non-Quranic” labeledsamples were collected from thousands of Arabic postdownloaded from one of the biggest Arabic religious forum onthe web. In the rest of this paper, section II highlights ourmotivation by focusing on previous works related to Quranicverse detection. Section III explains our framework in

263

264

265

266

267

268