BAENPD: A Bilingual Plagiarism Detector - Journal of ...

BAENPD: A Bilingual Plagiarism DetectorMohammad Shamsul Arefin†, Yasuhiko Morimoto†, Mohammad Amir Sharif‡

†Graduate School of Engineering, Hiroshima University, Japan‡The Center for Advanced Computer Studies, University of Louisiana at Lafayette, USA

Email: [email protected], [email protected], [email protected]

Abstract— In this paper, for the first time, we present a bilin-gual plagiarism detection system, BAENPD that can detectplagiarism from electronic Bangla and English documents.It uses two different methods for detecting plagiarism. Thefirst method is based on the analysis of individual contents ofthe documents, whereas second technique performs severalstatistical analysis of the documents. The system has beenevaluated by real documents. We have found that our systemcan efficiently detect plagiarism between English and Bangladocuments as well as from the documents of same language.

Index Terms— Plagiarism, statistical analysis, query execu-tion, documents relevancy, root detection.

I. INTRODUCTION

Plagiarism is defined as the appropriation or imitationof the language, ideas and thoughts of another authorand representation of them as one’s original work. Itmeans using another author’s work without giving himproper credit. Though plagiarism can be found in almostevery field, it is a major problem in academic areas asplagiarism destroys individual’s creativity and originalityand defeats the purpose of education. For recognizingcreativity and originality of one’s work, it is necessaryto detect plagiarized contents efficiently.

The easy availability of information in present daysmakes it easier to plagiarize another author’s contentswithout proper citation or reference and this tendency isincreasing day-by-day. Although there are many commer-cial and non-commercial tools available for plagiarismdetection, most of them are uni-lingual in nature andnone of them can detect plagiarism in Bangla documents.However, at present in academic and non-academic areasmany works are carried out in Bangla and we need tobe aware of the originality of these works. An efficientplagiarism detector considering the information in Banglacan ensure the novelty of such works.

Considering the above facts, we have proposed twodifferent methods for detecting plagiarized contents be-tween English and Bangla documents as well as from thedocuments of the same language. Our first approach isbased on the overall contents of the documents and is wellapplicable in detecting plagiarism from documents of anysize and any domain. However, our second methodology

This paper is based on “Bilingual Plagiarism Detector,” by Moham-mad Shamsul Arefin, Yasuhiko Morimoto, and Mohammad Amir Sharif,which appeared in the Proceedings of the 14th International Conferenceon Computer and Information Technology (ICCIT), Dhaka, Bangladesh,December 22-24, 2011. This paper in an enhanced version with detailexplanations and additional concepts.

depends on the statistical information of the documentsand it performs better if the documents are from the samedomain and uniform in size.

II. MOTIVATION

Plagiarism from the contents of a language to anotherlanguage is a common problem nowadays. This type ofplagiarism often occurs when the target language has lessresource like Bangla language. Consider the representa-tion of short information about the St. Martin’s Island ofBangladesh in English and Bangla, as shown in Figure 1and Figure 2.

Let us first consider the information in Figure 1. FromFigure 1, we can clearly see that though the informationis in two different languages, the sentence structures andorder of sentences are almost same. Also, similar set ofkeywords i.e. keywords in a language and the correspond-ing translated words in another language, yields the samecontent, even though keywords are placed in differentorder for keeping the accurate structure of the sentencesin each language.

Then, looking at the information of Figure 2, weobserve that though the information in English is same asthe information in Figure 1, the information in Bangla isdifferent than the information in Figure 1. It is because thesentences in Bangla has been created from source infor-mation ( here from English) by reordering the sentencesand using different words than the words use in Figure 1.

However, if we look carefully in the information ofboth figures, we can find that in both cases informationin Bangla has been plagiarized from the same Englishdocument. The information in Bangla in Figure 1 hasbeen almost directly plagiarized whereas in Figure 2 theinformation has been plagiarized using some level ofintelligence. The second type of plagiarism is the mostcommon form of plagiarism as identification of such typeof plagiarism is often hard to detect. The second form ofplagiarism is also common in the documents of the samelanguage.

Considering the above facts, we have developed twoplagiarism detection techniques those can efficiently de-tect plagiarism from the documents of two different lan-guages English and Bangla as well as from the documentsof the same language domain.

The contributions of this paper can be summarized asfollows. First, we propose efficient techniques for iden-tifying stop words and synonyms. Second, we elaborate

JOURNAL OF COMPUTERS, VOL. 8, NO. 5, MAY 2013 1145

© 2013 ACADEMY PUBLISHERdoi:10.4304/jcp.8.5.1145-1156

�� -�� । �� !�� " # ��$�� %�&�' �� $�"��$��-�� (� !�� ) ��$�� *�$ �� %�� $�!��" ��। �+,� �� -"� .�" �� "/�� - �� !�" 0��। �� %�� " �.�� 2। �3�� %� �� 4 �� %�� $�� /� 35 !�� 6�� .�-"� ��। ��7 $�� 8 �� $�� "�� /�� 6�� !�� "�9। �� :�� - 6�9। ��7 $�� 8 �� 6;�-�<=�� /��। St. Martin's Island is a coral island in the northeastern part of the Bay of Bengal. It is about 9 km south from Teknaf of the Cox's Bazar district and about 8 km west of the northwest coast of Myanmar, at the mouth of the Naf River. Due to the huge availability of coconut, locally it is also known as Narical Gingira. It is the only coral island in Bangladesh. St. Martin's Island is a popular tourist spot in Bangladesh. Three shipping liners run daily trips between the island and main land of Bangladesh. There are few good residential hotels in St. Martin’s Island. There is also a government resort. The law and order situation of the island is good.

Figure 1. Example of direct plagiarism �� -�� । �� -�� !�� "�� # �$�� %�� । �� $�� '��(�� %(�� %!��# �� ) ��!�� $�*�+| �� - �� .�� /0� �� -�� 1��| �� 2 ��$� �� $�� 3� �� .�� 4�� 1�-�� । %�5 �� 6 �� 7�� 4�8। ��8�9� �� .�� :�� (�; %�� 3�� %��!� 4�8। �� $�� (�� 1�!� %�<। %�5 �� 6 �� 4=�-�>?�� 3��। St. Martin's Island is a coral island in the northeastern part of the Bay of Bengal. It is about 9 km south from Teknaf of the Cox's Bazar district and about 8 km west of the northwest coast of Myanmar, at the mouth of the Naf River. Due to the huge availability of coconut, locally it is also known as Narical Gingira. It is the only coral island in Bangladesh. St. Martin's Island is a popular tourist spot in Bangladesh. Three shipping liners run daily trips between the island and main land of Bangladesh. There are few good residential hotels in St. Martin’s Island. There is also a government resort. The law and order situation of the island is good.

Figure 2. Example of indirect plagiarism

the method for detecting root words that improves theefficiency of plagiarism detection. Third, we provide atechnique for detecting plagiarism based on synonyms andkeywords in the documents. Fourth, we perform statisticalanalysis of the documents and based on statistical results,we detect plagiarism in the documents. Fifth, we give thecomplexity analysis of the proposed methods. Finally, weconduct extensive experiments on a set of real documentsand show the efficiency and robustness of our proposals.

The remainder of the paper is organized as follows.Section III provides a brief review of related works onplagiarism detection. Section IV describes the preliminaryideas about different related topics. Section V detailsarchitecture and methodology of content based plagiarismdetection technique. In Sections VI, we present statisticalapproach of plagiarism detection. Section VII gives theexperimental results. Finally, we conclude the paper inSection VIII.

III. LITERATURE REVIEW

In this section, we are providing a brief overview of theinitiatives taken around the globe for detecting plagiarism.WriteCheck [1] plagiarism detector uses pattern recogni-tion to match the contents of submitted papers against

Internet resources and then against an in-house databaseof previously submitted papers. This technique is differentthan the text searches of popular search engines such asGoogle and Bing and produces less false positives thansearch technology designed for other purposes. EVE2 [2]is a windows based system that performs a reliable checkof similar contents from the Internet to track down pos-sible instances of plagiarism. It examines the essays andthen quickly makes a large number of complex searchesin the Internet to locate suspected sites. DOC Cop [3] isa plagiarism, cryptomnesia and collusion detection toolthat creates reports displaying the correlation and matchesbetween documents or a document and the Web. Plagium[4] is a fast and easy-to-use tool to check text againstpossible plagiarism or possible sources of origination.Plagium intelligently breaks up the input text into smallersnippets. These snippets are matched against web contentin an efficient manner. The matches scores are used todetermine the matching of documents with the input text.

Plagiarism Detector [5] is a software tool to effectivelydiscover, trace and prevent unauthorized copy-pasting ofany textual material taken from the world wide web. Itdoes not use any in-house database. Rather, it uses thedatabases of three well known search engines: Google,Yahoo, and AltaVista. Upon receiving a document, atfirst it splits the document into different phrases and eachphrase is sent to three different search engines. Next, theresult of every search engine is downloaded and resultsources are analysed. Finally, merges the analysed reportsand generates the originality report of the submitteddocument. CodeMatch [6] compares thousands of sourcecode files in multiple directories and subdirectories todetermine which files are the most highly correlated. Itfirst divides each source code into elements and thendetermines the correlation using several matching algo-rithms. This can be used to significantly speed up thework of finding source code plagiarism. CodeMatch isalso useful for finding open source code within pro-prietary code, determining common authorship of twodifferent programs, and discovering common, standardalgorithms within different programs. Pl@giarism [7] isa plagiarism detection tool developed at the Law Facultyof the University of Maastricht, Belgium. Pl@giarismperforms cross-comparison against each other in order todetect similarities among the documents in the sample.It determines similarities between pairs of documents bycomparing three word phrases in each. Main limitationof Pl@giarism is that it does not automatically checkagainst sources on the Internet. GPSD [8] is a screeningprogram that has been designed to help users becomemore sensitive to their own writing style. GPSD justprovides a rough estimate whether plagiarism has or hasnot occurred. GPSP [9] evaluates a student’s knowledge oftheir own writing by producing a test. Every fifth word ofa student’s paper is eliminated and replaced with blankswhich the student has to replace. Accuracy and speedin replacing the blanks is evaluated against a proprietarydatabase, and a probability score returned immediately.

1146 JOURNAL OF COMPUTERS, VOL. 8, NO. 5, MAY 2013

© 2013 ACADEMY PUBLISHER

It is useful for detecting plagiarism where the originalsource cannot be located through other sources such asInternet search engines and other plagiarism detectionservices. Its limitations include not being able to identifythe source of the suspect text and requirement for studentsto sit a test. WCopyfind [10] software is a delightful ex-ample of the power of the computer to help in addressingthe plagiarism problem. It makes text-string comparisonsand can be instructed to find sub-string matches of givenlength and similarity characteristics. Such fine tuningpermits the exclusion of obvious non-plagiarism casesdespite text-string matches. The main limitation of theabove works is that they do not consider the detection ofplagiarism when the content is plagiarized from a specificlanguage and use in a different language.

Due to the rapid growth of Internet, this fact becomes amajor concern among the research communities. MLPlag[11] is an approach of plagiarism detection in multilin-gual environment. This method is based on analysis ofword positions. It utilizes the EuroWordNet thesaurus thattransforms words into language independent form. Thisallows identifying documents plagiarized from sourceswritten in other languages. Special techniques, such assemantic-based word normalization, were incorporated torefine the method. In [12], the authors propose a methodfor cross-language plagiarism detection. In this work,the authors introduce three models for the assessmentof cross-language similarity and perform the comparisonamong these three models. Cedeno et al. [13] introducea model for detecting plagiarism across distant languagepairs based on machine translation and monolingual sim-ilarity analysis.

Though, there are many plagiarism detection tools incommercial and academic areas, there is no model thatcan check the level of plagiarism of information fromother language documents to Bangla documents and viceversa. Even there is no tool that can detect plagiarismamong Bangla documents. Considering these facts, in thispaper, we provide a framework for detecting plagiarismamong Bangla documents as well as plagiarism of in-formation from English documents to Bangla documentsand vice versa. Our developed framework can also detectplagiarism among English documents.

IV. PRELIMINARIES

A. Plagiarism Detection Systems and Related Issues

The term plagiarism is well known among the researchcommunity. It generally focuses on detecting plagiarismfrom structured documents and considers that there is alarge number of documents. The process of plagiarismdetection consists of locating relevant documents andplagiarised parts from the documents, on the basis ofquery documents.

The amount of information available in electronic formis growing exponentially, making it increasingly importantto find the plagiarised documents so that the tendencyof copying one’s information is reduced significantly.In general, a plagiarism detection tool can be used for

Plagiarism Detection Methods

Local Similarity Assessment

Global Similarity Assessment

Fingerprinting

Term Occurrence Analysis

Substring Matching

Bag of Words

Analysis

Citation-based Plagiarism Detection

Stylometry

Citation Pattern Analysis

Figure 3. A taxonomy of plagiarism detection methods [14]

detecting plagiarism in the newly submitted researchpapers. It can also be used at the universities for checkinguniqueness of the works of the students.

B. Characteristics of a Bilingual Plagiarism Detector

For developing any bilingual plagiarism detection sys-tem, generally we have to consider four main things.Firstly, the detection approach should be in such a waythat there is almost no performance degradation of thedetector based on the language of the documents. Sec-ondly, there must be an approach for efficient handlingof bilingual dictionaries. Thirdly, the document to bechecked for plagiarism (query document) need to beprocessed using similar approach as the approach usedfor training the detector. Formal way is that if we usealgorithm A for training the system, we need to usealgorithm A for processing the query document. Lastly,we need a very good framework for proper modelling bywhich we can find the similarity in the documents. Theseare the crucial issues in bilingual plagiarism detection.

After getting the relevance, the detector needs to sortthe documents in descending order based on the valueof the relevance. Then, the top document in the sortedlist is marked as most relevant with respect to the querydocument and so on.

C. Problem of Bilingual Plagiarism Detection

Finding plagiarism from documents of two differentlanguages is more difficult than identifying plagiarism inthe documents of same language. It is even more difficultif the languages are non-ideographic like English andBangla languages. This is because the structure of docu-ments in non-ideographic languages differs significantly.As for example, there is major difference in the structureof documents in Bangla and English. In Bangla, thereare many variations of same words. We can form manydifferent words by just adding some postfixes. But thesevaried words have similar meaning. On the other hand, inEnglish such variations are less in number.



User Interface

Documents Database Creator

Synonym Identifier

Keywords Extractor

Stoplist in Database

Database Initialization and Processing

Storage

Documents Databases

Synonym List in

Database

Keyword Information in Database

Stoplist Identifier

Morphological Analyzer

Bilingual Dictionary

Documents Input

Documents Separator

Query Response

Query Execution Query Input

Sorting of documents

Bilingual Translator

Calculation of doc

relevance

Parse Query

Figure 4. Architecture of content based bilingual plagiarism detection

Also, the use of synonyms in the plagiarized documentsand the difference in the sentence structures of the lan-guages makes the detection task harder.

Therefore, we need to develop efficient techniques tosolve above problems for appropriate plagiarism detec-tion.

D. Plagiarism Detection Methods

There are different methods for detecting plagiarism.Figure 3 gives a taxonomy of different plagiarism de-tection methods. The techniques are characterized by thetype of similarity assessment they apply. In global similar-ity assessments large parts of the text or the document asa whole is used for similarity measure. On the other hand,in local method confined text segments are considered asinput.

At present fingerprinting method is most widely usedapproach in plagiarism detection. The procedure formsrepresentative digests of documents by selecting a set ofmultiple sub-strings from them. The sets represent thefingerprints and their elements are called minutiae [15],[16]. For checking a suspicious document for plagiarismat first the fingerprint of the document is computed andthen perform query minutiae with a pre computed indexof fingerprints for all documents of a reference collection.

Substring matching approach [17], [18], [19] uses effi-cient string matching algorithms. Checking a suspiciousdocument using string matching approach requires thecomputation and storage of efficiently comparable rep-resentations for all reference documents. However, sub-string matching approach is computationally expensive.

So, it is not well applicable for checking large documentcollections.

Bag of words analysis [20], [21], [22] uses the conceptof information retrieval. In this case, documents arerepresented with one or more vectors. Cosine similaritymeasure or any other similarity functions can be used forthis task.

Citation-based plagiarism detection [23], [24], [25] isbased on citation and reference information instead ofthe text of the documents. This approach can effectivelyidentify similar patterns in the citation sequences of twoacademic works. In academic areas this method is helpfulfor checking plagiarism in academic documents.

Stylometry [26] uses statistical methods for identifyingindividual author’s unique writing style and is mainly usedfor authorship attribution. By constructing and comparingstylometric models for different text segments and pas-sages plagiarization can be detected.

In this paper, we have used two different methods forplagiarism detection. One method is based on the con-cept of information retrieval. Another method considersstatistical information of the documents.

V. BILINGUAL PLAGIARISM DETECTIONBASED ON OVERALL CONTENTS IN THE

DOCUMENTS

The system architecture for bilingual plagiarism detec-tion based on overall contents in the documents is givenin Figure 4. It comprises three main modules: databaseinitialization and processing module, storage module andquery execution module. The database initialization and



Database Table: Word

Keyword Doc_ID Freq

Database Table: Document

Doc_ID Title Size

Database Table: Occur

Keyword Doc_freq

Figure 5. Storage of term information

processing module sets up the database from an initialset of documents against which the query document willbe checked for plagiarism. The storage module managesthe storage of the text database and related informa-tion. Query execution module takes query document asinput and checks the relevancy of the document withthe documents in the database, sorts the documents indescending order according to their relevance percentagewith test documents and returns the result to the user.The single directional arrows represent the direction ofnext sub-module to be executed in a module and thedouble directional arrows represent the relationship of asub-module with sub-modules from which the sub-modulegets help for processing.

A. Database Initialization and Processing Module

Database initialization and processing module consistsof the sub-modules: documents input, documents sepa-rator, documents database creator, stoplist identifier, syn-onym identifier, keywords extractor, morphological ana-lyzer, and bilingual translator. The relationships amongthe sub-modules are shown in Figure 4.

1) Documents Input: Documents input sub-module takes the set of documents D =Db1, Db2, · · · , Dbn, De1, De2, · · · , Dem as input fromthe user. Here, Dbi, 1 ≤ i ≤ n is the list of documentsin one language and Dej , 1 ≤ j ≤ m is the list ofdocuments in another language. This set of documents isused as training documents. In our proposed method, weconsider that the characters in documents are unicodesupported.

2) Documents Separator: Documents separator sub-module separates the documents set D into two subsetsD1 = Db1, Db2, · · · , Dbn and D2 = De1, De2, · · · , Dem

based on the contents of documents. The separation isnecessary because the analyzing of documents mainlybased on the language of the document’s contents.

3) Documents Database Creator: Documents databasecreator module stores each of the document sets D1 =Db1, Db2, · · · , Dbn and D2 = De1, De2, · · · , Dem. These

Synonym_ID Sequence Word 101 1 eB 101 2 cy —̄K 101 3 Awfavb 102 1 Kjg 102 2 †jLbx

(a) Example in Bangla

Synonym_ID Sequence Word 101 1 Money 101 2 Funds 102 1 Document 102 2 Article 102 3 Manuscript

(b) Example in English

Figure 6. Synonym information

Algorithm 1 Document DBInput: Document setsRequire: Storing the documents in the file sys-tem

1: begin2: Create two file objects of the directory name where

the document files exists3: Create two tables with the field Doc ID, Title, Size

having the data type as number, BFILE and numberrespectively

4: while counter is not greater than directory length ofthe directory do

5: Insert the file name of directory [counter] intoTitle field of the corresponding document table.

6: Increment the value of counter corresponding tothe table by 1

7: end while8: end

sets of documents are used as training documents. Thedocument databases for both D1 = Db1, Db2, · · · , Dbn

and D2 = De1, De2, · · · , Dem are created with theinformation document ID, document title and size foreach of the documents. In our system, we use BFILEdata type to store the document database in a file system.The algorithm for creating and storing the documents indatabase is given in Algorithm 1. Algorithm 1 storesall the documents in the database as BFILE. The firstsql query creates paths from which the BFILE shouldkeep the reference. The second sql command creates twotables declaring document title as BFILE according to thesyntax.

4) Stoplist Identifier: This sub-module stores somemost frequently used terms in the database. These wordsare auxiliary terms and are used most frequently. Theseare called stopwords. We call the list of stopwords inour system as stoplist. Stoplist creation module stores thestopwords of each language in a separate database table



S.N Postfix 1 wU 2 wUi 3 Uv 4 Uvi 5 UzKz 6 UzKzi 7 Lvbv 8 Lvbvi 9 Lvwb

10 Lvwbi 11 MvQv 12 MvwQ 13 ¸‡jv 14 ¸jv 15 ¸wj 16 ¸‡jvi 17 ¸jvi 18 ¸wji 19 ‡`i 20 w`‡Mi 21 MY 22 Rb 23 `j 24 Pq 25 me 26 gnj 27 cUj 28 Rvj 29 `vg 30 cvj 31 mKj 32 Kzj 33 h~_ 34 gvjv 35 ivwR 36 ivwk 37 wbKi 38 wbPq

S.N Postfix 39 mg~n 40 Me© 41 eM© 42 Avejx 43 mgy`q 44 ¸”Q 45 MÖvg 46 e„›` 47 cyÄ 48 gÊjx 49 ‡kªYx 50 i 51 iv 52 †i 53 Gi 54 Giv 55 e 56 ev 57 ‡e 58 ‡eb 59 we 60 Z 61 ‡Z 62 ‡Zg 63 ‡Zb 64 Zvg 65 Zzg 66 wZm 67 Db 68 DK 69 Q 70 ‡Q 71 wQ 72 ‡Qb 73 wQm 74 wQj 75 wQwj 76 wQ‡j

S.N Postfix 77 wQ‡jb 78 wQjvg 79 wQjyg 80 wQ‡jg 81 j 82 ‡j 83 wj 84 jvg 85 jyg 86 ‡jg 87 ‡jb 88 G 89 Gb 90 GQ 91 G‡Q 92 GwQ 93 G‡Qb 94 GwQm 95 GwQj 96 GwQ‡j 97 GwQwj 98 GwQ‡jb 99 GwQ‡jg

100 GwQjyg 101 GwQjvg 102 B 103 BI 104 Bm 105 B‡e 106 Be 107 Bwe 108 B‡eb 109 Bj 110 B‡j 111 Bwj 112 B‡jb 113 Bjvg 114 B‡jg

S.N Postfix 115 Bjyg 116 BZ 117 B‡Z 118 BZvg 119 B‡Zb 120 BwZm 121 B‡ZQ 122 B‡Z‡Q 123 B‡ZwQ 124 B‡Z‡Qb 125 B‡ZwQm 126 B‡ZwQj 127 B‡ZwQjvg 128 B‡ZwQ‡j 129 B‡ZwQwj 130 B‡ZwQ‡jb 131 B‡ZwQ‡jg 132 B‡ZwQjyg 133 BqvQ 134 Bqv‡Q 135 BqvwQ 136 Bqv‡Qb 137 BqvwQm 138 BqvwQ‡j 139 BqvwQwj 140 BqvwQj 141 BqvwQ‡jb 142 BqvwQ‡jg 143 BqvwQjvg

Figure 7. List of postfixes in Bangla

as it requires faster checking to verify whether any wordis stopword or not. When a new stopword is required tobe added, it is just appended in the corresponding table.

5) Synonym Identifier: Synonym identifier sub-moduleidentifies the synonyms of words. In this paper, weused a special synonym handling mechanism to store thesynonyms. It is explained below.

The creation of synonym module stores the synonym ofwords corresponds to each language. The synonyms arestored in the database scanning the words of documentdatabase excluding stop words. The document database isscanned twice to create the synonym list. After the firstscan of the database initialization without the synonymeffect, the Occur table of the database as shown in Figure5 contains different words of the document database. Thistable also contains words with their synonyms. So, thistable is checked to find the group of words having samemeaning.

With the group of words, a synonym table for eachlanguage is created as shown in Figure 6. In this structure,the words having same meaning will have same synonymid that forms a group of synonyms. In each group ofsynonyms, a sequence number is maintained according tothe importance.

During the second scan of the database initialization thesynonym table is used to keep the effect of synonyms.This module also manages the storage of the terms insuch a way that the plagiarism detection system can rununiformly overcoming the synonym-handling problem.Storage module stores term information and necessaryrelated information in the database according to Figure

5. In Figure 5, keyword and Doc ID of the Word tablerepresents which keyword presents in which documentand Freq field shows how many times the keywordoccurs in that document. The Doc ID, Title and Sizefields of the Document table represent document ID ofdocuments, Title and number of terms present in thatdocument, respectively.

In Occur table the field Keyword and Doc freq is usedto represent in how many documents a keyword occurs.After stemming, the root is stored in the Word table ofthe database with the corresponding document ID. Allthe keywords of all documents are stored in the Wordtable. After the creation of Word table, the Occur tableis created from Word table. A record in Document tableis inserted after completion of scanning of each document.

6) Keywords Extractor: The keyword extractor sub-module extracts the root of every word using morpho-logical analyzer. The algorithm for keyword processortakes words as input and gives the root of the words asoutput. Finding the root of a word is called stemming.We have used a list of 143 postfixes for Bangla as shownin Figure 7 collected from Bangla grammar books and127 postfixes for English from [27]. Anyone can enrichthe list. The overall procedure of keywords extraction ingiven in Algorithm 2.

To stem the postfixes from the terms of the documentset, the morphological analyzer checks the terms againstthe postfix list of the corresponding language. Two im-portant issues in keywords extraction include the selectionof next possible alphabets of different postfixes to whichnext matching occurs and the identification of whetherthe current alphabet is the end of any postfix matching.It can be handled efficiently if we know the appropriatecharacter sequence in the postfixes. As in our system allthe terms are unicode supported, we consider the phoneticsequence of characters for their appearance in the terms.As for example, the sequence of characters in the postfixesof Figure 7 is given in Figure 8.

7) Addition of New Documents: For addition of anew document, first of all the maximum document IDis obtained from the Document table. The new ID isincremented by one for the new document. Then thedocument’s terms are scanned one by one until the end ofthe file is reached. For each term, it is checked whetherit is in stoplist or not. If it is in the stoplist, it is justskipped. If it is not in the stoplist, morphological analysisis done to find the root of the word.

The root word is checked in the synonym table. If itis in the table, then the equivalent term is checked in theWord table for that document whether it is present ornot. If it is not present then one record is inserted into theWord table with values equivalent synonym as keyword,the new document id as Doc ID and frequency as 1. If itis present with the new document id then the Freq valueof the term for that term will be incremented by 1. If theroot word is not found in the synonym table, it is insertedinto a temporary table with the same structure as the wordtable. In this way the scanning of the new document is



39

L v b v w i

7-10 M v Q v w 11-12 ` ‡ w i ‡ M i 19-20

‡ i v 53-54 55-59 w ‡ e v b w m

Z ‡ g v b y 60-66

Q w ‡ m � j b ‡ v w g b 69-80 j w ‡ y b g v 81-87

‡ j w ‡ y b w Q ‡ g b b

88-101 w ‡ e w b I m 102-108 j w ‡ y b w g v 109-115

j w ‡ y b w g v Q g y Z v ‡ b w m w 116-132

b w ‡ j Q w ‡ m v w q b g

g

133-143

y b K

67-68

` j 23 R b 22 m K j 31 c v j 30 j K y 32

g n j 26 c U j 27 P q 24 R v j 28 m e 25 ` v g 29

M b 21

x Y k & i ‡

49

Q M y P &

44

R c y T &

47

g M & i v

45 w b K i 37 g v j v

34 i v w R 35 v e j x

42 w b P q 38

` e „ b &

46

q m g ~ ̀

43 m g ~ n 39 i v w k

36

x j g b & W

48 M & & e 40 e i i M

41 i ‡ v 50-52

1-6

i

i i

U y y K w v 33 h ~ _

13-18

i

yyy M j v w ‡ v

Figure 8. Character sequence of postfixes in Bangla

completed. If no term is found in the temporary table afterthe complete scan of the file the Occur table is processedfrom Word table according to Algorithm 3.

If there are some records in the temporary table thenit is considered that the terms occurring in those recordshave no entity in the synonym table. As a result no effectof synonyms is considered for those terms. So synonymtable is updated for the terms in temporary word table.Algorithm 4 performs updating task of the synonymtable.

Algorithm 4 shows how the synonym is generated forthe newly added document. Now it is required to addthe terms from temporary word table to the Word tableconsidering the effect of synonyms. Procedure to add theterms from temporary word table to word table is givenAlgorithm 5.

After inserting the term information into Word table, the

Occur table is updated according to Algorithm 1. Thenthe Document table is updated for the new documentwith its size. In this way all the information regardingthe new document is stored in the database.

8) Bilingual Translator: In the method of bilingualplagiarism detection, the translation of keywords betweenthe documents is necessary. Algorithm 6 performs thetask of such translation. The algorithm first searches forthe existence of the keyword in the dictionary. If the itemexists in the dictionary the algorithm does not insert theitem in the table. Otherwise, the algorithm inserts the itemin the dictionary along with its translated values in otherlanguage. So, there is only one entry of a specific keywordin the dictionary.



Algorithm 2 KeywordsExtractionInput: List of terms T = t1, t2, · · · , tx and list ofpostfixes P = p1, p2, · · · , py in a languageOutput: Keywords

1: begin2: for each term ti, 1 ≤ i ≤ x do3: Check each character of ti from last with the last

character of pj , 1 ≤ j ≤ y4: if a match is found then5: Check next character of ti with the character at

same position of pj until a complete matchingwith pj is found

6: Discard the matched part from ti and returnremaining part as root ri

7: else8: Discard the scanning of the corresponding term

and return ti as keyword ri9: end if

10: end for11: end

Algorithm 3 Update occur tbl new document1: begin2: Select maximum document id from the word table3: Get all the terms from the word table where document

id = maximum document id4: for each selected term do5: if the term is present in the occur table then6: Increment the doc freq value of that term by 1

in the occur table7: else8: Insert the term into occur table with doc freq

value by 1.9: end if

10: end for11: end

B. Storage

Storage module stores information processed bydatabase initialization and processing module. Bilingualdictionary keeps the mapping of keywords in two differentlanguages.

C. Query Execution Module

Query execution module takes the document that isnecessary to check for plagiarism as input. The parsequery sub-module calls synonym identifier, stoplist cre-ator and keywords identifier sub-modules to generatethe corresponding information from the query document.Then vector space model [28] is created and based onvector space model relevancy is calculated using cosinedistance formula. Finally, retrieved relevant documents aresorted in descending order of their relevancies and returnto the user via query response sub-module. Here, the usercan restrict the system to return top n relevant documents.

Algorithm 4 Update synonym tbl new document1: begin2: Get all the terms from the temporary word table3: Mark all those terms as unmarked.4: for each unmarked term do5: Select maximum synonym id from the synonym

table6: Increment the selected id by 17: Insert the unmarked term into synonym table with

id 1 and sequence number 18: Scan all the other unmarked terms to find the

synonyms of the previous term9: if some terms are found then

10: for each such term do11: Insert the term into synonym table with the

same id and sequence number as sequencenumber +1

12: Mark all the terms as marked having thegenerated synonym id

13: end for14: end if15: end for16: end

Algorithm 5 Insert word tbl new document1: begin2: Select document id for the new document3: Get all the terms from the temporary word table4: for each selected term do5: Find the terms equivalent synonym from the syn-

onym table having sequence number 16: Check this equivalent synonym in word table with

same doc id7: if present then8: Increment the freq value of that term for the new

document by 19: else

10: Insert a record in the word table with the equiv-alent synonym having the new document id andfreq value as 1

11: end if12: end for13: end

In query execution module when the relevant docu-ments are listed they can be accessed from front end,because all the documents stored in the database are asBFILE. When a relevant document is needed to view, itscorresponding Doc ID is selected and the file correspondsto that Doc ID is displayed in a graphical user interface.

D. Complexity Analysis

The plagiarism detection system developed based ondocuments’ overall contents has two parts for the timeconcern, one for storing the necessary information inindex structure and another one is query time for anyparticular query. Here, we just give time complexity for



Algorithm 6 BilingualTranslationInput: Keywords in a language Require: Update of thedictionary if necessary

1: begin2: for each keyword do3: search the dictionary4: if the keyword is found in the dictionary then5: do nothing6: else7: add the keyword in the database table with its

translated value in other language8: end if9: end for

10: end

relevancy check. We do not give the complexity analysisfor storing the documents.

Let, n be the number of keywords in the query doc-uments, D be the number of documents in collection,Q be the time required to get necessary informationabout a keyword. So, time required for querying all thekeywords in a query document is O(nQ), time requiredfor calculating relevance of D number of documentsis O(D) , and time required for sorting the relevantdocuments is O(DlogD). So, total time complexity forsearching is O(nQ) +O(DlogD).

VI. BILINGUAL PLAGIARISM DETECTIONBASED ON DOCUMENTS’ STATISTICAL

INFORMATION

Our statistical method for bilingual plagiarism detectionmainly based on several stylometric features [29] of thedocuments. Stylometric features can measure differentaspects of writing style, and can be useful for detectingplagiarism from the documents of the same domain. Inour statistical approach, we have used four different stylo-metric features to generate statistical information of eachdocument. These are (i) number of sentences, (ii) averagesentence length, (iii) sentence type, and (iii) tense formuse each sentence. We have used the concept of [30] forgenerating above stylometric features of each document.Here, the system does not consider the sentences withless than four words for analysis. This is because fromthe observation it has been found that most sentences withless than four words do not contain interesting informationfor comparison.

When a query document comes, the system generatessimilar statistical information of the query document.

Then based on the statistical information of the docu-ments, following mathematical formula is used to calcu-late the relevancy between two documents.

Str − Ste

Str+

Ltr − Lte

Ltr+

Simtr − Simte

Simtr+

+Cmtr − Cmte

Cmtr+

Cptr − CpteCptr

+Prtr − Prte

Prtr+

+Patr − Pate

Patr+

Futr − Fute

Futr(1)

In above equation, te and tr stand for test and train-ing documents respectively. The meanings of remainingsymbols are as S: total number of sentences, L: averagesentence length, Sim: number of simple sentences, Cm:number of complex sentences, Cp: number of compoundsentences, Pr: number of sentences with present tense,Pa: number of sentences with past tense and Fu: numberof sentences with future tense.

From the above equation, it is observed that for anytwo documents calculated value closer to zero meanshigh level of plagiarism and far from zero indicates lowlevel of plagiarism. A value zero indicates full plagiarizeddocument.

Similar to our previous method, in this approach, wecan also restrict the system to return the documents belowan user defined threshold and any relevant document canbe viewed in a graphical user interface.

There are several limitations of this approach. Mainlimitation of this approach is that it just gives an ap-proximation of plagiarization but not actual plagiarizedinformation. Another limitation is that this approach ishighly domain specific and the documents size need tobe uniform. However, the main advantage of this methodis that it is independent of the language in use. Thismethod is well applicable at the university level to checkthe originality of the students’ assignments.

VII. EXPERIMENTAL RESULTS ANDDISCUSSION

This section discusses the experimental setup for thesimulation and the corresponding results.

A. Experimental Setup

The plagiarism detection system has been developed ona machine having an Intel(R) Core2 Duo, 2 GHz CPU,and 3 GB main memory, running on Microsoft WindowsXP operating system. The system was implemented inJbuilder-8 in the front and Oracle 10g DBMS in the back-end for storing the text database and related information.We have used total 110 documents for experimental pur-pose. These documents were collected from a departmentof a public university. Students were asked to submittheir report individually from a specific domain. Totalhundred and ten students were divided into two equalsize groups. Students of one group submitted their reportsin Bangla and another group submitted their reportsin English. Among 110 documents, we have used 50Bangla documents and 50 English documents as trainingdocuments. Remaining 10 documents were used for testpurpose. The documents are almost equal in size.



020004000600080001000012000

10 20 30 40 50Stemming time (ms) Number of documents scannedBanglaEnglish

Figure 9. Stemming time varying number of documents

B. Postfix Statistics and Stemming Performance

We analyzed the syntax of various words in Bangladocuments using 143 postfixes and English documentsagainst a list of 127 postfixes. In our experiment, we foundthat our data set requires stemming of around 62 differentkinds of postfixes in Bangla documents and around 33postfixes in English documents.

Then, we have analyzed the stemming performance.We performed stemming on the dataset initially with tendocuments and incrementally added ten word documentsin the system. We have performed stemming separately onEnglish and Bangla documents data set. Figure 9 showsthat the stemming time increases almost linearly with theincrease of document number and stemming of Bangladocuments require more time than English documents.

C. Performance Based on Morphological Analysis

Next, we measured the performance based morpho-logical analysis. We compared the result obtained byperforming morphological analysis with the result ob-tained without morphological analysis. Figure 10, showsthe result. We have used five Bangla and five Englishdocuments as test documents and the training corpusvaries from twenty to hundred documents with almostequal number of Bangla and English documents. Weset similarity percentage to thirty. We have performedquery with five different documents of each language onsame corpus and took the average number of retrieveddocuments from each training corpus. From Figure 10,it is observed that in both cases the relevancy is lesswhen no morphological analysis is performed. This poorperformance result is due to the words with postfixesresembles to different words and cannot contribute muchin relevance calculation.

D. Performance Analysis Based on Severity of Plagiarism

In this experiment, we have modified our original train-ing documents by replacing their contents with several

051015202530

20 40 60 80 100Retrieved documents Documents in corpusb) Test documents in EnglishWithout morphological analysisWith morphological analysis

05101520253020 40 60 80 100Retrieved documents Documents in corpusa) Test documents in Bangla

Without morphological analysisWith morphological analysisFigure 10. Comparative relevancy with morphological analysis and without morphological analysis

00.10.20.30.40.50.60.70.80.910.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5Precision / Recall Proportion of plagiarized textprecisionrecall

Figure 11. Performance analysis with increase level of plagiarism

plagiarized contents of different lengths. Then, we haveanalyzed the performance against the level of plagia-rization. Figure 11 shows the result. From Figure 11,we can find that our system has good detection rate ofplagiarism in terms of precision and recall with respectto the plagiarism severity.

E. Performance Based on Documents Statistical Informa-tion

In this experiment, we have gathered the statisticalinformation of each document. The information includenumber of sentences in each document, average lengthof sentences, number of sentences of each type, numberof sentences based on tense of verbs. Here, we considertype of sentences as simple, complex and compound and



-15-10-50510151 5 9 13 17 21 25 29 33 37 41 45 49Relevance Value Doc IDsa) Bangla training documents

Test doc in Bangla Test doc in English

-15-10-50510151 5 9 13 17 21 25 29 33 37 41 45 49Relevance Value Doc IDsb) English training documents

Test doc in Bangla Test doc in EnglishFigure 12. Statistical analysis of relevancies between training and test documents

tenses as present, past tense and future tense. No sub-classification of tenses is considered. We use total tentest documents. The contents of five documents are inBangla and the remaining five documents are in English.We perform plagiarization check with each of the ten testdocuments against each training document and then takethe average of relevance values of five Bangla documentsand five English documents separately. We used equation(1) for relevancy calculation.

Figure 12 shows an approximation of plagiarizationin test documents from the training documents. Figure12(a) shows the result when the training documents arein Bangla and Figure 12(b) shows the result when thecontents of the training documents are in English. In firstcase, we use training corpus of 50 bangla documents andin second case, we use 50 English documents as trainingcorpus.

In the result of Figure 12, a value closer to zero meanshigh level of plagiarism and far from zero indicates lowlevel of plagiarism. A value zero indicates full plagia-rised document. From Figure 12, it is also observed that

relevancy between training and test documents of samelanguage are higher than the relevancy of documentsin different languages. The change in the structure ofsentences and other changes in the equivalent documentsof two different languages is the main reason of suchresult.

VIII. CONCLUSION

Detection of plagiarism from documents spanningover different languages is a major concern in currentglobalized world. While existing commercial and non-commercial plagiarism detectors provide good supportfor checking plagiarised documents within same lan-guage domain, they suffer from detecting plagiarismfrom documents of different language domains. In thispaper, we have proposed two different approaches fordetecting plagiarism from documents of two differentlanguages. First method is based on removal of stopwords, extraction of keywords, checking for synonymsand bilingual translation. The second approach is basedon statistical information of the documents. We haveimplemented the systems for two languages: English andBangla and found that the system can efficiently detectplagiarism from documents of both languages. Thoughwe have considered Bangla and English languages forthe experimental purpose, the system can adapt otherlanguages with slight modification in the dictionary andstorage of documents structure.

In this work, we consider an in-house database ofdocuments and it is necessary to scan a document beforeit is considered in the system for plagiarism detection. Infuture, we hope to develop a system that can efficientlydetect plagiarism from the documents in the Internetwithout scanning them in the system.

REFERENCES

[1] WriteCheck: Available: http://www.writecheck.com/[2] EVE2: Available: http://www.canexus.com/eve/[3] DOC Cop. Available: http://www.doccop.com/[4] Plagium. Available: http://www.plagium.com/[5] Plagiarism Detector. Available: http://www.plagiarism-

detector.com/[6] CodeMatch. Available: http://www.safe-corp.biz/

products codematch.htm[7] Pl@giarism. Available: http://www.plagiarism.tk/[8] Glatt Plagiarism Self-Detection Program (GPSD). Avail-

able: http://www.plagiarism.com/self.detect.htm[9] Glatt Plagiarism Screening Program (GPSP). Available:

http://www.plagiarism.com/screen.id.htm[10] WCopyfind. Available: http://plagiarism.bloomfieldmedia.

com/z-wordpress/software/wcopyfind/[11] Z. Ceska, M. Toman, and K. Jezek, Multilingual plagiarism

detection , In Proc. of 13th International Conference onArtificial Intelligence: Methodology, Systems, and Appli-cations, Varna, Bulgaria, September 2008, pp. 83-92.

[12] M. Potthast, A. B. Cedeno, B. Stein, and P. Rosso, Cross-Language plagiarism detection, In the Journal of Lan-guage Resources and Evaluation, vol. 45, no. 1, January2010, pp. 45 - 62.

[13] A. B. Cedeno, P. Rosso, E. Agirre, and G. Labaka, Pla-giarism detection across distant language pairs, In Proc.of the 23rd International Conference on ComputationalLinguistics, Beijing, china, August 2010, pp. 37-45.



[14] Plagiarism detection. Available: http://en.wikipedia.org/wiki/ Plagiarism detection/

[15] T. C. Hoad and J. Zobe, Methods for identifying versionedand plagiarised documents, Journal of the American Soci-ety for Information Science and Technology, vol. 54, no.3, pp. 203-215.

[16] B. Stein, Fuzzy-fingerprints for text-based informationretrieval, In Proc. of 5th International Conference onKnowledge Management, 2005, pp. 572-579.

[17] K. Monostori, A. Zaslavsky, and H. Schmidt, Documentoverlap detection system for distributed digital libraries, InProc. of the Fifth ACM Conference on Digital Libraries,2000, pp. 226-227.

[18] B. S. Baker, On Finding duplication in strings and soft-ware, Technical Report, AT&T Bell Laboratories, MurrayHill, New Jersey.

[19] D. V. Khmelev and W. J. Teahan, A repetition basedmeasure for verification of text collections and for textcategorization, In Proc. of the 26th annual internationalACM SIGIR conference on Research and development ininformaion retrieval, 2003, pp. 104-110.

[20] A. Si, H. V. Leong, R. W. H. Law, CHECK: A documentplagiarism detection system, In Proc. of the ACM sympo-sium on Applied computing, 1997, pp. 70-77.

[21] H. Dreher, Automatic conceptual analysis for plagiarismdetection, In The Journal of Issues in Informing Scienceand Information Technology, vol. 4, 2007, pp. 601-614.

[22] M. Zechner, M. Muhr, R. Kern, and M. Granitzer, Exter-nal and intrinsic plagiarism detection using vector spacemodels, In Proc. of 3rd Workshop on Uncovering Pla-giarism, Authorship and Social Software Misuse and 1stInternational Competition on Plagiarism Detection, 2009,pp. 47-55.

[23] B. Gipp, and J. Beel, Citation based plagiarism detection- a new approach to identify plagiarized work languageindependently, In Proc. of the 21th ACM Conference onHyptertext and Hypermedia, June 2010, pp. 273-274.

[24] B. Gipp, N. Meuschke, and J. Beel, Comparative eval-uation of text- and citation-based plagiarism detectionapproaches using GuttenPlag, In Proc. of 11th ACM/IEEE-CS Joint Conference on Digital Libraries, June 2011,Ottawa, Canada, pp. 255-258.

[25] B. Gipp, and N. Meuschke, Citation pattern matchingalgorithms for citation-based plagiarism detection: greedycitation tiling, citation chunking and longest commoncitation sequence, In Proc. of the 11th ACM Symposiumon Document Engineering, 2011, pp. 249-258.

[26] D. I. Holmes, The evolution of stylometry in humanitiesscholarship, In the Journal of Literary and LinguisticComputing, vol. 13 no. 3, 1998, pp. 111-117.

[27] List of Suffixes. Available: http://www.examples-help.org.uk /definition-of-words/list-of-suffixes.htm/

[28] G. Salton, A. Wong, and C. S. Yang, A Vector space modelfor information retrieval, In Communications of the ACM,vol. 18, no. 11, November 1975, pp. 613-620.

[29] P. Juola, Authorship Attribution, In Foundations andTrends in Information Retrieval, vol. 1, no. 3, 2006, pp.233-334.

[30] S. Azad, and M. S. Arefin, Bangla documents analyzer, InProc. of International Conference on Electronics, Com-

puter and Communication, Rajshahi, Bangladesh, June2008, pp. 438-442.

Mohammad Shamsul Arefin received his B.Sc. En-gineering in Computer Science and Engineering fromKhulna University, Khulna, Bangladesh in 2002, andcompleted his M.Sc. Engineering in Computer Scienceand Engineering in 2008 from Bangladesh University ofEngineering and Technology (BUET), Bangladesh. Nowhe is a PhD candidate at Hiroshima University withsupport of the scholarship of MEXT, Japan.

He is a member of Institution of Engineers Bangladesh(IEB) and currently working as an Assistant Professor inthe Department of Computer Science and Engineering,Chittagong University of Engineering and Technology,Chittagong, Bangladesh. His research interest includesprivacy preserving data mining, multilingual data man-agement, semantic web, and object oriented system de-velopment.

Yasuhiko Morimoto is an Associate Professor at Hi-roshima University. He received B.E., M.E., and Ph.D.from Hiroshima University in 1989, 1991, and 2002,respectively. From 1991 to 2002, he had been with IBMTokyo Research Laboratory where he worked for datamining project and multimedia database project. Since2002, he has been with Hiroshima University. His currentresearch interests include data mining, machine learning,geographic information system, and privacy preservinginformation retrieval.

Mohammad Amir Sharif received his B.Sc. Engineeringin Computer Science and Engineering from Khulna Uni-versity, Khulna, Bangladesh in 2002, and completed hisM.Sc. Engineering in Computer Science and Engineeringin 2007 from Bangladesh University of Engineering andTechnology (BUET), Bangladesh. Now he is a PhDcandidate University of Louisiana at Lafayette, USA.

Currently he is working as an Assistant Professor inMawlana Bhashani Science and Technology University,Tangail, Bangladesh, in Information and CommunicationTechnology Department. His research interest includesinformation retrieval, distributed computing, multilingualdata management, and Intelligent systems.



BAENPD: A Bilingual Plagiarism Detector - Journal of ...

Documents

Transcript of BAENPD: A Bilingual Plagiarism Detector - Journal of ...