Identification of Hindi Words Used in Pornographic Unsolicited Bulk Emails

8
ldentification of Hindi Words Used in Pornographic Unsolicited Bulk E-Mails Jatinderkumar R Saini* and Apurva A Desai** E-mail has become a fast and cheap means of online communication. The main threat to e-mail is Unsolicited Bulk E-mail (UBE), commonly known as spam e-mail. The current work aims at identification of Hindi words in pornographic UBE. The motives of the paper are manifold. This is an attempt to better understand the UBE and its interplay with the regional language in the perspective of international spamming. The problem has been addressed by employing tokenization technique and Unigram BOW model. The current paper reports the first results on the identification of 87 Hindi words from more than 1,850 pornographic UBE analyzed by us. To the best of our knowledge, this is the first attempt to identify Hindi words in the corpus of pornographic UBE. lntrod u ctio n E-mail has become an efficient and popular communication mechanism as the number of lnternet users has increased. lt has provided both faster and cheaper forms of communication mechanism. But a large part of e-mail traffic consists of non-personal, non-time critical and unsolicited information. This type of e- mail is called Unsolicited Bulk E-mail (UBE) and is commonly known by various other synonymous names like spam e-mail, bulk e-mail, junk e-mail, unimportant e-mail and Unsolicited Commercial E-mail (UCE). E-mail spam is a subset of spam that involves nearly identical messages sent to numerous recipients by e-mail. According to Saini and Desai (2009), there are various types of UBE, like those dealing with financial transactions, viruses, chain letters, pornography and fake offers for jobs, medicines, etc. Pornography or porn is defined as the portrayal of explicit sexual subject matter (Wikipedia Pornography, 2010). This portrayal may exhibit itself in either textual form or graphical form. ln the current work, we have attempted to identify the usage of Hindi terms in pornographic spam e-mails. The target of the current work is, hence, to focus on pornographic UBE in textual format. This way, we have attempted to understand the spamming behavior as well as the usage of n Associate Professor and Head, Department of Computer Science, S P College of Engineering, Visnagar, Cujarat, lndia; and is the corresponding author. E*mail: [email protected] n* Professor and Head, Department of Computer Science, Veer Narmad South Gujarat University, Surat, Gujarat, lndia. E-mail: [email protected] O 201 I lUP. All Rights Reserved.

Transcript of Identification of Hindi Words Used in Pornographic Unsolicited Bulk Emails

ldentification of Hindi Words Usedin Pornographic Unsolicited Bulk E-Mails

Jatinderkumar R Saini* and Apurva A Desai**

E-mail has become a fast and cheap means of online communication. Themain threat to e-mail is Unsolicited Bulk E-mail (UBE), commonly knownas spam e-mail. The current work aims at identification of Hindi wordsin pornographic UBE. The motives of the paper are manifold. This is anattempt to better understand the UBE and its interplay with the regionallanguage in the perspective of international spamming. The problem hasbeen addressed by employing tokenization technique and Unigram BOW

model. The current paper reports the first results on the identification of87 Hindi words from more than 1,850 pornographic UBE analyzed by us.To the best of our knowledge, this is the first attempt to identify Hindiwords in the corpus of pornographic UBE.

lntrod u ctio n

E-mail has become an efficient and popular communication mechanism as thenumber of lnternet users has increased. lt has provided both faster and cheaperforms of communication mechanism. But a large part of e-mail traffic consistsof non-personal, non-time critical and unsolicited information. This type of e-mail is called Unsolicited Bulk E-mail (UBE) and is commonly known by variousother synonymous names like spam e-mail, bulk e-mail, junk e-mail,unimportant e-mail and Unsolicited Commercial E-mail (UCE). E-mail spam is asubset of spam that involves nearly identical messages sent to numerousrecipients by e-mail. According to Saini and Desai (2009), there are various typesof UBE, like those dealing with financial transactions, viruses, chain letters,pornography and fake offers for jobs, medicines, etc.

Pornography or porn is defined as the portrayal of explicit sexual subjectmatter (Wikipedia Pornography, 2010). This portrayal may exhibit itself in eithertextual form or graphical form. ln the current work, we have attempted to identifythe usage of Hindi terms in pornographic spam e-mails. The target of the currentwork is, hence, to focus on pornographic UBE in textual format. This way, wehave attempted to understand the spamming behavior as well as the usage ofn Associate Professor and Head, Department of Computer Science, S P College of Engineering,

Visnagar, Cujarat, lndia; and is the corresponding author. E*mail: [email protected]* Professor and Head, Department of Computer Science, Veer Narmad South Gujarat University,

Surat, Gujarat, lndia. E-mail: [email protected]

O 201 I lUP. All Rights Reserved.

also attempted to study the usage of non-dictionary or slang Hindi words inpornographic UBE. We believe that in the context of international spamming, thescope of Hindi spamming is reglonal and limited, as compared to the bulk ofpornographic UBE in English roaming ever the lnternet.

Related WorkBased on the survey of related literature, we found that there has not been muchwork in the field of studying usage of Hindi language for electroniccommunication through e-mails. McAfee, a leading antivirus software developingcompany, in its report, believes that spam falls into various categories (McAfeelnc., 2008). Providing the breakup of the categories of spam e*mail, they haveincluded language-based spam e-mails, such as 'Russian spam' and 'chinesespam', in the list of spam e-mails presented by them.

Much of the usage of Hindi language in e-mails is meant to give it an informaltouch. This also takes the form of slang usage of the language. This is truer whenthe subject matter of the e-mail is pornographic. wikipedia lnrernet-slang (2010)has defined lnternet slang or lnternet language as a type of slang that lnternetusers have popularized, and in many cases, have coined. lnternet users employslang mainly through services like e-mails, messengers and blogs. YourDictionary(2010) has defined slang as highly informal speech that is outside conventionalor standard usage and consists both of coined words and phrases and of newor extended meanings attached to established terms. According to Goswami eral. (2009), slang is a non-dictionary word that has evolved with time due to itsfrequent and popular usage. Hence a good number of research instances areavailable to highlight the slang usage of the language.

The usage of slang in literature related to English language has been addresseda lot by the researchers. We believe that the results are principally true for Hindilanguage also. According to an analysis by Krasny (2000), usage of slang wordsof the English language is constantly changing and slang is increasingly becominga greater part of our shifting linguistic terrain. He has concluded that most newwords come from slang. Thorne (2004), in his research article on slang, style-shifting and sociability, has remarked that along with other factors, e-mail isresponsible for generation of new slangs as also for enormous proliferation ofwebsites designed to celebrate and decode slang. Our belief is that similar toEnglish language, Hindi is also changing and slang is becoming a key source ofits evolution.

ln the more specific fields of computer sciences, Kucukyilmaz et al. (200g) intheir work on predicting user and message attributes in computer-mediatedcommunication, have concluded that as chat conversations occur in a spontaneous

environment,

identified

ln another

words forperson in thethrough e-Towards thistn

UBE.

MethodolThe main

delineated as

step reduces

We restrictedprobability ofSecondly, the

by the domi

the currentremoval of

Further, we

UBE, Our

other types ofHindi words,pornographic

category of

2. Number

number

Here it is

we have alsoit is

inducing thingsterm 'drugs' inViagra and

in theirpornography is

Finally, we

termed pure

34 The lUPJournal of Systems Management, Vol. lX, No. 2, 201 I ldentification of

, we have

words inspamming, theto the bulk of

been muchelectronic

developing(McAfee

, they have

and 'Chinese

It an informalls truer when

lang (2010)

that lnternetusers employ

conventionaland of new

Goswami erdue to its

addressedfor Hindi

llang words

bccomingmost new

style-c-mail is

rlmilar tosource of

(2008) in

environment, the use of srang words and misspe,ings are frequent. They haveidentified various slang words in their w_o-rk which'i, rJrv rocused on chat mining.ln another paper, Kucukyirmaz et ar. (2006) rr"u. .rpior*,n. anarysis of srangwords for gender prediction in chat Jata. we believe that chatting habituates aperson in the usage of slang words, and so the same ;.;;;; while communicatingthrough e-mairs, wit arso riberary ,"t. ,r. of the same srang terms in e_mairs.Towards this end, we have not only attempted to identify the Hindi words usedin pornographic UBE but arso ioentineJ srang Hindi *;;;, used in pornographic

MethodologyThe main steps involved in the methodology for the current work could bedelineated as depicted in Figure t . rne rirst step is ," .",r"., e_mairs. The secondstep reduces the corpus created at the end of the first ,a"o-ov serecting onry UBE.we restricted the corpus onry to UBE because for non-rp", e_mairs, theprobabirity of finding pornographic Hindi words is very t.rr, ..,r"ry negrigibre.secondly, the size of corpui conraini;; uBE and non-spam e_mairs was biasedby the dominating n_umber or non-rp"-, e-mairs which were of no rerevance to::;#iT ;;::r:t ':1';::i:. this'motivated us ro. 'o'p" size redu*ion bv

Further, we decided to again refine the corpus by removar of non_pornographicUBE' Our motive behind doing this *as the prevention of ditr,,on of corpus byother types of UBE. Moreover, our berief was that pornographic UBE contains moreHindi words, compared to any other category of UBE. secondly, we believed thatpornographic UBE arso contains more rr"1g. usage of the ranguage than any othercategory of UBE. our berief is supported by the foilowing two points:l ' Number of srang words is highest in domains rike crime, sex, viorenceand drugs (Astriyani et al., iOOZ ana yourDictionary, Z0l0).2' Number of UBE dearing with crime, viorence and drugs is ress than thenumber of UBE dealing with sex.Here it is noteworthy that as with the research work of Astriyani et at. (2007),we have arso considered the terms 'sex' and ,oornograpr,y,'synonymousry.

Arso,it is noteworthy that the term'drugs,in the pr"r.n,lontext refers to addiction_inducing things rike cocaine, brown sLrgar, etc. This means that the usage of theterm'drugs'in the present context is not a reference to medicinar drugs rikeViagra and ciaris which are advertised through UBE. Arso, saini and Desai (2009)in their research work have found ttrai trre number of ueE dearing with sex orpornography is very high compared to UBE dealing with crime, violence and drugs.iJLL,"t; ::.j:i::,::,y"r.: i:-I,lf i lanouaoe. As such, no UBE was round ro betermed pure-Hindi UBE; instead the ,",1, .o",","j ;Jiil';T;i,,"_,?;

No. 2, 201 I ldentificationofHindiWordsUsed,n,o,non,uoffi55

Figure l: Oi"g;t*atic Representation of Methodology

moveand English

(DictionarY andNon- DictionarY)

Words

SelectPornog raPh ic

UBE

PerformTokenization

ResultsThe removal

pornographic

non-slang

words as well

of ll5 English

Hindi words. Tlt

dictionary. The 1

described earlierl

We call this vec!

The list of HWIl

---_-F-____-AI R"tri"ue I

I sinoi I

[_ woros J

English language. Moreover, Chinese, Russian and other languages were

discarded. Hence, at the end of the third step, we had a text corpus consisting

of only pornographic UBE. The size of this corpus was I ,854 spam e-mails'

The basic model of free text consists of documents which are sequences of

basic units called tokens. ln English language, the tokens are words (Zhang, 2006)

and the act of breaking the text into tokens is called tokenization' lt is notable

to mention here that even the Hindi terms in the mails were worded usinq English

alphabet. ln order to make it easier for analysis and further processing, we

performed tokenization on the corpus of pornographic UBE' This yielded us with

a text corpus in the Bag of words (Bow) form. ln BOW representation of a text

document, terms in the document are identified with the words in the document'

Hence this represenration is also called Set of Words (SoW) (Sebastiani, 2002)'

Tokenization was done in such a way that we could concentrate only on the

unigrams.

The Bow was further processed by removal of stop words from it' sebastiani

(2005) has defined Stop Words as topic neutral words such as articles and

prepositions, which are eliminated in a preprocessing phase' Bharati et al' (2002)

have defined them as the few words which have high frequencies in all the

categories, and hence are irrelevant for the classification exercise' This

processing of the Bow resulted in a vector consisting of 4,908 word entries'

Further, the English drctionary words were also removed from the BOW' Here' it

is noteworthy to state that by removal of English dictionary words we intended

to create a vector consisting only of words which are not present in the English

dictionary. Moreover, the treatment of text corpus was not case-sensitive and

the tokens resulted fronr tokenization were not subjected to reduction based on

their root forms. This means that we did not employ stemming and lemmatization

techniques discussed by Zhang (2006)'

CollectE_'Mails

Table l: List

S. No. Hind

Word

I Aao

2. Abel

3. Adaa'

4. Arey

5. Babu

6. Bagh

7. Bbi

8. Bbs

9. Bebo

t0. Bhabl

1l

12. Chuna

r 3. Deeda

t4. Desl

t5. Dupatl

16.

t7 Gaant

18. Gora

r9. Grjju

20.

2t.t2.

Haar

Hai-H,;

56 The lUPJournal of Systems Management, Vol' lX, No' 2, 201I

ResultsThe removar of stop words after tokenization of text corpus consisting of r ,g54pornographic UBE resurted in 4,g0g words. The removar of Engrish dictionary andnon-slang words also resulted in a 202-rowed vector. This vector contained Hindiwords as well as various English slang words. The vectorwas finally refined by removalof I I 5 English srang words, and the resurt was gz-rowed u".,o,. consisting of onryHindi words. The |r5 srang words are Engrish words which are not in Engrishdictionary' The g7-rowed vector, obtained by apprying u"rtu, pro..rring ,,"p1described earrier, was ordered by sorting it on the *oro, in rexicographic manner.we cat this vector the 'vector of Hindi

'words, or the ,Hindi words vector (HW),.The list of HWV is presented in Table L

wereconsistinge-mails.

of2006)

is notable

English

us withof a text

document.2002).

only on the

Sebastiani

s and

al. (2002\

in all the. This

entries.. Here, itintended

the English

and

based onization

Table l: List of la"ntiti"ffi hi UBE CorpusS. No. jninoiI words

S. No. HindiWords

HindiWords

S. No. S. No. HindiWords

I Aao 23. Hoon 45. Kum 67. Nazook2. Abey 24. Hota 46. Kya 68. Neechey3. Adaab 2s. lnispeaor 47. Kyaa 69. Noor4. Arey 26. lse 48. Kyun 70. Pa/lu5. Babu 27. lsi 49. Laaga 71 Pardhe6. Bagh 28. lski 50 Maar 7)

!_l1yn"!Saath

7. Bbi 29. Jal 5r.__*_tMalaya/il 73.

8. 8bs 30. Jalsa 52 Matlu I

74 Sab9. Eebo 3r Jwab 53. Masa/a 75. Sabse

10. Bhabhi 32. Kaam 54. Mein 76. Cch',ll Cheez 33. Kahan 55. Meine //' I Sareet2. Chunari 34. Kaiku 56. Meinen 78. Sunehari13. Deedar 35. Ka/ 57 Mera 7q T14. Desi 36. Kameez 58 Meri 80 T,15. Dupatta I Zl. Kar

Karkc

59. Milega 8r Terekut6. Firangi 38. 60. Mujra 82 I lclzt't7. Caand 39. Karthey 6l Naa 83 Wahaanr8. Gora 40. Kaun 62. Naam 84. Wohr9. Gujju 4t Khola 63. Nahin 85. Yaad

1__ Haar 42. Koi 64 Nai 86. Zabardasri

zaraz Hat 43 Kuch 65 Nain 87.22. Hain 44. Kudi 66 Nakko

No. 2, 201 I ;ffi57

The HWV provided the key input for further analysis of Hindi words. As is

evident from Table l, even though we analyzed pornographic UBE, the Hindi words

were not all specific to pornographic domain. All the words in the HWV were Hindi

words, but it presented a mixed collection of different kinds of Hindi words. This

is so because it is difficult to exactly define a 'pornographic Hindi word' and a'non-pornographic Hindi word'. Even if these are defined, a finite degree ofambiguity will always persist between the two categories of words. This is so

because, we believe that it is the usage of the word that modifies its behavior

and makes it to be categorized in either of the categories. We argue that Hindi

language just provides words and it is the context that makes them Pornographic.

Even though one could certainly argue for some Hindi words from Table I to be

categorized as pornographic and others to be categorized as non-pornographic,most of the instances from HWV are pornographic because they are used here

in that context. For example, kameez (i.e., shirt), salwar (i.e., trouser/pantaloon),zabardasti (i.e., forcefully) and khola (i.e., opened/stripped) are all non-pornographic words, but their usage sequence in pornographic context makes

them dubious. Hence, in order to have a more generalized classification, we did

not create three categories for pornographic, non-pornographic and words with

ambiguity. We addressed the problem of ambiguity by taking into play the role

of 'context'. Our view is supported by Bookrags (2010) where it is mentioned thatslang refers to informal words and expressions, but it has more to do with what

is appropriate in a particular context. Wikipedia Pornography (20.l0) also states

that the very definition of what is or is not pornography is context-specific. They

add that this definition has differed in different historical, cultural and national

co ntexts.

Slang terms often originate with the purpose of saving keystrokes and behave

like lnternet shorthand (Wikipedia lnternet-slang, 2010). From the HWV we also

found that many slang words have been compressed or abbreviated. For instance,

gujju and mallu are used instead of 'Gujarati' and 'Malayali', respectively. An

important finding obtained from HWV, however, is that extra keystrokes have also

been used for creation of slang words. For instance, bbi, meinen, inispectorinstead of bb (i.e., wife), meine (myself) and inspector, respectively.

Finally, from the HWV, we can see that in addition to words created by

shortening or lengthening the standard Hindi words, the slang words have also

been coined by the users. The words like kaiku and nakko for kyon (i.e., why)

and nahin (i.e., no) respectively are examples of this.

Conclus ionWe analyzed 1,854 pornographic UBE messages, which after removal of stop words

resulted in 4,908 unique words. Out of this, 4,706 were English dictionary words

and I I 5 were English non-dictionary (i.e. slang) words. The processing of the

original textresulted in i

depending on

word or SOW

words and itconcluded thatthe keystrokes

the findingse-mails have

Hence, we

or electronic

language. lnthrough the

conversation,

words seem tocommunication.

The results,

for analysis. We

pornographic

Hindi words in

of ouridentification ofis an insightmessaging in

Our currentof user. These

filters for UBE.

Referencesl. Astriyani,

Language

Gunadarma

2. Bharati A,

forCategories",

Advances in

3. Bookrags

http://www.

We

58 The lUPJournal of Systems Management, Vol. lX, No. 2, 201 1

ldentification of

words. As is

Hindiwordswere Hindi

words. This

word' and adegree ofThis is so

its behaviorthat Hindi

aphic.

Table I to be

rvvrsl,rttrt

used here

/pantaloon),are all non-

makeswe did

words withplay the role

thatdo with what

also states. They

and national

and behave

HWV we also

For instance,ly. An

have also

inispector

created by

have also(i.e., why)

of stop wordswords

ng of the

original text corpus through tokenization, BOW and multiple refinements, hence,

resulted in identification of 87 unique Hindi words from the pornographic domain.

We conclude that Hindi words may be used for portraying pornography

depending on the context of usage because in a different framework the same

word or SOW may not exhibit pornography at all. Hence, language just provides

words and it is the context that makes them behave differently. lt is also

concluded that the Hindi words in pornographic UBE are created either by saving

the keystrokes or by adding extra keystrokes to the usual words. Based on all

the findings and results, we conclude that most of the Hindi words used in thee-mails have come from the language otherwise spoken by the lndian natives.

Hence, we conclude that the number of Hindi words slanged through e-mailsor electronic communication in general, is less than those slanged in English

language. ln other words, English language shows the trend wherein words evolve

through the online communications and then are added to the offline general

conversation, whereas the reverse thing happens with Hindi language. Here thewords seem to first come up in conversation and then drip into the online

communication.

The results, findings and inferences provided here are based on the data used

for analysis. We do not argue in favor or otherwise of use of Hindi words forpornographic portrayal. We just report the results based on the identification ofHindi words in text corpus created from existing pornographic UBE. To the best

of our knowledge and based on the review of related literature, we conclude thatidentification of Hindi words in pornographic UBE is a new concept. Our workis an insight into the study of context-based slang usage of Hindi language formessaging in general and spamming in the pornographic domain in particular.

Our current findings reveal the influence of Hindi language on the online behaviorof user. These could also be put to use for language-neutral text based content

filters for UBE. I

Referencesl. Astriyani, Sutjiati R and Purwaningsih D E (2007), An Analysis of Slang

Language Related to Sex in Eminem's Rap Songs' Lyrics, Repository ofCunadarma University, ISSN: I 987-4783, Jakarta.

2. Bharati A, Varanasi K, Kamisetty C et al. (20O2), "A Document Space Model

for Automated Text Classification Based on Frequency Distribution Across

Categories", in Proceedings of lnternational Conference on NLP-2002, Recent

Advances in NLP, Vikas Publishing House, New Delhi.

3. Bookrags (2010), "Slang: The Primary English Curriculum", available at

http : / /www. bookrag s.com /S lan g /tan df / s lan g- I I -tf. lX, No. 2, 201 I

ldentification of Hindi Words Used in Pornographic Unsolicited Bulk E-Mails 59

-4. Coswami S, Sarkar S and Rustagi M (2009), "stylometric Analysis of Bloggers'

Age and Cender", in Proceedings of the 3'd lnternational AAAI Conference on

Weblogs and Social Media (ICWSM - 2009), San Jose, California.

5. Krasny M (2000), "Analysis: Usage of Slang Words", article from Talk of the

Nation (NPR), August 7, 2OOO, available at http://www.highbeam.com/doc/'lPl -30383388.htm1

6. Kucukyilmaz T, Cambazoglu B B, Aykanat C and Can F (2006), "Chat Mining

for Gender Prediction", Lecture Notes in Computer Science, Springer Berlin,

Heidelberg, Yol. 424312006, pp. 274-283, ISSN: 0302-9743'

7. Kucukyilmaz T, Cambazoglu B B, Aykanat C and Can F (2008), "Chat Mining:

Predicting User and Message Attributes in Computer-MediatedCommunication", lnformation Processing and Management: An lnternational

Journal, Yol. 44, No. 4, pp. I448-,l466, ISSN: 0306-4573.

8. McAfee lnc. (2008), "Current Spam Categories", available at http:// www.mcafee.com

/us/threat-centeri anti-spam /spam-categories. html, December 2 3, 2008.

9. Saini J R and Desai A A (2009), "Self Learning Taxonomical Classification

System Using Vector Space Document Analysis Model for Web Text Mining

in UBE", Ph.D. Thesis, accepted by Department of Computer Science, Veer

Narmad South Gujarat University, Surat, Gujarat, lndia, September..;0.

Sebastiani F (2002), "Machine Learning in Automated Text Categorization",

ACM Computing Surveys, Yol. 32, No. l, pp. I -47, ISSN: 0360-0300'

I 1. Sebastiani F (2005), "Text Categorization", in Alessandro Zanasi (Ed.), Text

Mining and lts Applicationi pp. .l09-129, WIT Press, Southampton, UK.

12. Thorne T (2004), "Slang, Style-Shifting and Sociability", MulticulturalPerspectives on English Language and Literature, Tallinn/London, available

at http:/ /www.kcl.ac.uk/content I 1 I c6l03 /08/ I 6/Slang,%20Style-shifting%20 and%ZQ Sociability.doc

1 3. Wikipedia lnternet-slang (2010), "lnternet-slang", Wikimedia Foundation lnc.,

availab le at http:/ /en.wikiped ia.org i wiki/ inte rnet-s lang

14. Wikipedia Pornography (20.l0), "Pornography", Wikimedia Foundation lnc.

available at en.wikipedia.orglwiki/pornography

15. YourDictionary Website (2010), "The American Slang Dictionary", available at

http:/ /www.yo u rd ict ionary. com / d ictio nary-article s /Ame rican -Slang-Dictionary.html

16. Zhang T (2006), "Predictive Methods for Text Mining", Machine Learning

Summer School - 2006, Taipei, available at videolectures.netmlss06tw

-zhang-pmtmReference # 08J-20 I I -05-04-0 I

60 The lUPJournal of Systems Management, Vol. lX, No. 2, 201 I