ldentification of Hindi Words Usedin Pornographic Unsolicited Bulk E-Mails
Jatinderkumar R Saini* and Apurva A Desai**
E-mail has become a fast and cheap means of online communication. Themain threat to e-mail is Unsolicited Bulk E-mail (UBE), commonly knownas spam e-mail. The current work aims at identification of Hindi wordsin pornographic UBE. The motives of the paper are manifold. This is anattempt to better understand the UBE and its interplay with the regionallanguage in the perspective of international spamming. The problem hasbeen addressed by employing tokenization technique and Unigram BOW
model. The current paper reports the first results on the identification of87 Hindi words from more than 1,850 pornographic UBE analyzed by us.To the best of our knowledge, this is the first attempt to identify Hindiwords in the corpus of pornographic UBE.
lntrod u ctio n
E-mail has become an efficient and popular communication mechanism as thenumber of lnternet users has increased. lt has provided both faster and cheaperforms of communication mechanism. But a large part of e-mail traffic consistsof non-personal, non-time critical and unsolicited information. This type of e-mail is called Unsolicited Bulk E-mail (UBE) and is commonly known by variousother synonymous names like spam e-mail, bulk e-mail, junk e-mail,unimportant e-mail and Unsolicited Commercial E-mail (UCE). E-mail spam is asubset of spam that involves nearly identical messages sent to numerousrecipients by e-mail. According to Saini and Desai (2009), there are various typesof UBE, like those dealing with financial transactions, viruses, chain letters,pornography and fake offers for jobs, medicines, etc.
Pornography or porn is defined as the portrayal of explicit sexual subjectmatter (Wikipedia Pornography, 2010). This portrayal may exhibit itself in eithertextual form or graphical form. ln the current work, we have attempted to identifythe usage of Hindi terms in pornographic spam e-mails. The target of the currentwork is, hence, to focus on pornographic UBE in textual format. This way, wehave attempted to understand the spamming behavior as well as the usage ofn Associate Professor and Head, Department of Computer Science, S P College of Engineering,
Visnagar, Cujarat, lndia; and is the corresponding author. E*mail: [email protected]* Professor and Head, Department of Computer Science, Veer Narmad South Gujarat University,
Surat, Gujarat, lndia. E-mail: [email protected]
O 201 I lUP. All Rights Reserved.
also attempted to study the usage of non-dictionary or slang Hindi words inpornographic UBE. We believe that in the context of international spamming, thescope of Hindi spamming is reglonal and limited, as compared to the bulk ofpornographic UBE in English roaming ever the lnternet.
Related WorkBased on the survey of related literature, we found that there has not been muchwork in the field of studying usage of Hindi language for electroniccommunication through e-mails. McAfee, a leading antivirus software developingcompany, in its report, believes that spam falls into various categories (McAfeelnc., 2008). Providing the breakup of the categories of spam e*mail, they haveincluded language-based spam e-mails, such as 'Russian spam' and 'chinesespam', in the list of spam e-mails presented by them.
Much of the usage of Hindi language in e-mails is meant to give it an informaltouch. This also takes the form of slang usage of the language. This is truer whenthe subject matter of the e-mail is pornographic. wikipedia lnrernet-slang (2010)has defined lnternet slang or lnternet language as a type of slang that lnternetusers have popularized, and in many cases, have coined. lnternet users employslang mainly through services like e-mails, messengers and blogs. YourDictionary(2010) has defined slang as highly informal speech that is outside conventionalor standard usage and consists both of coined words and phrases and of newor extended meanings attached to established terms. According to Goswami eral. (2009), slang is a non-dictionary word that has evolved with time due to itsfrequent and popular usage. Hence a good number of research instances areavailable to highlight the slang usage of the language.
The usage of slang in literature related to English language has been addresseda lot by the researchers. We believe that the results are principally true for Hindilanguage also. According to an analysis by Krasny (2000), usage of slang wordsof the English language is constantly changing and slang is increasingly becominga greater part of our shifting linguistic terrain. He has concluded that most newwords come from slang. Thorne (2004), in his research article on slang, style-shifting and sociability, has remarked that along with other factors, e-mail isresponsible for generation of new slangs as also for enormous proliferation ofwebsites designed to celebrate and decode slang. Our belief is that similar toEnglish language, Hindi is also changing and slang is becoming a key source ofits evolution.
ln the more specific fields of computer sciences, Kucukyilmaz et al. (200g) intheir work on predicting user and message attributes in computer-mediatedcommunication, have concluded that as chat conversations occur in a spontaneous
environment,
identified
ln another
words forperson in thethrough e-Towards thistn
UBE.
MethodolThe main
delineated as
step reduces
We restrictedprobability ofSecondly, the
by the domi
the currentremoval of
Further, we
UBE, Our
other types ofHindi words,pornographic
category of
2. Number
number
Here it is
we have alsoit is
inducing thingsterm 'drugs' inViagra and
in theirpornography is
Finally, we
termed pure
34 The lUPJournal of Systems Management, Vol. lX, No. 2, 201 I ldentification of
, we have
words inspamming, theto the bulk of
been muchelectronic
developing(McAfee
, they have
and 'Chinese
It an informalls truer when
lang (2010)
that lnternetusers employ
conventionaland of new
Goswami erdue to its
addressedfor Hindi
llang words
bccomingmost new
style-c-mail is
rlmilar tosource of
(2008) in
environment, the use of srang words and misspe,ings are frequent. They haveidentified various slang words in their w_o-rk which'i, rJrv rocused on chat mining.ln another paper, Kucukyirmaz et ar. (2006) rr"u. .rpior*,n. anarysis of srangwords for gender prediction in chat Jata. we believe that chatting habituates aperson in the usage of slang words, and so the same ;.;;;; while communicatingthrough e-mairs, wit arso riberary ,"t. ,r. of the same srang terms in e_mairs.Towards this end, we have not only attempted to identify the Hindi words usedin pornographic UBE but arso ioentineJ srang Hindi *;;;, used in pornographic
MethodologyThe main steps involved in the methodology for the current work could bedelineated as depicted in Figure t . rne rirst step is ," .",r"., e_mairs. The secondstep reduces the corpus created at the end of the first ,a"o-ov serecting onry UBE.we restricted the corpus onry to UBE because for non-rp", e_mairs, theprobabirity of finding pornographic Hindi words is very t.rr, ..,r"ry negrigibre.secondly, the size of corpui conraini;; uBE and non-spam e_mairs was biasedby the dominating n_umber or non-rp"-, e-mairs which were of no rerevance to::;#iT ;;::r:t ':1';::i:. this'motivated us ro. 'o'p" size redu*ion bv
Further, we decided to again refine the corpus by removar of non_pornographicUBE' Our motive behind doing this *as the prevention of ditr,,on of corpus byother types of UBE. Moreover, our berief was that pornographic UBE contains moreHindi words, compared to any other category of UBE. secondly, we believed thatpornographic UBE arso contains more rr"1g. usage of the ranguage than any othercategory of UBE. our berief is supported by the foilowing two points:l ' Number of srang words is highest in domains rike crime, sex, viorenceand drugs (Astriyani et al., iOOZ ana yourDictionary, Z0l0).2' Number of UBE dearing with crime, viorence and drugs is ress than thenumber of UBE dealing with sex.Here it is noteworthy that as with the research work of Astriyani et at. (2007),we have arso considered the terms 'sex' and ,oornograpr,y,'synonymousry.
Arso,it is noteworthy that the term'drugs,in the pr"r.n,lontext refers to addiction_inducing things rike cocaine, brown sLrgar, etc. This means that the usage of theterm'drugs'in the present context is not a reference to medicinar drugs rikeViagra and ciaris which are advertised through UBE. Arso, saini and Desai (2009)in their research work have found ttrai trre number of ueE dearing with sex orpornography is very high compared to UBE dealing with crime, violence and drugs.iJLL,"t; ::.j:i::,::,y"r.: i:-I,lf i lanouaoe. As such, no UBE was round ro betermed pure-Hindi UBE; instead the ,",1, .o",","j ;Jiil';T;i,,"_,?;
No. 2, 201 I ldentificationofHindiWordsUsed,n,o,non,uoffi55
Figure l: Oi"g;t*atic Representation of Methodology
moveand English
(DictionarY andNon- DictionarY)
Words
SelectPornog raPh ic
UBE
PerformTokenization
ResultsThe removal
pornographic
non-slang
words as well
of ll5 English
Hindi words. Tlt
dictionary. The 1
described earlierl
We call this vec!
The list of HWIl
---_-F-____-AI R"tri"ue I
I sinoi I
[_ woros J
English language. Moreover, Chinese, Russian and other languages were
discarded. Hence, at the end of the third step, we had a text corpus consisting
of only pornographic UBE. The size of this corpus was I ,854 spam e-mails'
The basic model of free text consists of documents which are sequences of
basic units called tokens. ln English language, the tokens are words (Zhang, 2006)
and the act of breaking the text into tokens is called tokenization' lt is notable
to mention here that even the Hindi terms in the mails were worded usinq English
alphabet. ln order to make it easier for analysis and further processing, we
performed tokenization on the corpus of pornographic UBE' This yielded us with
a text corpus in the Bag of words (Bow) form. ln BOW representation of a text
document, terms in the document are identified with the words in the document'
Hence this represenration is also called Set of Words (SoW) (Sebastiani, 2002)'
Tokenization was done in such a way that we could concentrate only on the
unigrams.
The Bow was further processed by removal of stop words from it' sebastiani
(2005) has defined Stop Words as topic neutral words such as articles and
prepositions, which are eliminated in a preprocessing phase' Bharati et al' (2002)
have defined them as the few words which have high frequencies in all the
categories, and hence are irrelevant for the classification exercise' This
processing of the Bow resulted in a vector consisting of 4,908 word entries'
Further, the English drctionary words were also removed from the BOW' Here' it
is noteworthy to state that by removal of English dictionary words we intended
to create a vector consisting only of words which are not present in the English
dictionary. Moreover, the treatment of text corpus was not case-sensitive and
the tokens resulted fronr tokenization were not subjected to reduction based on
their root forms. This means that we did not employ stemming and lemmatization
techniques discussed by Zhang (2006)'
CollectE_'Mails
Table l: List
S. No. Hind
Word
I Aao
2. Abel
3. Adaa'
4. Arey
5. Babu
6. Bagh
7. Bbi
8. Bbs
9. Bebo
t0. Bhabl
1l
12. Chuna
r 3. Deeda
t4. Desl
t5. Dupatl
16.
t7 Gaant
18. Gora
r9. Grjju
20.
2t.t2.
Haar
Hai-H,;
56 The lUPJournal of Systems Management, Vol' lX, No' 2, 201I
ResultsThe removar of stop words after tokenization of text corpus consisting of r ,g54pornographic UBE resurted in 4,g0g words. The removar of Engrish dictionary andnon-slang words also resulted in a 202-rowed vector. This vector contained Hindiwords as well as various English slang words. The vectorwas finally refined by removalof I I 5 English srang words, and the resurt was gz-rowed u".,o,. consisting of onryHindi words. The |r5 srang words are Engrish words which are not in Engrishdictionary' The g7-rowed vector, obtained by apprying u"rtu, pro..rring ,,"p1described earrier, was ordered by sorting it on the *oro, in rexicographic manner.we cat this vector the 'vector of Hindi
'words, or the ,Hindi words vector (HW),.The list of HWV is presented in Table L
wereconsistinge-mails.
of2006)
is notable
English
us withof a text
document.2002).
only on the
Sebastiani
s and
al. (2002\
in all the. This
entries.. Here, itintended
the English
and
based onization
Table l: List of la"ntiti"ffi hi UBE CorpusS. No. jninoiI words
S. No. HindiWords
HindiWords
S. No. S. No. HindiWords
I Aao 23. Hoon 45. Kum 67. Nazook2. Abey 24. Hota 46. Kya 68. Neechey3. Adaab 2s. lnispeaor 47. Kyaa 69. Noor4. Arey 26. lse 48. Kyun 70. Pa/lu5. Babu 27. lsi 49. Laaga 71 Pardhe6. Bagh 28. lski 50 Maar 7)
!_l1yn"!Saath
7. Bbi 29. Jal 5r.__*_tMalaya/il 73.
8. 8bs 30. Jalsa 52 Matlu I
74 Sab9. Eebo 3r Jwab 53. Masa/a 75. Sabse
10. Bhabhi 32. Kaam 54. Mein 76. Cch',ll Cheez 33. Kahan 55. Meine //' I Sareet2. Chunari 34. Kaiku 56. Meinen 78. Sunehari13. Deedar 35. Ka/ 57 Mera 7q T14. Desi 36. Kameez 58 Meri 80 T,15. Dupatta I Zl. Kar
Karkc
59. Milega 8r Terekut6. Firangi 38. 60. Mujra 82 I lclzt't7. Caand 39. Karthey 6l Naa 83 Wahaanr8. Gora 40. Kaun 62. Naam 84. Wohr9. Gujju 4t Khola 63. Nahin 85. Yaad
1__ Haar 42. Koi 64 Nai 86. Zabardasri
zaraz Hat 43 Kuch 65 Nain 87.22. Hain 44. Kudi 66 Nakko
No. 2, 201 I ;ffi57
The HWV provided the key input for further analysis of Hindi words. As is
evident from Table l, even though we analyzed pornographic UBE, the Hindi words
were not all specific to pornographic domain. All the words in the HWV were Hindi
words, but it presented a mixed collection of different kinds of Hindi words. This
is so because it is difficult to exactly define a 'pornographic Hindi word' and a'non-pornographic Hindi word'. Even if these are defined, a finite degree ofambiguity will always persist between the two categories of words. This is so
because, we believe that it is the usage of the word that modifies its behavior
and makes it to be categorized in either of the categories. We argue that Hindi
language just provides words and it is the context that makes them Pornographic.
Even though one could certainly argue for some Hindi words from Table I to be
categorized as pornographic and others to be categorized as non-pornographic,most of the instances from HWV are pornographic because they are used here
in that context. For example, kameez (i.e., shirt), salwar (i.e., trouser/pantaloon),zabardasti (i.e., forcefully) and khola (i.e., opened/stripped) are all non-pornographic words, but their usage sequence in pornographic context makes
them dubious. Hence, in order to have a more generalized classification, we did
not create three categories for pornographic, non-pornographic and words with
ambiguity. We addressed the problem of ambiguity by taking into play the role
of 'context'. Our view is supported by Bookrags (2010) where it is mentioned thatslang refers to informal words and expressions, but it has more to do with what
is appropriate in a particular context. Wikipedia Pornography (20.l0) also states
that the very definition of what is or is not pornography is context-specific. They
add that this definition has differed in different historical, cultural and national
co ntexts.
Slang terms often originate with the purpose of saving keystrokes and behave
like lnternet shorthand (Wikipedia lnternet-slang, 2010). From the HWV we also
found that many slang words have been compressed or abbreviated. For instance,
gujju and mallu are used instead of 'Gujarati' and 'Malayali', respectively. An
important finding obtained from HWV, however, is that extra keystrokes have also
been used for creation of slang words. For instance, bbi, meinen, inispectorinstead of bb (i.e., wife), meine (myself) and inspector, respectively.
Finally, from the HWV, we can see that in addition to words created by
shortening or lengthening the standard Hindi words, the slang words have also
been coined by the users. The words like kaiku and nakko for kyon (i.e., why)
and nahin (i.e., no) respectively are examples of this.
Conclus ionWe analyzed 1,854 pornographic UBE messages, which after removal of stop words
resulted in 4,908 unique words. Out of this, 4,706 were English dictionary words
and I I 5 were English non-dictionary (i.e. slang) words. The processing of the
original textresulted in i
depending on
word or SOW
words and itconcluded thatthe keystrokes
the findingse-mails have
Hence, we
or electronic
language. lnthrough the
conversation,
words seem tocommunication.
The results,
for analysis. We
pornographic
Hindi words in
of ouridentification ofis an insightmessaging in
Our currentof user. These
filters for UBE.
Referencesl. Astriyani,
Language
Gunadarma
2. Bharati A,
forCategories",
Advances in
3. Bookrags
http://www.
We
58 The lUPJournal of Systems Management, Vol. lX, No. 2, 201 1
ldentification of
words. As is
Hindiwordswere Hindi
words. This
word' and adegree ofThis is so
its behaviorthat Hindi
aphic.
Table I to be
rvvrsl,rttrt
used here
/pantaloon),are all non-
makeswe did
words withplay the role
thatdo with what
also states. They
and national
and behave
HWV we also
For instance,ly. An
have also
inispector
created by
have also(i.e., why)
of stop wordswords
ng of the
original text corpus through tokenization, BOW and multiple refinements, hence,
resulted in identification of 87 unique Hindi words from the pornographic domain.
We conclude that Hindi words may be used for portraying pornography
depending on the context of usage because in a different framework the same
word or SOW may not exhibit pornography at all. Hence, language just provides
words and it is the context that makes them behave differently. lt is also
concluded that the Hindi words in pornographic UBE are created either by saving
the keystrokes or by adding extra keystrokes to the usual words. Based on all
the findings and results, we conclude that most of the Hindi words used in thee-mails have come from the language otherwise spoken by the lndian natives.
Hence, we conclude that the number of Hindi words slanged through e-mailsor electronic communication in general, is less than those slanged in English
language. ln other words, English language shows the trend wherein words evolve
through the online communications and then are added to the offline general
conversation, whereas the reverse thing happens with Hindi language. Here thewords seem to first come up in conversation and then drip into the online
communication.
The results, findings and inferences provided here are based on the data used
for analysis. We do not argue in favor or otherwise of use of Hindi words forpornographic portrayal. We just report the results based on the identification ofHindi words in text corpus created from existing pornographic UBE. To the best
of our knowledge and based on the review of related literature, we conclude thatidentification of Hindi words in pornographic UBE is a new concept. Our workis an insight into the study of context-based slang usage of Hindi language formessaging in general and spamming in the pornographic domain in particular.
Our current findings reveal the influence of Hindi language on the online behaviorof user. These could also be put to use for language-neutral text based content
filters for UBE. I
Referencesl. Astriyani, Sutjiati R and Purwaningsih D E (2007), An Analysis of Slang
Language Related to Sex in Eminem's Rap Songs' Lyrics, Repository ofCunadarma University, ISSN: I 987-4783, Jakarta.
2. Bharati A, Varanasi K, Kamisetty C et al. (20O2), "A Document Space Model
for Automated Text Classification Based on Frequency Distribution Across
Categories", in Proceedings of lnternational Conference on NLP-2002, Recent
Advances in NLP, Vikas Publishing House, New Delhi.
3. Bookrags (2010), "Slang: The Primary English Curriculum", available at
http : / /www. bookrag s.com /S lan g /tan df / s lan g- I I -tf. lX, No. 2, 201 I
ldentification of Hindi Words Used in Pornographic Unsolicited Bulk E-Mails 59
-4. Coswami S, Sarkar S and Rustagi M (2009), "stylometric Analysis of Bloggers'
Age and Cender", in Proceedings of the 3'd lnternational AAAI Conference on
Weblogs and Social Media (ICWSM - 2009), San Jose, California.
5. Krasny M (2000), "Analysis: Usage of Slang Words", article from Talk of the
Nation (NPR), August 7, 2OOO, available at http://www.highbeam.com/doc/'lPl -30383388.htm1
6. Kucukyilmaz T, Cambazoglu B B, Aykanat C and Can F (2006), "Chat Mining
for Gender Prediction", Lecture Notes in Computer Science, Springer Berlin,
Heidelberg, Yol. 424312006, pp. 274-283, ISSN: 0302-9743'
7. Kucukyilmaz T, Cambazoglu B B, Aykanat C and Can F (2008), "Chat Mining:
Predicting User and Message Attributes in Computer-MediatedCommunication", lnformation Processing and Management: An lnternational
Journal, Yol. 44, No. 4, pp. I448-,l466, ISSN: 0306-4573.
8. McAfee lnc. (2008), "Current Spam Categories", available at http:// www.mcafee.com
/us/threat-centeri anti-spam /spam-categories. html, December 2 3, 2008.
9. Saini J R and Desai A A (2009), "Self Learning Taxonomical Classification
System Using Vector Space Document Analysis Model for Web Text Mining
in UBE", Ph.D. Thesis, accepted by Department of Computer Science, Veer
Narmad South Gujarat University, Surat, Gujarat, lndia, September..;0.
Sebastiani F (2002), "Machine Learning in Automated Text Categorization",
ACM Computing Surveys, Yol. 32, No. l, pp. I -47, ISSN: 0360-0300'
I 1. Sebastiani F (2005), "Text Categorization", in Alessandro Zanasi (Ed.), Text
Mining and lts Applicationi pp. .l09-129, WIT Press, Southampton, UK.
12. Thorne T (2004), "Slang, Style-Shifting and Sociability", MulticulturalPerspectives on English Language and Literature, Tallinn/London, available
at http:/ /www.kcl.ac.uk/content I 1 I c6l03 /08/ I 6/Slang,%20Style-shifting%20 and%ZQ Sociability.doc
1 3. Wikipedia lnternet-slang (2010), "lnternet-slang", Wikimedia Foundation lnc.,
availab le at http:/ /en.wikiped ia.org i wiki/ inte rnet-s lang
14. Wikipedia Pornography (20.l0), "Pornography", Wikimedia Foundation lnc.
available at en.wikipedia.orglwiki/pornography
15. YourDictionary Website (2010), "The American Slang Dictionary", available at
http:/ /www.yo u rd ict ionary. com / d ictio nary-article s /Ame rican -Slang-Dictionary.html
16. Zhang T (2006), "Predictive Methods for Text Mining", Machine Learning
Summer School - 2006, Taipei, available at videolectures.netmlss06tw
-zhang-pmtmReference # 08J-20 I I -05-04-0 I
60 The lUPJournal of Systems Management, Vol. lX, No. 2, 201 I
Top Related