REPRESENTING TEXT DOCUMENTS IN TRAINING DOCUMENT SPACES: A NOVEL MODEL FOR DOCUMENT REPRESENTATION

Journal of Theoretical and Applied Information Technology

XX st Month 201x. Vol. x No.x © 2005 - 2012 JATIT & LLS. All rights reserved.

ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195

REPRESENTING TEXT DOCUMENTS IN TRAININGDOCUMENT SPACES: A NOVEL MODEL FOR DOCUMENT

REPRESENTATION1ASMAA MOUNTASSIR, 2HOUDA BENBRAHIM, 3ILHAM BERRADA

1,2,3ALBIRONI Research Team, ENSIAS, Mohamed 5 University, Souissi, Rabat,

Morocco

E-mail: [email protected], [email protected],[email protected]

ABSTRACT

In this paper, we propose a novel model for Document Representation inan attempt to address the problem of huge dimensionality and vectorsparseness that are commonly faced in Text Classification tasks. Theproposed model consists of representing text documents in the space oftraining documents at a first stage. Afterward, the generated vectorsare projected in a new space where the number of dimensions correspondsto the number of categories. To evaluate the effectiveness of ourmodel, we focus on a problem of binary classification. We conduct ourexperiments on Arabic and English data sets of Opinion Mining. We use asclassifiers Support Vector Machines (SVM) and k-Nearest Neighbors (k-NN)which are known by their effectiveness in classical Text Classificationtasks. We compare the performance of our model with that of theclassical Vector Space Model (VSM) by the consideration of threeevaluative criteria, namely dimensionality of the generated vectors,time (of learning and testing) taken by the classifiers, andclassification results in terms of accuracy. Our experiments show thatthe effectiveness of our model (in comparison with the classical VSM)depends on the used classifier. Results yielded by k-NN when applyingour model are better or as those obtained when applying the classicalVSM. For SVM, results yielded when applying our model are in general,slightly lower than those obtained when using VSM. However, the gain interms of time and dimensionality reduction is so promising since theyare dramatically decreased by the application of our model.

1

http://www.jatit.org/




Keywords: Document Representation, Text Classification, Opinion Mining, Machine Learning,Natural Language Processing.

1. INTRODUCTION With the increasing amount of

available text documents indigital forms (either on the webor in databases), the need toautomatically organize andclassify these documents becomesmore important and, at the sametime, more challenging. We canfind a wide range of domains inwhich we use Text Classificationtechniques. Among these domains wefind Categorization by Topic [27],Opinion Mining [1], RecommendationSystems [15], Question Answering[40] and Spam Detection [25].

Automated Text Classification(TC) is a supervised learning taskthat consists of assigning somepre-defined category labels to newdocuments (called test documents)on the basis of the likelihoodsuggested by labeled documents(called training documents). Agrowing number of machine learningmethods are applied to thisproblem, including Naïve Bayes,Decision Trees, Support VectorMachines, and k-Nearest Neighbors[31].

As text documents cannot bedirectly interpretable by suchlearning algorithms, we need torepresent these documents by theuse of the Vector Space Model(VSM) [30]. VSM consists ofgenerating for each document itscorresponding feature vector.

Given a feature set, and for agiven document, the generatedvector gives to each feature itsweight with respect to thedocument. Note that a featureweight used to measure howimportant is the feature regardingthe document. At this stage, twoissues are to be addressed. Thefirst issue is how to build thefeature set. The second issue ishow to weight these features.

One conventional way toconstruct the feature set is toconsider documents as “bags-of-words”, where the features areobtained by the extraction ofsingle words that appear in thedocuments [18][20]. One drawbackof this model is that it ignoresboth combination and order ofwords within the documents. Thereare some attempts to overcome thisproblem by the consideration ofphrases [18][7], word senses [10]and multi-words [41]. Other workstry to enrich the feature set byincluding other kinds of featuressuch as parts of speech [20],semantic [12], syntactic [1],stylistic [1], and morphological[3] features.

Concerning the weighting issue,we can find various weightingschema, the most used ones arepresence and frequency-basedweightings [17]. We can also findthe popular TF-IDF weighting (aka

2





Term Frequency Inverse DocumentFrequency) and its differentvariants which are studied byPaltoglou & Thelwall [19].

The major problem that is facedwhen applying VSM on textdocuments is the hugedimensionality of the extractedfeatures from training documents.Consequently, we may face theproblem of vector sparseness.There are some attempts toovercome this problem by theapplication of several techniquesto reduce, in a selective manner,dimensionality of features. Amongthese techniques, we find somestandard ones such as stemming[33], stop word removal [28] andterm-frequency thresholding [16].We find also some sophisticatedtechniques called FeatureSelection algorithms, includingChi-Square, Information Gain andMutual Information [6][38].Thesetechniques seek to identify, amongthe native features, the mostrelevant ones to use them togenerate feature vectors for textdocuments. Nevertheless, and evenwith the application of suchtechniques, the dimensionalityremains strong especially when wedeal with large data sets. Suchhuge dimensionality makes someclassifiers intractable. Moreover,it makes classification tasks soexpensive in terms of both memoryand time. Note that even if weproceed to an aggressive featurereduction, it is more likely to

face the problem of over-fitting.These issues are the mainmotivation behind the suggestionof a novel model to represent textdocuments differently.

The present paper has as majorgoal to remedy the problem relatedto the huge dimensionality of thegenerated vectors by VSM on theone hand, and the problem ofvector sparseness on the otherhand. We propose a novel two-stagemodel to represent text documents.The first step of this modelconsists of generating new vectorsfor documents by representing themin the space of trainingdocuments. The second step usesthe norms of these generatedvectors to project documents in aone-dimensional space. We evaluateour model by the consideration ofa problem of Opinion Mining [22]which corresponds to a problem ofbinary classification where theaim is to classify opinionateddocuments as either positive ornegative. We conduct ourexperiments on four Arabic datasets and an English data set.These data sets are collections ofcomments or reviews written bymere internet users on the web. Weuse two standard classificationalgorithms known by theireffectiveness in TextClassification, namely SupportVector Machines [35] and k-NearestNeighbors [5]. We evaluate theproposed model by comparing itseffectiveness with that of the

3





classical document representationbased on VSM. The effectiveness ismeasured in terms of threecriteria. The first criterion isrelated to the dimensionality ofthe generated vectors. The secondone deals with the time taken byeach classifier for learning andtesting. The third criterioncorresponds to the obtainedresults of classification in termsof accuracy. We specify that wefocus, in this study, on reducingdimensionality and required timefor classification rather thanenhancing the performance ofclassification. In other words, wedo not seek to primarily getbetter performance classificationwith regard to the use of theclassical representation. The remainder of this paper is

organized as follows. In thesecond section, we describe theclassical document preprocessingand representation. The thirdsection gives more details aboutthe proposed model. In the fourthsection we present the datacollection that we use. The fifthsection describes our experimentsas well as the obtained results.The last section concludes thepaper and presents our futureworks.2. CLASSICAL DOCUMENTPREPROCESSING AND REPRESENTATION

Before we can use MachineLearning techniques to classify aset of text documents, it is

necessary to first preprocessthese documents and then map themonto a vector space. The firststep is called documentpreprocessing. The second step isknown as document representation.We give below more details aboutthese two steps.2.1Classical DocumentPreprocessing

Document preprocessing consistsof cleaning and normalizing textdocuments to prepare them toclassification step. We presentbelow some common tasks ofpreprocessing phase. We illustrateby some Arabic examples, ratherthan English ones, as theprocessing of this language ismore complex than English.

Tokenization: This task usedto transform a text document to asequence of tokens separated bywhite spaces. This transformationincludes also the removal ofpunctuation marks, numbers andspecial characters. We cantokenize a given text into eithersingle words or phrases dependingon the adopted model. The mostcommon model is the bag-of-wordswhich consists of splitting agiven text into single words. Wecan also use the n-gram word model[32]. As a definition, an n-gramword is a sequence of n words in agiven document. If n is equal to1, we talk about unigrams, if itis equal to 2, they are calledbigrams, and so on. For example,

4





in the sentence « ان� د ك�� ع��ل�ق�� ا ال�مسلس��ل رائ� »(which can be translated as “theseries was wonderful”), theunigrams that we can extract are« د ان� «, » ل�ق ال�مسلسل «, » ك�� » and « ع�� ارائ� » (forthe translated sentence, theunigrams are respectively “the”,“series”, “was” and “wonderful”).The bigrams that we can find areان� » د ك�� ان� «, »ل�ق�� » and « ال�مسلسلك�� ع�� اال�مسلس��ل رائ� »(for the translated sentence, thebigrams are respectively “theseries”, “series was”, and “waswonderful”).

Segmentation: This processis specific to the Arabiclanguage. It consists ofseparating a word from its clitics(i.e. proclitics and enclitics)and the determiner Al [2]. As adefinition, clitics are a kind ofaffixes that can be attached toArabic words; they can beprepositions, conjunctions, futuremarkers, etc. As examples, thesegmentation of the attached word

ح » ص��ب� �which is equivalent in) « وس�يEnglish to the phrase “and it will

become”) gives ح » ص��ب� �« ي� (i.e. itbecomes), while the segmentation

of the attached word « ه ب� which) «ك�تis equivalent to the English

phrase “he wrote it”) gives « ب�ك�ت»(i.e. he wrote). The first example

illustrates proclitic removalwhile the second one is a case ofenclitic removal.

Stemming: Word stemming is acrude pseudo-linguistic processwhich removes suffices to reducewords to their word stem [33]. Forexample, the words ‘classifier’,‘classified’ and ‘classifying’would all be reduced to the wordstem ‘classify’. Consequently, thedimensionality of the featurespace can be reduced by mappingmorphologically similar words ontotheir word stem. A widely appliedstemming algorithm is the suffixstripper developed by Porter [24].In the Arabic language, there aretwo different morphologicalanalysis techniques; namelystemming and light-stemming. Whilestemming reduces the word to itsstem, light-stemming removescommon affixes from the wordwithout reducing it to its stem[11]. If we apply stemming on the

words « اب� ه » ,(the book) « ال�كت ب� « ال�مكت(the library) and « ال�مكتب� » (thedesk), we will obtain the same

stem « ك�تب� » (to write).Nevertheless, the application oflight-stemming on the words «

« ال�كتب� (the books) and ان� » ت�� « م�كت(two desks) will give respectively

5





اب� » a) « م�كتب� » and (a book) « ك�ت��desk). The main idea for usinglight stemming [9][9] is that manyword variants do not have similarmeanings or semantics althoughthese word variants are generatedfrom the same root.

Stop Word Removal:Typically, stop words refer tofunction words such as articles,prepositions, conjunctions, andpronouns, which provide structurein language rather than content[28]. Such words do not have animpact on category discrimination.

Term Frequency Thresholding:This process consists ofeliminating words whosefrequencies are either above apre-specified upper threshold orbelow a pre-specified lowerthreshold. This process helps toenhance classification performancesince terms that rarely appear ina document collection will havelittle discriminative power andcan be eliminated [34]. Likewise,high frequency terms are assumedto be common and thus not to havediscriminative power either. Wespecify that there is notheoretical rule to set thethreshold; it is chosenempirically and hence its valuedepends on the experimentalenvironment.

2.2Classical DocumentRepresentation

The classical documentrepresentation consists of mappingeach document onto a vectorrepresentation. These vectors aregenerated by the use of VSM. Theretained terms (single words orphrases) after preprocessing arecalled features (or index terms).For a given document, itscorresponding vector is obtainedby computing the weight of eachterm with respect to thatdocument.

Let {fi} be the features set,and ni(d) be the weight of featurefi regarding document d. Eachdocument is mapped onto the vectord :={ni(d)}. Basically,dimensionality of the nativefeature space is so strong; it canbe tens or hundreds of thousandsof terms for even a moderate-sizeddata set [38]. This is consideredas a serious problem inclassification tasks since itpresents three major drawbacks.First, it makes some classifiersintractable because of theircomplexity [6]. Second, it affectsclassification effectiveness sincemost of native features are noisyor do not carry information forclassification [14]. Third, thismakes the task of classificationso expensive. This is why it isessential to eliminate uselessfeatures and hence reduce thedimensionality. We refer to thetask of Dimensionality Reduction

6





which consists of severaltechniques. Besides the commonmethods (stemming, stop wordsremoval and thresholding), thereare several feature selectionalgorithms that compute the“goodness” of each feature inorder to evaluate how significantit is. Thanks to these algorithms,we can perform aggressivedimensionality reduction withoutlosing in classificationeffectiveness [38].3. NOVEL MODEL FOR DOCUMENTREPRESENTATION

As mentioned in theintroduction, the main idea behindthe proposed documentrepresentation model is torepresent each text document by avector in the space of trainingdocuments for each category.Consequently, each document has asmany vectors as the number ofcategories. In the following, wegive further details on theproposed model. Forsimplification, we consider aproblem of binary-class textclassification.

Our model consists first ofsplitting training set D into twosub-sets D+ and D- which includerespectively positive and negativetraining documents. Given thefeature set that we extractcommonly from training documents,we aim to build for each featureits two vectors v+ and v- thatmight represent it in D+ and D-

respectively. Note that thedimensionality of v+ corresponds tothe number of positive trainingdocuments (i.e. cardinality of D+).It is the same for v-. For afeature f, its corresponding v+ isobtained as follows: the ith

component of v+ corresponds to theweight of f with regard to the ith

document of D+. This weight has asvalue 1 if f appears in thatdocument; and 0 otherwise. Vectorv- is obtained likewise. Afterbuilding for all features their v+

and v-, we proceed to building foreach document its corresponding V+

and V-. For a document d, we obtainits V+ by summing vectors v+ thatcorrespond to all features whichappear in d. In other words, weconsider just features that occurin d, then we sum their associatedv+ vectors. The resulting vector iscalled V+ and is associated todocument d. We proceed likewise toassociate vector V- to document d.Note that this representationprocess is applied on bothtraining and test documents.

This novel representation, whichconsists of generating for eachdocument its corresponding V+ andV-, helps to overcome the problemof huge dimensionality since thenumber of training (positive andnegative) documents is much lowerthan the number of extractedfeatures from training documents.Yet, the generated vectors areconsiderably less sparse than the

7





common generated vectors infeature space.

As we can notice, for a documentd (either training or test), itscorresponding V+ used to representd in the positive trainingdocument space. V+ shows how manyfeatures are shared betweendocument d and each positivetraining document. Likewise, itscorresponding V- used to representd in the negative trainingdocument space. V- shows how manyfeatures are shared betweendocument d and each negativetraining document. We give belowthe algorithm of the proposedmodel steps.Input: Number of positive trainingdocuments n+

Number of negativetraining documents n-

Number of testdocuments M Number of features Nin training documents Positive training setD+ ={d+

i} (i=1.. n+) Negative training setD- ={d-

i} (i=1.. n-) Test set T={ti} (i=1..M) Feature set F={fi}(i=1.. N) for each feature fЄF

v+(f)=(wf(d+1),…, wf(d+ n+))

//wf(d+i)=1 if f appears in d+

i, 0otherwisev-(f)= (wf(d-

1),…, wf(d- n-)) //wf(d-

i)=1 if f appears in d-i, 0

otherwise

end for for each document dЄD

V+(d)=

V-(d)= //δi(d)=1 if fi appears in d, 0otherwise

end for for each document dЄT

V+(d)=

V-(d)=

//δi(d)=1 if fi appears in d, 0otherwise

end forend algorithm

After that representation, wecompute for each document (eithertraining or test) the Lp norms ofits V+ and V- following theformula:

||X||p=

Note that degree p of thecomputed norms can be consideredas a parameter for the novelmodel. Its value is to be setempirically. We specify that, fora document d, the norm of its V+

can be viewed as a way to quantifyhow much does it resemble to thepositive training documents.Likewise, the norm of its V-

reflects to what degree document dis similar to the negativetraining documents.

To each document d, we associateas score the ratio of ||V+(d)|| and

8





||V-(d)||. The objective is torepresent all documents (trainingand test) in a one-dimensionalspace by their associated scores.Afterward, we can apply someclassification algorithms such asSupport Vector Machines and k-Nearest Neighbors together withthe new representation model.4. DATA COLLECTIONS4.1 Arabic Data Sets

We use three Arabic data setsbuilt, from Aljazeera’s websiteforums (www.aljazeera.net), byMountassir et al. in [17][18]. Wecall these data sets respectivelyDS1, DS2 and DS3. DS1 [18]consists of 468 comments aboutmovie reviews about one historicalseries. It contains 184 positivedocuments vs. 284 negative ones.DS2 [17] is a collection of 1003comments about 18 sport issues. Itconsists of 486 positive documentsvs. 517 negative ones. DS3 [18] isa collection of 611 comments aboutone political issue. It consistsof 149 positive documents and 462negative documents. These threedata sets are labeled manually byone annotator. Note that DS1 andDS3 are unbalanced data sets. Thisproblem is tackled in section 5.

We use also the Opinion Corpusfor Arabic (OCA1) built by Rushdi-Saleh et al. [26]. It consists of500 movie-reviews collected fromseveral Arabic blog sites and web

1http://sinai.ujaen.es/wiki/index.php/OCA_Corpus_%28English_version%29

pages. The distribution ofdocuments is 250 positivedocuments vs. 250 negative ones.This data set is labeledautomatically on the basis ofrating systems.4.2 English Data Sets

We use as English data set thepolarity dataset v2.02 built byPang and Lee [21]. It consists of2000 movie-reviews collected fromIMDb website (www.imdb.com). Thedistribution of documents is 1000positive documents vs. 1000negative ones. This data set isalso labeled automatically on thebasis of rating systems. In thefollowing, we call this data setIMDB.

In Table 1, we summarize thelanguage and the structure of theused data sets. We show for eachdata set the distribution of eachclass as well as the total numberof documents.

Table 1: Structure of the studied data sets.

Language

DataSet POS NEG Tota

l

Arabic

DS1 184 284 468

DS2 486 517 1003

DS3 149 462 611

OCA 250 250 500

English IMDB 1000 1000 2000

2 http://www.cs.cornell.edu/people/pabo/movie-review-data/

9





5. EXPERIMENTSThis section describes the

experiments that we conduct toevaluate the performance of theproposed model. First, we presentthe experimental design by givingsome details about data setprocessing (sampling and balancingmethods), document pre-processing,parameter of our model,classification algorithms and thesignificance test that we use.Afterward, we present and discussthe different results. 5.1 Experimental Design5.1.1 Data Set Sampling

We resample our data sets byrandomly splitting them into twosets where 75% represent thetraining part and 25% representthe test part. This split isstratified with regard to bothcategory and feature distribution.We repeat this split 25 times soas to generate 25 samples for eachdata set.

For each data set, we conductthe experiments on all its samplesand we record the obtained resulton each sample. The final resultthat we associate to the data setis obtained by averaging itssamples’ results.5.1.2 Data Set Balancing

As two of the studied data sets(DS1 and DS3) are unbalanced andas we use Support Vector Machines(SVM) and k-Nearest Neighbors(kNN) as classifiers, we have toequalize the class distribution of

the unbalanced data sets. Indeed,Mountassir et al. [18] study SVMand kNN in the context ofunbalanced data sets. Theyconclude that these twoclassifiers are sensitive to classimbalance and hence requirebalancing data sets to achieve thebest results.

To balance DS1 and DS3, we usean under-sampling method proposedby Mountassir et al. [18] andcalled Remove by Clustering. Thismethod consists of applyingclustering on majority classdocuments to identify the centerof each cluster. The goal is tokeep, among majority classdocuments, just the identifiedcenters since they can representthe whole of majority class.

We point out that this balancingis applied only on training setsfor each data set; test sets arelet as they are. Training setsneed to be balanced in order notto bias classifiers’ learning. Buttest sets are not modified so asto test classifiers on setsrepresenting the reality [18].5.1.3 Preprocessing

The preprocessing of our datasets consists of removing fromtext documents all punctuationmarks, special characters andnumbers. As stemming process, weapply the stemmer of Khoja andGarside [11] on the Arabicdocuments, and the Porter stemmer3

3 http://snowball.tartarus.org/algorithms/porter/stemmer.html

10





[24] on the English documents. Weuse, as weighting scheme, binaryweights which are based on termpresence. For the Arabic datasets, we remove, from featurespace, terms that occur once inthe data set. For the English dataset, we remove from nativefeatures terms whose frequenciesare less than or equal to 6.Recall that the English data setis much larger than the Arabicones. We specify that thesethresholds are chosen empiricallyso as to reduce the nativedimensionality and at the sametime to get better classificationresults. Indeed, and since ourdocuments are written by mereinternet users, this thresholdingmay help us to clean documentsfrom, among others, typing errorsmade by these internet users. Forall our data sets, we adopt thestandard bag-of-words model.Finally, we note that we do notremove stop words.5.1.4 Classification and

AlgorithmsThe present study focuses on

one-label binary classificationwhere each document is assignedone of the two categories POSITIVEor NEGATIVE. We use the datamining package Weka [37] toperform our tasks of preprocessingand classification. The proposedrepresentation is implementedwithin this toolkit. We use asclassification algorithms SupportVector Machines (SVM) and k-

Nearest Neighbor (k-NN). Thesealgorithms are shown as effectivefor several Text Classificationtasks [20][39]. For SVMclassifier, we employ a normalizedpolynomial kernel with aSequential Minimum Optimization(SMO) [23]. Concerning k-NNclassifier, we use a linearsearch. When applying theclassical document representation,we use a cosine-based distance[29]. But when applying our model(i.e. the dimensionality is verysmall), we use the Euclideandistance. Among the tested oddvalues of k (which range from 1 to31 in the present study), wechoose those that allow achievingthe best results. The selectedvalues are various depending onthe data sets.

For the parameter p (degree ofnorm) of the proposed model, wetest several values and find thatthe value that yields the bestclassification results for all thedata sets corresponds to 2.5.1.5 Statistically Significant

TestTo compare between two

algorithms, we use the paired t-test which is a widely usedstatistical test within theMachine Learning community [4].

Assume that we have k samplesfor a given data set. Learningalgorithms A and B are bothtrained on training set and theresulting classifiers are testedon test set of each sample i. Let

11





piA (respectively pi

B) be theobserved proportion of testdocuments correctly classified byalgorithm A (respectively B) onthe sample i. If we assume the kdifferences pi = pi

A - piB are drawn

independently from a normaldistribution, then we can applyStudent’s t-test by computing thestatistic [8]:

t= where

Under the null hypothesis (i.e.the two learning algorithms A andB have the same accuracy on thetest set), this statistic has a t-distribution with k-1 degrees offreedom. For a two-sided test withprobability of incorrectlyrejecting the null hypothesis of0.05, the null hypothesis can berejected if |t|>tk-1,0.975. The valueof tk-1,0.975 is obtained from the t-table.

For our experiments, k=25. So,according to t-table, the nullhypothesis is rejected when |t|>2.064.5.2 Results

As mentioned in theintroduction, we evaluate ourmodel by comparing itseffectiveness with the classicaldocument representation (whichcorresponds to the standard VSM)

and by considering threeevaluative criteria. The firstcriterion concerns thedimensionality of the generatedvectors by each representationmethod. In Table 2, we show, foreach data set, the dimensionalityof the generated vectors by theapplication of each representationmethod, namely the classicalrepresentation (which is based onrepresenting documents in featurespace) and our representation(which is based on representingdocuments in training documentspaces). Note that thedimensionality in the classicalrepresentation corresponds to thesize of the dictionary (i.e.number of features), while thedimensionality in ourrepresentation corresponds to thenumber of training (positive andnegative) documents.

Our goal, by the considerationof this criterion, is to highlightthe contribution of ourrepresentation to reduce thedimensionality of the generatedvectors (even these vectors arenot actually used inclassification since they areprojected in a one-dimensionalspace) without using some featureselection method. We also aim, bythat table, to give an insight onthe complexity related to eachrepresentation.

As we can see from Table 2, therate of dimensionality reductionis considerable since it ranges

12





from 6.3% (for OCA) to 20.8% (forIMDB). So, it is obvious that ourrepresentation is more effectivethan the classical one in terms ofthe first criterion related todimensionality.

Table 2: Dimensionality of the generated vectors bythe application of each representation.

ClassicalRepresenta

tion

NovelRepresentati

onDS1 2270 273DS2 4417 750DS3 3500 230OCA 5978 376IMDB 7203 1501

The second criterion that weconsider to evaluate our modelcorresponds to the time taken byeach classification algorithm forlearning and testing. In Table 3,we show for each classifier (SVMand k-NN), time (in terms ofseconds) that it takes followingthe application of eachrepresentation on each data set.Our goal is to compare therequired time by eachclassification algorithm whenusing each of the tworepresentations. We specify thatwe use, for our experiments, amachine with 1.83GHz CPUs and 2GBof memory.

Table 3: Time taken for learning and classification interms of seconds by each classifier by the

application of each representation.

ClassicalRepresentati

on

NovelRepresentati

on SVM

k-NN

SVM

k-NN

DS1 9 457 1.4 2.8

DS2 85 7325 0.5 12

DS3 12 846 0.4 2.9

OCA 80 4016 0.6 3.4

IMDB 3194 142023 1.5 58

It is clear from Table 3 thatthe application of our approachleads to a considerable timereduction for the two classifierson all data sets. The rate of timereduction is so interesting sinceit can be up to 0.04% for SVM (onIMDB) and 0.04% for k-NN (onIMDB). So, and as a secondconclusion, we can say that ourrepresentation model is highlyeffective in comparison with theclassical method when consideringthe time taken by classifiers forlearning and testing.

The third and last criterionthat we take into account whilecomparing our representation modelwith the classical one concernsthe classification accuracy. InTable 4, we report for eachclassifier the obtainedclassification result in terms ofaccuracy following the applicationof each representation on each

13





data set. We also mention, foreach result, the correspondingstandard deviation since thereported result (in each cell)represents the average of resultsobtained on the 25 samples of eachdata set.

Table 4: CLASSIFICATION RESULTS IN TERMS OF ACCURACYFOR EACH CLASSIFIER BY THE APPLICATION OF EACH

REPRESENTATION

Classical Representation

Novel Representation

SVM

k-NN

SVM

k-NN

DS1

71.8±4.2

62.89±5.5

70.71± 5

71.9± 3.8

DS2

71.30±2.8

67.34±2.3

66.72± 3.4

66.67± 3.3

DS3

65.1±4.4

62.18±5.2

63.38± 5.2

60.24± 6.1

OCA

90.25±2.3

85.87±3.1

88.13± 3.1

87.17± 2.7

IMDB

85.49±1.2

78.6±1.4

81.38± 1.7

81.36±2

To make a comparison of thereported results in Table 4, weuse the paired t-test. We reportin Table 5 the decision ofcomparison after computing the

statistic t. Recall that weconsider that two comparedclassifiers do not have the sameaccuracy when |t|>2.064.

Table 5: COMPARISON OF THE ALGORITHMS USING T-TEST

Old SVMvs. NewSVM

Old k-NNvs. New k-

NN

DS1 ~ <

DS2 > ~

DS3 ~ >

OCA > <

IMDB > <

The decision that we report canbe ‘>’ (i.e. the first classifieris better than the secondclassifier), ‘<’ (the firstclassifier is worse than thesecond classifier) or ’~’ (thefirst classifier is the same asthe second classifier).We specifythat, in Table 5, we denote by OldSVM the application of theclassical representation togetherwith SVM, while New SVM refers tothe application of the newrepresentation model together withSVM. It is the same for Old andNew k-NN.

As we can see from Table 5, theapplication of our representationleads, in general, to a slightdegradation of the performance ofSVM. Indeed, Table 5 shows that

14





Old SVM outperforms New SVM inmost of cases (on DS2, OCA andIMDB). Old SVM and New SVM havethe same accuracy on DS1 and DS3.By referring to Table 4, we cansee that this degradation in SVMperformance ranges from 2.12% to4.58M% in terms of accuracy. Thiscan be understandable since SVM isknown by its insensitivity tofeature dimensionality. In otherwords, the performance of SVM isnot affected whatever is thedimensionality of the usedvectors. So, we can conclude thatthe application of ourrepresentation can lead to aslight loss of information whileusing SVM. Nevertheless, it is notthe case with k-NN since Table 5shows, typically, acompetitiveness of Old k-NN andNew k-NN. Indeed, New k-NNoutperforms Old k-NN only on DS1,OCA and IMDB. But Old k-NNoutperforms New k-NN only on DS3.The two classifiers have the sameperformance on DS2. So, we can saythat we do not loose informationby the application of ourrepresentation while using k-NN.On the contrary, we can improvethe performance of k-NN with up to9.01% of accuracy. As a conclusionconcerning classificationaccuracy, we can say that theeffectiveness of our model dependson the used classifier; it isclearly effective if we use k-NN,and slightly less effective if weuse SVM.

As a summarized conclusion, wecan say that our representationmodel is shown to be moreeffective than the classicalrepresentation in terms ofdimensionality reduction and timetaken for learning and testing bythe classifiers. However, theeffectiveness of our model interms of classification accuracydepends on the used classifier.When we use SVM, we can lose someinformation by the application ofour method (in comparison with theclassical representation); butwhen we use k-NN, we can improvethe performance by the applicationof the proposed representationmodel.6. CONCLUSION AND FUTURE WORKS

The study that we report in thispaper has as goal to propose anovel model for DocumentRepresentation in an attempt toovercome the problems related tohuge dimensionality and vectorsparseness which are commonlyfaced in Text Classificationproblems. Instead of using vectorsgenerated in feature space, wepropose to represent textdocuments in the training documentspaces. We focus, in this study,on reducing dimensionality andrequired time for classificationrather than enhancing theperformance of classification. Weevaluate our model by comparingits effectiveness with theclassical representation which

15





corresponds to Vector Space Model.We consider, for this comparison,three criteria, namely thedimensionality of the generatedvectors, the time required byclassifiers for learning andtesting, and classification resultin terms of accuracy. We considera problem of binaryclassification. We use bi-lingualOpinion Mining data sets(including four Arabic data setsand one English data set). We useas classification algorithmsSupport Vector Machines and k-Nearest Neighbors. We use thepaired t-test to perform astatistically significantcomparison between the usedclassifiers. Our results show thatthe proposed model outperformsdramatically the classicalrepresentation in terms ofdimensionality reduction and timetaken by the classifiers. However,by considering the criterionrelated to classificationperformance, we can say that theeffectiveness of our model dependson the used classificationalgorithm. Indeed, when usingSupport Vector Machines, we canrecord a slight degradation of itsperformance by the application ofour model. However, theperformance of k-Nearest Neighborscan be significantly improved whenwe apply the proposed method.

We have several and variousfuture directions. First, we aimto improve our model so as to

avoid the eventual loss ofinformation when using SupportVector Machines. We have also asgoal to extend our approach to amulti-class problem. We willconduct our experiments onproblems of Classification byTopic where there are manycategories. We also seek topropose a novel classificationmethod that fits more with the newrepresentation and that canoutperform the tested SupportVector Machines and k-NearestNeighbors. Finally, we lookforward to modifying the proposedapproach so as to overcome theproblem of unbalanced data setswhich is commonly encountered inText Classification problems.REFRENCES: [1] A. Abbasi, H. Chen, and A. Salem,

“Sentiment Analysis in MultipleLanguages: Feature Selection forOpinion Classification in Web Forums,”In ACM Trans. Inf.Syst., 26,2008, pp.1–34.

[2] M. Abdul-Mageed, M. T. Diab, and M.Korayem, “Subjectivity and SentimentAnalysis of Modern Standard Arabic,” inProc. ACL 2011, Portland, Oregon, USA,2011.

[3] M. Abdul-Mageed, S. Kuebler, and M.Diab, “SAMAR: A System for Subjectivityand Sentiment Analysis of Social MediaArabic,” Proceedings of the 3rdWorkshop on Computational Approaches toSubjectivity and Sentiment Analysis(WASSA), ICC Jeju, Republic of Korea,2012.

[4] R. R. Bouckaert, “Choosing between twolearning algorithms based on calibratedtests,” In ICML’03, 2003.

[5] B. V. Dasarathy, “Nearest Neighbor (NN)Norms: NN Pattern ClassificationTechniques,” McGraw-Hill ComputerScience Series. Las Alamitos,California: IEEE Computer SocietyPress, 1991.

16




ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195[6] M. Dash, and H. Liu, “Feature selection

for classification,” Intelligent DataAnalysis 1(3), 1997.

[7] K. Dave, S. Lawrence, and D. M.Pennock, “Mining the peanut gallery:Opinion extraction and semanticclassification of product reviews,” InProceedings of the 12th InternationalConference on the World Wide Web, 2003,pp.519-528.

[8] T. G. Dietterich, “Approximatestatistical tests for comparingsupervised classification learningalgorithms,” Neural Computation, v.10n.7, 1998, pp.1895-1923.

[9] R. Duwairi, M. Al-Refai, and N.Khasawneh, “Stemming Versus LightStemming as Feature SelectionTechniques for Arabic TextCategorization,” 4th Int. Conf. onInnovations in Information TechnologyIIT’07, 2007.

[10] A. Kehagias, V. Petridis, V. G.Kaburlasos, and P. Fragkou, “Acomparison of word and sense-based textcategorization using severalclassification algorithms,” Journal ofIntelligent Information Systems, Vol.21, 2000, pp. 227-247.

[11] S. Khoja, and R. Garside, “StemmingArabic text,” Computer ScienceDepartment, Lancaster University,Lancaster, 1999.

[12] S.-M. Kim, and E. Hovy, “Determiningthe sentiment of opinions,” InProceedings of the 20th internationalconference on computational linguistics(COLING 2004), Geneva, Switzerland,2004, pp.1367–1373.

[13] A. C. König, and E. Brill, “Reducingthe human overhead in textcategorization,” In Proceedings of the12th ACM SIGKDD conference on knowledgediscovery and data mining,Philadelphia, Pennsylvania,USA, 2006,pp.598–603.

[14] D. D. Lewis, “Representation andLearning in Information Retrieval,” Ph.D. thesis, Department of Computer andInformation Science, University ofMassachusetts, USA, 1992.

[15] Y. Li, J. Nie, Y. Zhang, B. Wang, B.Yan, and F. Weng, “Contextualrecommendation based on text mining,”In Proceedings of the 23rdInternational Conference onComputational Linguistics: Posters,2010, pp.692–700.

[16] H. P. Luhn, “The Automatic Creationof Literature Abstracts,” 1958,pp. 118–125.

[17] A. Mountassir, H. Benbrahim, and I.Berrada, “A cross-study of SentimentClassification on Arabic corpora,” InResearch and Development in IntelligentSystems XXIX, M. Bramer and M.Petridis, Editors. 2012, SpringerLondon, pp.259-272.

[18] A. Mountassir, H. Benbrahim, and I.Berrada, “An empirical study to addressthe problem of Unbalanced Data Sets inSentiment Classification,” In proc ofIEEE International Conference onSystems, Man and Cybernetics (SMC’12),Seoul, Korea, 2012, pp.3280-3285.

[19] G. Paltoglou, and M. Thelwall, “Astudy of Information Retrieal weightingschemes for sentiment analysis,” InProc. Of the 48th Annual Meeting of theAssociation for ComputationalLinguistics, Uppala, Sweden, 2010,pp.1386-1395.

[20] B. Pang, L. Lee, and S.Vaithyanathain, “Thumbs up? Sentimentclassification using machine learningtechniques,” In Proceedings of theConference on Empirical Methods inNatural Language Processing, 2002,pp.79-86.

[21] B. Pang, and L. Lee, “A sentimentaleducation: Sentiment analysis usingsubjectivity summarization based onminimum cuts,” In Proceedings of the42nd Annual Meeting of the ACL,Barcelona, Spain, July. ACL’04, 2004.

[22] B. Pang, and L. Lee, “Opinion Miningand Sentiment Analysis,” Now PublishersInc, 2008.

[23] J. Platt, “Fast training on SVMsusing sequential minimal optimization,”In B. Scholkopf, C. Burges, and A.Smola (Ed.), Advances in KernelMethods: Support Vector Learning, MITPress, Cambridge, MA, 1999.

[24] M. F. Porter, “An algorithm forsuffix stripping,” Program 14(3), 1980,pp.130–137.

[25] M. Prilepok, T. Jezowicz, J.Platos, and V. Snasel, “Spam detectionusing compression and PSO,”Computational Aspects of SocialNetworks (CASoN), 2012 FourthInternational Conference on , 2012,pp.263-270.

[26] M. Rushdi-Saleh, M. T. Mrtin-Valdivia, L. A. Urena-Lopez, and J. M.Perea-Ortega, “Bilingual Experiments

17

http://dl.acm.org/citation.cfm?id=303237&CFID=173559332&CFTOKEN=69257871







with an Arabic-English Corpus forOpinion Mining,” In Proc. of RecentAdvances in Natural Language Processing2011, Hissar, Bulgaria.

[27] M. K. Saad, and W. Ashour, “ArabicMorphological Tools for Text Mining,”6th ArchEng Int. Symposiums, EEECS’10the 6th Int. Symposium on Electricaland Electronics Engineering andComputer Science, European Universityof Lefke, Cyprus, 2010.

[28] M. Sahami, “Using Machine Learningto Improve Information Access,” Ph. D.thesis, Department of Computer Science,Stanford University, 1998.

[29] G. Salton, and M. McGill, “ModernInformation Retrieval,” New York:McGraw-Hill, 1983.

[30] G. Salton, “Automatic TextProcessing: The Transformation,Analysis, and Retrieval of Informationby Computer,” Reading, Pennsylvania:Addison-Wesley, 1989.

[31] F. Sebastiani, “Machine learning inautomated text categorization,” ACMComput. Surv., Volume 34, Number 1,2002.

[32] C. Shannon, “A mathematical theoryof communication,” Bell SystemTechnical Journal, 27, Bell SystemTechnical Journal, 1948.

[33] A. F. Smeaton, “Informationretrieval: Still butting heads withnatural language processing,” InInformation Extraction. InternationalSummer School SCIE 1997, pp.115–139.Springer-Verlag.

[34] C. J. Van Rijsbergen, “InformationRetrieval,” (2 ed.). London:Butterworths, 1979.

[35] V. Vapnik, “The Nature ofStatistical Learning,” Springer-Verlag,1995.

[36] C. Whitelaw, N. Garg, and S.Argamon, “Using appraisal groups forsentiment analysis,” In Proceedings ofthe 14th ACM Conference on Informationand Knowledge Management, 2005, pp.625-631.

[37] I. H. Witten, and E. Frank, “DataMining: Practical machine learningtools and techniques,” 2nd Edition,Morgan Kaufmann, San Francisco,California, 2005.

[38] Y. Yang, and J. O. Pedersen, “Acomparative study on feature selectionin text categorization,” In Proceedingsof the Fourteenth InternationalConference, 1997.

[39] Y. Yang, “An evaluation ofstatistical approaches to textcategorization,” Inform. Retr. 1, 1–2,1999.

[40] H. Yu, and V. Hatzivassiloglou,“Towards answering opinion questions:Separating facts from opinions andidentifying the polarity of opinionsentences,” in Proceedings of theConference on Empirical Methods inNatural Language Processing (EMNLP),2003.

[41] W. Zhang, T. Yoshida, and X. Tang,“Text classification based on multi-wordwith support vector machine,” Knowledge-Based Systems, Vol. 21, 2008, pp.879-886.

18





19


REPRESENTING TEXT DOCUMENTS IN TRAINING DOCUMENT SPACES: A NOVEL MODEL FOR DOCUMENT REPRESENTATION

Documents

Transcript of REPRESENTING TEXT DOCUMENTS IN TRAINING DOCUMENT SPACES: A NOVEL MODEL FOR DOCUMENT REPRESENTATION