A Propositional Approach to Textual Case Indexing
Transcript of A Propositional Approach to Textual Case Indexing
A Propositional Approach to Textual Case Indexing
Nirmalie Wiratunga�, RobLothian
�, SutanuChakraborti
�andIvanKoychev
��
Schoolof ComputingTheRobertGordonUniversity
AberdeenAB25 1HG,Scotland,UKnw|rml|[email protected]�
Instituteof MathematicsandInformaticsBulgarianAcademyof Science
Sofia- 1113,[email protected]
Appears in Proceedingsof the9thEuropeanConferenceonPrinciplesandPracticeof KDD,pp.380–391,Copyright c
�2005Springer (www.springer.de).All rights reserved.
Abstract. Problemsolving with experiencesthat arerecordedin text form re-quiresa mappingfrom text to structuredcases,sothatcasecomparisoncanpro-vide informed feedbackfor reasoning.One of the challengesis to acquireanindexing vocabulary to describecases.We explore the useof machinelearningandstatisticaltechniquesto automateaspectsof this acquisitiontask.A proposi-tionalsemanticindexing tool, PSI, whichformsits indexing vocabularyfrom newfeaturesextractedaslogicalcombinationsof existingkeywords,is presented.Weproposethatsuchlogical combinationscorrespondmorecloselyto naturalcon-ceptsandaremoretransparentthanlinearcombinations.Experimentsshow PSI-derivedcaserepresentationsto havesuperiorretrieval performanceto theoriginalkeyword-basedrepresentations.PSI alsohascomparableperformanceto LatentSemanticIndexing, a populardimensionalityreductiontechniquefor text, whichunlike PSI generateslinearcombinationsof theoriginal features.
1 Introduction
Discovery of new featuresis an importantpre-processingstepfor textual data.Thisprocessis commonlyreferredto asfeatureextraction,to distinguishit from featurese-lection,wherenonew featuresarecreated[18]. Featureselectionandfeatureextractionsharetheaimof formingbetterdimensionsto representthedata.Historically, therehasbeenmoreresearchwork carriedoutin featureselection[9, 20,16] thanin extractionfortext pre-processingappliedto text retrieval andtext classificationtasks.However, com-binationsof featuresarebetterableto tackletheambiguitiesin text (e.g.synonymsandpolysemys)that often plaguefeatureselectionapproaches.Typically, featureextrac-tion approachesgeneratelinearcombinationsof theoriginal features.Thestrongfocuson classificationeffectivenessalonehasincreasinglyjustified theseapproaches,eventhoughtheir black-boxnatureis not ideal for userinteraction.This argumentappliesevenmorestronglyto combinationsof featuresusingalgebraicor highermathematical
functions.Whenfeatureextractionis appliedto taskssuchashelpdesksystems,med-ical or law documentmanagement,emailmanagementor evenSpamfiltering, thereisoftenaneedfor userinteractionto guideretrieval or to supportincrementalqueryelab-oration.Theprimarycommunicationmodebetweensystemanduserhastheextractedfeaturesasvocabulary. Hence,thesefeaturesshouldbetransparentaswell asprovidinggooddimensionsfor classification.
The needfor featuresthat aid userinteractionis particularlystrongin the field ofCase-BasedReasoning(CBR), wheretransparency is an importantelementduringre-trieval andreuseof solutionsto similar, previously solved problems.This view is en-forced by researchpresentedat a mixed initiative CBR workshop[2]. The indexingvocabulary of a CBR systemrefersto thesetof featuresthatareusedto describepastexperiencesto berepresentedascasesin thecasebase.Vocabulary acquisitionis gen-erally a demandingknowledgeengineeringtask,even moreso whenexperiencesarecapturedin text form. Analysisof text typically begins by identifying keywordswithwhich anindexing vocabulary is formedat thekeyword level [14]. It is herethat thereis anobviousopportunityto apply featureextractionfor index vocabulary acquisitionwith a view to learningtransparentandeffectivetextual caserepresentations.
The focusof this paperis extractionof featuresto automateacquisitionof indexvocabulary for knowledgereuse.Techniquespresentedin this paperaresuitedfor ap-plicationswherepastexperiencesarecapturedin free text form andarepre-classifiedaccordingto thetypesof problemsthey solve.We presenta PropositionalSemanticIn-dexing (PSI) tool, whichextractsinterpretablefeaturesthatarelogicalcombinationsofkeywords.We proposethatsuchlogical combinationscorrespondmorecloselyto nat-uralconceptsandaremoretransparentthanlinearcombinations.PSI employsboostingcombinedwith rule mining to encouragelearningof non-overlapping(or orthogonal)setsof propositionalclauses.A similarity metric is introducedso that textual casescanbe comparedbasedon similarity betweenextractedlogical clauses.Interpretabil-ity of theselogical constructscreatesnew avenuesfor userinteractionandnaturallyleadsto the discovery of knowledge.PSI ’s featureextraction approachis comparedwith the populardimensionalityreductiontechniqueLatentSemanticIndexing (LSI),which usessingularvaluedecompositionto extractorthogonalfeaturesthatarelinearcombinationsof keywords[7]. Caserepresentationsthatemploy PSI ’s logical expres-sionsare more comprehensibleto domainexpertsand end-userscomparedto LSI ’slinear keyword combinations.Ideally we wish to achieve this expressivenesswithoutsignificantlossin retrieval effectiveness.
We first establishour terminologyfor featureselectionandextraction,beforede-scribinghow PSI extractsfeaturesaslogicalcombinations.WethendescribeLSI, high-lighting theproblemof interpretabilitywith linearcombinations.Finally we show thatPSI ’s approachachievescomparableretrieval performanceyet remainsexpressive.
2 Feature Selection and Extraction
Considerthehypotheticalexamplein Figure1 wherethetaskis to weedoutSpamfromlegitimateemail relatedto AI. To assistwith future messagefiltering thesemessagesmustbemappedontoasetof casesbeforethey canbereused.Wewill referto thesetof
����� � ��������� �� � ���������������� ��� ����������� ������! "#���$�$��%'&��"���( )+*�����&# &�� � "�&#� ���&�(�,- ��&�( � � *. � �������� � � � ��� ��������&##*������� � �� ���� � )�&����0/
12345 6
7�8 9�: 7�; < : 7=< >.?; @�A�B�: C�D; < E.7 F�7 E�@�>�7�BC�@.G+H'BIG$J�C; 7E+B.; < E�7�@�:B#@�K�: C=L�L�L�L�L�LE�@M>�7�B�K�>�7�7�N; @ 8 < O�7PQR
ST
U�V#W�X�Y Z�[\�]�WZ \ ^�\�_ X�`'Z V�Y a+a�Y Z Wb�\#W#a�X�c ZM]�W#aad _ \�e�X�bMf'g�V�Whb�Y a#V.\�eZ�].\�_ Wi _ W�W$j ^�W�k#aZ V d X�]�\#a'Zlf f fmnopq
rts�uv�wx#y z�s�{| } su'vMwx | { ~#u | y v�w | v��s���� v�{ ����v��� } � |M} ����s ��sz�y xu#~x'x.v�w | } y x� y x | v���� � | s��#x{ s�z�~���� y u'� | y v�w��| } s=��� v���y u'�M�xs�wx�s�v����
����� �'������������� ���� ������ � ��� ���� ¡ � ��¢�£¤ � ¢M��¢������ ¥ ¦¢#� ¥ ������ � §���¢#� � ¢�¦�§���£� ��¨���#� ����¢�� �� ���� §��.©«ª
¬®¯° ±#² ³ ´¶µ�· ¸�µ¹�º�¹ ² »�¼�½¾ µ#· ¿Àµ�µ º¹ ² ¼#»�Á ³ · ¸�Â.Ã+Ä »�ÁÅ »� ¹ ³ º�Æ Ã#Ç�µ�È�³ Ä ³ ȹ�º Å º » ºÆÃ#Ç�µ�È#³ Ä ³ È#É�Ê�Á »MÂû�Â�µ�¿+¸ ¹ ·Å ³ Ä Ä µ�Á µ º ·�Ë· ¸�»�Ì�¼�¸ ÍÎÏÐ
ÑÒ
Ó�Ô�Õ�Ö ×�Ø ×ÙÕ�Ó�× ÚÜÛÞÝ Ý
ß=à�á á
â ã#ä äå�æ�ç ç
è éMê ëÙì ì è í�ëÙéMêî ì íðïÙñòè êôóðõíÙñ î éMê
öðë�÷Mê ø ùôñ ëðëú è î íÙñ îû ï�ñòé
îÙûðû ì è ü î êôè ï�é
ý.ë�þ ÿ � êôñ î ü�ê ëðö��ðë î ê �ðñ ë��
Fig. 1. Featuresaslogical keyword combinations.
all labelleddocuments(cases)as�
. The keyword-vectorrepresentationis commonlyusedto representa document� by consideringthepresenceor absenceof words[17].Essentiallythesetof featuresarethesetof words (e.g.“conference”,“intelligent”).Accordinglya document� is representedasa pair ��� ���� , where � = �� � ������� ���� ��� � isabinaryvaluedfeaturevectorcorrespondingto thepresenceor absenceof wordsin ;and � is � ’s classlabel.
Featureselectionreduces� �� to a smallerfeaturesubsetsize � [20]. InformationGain(IG) is oftenusedfor this purpose,where� featureswith highestIG areretainedandthenew binary-valuedfeaturevector� � is formedwith thereducedwordvocabularyset � , where �"! and � � �$# � �� . The new representationof document�with %� is a pair �� �& ���� . Selectionusing IG is the base-linealgorithmin this paperandis referredto asBASE. An obviousshortcomingof BASE is that it fails to ensureselectionof non-redundantkeywords.Ideally we want � � to containfeaturesthat arerepresentativebut alsoorthogonal.A moreseriousweaknessis thatBASE’s one-to-onefeature-wordcorrespondenceoperatesata lexical level, ignoringunderlyingsemantics.
Figure1 illustratesa proof treeshowing how new featurescanbeextractedto cap-ture keyword relationshipsusingpropositionaldisjunctive normalform clauses(DNFclauses).Whenkeywordrelationshipsaremodelled,ambiguitiesin text canberesolvedto someextent.For instance“grant” and“application”capturesemanticsakin to legiti-matemessages,while thesamekeyword “application” in conjunctionwith “debt-free”suggestsSpammessages.
Featureextraction,like selection,alsoreduces� �� to a smallerfeaturesubsetsize� . However unlike selectedfeatures,extractedfeaturesno longercorrespondto pres-enceor absenceof singlewords.Therefore,with extractedfeaturesthe new represen-tation of document� is �� � �' ��(� , but %� �*)! . Whenextractedfeaturesare logicalcombinationsof keywordsasin Figure1, thena new feature+,� �.-/0� � , representsa
propositionalclause.For examplethe new feature +1� �� representsthe clause:“intelli-gent” 2 “algorithm” 2 (“grant” 3 “application”).
3 Propositional Semantic Indexing (PSI)
PSI discoversandcapturesunderlyingsemanticsin the form of propositionalclauses.PSI ’s approachis two-fold. Firstly, decisionstumpsareselectedby IG andrefinedbyassociationrule mining, which discoverssetsof Horn clauserules.Secondly, a boost-ing processencouragesselectionof non-redundantstumps.The PSI featureextractionalgorithmandthe instantiationof extractedfeaturesappearat the endof this sectionaftera descriptionof themainsteps.
3.1 Decision Stump Guided Extraction
A decisionstumpis a one-level decisiontree[12]. In PSI, a stumpis initially formedusinga singlekeyword,which is selectedto maximiseIG. An exampledecisionstumpformedwith “conference”in its decisionnodeappearsin Figure2. It partitionsdocu-mentsinto leaves,basedon whetheror not “conference”appearsin them.For instance70documentscontaintheword“conference”andjust5 of theseareSpam(i.e.+5). It isnot uncommonfor documentscontaining“conference”to still besemanticallysimilarto thosenot containingit. Sodocumentscontaining“workshop”without “conference”in the right leaf arestill contextually similar to thosecontaining“conference”in theleft leaf. A generaliseddecisionnodehasthe desiredeffect of bringing suchseman-tically relateddocumentscloser[19]. Generalisationrefinesthedecisionnodeformedwith a singlefeature+,� , to anextractedfeature+1� � , containinga propositionalclause.Typically a refinednoderesultsin animprovedsplit (seeright stumpin Figure2).
465�768 9;: 9;7<469= >
? > 4A@CB69CBD EGF�H�I�J�H�K LML > 4A@CB69CBD E�=&N;H;I�JO=6PCH<K=
=RQ�> 4A@CB69CBD E�=C=6>�I�JS=A>�K Q�TC> 4A@CB69CBD EGPCH;I<JS=6PCH<K
465�768 9;: 9;7<469VUXW 7�Y 9�Z Z W [C9;7<469 U\ 5�:^]'B<_�5�` U I a a I U >gb eneralise
node
ced cgfhi jMk�lnm jMo�p qAr j�mts
Fig. 2. NodeGeneralisation.
A propositionalclauseis formedby addingdisjunctsto aninitial clausecontainingjust the selectedfeature +,� . Eachdisjunct is a conjunctionof one or more keywordco-occurrenceswith similar contextualmeaningto thatof +,� . An exhaustivesearchfordisjunctswill invariablybeimpractical.Fortunatelythesearchspacecanbeprunedbyusing +,� asa handleover this space.Insteadof generatingandevaluatingall disjuncts,we generatepropositionalHorn clauserulesthatconclude+,� givenotherlogical key-word combinations.
u'v<w&x y<z y<wAu&y1{}| wR~ y<� � | �6y�wRu'y{���v�z ���A�Rv<�.{}z y6��| �&~ y<z
�S�A� � ���6����� ���R�'�6�<������&�'�6��� �6� � �A�e���'� �6�A .¡.¢£¥¤<¦&§&¨�©&ª «�¬ «<©R§'«e¯® ©R° «<± ± ® ²6«�©R§'«³ ´;µ&¶&·�¸&¹ º�» º<¸R¶'ºe¼,½�·<» ¾�¿RÀR·�ÁÂtÃ�Ä&Å&Æ�Ç&È É�Ê É<ÇRÅ'ÉeË¯Ê É6Ì�Í Î�Ï É<ÊÐtÑ�Ò&Ó'Ô<Õ'Ö ×<Ø ×�ÕRÓ'×eÙÚ×�Û<ÜÝ ÞOß&à'á<â'ã ä<å ä�âRà'äeæÚä�ç<èéå ä6ê�ë ì&í ä<å
à&á�â&ã ä�å ä<âRà'ä¯î ï<ðVñà&á�â&ã ä�å ä<âRà'ä,òóë â'í ä�ô ô ë ê6ä<âAà&ä�îGõ6õAö ÷�ðVñà&á�â&ã ä�å ä<âRà'ä,ò�ø�á<å ù�ìAúRá<ûeîGõ6õ6ö ÷�ðVñà&á�â&ã ä�å ä<âRà'ä,òóå ä6ê<ë ì�í ä�åMîGõ�ïMö üMðVñà&á�â&ã ä�å ä<âRà'ä,òóî ä6ç<èýå ä�ê<ë ì&í ä<å ñMî þMö ÿ�ðVñà&á�â&ã ä�å ä<âRà'ä,ò ä6ç<èXî � ÷Mö � ðVñ������� �� �������� �����
��� � � � !!Fig. 3. Growing clausesfrom selectedrules.
3.2 Growing Clauses from Rules
Examplesof five associationrulesconcludingin “conference”(i.e. +,� ) appearin Fig-ure 3. Theserulesareof the form H " B, wherethe rule body B is a conjunctionofkeywords,andthe rule headH is a singlekeyword. Theseconjunctionsarekeywordcombinationsthathavebeenfoundto co-occurwith theheadkeyword.Rulebodiesareagoodsourceof disjunctswith which to grow ourDNF clause+1� � , which initially con-tainsonly theselectedkeyword“conference”.However, aninformedselectionstrategyis necessaryto identify thosedisjunctsthataregooddescriptorsof underlyingseman-tics.
Thecontributionof eachdisjunctto clausegrowth is measuredby comparingIG of+,� � with andwithout thedisjunct(rulebody)includedin theDNF clause.Disjunctsthatfail to improve IG arefilteredout by the gain-filter. Thoseremainingarepassedontogen-filter whereany specialisedformsof a disjunctwith a lower IG comparedto anyoneof its generalisedforms arefiltered out. The DNF clausesin Figure3 show howeachrule is convertedinto a potentialDNF clause(differencein IG, usedfor filteringappearin brackets).Thefinal DNF clausederivedoncethefiltering stepis completedis: “conference” 2 “intelligence” 2 “workshop” 2 “register”. We usethe Apriori [1]associationrule learnerto generatefeatureextractionrulesthatconcludea selected+,� .Apriori typically generatesmany rules,but thefilters areableto identify usefulrules.
3.3 Feature Extraction with Boosting
PSI ’s iterativeapproachto featureextractionemploysboosteddecisionstumps(seeFig-ure 4). The numberof featuresto be extractedis determinedby vocabulary size. Thegeneralideaof boostingis to iteratively generateseveral (weak) learners,with eachlearnerbiasedby thetrainingerrorin thepreviousiteration[10]. Thisbiasis expressedby modifying weightsassociatedwith documents.Whenboostedstumpsareusedforfeatureselectionthe new documentdistribution discouragesselectionof a redundantfeaturegiven the previously selectedfeature[6]. Here,with extractedfeatures,unlikewith singlekeyword-basedfeatures,weneedto discouragediscoveryof anoverlappingclausegiventhe previously discoveredclause.We achieve this by updatingdocumentweightsin PSI accordingto the error of the decisionstumpcreatedwith the new ex-tractedfeature,+1� � , insteadof +,� .
#%$ $= & ; n = ' ()' ; vocabulary size= *
Algorithm: PSI
Repeatinitialisedocumentweightsto 1 / +,.- = featurewith highestIG#
=#0/ , -, $ $- = GROWCLAUSE( , - , # )#%$ $
=#%$ $�1 , $ $-
stump= CREATESTUMP( , $ $- )243�3 = error(stump)updatedocumentweightsusing 24353
Until ( ' # $ $ ' = vocabulary size)Return
# $ $Fig. 4. FeatureExtractionwith PSI.
3.4 Feature Instantiation
OncePSI hasextractednew features,textualcasesaremappedto anew representation.For anew feature+,� �6 , let 7 698;:=<9>?6 < , beits propositionalclause,where>?6 < 8A@CB � 6 <DBis the E�FHG conjunctionin thisclause.Thenew representationof document� = �� � �� ��(� isobtainedby:
� � �6 8AI < gain inc > 6 < �.J infer > 6 < �heregain inc returnsthe increasein gain achievedby > 6 < whengrowing 7 6 . Whetheror not > 6 < canbeinferred(satisfied)from a document’s initial representation� = ��� ����(i.e. usingall featuresin ) is determinedby:
infer > 6 < � 8LKNM if ( @OB � 6 <DB ) = TruePotherwise
The PSI-derivedrepresentationenablescasecomparisonat a semantic(or conceptual)level, becauseinstantiatedfeaturesnow capturethedegreeto which eachclauseis sat-isfied by documents.In otherwords,satisfactionof the samedisjunctwill contributemoretowardssimilarity thansatisfactionof differentdisjunctsin thesameclause.
4 Latent Semantic Indexing (LSI)
LSI is anestablishedmethodof featureextractionanddimensionreduction.Thematrixwhosecolumnsarethe documentvectors � � ������C 6� � Q,� , known asthe term-documentmatrix, constitutesa vector spacerepresentationof the documentcollection. In LSI,the term-documentmatrix is subjectedto singularvalue decomposition(SVD). The
RTSHUWV XHY X UZRTX [�\ ]
^`_bac dfe g h h c i�g dfej h i�kHl c e m�niHl j dTeo�gHpTe q r l g�gs c j i l jt k l djHt�t h c u j e c kHdv uTk dTr g l g dZuTg av c dfe g h h c i�gHdTe av iHl j dTe wv jHt�t h c u j e c kHdZw
x h ozy�g j e {�l gZ|}�~ j h {fg�| � gZ���H�fe l j uWe g�oy�g j e {�l g�|}�~ j h {Zg�| �Z� �Z�
� � �T�� � �Z�� � �� � �q � � �W�� � �� � �� � �� � ���
��� �����
u j |fg ���
Fig. 5. FeatureextractionusingLSI.
SVD extractsanorthogonalbasisfor this space,consistingof new featuresthatarelin-earcombinationsof theoriginal features(keywords).Crucially, thesenew featuresarerankedaccordingto their importance.It is assumedthat the � highest-rankedfeaturescontainthe true semanticstructureof the documentcollectionandthe remainingfea-tures,which areconsideredto be noise,arediscarded.Any valueof � lessthantherankof the term-documentmatrix canbe used,but goodvalueswill be muchsmallerthanthis rankand,hence,muchsmallerthaneither � ��� or � � � . Thesefeaturesform anew lower-dimensionalrepresentationwhichhasfrequentlybeenfoundto improveper-formancein informationretrieval andclassificationtasks[18,22]. Full technicaldetailscanbefoundin theoriginalpaper[7].
Figure5 shows a hypotheticalexamplefrom the AI-Spamdomain(cf. Figure1).The first extractedfeatureis a combinationof “intelligent”, “algorithm”, “grant” and“application”. Any documentcontainingmost of theseis likely to be legitimate, sohigh valuesof this featureindicatenon-Spam.Thesecondfeaturehaspositive weightsfor “application”, “debt-free” and“viagra” anda negative weight for “grant”. A highvaluefor thisfeatureis likely to indicateSpam.Thenew featuresareorthogonal,despitehaving two keywordsin common.Thefirst featurehaspositiveweightsfor both“grant”and“application”, whereasthe secondhasa negative weight for “grant”. This showshow the modifying effect of “grant” on “application” might manifestitself in a LSI-derivedrepresentation.With a high scorefor thefirst extractedfeatureanda low scorefor thesecond,theincomingtestcaseis likely to beclassifiedaslegitimateemail.
LSI extractedfeaturesare linear combinationsof typically very large numbersofkeywords.In practicethiscanbein theorderof hundreds/thousandsof keywords,unlikein our illustrative exampleinvolving just 8 keywords.Consequently, it is difficult tointerpret thesefeaturesin a meaningfulway. In contrast,a featureextractedby PSI
combinesfar fewer keywords and its logical descriptionof underlyingsemanticsiseasierto interpret.A furtherdifferenceis that,althoughbothPSI andLSI exploit word-wordco-occurrencesto discoverandpreserveunderlyingsemantics,PSI alsodrawsonword-classco-occurrenceswhile LSI doesnot naturallyexploit this information.
5 Evaluation
Thegoodnessof caserepresentationsderivedby BASE, LSI andPSI in termsof retrievalperformanceis comparedon aretrieve-onlyCBR system,wheretheweightedmajorityvotefrom the3 bestmatchingcasesarere-usedto classifythetestcase.A modifiedcasesimilarity metric is usedso thatsimilarity dueto absenceof words(or wordsin linearcombinationsor in clauses)is treatedaslessimportantcomparedto theirpresence[19].
Experimentswereconductedon 6 datasets;4 involving email routing tasksand2involving Spamfiltering. Variousgroupsfrom the20Newsgroupscorpusof 20 Usenetgroups[13], with 1000 postings(of discussions,queries,commentsetc.) per group,form the routing datasets:SCIENCE (4 sciencerelatedgroups);REC (4 recreationre-latedgroups);HW (2 hardwareproblemdiscussiongroups,oneon Mac, the otheronPC);andRELPOL (2 groups,oneconcerningreligion, theotherpolitics in themiddle-east).Of the2 Spamfiltering datasets:USREMAIL [8] contains1000personalemailsofwhich 50%areSpam;andL INGSPAM [16] contains2893messagesfrom a linguisticsmailing list of which27%areSpam.
Equal-sizeddisjoint train-testsplits were formed.Eachsplit contains20% of thedatasetandalsopreservestheclassdistributionof theoriginalcorpus.All text waspre-processedby removingstopwords(commonwords)andpunctuation.Remainingwordswerestemmedto form , where � � variesfrom approximately1,000in USREMAIL
to 20,000in L INGSPAM. Generally, with both routing andfiltering tasks,the overallaim is to assignincomingmessagesinto appropriategroups.Hence,testsetaccuracywaschosenasthe primary measureof the effectivenessof the caserepresentationasa facilitator of casecomparison.For eachtestcorpusandeachmethod,the accuracy(averagedover15trials)wascomputedfor representationswith 20,40,60,80,100and120features.
Pairedt-testswereusedto find improvementsby LSI andPSI comparedto BASE
(one-tailedtest)anddifferencesbetweenLSI andPSI (two-tailedtest),bothat the95%significancelevel. Precision1 is an importantmeasurewhencomparingSpamfilters,becauseit penaliseserror dueto falsepositives(Legitimate � Spam).Hence,for theSpamfiltering datasets,precisionwastestedaswell asaccuracy.
5.1 Results
Accuracy resultsin Figure6 shows that BASE performspoorly with only 20 features,but getscloserto thesuperiorPSI whenmorefeaturesareadded.PSI ’s performanceisnormallygoodwith 20featuresandis robustto thefeaturesubsetsizecomparedto bothBASE andLSI. LSI clearlyperformsbetterfor smallersizes.This motivatedan inves-tigationof LSI with fewer than20 features.We foundthat10-featureLSI consistentlyoutperforms20-featureLSI andis closeto optimal.Consequently, 10-featureLSI wasusedfor thesignificancetesting,in orderto give a morerealisticcomparisonwith theothermethods.
Table5.1 comparesperformanceof BASE andPSI (both 20 features)andLSI (10features).WhereLSI or PSI aresignificantlybetterthanBASE, theresultsarein bold.
1 Precision= TP/(TP+FP)whereTP is no.of truepositivesandFPis no.of falsepositives.
� �� �� �� �� �� �� �� �� �
� ��� ��� ��� ��� � ��� � ����¡ ¡¢ £¥¤ �D¦§£¥¨¡©b�¥¢ª¦�« ¬¥� ®® ¯°±® ²³´µ ¶·¸¹ ¶¸ º b©b� »ª¦½¼ ¾f¦§¼
¿¡À Á Â ÂÃ ÄÃ ÂÅ ÄÅ ÂÆ ÄÆ Â
Ç È�ÉbÈ�Ê È�Ë ÈÍÌ È ÈÎÌ ÇbÈÏ§Ð¥Ñ¡Ò Ó¥Ô ÐDÕ½Ó¥Ö¡×bÐ¡ÒªÕ§Ø Ù¥ÐÚ ÛÛÜÝÞÛßàáâ ãäåæ ãå
çbèéWê¥ëbì íªî½ï ðTî½ï
ñ òò óò òô óô òõ ó
ó öø÷ öùñ öùô öûú ö öüú ó öý§þ¡ÿ�� ��� þ�������bþ����� ¥þ� ���������� ���� �� � ÿ�bþ ����� � ���
!" # $ %�&
' (' )) () )* (* )
+ (,' (-* (-. (0/ ( (1/ + (2�3�4�5 687 3:9�6�;�<83�59�= >�3? @@ABC@DEFG HIJK HJ
L8M N O P M OQ RS TS RU TU RV TV R
W TYX T Q T U T[Z T T\Z W T]�^�_�` a�b ^�c�a�d�e8^�`c�f g�^h iijklimnop qrst qs
u�v w x8y8z8{ | } ~� �� ~� �� ~� � �
� ��� ��� �,� ��� � ��� � �������� ��� �����8���8������ ���� ��������� ¡¢£¤ ¡£
¥ ¦ § ¨8©�ª «8¬
Fig. 6. Accuracy resultsfor datasets.
Table 1. Summaryof significancetestingfor featuresubsetsize20 (10 for LSI).
Routing:Accuracy Filtering: Accuracy(Precision)Algo. REC HW RELPOL SCIENCE USREMAIL L INGSPAM
BASE 71.7 73.0 88.7 48.1 85.7(89.5) 94.2(92.0)LSI *78.7 65.5 90.4 *71.8 93.9 (*96.8) 96.8 (89.0)PSI 76.2 *80.1 91.2 59.9 94.1 (95.2) 95.8 (*92.1)
WhereLSI andPSI aresignificantlydifferent,thebetterresultis starred.It canbeseenthat LSI is significantly better than BASE on 6 of 8 measuresand to PSI on 3. PSI
is better than BASE on 7 measuresand better than LSI on 2. We concludethat the20-dimensionalrepresentationsextractedby PSI have comparableeffectivenessto the10-dimensionalrepresentationsextractedby LSI. Generally, BASE needsamuchlargerindexing vocabulary to achieve comparableperformance.PSI workswell with a smallvocabularyof features,whicharemoreexpressivethanLSI ’s linearcombinations.
5.2 Interpretability
Figure7 providesa high-level view of samplefeaturesextractedby PSI in theform oflogical combinations(for 3 of thedatasets).It is interestingto comparethedifferencesin extractedcombinations(edges),thecontributionof keywords(ovals)to differentex-tractedfeatures(boxes)andthenumberof keywordsusedto form conjunctions(usuallynotmorethan3). Weseeamassof interconnectednodeswith theHW datasetonwhichPSI ’s performancewasfar superiorto thatof LSI. Closerexaminationof this datasetshows thattherearemany keywordsthatarepolysemousgiventhetwo classes.For in-stance“drive” is applicablebothto MacsandPCsbut combinedwith “vlb” indicatesPCwhile with “syquest”indicatesMac.Unlike HW, themulti-classSCIENCE datasetcon-tainsseveraldisjoint graphseachrelatingto a class,suggestingthat theseconceptsareeasilyseparable.Accuracy resultsshow LSI to beaclearwinneron SCIENCE. This fur-thersupportsourobservationthatLSI operatesbestonly in theabsenceof class-specificpolysemousrelationships.Finally, featuresextractedfrom L INGSPAM in figure7 show
¯® °²± ³´µ¶¯··¹¸²º ¸ º²»¹¸¶¼¾½
½¶¸¯® ³ ¿
À ¶¼Á³ ®À ± º ½²¸²º ´µ½ À ± ¼
ºÁö´Ä»¾³ ¯´ ³ ® ½¶¸²³ ´Å½¯·¾³»¹¸²³ ± ½Â· ³
´Æº¾Ç »¹½¶Â»²ÈºÁö· À ® ¯´
À ± ½²³
´Ä± È ± ³ ¸Â® ±
»¶® ¯°²È ½Â´
É ¶ À® ½¶¸¶¼Á³ ± ¯·
¼¹¿¶± ·¹½²º® ½²º¾³ ¸ÂÊ® ¸¯· ³
º²»¹¸¶¼¾½È ¸ÂÊ· ¼¹¿
È Ê·¹¸Â®º¹¶È ¸Â®
¿¹½¶¸ÂÈ ³ ¿
½Â·¹¼¹® ö» ³Ë ½¯±
½²º ¼¹® ¹Ì
½¶È ½²¼Á³ ® ± ¼³ ½²º¾³À ¼º²¿¶± ½ÂÈ À
¼ ½Â® ³ ¸Â± ·²È ±
͹βϾÐ
Ñ¶Ò Ó²Ô:Ò Õ¯Ö
SCIENCE
×Áؾ٠Ú�Û Ü ÝÛ Ø�Þ ß Ù Ý�à
ß Ù Ø�Þ ×ÁÙ Ý�áÜ Û âäã�Úâäã Ø�Û Ùâäå Ü æ�ÛÁá×ÁØÁÙ Ú�Û Ü ÝÝ�Û Ü ÚÁÙ ç
âäå Ù ß
Ø�Û¾ÛÁè
é¾ß ÛÁåÁÝê Ü ÛÁÛåÁè�Ú�Û¹Ü á Ù Ýç�ß Ù ç�æÙ Ø�çã âéÁÜ ã ê Ù áè¾ã ß ß å ÜÞ ×�å Ü å Ø�á ÛçãÁÝ�áÝ�Û ß ß
ë ×�Ý�ÙÝ�Ù á Û
à�á á é
ÚÁÙ Ý�Ù áÝ�å�Ú�Û
ß å Ø�Þ¹×�åÁÞ
LINGSPAM
ì¹íî²ï ð
ñ ò
óÂì¶ì²ïôµó¶í
ó ò ðôÅó²í
õÁö õ¾÷ ø¯ô
ò¹ù ð¹ú¶ó¯û òð²ü õ
í¾ú¯ý¾÷ û ú¶ï
ñ ò ø
õ íÁõ ñþÂü¹ó ò û ó
ò úú¹õ
ÿ ú¯û �
í¾ø¯ý¾÷ û ñï í ñ ñ ñ
î²ï ðô ð
õ²ì¹ø¶ø ò
õ íÁõ ñ
ì ú¹ÿ ø¯û ð ú¶ú �
ý¹ø ÿ
ò û ñ î¹ø
ð ñ ú õ¹ø¹÷ò û ñ î¹ø
ì ò
� ü¶ôÄì ø¯ûí¾ó¯û ò
ï íõ ñ ô ô
í¾ó¯û ò
ì²ï ü��
� ���
��� �������HW
������ �����
Fig. 7. Logical combinationsextractedfrom datasets.
thatthemajorityof new featuresaresinglekeywordsratherthanlogical combinations.ThisexplainsBASE’s goodperformanceon L INGSPAM.
An obviousadvantageof interpretabilityis knowledgediscovery. ConsidertheSCI-ENCE tree,here,without theextractedclausesindicatingthat “msg”, “food” and“chi-nese”arelinkedthrough“diet”, onewould not understandthemeaningin context of atermsuchas“msg”. Suchproof trees,which areautomaticallygeneratedby PSI, high-light relationsbetweenkeywords;this knowledgecanfurtheraid with glossarygener-ation,often a demandingmanualtask(e.g.FALLQ [14]). Caseauthoringis a furthertaskthat canbenefitfrom PSI generatedtrees.For exampledisjoint graphsinvolving“electric” and“encrypt” with many fewer keyword associationsmaysuggesttheneedfor casecreationor discovery in thatarea.From a retrieval standpoint,PSI generatedfeaturescan be exploited to facilitatequeryelaboration,within incrementalretrievalsystems.Themainbenefitto theuserwould betheability to tailor theexpandedqueryby deactivatingdisjunctsto suit retrieval needs.Sinceclausesaregranularandcanbebrokendown into semanticallyrich constituents,retrieval systemscangatherstatisticsof whichclausesworkedwell in thepast,basedonuserinteraction;this is difficult overlinearcombinationsof featuresminedby LSI.
6 Related Work
Featureextractionis an importantareaof researchparticularlywhendealingwith tex-tual data.In Textual-CBR(TCBR) researchthe SMILE systemprovidesusefulinsightinto how machinelearningandstatisticaltechniquescanbe employed to reasonwithlegal documents[3]. As in PSI, singlekeywordsin decisionnodesareaugmentedwith
otherkeywordsof similar context. Unlike PSI, thesekeywordsareobtainedby look-ing up a manuallycreateddomain-specificthesaurus.Although PSI grows clausesbyanalysingkeyword co-occurrencepatterns,they canjust aseasilybe grown usingex-isting domain-specificknowledge.A morerecentinterestin TCBR involvesextractionof featuresin theform of predicates.TheFACIT framework involving semi-automatedindex vocabulary acquisitionaddressesthis challengebut alsohighlightsthe needforrelianceon deepsyntacticparsingand the acquisitionof a generative lexicon whichwarrantssignificantmanualintervention[11].
In text classificationandtext mining research,thereis muchevidenceto show thatanalysisof keyword relationshipsand modelling them as rules is a successfulstrat-egy for text retrieval. A goodexampleis RIPPER [5], which adoptscomplex optimi-sationheuristicsto learnpropositionalclausesfor classification.A RIPPER rule is aHorn clauserule that concludesa class.In contrast,PSI ’s propositionalclausesformfeaturesthat caneasilybe exploited by CBR systemsto enablecasecomparisonat asemanticlevel. Suchcomparisonscanalsobefacilitatedwith theFEATUREM INE [21]algorithm,which alsoemploysassociationrule mining to createnew featuresbasedonkeywordco-occurrences.FEATUREM INE generatesall possiblepair-wisekeywordco-occurrencesconvertingonly thosethatpassa significancetestinto new features.Whatis uniqueaboutPSI ’s approachis that firstly, searchfor associationsis guidedby aninitial featureselectionstep,secondly, associationsremainingafteran informedfilter-ing stepareusedto grow clauses,and,crucially, boostingis employed to encouragegrowing of non-overlappingclauses.Themainadvantageof PSI ’s approachis that in-steadof textual casesimilarity basedsolelyon instantiatedfeaturevaluecomparisons(asin FEATUREM INE), PSI ’s clausesenablemorefine-grainedsimilarity comparisons.Like PSI, WHIRL [4] also integratesrulesresultingin a morefine-grainedsimilaritycomputationover text. However theserulesaremanuallyacquired.
The useof automatedrule learning in an Information Extraction (IE) setting isdemonstratedby TEXTRISE, whereminedrulespredict text for slotsbasedon infor-mationextractedover otherslots[15]. Thevocabulary is thuslimited to templateslotfillers. In contrastPSI doesnot assumeknowledgeof casestructuresandis potentiallymoreusefulin unconstraineddomains.
7 Conclusion
A novel contribution of this paperis the acquisitionof an indexing vocabulary in theform of expressiveclauses,anda caserepresentationthatcapturesthedegreeto whicheachclauseis satisfiedby documents.Thepropositionalsemanticindexing tool, PSI, in-troducedin thepaper, enablestext comparisonat asemantic,insteadof a lexical, level.Experimentsshow thatPSI ’s retrieval performanceis significantlybetterthanthatof re-trieval overkeyword-basedrepresentations.Comparisonof PSI-derivedrepresentationswith thepopularLSI-derivedrepresentationsgenerallyshowscomparableretrieval per-formance.Howeverin thepresenceof class-specificpolysemousrelationshipsPSI is theclearwinner. Theseresultsarevery encouraging,because,althoughfeaturesextractedby LSI arerich mathematicaldescriptorsof the underlyingsemanticsin the domain,unlike PSI, they lack interpretability. We notethat PSI ’s relianceon classknowledge
inevitably restrictsits rangeof applicability. Accordingly, future researchwill seektodevelopanunsupervisedversionof PSI.
Acknowledgements
We thankSusanCraw for helpful discussionson this work.
References
1. R. Agrawal, H. Mannila, R. Srikant,H. Toivonen,andA. I. Verkamo. Fastdiscovery ofassociationrules. In Advancesin KD andDM, pages307–327,1995.AAAI/MIT .
2. D. Aha,editor. Mixed-InitiativesWorkshopat 6thECCBR, 2002.Springer.3. S.BruninghausandK. D. Ashley. Bootstrappingcasebasedevelopmentwith annotatedcase
summaries.In Procof the2ndICCBR, pages59–73,1999.Springer.4. W. W. Cohen. Providing database-like accessto the web using queriesbasedon textual
similarity. In Procof theInt Confon Managementof Data, pages558–560,1998.5. W. W. CohenandY. Singer. Context-sensitive learningmethodsfor text categorisation.ACM
Transactionsin InformationSystems, 17(2):141–173,1999.6. S. Das. Filters,wrappersanda boostingbasedhybrid for featureselection. In Proc of the
18thICML, pages74–81,2001.MorganKaufmann.7. S. C. Deerwester, S. T. Dumais,T. K. Landauer, G. W. Furnas,andR. A. Harshman.In-
dexing by latentsemanticanalysis.Journalof theAmericanSocietyof InformationScience,41(6):391–407,1990.
8. S.J.Delany andP. Cunningham.An analysisof case-baseeditingin aspamfiltering system.In Procof the7thECCBR, pages128–141,2004.Springer.
9. G. FormanandI. Cohen.Learningwith Little: Comparisonof ClassifiersGivenLittle Train-ing. In Procof the8thEuropeanConfonPKDD. pages161–172,2004.
10. Y. FreundandR. Schapire.Experimentswith anew boostingalgorithm.In Procof the13thICML, pages148–156,1996.
11. K. M. GuptaandD. W. Aha. Towardsacquiringcaseindexing taxonomiesfrom text. In Procof the17thInt FLAIRSConference, pages307–315,2004.AAAI press.
12. W. IbaandP. Langley. Inductionof one-level decisiontrees.In Procof the9thInt WorkshoponMachineLearning, pages233–240,1992.
13. T. Joachims.A probabilisticanalysisof theRocchioalgorithmwith TFIDF. Technicalreport,CarnegieMellon UniversityCMU-CS-96-118,1996.
14. M. Lenz.Definingknowledgelayersfor textualCBR. In Procof the4thEuropeanWorkshoponCBR, pages298–309,1998.Springer.
15. U. Y. NahmandR. J.Mooney. Mining soft-matchingrulesfrom textualdata.In Procof the17thIJCAI, pages979–984,2001.
16. G. Sakkis,I. Androutsopoulos,G. Paliouras,V. Karkaletsis,C. Spyropoulos,andP. Stam-atopoulos.A memory-basedapproachto anti-spamfiltering for mailing lists. InformationRetrieval, 6:49–73,2003.
17. G. SaltonandM. J.McGill. An introductionto modernIR. 1983,McGraw-Hill.18. F. Sebastiani.ML in automatedtext categorisation.ACMComputingsurveys, 34:1–47,2002.19. N. Wiratunga,I. Koychev, andS. Massie. Featureselectionandgeneralisationfor textual
retrieval. In Procof the7thECCBR, pages806–820,2004.Springer.20. Y. YangandJ.O. Pedersen.A comparative studyon featureselectionin text categorisation.
In Procof the14thICML, pages412–420,1997.Springer.
21. S. Zelikovitz. Mining for featuresto improve classification.In Proc of MachineLearning,Models,TechnologiesandApplications, 2003.
22. S. Zelikovitz andH. Hirsh. UsingLSI for text classificationin thepresenceof backgroundtext. In Procof the10thInt Confon InformationandKM, 2001.