A Propositional Approach to Textual Case Indexing

13
A Propositional Approach to Textual Case Indexing Nirmalie Wiratunga , Rob Lothian , Sutanu Chakraborti and Ivan Koychev School of Computing The Robert Gordon University Aberdeen AB25 1HG, Scotland, UK nw|rml|[email protected] Institute of Mathematics and Informatics Bulgarian Academy of Science Sofia - 1113, Bulgaria [email protected] Appears in Proceedings of the 9th European Conference on Principles and Practice of KDD, pp. 380–391, Copyright c 2005 Springer (www.springer.de). All rights reserved. Abstract. Problem solving with experiences that are recorded in text form re- quires a mapping from text to structured cases, so that case comparison can pro- vide informed feedback for reasoning. One of the challenges is to acquire an indexing vocabulary to describe cases. We explore the use of machine learning and statistical techniques to automate aspects of this acquisition task. A proposi- tional semantic indexing tool, PSI, which forms its indexing vocabulary from new features extracted as logical combinations of existing keywords, is presented. We propose that such logical combinations correspond more closely to natural con- cepts and are more transparent than linear combinations. Experiments show PSI- derived case representations to have superior retrieval performance to the original keyword-based representations. PSI also has comparable performance to Latent Semantic Indexing, a popular dimensionality reduction technique for text, which unlike PSI generates linear combinations of the original features. 1 Introduction Discovery of new features is an important pre-processing step for textual data. This process is commonly referred to as feature extraction, to distinguish it from feature se- lection, where no new features are created [18]. Feature selection and feature extraction share the aim of forming better dimensions to represent the data. Historically, there has been more research work carried out in feature selection [9, 20, 16] than in extraction for text pre-processing applied to text retrieval and text classification tasks. However, com- binations of features are better able to tackle the ambiguities in text (e.g. synonyms and polysemys) that often plague feature selection approaches. Typically, feature extrac- tion approaches generate linear combinations of the original features. The strong focus on classification effectiveness alone has increasingly justified these approaches, even though their black-box nature is not ideal for user interaction. This argument applies even more strongly to combinations of features using algebraic or higher mathematical

Transcript of A Propositional Approach to Textual Case Indexing

A Propositional Approach to Textual Case Indexing

Nirmalie Wiratunga�, RobLothian

�, SutanuChakraborti

�andIvanKoychev

��

Schoolof ComputingTheRobertGordonUniversity

AberdeenAB25 1HG,Scotland,UKnw|rml|[email protected]

Instituteof MathematicsandInformaticsBulgarianAcademyof Science

Sofia- 1113,[email protected]

Appears in Proceedingsof the9thEuropeanConferenceonPrinciplesandPracticeof KDD,pp.380–391,Copyright c

�2005Springer (www.springer.de).All rights reserved.

Abstract. Problemsolving with experiencesthat arerecordedin text form re-quiresa mappingfrom text to structuredcases,sothatcasecomparisoncanpro-vide informed feedbackfor reasoning.One of the challengesis to acquireanindexing vocabulary to describecases.We explore the useof machinelearningandstatisticaltechniquesto automateaspectsof this acquisitiontask.A proposi-tionalsemanticindexing tool, PSI, whichformsits indexing vocabularyfrom newfeaturesextractedaslogicalcombinationsof existingkeywords,is presented.Weproposethatsuchlogical combinationscorrespondmorecloselyto naturalcon-ceptsandaremoretransparentthanlinearcombinations.Experimentsshow PSI-derivedcaserepresentationsto havesuperiorretrieval performanceto theoriginalkeyword-basedrepresentations.PSI alsohascomparableperformanceto LatentSemanticIndexing, a populardimensionalityreductiontechniquefor text, whichunlike PSI generateslinearcombinationsof theoriginal features.

1 Introduction

Discovery of new featuresis an importantpre-processingstepfor textual data.Thisprocessis commonlyreferredto asfeatureextraction,to distinguishit from featurese-lection,wherenonew featuresarecreated[18]. Featureselectionandfeatureextractionsharetheaimof formingbetterdimensionsto representthedata.Historically, therehasbeenmoreresearchwork carriedoutin featureselection[9, 20,16] thanin extractionfortext pre-processingappliedto text retrieval andtext classificationtasks.However, com-binationsof featuresarebetterableto tackletheambiguitiesin text (e.g.synonymsandpolysemys)that often plaguefeatureselectionapproaches.Typically, featureextrac-tion approachesgeneratelinearcombinationsof theoriginal features.Thestrongfocuson classificationeffectivenessalonehasincreasinglyjustified theseapproaches,eventhoughtheir black-boxnatureis not ideal for userinteraction.This argumentappliesevenmorestronglyto combinationsof featuresusingalgebraicor highermathematical

functions.Whenfeatureextractionis appliedto taskssuchashelpdesksystems,med-ical or law documentmanagement,emailmanagementor evenSpamfiltering, thereisoftenaneedfor userinteractionto guideretrieval or to supportincrementalqueryelab-oration.Theprimarycommunicationmodebetweensystemanduserhastheextractedfeaturesasvocabulary. Hence,thesefeaturesshouldbetransparentaswell asprovidinggooddimensionsfor classification.

The needfor featuresthat aid userinteractionis particularlystrongin the field ofCase-BasedReasoning(CBR), wheretransparency is an importantelementduringre-trieval andreuseof solutionsto similar, previously solved problems.This view is en-forced by researchpresentedat a mixed initiative CBR workshop[2]. The indexingvocabulary of a CBR systemrefersto thesetof featuresthatareusedto describepastexperiencesto berepresentedascasesin thecasebase.Vocabulary acquisitionis gen-erally a demandingknowledgeengineeringtask,even moreso whenexperiencesarecapturedin text form. Analysisof text typically begins by identifying keywordswithwhich anindexing vocabulary is formedat thekeyword level [14]. It is herethat thereis anobviousopportunityto apply featureextractionfor index vocabulary acquisitionwith a view to learningtransparentandeffectivetextual caserepresentations.

The focusof this paperis extractionof featuresto automateacquisitionof indexvocabulary for knowledgereuse.Techniquespresentedin this paperaresuitedfor ap-plicationswherepastexperiencesarecapturedin free text form andarepre-classifiedaccordingto thetypesof problemsthey solve.We presenta PropositionalSemanticIn-dexing (PSI) tool, whichextractsinterpretablefeaturesthatarelogicalcombinationsofkeywords.We proposethatsuchlogical combinationscorrespondmorecloselyto nat-uralconceptsandaremoretransparentthanlinearcombinations.PSI employsboostingcombinedwith rule mining to encouragelearningof non-overlapping(or orthogonal)setsof propositionalclauses.A similarity metric is introducedso that textual casescanbe comparedbasedon similarity betweenextractedlogical clauses.Interpretabil-ity of theselogical constructscreatesnew avenuesfor userinteractionandnaturallyleadsto the discovery of knowledge.PSI ’s featureextraction approachis comparedwith the populardimensionalityreductiontechniqueLatentSemanticIndexing (LSI),which usessingularvaluedecompositionto extractorthogonalfeaturesthatarelinearcombinationsof keywords[7]. Caserepresentationsthatemploy PSI ’s logical expres-sionsare more comprehensibleto domainexpertsand end-userscomparedto LSI ’slinear keyword combinations.Ideally we wish to achieve this expressivenesswithoutsignificantlossin retrieval effectiveness.

We first establishour terminologyfor featureselectionandextraction,beforede-scribinghow PSI extractsfeaturesaslogicalcombinations.WethendescribeLSI, high-lighting theproblemof interpretabilitywith linearcombinations.Finally we show thatPSI ’s approachachievescomparableretrieval performanceyet remainsexpressive.

2 Feature Selection and Extraction

Considerthehypotheticalexamplein Figure1 wherethetaskis to weedoutSpamfromlegitimateemail relatedto AI. To assistwith future messagefiltering thesemessagesmustbemappedontoasetof casesbeforethey canbereused.Wewill referto thesetof

����� � ��������� �� � ���������������� ��� ����������� ������! "#���$�$��%'&��"���( )+*�����&# &�� � "�&#� ���&�(�,- ��&�( � � *. � �������� � � � ��� ��������&##*������� � �� ���� � )�&����0/

12345 6

7�8 9�: 7�; < : 7=< >.?; @�A�B�: C�D; < E.7 F�7 E�@�>�7�BC�@.G+H'BIG$J�C; 7E+B.; < E�7�@�:B#@�K�: C=L�L�L�L�L�LE�@M>�7�B�K�>�7�7�N; @ 8 < O�7PQR

ST

U�V#W�X�Y Z�[\�]�WZ \ ^�\�_ X�`'Z V�Y a+a�Y Z Wb�\#W#a�X�c ZM]�W#aad _ \�e�X�bMf'g�V�Whb�Y a#V.\�eZ�].\�_ Wi _ W�W$j ^�W�k#aZ V d X�]�\#a'Zlf f fmnopq

rts�uv�wx#y z�s�{| } su'vMwx | { ~#u | y v�w | v��s���� v�{ ����v��� } � |M} ����s ��sz�y xu#~x'x.v�w | } y x� y x | v���� � | s��#x{ s�z�~���� y u'� | y v�w��| } s=��� v���y u'�M�xs�wx�s�v����

����� �'������������� ���� ������ � ��� ����  ¡ � ��¢�£¤ � ¢M��¢������ ¥ ¦¢#� ¥ ������ � §���¢#� � ¢�¦�§���£� ��¨���#� ����¢�� ��  ���� §��.©«ª

¬­®¯° ±#² ³ ´¶µ�· ¸�µ¹�º�¹ ² »�¼�½¾ µ#· ¿Àµ�µ º¹ ² ¼#»�Á ³ · ¸�Â.Ã+Ä »�ÁÅ »� ¹ ³ º�Æ Ã#Ç�µ�È�³ Ä ³ ȹ�º Å º » ºÆÃ#Ç�µ�È#³ Ä ³ È#É�Ê�Á »MÂû�Â�µ�¿+¸ ¹ ·Å ³ Ä Ä µ�Á µ º ·�Ë· ¸�»�Ì�¼�¸ ÍÎÏÐ

ÑÒ

Ó�Ô�Õ�Ö ×�Ø ×ÙÕ�Ó�× ÚÜÛÞÝ Ý

ß=à�á á

â ã#ä äå�æ�ç ç

è éMê ëÙì ì è í�ëÙéMêî ì íðïÙñòè êôóðõíÙñ î éMê

öðë�÷Mê ø ùôñ ëðëú è î íÙñ îû ï�ñòé

îÙûðû ì è ü î êôè ï�é

ý.ë�þ ÿ � êôñ î ü�ê ëðö��ðë î ê �ðñ ë��

Fig. 1. Featuresaslogical keyword combinations.

all labelleddocuments(cases)as�

. The keyword-vectorrepresentationis commonlyusedto representa document� by consideringthepresenceor absenceof words[17].Essentiallythesetof featuresarethesetof words (e.g.“conference”,“intelligent”).Accordinglya document� is representedasa pair ��� ���� , where � = �� � ������� ���� ��� � isabinaryvaluedfeaturevectorcorrespondingto thepresenceor absenceof wordsin ;and � is � ’s classlabel.

Featureselectionreduces� �� to a smallerfeaturesubsetsize � [20]. InformationGain(IG) is oftenusedfor this purpose,where� featureswith highestIG areretainedandthenew binary-valuedfeaturevector� � is formedwith thereducedwordvocabularyset � , where �"! and � � �$# � �� . The new representationof document�with %� is a pair �� �& ���� . Selectionusing IG is the base-linealgorithmin this paperandis referredto asBASE. An obviousshortcomingof BASE is that it fails to ensureselectionof non-redundantkeywords.Ideally we want � � to containfeaturesthat arerepresentativebut alsoorthogonal.A moreseriousweaknessis thatBASE’s one-to-onefeature-wordcorrespondenceoperatesata lexical level, ignoringunderlyingsemantics.

Figure1 illustratesa proof treeshowing how new featurescanbeextractedto cap-ture keyword relationshipsusingpropositionaldisjunctive normalform clauses(DNFclauses).Whenkeywordrelationshipsaremodelled,ambiguitiesin text canberesolvedto someextent.For instance“grant” and“application”capturesemanticsakin to legiti-matemessages,while thesamekeyword “application” in conjunctionwith “debt-free”suggestsSpammessages.

Featureextraction,like selection,alsoreduces� �� to a smallerfeaturesubsetsize� . However unlike selectedfeatures,extractedfeaturesno longercorrespondto pres-enceor absenceof singlewords.Therefore,with extractedfeaturesthe new represen-tation of document� is �� � �' ��(� , but %� �*)! . Whenextractedfeaturesare logicalcombinationsof keywordsasin Figure1, thena new feature+,� �.-/0� � , representsa

propositionalclause.For examplethe new feature +1� �� representsthe clause:“intelli-gent” 2 “algorithm” 2 (“grant” 3 “application”).

3 Propositional Semantic Indexing (PSI)

PSI discoversandcapturesunderlyingsemanticsin the form of propositionalclauses.PSI ’s approachis two-fold. Firstly, decisionstumpsareselectedby IG andrefinedbyassociationrule mining, which discoverssetsof Horn clauserules.Secondly, a boost-ing processencouragesselectionof non-redundantstumps.The PSI featureextractionalgorithmandthe instantiationof extractedfeaturesappearat the endof this sectionaftera descriptionof themainsteps.

3.1 Decision Stump Guided Extraction

A decisionstumpis a one-level decisiontree[12]. In PSI, a stumpis initially formedusinga singlekeyword,which is selectedto maximiseIG. An exampledecisionstumpformedwith “conference”in its decisionnodeappearsin Figure2. It partitionsdocu-mentsinto leaves,basedon whetheror not “conference”appearsin them.For instance70documentscontaintheword“conference”andjust5 of theseareSpam(i.e.+5). It isnot uncommonfor documentscontaining“conference”to still besemanticallysimilarto thosenot containingit. Sodocumentscontaining“workshop”without “conference”in the right leaf arestill contextually similar to thosecontaining“conference”in theleft leaf. A generaliseddecisionnodehasthe desiredeffect of bringing suchseman-tically relateddocumentscloser[19]. Generalisationrefinesthedecisionnodeformedwith a singlefeature+,� , to anextractedfeature+1� � , containinga propositionalclause.Typically a refinednoderesultsin animprovedsplit (seeright stumpin Figure2).

465�768 9;: 9;7<469= >

? > 4A@CB69CBD EGF�H�I�J�H�K LML > 4A@CB69CBD E�=&N;H;I�JO=6PCH<K=

=RQ�> 4A@CB69CBD E�=C=6>�I�JS=A>�K Q�TC> 4A@CB69CBD EGPCH;I<JS=6PCH<K

465�768 9;: 9;7<469VUXW 7�Y 9�Z Z W [C9;7<469 U\ 5�:^]'B<_�5�` U I a a I U >gb eneralise

node

ced cgfhi jMk�lnm jMo�p qAr j�mts

Fig. 2. NodeGeneralisation.

A propositionalclauseis formedby addingdisjunctsto aninitial clausecontainingjust the selectedfeature +,� . Eachdisjunct is a conjunctionof one or more keywordco-occurrenceswith similar contextualmeaningto thatof +,� . An exhaustivesearchfordisjunctswill invariablybeimpractical.Fortunatelythesearchspacecanbeprunedbyusing +,� asa handleover this space.Insteadof generatingandevaluatingall disjuncts,we generatepropositionalHorn clauserulesthatconclude+,� givenotherlogical key-word combinations.

u'v<w&x y<z y<wAu&y1{}| wR~ y<� � | �6y�wRu'y{���v�z ���A�Rv<�.{}z y6��| �&~ y<z

�S�A� � ���6����� ���R�'�6�<������&�'�6��� �6� � �A�e���'� �6�A .¡.¢£¥¤<¦&§&¨�©&ª «�¬ «<©R§'«e­¯® ©R° «<± ± ® ²6«�©R§'«³ ´;µ&¶&·�¸&¹ º�» º<¸R¶'ºe¼,½�·<» ¾�¿RÀR·�ÁÂtÃ�Ä&Å&Æ�Ç&È É�Ê É<ÇRÅ'ÉeË¯Ê É6Ì�Í Î�Ï É<ÊÐtÑ�Ò&Ó'Ô<Õ'Ö ×<Ø ×�ÕRÓ'×eÙÚ×�Û<ÜÝ ÞOß&à'á<â'ã ä<å ä�âRà'äeæÚä�ç<èéå ä6ê�ë ì&í ä<å

à&á�â&ã ä�å ä<âRà'ä¯î ï<ðVñà&á�â&ã ä�å ä<âRà'ä,òóë â'í ä�ô ô ë ê6ä<âAà&ä�îGõ6õAö ÷�ðVñà&á�â&ã ä�å ä<âRà'ä,ò�ø�á<å ù�ìAúRá<ûeîGõ6õ6ö ÷�ðVñà&á�â&ã ä�å ä<âRà'ä,òóå ä6ê<ë ì�í ä�åMîGõ�ïMö üMðVñà&á�â&ã ä�å ä<âRà'ä,òóî ä6ç<èýå ä�ê<ë ì&í ä<å ñMî þMö ÿ�ðVñà&á�â&ã ä�å ä<âRà'ä,ò ä6ç<èXî � ÷Mö � ðVñ������� �� �������� �����

��� � � � !!Fig. 3. Growing clausesfrom selectedrules.

3.2 Growing Clauses from Rules

Examplesof five associationrulesconcludingin “conference”(i.e. +,� ) appearin Fig-ure 3. Theserulesareof the form H " B, wherethe rule body B is a conjunctionofkeywords,andthe rule headH is a singlekeyword. Theseconjunctionsarekeywordcombinationsthathavebeenfoundto co-occurwith theheadkeyword.Rulebodiesareagoodsourceof disjunctswith which to grow ourDNF clause+1� � , which initially con-tainsonly theselectedkeyword“conference”.However, aninformedselectionstrategyis necessaryto identify thosedisjunctsthataregooddescriptorsof underlyingseman-tics.

Thecontributionof eachdisjunctto clausegrowth is measuredby comparingIG of+,� � with andwithout thedisjunct(rulebody)includedin theDNF clause.Disjunctsthatfail to improve IG arefilteredout by the gain-filter. Thoseremainingarepassedontogen-filter whereany specialisedformsof a disjunctwith a lower IG comparedto anyoneof its generalisedforms arefiltered out. The DNF clausesin Figure3 show howeachrule is convertedinto a potentialDNF clause(differencein IG, usedfor filteringappearin brackets).Thefinal DNF clausederivedoncethefiltering stepis completedis: “conference” 2 “intelligence” 2 “workshop” 2 “register”. We usethe Apriori [1]associationrule learnerto generatefeatureextractionrulesthatconcludea selected+,� .Apriori typically generatesmany rules,but thefilters areableto identify usefulrules.

3.3 Feature Extraction with Boosting

PSI ’s iterativeapproachto featureextractionemploysboosteddecisionstumps(seeFig-ure 4). The numberof featuresto be extractedis determinedby vocabulary size. Thegeneralideaof boostingis to iteratively generateseveral (weak) learners,with eachlearnerbiasedby thetrainingerrorin thepreviousiteration[10]. Thisbiasis expressedby modifying weightsassociatedwith documents.Whenboostedstumpsareusedforfeatureselectionthe new documentdistribution discouragesselectionof a redundantfeaturegiven the previously selectedfeature[6]. Here,with extractedfeatures,unlikewith singlekeyword-basedfeatures,weneedto discouragediscoveryof anoverlappingclausegiventhe previously discoveredclause.We achieve this by updatingdocumentweightsin PSI accordingto the error of the decisionstumpcreatedwith the new ex-tractedfeature,+1� � , insteadof +,� .

#%$ $= & ; n = ' ()' ; vocabulary size= *

Algorithm: PSI

Repeatinitialisedocumentweightsto 1 / +,.- = featurewith highestIG#

=#0/ , -, $ $- = GROWCLAUSE( , - , # )#%$ $

=#%$ $�1 , $ $-

stump= CREATESTUMP( , $ $- )243�3 = error(stump)updatedocumentweightsusing 24353

Until ( ' # $ $ ' = vocabulary size)Return

# $ $Fig. 4. FeatureExtractionwith PSI.

3.4 Feature Instantiation

OncePSI hasextractednew features,textualcasesaremappedto anew representation.For anew feature+,� �6 , let 7 698;:=<9>?6 < , beits propositionalclause,where>?6 < 8A@CB � 6 <DBis the E�FHG conjunctionin thisclause.Thenew representationof document� = �� � �� ��(� isobtainedby:

� � �6 8AI < gain inc > 6 < �.J infer > 6 < �heregain inc returnsthe increasein gain achievedby > 6 < whengrowing 7 6 . Whetheror not > 6 < canbeinferred(satisfied)from a document’s initial representation� = ��� ����(i.e. usingall featuresin ) is determinedby:

infer > 6 < � 8LKNM if ( @OB � 6 <DB ) = TruePotherwise

The PSI-derivedrepresentationenablescasecomparisonat a semantic(or conceptual)level, becauseinstantiatedfeaturesnow capturethedegreeto which eachclauseis sat-isfied by documents.In otherwords,satisfactionof the samedisjunctwill contributemoretowardssimilarity thansatisfactionof differentdisjunctsin thesameclause.

4 Latent Semantic Indexing (LSI)

LSI is anestablishedmethodof featureextractionanddimensionreduction.Thematrixwhosecolumnsarethe documentvectors � � ������C 6� � Q,� , known asthe term-documentmatrix, constitutesa vector spacerepresentationof the documentcollection. In LSI,the term-documentmatrix is subjectedto singularvalue decomposition(SVD). The

RTSHUWV XHY X UZRTX [�\ ]

^`_bac dfe g h h c i�g dfej h i�kHl c e m�niHl j dTeo�gHpTe q r l g�gs c j i l jt k l djHt�t h c u j e c kHdv uTk dTr g l g dZuTg av c dfe g h h c i�gHdTe av iHl j dTe wv jHt�t h c u j e c kHdZw

x h ozy�g j e {�l gZ|}�~ j h {fg�| � gZ���H�fe l j uWe g�oy�g j e {�l g�|}�~ j h {Zg�| �Z� �Z�

� � �T�� � �Z�� � �� � �q � � �W�� � �� � �� � �� � ���

��� �����

u j |fg ���

Fig. 5. FeatureextractionusingLSI.

SVD extractsanorthogonalbasisfor this space,consistingof new featuresthatarelin-earcombinationsof theoriginal features(keywords).Crucially, thesenew featuresarerankedaccordingto their importance.It is assumedthat the � highest-rankedfeaturescontainthe true semanticstructureof the documentcollectionandthe remainingfea-tures,which areconsideredto be noise,arediscarded.Any valueof � lessthantherankof the term-documentmatrix canbe used,but goodvalueswill be muchsmallerthanthis rankand,hence,muchsmallerthaneither � ��� or � � � . Thesefeaturesform anew lower-dimensionalrepresentationwhichhasfrequentlybeenfoundto improveper-formancein informationretrieval andclassificationtasks[18,22]. Full technicaldetailscanbefoundin theoriginalpaper[7].

Figure5 shows a hypotheticalexamplefrom the AI-Spamdomain(cf. Figure1).The first extractedfeatureis a combinationof “intelligent”, “algorithm”, “grant” and“application”. Any documentcontainingmost of theseis likely to be legitimate, sohigh valuesof this featureindicatenon-Spam.Thesecondfeaturehaspositive weightsfor “application”, “debt-free” and“viagra” anda negative weight for “grant”. A highvaluefor thisfeatureis likely to indicateSpam.Thenew featuresareorthogonal,despitehaving two keywordsin common.Thefirst featurehaspositiveweightsfor both“grant”and“application”, whereasthe secondhasa negative weight for “grant”. This showshow the modifying effect of “grant” on “application” might manifestitself in a LSI-derivedrepresentation.With a high scorefor thefirst extractedfeatureanda low scorefor thesecond,theincomingtestcaseis likely to beclassifiedaslegitimateemail.

LSI extractedfeaturesare linear combinationsof typically very large numbersofkeywords.In practicethiscanbein theorderof hundreds/thousandsof keywords,unlikein our illustrative exampleinvolving just 8 keywords.Consequently, it is difficult tointerpret thesefeaturesin a meaningfulway. In contrast,a featureextractedby PSI

combinesfar fewer keywords and its logical descriptionof underlyingsemanticsiseasierto interpret.A furtherdifferenceis that,althoughbothPSI andLSI exploit word-wordco-occurrencesto discoverandpreserveunderlyingsemantics,PSI alsodrawsonword-classco-occurrenceswhile LSI doesnot naturallyexploit this information.

5 Evaluation

Thegoodnessof caserepresentationsderivedby BASE, LSI andPSI in termsof retrievalperformanceis comparedon aretrieve-onlyCBR system,wheretheweightedmajorityvotefrom the3 bestmatchingcasesarere-usedto classifythetestcase.A modifiedcasesimilarity metric is usedso thatsimilarity dueto absenceof words(or wordsin linearcombinationsor in clauses)is treatedaslessimportantcomparedto theirpresence[19].

Experimentswereconductedon 6 datasets;4 involving email routing tasksand2involving Spamfiltering. Variousgroupsfrom the20Newsgroupscorpusof 20 Usenetgroups[13], with 1000 postings(of discussions,queries,commentsetc.) per group,form the routing datasets:SCIENCE (4 sciencerelatedgroups);REC (4 recreationre-latedgroups);HW (2 hardwareproblemdiscussiongroups,oneon Mac, the otheronPC);andRELPOL (2 groups,oneconcerningreligion, theotherpolitics in themiddle-east).Of the2 Spamfiltering datasets:USREMAIL [8] contains1000personalemailsofwhich 50%areSpam;andL INGSPAM [16] contains2893messagesfrom a linguisticsmailing list of which27%areSpam.

Equal-sizeddisjoint train-testsplits were formed.Eachsplit contains20% of thedatasetandalsopreservestheclassdistributionof theoriginalcorpus.All text waspre-processedby removingstopwords(commonwords)andpunctuation.Remainingwordswerestemmedto form , where � � variesfrom approximately1,000in USREMAIL

to 20,000in L INGSPAM. Generally, with both routing andfiltering tasks,the overallaim is to assignincomingmessagesinto appropriategroups.Hence,testsetaccuracywaschosenasthe primary measureof the effectivenessof the caserepresentationasa facilitator of casecomparison.For eachtestcorpusandeachmethod,the accuracy(averagedover15trials)wascomputedfor representationswith 20,40,60,80,100and120features.

Pairedt-testswereusedto find improvementsby LSI andPSI comparedto BASE

(one-tailedtest)anddifferencesbetweenLSI andPSI (two-tailedtest),bothat the95%significancelevel. Precision1 is an importantmeasurewhencomparingSpamfilters,becauseit penaliseserror dueto falsepositives(Legitimate � Spam).Hence,for theSpamfiltering datasets,precisionwastestedaswell asaccuracy.

5.1 Results

Accuracy resultsin Figure6 shows that BASE performspoorly with only 20 features,but getscloserto thesuperiorPSI whenmorefeaturesareadded.PSI ’s performanceisnormallygoodwith 20featuresandis robustto thefeaturesubsetsizecomparedto bothBASE andLSI. LSI clearlyperformsbetterfor smallersizes.This motivatedan inves-tigationof LSI with fewer than20 features.We foundthat10-featureLSI consistentlyoutperforms20-featureLSI andis closeto optimal.Consequently, 10-featureLSI wasusedfor thesignificancetesting,in orderto give a morerealisticcomparisonwith theothermethods.

Table5.1 comparesperformanceof BASE andPSI (both 20 features)andLSI (10features).WhereLSI or PSI aresignificantlybetterthanBASE, theresultsarein bold.

1 Precision= TP/(TP+FP)whereTP is no.of truepositivesandFPis no.of falsepositives.

� �� �� �� �� �� �� �� �� �

� ��� ��� ��� ��� � ��� � ����¡ ¡¢ £¥¤ �D¦§£¥¨¡©b�¥¢ª¦�« ¬¥�­ ®® ¯°±® ²³´µ ¶·¸¹ ¶¸ º  b©b� »ª¦½¼ ¾f¦§¼

¿¡À Á Â ÂÃ ÄÃ ÂÅ ÄÅ ÂÆ ÄÆ Â

Ç È�ÉbÈ�Ê È�Ë ÈÍÌ È ÈÎÌ ÇbÈÏ§Ð¥Ñ¡Ò Ó¥Ô ÐDÕ½Ó¥Ö¡×bÐ¡ÒªÕ§Ø Ù¥ÐÚ ÛÛÜÝÞÛßàáâ ãäåæ ãå

çbèéWê¥ëbì íªî½ï ðTî½ï

ñ òò óò òô óô òõ ó

ó öø÷ öùñ öùô öûú ö öüú ó öý§þ¡ÿ�� ��� þ�������bþ����� ¥þ� ���������� ���� �� � ÿ�bþ ����� � ���

!" # $ %�&

' (' )) () )* (* )

+ (,' (-* (-. (0/ ( (1/ + (2�3�4�5 687 3:9�6�;�<83�59�= >�3? @@ABC@DEFG HIJK HJ

L8M N O P M OQ RS TS RU TU RV TV R

W TYX T Q T U T[Z T T\Z W T]�^�_�` a�b ^�c�a�d�e8^�`c�f g�^h iijklimnop qrst qs

u�v w x8y8z8{ | } ~� �� ~� �� ~� � �

� ��� ��� �,� ��� � ��� � �������� ��� �����8���8������ ���� ���������  ¡¢£¤ ¡£

¥ ¦ § ¨8©�ª «8¬

Fig. 6. Accuracy resultsfor datasets.

Table 1. Summaryof significancetestingfor featuresubsetsize20 (10 for LSI).

Routing:Accuracy Filtering: Accuracy(Precision)Algo. REC HW RELPOL SCIENCE USREMAIL L INGSPAM

BASE 71.7 73.0 88.7 48.1 85.7(89.5) 94.2(92.0)LSI *78.7 65.5 90.4 *71.8 93.9 (*96.8) 96.8 (89.0)PSI 76.2 *80.1 91.2 59.9 94.1 (95.2) 95.8 (*92.1)

WhereLSI andPSI aresignificantlydifferent,thebetterresultis starred.It canbeseenthat LSI is significantly better than BASE on 6 of 8 measuresand to PSI on 3. PSI

is better than BASE on 7 measuresand better than LSI on 2. We concludethat the20-dimensionalrepresentationsextractedby PSI have comparableeffectivenessto the10-dimensionalrepresentationsextractedby LSI. Generally, BASE needsamuchlargerindexing vocabulary to achieve comparableperformance.PSI workswell with a smallvocabularyof features,whicharemoreexpressivethanLSI ’s linearcombinations.

5.2 Interpretability

Figure7 providesa high-level view of samplefeaturesextractedby PSI in theform oflogical combinations(for 3 of thedatasets).It is interestingto comparethedifferencesin extractedcombinations(edges),thecontributionof keywords(ovals)to differentex-tractedfeatures(boxes)andthenumberof keywordsusedto form conjunctions(usuallynotmorethan3). Weseeamassof interconnectednodeswith theHW datasetonwhichPSI ’s performancewasfar superiorto thatof LSI. Closerexaminationof this datasetshows thattherearemany keywordsthatarepolysemousgiventhetwo classes.For in-stance“drive” is applicablebothto MacsandPCsbut combinedwith “vlb” indicatesPCwhile with “syquest”indicatesMac.Unlike HW, themulti-classSCIENCE datasetcon-tainsseveraldisjoint graphseachrelatingto a class,suggestingthat theseconceptsareeasilyseparable.Accuracy resultsshow LSI to beaclearwinneron SCIENCE. This fur-thersupportsourobservationthatLSI operatesbestonly in theabsenceof class-specificpolysemousrelationships.Finally, featuresextractedfrom L INGSPAM in figure7 show

­¯® °²± ³´µ­¶­¯··¹¸²º ¸ º²»¹¸¶¼¾½

½¶¸¯® ³ ¿

À ­¶¼Á³ ­Â®À ± º ½²¸²º ´µ½ À ± ¼

ºÁö´Ä»¾³ ­¯´ ³ ® ½¶¸²³ ´Å½¯·¾³»¹¸²³ ± ½Â· ³

´Æº¾Ç »¹½¶­Â»²ÈºÁö· À ® ­¯´

À ± ½²³

´Ä± È ± ³ ¸Â® ±

»¶® ­¯°²È ½Â´

É ­¶­ À® ½¶¸¶¼Á³ ± ­¯·

¼¹¿¶± ·¹½²º® ½²º¾³ ¸ÂÊ® ¸¯· ³

º²»¹¸¶¼¾½È ¸ÂÊ· ¼¹¿

È Ê·¹¸Â®º¹­¶È ¸Â®

¿¹½¶¸ÂÈ ³ ¿

½Â·¹¼¹® ö» ³Ë ½¯±

½²º ¼¹® ­¹Ì

½¶È ½²¼Á³ ® ± ¼³ ½²º¾³À ¼º²¿¶± ½ÂÈ À

¼ ½Â® ³ ¸Â± ·²È ±

͹βϾÐ

Ñ¶Ò Ó²Ô:Ò Õ¯Ö

SCIENCE

×Áؾ٠Ú�Û Ü ÝÛ Ø�Þ ß Ù Ý�à

ß Ù Ø�Þ ×ÁÙ Ý�áÜ Û âäã�Úâäã Ø�Û Ùâäå Ü æ�ÛÁá×ÁØÁÙ Ú�Û Ü ÝÝ�Û Ü ÚÁÙ ç

âäå Ù ß

Ø�Û¾ÛÁè

é¾ß ÛÁåÁÝê Ü ÛÁÛåÁè�Ú�Û¹Ü á Ù Ýç�ß Ù ç�æÙ Ø�çã âéÁÜ ã ê Ù áè¾ã ß ß å ÜÞ ×�å Ü å Ø�á ÛçãÁÝ�áÝ�Û ß ß

ë ×�Ý�ÙÝ�Ù á Û

à�á á é

ÚÁÙ Ý�Ù áÝ�å�Ú�Û

ß å Ø�Þ¹×�åÁÞ

LINGSPAM

ì¹íî²ï ð

ñ ò

óÂì¶ì²ïôµó¶í

ó ò ðôÅó²í

õÁö õ¾÷ ø¯ô

ò¹ù ð¹ú¶ó¯û òð²ü õ

í¾ú¯ý¾÷ û ú¶ï

ñ ò ø

õ íÁõ ñþÂü¹ó ò û ó

ò úú¹õ

ÿ ú¯û �

í¾ø¯ý¾÷ û ñï í ñ ñ ñ

î²ï ðô ð

õ²ì¹ø¶ø ò

õ íÁõ ñ

ì ú¹ÿ ø¯û ð ú¶ú �

ý¹ø ÿ

ò û ñ î¹ø

ð ñ ú õ¹ø¹÷ò û ñ î¹ø

ì ò

� ü¶ôÄì ø¯ûí¾ó¯û ò

ï íõ ñ ô ô

í¾ó¯û ò

ì²ï ü��

� ���

��� �������HW

������ �����

Fig. 7. Logical combinationsextractedfrom datasets.

thatthemajorityof new featuresaresinglekeywordsratherthanlogical combinations.ThisexplainsBASE’s goodperformanceon L INGSPAM.

An obviousadvantageof interpretabilityis knowledgediscovery. ConsidertheSCI-ENCE tree,here,without theextractedclausesindicatingthat “msg”, “food” and“chi-nese”arelinkedthrough“diet”, onewould not understandthemeaningin context of atermsuchas“msg”. Suchproof trees,which areautomaticallygeneratedby PSI, high-light relationsbetweenkeywords;this knowledgecanfurtheraid with glossarygener-ation,often a demandingmanualtask(e.g.FALLQ [14]). Caseauthoringis a furthertaskthat canbenefitfrom PSI generatedtrees.For exampledisjoint graphsinvolving“electric” and“encrypt” with many fewer keyword associationsmaysuggesttheneedfor casecreationor discovery in thatarea.From a retrieval standpoint,PSI generatedfeaturescan be exploited to facilitatequeryelaboration,within incrementalretrievalsystems.Themainbenefitto theuserwould betheability to tailor theexpandedqueryby deactivatingdisjunctsto suit retrieval needs.Sinceclausesaregranularandcanbebrokendown into semanticallyrich constituents,retrieval systemscangatherstatisticsof whichclausesworkedwell in thepast,basedonuserinteraction;this is difficult overlinearcombinationsof featuresminedby LSI.

6 Related Work

Featureextractionis an importantareaof researchparticularlywhendealingwith tex-tual data.In Textual-CBR(TCBR) researchthe SMILE systemprovidesusefulinsightinto how machinelearningandstatisticaltechniquescanbe employed to reasonwithlegal documents[3]. As in PSI, singlekeywordsin decisionnodesareaugmentedwith

otherkeywordsof similar context. Unlike PSI, thesekeywordsareobtainedby look-ing up a manuallycreateddomain-specificthesaurus.Although PSI grows clausesbyanalysingkeyword co-occurrencepatterns,they canjust aseasilybe grown usingex-isting domain-specificknowledge.A morerecentinterestin TCBR involvesextractionof featuresin theform of predicates.TheFACIT framework involving semi-automatedindex vocabulary acquisitionaddressesthis challengebut alsohighlightsthe needforrelianceon deepsyntacticparsingand the acquisitionof a generative lexicon whichwarrantssignificantmanualintervention[11].

In text classificationandtext mining research,thereis muchevidenceto show thatanalysisof keyword relationshipsand modelling them as rules is a successfulstrat-egy for text retrieval. A goodexampleis RIPPER [5], which adoptscomplex optimi-sationheuristicsto learnpropositionalclausesfor classification.A RIPPER rule is aHorn clauserule that concludesa class.In contrast,PSI ’s propositionalclausesformfeaturesthat caneasilybe exploited by CBR systemsto enablecasecomparisonat asemanticlevel. Suchcomparisonscanalsobefacilitatedwith theFEATUREM INE [21]algorithm,which alsoemploysassociationrule mining to createnew featuresbasedonkeywordco-occurrences.FEATUREM INE generatesall possiblepair-wisekeywordco-occurrencesconvertingonly thosethatpassa significancetestinto new features.Whatis uniqueaboutPSI ’s approachis that firstly, searchfor associationsis guidedby aninitial featureselectionstep,secondly, associationsremainingafteran informedfilter-ing stepareusedto grow clauses,and,crucially, boostingis employed to encouragegrowing of non-overlappingclauses.Themainadvantageof PSI ’s approachis that in-steadof textual casesimilarity basedsolelyon instantiatedfeaturevaluecomparisons(asin FEATUREM INE), PSI ’s clausesenablemorefine-grainedsimilarity comparisons.Like PSI, WHIRL [4] also integratesrulesresultingin a morefine-grainedsimilaritycomputationover text. However theserulesaremanuallyacquired.

The useof automatedrule learning in an Information Extraction (IE) setting isdemonstratedby TEXTRISE, whereminedrulespredict text for slotsbasedon infor-mationextractedover otherslots[15]. Thevocabulary is thuslimited to templateslotfillers. In contrastPSI doesnot assumeknowledgeof casestructuresandis potentiallymoreusefulin unconstraineddomains.

7 Conclusion

A novel contribution of this paperis the acquisitionof an indexing vocabulary in theform of expressiveclauses,anda caserepresentationthatcapturesthedegreeto whicheachclauseis satisfiedby documents.Thepropositionalsemanticindexing tool, PSI, in-troducedin thepaper, enablestext comparisonat asemantic,insteadof a lexical, level.Experimentsshow thatPSI ’s retrieval performanceis significantlybetterthanthatof re-trieval overkeyword-basedrepresentations.Comparisonof PSI-derivedrepresentationswith thepopularLSI-derivedrepresentationsgenerallyshowscomparableretrieval per-formance.Howeverin thepresenceof class-specificpolysemousrelationshipsPSI is theclearwinner. Theseresultsarevery encouraging,because,althoughfeaturesextractedby LSI arerich mathematicaldescriptorsof the underlyingsemanticsin the domain,unlike PSI, they lack interpretability. We notethat PSI ’s relianceon classknowledge

inevitably restrictsits rangeof applicability. Accordingly, future researchwill seektodevelopanunsupervisedversionof PSI.

Acknowledgements

We thankSusanCraw for helpful discussionson this work.

References

1. R. Agrawal, H. Mannila, R. Srikant,H. Toivonen,andA. I. Verkamo. Fastdiscovery ofassociationrules. In Advancesin KD andDM, pages307–327,1995.AAAI/MIT .

2. D. Aha,editor. Mixed-InitiativesWorkshopat 6thECCBR, 2002.Springer.3. S.BruninghausandK. D. Ashley. Bootstrappingcasebasedevelopmentwith annotatedcase

summaries.In Procof the2ndICCBR, pages59–73,1999.Springer.4. W. W. Cohen. Providing database-like accessto the web using queriesbasedon textual

similarity. In Procof theInt Confon Managementof Data, pages558–560,1998.5. W. W. CohenandY. Singer. Context-sensitive learningmethodsfor text categorisation.ACM

Transactionsin InformationSystems, 17(2):141–173,1999.6. S. Das. Filters,wrappersanda boostingbasedhybrid for featureselection. In Proc of the

18thICML, pages74–81,2001.MorganKaufmann.7. S. C. Deerwester, S. T. Dumais,T. K. Landauer, G. W. Furnas,andR. A. Harshman.In-

dexing by latentsemanticanalysis.Journalof theAmericanSocietyof InformationScience,41(6):391–407,1990.

8. S.J.Delany andP. Cunningham.An analysisof case-baseeditingin aspamfiltering system.In Procof the7thECCBR, pages128–141,2004.Springer.

9. G. FormanandI. Cohen.Learningwith Little: Comparisonof ClassifiersGivenLittle Train-ing. In Procof the8thEuropeanConfonPKDD. pages161–172,2004.

10. Y. FreundandR. Schapire.Experimentswith anew boostingalgorithm.In Procof the13thICML, pages148–156,1996.

11. K. M. GuptaandD. W. Aha. Towardsacquiringcaseindexing taxonomiesfrom text. In Procof the17thInt FLAIRSConference, pages307–315,2004.AAAI press.

12. W. IbaandP. Langley. Inductionof one-level decisiontrees.In Procof the9thInt WorkshoponMachineLearning, pages233–240,1992.

13. T. Joachims.A probabilisticanalysisof theRocchioalgorithmwith TFIDF. Technicalreport,CarnegieMellon UniversityCMU-CS-96-118,1996.

14. M. Lenz.Definingknowledgelayersfor textualCBR. In Procof the4thEuropeanWorkshoponCBR, pages298–309,1998.Springer.

15. U. Y. NahmandR. J.Mooney. Mining soft-matchingrulesfrom textualdata.In Procof the17thIJCAI, pages979–984,2001.

16. G. Sakkis,I. Androutsopoulos,G. Paliouras,V. Karkaletsis,C. Spyropoulos,andP. Stam-atopoulos.A memory-basedapproachto anti-spamfiltering for mailing lists. InformationRetrieval, 6:49–73,2003.

17. G. SaltonandM. J.McGill. An introductionto modernIR. 1983,McGraw-Hill.18. F. Sebastiani.ML in automatedtext categorisation.ACMComputingsurveys, 34:1–47,2002.19. N. Wiratunga,I. Koychev, andS. Massie. Featureselectionandgeneralisationfor textual

retrieval. In Procof the7thECCBR, pages806–820,2004.Springer.20. Y. YangandJ.O. Pedersen.A comparative studyon featureselectionin text categorisation.

In Procof the14thICML, pages412–420,1997.Springer.