DATA MINING - Ekowiki

140
DATA MINING Course notes KUL YSALINE DE WOUTERS 2016-2017

Transcript of DATA MINING - Ekowiki

DATAMININGCoursenotes

KULYSALINEDEWOUTERS

2016-2017

2016-2017 DataMining YsalinedeWouters

2

Tableofcontent

CHAPTER1:INTRODUCTIONTODATAMINING 7

1. BACKGROUNDONDATAMINING 71.1 TECHNOLOGICALADVANCES 81.2 DATAMININGVS.MACHINELEARNING 102. DATAMININGTASKS 113. THEDATAMININGPROCESS 133.1 COMPONENTSOFADATAMININGSYSTEM 134. STATISTICALLIMITSONDATAMINING 145. THINGSUSEFULTOKNOW 145.1 IMPORTANCEOFWORDSINDOCUMENTS 145.2 HASHFUNCTIONS 155.3 INDEXES 15

CHAPTER2:GETTOKNOWYOURDATA 16

1. RETROSPECTIVEVS.PROSPECTIVEDATA 162. TYPESOFDATA 193. SIMPLEDESCRIPTIVESTATISTICS 204. VISUALIZATION 21

CHAPTER3:INTRODUCTIONTOPREDICTIVEMODELINGINDATAMINING 23

1. INDUCTIVELEARNING 231.1 SCORINGPROBLEMS 232. FEATURES 243. CALIBRATIONCHECK 254. BAYESRULE&NAÏVEBAYESCLASSIFIERS 255. LOGISTICREGRESSION,TRAININGTASK 266. CHALLENGEPROBLEM 27

CHAPTER4:RECOMMENDINGPRODUCTS 30

1. PROBLEMOVERVIEW 301.1 RECOMMENDATIONTYPES 311.2 CHALLENGES 322. CONTENTBASEDFILTERING 333. COLLABORATIVEFILTERING 344. EVALUATION 375. CASESTUDY:NETFLIXCHALLENGE 385.1 MODELLINGGLOBALANDLOCALBIASES 405.2 LATENTFACTORMODELS 435.3 MODELLINGTEMPORALDYNAMICS 445.4 TRYLOTSANDLOTSOFDIFFERENTMODELSANDCOMBINETHEOUTPUT 44

CHAPTER5:MODELENSEMBLES 47

1. MOTIVATIONANDOVERVIEW 472. SAMPLING-BASEDAPPROACHES 482.1 BAGGING:BOOTSTRAMAGGREGATING 482.2 ADABOOST(ADDAPTEDBOOSTING) 492.3 CROSS-VALIDATION 51

2016-2017 DataMining YsalinedeWouters

3

2.4 GRADIENTTREEBOOSTING 513. MANIPULATEFEATURES 544. ADDRANDOMNESS 555. STACKING 566. WHYDOENSEMBLEMETHODSWORK? 576.1 EXPECTEDERROR 576.2 BIAS/VARIANCEEXPLANATION 586.3 STATISTICALEXPLANATION 596.4 REPRESENTATIONALEXPLANATION 616.5 COMPUTATIONALEXPLANATION 617. SOMEAPPLICATIONS 62

INTERLUDE:ACTIONABLEDATAMINING 64

INTERLUDE:LARGESCALEDECISIONTREELEARNING 65

1. DECISIONTREEOVERVIEW 652. RAINFOREST 653. BOAT 66

CHAPTER6:ASSOCIATIONRULEMINING 68

1. INTRODUCTIONANDDEFINITIONS 682. NAÏVEALGORITHM 713. APRIORI 744. PCY 765. LIMITINGDISKI/O 785.1 SIMPLEALGORITHM:SAMPLE 795.2 THESONALGORITHM 805.3 TOIVONEN’SALGORITHM 806. FPGROWTH 827. INCORPORATINGCONSTRAINTSINTOMINING 848. PRESENTINGRESULTS,OTHERMETRICS 85

CHAPTER7:MININGSEQUENCES 89

1. MOTIVATION 892. BACKGROUNDANDDEFINITIONS 893. FREESPAN 934. PREFIXSPAN 94

CHAPTER8:CLUSTERING 991. UNSUPERVISEDLEARNING,CLUSTERINGINTRO 992. HIERARCHICALCLUSTERING 1012.1 THEALGORITHM 1012.2 DISTANCEMEASURES 1022.3 EVALUATIONOFTHEALGORITHM 1042.4 CLUSTERFEATUREVECTOR 1052.5 CLUSTERFEATURETREE 1053. PARTITIONALCLUSTERING 1083.1 PARTITIONINGALGORITHMS 1083.2 K-MEANSALGORITHM 1093.3 BRADLY-FAYYAD-REINA(BFR) 111

2016-2017 DataMining YsalinedeWouters

4

4. MODEL-BASEDCLUSTERING 1145. APPLICATIONS 1175.1 LOWQUALITYOFWEBSEARCHES 1175.2 DOCUMENTCLUSTERING 1175.3 BIOINFORMATICS 120

CHAPTER9:USINGUNLABELEDDATA 123

1. INTRODUCTION 1232. SEMI-SUPERVISEDLEARNING 1233. ACTIVELEARNING 1253.1 THECONCEPT 1253.2 HOWTOSELECTQUERIES? 1273.3 QUERY-BY-COMMITTEE(CBC) 1283.4 ALTERNATIVEQUERYTYPES 1294. SUMMARY 129

CHAPTER10:TIMESERIES 131

1. TIMESERIESINTRO 1312. TIMESERIESCLASSIFICATION 1313. SYMBOLICREPRESENTATIONS 1364. APPLICATIONSTOSPORTS 1375. TOCONCLUDE 139

2016-2017 DataMining YsalinedeWouters

5

2016-2017 DataMining YsalinedeWouters

6

2016-2017 DataMining YsalinedeWouters

7

CHAPTER1:INTRODUCTIONTODATAMININGDatamining refers to thediscoveryofmodels for data.However, amodel canbeoneofseveralthings.Originally,dataminorwasaderogatorytermreferringtoattemptstoextractinformationthatwasnotsupportedbythedata.Today, it is seenas theconstructionofastatisticalmodel,thatis,anunderlyingdistributionfromwhichthevisibledataisdrawn.Somepeopleregarddataminingassynonymouswithmachinelearning.Thereisnoquestionthat some data mining appropriately uses algorithms from machine learning. Machinelearningpractitionersusethedataasatrainingset,totrainanalgorithmofoneofthemanytypesusedbymachine-learningpractitioners,suchasBayesnet,supportvectormachines,decisiontrees,…Morerecently,computerscientistshave lookedatdataminingasanalgorithmicproblem.Therearemanydifferentapproachestomodellingdata.Wecouldfor instanceconstructastatisticalprocesswherebythedatacouldhavebeengenerated.

• Summarizingthedatasuccinctlyandapproximately;Ex: page rank idea. The entire complex structure of the Web issummarizedbyasinglenumberforeachpage.Thisnumberistheprobabilitythatarandomwalkeronthegraphwouldbeatthatpageatanygiventime.Itreflectstheimportanceofthepages.

• Extractingthemostprominentfeaturesofthedataandignoringtherest.Itrepresentsthedatabytheseexamples.

o Frequent itemsets:Makes sense fordata that consistsof“baskets”ofsmallsetsofitems.Lookforsmallsetsofitemsthat appear together in may baskets, these frequentitemsetsarethecharacterizationofthedatathatweseek.

o Similaritems:findpairsofsetsthathavearelatively largefractionoftheirelementsincommon.Ex:treatingcustomersatanonlinestorelikeamazonasthesetofitemstheyhavebought.Amazoncanlookforsimilarcustomers and recommend something many of thesecustomershavebought.Thisiscalledcollaborativefiltering.

1. Backgroundondatamining

Thiscourseprovidesabroadsurveyofseveralimportantandwell-knownsubfields.Themaingoalistodevelopanoverallsenseofhowtoextractinformationfromdatainasystematicway.

• The How: Gain insight into the working of specific algorithms. How to extractinformationfromdata.Wewanttounderstandtypicalquestionsunderlyingthehow.Howdoalgorithmswork,whatdotheytrytodo,howdotheyextractinformation?

• TheWhy:Understandthe“bigpicture”ofdataminingGoals:understandthechallengesindatamining.Manydefinitionsarepossible:

2016-2017 DataMining YsalinedeWouters

8

• PhrasetoputonCVtogethired• Non-trivialextractionofimplicit,previouslyunknownandusefulinformationfrom

data• Buzzwordusedtogetmoneyfromfundingagenciesandventurecapitalfirms• (Semi-)automatedexplorationandanalysisoflargedatasettodiscovermeaningful

patternsData mining can be interpreted as the process of automatically identifying models andpatternsfrommassiveobservationaldatabasesthatare

• Valid:holdonnewdatawithsomecertainty• Novel:non-obvioustothesystem.• Useful:shouldbepossibletoactontheitem.Shouldallowustohavealookinsidethe

problemforinstanceinordertoallowusdecisionmaking.• Understandable: humans should be able to interpret the pattern. Often we are

workingwithexpertandwedon’tevenunderstandwhatheistalkingabout.Dataminingcaneitherbedatadrivenor learningdriven.Youwillhavedataanduse it tomakeinferences,strategiesaboutfact,drawmodelsandpatterns.Theprocessofautomaticallyidentifyingmodelsandpatternsfrommassiveobservationaldatabases.

• Modelsandpatternscanbedecisiontreesforinstance.Theseareakindofrepresentationoftheknowledge.

• Massive,wearetalkingaboutdatabasesystemsandscalability.• Observational

ð Observedata

Therearethreegoalstodatamining

• Understandthedata,representation,whatisinthedata,…• Extracttheknowledgeofthedata• Make predictions about the future, what can happen

tomorrow,nextweekofyear,et.

1.1 Technologicaladvances25-30yearsagotherewasessentiallynodatamining,and inthe lastyears, it isbecomingextremelypopularandsuccessful.Itisusuallypopular,manycompaniesapplydatamining,thesamegoesforuniversities,sportsteams,etc.Itisalsotaughtinacademia.Therearetworeasonsforthispopularity.

2016-2017 DataMining YsalinedeWouters

9

• Technologyhasgreatlyimproved.• Databasesandthewebmeanseveryonehasdata,everyonecollectsdata.

Storageislargerandcheaper

ð Moore’slawformagneticdiskdensity:“capacitydoublesevery18months”.Storagecostperbyteisfallingrapidly.Thingslikestoragehasbecominglargerandcheaperovertheyears.Thereismoreofferregardingstoragecapacity.Next,wealsoobserveimprovementsincomputingpower.Thesupercomputerof15yearsagoisequivalentlypowerfulasmoderndesktops.Manylargedatasetsareavailabletoday.

Onlinetextsources• MEDLINEhas19millionpublishedarticles.• Wikipediahashugenumberofarticles.PeopletrytoanalyseWikipedia.• Websearchengineswhichcanindexbillionsofwebpageseveryday.100’sof

millionsofsitevisitorsperday• Retailtransactiondata• Ebay,Amazon,Walmart:>100milliontransactionsperday• Visa,Mastercard:similarorlargernumbers

Scientificuses

• DatacollectedandstoredatGB/hour• Remotesensorsonasatellite• Microarraysgeneratinggeneexpressiondata• Scientificsimulations• Traditionaltechniquesinfeasibleforrawdata• Datamininghelpsscientiststo• Classifyingandsegmentingdata• Formhypotheses• Findhiddenpatternsandcorrelations

Furthermore, data mining appears as commercially useful. Indeed, data analysis helpscompanies,stores,todevelopandtoknowhowtheyshoulddisplaytheshelves,whatandhowmuchtheyshouldsell.ManycompaniescollectandstoredataSearch-engines:clickdata,advertising,automatingthesearch.

• Stores:purchasesrecords• Banks:creditcardtransactions,wanttobeabletocatchfraud.

Competitionisstronganddataminingcanhelp• Marketingandadvertising• Search• Inventory.Inventoryremainsexpensive,youdon’twanttohavetomuchstocksoyou

shouldbesuretooptimizetheinventorylevel.Dataminingcrossesmanydifferentdisciplines.

• Databases:largedatasetsarestored• Machinelearning• Statistics

2016-2017 DataMining YsalinedeWouters

10

• Visualizationofthedata• Informationretrieval,goodretrievaltoallowanefficientsearchofthedata• High-performancecomputingtoensuretheanalysisgoesfast.

The two most similar things are Machine Learning and statistics. What is statistics?Statisticiansthinkintermsofmodels.Therearelotofpeople,calledpopulationandwetakesomesampleofthatpopulation.Thefirstthingyoucandoistowriteadistribution,describingthe sample. Another thing is tomake statistical inference, I have a sample and I try thehypothesis,Iapplyittothemodeandthenseeifitfitsthedata.Thereisalwaysarelationshipbetweensometargetvariableandinterpretablevariable.Thereisusuallynohypothesis.Statisticsisoftenmoreusedwhenwehavethisprospectivedata. Datamining focuses on analysis of existing data. It ismore algorithmic based; dataminingdefinesalgorithmsthatencompassthedata.Machine learning is interested in designing algorithms that are able to improve theperformancesorexperienceofatask.Performancescouldrelatetosomemetricswewanttooptimize(accuracy,ROCcurve,…)whereasataskreferstoacertainproblem,andexperienceisoftendata.

1.2 Dataminingvs.MachineLearning

Datamining• Dataminingfocusesmoreonscalability.• Dataminingismuchmoreapplicationfocused.• Dataminingismoreoftenusedinindustrialsetting.

MachineLearning• Moretheoreticalemphasis• Termmoreusedinresearch/academiathendatamining.

Intermsofscalingup.Asweseeonthepicture,wehavebothadatabasestoredondiskandamainmemory. It is very slow to access data that is stored on a diskwhereas themainmemoryisrelativelyfasttoaccess.Onlyasmallsubsetofthedatacanbestoredatthemainmemory.Oftendataminingtriestolimitsthenumberoftypesthatithastoretainonthedisk.

2016-2017 DataMining YsalinedeWouters

11

2. Dataminingtasks

ð Therearefourtasksincluded.

• EXPLORE:trytounderstandandgettoknowthedata.Therefore,wedoseveraltasksrelatedtostatisticsthatistosay,definerelevantandmeaningfulsummarystatisticsofthedata:numberofvariables,averages,means,modes,etc.Youalsotrytoidentifyflaws or problems in the data: skew, missing values, visually inspect the data,classificationproblems,…Explorationisthefirstthingyouwilldowhenyoureceivedata.

ð Whatiswrongwiththispicture?Thedataexceedsthemaximumrange.Whatever

therangeis, it isexceeded,thistendstobeaproblem.Anotherscalewouldbemoreappropriate.

2016-2017 DataMining YsalinedeWouters

12

Anotherexamplesrelatestothetaxidata,from2014.Guessesforthedecline?UBERcouldbea good reason, people using Uber instead of taxis. Again we cannot assume a causalrelationshiphere,butprobablyitisonereasonforthisdecline.

• DESSCRIPTITVEMODELINGYoutrytobuildmodelsthatdescribeandsummarizethedata.Youcansimulatehowthedatawasgenerated,modellingtheprocessthatgeneratedthedata.Techniques

• Clustering• Densityestimation/probabilisticmodels

• PREDICTIVEMODELING

WehavesomevariablesXandwewanttopredictthefuturevaleofanothervariable,Y,alsocalledthetargetvariable.

• Classification:Yisdiscrete• Regression:Yiscontinuous• Probabilityestimation:Prob(Y=y)

ThegoalistoexploretherelationshipbetweenthesetofvariablesXandthesinglevariableY.Youtrytoapproximateafunction,mappingconfigurationsofXàY.Thisiswhat many machine learning and statistic algorithms do. Often the focus is onaccuracy,notcomprehensibility.WemainlyfocusongettingthepredictionforY.Onebigapplicationis inpolitics,peoplemakingcampaignsmakelotofuseofdata.Which candidate do you prefer? Who will vote? Who will give money? Who canpersuade?Anothergoodapplicationofthisiswebsearch.Youhavesomequery,andyouwanttoknowwhichdocumentsarerelatedtothequery.Howdopeoplesolvethis?Thefirstattemptistomanuallycuratedirectories.Theproblemisthatitworksforsmalldatasetsbutitdoesnotscale.Next,anothernaturalthingtodoistrytomatchwordsinquerytowordsindocument.Thisisratherchallengingsincetherearemanywordsinthequery.Google’sideawastoexploitthewebstructure.Viewlinksasvotes,sothemorelinksmeansthemoreimportantwebpagesare.Theintuitionbehindthis isthatyoucantrustwhatotherpeoplesay.

• DISCOVERINGPATTERNS

Thegoalistodiscoverinteresting“local”patternsinthedata,nottocharacterizethedataglobally.Thefocusistofindhuman-interpretablepatternsthatdescribethedata.Techniques

• Item-setmining• Patternmining• Sequencemining

One examplewould be to find repeated DNA sequences. Another thingwould beproductrecommendation,findingcommonlyco-purchaseditems.Again,thisstepcouldbeillustratedbysports.NBAlogsallplaybyplayinformation:whichplayersareinthegame?Shotsattempts,etc.Thequestionsare:whichline-upsworkwell?Offensiveefficiency?Defensiveefficiency?

2016-2017 DataMining YsalinedeWouters

13

MachineLearningclass Dataminingclass

• MLhaslotsandlotsoftechniquesforpredictive models: naïve Bayes,decisiontrees,rulelearning,ILP,…

• Experimentalmethodologylikecrossvalidation

• Reinforcementlearning

• Predictive modelling motivated byapplications

• Pattern mining (unsupervisedlearning)

• Scalability

3. Thedataminingprocess

Firstselectdatatogetthetargetdata.Lookforoutliers,missingvalues,etc.Thenyoudosomesort of pre-processing, a key step where you apply transformation to your data, featuretransformation.Thenyouhavedataminingwhereyoufindmodelsandpatterns.Ideallyyouhaveknowledgeattheend.Inadataminingsystem,youwantsomethingthatiscomputationallysound.Youwantittobescalable, without it taking too much time and space complexity. You want things to beparallelizable.Nextyouwantthingstobestatisticallysound,findingmeaningfulpatterns.Doourresultsgeneralizetonewdata?Youalsowantthedatatobeeasytousedandcomprehensible.

3.1 Componentsofadataminingsystem

• Representation isabouthowyouactuallyrepresentthedata.There are two aspects of representation. The way data isrepresentedandthewaymodelsarerepresented.

o Data is represented using feature vectors, relationaldatabase,freetext,images,graphs,etc.

o Model: decision trees, graphical models, rule set,association rules, graph patterns, sequential patterns,etc.

• Evaluationcaneithersubjectiveorobjectiveo Objective: accuracy, precision and recall, cost, utility,

fast,etc.

2016-2017 DataMining YsalinedeWouters

14

o Subjective:interesting,novel,actionable• Search

o Combinatorialoptimization:Greedysearcho Convex optimization:Gradient descent,with functions

youwanttominimize,maximize,etc.o Constrainedsearch:Constrainedsearch

• Datamanagement• Userinterface

Tosumup,weliveanagewherelargeamountsofdataarecommonplace.Dataminingishugelypopularandhugelysuccessfulbecauseitextractsusefulinformationfromthisdata.Dataminingisabletohelppeoplesolvingproblems.Theinformationcomesinmanyformslikemodels,patterns,etc.Finally,dataminingischallenging!

4. StatisticallimitsonDataMining

ð Acommondata-miningprobleminvolvesdiscoveringunusualeventshiddenwithinmassiveamountsofdata.

• TotalInformationAwareness:theresultsyouobtainalldependsonhownarrowlyyoudefinetheactivitiesthatyoulookfor.

• Bonferroni’sPrinciple:supposeyouhaveacertainamountofdata,andyoulookforeventsofacertaintypewithinthatdata.Youcanexpecteventsofthistypetooccur,evenifthedataiscompletelyrandom,andthenumberofoccurrencesoftheseeventswillgrowasthesizeofthedatagrows.Calculatetheexpectednumberofoccurrencesoftheeventsyouarelookingfor,ontheassumptionthatdataisrandom.Ifthisnumberissignificantlylargerthanthe number of real instance you hope to find, then you must expect almostanythingyoufindtobebogus.

5. Thingsusefultoknow

5.1 ImportanceofWordsinDocuments

ð Inmanyapplicationsweshallbefacedwiththeproblemofcategorizingdocuments

(sequencesofwords),bytheirtopic.Topicsaretypicallyidentifiedbyfindingthespecialwordsthatcharacterizedocumentsaboutthattopic.Classificationoftensortsbylookingatdocuments,andfindingthesignificantwordsin those documents. Our intuition might be that the words appearing mostfrequentlyinadocumentarethemostsignificant.Butthisisnottrue.Ifact,theseveral hundred most common words in English (“the” or “and”) are oftenremoved from documents before any attempt to classify them. Actually, theindicatorsofthetopicarerelativelyrarewords.Butnotallrarewordsareequallyusefulasindicators.

2016-2017 DataMining YsalinedeWouters

15

TD.IDF=TermsFrequencyTimesInverseDocumentFrequency.Thisistheformalmeasureofhowconcentratedintorelativelyfewdocumentsaretheoccurrencesofagivenword.

5.2 HashFunctions

Ahash function h takes a hash-key value as an argument and produces a bucketnumberasaresult.Thebucketnumberisaninteger,normallyinrange0toB-1,whereBisthenumberofbuckets.Hash-keyscanbeofanytype.Thereisanintuitivepropertyofhashfunctionsthatthey“randomize” hash-keys. If hash-keys are drawn randomly from a reasonablepopulationofpossiblehash-keys, thenhwill sendapproximatelyequalnumbersofhash-keystoeachoftheBbuckets.

5.3 Indexes

Anindexisadatastructurethatmakesitefficienttoretrieveobjectsgiventhevalueofoneormoreelementsofthoseobjects.Themostcommonsituationisonewheretheobjectsarerecords,andtheindexisononeofthefieldsofthatrecord.Givenavaluevforthatfield,theindexletsusretrievealltherecordswithvaluevinthatfield.Ex: fileof (name,address,phone),andan indexonthephone field.Givenaphonenumber, the index allowsus to findquickly the recordor recordswith that phonenumber.

2016-2017 DataMining YsalinedeWouters

16

CHAPTER2:GETTOKNOWYOURDATAAnalysingdatarequires

• Collectingdata• Understandingdata• Puttingdataintoformatsuchthatanalysiscanbeperformed.

Collecting the data is the most complicated, time consuming, and important part of theanalysis.50-75%ofthetimeisspentoncollectingthedata.Mostofthemistakesarisehere,especiallyinthetransformationphase.

1. Retrospectivevs.ProspectivedataCorrelationalvs.causation

• RetrospectiveforcorrelationAandBarerelated• ProspectiveforcausalityAcausesB

Thereexisttwotypesofdata

• Prospectivedata:collectedaspartofcontrolledscientificexperiment.• Retrospectiveorobservationaldata:collectedpassivelyorsomeparticularreason.

ð Thetypeofdatayouhaveinfluencestheconclusionsyoucandrawfromit.With

prospectivedata,youcanfindthings likecausality.Withretrospectivedata it isdifficulttofindcausality.

PROSPECTIVEDATAProspective data is like original, traditional, scientific experiment design. We have ahypothesisHwewanttotestanddesignexperiment,withcontrols,totestH.wecollectdatatotrytodeterminethecause,beforeanalysingresultsandseeiftheyconfirmH.Ex:Clinicaltrials,geneknockoutexperiments,etc.

ð Veryexpensiveandtimeconsumingsinceyouneedalotofsamples,dividedataintogroups,trialsareneeded,etc.Today,thisseemscompletelyoutdated.

Dataminingislargelyaboutcorrelation.

• Datapassivelycollected• Datacollectedforsomeotherreason• Dataisrelativelycheaptocollect.

ð Wecandrawstrongerconclusionsfromprospectivedata,butitisexpensiveandtimeconsumingtocollect.

OBSERVATIONALDATANow,wehavehugeobservationaldata sets. Ex:Web logs, customer transactionsat retailstores,humangenome,etc.It makes sense to leverage available data since it may contain useful information and itappearsverycheaptocollect.

2016-2017 DataMining YsalinedeWouters

17

Theassumptionsofexperimentaldesignareviolated.

• Howcanweusesuchdatatodoscience?• Canwedomodelexploration,hypothesistesting?

Oneexception:theweb.IttendstobeveryeasytoexperimentontheWebasitcountsahugenumberofusers.Amazonforinstancehasahugenumberofusers,whichmakesiteasytoquicklyandeasilycollectdata.Amazonappliesarecommendationsystem.GregLinden’sshoppingcart recommendationsadditemsbasedonwhatisinshoppingcart.Amarketingvicepresidentstatesthatitmightdistractpeopleawayfromcheckingout.Thisresultinaprohibitiontocontinue.HedisobeyedandranatesttoseeandfoundnothavingrecommendationswascostingAmazona lotofmoney.Thisillustratesthatourintuitionsareoftenwrong.Whyshouldwedebateifwecancollectdata?Culturalreasons:

• UptonSinclear: it isdifficulttogetamantounderstandsomethingwhenhissalarydependsuponhisnotunderstandingit

• Corollary:nonottrusttheHIPPO:HighestPaidPerson’sOpinion.Bydefinition,thesepeopleareprobablybiasedintheiropinion.

Wehavetwogroups:thetreatmentgroupandthecontrolgroup.Wearegonnadivideourpopulation,usually50-50amongthesegroups.Thetreatmentgroupwillseeonevariant of the website; the control group will see another variant. We will thencomparetheoutcomesbetweenthetwogroups.Wecouldforinstanceseehowmuchmoneypeoplespendinonegroup.Thistestforcausationnotcorrelation.Ifitisdonecorrectly,thereareonlytwothingsthatcanexplainthechangeinoutcome.

• Avs.B• RandomChance

Everythingelseaffectsbothvariants.StatisticianstrytoanswerifAorBisbetterwithaformalhypothesistest.Notethatthisdoesnotsolveeverything.

2016-2017 DataMining YsalinedeWouters

18

Hypothesis testing outline:wedefine twohypotheses: the null hypothesis and thealternative.TheH0statesthattheoriginalmodelisbetterwhereasthealternativeorHastatesthatthevariantisbetter.Werunanexperimenttogetdataaboutthealternativehypothesis.Wethencomparehow probable the observed outcome is compared to what is expected if the nullhypothesisweretrue.Thep-valueisneededtoperformthetesting.Thev-valueistheprobabilityofthedataorsomethingmoreextremeunderthenullhypothesis.Usuallyweusep<=0,05tobeconfident that a difference is statistically significant. The probability of getting theoutcomecorrespondstotheareaunderthecurve.

Lookingathypothesistesting,whatkindoferrorscanwemake?

• TypeIError:rejectatruenullhypothesisakaafalsepositive.• TypeIIError:Donotrejectafalsenullhypothesisakaafalse

negative.TheNullhypothesisisactuallfalse.

SomekeythingstothinkaboutistheoverallEvaluationcriteria(OEC),ametricwearemeasuring.Itisimportanttothinkcarefullyaboutthis.Decideitupfrontanddonotchangeit.Effectsize:differenceofOECbetweentreatmentandcontrol.

2016-2017 DataMining YsalinedeWouters

19

ð Wehavetostayawareofdayofweekandseasonaleffects.Bewaretoofnewnesseffects.Assigninguserstogroupscanbedifficultasneedtoensurethatthisis

• Random• Consistentacrossplatforms.• Nointeractionsbetweendifferentexperiments,nocorrelations.• Fast,peopleareintoleranttoslownessofoperations.

2. Typesofdata

ð Whatcandatalooklike?

• Featurevector

• Transactiondataoccursforinstancewhenvisitingandanalysing

dataonAmazon’swebsite. Implicitly youcanuse it as sparserepresentation.

• Relational data: extremely common sort of data

representation,includingtablesmadeoutofrowsandcolumns.There might be dependencies between tables, anddependenciesbetweenrowsinthetable.Datamaybesortedinoneorseveralplaces.

2016-2017 DataMining YsalinedeWouters

20

• Semi-structured data: Little boxes where we get structuredinformationaboutsomeone.

• Graph:Molecules• Sequencialdata:DNA.Thisisonetypeofordereddata• TimeSeriesData• SpatialData:couldbelocationaldataforinstance.• Spatial-Temporal

3. Simpledescriptivestatistics

Firstgetanoverall senseof thedata,analysingdata typeofvariables for instance:numericaldata,textdata.Inotherwords,firstlookatobviousthingslike:numberofvariables,numberofdatapoints,missingvalues,classskewforpredictionproblems,etc. Missing values are tricky! Class skew looks at number of positive, negativeinstancesinthesampleforinstance.Next,diveintotheindividualvariables.

• Discretevariables:Foracertainvariableyoucouldforinstancelookatthenumberofvaluesthisvariablecantake.Howoftendoes each value occur, etc.Which values are ordered,mode(mostcommonvalue)?

• Continuous variables: look at the mean, sample variance,median, quartile. This gives an idea about how the data isdistributed.

o Q1:valueatposition0,25no Q3:valueatposition0,74no Interquartilerange:Q3-Q1

Whenlookingataverages,weareofteninterestedinweightedaverages.Themeanassumesthateachdatapoint isofequal importance inaverage.Butactually,somedatamaybemoreimportantthanotherdata.So,theestimatewillbemorereliable,more valuable, more representative, more recent, by taking the weight of eachindividualvariableintoaccount.

2016-2017 DataMining YsalinedeWouters

21

Simpson's paradox, or the Yule–Simpson effect, is a paradox in probability andstatistics, in which a trend appears in different groups of data but disappears orreverseswhenthesegroupsarecombined.Itissometimesgiventhedescriptivetitlereversalparadoxoramalgamationparadox.Thisresultisoftenencounteredinsocial-scienceandmedical-sciencestatistics,andisparticularlyconfoundingwhenfrequencydataisundulygivencausalinterpretations.Theparadoxicalelementsdisappearwhencausalrelationsarebroughtintoconsideration.The most famous example is the UC-Berkely admissions. UC-Berkley got sued forgenderbiasingradschooladmissionsinthe70s.

• Men’sacceptancerate:44%• Women’sacceptancerate:35%

Itwasobservedthatnodepartmentwasbiasedagainstwomen.Actually,mostslightlyfavoured women. Women tended to apply to competitive programs with lots ofapplicantsandlowadmitrates.Describingnetworks

• Geodesic:shortest_path(n,m)• Diameter:max(geodesic(n,m))n,mactorsingraph• Density:numberofexistingedges/allpossibleedges.• Degreedistribution,countingeachnode,attributestheyhave.

4. Visualization

Thisfigureshowsfourdatasets,madeofsimpledata.Ineachdataset,thexvariablehasameanof9,andYhasameanof7,5.Eachhassamecorrelation(X,Y):0,82.Theyalsoallhavethesamelinearregression:y=3+0,5x.

Lookingatthedata,therearemanydifferentwaystoplotitandrepresentthisvisually.

2016-2017 DataMining YsalinedeWouters

22

Linearregressionlineshavebeendrawn. We observe that eachgraphs is different. The two onthe left are more or less thesame.Thebottomlessismoreorless a straight line, except theoutlier.Ontheright,theshapeofthedistributionisdifferent.

Thegraphonthetophasacurvedshape.Whereasonthebottomright,itlookslikeastraightline.Next,wecanlookattheboxplotofthedata.Withcontinuousdata,peopleusethiskindofplots.

Anothercommonvisualizationisthebarchart.Therearedifferentpositions,andeachpositionincludesacount.Wecangetagoodinsightintowhat’sgoingon.Histogramsareanotherway toplotcontinuousvariables.Wedividedata intobins,usuallyeachofequalwidth,wethencountthenumberofpointsineachbin.Regardinghistograms,thebinsizeisimportant!Wecanalsolookatthedistribution’shape.

To conclude, typically, you start theanalysisby trying tounderstand thedata. Is itobservationalorprospective?Whatisbeingmeasured?Whatdoesthedatalooklike?

2016-2017 DataMining YsalinedeWouters

23

CHAPTER 3: INTRODUCTION TO PREDICTIVE MODELING IN DATAMININGThischapterisclosetomachinelearning,wewillcoversomealgorithmsandseehowtheycomeinhandywhenanalysingdata.

1. Inductivelearning

1.1 Scoringproblems

• Given:DataS={(x1,y1),...,(xn,yn)}• Learn:FunctionF:x->y

ð IfYisdiscrete,wewilluseclassification,ifitiscontinuous,weuseregression.

Veryclassicexamplesofscoringproblemsarecreditscoring,whereyouhavetopredictwhoislikelytodefaultonloan.Thescoreistheprobabilitypeoplewillrepay.Anotherexamplesrelatestomarketing.Marketingcostsalotofmoney,youhavetomailpeople,call,…Targetmarketingallowsusto identifypeoplewhoaremore likelytorespond.Doingthis,ascorecould be a probability of someone responding to themarketing campaign. Finally, churnpredictionisanothercommonthing,whereyourankcustomerswhoarelikelytoswitchtoanotherservice,dropservice.Youwillprovideincentivestothosemostlikelytoleave.

1.1.1 AregressionapproachThegoalofregressionistolearnafunctiontoapproximateE[Y|X]1foreachX.

ð Wetrytoestimatesheposteriorclassprobabilities=binaryclassifier.

Ex:What’stheprobabilitythataclientrespondstomyadvertisingcampaign.Themost common thingwe can do for this is the logistic regression. This is adiscriminativemodelforlearningtheprobabilityofYgivenX.Wedon’tcareaboutmodellingthesinglevaluesofX,modellingtheprobabilityofX.Thisisasigmoidappliedtolinearfunctionofdata.

1ExpectedvalueofYgivenX

2016-2017 DataMining YsalinedeWouters

24

2. Features• Therearebinaryvariables,whichmeansthatforeachword,we

createabinaryvariablethatiseitherOor1.• Therearediscretevariables,thatwetakeonkvalues.Wewill

usek-1values.• Real:justasis• Complexvariables

o Democratorindependento WordA=PresentANDWordB=Presento Realsin(x)orX1X2

Inlogisticregression,wehavetwopossibilities:Y=0andY=1.

Whypickuplogisticregression?Inmanytasks,calibrationoftheprobabilityestimatesisimportant.Byestimatingaprobability,itshouldbeaprobability.Ex:ifsay75%ofbeing in positive class, then in test set you would expect ¾ of the instances withestimatetobetrulypositive.

2016-2017 DataMining YsalinedeWouters

25

However, not all classifiers are well-calibrated. For instance, Naïve Bayes is notwhereaslogisticregressioniswellcalibrated.IfyoulookatprobabilitiesprovidedbyNaïveBayes,youdon’tgetgoodprobabilities.

3. Calibrationcheck

Perform the check on validation data, divide predicted probability estimates intomultiplebins.PlotY-Axis,foreachbin,thenumberofpositiveexamplesinthebinorthe total number of examples in the bin. The X axis represents the predictedprobability.Agoodcalibrationremainsclosetothediagonal.

4. BayesRule&NaïveBayesClassifiersLet’s consider a supervised learning problem inwhichwewish to approximate anunknowntargetfunctionf:XàYorequivalentlyP(Y|X).WeassumeYisaBoolean-valuedrandomvariable,andX isavectorcontainingnBooleanattributes.ApplyingBayesrule,weseethat:

ð IntractablesamplecomplexityforleaningBayesianclassifiers,hence,wemustlook

forwaystoreducethiscomplexity.TheNaïveBayesclassifierdoesthisbymakingaconditionalindependenceassumptionthatdramaticallyreducesthenumberofparameterstobeestimatedwhenmodellingP(X|Y).The Naïve Bayes classifier assumes all attributes describing X are conditionallyindependent given Y. this dramatically reduces the number of parameters thatmustbeestimatedtoleantheclassifier.GiventhreesetsofrandomvariablesX,YandZ,wesaythatXisconditionallyindependentofYgivenZ,ifandonlyiftheprobabilitydistributiongoverningXisindependentofthevalueofYgivenZ.

TheNaiveBayesalgorithmisaclassificationalgorithmbasedonBayesruleandasetofconditionalindependenceassumptions.GiventhegoaloflearningP(Y|X)whereX=⟨X1...,Xn⟩,theNaiveBayesalgorithmmakestheassumptionthateachXiisconditionallyindependentofeachoftheotherXksgivenY,andalsoindependentofeachsubsetof

2016-2017 DataMining YsalinedeWouters

26

theotherXk’sgivenY.Thisassumptiondramaticallysimplifiestherepresentationof(X|Y)andtheproblemofestimatingitfromthetrainingdata.

5. Logisticregression,trainingtask

LogisticRegressionisanapproachtolearningfunctionsoftheformf:XàY,orP(Y|X)in the casewhere Y is discrete-valued, andX = (X1,…, Xn) is any vector containingdiscreteorcontinuousvariables.LogisticRegressionassumesaparametricformforthedistributionP(Y|X),thendirectlyestimates itsparameters fromthetrainingdata.TheparametricmodelassumedbyLogisticRegressioninthecasewhereYisBooleanis:

Bothequationsmustsumto1.

Given:DataD=((x1,y1),...,(xn,yn))Youwanttolearnweightsthatmaximizetheinitialprobability:P(Y|X).Inotherwords,themaximumweighthatpushallpredictedprobabilitytozerofornegativeexamplesandto1forpositiveexamples.Thegoodnewsisthatthisfunctionisconcave,whichmeans this function iseasy tooptimize.However, there isnoclosed-formsolution,whichisbadnews.

2016-2017 DataMining YsalinedeWouters

27

ð LogisticregressionisoftenreferredtoasadiscriminativeclassifierbecausewecanviewthedistributionP(Y|X)asdirectlydiscriminatinghevalueofthetargetvalueYforanygiveninstanceX.

6. ChallengeProblem

Ex:Let’ssayyoutrytopredictfluoutbreaks.Yourmodelwillhelptomodelwhere,whenfluoccurs.What data could we use?What features would we look at?What else would beimportantintheprediction?Wecouldforexamplegetdatafromapharmacy,aspeopleoftengobuyingsomemedicinewhentheyfeelsick.Whenwegetsick,wealsotweetaboutitonFacebook,goontheinternettogetsomemoreinformation,hence,socialmediadatacouldbeinterestingtoo.However,socialmediamay be biased. Nevertheless, looking at socialmedia provides uswith largeamountsofdata.Logisticregressioncouldbeausefultooltopredictthetool.Ex:Influenzaresultsin250000–500000deathsintheworldeachyear.Rucklydetectingoutbreakscanreducerisk.Currently,thecentrefordiseasecontroltracksthisbyaggregatingcountsofpeoplewithflulikeillnesses:you go to the doctor, makes a guess, then he sends it to the CDC, which aggregates it.However,thistendstobeslowandnottimelyenough,becausedataneedstobesentandthenaggregated.Theideanowistoexploitquerylogs.Thegoalistopredicttheproportionofdoctorvisitsthatareflu-relatedascalculatedbythecentrefordisease(CD).Data includeshistoricalGoogleSearchdata.Amodelwouldbethelogisticregression.Features:queriesmostcorrelatedtotargetvariable,totrytoselecttherightfeatures.

There are several guesses about the drop.Flusarecorrelatedwiththefluseason.Butother queries are correlated too withseasons.

Oftentimes,thereisavariablenotbeusedintheanalysisthatiscorrelatedwiththetargetvariable,Y,andthepredictorvariables,X.inthisexample,itisseason.Incausalstudies,thesearecalledconfoundingvariables.Thisoccursoften,watchoutforit!

2016-2017 DataMining YsalinedeWouters

28

Anotherexampleiscreditscoring,whereapersonappliesforaloan.Thisapplicationiseitherrejectedoraccepted.However,somepeopleacceptedforaloanmaydefault.Thegoalistopredictforthosegrantedaloan,whowilldefaultonit.Traditionallyloansweregrantedbyexperts.However,therewerescepticismthatautomationisbetterthanexperts.Hence,itwasfirstadoptedincredit-cardapprovals.Lateritwasthenbroadlyadoptedinhome-loans,etc.Now,itiswidelyacceptedandusedbyalmostallbanks,credit-grantingagencies,etc.Customerprovideddata:Age,address,income,job,numberofcreditcards,savings,etc.Wealsogetinternaldataaboutcustomer:howlongwiththebank,previousloans,etc.Wealsobuyexternalcustomer-leveldatalikecreditreportsandlegalproceedings.Alsomacro-leveldatacomesinhandy:demographics(postcodeforinstance).

ð Thisischallengingasdatamightbeincorrectlyentered.Peoplealsodeliberatefalseinformation,enteringahigherincomeorlowerincomethanthetrueone.Alsolegalissuesmayoccur:illegaltouserace,colour,religion,nationalorigin,sexmaritalstatus,orageinthedecisiontograntcredit.

Theapproachwetakerelatestologisticregressionanddecisiontrees.Wewanttomakesurethatdata is relevantandfitscustomers.Weneedtoaccesstorelevantvariables for thesecustomers.Defininglabelsremainsatrickyproblem.

Practicalissues

• Cost/benefitanalysiso Savingofemployingsystem,smallgainsinaccuracymay

beveryvaluable.o Costsofdeveloping,testing,maintaining,checkinglegal

requirements,etc.• Otherpoints:

o Setthreshold:whenshouldloansbegrantedo Override:bankemployeescanignoresystemo Whenisanewmodelneeded?o Continuousevaluation.

AnotherplacewherelogisticregressionisusedisinsuggestingFacebookFriends.HowdoesFacebookcomeupwithpeoplewemayknow?Thefirstinsightisthatmostcomefromfriendsoffriends.Theproblemisthattherearelotsofcandidates.Theaverageuserhas130friends:17000candidates.Hence,thetrickistodefinelotsoffeatures,exploitingtherightdata.Youcan thinkof things like thenumberof friends incommon,agood friendofyours justfriendedthisperson,demographics(age,gender,etc.).Amodelisthenbuilt,topredictwhetherauserwillclickonagivensuggestion.Whatistheprobabilitythattheuserclicksonit?Therefore,alogisticregression(+decisiontree)isusedtoproducearanking.OnefinalapplicationinlogisticregressionisthepredictionofCTRforadvertising.WhenyouGoogle,youenterakeyword.

2016-2017 DataMining YsalinedeWouters

29

Ontherightyouhaveadslots.Thelinksarealgorithmicsearches.Everytimesomeoneclicksontheadd,yougetacertainamount.Advertisers

• Bidx$perkeywordw• Payx$everytimesomeoneclicksonadd• Givesamaximumbudget

ð Someonesearchesonakeywordw,whichadshouldthesearchenginedisplay?Naïvesolution:justreturntheadwiththehighestbidonw.Google’sinsight:profit=x$*clickthroughrate(CTR).DisplaybasedonthisCTR.Tosumup,weoftenwanttoassignascoretoexamplesinpredictionproblems.Thelogisticregression is aworkhorse in practice for this types of problems. There aremany relevantindustryapplications.

2016-2017 DataMining YsalinedeWouters

30

CHAPTER4:RECOMMENDINGPRODUCTS

ð Howwouldyourecommendproducts?Whatkindofdatadoweneed?Howdoyoucollectthis?Wecouldfor instancelookatproductsthatareusuallysoldtogether.Wecouldalso lookateverythingthecustomerhasbought,andclusteramongcustomers.Anotheroptioncouldbetolookatsimilarcustomerprofiles.

There is an extensive class ofWeb applications that involve predicting user responses tooptions.Suchafacilityiscalledarecommendationsystem.Ex: offering news articles to on-line newspaper readers, based on a prediction of readerinterest;offeringcustomersofanon-lineretailersuggestionsaboutwhattheymightliketobuy,basedontheirpasthistoryofpurchasesorproductsearches.

1. ProblemoverviewWhenwethinkabouteconomicsoftraditionalretailermarket,spaceisscarceandexpensivecommodity.

• Retailers:physical.Youneedsomespacetorent.Physicaldeliverysystemsarecharacterizedbyascarcityofresources.

• TVnetworks:time• Movietheatres:spaceandtime.• Brick-and-mortarstoreshavelimitedshelfspace,canshowthecustomeronly

asmallfractionofallthechoicesthatexist.><on-linestorescanmakeanythingthatexistsavailabletothecustomer.

ð Peoplearenotwillingtotravelfarforproducts(buying,movie,etc.),soyouneed

tohaveacustomerbased.Implication:focusonpopularproducts.

Ontheotherhand,ifyouhaveanonlineeconomy,forinstanceontheweb,storagespaceischeap,itismuchcheapertostorethingsontheweb.Next,sitescatertoeveryone,youarenotrestrictedtopeoplenearyou,youcanreachpeopleallaroundtheworld.Implication:lowcostandeasyaccess,whichmeansitispossibletooffermorechoice.

ð Problem:howcanwefindproductsgivenhugenumberofchoices?Solution:systemsthatcanrecommendproducts,particularlyunusualorunpopularones.

Rankingofproductsandsales.• Physical stores, they can stock

physicalproducts,limitedoffering.• Mixed:Amazon,ontheinternetbut

theyalsohavephysicalretailers• Onlineonly:purelyonline.Ex:iTunes

2016-2017 DataMining YsalinedeWouters

31

Thedistinctionbetweenthephysicalandon-line worlds has been called the long tailphenomenon. Items are ordered on thehorizontalaxisaccordingtotheirpopularity.Physical institutions provide only themostpopularitemstotheleftoftheverticalline,whilethecorrespondingon-lineinstitutionsprovidetheentirerangeofitems:thetailaswellasthepopularitems. Making recommendations. Ex: what do I want to watch tonight? Let’s look online.WhatwebsitecouldIuse?Basedonthevieweditems,thesystemwillrecommendmeaproduct.

Animportantconceptinrecommendation-systemapplicationistheutilitymatrix.Therearetwoclassesofentities,whichweshallrefertoasusersanditems.Usershavepreferencesforcertain items, and these preferences must be teased out of the data. The data itself isrepresentedasautilitymatrix,givingforeachuser-itempair,avaluethatrepresentswhatisknownaboutthedegreeofpreferenceofthatuserforthatitem.

withoutautilitymatrix,itisalmostimpossibletorecommenditems.However,acquiringdatafromwhich tobuildautilitymatrix isoftendifficult.Thereare twogeneralapproaches todiscoveringthevalueusersplaceonitems:

• Askuserstorateitems• Makeinferencesfromusers’behaviour

ð Goalofarecommendationsystemistopredicttheblanksintheutilitymatrix.Itis

notnecessarytopredicteveryblankbutonlytodiscoversomeentriesineachrowthatarelikelytobehigh.

ð Findalargesubsetofthosewiththehighestratings.

1.1 Recommendationtypes

Thereexistloadsofrecommendationtypes.• Editorial: these are list of favourites. Ex: newspapers, travel

magazines.Newsserviceshaveattemptedtoidentifyarticlesof

2016-2017 DataMining YsalinedeWouters

32

interestofreaders,basedonthearticlesthattheyhavereadinthepast.Mightbebasedonthesimilarityofimportantwordsinthedocuments,oronthearticlesthatarereadbypeoplewithsimilarreadingtastes.Youcanalsohavelistsofessentialitems.

• Aggregateso Top10listso Mostemailedarticleso Mostrecentposts

• Personalized user recommendations: They makerecommendationsbasedonyourpreferences.Perhapsthemostimportant use of recommendation systems is at on-lineretailers.

o Amazono Movie sites (Netflix): Netflix offers its customers

recommendationsofmoviestheymightlike.Thesearebasedonratingsprovidedbyusers.

1.2 Challenges

Thereare3keychallenges

• Howdowegetuserfeedback?• Howdowepredictanunknownrating?Youwanttobeableto

nailthethings.• Howdoweevaluatepredictions?

Howcouldweobtainratings?Thereexistmanywaystodo.Iwanttopredictwhataspecificuserwilllike,basedonhispreferences.

• Explicitratingofproducts• Boughtitems• Itemson“wishlists”• Recentlyclickedproductpages/link• Lengthoftimespentonproductpage• Printedlinks• Etc.

Problem:predictratings

• Given information about a user’s preferences, interests,likes/dislikes

• Predict:willaspecificuserlikeagivenproduct,service,etc.?

Challenges…• Sparsedata:mostusersrateveryfewitems• Coldstart:newitemshavenoratings• Mainparadigms

2016-2017 DataMining YsalinedeWouters

33

o Contentbasedfilteringo Collaborativefilteringo Hybrid:usebothpriorapproaches

2. Contentbasedfiltering

Idea:focusonpropertiesofitems.Similarityofitemsisdeterminedbymeasuringthesimilarityintheirproperties.Machinelearningproblem,sowecanjustapplybasicstandardsseen.Wecaneasilyrecommenditemstoacustomerbasedonprofilesofthepast.Wetrytofindthingsthataresimilartowhatthecustomerhasvisitedbefore,similartopreviousitemsratedhighlybythecustomer.

Soweneedtodefineanumberoffeatures,andobtaindatafromtheusers.Wewillthenapplyanymachinelearningalgorithmtothisdata.Ex:foreverymovie,Iconstructfeatures(actors,genre,director…),Ithentrytofindoutwhethertheuserwilllikethemovie.Akeychallengeisbuildingitemprofile.Anitemprofileisawaytodescribeeachitem.Usuallythesearehand-crafted.Youhavetospendalotoftimebuildingthisprofile.Usually,forsuchtasks,theyarenotdoneinanautomatedway.Ex:foranewsarticle:Words,title,author,etc.

2016-2017 DataMining YsalinedeWouters

34

Pro’s Con’s• Only need data about one user.

No need to have data aboutmillionsofusers.

• Results in a more personalizedapproach (Ex: good if user hasuniquetaste).

• More easily recommendnew/unpopularitems

• Can provide context forrecommendation(Ex:pathdowndecisiontree)

• Lotsof researchonclassificationalgorithms.Every timethere isanewone,Iuseitinsteadoftheoldone.

• Need to construct meaningfulfeatures that are predictive ofuser’s preferences Ex: lots ofhand-coding,guess&check)

• Never recommends itemsoutsideuser’scontentprofile

• Hardtobuildaprofileforanewuser. I need to have sufficientnumberofexamples inordertorecommendsomething.

• Ignoresinformationaboutotherusers

ð Supervisedclassificationproblem

Wenotonlyneedtocreatevectorsdescribingitems;weneedtocreatevectorswiththe same components that describe the user’s preferences; the utility matrixrepresentstheconnectionbetweenusersanditems.Thebestestimatewecanmakeregardingwhichitemstheuserliesissomeaggregationoftheprofilesofthoseitems.Acompletelydifferentapproachtoarecommendationsystemusingitemprofilesandutilitymatricesistotreattheproblemasoneofmachinelearning.Regardthegivendataasatrainingset,andforeachuser,buildaclassifierthatpredictstheratingofallitems.Theconstructionofadecisiontreemightbehandy.Nevertheless, itrequirestheselectionofapredicateforeachinteriornode.Therearemanywaysofpickingthebestpredicate.Oncethepredicatehasbeenchoosing,wedividethe itemsintothetwogroups:thosethatsatisfythepredicateandthosethatdonot.

3. Collaborativefiltering

Idea:Finduserswithsimilartastesandrecommendproductstheyliked.Thebiginsightisthatwecandothisjustbylookingatratings,andjustusetheseratingstomakethepredictions,noneedtolookatfeatures.So,thebigideaistofindotheruserswhoseratingsaresimilartothecurrentuser.Andthenpropagatethe(dis)likestothecurrentuser.

2016-2017 DataMining YsalinedeWouters

35

Standard matrix where each rowcorresponds to a user, so we haveloads of users. Each column is aproduct. The entries are the ratings.Therearemanyusers,manyproducts,andmostoftheentrieswillbeempty.Very sparse!Most of the entries aremissinghere.

So,wehaveourdatabase,andactiveusers,andwewanttofindsimilaritiesbetweenthecurrentuserandalltheotherusersinthedatabase.Forinstance,BobandEvehavethesameratings,thisratingisdifferentfromAlice,Chris,etc.Wethentrytofindouthowstrongthesimilarityis.Thealgorithmworksinfoursteps.

• STEP 1: Measure similarity between user of interest and allother users. The question is, how canwe compute similaritybetweentwousers?Thereareloadsofsimilaritymetrics.I have twousers, I knowwhat the first is interesting in and Ishouldmakerecommendationsforthesecondone.

Problem:treatsmissingratingsaszerosinceittreatseveryentryasavector.Thistreatingaszeroisawrongwayofdoingsinceitisclearlynotthecase.ThePearsoncorrelationisapossiblesolution.

ThekeythingtorememberwiththePearsoncorrelationisthatit only considers items k rated by both users. This is just astandardcorrelationwecoulduse.Ifthiscorrelationequals1,

2016-2017 DataMining YsalinedeWouters

36

this isaperfect linearrelationshipwithboth increasing inthesamedirection.Whereas as a correlation equals to -1meansthatwhenoneratingdecreases,theotheronewillincrease.

• STEP2: (Optional): select a smaller subset consistingofmostsimilar users. You could use the whole database. But therewouldbeloadsofpeoplewithnocorrelation,that’swhywewillignore“faraway”users,andpickonlythe“k”nearestusers.Wecouldalsodecidejustincludingallusersaboveapredeterminedweightthreshold.Note:coulduseanyk-NNstrategyhere.

• STEP3:Predictratingsasaweightedcombinationof“nearestneighbours”.Nowwewanttopredictratings.Onebigproblemhereisthatpeople havedifferent rating scales. So, oneusermight use 5stars,while othersmay use only 3.Wewill first compute anaveragerating,andthentrytoimplementanalgorithmtofindoutaboutthenearestusers.

Onepotentialproblemhereisthatcorrelationswhicharebasedonveryfewco-rateditemsmaybeinaccurate.SimplesolutionadjustthePearsoncorrelationbasedonnumberof co-rated items. I take some threshold. I then adapt thecorrelation.

• STEP 4: return the highest rated items. We don’t want to

recommend itemswitha lowcorrelationrate.Next,wedon’twanttooverloadtheusers,forinstance,returningthousandsofpossibleproducts.Weusuallyreturnthetop2,5or10items.Itisthenpossibletogivetheusertheoptiontoviewmoreitems.

ð Could you employ the basic collaborative filtering algorithm between pairs ofitems?Wouldthiswork?

2016-2017 DataMining YsalinedeWouters

37

ð Anexampleisprovidedintheslides.Givenausermoviepair,Iwanttopredicttherating.Thebigpracticalissueisthatwehavelargedatasets:wecouldhavemillionsofusersand100000sofitems.Thekeyissueishowtoefficientlyfindsimilaritems?Iwanttocomputepairwisesimilaritybetweenusers.

• Pairwisesimilarityfor1Musers:+-5days• Pairwisesimilarityfor10Musers:+-1year!

ThetrickistousetheLocalitySensitivehashing.

Pro’s Con’s

• Simpleandintuitiveapproach• Worksforanykindofitem

o Nofeaturedesigno Nofeatureselection

• WorksbetweenpairsofUsers• Exploits information about

otherusers/items

• Data sparsity: even lots ofdata, hard to find users thatratedthesameitems.

• Cold start: need for enoughusersindatabase

• First rate: can’t recommendunrateditems

o Nowproducto Uniqueitems

• Popularity bias: favours itemsthatlotsofpeoplelike(i.e.,badifyouhaveuniquetaste).

4. Evaluation

ð Howdoweevaluatethesealgorithms?Here,wehaveahugeratingmatrix.Wethink about the matrix, taking out the most recent ratings. We then makepredictionsonthesemostrecentratings.

ThetwomosttypicalmetricsareRoot-MeanSquareerror(RMSE)andtheRankCorrelationSpearman’s.

2016-2017 DataMining YsalinedeWouters

38

Assume0/1data

• Coverage: number of items/users for which the system canmake a prediction. You want to be able to make as manypredictionsaspossible.

• Precision:accuracyofpredicteditems• Receiveroperatorcharacteristic(ROC)curve

o Falsepositiverateo Truepositiverate

ð WeaknesswithMetrics

• Focusingonaccuracymissesimportantpoints

o Predictiondiversityo Predictioncontexto Orderofprediction

• Onlyhighratingsmatter:RMSEmightpenalizeamethodthatdoeswellforhighratingsandbadlyforothers.

5. Casestudy:NetflixchallengeNetflixisamonthlysubscriptionservice,providingmoviesandTV.TheoriginalmodelwasaMailorderDVDservice.Peoplecouldbuydigitalrightstomanymovies.Itwasalsopossibletorankmoviesofinterestandreceiveinmailbasedonmoviesyouwanttosee.Today,thingshavechanged,wenowfaceanewmodel:streamingonlinecontent.Capitalizeon“bingewatching”createtheirowncontentIn2009,Netflixhad

• 100,000movies• 10millionscustomers• MoreUS-based

Currently it tends to be very international. It was first based in the US, now it is veryinternational.

• Thecontentbasedonregion.

2016-2017 DataMining YsalinedeWouters

39

• InUS:~5,000moviesand~2,000TVshows.• 75millionsubscribers

Theywanted tohavearecommendersystem.The ideawas tocreateacompetition for1million$.That’swhy,in2006,theyannouncedtheNetflixPrize,amachinelearninganddataminingcompetitionformovieratingprediction.$1millionwasofferedtowhoeverimprovedtheaccuracyoftheexistingsystemcalledCinematch.Proposedalgorithmsweretestedontheirabilitytopredicttheratingsinasecretremainderofthelargerdataset.Ateamcameupwiththefinalcombinationof107algorithms.Toputthesealgorithmstouse,theyhadtoworktoovercomesomelimitations,forinstancethattheywerebuilttohandle100millionratings,insteadofthemorethan5billionthattheyhave.Theywerenotbuilttoadaptasmembersaddedmoreratings.

Given:atrainingsetof100millionratingDo: build a recommendation system that improves the root-mean squarederrorby10%overNetflix’ssystem.

• Provided100millionratings

o Matrixis99%sparseo 480,000userso 17,700movies

• Ratingincludeo Usero Movieo Ratingo Time-stamp

ð Whysponsorachallenge?Netflixonlymakesmoneyiftheykeepcustomers.The

Cinematch, their internal system was expensive to develop and made slowprogress.Next,publicityisgoodandtheawardingprizewillpayforitself.

ð Whatistolose?

§ Negativepublicity• Concernsaboutuserprivacy(i.e.,userbacklashtodatarelease)• Prizewontooquickly• Noonewinstheprize

§ Timeandeffortassociatedwithrunningthecompetition(setupserver)§ Resultsnotusefulinpractice(e.g.,tooslow).

2016-2017 DataMining YsalinedeWouters

40

Evaluationð Why was the test set split? To

avoidoverfitting.

The competition began in October of 2006. There were many more teams thananticipated,eventuallyaround40000.Therewerelotsonlotsofinitialprogressandwithinweeksa1%improvementoverbaselinesystemwasachieved.

Therearefourimportantbigideas

• Modellingglobalandlocalbiases• Latentfactormodels• Modellingtemporaldynamics• Trylotsandlotsofdifferentmodelsandcombinetheoutput

5.1 Modellingglobalandlocalbiases

Wehavetwocomponents

• Usermeanrating• User-Movieeffect

The Better baseline rating wants to take themean rating and divide it into threecomponents.Iwanttolookattheoverallmean,andthenaddtheuserbiasandmoviebias.

• Userbias:howmuchbetterisitthanothermovies.Fortheuserbias,eachuserhasadifferentratingscale.Whatthenumbermeansdependsontheindividualuser.Otherthingsaffecttheuserbiaslikethevaluesofotherratingstheusergaverecently(day-specificmood,anchoring,multi-useraccounts).

2016-2017 DataMining YsalinedeWouters

41

Ex:ifyouaregivenanumber,askpeopletowritetheircellphonenumber,andyouthenask“howmanycountriesarethereinAfrica?”.Peoplewillrespondbasedonthepreviouslygivennumbers.

• Moviebias:Howmuchhigherorlowisthismovieratingcomparedtoother.Youmay be interested in the (recent) popularity ofmovie I, selection bias;relatedtothenumberofratingsusergaveonthesameday(“frequency”).Bylookingatthesewehaveindicationsandexpectations.

ð Capturingtheglobaleffects.

ð Insteadofhavingsimilarities,wewillreplaceitbytheweight.

2016-2017 DataMining YsalinedeWouters

42

Wecanactually train theweights tooptimize theobjective function, that is tosay,minimizethisobjective!

• Takeinitialguessforweights• Takefunctionderivative• Update values of weight in

direction• Repeatuntilconvergence

2016-2017 DataMining YsalinedeWouters

43

In 2007, data appears as the challenge task at KDD and there was an associatedworkshop. The winners of the improvement prize were from BellKor, 8.4%improvement(YehudaKoren,BobBell,ChrisVolinksy,AT&TResearch).

5.2 Latentfactormodels

Idea: make some matrix factors and decompose it into interesting, smaller parts.Topicscapturesharedhiddenstructures.Youwantthenumberoftopicstobeassmallaspossible.

ð ExamplesareprovidedintheslidesSolveviaanoptimizationproblem• Challenges:

o Sparsematrixwithmanymissingentrieso Slowtocomputegradiento Needtoavoidoverfitting

• Solutionso Solveleastsquaresoverjustobserveddatao Stochasticgradientdescento Shrinktowardsmeaniflittledata

In 2008, the progress sloweddownandmany teamsdroppedout due to the timeneededtocomplete.Manypaperswerepublishedonthetask.Leadershad9,4%improvement.

2016-2017 DataMining YsalinedeWouters

44

5.3 Modellingtemporaldynamics

In2004,therewasaconsiderablejumpintheratingpeopleassigntomovies.Couldbe due to improvements of cinematch. Netflix improved matching, leading tohigherrankings?Another explanation could be that people are biased towards higher ratings:meaningofratingchanges?

5.4 Trylotsandlotsofdifferentmodelsandcombinetheoutput

Peoplearegettingdesperate.Theytriedlotsofthings,butstillhavenotreachedthe10%threshold.Idea:“KitchenSinkApproach”.Buildlotsandlotsofpredictors• Classifiers• Collaborativefilters• EnsemblesComeupwithcleverwaystoblendtheresults2009,attheendofJune,the leadingteamsubmitsresultsthatexceedthe10%threshold.Thecompetitionentersthe30-dayfinalperiod.

2016-2017 DataMining YsalinedeWouters

45

Anew“ensemble”teamisformedbasedoncollaborationforothersnearattopofleaderboardandquicklybeatthe10%thresholdtoo.Theraceison…Takeawaymessages

§ Knowingyourdataiscrucial§ Mustaccountforunpredictableusers’behaviour§ Discoveringhiddenstructurecanhelp§ Whenindoubt,combinelotsofmodels

Thankstothecompetition,itservedaspublicity(andmoney)forthewinningteam and its members. Netflix received much attention and valuable newalgorithms.Datamining/machinelearning

§ Excitementandpublicityforthefield§ Interestingnewpublications§ Valuabledataset

Tosumup,recommendersystemsareanimportantandactiveareaorresearchanduse.Threeparadigms

§ Content: based on designing features and applyingclassification/regressionalgos

§ Collaborative: based on comparing users/items (aka nearestneighbour)

§ Hybrid:blendsbothapproachesð Netflixchallengewasinterestingandsuccessful

One of the reasons their focus in the recommendation algorithms has changed isbecause Netflix as a whole has changed dramatically in the last few years. Netflixlaunchedan instant streaming service in 2007. Streamingmembers are looking forsomethinggreattowatchrightnow;theycansampleafewvideosbeforesettlingonone,theycanconsumeseveralinonesession,statisticsareprovidedsuchaswhetheravideoaswatchedfullyorpartially.Wehaveadaptedourpersonalizationalgorithmstothisnewscenarioinsuchawaythatnow75%ofwhatpeoplewatchisfromsomesortofrecommendation.Wereachedthis point by continuously optimizing the member experience and have measuredsignificantgainsinmembersatisfactionwheneverweimprovedthepersonalizationforourmembers.Overtheyears,Netflixdiscoveredthatthereisatremendousvaluetotheirsubscribersin incorporating recommendations to personalize as much of Netflix as possible.Personalizationstartsontheirhomepage,consistingofgroupsofvideosarrangedinhorizontalrows.Ex:Top10row:bestguessatthetentitlespeoplearemostlikelytoenjoy.

2016-2017 DataMining YsalinedeWouters

46

Animportantelementisawareness.Netflixwantmemberstobeawareofhowtheyareadaptingtotheirtastes. It thenencouragesmemberstogivefeedbackthatwillresultinbatterrecommendations.Next,recommendationsarepartlybasedonourfriends.KnowingaboutyourfriendsnotonlygivesNetflixanothersignaltouseinthepersonalizationalgorithms,butitalsoallows for different rows that rely mostly on the social circle to generaterecommendations.Similarityisanotherimportantsourceofpersonalizationintheirservice.Similaritycanbethoughtofinaverybroadsense;itcanbebetweenmoviesorbetweenmembersandcanbeinmultipledimensionssuchasmetadata,ratings,orviewingdata.Inmostofthepreviouscontexts–beitintheTop10row,thegenres,orthesimilars–ranking,thechoiceofwhatordertoplacetheitemsinarow,iscriticalinprovidinganeffectivepersonalizedexperience.ThegoalofNetflix’rankingsystemistofindthebestpossibleorderingofasetofitemsforamember,withinaspecificcontext,inreal-time.Their business objective is tomaximizemember satisfaction andmonth-to-monthsubscriptionretention,whichcorrelateswellwithmaximizingconsumptionofvideocontent.

2016-2017 DataMining YsalinedeWouters

47

CHAPTER5:MODELENSEMBLES

1. Motivationandoverview

Anensembleofclassifiersisasetofclassifierswhoseindividualdecisionsarecombinedinsomewaytoclassifynewexamples.One good learner produces one effective classifier: could learningmany classifiershelp?The main discovery is that ensembles are often much more accurate than theindividualclassifiersthatmakethemup.Anensemblecanbemoreaccuratethanitscomponentclassifiersonlyiftheindividualclassifiersdisagreewithoneanother.Humanensemblesaredemonstrablybetter.Ex:howmanyjellybeansinthejar?Individualestimatesvs.groupaverage.Ex:WhowantstobeaMillionaire:Expertfriendvs.audiencevote.ð IfclassifiersmakeINDEPENDENTmistakes,thenh*(combinedhypothesis)ismore

accurate,becausetheprobabilitythattheyallmakeamistakeislow.

Let’s say we assume independent errors (30%) and a majority vote. We have 21classifierswhichallhavethesameerrorrate.

Areaundercurvefor>=11wrongis0.026.

ð Orderofmagnitudeimprovement!

Howtogeneratethebaseclassifiers?• Differentlearners?• Bootstrapsamples?• Etc.

2016-2017 DataMining YsalinedeWouters

48

Howtointegrate/combinethem?

• Average• WeightedAverage• Instance-specificdecisions• Etc.

Ensembleapproaches

Sampledataset:createreplicatesinsomeway,orreweighttheexamplesinsomeway.

• Bagging• Boosting

Manipulatefeatures• Inputfeature.forinstance,Icanonlyshowasubsetofthem,trainone

classifierfirst,thenanother,etc.• Targetfeatures.Createanewclassifier.Groupingtogethersomelabels.

Addrandomness• Data• Algorithm

Stacking

2. Sampling-basedapproaches

Thisfirstmethodmanipulatesthetrainingexamplestogeneratemultiplehypotheses.Thelearningalgorithmisrunseveraltimes,eachtimewithadifferentsubsetofthetrainingexamples.Thisworksespeciallywellforunstablelearningalgorithms.Thekeyideaherehastodowiththestabilityofanalgorithm.Unstablelearner:minorvariationsintrainingdataresultinmajorchangesinclassifieroutput,leadtoadifferentmodel.

• Unstable:Decisiontree,neuralnetwork,rulelearningalgorithms• Stable:linearregression,nearestneighbour,linearthresholdalgorithms,etc.

ð Subsamplingisbestforunstablelearners

• Bagging• Boosting• Cross-validatedCommittees

2.1 Bagging:BootstramAggregating

Given:DatasetS,integerTFori=1,...,TSi=BootstrapreplicateofS(i.e.,samplewithreplacement)hi=ApplylearningalgorithmtoSi

2016-2017 DataMining YsalinedeWouters

49

Classifytestinstanceusingunweightedvote

Draw |Data Set| examples withreplacement. Each sample’ contains 63.2%oforiginalexamples(+duplicates)

Baggingisthemoststraightforwardwayofmanipulatingthetrainingset.Oneachrun,baggingpresentsthelearningalgorithmwithatrainingsetthatconsistsofasampleofmtrainingexamplesdrawnrandomlywithreplacementfromtheoriginaltrainingsetofmitems.

2.2 AdaBoost(AddaptedBoosting)

• Idea1:Assignweightstoexamples.Itwilliterativelychangetheweights.DecreaseweightofcorrectlylabelledexamplesIncrease weight of incorrectly labelled examples =>Focus attention onmisclassifiedexamplesThealgorithmwantstodecreasethefocusofexamplesthatarewell-classified.IwanttomakesureIgetinexactinstances.

• Idea2:Assignweightstoeachlearnedhypothesisbasedonhowaccurateitis.Weightedvotetolabelnewexamples.

2016-2017 DataMining YsalinedeWouters

50

Assumethatwearegoingtomakeoneaxisparallelcutthroughthefeaturespace.

Wehave3misclassifiedinstances.Whatwewanttodoistoassignahigherweighttothese 3 misclassified examples. In brief, we up weight the mistakes and wedownweighteverythingelse.

ð Howwillthenumberofroundseffectgeneralization?

Expect• Trainingerrortodroporreach0• Test error to increasewhenh*becomes too complex: “Occam’s razor” (i.e.

overfitting)• Hardtoknowwhentostoptraining

><Butofthe,thetesterrordoesnotincrease,evenafter100rounds.Thetesterrorcontinuestodrop,evenaftertrainingerroris0!.Occam’srazor:”simplerisbetter”appearstonotapply!

The key idea here is called margins. The training error only measures whetherclassifications are right or wrong. But it should also consider confidence ofclassifications.

2016-2017 DataMining YsalinedeWouters

51

Marginstrytoexploittheconfidence.H*isaweightedmajorityvoteofweakclassifiers.Eachoftheindividualclassifiersisbetterthan50-50.Wecanmeasuretheconfidencebymargin.Wetaketheweightedvoteandcalculatehowmuchweightwehaveonthepositivecasesandhowmuchwehaveonnegativecases.à(weightedvote+)−(weightedvote-)

Itturnsoutthatboostingreallytriestobuildupaclassifierwithahighmargin.Boostingincreasesthemarginveryaggressivelysinceitconcentratesonthehardestexamples.If themargin is large,moreweak learners agree andhencemore roundsdoesnotnecessarilyimplythatthefinalclassifierisgettingmorecomplex.Theorem: large margins yield better bound on generalization error. If all trainingexampleshavelargemargins,thenwecanapproximatefinalclassifierbyamuchsmallclassifier.Itissimilartohowpollscanpredictoutcomeofanot-too-closeelection.

2.3 Cross-validation

Anothertrainingsetsamplingmethodistoconstructthetrainingsetsbyleavingoutdisjointsubsetsofthetrainingdata.Ex: the training set can be randomly divided into 10 disjoint subsets. Then 10overlappingtrainingsetscanbeconstructedbydroppingoutadifferentoneofthese10subsets.

2.4 GradientTreeBoosting

Thebasealgorithmisoldbutveryhypednow.Wewillfocusonleastsquaresregressioncase.Wewillignoresomeofmathematicaldetails.GradientBoosting=GradientDescent+Boosting.Itfitsanadditivemodel(ensemble)in a greedy forward stage-wise manner. Each stage introduces a weak learner toaddress the shortcomings of the current model. Shortcomings are identified bygradients.

2016-2017 DataMining YsalinedeWouters

52

Intuitively,westartwithaverysimplehypothesis.

The function is just an additive function and the regressionmodel is just adecisiontree.IstartwiththeinitialdatasetandIwanttolearnamodel.hence,Ievaluateallpossibledecisiontrees.Idea:incrementallygetclosertotreelevel.

Recall,gradientdescent

2016-2017 DataMining YsalinedeWouters

53

ð ConnectiontoGradientDescent

Why is Gradient Boosting so powerful? It is a generalization ofAdaBoost.AdaBoostdesignsaparticularfunctionandtriestooptimizeclassification.

2016-2017 DataMining YsalinedeWouters

54

GradientBoostingtendstoabstractawaythealgorithmfromthelossfunctionandhencethetask.Thus, itcanplug inanydifferentiable loss functionandusethesamealgorithm

• Otherregressionlossfunctions• Classification• Ranking

Severaldetailswereskipped

• Regularizationtoavoidoverfittingo Restrict depth of trees by not considering full

possibletrees.Ienumeratetreesuptoacertaindepth.

o AddpenaltytermtoobjectivefunctionJ• Settingthelearningrateη• Derivationsanddiscussofalllossfunctions

AdaBoostvs.GradientBoostingð Howaretheysimilarandhowaretheydifferent

• Similarities

o Stage wise (1st model then second model, etc.) greedy learning ofadditivemodel.

o Focusonmispredictedexamples.Themodelmakessomemistakesandbothtechniqueswanttofocusonthesemistakes.

• Differenceso Focusonmispredictions

§ AdaBoost:High-weightdatapoints§ GradientBoosting:Gradientof lossfunction.Focusona little

partofeachexample.o Generality

§ AdaBoost:Justclassification.§ GradientBoosting:Anydifferentiableloss

3. Manipulatefeatures

Idea:differentlearnersseedifferentsubsetsoffeatures(ofeachtraininginstances).Empiricallyitprovidedmixedresults.Thetechniqueworksbestwheninputfeaturesarehighlyredundant.

2016-2017 DataMining YsalinedeWouters

55

ð Howtousethisinasamplingcontext?Theideaistocreatemorethanlogkmodels,togetsomeredundancy.“Error-CorrectingCodes”(someredundancy)

Given:IntegerTForI=1toT

• Partitionlabelsintotwodisjointsets• Buildclassifiertodistinguishbetweenthesesetsofexamples

4. Addrandomness

Neuralnetworks:• Differentinitialvalues• Notreallyindependent

Decisiontrees:• Considertop20attributeschooseoneatrandom?• Produce200classifiers• Toclassifynewinstance:Vote

FOIL:• Chooseanytestw/foilgainwithin80%oftop• Goodempiricalperformance

RandomForests

Random forests or random decision forests[1][2] are an ensemble learningmethod for classification, regression and other tasks, that operate by

2016-2017 DataMining YsalinedeWouters

56

constructingamultitudeofdecisiontreesattrainingtimeandoutputtingtheclass that is the mode of the classes (classification) or mean prediction(regression) of the individual trees. Random decision forests correct fordecisiontrees'habitofoverfittingtotheirtrainingset.

Increasingi• Increasescorrelationamongindividualtrees(BAD)• Alsoincreasesaccuracyofindividualtrees(GOOD)

UsetuningsettochoosegoodsettingforiOverall,randomforests

• Areveryfast• Dealwithlarge#offeatures• Reduceoverfittingsubstantially• Workverywellinpractice

Randomforestsdifferinonlyonewayfromthegeneralschemeofbagging:theyuseamodifiedtreelearningalgorithmthatselects,ateachcandidatesplit inthelearningprocess,arandomsubsetofthefeatures.Thisprocess issometimescalled"featurebagging". The reason for doing this is the correlation of the trees in an ordinarybootstrapsample:ifoneorafewfeaturesareverystrongpredictorsfortheresponsevariable(targetoutput),thesefeatureswillbeselectedinmanyoftheBtrees,causingthemtobecomecorrelated.

5. StackingStackingisasimilartoboosting:youalsoapplyseveralmodelstoyouroriginaldata.Thedifferencehereis,however,thatyoudon'thavejustanempiricalformulaforyourweightfunction,ratheryouintroduceameta-levelanduseanothermodel/approachtoestimatetheinputtogetherwithoutputsofeverymodeltoestimatetheweightsor,inotherwords,todeterminewhatmodelsperformwellandwhatbadlygiventheseinputdata.Idea:youwanttolearnwhethereachlearnerisgood

2016-2017 DataMining YsalinedeWouters

57

6. Whydoensemblemethodswork?

ð There are four possible explanations to why ensembles are better than basicclassifiers.

6.1 ExpectedError

2016-2017 DataMining YsalinedeWouters

58

• Bias:leftpicture.• Variance:rightpicture.Complicatedhypothesisspace.

SourcesofBias

• Bias comes from the inability to represent certain decisionboundaries.E.g.,linearthresholdunits,decisiontrees.

• Youmakeincorrectassumptionso E.g.,failureofindependenceassumptioninnaïveBayes

• Classifiersthatare“tooglobal”o E.g.,singlelinearseparator,smalldecisiontree

• Ifbiasishigh,themodelisunderfittingthedata

Sourcesofvariance• Statisticalsources.Classifiersare“toolocal”andcaneasilyfitthe

data.(e.g.,nearestneighbor,largedecisiontrees)• Computationalsources

o Decisionmadeonsmallsubsetsofthedata(e.g.,decisiontreesplitsneartheleaves)

o Randomization in the learning algorithm (e.g., neuralnetswithrandominitialweights)

o Unstable learning algorithms (e.g., decision boundarycanchangeifonetrainingexamplechanges)

o Highvariance⇒modelisoverfittingthedata

6.2 Bias/varianceexplanation

2016-2017 DataMining YsalinedeWouters

59

Biasandvariancearetwocomponentsoferror.

§ Bias:inabilitytorepresentthetrueconcept.IhavesomefunctionIwanttopredict,basedonhypothesisandmodel. Imighthavesomeerrorwhichiscalled“bias”.Wecanthinkaboutbiaswhenwedon’thaveagoodhypothesisspace.

§ Variance: fluctuationsduetorandomvariations inthedata.Appearswhenthehypothesisclassistoobig.

ð Hence,wearefacedwithatrade-off!

§ Moreexpressiveclassofhypotheses,whichgenerateshighervariance§ Lessexpressiveclass,whichwillgeneratehigherbias.

IhavetomakeadecisionbeforehandbasedonwhetherIpreferhighbiasorhighvariance.

ð Ensemblescanaddressbothbiasandvariance!

§ Bagging: if bootstrap approximation is correct, then itwould reducevariancewithoutchangingbias.Inpractice,baggingcanreduceboth.

• Forhigh-biasclassifiers,canreducebias• Forhigh-varianceclassifiers,canreducevariance

§ Boosting: Attends to allow you to work with more expressivehypothesis.

• Earlyiterations:primarilyreducesbias• Lateriterations:primarilyreducesvariances(apparently).

6.3 Statisticalexplanation

Tomotivatethisexplanation,wecouldlookatanexampleofflippingcoins.

2016-2017 DataMining YsalinedeWouters

60

ð IcanthencomputetheprobabilitythatIuseoneofthecoins,basedonthisfigures.

IseethatthereisasmallchangethatIflippedC1,butthereisabiggerchancethatIswapC2andC3.

Bayesianestimatecouldbeabetteridea.Wemaketheprobabilityofseeingahead aweighted vote among all possibilities.We then get aweighted voteaboutthepredictions.

2016-2017 DataMining YsalinedeWouters

61

ð How can the learning algorithm select among the set of (almost) equally goodhypotheses?HowcanIpickthebesthypothesis?Icanforexamplethingabouttheerrorrate.FromtheoryweknowitiscalledtheBayesoptimalclassifier,makingaweightedmajority vote among all possible hypothesis. That is the best way we can do:weightedbytheirposteriorprobability,probablythebestpossibleclassifier,thatis,itminimizestheerrorrate.Butitisactuallyinfeasible.

ð EnsemblelearningapproximatesBayesoptimal

6.4 Representationalexplanation

Each leanerworkswith a givenhypotheses class (i.e., set of possiblemodels). It ispossiblethatthetargetfunctionliesoutsideoftheconsideredhypothesisclass.

• Linear classifier: I will have a yfunctionthatseparatespositiveandnegatives examples. I won’t have asingleseparationline.

• Nostraight line,adecision treecanrepresentthisdiagonalline.

An ensemble averaging may approximatethetruetargetfunctionwitharbitrarilygoodaccuracy.

• Curvedboundarybyaveraginglines• Diagonal boundary by averaging

“staircases”.

6.5 Computationalexplanation

Mostlearningalgorithmssearchthroughhypothesesspacetofindone“good”model.Hypothesisspacecouldbeaspaceofdecisiontrees,regressionlines,etc.

2016-2017 DataMining YsalinedeWouters

62

The most interesting hypothesis space are huger or infinite, so we have to do aheuristicsearch.The learning might get stuck in a local minimum. One strategy for avoiding localminimaistorepeat thesearchmanytimeswithrandomrestarts.This isessentiallywhatbaggingdoes.Thismethodisreallypowerful.

7. Someapplications

Thebestoneoratleastthemostsuccessfuloneisinthewebsearch.

Uses ensembles of decision trees.They use the gradient boosting.Wecanthinkaboutthedata,wehaveaquery,here,a radiostation.Andwethen have different Url’s (= trainingdata).Wealsohavearankingamongthelines.Theywilltheninduceanensembleoftrees.

AscoreisusedtoranktheUrl’s.IwanttomakesureIpicktheperfectonefirst,thenagoodone,etc.Theywanttopredictascoreandlearnaboutdecisiontreestoperformtheranking.Onethingthatisimportantistoevaluatethetrees.Anotherreasonisthatthismethodtendstobeveryaccurate.Another area where ensembles arewidelyusedisinedgedetection.Givenan image, you want to detect edges,finding the outline of the differentfigures.What you could do then is touseanensemble,forinstancerandomforest.We will have some probabilisticmodels,traindecisiontrees,etc. Athirdareareferstomotioncapture.Iwanttomodelaperson.Thegoalstandardistohaveagoodcameratrackingsystem.Youthenhavearandommapandyoutrytoget locationof thepeople.This iswidelyused insportsdomain to lookatpeople’sbehaviouronthefield.Thisismethodisreallyaccurate.However,thelabsetupisincrediblyexpensiveandverytimeconsuming.Videogamesthattriestorecognizeyourgesture.Yougentrytoinferwhattheposesofpeopleare.Yougetadepthimageandyouthengotoabodypartbeforemovingontoa3Djointproposal.Decisiontreesareusedtomoveonthroughtheprocedure.

2016-2017 DataMining YsalinedeWouters

63

Tosumup,acommitteeofexpertsistypicallymoreeffectivethanasinglesupergenius.Keyissues

• Generatingbasemodels• Integratingresponsesfrombasemodels:Ihavetocombineor

aggregatethepredictions.

Popularensembletechniques• Manipulatetrainingdata:baggingandboosting• Manipulateoutputvalues:error-correctingoutputcoding

ð Bias/varianceisreallyimportantindatamining.

2016-2017 DataMining YsalinedeWouters

64

INTERLUDE:ACTIONABLEDATAMINING

• Howmanyofyougotothesamegrocerystoreeveryweek?• Howmanyofyougotothesamebar?• Howmanyofdrinkonetypeofbeer?• Whatwouldmakeyouchange?• Whydoweworkthisway?

Whatwewanttotakeawayisthatpeopletypicallyhavesimilarhabits.Similarhabitsmakesomethingseasiertoanalyse,enteringacertainroutine.Onceyouhavecertainhabits,itbecomeshardtochange.Task:basedoncustomerinformationDo:Youwanttopredictwhetherthecustomerispregnantornot.Manyquestionsarisethen

• Whatdataisneeded?Wherewouldyougetit?Weneeddataaboutgender,age.Purchasedata.Wealsoneeddataaboutsearchengines,lookingatwhichproductspeoplearelookingfor.Ishouldtrytofigureoutwhatpregnantpeoplearelookingfor.

• Whattechniquesareapplicabletothisproblem?Thisisabinaryclassificationproblem,whereyoumaywanttoassignascoretopregnantpeople.

• Whatwouldyoulookfor?• Howcouldyouexploityourfindings?Themaingoalofdataminingistoact

upon the data. So if I know if someone is pregnant, a want to use thisinformationtoactinacertainmanner.Imaybetendedtorecommendsomeproductsforinstance,proposingsomeadds,etc.

• Arethererisksinvolved?Ifso,whatarethey?Therearesomerisksrelatedtoprivacy.Thenquestionsarise:isthisrationalornot?Thereisalsoariskoftakingfalsepositive.Takeforinstanceapersonthatmakessearchesforhisorherfriendsbutwhomightnotbepregnantatall.

2016-2017 DataMining YsalinedeWouters

65

INTERLUDE:LARGESCALEDECISIONTREELEARNING

1. Decisiontreeoverview

• Internal nodes arechoicenodes.

• Leave nodes arepredictionnodes.Theyrepresent theclassification.

Ihaveabunchoftrainingdata.Ithensplit,pickingthebestinternalnode.Youfirstwanttocheckifallpointsareofthesame class. Otherwise you have to runthroughallattributesandevaluatethesplitsoneachofthem.Youthenfindthebestoneandpartitionthedatayouhave.

ð Whenwillthisbasicdecisiontreealgorithmbeinefficient?Thinkaboutverylarge

datasets.Answer:ifmydecisiontreedoesnotfitinthememory.EverytimeIwanttodoasplit,Ihavetodoafullscanofthedata.So,ImightkeepalistofinstancesIwanttolookat.TherearemanyexamplesIcanignore.

2. Rainforest

ð Canweimprove?Yes!Wejustneedaggregateinformationateachnodeinorder

tocomputethesplitpoint.Wecanjuststoretheaggregateinformation.ThedatastructureiscalledAttributeValueClassLabelsetorAVCset.Countsforeachclassareaggregatedanditstorestheclasslabeldistributionforeachattributevalue.TheAVCgroupisanAVCsetforallpossibleattributes.

RainForestAlgorithms

• RF-Write:Alwayswriteoutpartitionstodisk

2016-2017 DataMining YsalinedeWouters

66

Pro:DonothavetoreadwholeDBeachtimeCon:Writingtodiskisexpensive

• RF-Read:ScanthewholedatabaseateachnodelevelPro:AvoidexpensivewriteoperationsCon:Readlotsofirrelevantinformation

• RF-Hybrid:MixedstrategyUseRF-ReaduntilAVC-Groupsofchildnodesdon’tfitinmemory.ForeachlevelwhereAVC-Groupsdon’tfitinmemory,partitionchildnodesintosetsM&N

• AVC-GroupsfornÎMallfitinmemory• AVC-GroupsfornÎNarebuildondisk.

Processnodesinmemorythenfillmemoryfromdisk

IhaveadatabaseandIscanthedata. I build the AVC set foreach of my possible nodes. Ipickmyrootnode.

Inowgettwonodes.ForeachnodeI have to consider the possiblesplits.Iconsidereachattribute.So, I doubled the number ofattributesIhavetoconsider.

3. BOAT

Another way to think about scaling up is the BOAT algorithm. This stands forBootsrappedOptimistAlgorithmforTreeConstruction.Thisisveryscalabletolearningdecision trees. Indeed, it needs very few scans (perhaps 2) over the data. And itconstructsmultiplelevelsatonce.

2016-2017 DataMining YsalinedeWouters

67

Thealgorithmrunsintwophases• Samplingphases: it isgonnalookatasmuchasthedatasetwecan.

Take a sampleD́́Ì D from the training database such that D’ fits inmemory. Construct b bootstrap trees T1,...,Tb by creating trainingsamples D1,...,Db obtained by sampling with replacement from D’.Performbagging!Ithenendupwithalargenumberoftrees.Itwillthentrytocombinethemintoaunified,singletree.So,itwilltrytocreateasummarytree.

• CreateCoarseSplittingCriteria:Processthetreesinatop-downfashion.AteachnodeN,checkifthesplittingcriteriaisidenticalforalltrees.Ifnot,deleteNanditssubtreesinalltrees.Iftheyagreeconstructcoarsesplittingcriteria

ð Coarsesplittingcriteria• Discrete:Justtheattribute-values• Continuous:Confidenceinterval[L,U]suchthatthetruesplit

islikelyintheinterval.Wehavearangeofpossiblevalues.

Both trees agree on theroot node. On the leftsubtreetheyallagree.Onthe right subtree theyagree. But then theydisagree.

Onthelefttreewehave>20,ontheotherside>25.Soforagewedon’tknowwheretheactualthresholdwillbe.Forthefinaltree,wecomputeanexactsplit.

• Cleaningphase

Fordiscreteattributethecoarsesplittingcriteria istheexactsplittingattribute, whereas, for continuous data, only try splits withinconfidenceinterval.CollectallexamplesthatfollowthispathANDfallwithintheconfidenceinterval.Thencomputeexactsplitbasedontheseexamples

2016-2017 DataMining YsalinedeWouters

68

CHAPTER6:ASSOCIATIONRULEMINING

ð Howtofindpatterns?

1. Introductionanddefinitions

Given:SetoftransactionsFind:If-Thenrulesthatpredicttheoccurrenceofanitembasedonotheritemsinthetransaction.

The main motivation is finding regularities in data. What products were oftenpurchased together?WhatkindsofDNAare sensitive tonewdrug?Try to findco-occurrences. We try to find patterns in the data, patterns that predict a certainvariable.Foundationformanydataminingtasks:

• Association• Correlation

ð Algorithmsdonotrequirelabelleddataorforausertospecifyapredefinedtarget

concept.

• Anitemsetisacollectionofoneormoreitems.Ex:{Bread,Milk}• Ak-itemset isan itemsetthatcontainsk items.Ex:3-itemset{Bread,

Milk,Diaper}

Associationrulesareif-thenrulesaboutthecontentsofbasketsGiven:Setofitems:I={i1,i2,...,im}

Setoftransactions:D={d1,d2,...,dn}Anassociationrule:AàB

2016-2017 DataMining YsalinedeWouters

69

{i1,i2,...,ik}→jmeans:“ifabasketcontainsallofi1,...,ikthenitislikelytocontainj.”

Whendealingwithassociationrules,asimplequestionis:findsetsofitemsthatappear“frequently”inthebaskets.ThesupportcountforitemsetIisthenumberofbasketscontainingallitemsinI.thesupport isthefractionoftransactionsthatcontainan itemset.Basicallyweare justcountinghowmanytimesweseeanitemsetinthedata.Givenasupportthresholds,frequentitemsetsappearinatleastsbaskets.

AÞC=support({A}È{C})/|T|

The confidence of an association rule is the conditionalprobability of j given i1,…, ik. This gives ameasure of howaccurate the rule is. Given I have observed A, what is theprobabilityIalsoobserveB?Confidence(AÞB)=P(B|A)=sup({A,B})/sup(A)TheconfidenceoftheruleisthefractionofthebasketswithallofIthatalsocontainj.

Confidencealonecanbeuseful,providedthesupportforthe leftsideoftherule isfairlylarge.Weusuallywanttheconfidenceoftheruletobereasonablehigh,perhaps50%,orelsetherulehaslittlepracticaleffect.

Givenanassociationruleiàj,Interest=Confidence(j|I)-Support(j).

• Interest=0:IhasnoinfluenceonJ• Interest>0:Imaycausethepresenceofj• Interest<0:Idiscouragesthepresenceofj

IfIhasnoinfluenceonj,thenwewouldexpectthatthefractionofbasketsincludingIthatcontainjwouldbeexactlythesameasthefractionofallbasketsthatcontainj.sucharulehasinterest0.

Indataminingandassociationrulelearning,liftisameasureoftheperformanceofatargetingmodelatpredictingorclassifyingcasesashavinganenhancedresponse(withrespecttothepopulationasawhole),measuredagainstarandomchoicetargetingmodel.Atargetingmodelisdoingagoodjobiftheresponsewithinthetargetismuchbetterthantheaverageforthepopulationasawhole.Liftissimplytheratioofthesevalues:targetresponsedividedbyaverageresponse.Ifsomerulehada liftof1, itwould implythattheprobabilityofoccurrenceoftheantecedent and that of the consequent are independentof eachother.When twoevents are independent of each other, no rule can be drawn involving those twoevents.

2016-2017 DataMining YsalinedeWouters

70

If the lift is > 1, it lets us know the degree to which those two occurrences aredependentononeanother,andmakesthoserulespotentiallyusefulforpredictingtheconsequentinfuturedatasets.Why could the lift measure be useful? By dividing confidence by support, we can“normalize”dataandcomparetheresults.Theassociationruleminingtaskissimple.Given:Transactiondata,supports,andconfidencec.Find:Allassociationruleswithsupport≥sandconfidence≥cTaskfocus:

• Generalmany-manymapping(association)betweenitemsandbaskets• Connectionamong“items”not“baskets”• Focusoncommoneventnotonrareevents

Thereexistdifferenttypesofassociations

• Booleanassociations

• Quantitativeassociations

Numberofpredicatescaptured

• Singleattribute

• Multipleattributes

• Multi-relational

SingleorMultipleLevel

• SingleLevel

• MultipleLevel

2016-2017 DataMining YsalinedeWouters

71

Associationrulesareoftenusedintheretailsector.• Baskets:setsofproductssomeoneboughtinonetriptothestore• Items:products

Ex:giventhatmanypeoplebuybeeranddiaperstogether,runasaleondiapers;raisepriceofbeer.Onlyusefulifmanybuydiapersandbeer.Whatitemsshouldstorestockupon.

Byfindingfrequentitemsets,aretailercanlearnwhatiscommonlyboughttogether.Especiallyimportantarepairsorlargersetsofitemsthatoccurmuchmorefrequentlythanwouldbeexpectedweretheitemsboughtindependently.Ex:diapersandbeer.Onewouldhardlyexpectthesetwoitemstoberelated.Throughdataanalysisonechainstorediscoveredthatpeoplewhobuydiapersareunusuallylikelytobuybeer.

Anotherapplicationisplagiarism

• Baskets=sentences• Items=documentscontainthosesentences

Itemsthatappeartogethertoooftencouldrepresentplagiarism.Noticethatitemsdonothavetobe“in”baskets.Documentshavingahighcountcontainplagiarism.

Associationrulesarealsooftenusedinwebpagesanalytics.

• Baskets=webpages• Items=words

Unusualwords appearing together in a large number of documents,e.g.,“Brad”and“Angelina”,mayindicateaninterestingrelationship.

2. NaïveAlgorithm

Walmartsells100000itemsandcanstorebillionsofbaskets.TheWebhasbillionsofwordsandmanybillionsofpages.Wehaveaccesstolotsandlotsofdata…NaïveGenerateandTestGoal:Findallassociationruleswithsupport≥sandconfidence≥c.ForeachassociationruleX=>Y,keepitifitmeetssupport&confidencethreshold.><Complexity:exponentialinnumberofitems.

ThisNaïvealgorithmtendstobetooslow!

2016-2017 DataMining YsalinedeWouters

72

AssociationRuleMiningGoal Insight:Givenfrequentitemsets,

• Trivialtoderiveallassociationrules• Thesewillcontainallruleswithsupportsandconfidencec

Hardpart:findingthefrequentitemsets.Note:if{i1,i2,...,ik}→jhashighsupportandconfidence,thenboth{i1,i2,...,ik}and{i1,i2,...,ik,j}willbe“frequent”CreatingassociationrulesGiven:Supports,confidencec

• Step1:Findallitemsetswithsupports• Step2:ForeachfrequentitemsetL,foreachnon-emptysubsetsofL.

Outputtherules→{l-s}ifitsconfidence≥cOnceIhavefrequentitemsets,findingassociationrulesisquiteeasy.

Typically,dataiskeptinflatfilesratherthaninadatabasesystem.Dataisstoredondiskbasket-by-basket.Useknestedtoexpandbasketsintopairs,triples,etc.asyoureadbasket.Thetruecostofminingdisk-residentdataisusuallythenumberofdiskI/O’s.

• Readdatainpasses:allbasketsreadinturn• Cost:Numberofpassesanalgorithmtakes

Thebottleneckismainmemory.Formanyfrequent-itemsetalgorithms,mainmemoryis the critical resource. As we read baskets, we need to count something, e.g.,occurrencesofpairs.Thenumberofdifferentthingswecancountislimitedbymainmemory.Swappingcountsin/outisadisaster(why?).Thehardestproblemoftenturnsouttobefindingthefrequentpairs,oftenfrequentpairsarecommon,frequenttriplesarerare.Theprobabilityofbeingfrequentdropsexponentiallywithsize.Thenumberofsetsgrowsmoreslowlywithsize.First,focusonpairs,thenextendtolargestsets.NaïveAlgorithmWhatisthenaïvewaytothinkaboutthis?ForeachbasketthatIprocess,IgenerateallpossiblepairsandIprocessthem.Read file once, counting inmainmemory theoccurrencesof eachpair. Fromeachbasketofnitems,generateitsn(n-1)/2pairsbytwonestedloops.Failsif(#items)2exceedsmainmemoryRemember:#itemscanbe100K(Wal-Mart)or10B(Webpages)2approaches

2016-2017 DataMining YsalinedeWouters

73

1) Countallpairs,usingatriangularmatrix.Observeallpossiblepairsinthere.Icanfigureouttheidentifiersoftheitems.

- Assigneachitemanumber- Count{i,j}onlyifi<j- Keeppairs in theorder: {1,2}… {1,n } : {2,3} ...

{n-1,n}- Pair{i,j}attheposition:(i–1)(n–i/2)+j–i

ð 4bytes/pairstoreallpairs

Assigneachtransactionanumberandkeepacountforeachitemset.

Wehaveasimpletransactionaldatabasecontainingfourtransactions.ThenIhavemyarraywhereIgeneratedallpossibleitemsets.Ihavetokeeponeentryforeachpossiblepair.Iwanttoscanmydata.ForeachtransactionIgenerateallpossiblepairs.

2) Tableoftriples[i,j,c]=pair{i,j}countisc.Thismethodismorememoryefficientthanpreviousapproach.Thetotalnumberofbytesusedisabout12p,wherepisthenumberofpairsthatactuallyoccur.Thismethodrequiresustostorethreeintegers,ratherthanone,oreverypair thatdoesappear in somebasket. Thisapproachbeats the triangularmatrixifatmost1/3ofthepossiblepairsactuallyoccur.Itrequiresextraspaceforretrievalofstructure.Thetriplesmethoddoesnotrequireustostoreanythingifthecountforapairiszero.

Sofar,wefocusedonidentifyingfrequentpairs.But,thereexistmanymorepossiblefrequenttriplesthanfrequentpairs.Howcanweavoidgeneratingallofthem?Numberoffrequentitemsetsofsizetwowillbebiggerthanfrequentitemsetofsizethree.Butontheotherhand,therearemorepossiblefrequenttriplesthanpossibledoubles.

2016-2017 DataMining YsalinedeWouters

74

3. Apriori

Inpractice, themostmainmemory is requiredfordeterminingthefrequentpairs.Thenumber of items,while possible very large, is rarely so largewe cannot count all thesingletonsetsinmainmemoryatthesametime.In practice the support threshold is set high enough that it is only a rare set that isfrequent.Monotonicity tells us that if there is a frequent triple, then there are threefrequent pairs contained within it. Thus, we expect to find more frequent pairs thanfrequenttriples,morefrequenttriplesthanfrequentquadruples,andsoon.TheA-Priorialgorithmisdesignedtoreducethenumberofpairsthatmustbecounted,attheexpenseofperformingtwopassesoverdata,ratherthanone.ThisisthejoboftheA-PrioriAlgorithmandrelatedalgorithmstoavoidcountingmanytriplesorlargersets.Themainpurposehereistogenerateandtestapproachfordiscoveringfrequentitemsets.Theapriorialgorithmisaniterativeapproach.

• Find all frequent itemsets of size k before finding frequentitemsetsofsizek+1

• Onepassthroughthedataforeachfrequentitemsetsize.

Ifanitemsetappearsatleaststimes,sodoallitssubsets.ThecontrapositiveforpairsstatesthatifitemIdoesnotappearinsbaskets,thennopairincludingIcanappearinsbaskets.Itwillneverbefrequent.AssoonasIknowthatsomethingisinfrequent,thereisnopointofcountinganitemsetthatisbigger.

TheAprioriprincipleholdsduetothefollowingpropertyofthesupportmeasure: "X,Y:(XÍY)Þs(X)³s(Y)

• Pass1:Createtwotables.Thefirsttabletranslatesitemnamesintointegersfrom1ton.Theothertableisanarrayofcounts.Read baskets and count inmainmemory the occurrences ofeach item. It requiresmemoryproportional to thenumberofitems.Frequentitemsarethosethatappearstimes.Afterthefirstpassweexaminethecountsinordertodeterminethefrequentitemsets.Wethencreateanewnumberfrom1tomforjustthefrequentitems.Thistableisanarrayindexed1to

2016-2017 DataMining YsalinedeWouters

75

n,andtheentryforI iseither0, if itemI isnotfrequent,orauniqueintegerintherange1tomifitemIisfrequent.

• Pass 2: Read baskets again and count in main memory onlythosepairsbothofwhichwerefoundinPass1tobefrequent.In a double loop, generate all pairs of frequent items in thatbasket. For each such pair, add one to its count in the datastructureusedtostorecounts.It requires memory proportional to the square of frequentitems,plusa listofthefrequent items.Frequent itemsetsarethosethatappearstimes.

Verifyifeachcandidatesatisfiesthethreshold.

ð Ifnofrequentitemsetsofacertainsizearefound,thenmonotonicitytellsustherecanbenolargerfrequentitemsets,sowecanstop.

1) JoinStep:CandidatesetCk+1isgeneratedbyjoiningLkwithitself

SupposetheitemsinLkarelistedinanorder.JoineachelementinLkwithitself.Ifl1,l2∈Lk,theyarejoinableif:

o Thefirstk-1itemsinl1andl2arethesameo l1[1] = l2[1] AND l1[2] = l2[2] AND ... AND

l1[k-1]=l2[k-1]

2) Prune Step: Any k-itemset that is not frequent cannot be a

subsetofafrequent(k+1)-itemsetForeachcandidateitemsetsCk+1,lookateachsubsetofsizek,i.e.,droponeitemfromthecandidate.Ifanyofthesesubsetsisn’tfrequent,discardthiscandidate.àApplicationoftheAprioriPrinciple

2016-2017 DataMining YsalinedeWouters

76

Therealtrickisthepruningstep.Thisisthekeystep.

Thesizeofthefileofbasketsissufficientlylarge,henceitdoesnotfitinmainmemory.Thus,amajorcostofanyalgorithmisthetimeittakestoreadthebasketsfromdisk.Besidescost,anotherdata-relatedissuemayberelatedtotheuseofthemainmemoryfor itemset counting. All frequent-itemset algorithms require us tomaintainmanydifferentcountsaswemakeapassthroughthedata.Wemightforinstanceneedtocount the number of times that each pair of itemsoccurs in baskets.Wemust becareful aswecannot countanything thatdoesnot fit inmainmemory. Thus,eachalgorithmhasalimitonhowmanyitemsitcandealwith.

4. PCY

InthissectionweconsiderthePCYalgorithm,whichtakesadvantageofthefactthatinthefirstpassofA-Priorithereistypicallylotsofmainmemorynotneededforthecountingofsingleitems.Simpleproblem:checkifanobjecthasbeenpreviouslyencountered.Filterstrings:ScanalargerfileFofstringsandoutputthosethatareinS(e.g.,emails)Hassomeonevisitedthesetyet(abitmorecomplicated…).ð Often, fileswill not fit inmemory. This algorithmexploits the observation that

theremay bemuch unused space inmainmemory on the first pass. The PCYAlgorithmusesthatspaceforanarrayof integersthatgeneralizesthe ideaofaBloomfilter.

2016-2017 DataMining YsalinedeWouters

77

Think of an array as a hash table,whose buckets hold integers ratherthansetsofkeysorbits.Pairsofitemsare hashed to buckets of this hashtable.Asweexamineabasketduringthefirstpass,wenotonlyadd1tothecountforeachiteminthebasket,butwe generate all the pairs, using adoubleloop.Hash each pair and add 1 to thebucketintowhichthatpairhashes.

Attheendofthefirstpass,eachbuckethasacount,whichisthesumofthecountsofallthepairsthathashtothatbucket.Ifthecountofabucketisatleastasgreatasthesupportthresholds,itiscalledafrequentbucket.

• We can say nothing about the pairs that hash to a frequentbucket; theycouldallbe frequentpairs fromthe informationavailabletous.

• Ifthecountislessthans,weknownopairthathashestothisbucketcanbefrequent.

Nextdefinethesetofcandidatepairsforthenextpass.

Analysis

• Nofalsenegatives:catchesallelementsinF• Falsepositivesarepossible• Ifb=n|S|atmost1/nofthebitarrayis1,only1/nthofelements

notinSgetthroughfiltero Notewillbelessthan1/Ngivencollisionso Assuminghashinguniformlyatrandom

DuringthefirstpassoftheApriorialgorithm,mostmemoryisidle.Idea:usefreememoryforahashtable

• Hashpairs of items that appear in a transaction:weneed togeneratethese

• Justthecount,notthepairsthemselves• InterestedinthepresenceofapairANDwhetheritispresentat

leasts(support)times.

2016-2017 DataMining YsalinedeWouters

78

Observationaboutbuckets

• Abucketthatafrequentpairhashestomeetminimumsupportthreshold.Cannoteliminateanymemberofthisbucket

• Even without any frequent pair, a bucket can be frequent.Cannoteliminateanymemberofthisbucket

• Best case: count for a bucket is less thanminimum support.Eliminateallpairshashedtothisbucketevenifthepairconsistsoftwofrequentitems.

ð Anexampleisprovidedintheslides.

5. LimitingdiskI/O

A-Priori,PCY,etc., takekpassestofindfrequent itemsetsofsizek. inotherwords,previouslydiscussedalgorithmsuseonepassforeachsizeofitemsetweinvestigate.If themainmemory is too small to hold the data and the space needed to countfrequentitemsetsofonesize,theredoesnotseemtobeanywaytoavoidkpassestocomputetheexactcollectionoffrequentitemsets.

• Good:cleverpruningstrategy• Bad:stillrequireslotsofpassoverdataL

Manyapplicationswhereitisnotessentialtodiscovereveryfrequentitemset.Hence,itisquitesufficienttofindmostbutnotallofthem.Makingonepassoverthedataiswhattheeliminatingfactoris.Othertechniquesuse2orfewerpassesforallsizes:

• Simplealgorithm• SON(Savasere,Omiecinksi,andNavathe)• Toivonen

2016-2017 DataMining YsalinedeWouters

79

ð Thequestionis,canwefindallfrequentitemsetsin2passesorless?

5.1 SimpleAlgorithm:Sample

Idea:insteadofusingtheentirefileofbaskets,wecouldpickarandomsubsetofthebasketsandpretenditistheentiredataset.Procedure:Setofthedata, Ihaveadatabase,thenmemory.Weneedtohavetwopartsofthememory.Wewillrandomlysamplebasketsandmakesuretheseendupinthememory.Wethenhavesomedatainmemory,wethenrunApriorionthesample.So we need to think about the support threshold, that scales to the size of thedatabase.Thatistosay,scalebacksupportcount,e.g.,uses/100ifsampleis1/100oforiginalDBsize.Thesafestwaytopickthesampleistoreadtheentiredataset,andforeachbasket,selectthatbasketforthesamplewithsomefixedprobabilityp. Incaseourbasketsappearinrandomorderinthefilealready,thenwedonotevenhavetoreadtheentirefile.Oncewehaveselectedoursampleofthebaskets,weusepartofmainmemorytostorethesebaskets.Thebalanceofthemainmemoryisusedtoexecuteoneofthealgorithmswehavediscussed,suchasA-Priori,PCY,…thealgorithmmustrunpassesover themain-memory sample for each itemset size, until we find a size with nofrequentitems.Missessetsfrequentinwholebutnotinsample

• Canuseevensmallersupport• E.g.,s/125insteadofs/100

Optional:verifyguessesonentiredataset.

Thisalgorithmrequiresatmosttwopasses:onepasstoconstructthesampleandthenonesampleifyouwanttoverifyguesses.Thiscanbeveryfast;Itisalsoverysimple.Ofcoursethealgorithmwillfailifwhichevermethod seen cannot be run in the amount ofmainmemory left after storing thesample.

ð Be careful, it cannotbe relieduponeither toproduceall the itemsets that arefrequentinthewholedataset,norwillitproduceonlyitemsetsthatarefrequentinthehole;anitemsetthatisfrequentinthewholebutnotinthesampleisafalse

2016-2017 DataMining YsalinedeWouters

80

negative,whileanitemsetthatisfrequenttinthesamplebutnotthewholeisafalsepositive.

ð Retainasfrequentonlythoseitemsetsthatwerefrequentinthesampleandalsofrequent in thewhole.This improvementwilleliminateall falsepositives,butafalsenegativeisnotcountedandthereforeremainsundiscovered.

5.2 TheSONAlgorithmð Savasere,Omiecinski,andNavathe(SON)

This improvementavoidsbothfalsenegativesandfalsepositives,at thecostofmakingtwofullpasses.Idea:dividetheinputfileintochunks,treateachchunkasasample,andrunthesimple,randomizedalgorithmionthatchunk.Usepsasthreshold,ifeachchunkisfractionpofthewholefile,ands isthesupportthreshold.Storeondiskall thefrequentitemsetsfoundforeachchunk.Onceallchunkshavebeenprocessed,taketheunionofalltheitemsetsthathavebeenfoundfrequentforoneormorechunks.Thesearethecandidateitemsets.Monotonicity:every itemsetthat isfrequent inthewhole isfrequent inat leastonechunk,andwecanbesurethatallthetrulyfrequentitemsetsareamongthecandidates;i.e.,therearenofalsenegatives.

• 1stpass:readeachchunkandprocessit• 2ndpass:countallthecandidateitemsetsandselectthosethat

havesupportatleastsasthefrequentitemsets

5.3 Toivonen’sAlgorithmAgain,itusesthesimplealgorithm,butlowerthethresholdsforthesample.Exifthesampleis1%ofthebaskets,uses/125vs.s/100.Goal:avoidmissingtrulyfrequentitemsets.Toivonen’s algorithm begins by selecting a small sample of the input dataset, andfindingfromitthecandidatefrequentitemsets.Itisessentialthatthethresholdissettosomethinglessthanitsproportionalvalue.Thesmallerwemakethethreshold,the

2016-2017 DataMining YsalinedeWouters

81

more main memory we need for computing all itemsets that are frequent in thesample,butthemorelikelywearetoavoidthesituationwherethealgorithmmfailstoprovideananswer.Wethenconstructthenegativeborder.Thisisthecollectionofitemsetsthatarenotfrequentinthesample,butalloftheirimmediatesubsetsarefrequentinthesample.Example:ABCDisinthenegativeborderifandonlyif:

• Itisnotfrequentnithesample,and• ABC,ABD,ACD,BCDareallfrequentinsample.

Aisnegativeborderifandonlyof• Itisnotfrequentinthesample• Emptysetisfrequentunlessthenumberofbaskets<support

Tocompletethealgorithm,wemakeapassthroughtheentiredataset,countingalltheitemsetsthatarefrequentinthesampleorareinthenegativeborder.Therearetwopossibleoutcomes

• Nomember of the negative border is frequent in the wholedataset. In this case, the correct set of frequent itemsets isexactlythoseitemsetsfromthesamplethatwerefoundtobefrequentinthewhole.

• Somememberofthenegativeborderisfrequentinthewhole.Thewecannotbesurethattherearenotsomeevenlargersets,in neither the negative border nor the collection of frequentitemsetsforthesample,thatarealsofrequentinthewhole.So,we can give no answer at this time and must repeat thealgorithmwithanewrandomsample.

Theorem:ifthereisanitemsetthatisfrequentinthewhole,butnotfrequentinthesample,thenthereisamemberofthenegativeborderforthesamplethatisfrequentinthewhole.Proof:proveitbycontradictionSupposenot;i.e.;

1) ThereisanitemsetSfrequentinthewholebutnotfrequentinthesample,and

2) Nothinginnegativeborderisfrequentinthewhole

2016-2017 DataMining YsalinedeWouters

82

LetTbeasmallestsubsetofSthatisnotfrequentinthesample.Tisfrequentinthewhole(Sisfrequent+monotonicity).Tisinthenegativeborderbecauseotherwiseitwouldnotbethesmallest.AnyremovalfromTresultsinsomethingthatisfrequent.ð Why are these variants interesting? These employ standard things to improve

efficiency: hashing and sampling. Hashing is a good technique, used in manyalgorithmstomakethingsgofast.Thesearestandardtechniquesthatwewilloftenencounter.Theycanbeusedinmanyothercontexts,sogoodtoknowtherestrengthsandweaknesses.Alwaysthinkabouthowyoucanimproveanalgorithm!

6. FPGrowth

Quickreminder,wherearewesofar?• A-Priori

o Guaranteedtobecompleteo Slowdue to repeatedDB scans to generate candidate

itemsetsAlthough the Apriori heuristic achieves good performancegainedbyreducingthesizeofcandidatessets,itmaysufferfromsome costs in situations with a large number of frequentpatterns, long patterns, or quite low minimum supportthresholds.

• Toivonen’sAlgorithmo OnlyrequirestwoDBscanso Notguaranteedtobecomplete

ð Canwegetfastandcomplete?

ThereisanapproachcalledFrequent-patterntree(FPtree)thattriestodothis.Itmainly consists out of two big ideas. This is an extended prefix-tree structurestoringcrucial,quantitativeinformationaboutfrequentpatterns.EverythingwehaveseensofarisbasedontheA-Priorialgorithm,counteverythingofsize1beforemovingtosize2,andsoon.StudieshaveshownthatFP-growthisaboutanorderofmagnitudefasterthanApriori,especiallywhenthedatasetisdenseand/orwhenthefrequentpatternsarelong.

• Data compression: Exploit redundancy to compactly butcompletely represent frequent items in DB. Compress thedatabasetosomethingsmallerthatfitsinthemainmemory.

• AvoidsrepeatedDBscans.

2016-2017 DataMining YsalinedeWouters

83

Acompactdatastructurecanbedesignedbasedonthefollowingobservations

• Sinceonly the frequent itemswillplaya role in the frequent-pattern mining, it is necessary to perform one scan oftransactiondatabaseDBtoidentifythesetoffrequentitems

• Ifthesetoffrequentitemsofeachtransactioncanbestoredinsomecompactstructure,itmaybepossibletoavoidrepeatedlyscanningtheoriginaltransactiondatabase

• Ifmultipletransactionsshareasetoffrequentitems,itmaybepossible to merge the shared sets with the number ofoccurrencesregisteredascount.Itiseasytocheckwhethertwosetsareidenticalifthefrequentitemsinallofthetransactionsarelistedaccordingtoafixedorder.

• Iftwotransactionsshareacommonprefix,accordingtosomesortedorderoffrequentitems,thesharedpartscanbemergedusing one prefix structure as long as the count is registeredproperly. If the frequent items are stored in their frequencydescending order, there are better chances that more prefixstingscanbeshared.

TheFP-treeconstructiontakesexactlytwoscansofthetransactiondatabase:thefirstscancollectsthesetoffrequentitems,andthesecondconstructstheFP-tree.

TodiscoveritemsetsintheFP-tree,partitionthedataandcountwithin each partition: aka Divide-and-conquer. It avoidscandidategeneration.Looking at the five transactions,we observe that thefirst one and the second one are exactly the same

transactions.Thetransactionisrepeatedsowecouldjuststoreitoncewithcount.However,itisveryunlikelytofindmanexactrepeats.Thequestionis,howtodobetterthanthis?Whenwelookatthedata,therearestillrepeatedstructuresinthisdata.Wecanseethingslikefappearsinalltransactions.Question:howcouldweexploitthestructuretocompress?Thegoalistoveryquicklyandcheaplycompressthedata. Ideallyweshouldcompressthedatainoneortwopasses.

Insteadofstoringfonce.Westoreitonceandwekeeptrackofthecounts.Eachnodeinthethreeisjustanitemwiththecount.Wecanreconstructthepatternbygoingdownthetree.Question:isthereabetterwaytoorderthedata?

2016-2017 DataMining YsalinedeWouters

84

Generalidea:Divide-and-conquertorecursivelygrowfrequentpatternpathusingtheFP-tree.Method:

• Foreachitem,constructisconditionalpattern-baseandthenitsconditionalFP-tree

• RepeatprocessoneachnewlycreatedconditionalFP-treeuntilo TheresultingFP-treeisempty,oro Itcontainsonlyonepath

ð Adetailedexampleisgivenintheslides.

ð Why is this a beneficial technique? It is a very compact representation of thedatabase.Wejustincreasethecountsinsteadofjustkeepingtheitemsetsseveraltimesintothedatabase.

WhyisFP-GrowthFast?

• Nocandidategeneration,nocandidatetest• Usecompactdatastructure• Eliminaterepeateddatabasescans• BasicoperationiscountingandFP-treebuilding

7. Incorporatingconstraintsintomining

Itisnotrealistictofindallpatternsinadatabaseautonomously.

• Toomanypatterns• Patternstoounfocused

2016-2017 DataMining YsalinedeWouters

85

Datamining is interactive:users shouldprovideadviceanddirection toalgorithms.Usuallywehaveahypothesiswewanttoevaluatebylookingatthedata.Constraint-basedmining

• Userprovidesconstraintsonwhattomine• Systemexploitsconstraintsforefficiency

Typeofconstraints

• Interestconstraint:Ex:rulemeetsminimumsupport,confidence,interestthresholds

• Dataconstraint.Ex:productssoldtogetherinLeuven,inMarch2017-04-29

• Ruleorpatternconstraint.o Content:smallmale(sum<10)triggers largesale(sum

>200)o From:meta-rule.P(x,y)ÙQ(x,w)→takes(x,database

systems)

Givenconstraints,aminingalgorithmshouldbe• Sound:onlyfindsitemsetsthatsatisfyconstraints• Complete:findsallitemsetsthatsatisfyconstraints

ANaïvesolutionfindsallfrequentitemsets,thentestthemforconstraintsatisfaction.Thisisanefficienttechnique.Theefficientapproachistoanalysetheitemsetandpushtheconstraintintotheminingprocess.Then,thingswillbemuchmoreefficient.Anti-Monotonicity:whenanitemsetSviolatestheconstraint,sodoesanyofitssuperset.

Succinctness:ifAissuccinctyoucanenumerateall,andonlythose,thatsatisfyit.

Idea:withoutlookingatthedata,candetermineitemsetsthatsatisfiestheconstraint.Ex:min(S.Price)£vissuccinct

8. Presentingresults,othermetrics

• Maximalitemsets:noimmediatesupersetisfrequent

2016-2017 DataMining YsalinedeWouters

86

• Closed itemsets: no immediate superset has the same count(>0)

o Storeswhatisfrequentandexactcountso Losslessencodingoffrequentitemsets

ð Whydoweneedthesenotionsofmaximalandcloseditemsets?

• Waytocompressthesearchspace.Whatistheintuitionifweonlykeepthecloseditemsets?Orifweonlykeepthemaximalones,inwhichcasedoweloseinformation?

o If we only keep the closest, we can generate all thefrequentitemsets.

MiningMaximalItemsets:MaxMiner

Wewanttotrysomethingthatisasbigaspossibleandfrequent.Alwaysaddsub-branches.Pruningtechniques

• Localpruning(e.g.,atnodeA)o Ifh(N)Èt(N)isfrequent,prunewholesub-tree(e.g.,ifABCDisfrequentdon’t

explore)o If h(N)È i, for some iÎ t(N) is NOT frequent remove i from “h” before

expanding(e.g.,ifACnotfrequent,removeC)• Globalpruning:Whenmaxpatternfound(e.g.,ABCD).Pruneallnodes(e.g.,B,C,D)

whereh(N)Èt(N)isasubset(e.g.,subsetofABCD)

ð Anexampleisgivenintheslides

2016-2017 DataMining YsalinedeWouters

87

We have looked at two popular objectivemeasurements: support and confidence.There are other things we can look at, which are more subjective measures:interestigness.Arule(pattern)isinterestingifitis

• Unexpected(surprisingtotheuser)• Actionable(theusercandosomethingwithit)

Tosumup…Characterizationofalgorithms

• Searchtechniques:o Breadth-first:Apriorianditsvariantso Depth-first:FP-Growtho Lookahead

• Pruningtechniqueso Subsetinfrequento Hashingo Supersetfrequency

• Limitingdiskaccessesistheprimarygoal

Presentingrules• Whichitemsets

o Allitemsetso Closeditemsetso Maximalitemsets

• Whatmetricso Support,confidence,interesto Lifto Correlation

ð Whatisthedifferencebetweenstandardruleinductionandassociationrules?

o Rule induction: the rule is just a target. In rule mining we perform aclassifier.

o Associationrules

Associationrulesareanefficientwaytomineinterestinginformationinverylarge databases: get probabilities. It does not require (but can exploit) userguidanceforinterestingpatterns.

2016-2017 DataMining YsalinedeWouters

88

A-priorialgorithmanditsextensionsallowtheusertogatheragooddealofinformationwithouttoomanypassesthroughdata.

Foodforthought:Patternexplosionisawell-knownprobleminfrequentitemsetmining.Highsupport thresholds typically result only in few well-known patterns; but for low supportthresholds,thenumberoffrequentitemsetscaneasilybeordersofmagnitudelargerthanthenumberoftransactions.Knowledgediscoveryinsuchhumongousitemsetcollectionsisvirtuallyimpossible.Whatarethecausesofpatternexplosion?Canyouthinkofawaytosolveoratleastalleviatethisissue?

• Problemwiththechoiceofsupportbecausethereisatrade-off.o Toolow:lotsandlotsofitemsetso Toohigh:fewitemsetswhicharewell-known

• Iftherearestatisticaldependenciesbetweensomeitemsthenyouseethemappearingeverywheretogether,whichmightnotaddadditionalinformation.How canwedealwith it?We could for instance use constraints. Ifwe have somebackgroundinformation,wecanexcludesomeitemsets.

2016-2017 DataMining YsalinedeWouters

89

CHAPTER7:MININGSEQUENCES

• Understandhowsequentialpatternminingrelatestoassociationrulemining• Introducethebasicterminologyandconceptsofsequentialpatternmining• Abriefintroductiontothedifferentalgorithmsforsequentialpatterndiscovery

1. Motivation

Idea: Sequential pattern mining discovers frequent subsequences as patterns in asequencedatabase.Itisanimportantdataminingproblemwithbroadapplications.

• Associationrulemining:findrelationshipsbetweenitemsthatco-occursimultaneously.Lookingatassociationsruleminingwewerelookingatassociationsandconnectionsbetweenitemsinamarketbasketforinstance.

• Sequentialpatternanalysisconsiderstime(ordertransactionsoccur in). In sequentialpatternminingwewill take theorderintoaccount.Theordermattershere.Oftentheordercanbeimportantindecisionmaking.Many different applications domains are characterized bysequentialdata.Oftenthetimeororderofactions/eventscanbe relevant in decisionmaking. It is important to be able tocapturethosetypesofdependencies.The Sequential patternmining task is quite similar to that ofassociationrulemining.Given:Asequentialdatabaseandminimumsupportthreshold.Find:All(maximal)frequentsubsequences

2. Backgroundanddefinitions

2016-2017 DataMining YsalinedeWouters

90

Wepreviouslyworkedontransactiondatabase,assumingatransactionID.Here,wewilltimestampacustomeridentifier.Soweneedtomakeasequencedatabasebasedonthetransactionsofeachcustomer.

TransactionDatabase

SequenceDatabase

Thecustomer-sequencedatabaseisatime-orderlistofa customer’s transactions. For instance, customer 1bought a PC before paper. Customer 4 bought a PCbeforeaprinterandink.

Wecangroupalltransactionsbyusers.

Asequenceisjustanorderedlistofitemsets.Itemsinthetransactionsremainunordered.

The length of a sequence is the number of itemsets in the sequence, thenumberoftransactionsinit.

A sequence S1 = (a1,a2,...,an) is contained in another sequence S2 =(b1,b2,...,bm)IF

• Existsintegersi1<i2<...<insuchthat• a1⊆bi1ANDa2⊆bi2AND...ANDan⊆bin

IcanfindsometransactionsinS2suchthatS1occursinit.SowehavetofindawaytomapthosetransactionsofS1belongingtoS2.

2016-2017 DataMining YsalinedeWouters

91

An itemcanoccuratmostonce inanelementofasequence,butcanoccurmultipletimesindifferentelementsofasequence.

• A customer supports a sequence s if s is contained in thecustomersequence.The support is the fraction of customers who support thesequence.

• A sequence is frequent if it meets a minimum supportthreshold.

• A sequence s ismaximal if it is not contained in any othersequence.

Ex:

Rmq: <(Printer) (Ink)>means that the printer and the ink arebought in two separate transactions. Notice that customer 4buys both the printer and ink but in the same transactions.Whichisdifferentfrom<(Printer)(Ink)>.

RemembertheAprioriproperty:ifasequenceisnotfrequent,thennoneofitssuper-sequencesarefrequent.

Sequentialpatternminingfacessomechallenges.Indeed,databasescontainahugenumberofpossiblesequentialpatterns.Becausewearenowconsideringtheorder.Hence,generatingcandidatestendstobequitechallenging.Algorithmsshouldfindallpatterns(ifpossible)thatmeetaminimumsupportthreshold.Theyshouldalsobeefficientandscalableandincorporatetheuserprovidedconstraint.

2016-2017 DataMining YsalinedeWouters

92

ð Thebigchallengesresideingeneratingcandidates.

Thereare2possiblewaystoextendapossiblecandidate.• Extendthesequencetoincludeanotheritemset(i.e.,aseparate

transaction).Ihaveasequenceofacertainlength,forinstancelength2andIaddanewtransaction,leadingtoasequenceoflength3.

• Extendthenumberofitemsincludedinoneitemset(i.e.,model

co-occurringitems).

Assumethatwehave6frequentlength1sequences. First:Extendthelengthofthesequence.Thisyields6x6=36candidates.

Second:wewillextendthesizeofanitemset.Thisyields6x5/2=15candidates.Insteadofhaving15itemsetswewillhave15+36itemsets.

Thereareanumberofapplicationsthatpeoplelookat.

§ Thinkaboutretailshopping.First,buyacomputer,second,byprinterwithinthenext3months.

§ Other applications are in the medical domains. First distinguishsymptoms,thendiseasesandthirdtreatments.

§ Financialmarkets:commonlyoccurringtransactionorders.§ Webdata.Ex:clickstreams.Whatorderdopeoplevisitpagesin?§ Telephonecallingpatterns:whatorderdopeoplecallothersin?§ DNAsequences:whatstructureiscommonlyrepeated?

Quickcheck…

Whatisthesupportof:

2016-2017 DataMining YsalinedeWouters

93

§ (PC).Supportis2.Weonlycareaboutthenumberofcustomerswhoboughtapc.Here:customers3and4.Wedon’tcountthenumberoftimesapcwasbought.

§ (Ink,Paper).Supportis3.§ (PC)(Ink).Supportis2

Given:10thingsofsize1arefoundtobefrequent.Howmany itemsetsmustbeevaluatedon thenext search iteration?45 (=10x9/2)Howmany sequences must be evaluated on the next search iteration? 45(items)+100=145(=10x9/2)+102)

3. FreeSpan

ð Canwe develop amethodwhichmay absorb the spirit of Apriori but avoid orsubstantiallyreducetheexpensivecandidategenerationandtest?

FreeSpan isbasically the first thingwe thinkof in sequentialmining.The idea is toadapttheFP-Growthtothesequentialcase.Idea: Use frequent items to recursively project sequence databases into a set ofsmaller projected databases and grow subsequence fragments in each projecteddatabase.Overview:Divide-and-conquerapproach

§ Step1:Countsequencestofindf-list§ Step2:Recurse

• Projectsequencedatabaseintoasetofsmallerdatabases• Minefrequentpatternsineachprojecteddatabase.

Thebigchallengehereisthatordermatters!Step1:Findlength-1sequentialpatternsandlisttheminsupportdescendingorder

ð f_list=a:4,b:4,c:4,d:3,e:3,f:3;g:1

Rmq:wedon’tcounthowoftenapatternappearsinthetransactions.Wecountthenumberofcustomers’transactionsinwhichthe1-lengthsequenceappears.

Here,wecanpruneg,asitonlyhasasupportof1.Hence,wedon’tcareaboutg.wewillusetheotherfrequentsequencestopartitionthesearchspace.Step 2: Divide the search space. The complete set of sequence patterns can bepartitionedinto6disjointsubsets(movedownthef_list).

2016-2017 DataMining YsalinedeWouters

94

Thesubsetsofsequentialpatternscanbeminedbyconstructingprojecteddatabases.

§ Finding sequential patterns containing only item a. By scanningsequencedatabaseonce,theonlytwosequentialpatternscontainingonlyitemaare:(a)and(aa).

§ Finding sequential patterns containing itemb but no itemafter b inf_list.Thiscanbeachievedbyconstructingthe(b)-projecteddatabase.

§ Finding other subsets of sequential patterns. Others subsets can befoundsimilarly,byconstructingcorrespondingprojecteddatabaseandmining them recursively. These databases are constructedsimultaneouslyduringonescanoftheoriginalsequencedatabase.

Let’slookatthe<b>projecteddatabase.

Pro’s Con’s

• Projections replace candidategenerate

• Conprojectatanypointinsequence• Projected sequences may not be

shorter

In thiscase Ididn’tsavethatmuchwork.Mysequencesarenotthatshorter than the previous sequences. I justmanaged to remove onecustomerID.

4. PrefixSpan

2016-2017 DataMining YsalinedeWouters

95

Idea: instead of projecting sequence databases by considering all the possibleoccurrences of frequent subsequences, the projection is based only on frequentprefixes because any frequent subsequence can always be found by growing afrequentprefix.PrefixSpanmines the complete setofpatternsbut it greatly reduces theeffortsofcandidate subsequence generation. Moreover, it reduces the size of projecteddatabasesandleadstoefficientprocessing.PrefixSpan outperforms both the A-Priori algorithm and previously seen method,FreeSpan.Pro’s

o Maintainprojectionidea,avoidcandidategeneration.Wewanttoguaranteewedolesswork.

o Projectionbasedonsequenceprefix:ensurethatsequencesgetprogressivelyshorter.

Definition:PrefixSequenceβ=<b1b2...bm>isaprefixofsequenceα=<a1a2...an>,wherem≤n,IFF

• bi=aifori≤m-1• bm⊆am• Allitemsin(am–bm)arealphabeticallyafterthoseinbm

Definition:Projection

Asubsequenceα’ofsequenceαisaprojectionofαw.r.t.prefixβIFF• βisasubsequenceofα• α’hasprefixβ• Nopropersuper-sequenceα’’ofα’suchthatα’’isasubsequenceofαandhas

prefixβ

Similarly,whatwecandoislookattheSuffix.Splittheprojectionintotwopart,theprefix(data)andthesuffix.

• Letα’=<a1a2...an>betheprojectionofαw.r.t.prefixβ=<a1a2...am-1a’m>(m

≤n)• Sequenceγ=<a’’mam+1...an>iscalledthesuffixofαw.r.t.prefixβ,denotedas

γ=α/β,whereα’’m=(am-a’m)

2016-2017 DataMining YsalinedeWouters

96

• Wealsodenoteα’=β⋅γ

ProjectedDB • Given,prefixβandsuffixγ• WecallγtheβprojectedDB

Whenmyprefixgetslongerandlonger,mysuffixgetssmallerandsmaller.

Thefirstoccurrenceofeoccursinatransactionwithtwoitems,sowedon’twanttolosethefactthatthereissomestructureandthateco-occurs.So,weaddan“_”,toindicatethatfshouldoccurwithanotheritem.

ThePrefixSpanAlgorithmperformsadivide-and-conquersearchtofindallsequentialpatterns.

• Step1:Findallfrequentlength1sequences• Step 2: Divide search space based on length 1 prefixes and recurse:

PrefixSpan(α,l,S|α)

1. ScanS|αtofindfrequentitemsbsuchthat:a. bcanbeassembledtothe lastelementofαtoformasequential

pattern;orb. <b>canbeappendedtoαtoformasequentialpattern

2. Foreachfrequentitemb,appendittoαtoformasequentialpatternα’,

andoutputα’;

3. Foreachα’,constructα’-projecteddatabaseS|α’,andcallPrefixSpan(α’,l+1,S|α’).

Adetailedexampleisprovidedintheslides

2016-2017 DataMining YsalinedeWouters

97

ProjectedDBsallowustoavoidgeneratingcandidatesequences.• No candidate sequence needs to be generated by

PrefixSpan.Itonlygrowslongersequentialpatternsfromtheshorterfrequentones.Itdoesnotgeneratenortestany candidate sequence non-existent in a projecteddatabase.

• The projected databases keep shrinking: a projecteddatabase is smaller thantheoriginalonebecauseonlythe postfix subsequences of a frequent prefix areprojectedintoaprojecteddatabase.

• MajorcostofPrefixSpanistheconstructionofprojecteddatabases.

The costliest operation is building projected DBs. Every time I perform a newrecursive step, i have to build a new database. If the number and/or size ofprojecteddatabasescanbereduced,theperformanceofsequentialpatternminingcan be improved substantially. There are two ways to improve computationalefficiency.

o Bi-levelprojectionstoreducethenumberandsizesofprojectedDBs.

Thefirstscanidentifiesfrequentlength1sequences.Thesecondscanconstructsatriangularmatrixinsteadofprojecteddatabasesandcount.

• Support(a,a)foreachitem• Foreachpair(a,b)offrequentitemscount

o Support(<ab>)o Support(<ba>)o Support(<(ab)>

• Onthenextlevel,constructprojectedDBsWewilllookatverysmallpartsofthedata.Lookatthingswethinkweneed.

Foreachfrequentlength-2sequentialpatternconstructaprojectedDB.Thenonnextscanbuildthematrix.

2016-2017 DataMining YsalinedeWouters

98

ð Formoreinformation,cfr.Hand-writtennotesintheslides.

o Pseudo-projectionreducescost ifprojecteddatabasefits inmemory.Wewanttoavoidcopying.Insight: Postfixes of sequences often appear repeatedly in recursiveprojecteddatabases.Pseudo-projection:avoidrepeatedphysicalcopyingpostfix

• EfficientifprojectedDBfitsinmemory• CostlyifprojectedDBisdiskresident

Idea:usepointstoavoidcopyingWitheachprojectionstore

• Pointertobasesequence.PointstowhereIwanttobeinthesequence.

• Offsettowherepost-fixstarts

Insteadofbuildingthedata,Ijustmaintainapointer.

Toconclude,manyimportantdomainsarecharacterizedbythepresenceofsequentialdata.Sequential pattern mining helps capture time dependences between events. It issimilartofrequentitemsetsbutconsidersordersofelements.

ð Commonalgorithmsareinspiredbyitemsetmining.

2016-2017 DataMining YsalinedeWouters

99

CHAPTER8:CLUSTERING

1. Unsupervisedlearning,clusteringintro

Clusteringistheprocessofexaminingacollectionof“points”,andgroupingthepointsinto“clusters”accordingtosomedistancemeasure.Goal:findgroupsofitemsProduceclusterssuchthatthingsthatoccurinthesamegrouparehighlysimilar,orareclosetogetherindistance.Agoodclusteringmethodwillproduceclusterswith

• Highintra-classsimilarity• Lowinter-classsimilarity

Whenyouarefacedwiththeclusteringtask,manythingscanmakesense.Ex:dividesetintogroups.Youcandogroupingbasedongenderforinstance.Clustering is a subjective notion hence, it is difficult to give a precise definition ofclusteringquality.Itisapplicationdependentandultimatelysubjective.Thinkingabouthowmanyclusterscanalsobeatrickyquestion.HowmanyclustersdoIwant?

Remember… Supervisedlearning:

• Dataaresetofpairs<x,y>,wherey=f(x)• Goal:Approximatef

Unsupervisedlearning:thedataisjustx!Goal:FindstructureinthedataChallenge:Groundtruth isoftenmissing (noclearerror function, like insupervisedlearning)Onewaytothinkaboutclusteringisthatitisanunsupervisedlearningtechnique.Inunsupervisedlearning,youdon’thavealabel.Youjusthaveattributes.Thegoalistofindastructureinthedata,basedontheattributes.

2016-2017 DataMining YsalinedeWouters

100

><Insupervisedlearningweknowthelabelandwewanttooptimizeacertainmetric:ROC,AUC,…Unsupervisedlearningtendstobemoreabstract.Whyisclusteringuseful?Itcanbeinterestingtolookat.Youcanuseittocompressthedata.

• Groupsofdataareoftenuseful• Compressorreducethedata(e.g.,establishprototypes)• Identifyoutliers(ornovelexamples)• Pre-processingforsupervisedlearning(e.g.,featureconstruction)• Visualizationofthedata

Given: D = { x1, x2,..., xN }, where each xi is a d-dimensional feature vector (xi,1,xi,2,...,xi,d)Do:DividetheNvectorsintoKgroupssuchthatthegroupingis“optimal”

Thereare3keyissues

• Measureofdistances:distancebetweenexamples,betweenclusters,betweenexampleandclusters.

• Howwillwerepresentclusters?• Assigningitemstoclusters

Clusteringmethodsuseadistance(similarity)measuretoassessthedistancebetween

• Apairofinstances• Aclusterandaninstance• Apairofclusters

Rememberthat therequirements fora functiononpairsofpoints tobeadistancemeasurearethat

• Distancesarealwaysnonnegative,andonlythedistancebetweenapointanditselfis0.

• Distanceissymmetric;itdoesn’tmatterinwhichorderyouconsiderthepointswhencomputingtheirdistance.

• Distancemeasuresobeythetriangleinequality.Thedistancefromxtoytozisneverlessthanthedistancegoingfromxtozdirectly.

Givenadistancevalue,wecanconvertitintoasimilarityvalue:sim(i,j)=1/[1+dist(i,j)]Notalwaysstraightforwardtogotheotherway.We’lldescribeouralgorithmsintermsofdistances.

Editdistance:Thenumberofrequirededitstotransformastringintoanotherstringedit_distance(mining,smiling)=2

2016-2017 DataMining YsalinedeWouters

101

We can divide clustering algorithms into two groups. There exist several types ofclusterstructures:thehierarchicalstructuresandtheflatstructures.

• Hierarchicalarrangement:createahierarchybetweenobjects.• Flatarrangement:straightpartitionthedataintodifferentgroups.

Regardingtheclusterassignment,wedistinguishbetweendifferenttechniques

• Hardclustering:Eachitemoccursinexactlyonecluster• Softclustering:Eachitemhasaprobabilityofbelongingtoacertaincluster• Disjunctivevs.overlappingclusters.Itemsmayoccurinmultipleclusters

Clusteringapproaches

• Hierarchical:Createahierarchicaldecompositionof the setofobjectsusingsomecriterion

• Partitioning: Construct various partitions and then evaluate them by somecriterion

• Model-based:Hypothesizeamodelforeachclusterandfindbestfitofmodelstodata

• Density-based:Guidedbyconnectivityanddensityfunctions

2. Hierarchicalclustering

2.1 ThealgorithmThisalgorithmstartswitheachpointinitsowncluster.Clustersarecombinedbasedontheir“closeness”.Combinationstopswhenfurthercombinationleadstoclustersthat are undesirable for oneof several reasons. Ex:wemay stopwhenwehave apredetermined number of clusters, or wemay use ameasure of compactness forclusters, and refuse to construct a cluster by combining two smaller clusters if theresultingclusterhaspointsthatarespreadoutovertoolargearegion.Hierarchical clustering can do top-down (divisive or bottom-up (agglomerative)clustering.

• Top-down:subdivideobjects.Partitiontheitems• Bottom-up:everyobject is inan individualcluster,cluster themall together

intooneaggregatedcluster.Givenasetofinstances,Icreateoneclusterforeachinstance.Iwanttofindthenearesttwoclusters.

Ineithercase,wemaintainamatrixofdistance(orsimilarity)scoresforallpairsof

• Instances;• Clusters(formedsofar)• Instancesandclusters

2016-2017 DataMining YsalinedeWouters

102

Dendogramsareusedtorepresenthierarchicalclustering.Wehavedistanceonthex-axis.Leavesrepresenttheinstances.Thebarheightindicatesthedegreeofdifferencewithinthecluster.

Wehavetodecideinadvanceon:

• Howwillclustersberepresented?• Howwillwechoosewhichtwoclusterstomerge?• Whenwillwestopcombiningclusters?Thereareseveralapproacheswemight

usetostoppingtheclusteringprocess.o Wecouldbetold,orhaveabelief,abouthowmanclusterstherearein

thedata.o Wecouldstopcombiningwhenatsomepointthebestcombinationof

existingclusterproducesaclusterthatisinadequate.o Continue clustering until there is only one cluster. However, it is

meaninglesstoreturnasingleclusterconsistingofallthepoints.

2.2 Distancemeasures

Thedistancebetweentwoclusterscanbedeterminedinseveralways• Singlelink:distanceoftwomostsimilarinstances:dist(cu,cv)=min{dist(a,b)

|a∈cu,b∈cv}Taketheminimumdistances.Issue:chainphenomenon.

2016-2017 DataMining YsalinedeWouters

103

• Completelink:distanceoftwoleastsimilarinstances:dist(cu,cv)=max{dist(a,

b)|a∈cu,b∈cv}.Takethemaximumdistancebetweentwoclusters.Usuallywhenyoudoclusteringyouapplycompletelinkage.

• Averagelink:averagedistancebetweeninstances:dist(cu,cv)=avg{dist(a,b)

|a∈cu,b∈cv}

Ifwemergedcuandcvintocj,wecandeterminedistancetoeachothercluster:• Single link:

dist(cj,ck)=min{dist(cu,ck),dist(cv,ck)}• Complete link:

dist(cj,ck)=max{dist(cu,ck),dist(cv,ck)}• Averagelink:|c|*dist(c,c)+|c|*dist(c,c)dist(cj,ck)=

Singlelink-Chaining

2016-2017 DataMining YsalinedeWouters

104

Bottomline

• Simple,fast• Oftenlowquality

CompleteLink

• WorstcaseO(n3)• Fastalgorithm:RequiresO(n2)space• Nochaining• Bottom line:

TypicallymuchfasterthanO(n3)• Oftengoodquality

2.3 EvaluationofthealgorithmWeaknessesofagglomerativeclusteringmethods:

• Thebasicalgorithmisnotveryefficient.Ateachstep,wemustcomputethedistancesbetweeneachpairofclusters,inordertofindthebestmerger.

• Donotscalewell:timecomplexityofatleastO(n2),wherenisthenumberoftotalobjects

• Canneverundowhatwasdonepreviouslyð Integrationofhierarchicalwithdistance-basedclustering

• BIRCH:usesCF-treeandincrementallyadjuststhequalityofsub-clusters• CURE:selectswell-scatteredpointsfromtheclusterandthenshrinksthem

towardsthecentreoftheclusterbyaspecifiedfraction

BIRCH: Balanced Iterative Reducing andClustering usingHierarchies (Zhang,Ramakrishnan & Livny, 1996). It incrementally construct a CF (ClusteringFeature)tree.Parameters:maxdiameter,maxchildren

§ Phase 1: scan DB to build an initial in-memory CF tree (each node:#points,sum,sumofsquares)

§ Phase2:useanarbitraryclusteringalgorithmtoclustertheleafnodesoftheCF-tree

Scaleslinearly:findsagoodclusteringwithasinglescanWeaknesses:handlesonlynumericdata,sensitivetoorderofdatarecords

2016-2017 DataMining YsalinedeWouters

105

2.4 ClusterFeatureVector Given:X1,...,Xn,datapointsinaclusterwhereeachwithd-dimensions

Note:CFsareadditive!E.g.,CF1+CF2=(N1+N2,LS1+LS2,SS1+SS2)

2d+1valuesrepresentanynumberofpoints

d=numberofdimensions.AveragesineachdimensioncanbecalculatedasSUMi/NVarianceindimensionicanbecomputedby:(SUMSQi/N)–(SUMi/N)2TogetstandarddeviationtakesquarerootCanalsocomputetheradius.

2.5 ClusterFeatureTree

2016-2017 DataMining YsalinedeWouters

106

ACF-treeisaheight-balancedtreewithtwoparameters:

• Branchingfactor(nonleafnodesB,leafnodes,L)• ThresholdT

Eachnonleafnodehastheform[CFi,childi]EachleafnodehasCF

• SetofCFs• Twopointers:prevandnext

RadiusofasubclusterunderaleafnodecannotexceedthethresholdT

ToconstructtheCF-Tree,scanthedatasetandinserttheincomingdatainstancesintotheCFtreeonebyone.Eachinstanceisinsertedintotheclosestsubclusterunderaleafnode.Ifinsertioncausessubclusterdiametertoexceedthreshold,thencreateanewsub-cluster.Thenewsub-clustermaycauseitsparentstoexceedbranchingfactor.Ifso,splittheleafnode.

§ Identifyingthepairofsub-clusterswithlargestinter-clusterdistance.§ Dividebyproximitytothesetwosub-clusters

If this split causes thenon-leafnode toexceed thebranching fact, then recursivelysplit.Iftherootnodeissplit,thentheheightoftheCFtreeisincreasedbyone.ð AnexampleisgivenintheslidesTraditionalalgorithms

2016-2017 DataMining YsalinedeWouters

107

WhatwouldBIRCHdo?

Birchassumes:

§ Clustersarenormallydistributedineachdimension§ Axesarefixed:Ellipsesatananglearenotok

ClusteringUsingRepresentatives(CURE)Clusterdefinition:Setofrepresentativepoints.Itenablesclustersofdifferingshapes.ð RequiresanEuclideanspace

TheCUREalgorithmdoesnotassumeanythingabouttheshapeofclusters;theyneednotbenormallydistributed,andcanevenhavestrangebends,S-shapes,orevenrings.Insteadofrepresentingclustersbytheircentroid,itusesacollectionofrepresentativepoints,asthenameimplies.

Two-pass(hierarchical)clusteringapproach

• Pass1:Clusteringofsubsetofdatatopick“representativepoints”.

2016-2017 DataMining YsalinedeWouters

108

o Takea small sampleof thedataandcluster it inmain

memory. • Select a small set of points from each cluster to be

representativepoints.Thesepointsshouldbechosentobeasfarfromoneanotheraspossible.

• Moveeachoftherepresentativepointsafixedfractionofthe

distancebetweenitslocationandthecentroidofitscluster.20%isagoodfractiontochoose.

• Pass2:AssignallpointstoclustersScantheentiredatasetandassigneachexampleetothe“closest”cluster.

• Standardmetricdeterminestheclosest• Donebyfindingrepresentativewithsmallestdistancetoe.

3. Partitionalclustering

3.1 PartitioningAlgorithms

Method:ConstructapartitionofadatabaseDofnobjectsintoasetofkclusters

2016-2017 DataMining YsalinedeWouters

109

Givenak,findapartitionofkclustersthatoptimizesthechosenpartitioningcriterionGlobaloptimal:exhaustivelyenumerateallpartitionsHeuristicmethods:k-means,k-medoidsalgorithms

§ k-means(MacQueen,1967):Eachclusterisrepresentedbythecenterofthecluster

§ k-medoidsorPAM(Partitionaroundmedoids)(Kaufman&Rousseeuw,1987):Eachclusterisrepresentedbyoneoftheobjectsinthecluster

Idea:divideinstancesintodisjointclusters.Flatvs.treestructure.

Issues:

§ Howmanyclustersshouldtherebe?§ Howshouldclustersberepresented?

WecangenerateaPartitionalclusteringfromahierarchicalclusteringby“cutting”thetreeatsomelevel.

3.2 K-Meansalgorithm

TheK-meansalgorithmisacommonly-usedalgorithm.It isbotheasyto implementandquicktorun.Assumptions

§ Objectsaren-dimensionalvectors§ Distance/similaritymeasurebetweentheseinstances

Goal:Partitionthedata inKdisjointsubsets. Ideallythepartitionshouldreflectthestructureofthedata.

2016-2017 DataMining YsalinedeWouters

110

Initiallychoosekpointsthatarelikelytobeindifferentclusters.Makethesepointsthecentroidsoftheirclusters.Foreachremainingpointp,findthecentroidtowhichpisclosest.Addptotheclusterofthatcentroidandthenadjustthecentroidofthatclustertoaccountforp.

Howtoinitializetheclusters?Wewanttopickpointsthathaveagoodchanceoflyingindifferentclusters.Thereare2approaches

• Pickpointsthatareasfarawayfromoneanotheraspossible.Goodchoice:pickthefirstpointatrandom.

• Clusterasampleofthedata,perhapshierarchically,sotherearekclusters.Pickapointfromeachcluster,perhapsthatpointclosesttothecentroidofthecluster.

Anoptionalstepattheendistofixthecentroidsoftheclustersandtoreassigneachpoint,includingthekinitialpoint.Tothekclusters.Theresultsdependontheseedselection.Someseedscanresultinpoorconvergenceorsub-optimalclusterings.

• Pickpointsrandomlyorfromdatainstances• Useresultsofanothermethod• Manyrunsofk-means:eachwithrandomseeds

• |cj|isnumberofexamplesassignedtoclustercj• iÎcj,i.e.,examplesthatareassignedtoclustercj• Thisisavector:Calculateµalongeachdimension

ð AnexampleisprovidedintheslidesA trickpointof thisalgorithm ispicking the rightvalueofk.wemaynotknowthecorrect valueof k touse in a k-means clustering.However, ifwe canmeasure thequalityoftheclusteringforvariousvaluesofk,wecanusuallyguesswhattherightvalueofk is. Ifwetakeameasureofappropriateness forclusters, suchasaverageradiusordiameter,thevaluewillgrowslowly,aslongashenumbersofclustersweassumeremainsatorabovethetruenumberofclusters.Assoonaswetrytoformfewerclusters,thantherereallyare,themeasurewillriseprecipitously.

TimeComplexity

• Distancebetweentwoinstances:O(d),wheredisthedimensionalityofthevectors

• Reassigningclusters:O(kn)distancecomputations,orO(kdn)• Computing centroids: Each instance vector gets added once to some

centroid:O(dn)• AssumethesetwostepsareeachdoneonceforIiterations:O(Ikdn)

2016-2017 DataMining YsalinedeWouters

111

• Linearinallrelevantfactors,withfixednumberofiterations,moreefficientthanO(n2)HAC

3.3 Bradly-Fayyad-Reina(BFR)

TheBFRalgorithmisavariantofthek-meansthatisdesignedtoclusterdtainahigh-dimensionalEuclideanspace.Keyassumption: clusters arenormallydistributedarounda centroid in a Euclideanspace.Thestandarddeviationsindifferentdimensionsmayvarybutthedimensionsmustbeindependent.Algorithmoverview:

• Read points one main-memory-full at a time. The algorithm begins byselectingkpoints,usingoneofthemethodswesawinprevioussection.Then,thepointsofthedatafilearereadinchunks.Thesemightbechunksfromadistributedfilesystemoraconventionalfilemightbepartitionedintochunksoftheappropriatesize.Eachchunkmustconsistoffewenoughpoints that theycanbeprocessed inmainmemory.Also stored inmainmemoryaresummariesofthekclustersandsomeotherdata,sotheentirememoryisnotavailabletostoreachunk.Threeclassesofpoints

o Thediscardset:pointscloseenoughtoacentroidtoberepresentedstatistically.Foreachcluster,thediscardsetisrepresentedby:

§ Thenumberofpoints,N§ The vector SUM, whose ith component is the sum of the

coordinatesofthepointsintheithdimension.§ The vector SUMSQ: ith component = sum of squares of

coordinatesintheithdimension.o Thecompressionset:groupsofpointsthatareclosetogetherbut

notclosetoanycentroid.Theyarerepresentedstatistically,butnotassignedtoacluster

o Theretainedset:isolatedpoints

• Summarizemostpointsbysimplestatistics

2016-2017 DataMining YsalinedeWouters

112

• To begin, from the initial loadwe select the initial k centroids by somesensibleapproach

Initialization

• Takeasmall,randomsampleandclusteroptimally• Takeasample;pickarandompoint,andthenk-1morepoint,eachasfar

fromthepreviouslyselectedpointsaspossible

Processingpoints• Findpointsclosetoaclustercentroid.Addthemtoclusterandthediscard

set.• Performin-memoryclusteringonremainingpointsandtheoldRS.Clusters

becomeCS,otherstoRS.• Adjustclusterstatisticsbasedonnewpoints• ConsidermergingpairsofCS• OnfinalroundmargeellCSsandRSpointsintothenearestcluster.

Twokeyquestions

1) Whenisapoint“closeenough”toaclusterthatwecanittothatcluster?Weneedawaytodecidewhethertoputanewpointintoacluster.BFRsuggeststwoways

o The Mahalanobis distance is less than threshold. This is anormalized Euclidean distance. For point (x1,...,xk) and centroid(c1,...,ck):

§ Normalizeineachdimension:yi=|xi-ci|/ si§ Takesumofthesquaresoftheyi’s§ Takethesquareroot§ AcceptapointforaclusterifitsM.D.is<somethreshold,

e.g.,4standarddeviations

EqualM.D.Regions

o Lowlikelihoodofthecurrentlynearestcentroidchanging.

Addptoaclusterifitnotonlyhasthecentroidclosesttop,butitisveryunlikelythat,afterallthepointshavebeenprocessed,someotherclustercentroidwillbefoundtobenearertop.

2) Whenshouldtwocompressedsetsbecombinedintoone?Compute the variance of the combined sub-cluster. N, SUM and SUMQallowustomakethatcalculation.Combineifthevarianceisbelowsomeuserprovidedthreshold.

Gettingthekright

2016-2017 DataMining YsalinedeWouters

113

ð Howtoselectk?Trydifferentk,lookingatthechangeintheaveragedistancetocentroidaskincreases.Averagefallsrapidlyuntilrightk,thenchangeslittle.

• Toofew:manylongdistancestocentroid

• Justright:distancesrathershort

• Toomany:littleimprovementinaveragedistance

2016-2017 DataMining YsalinedeWouters

114

Towrapup

K-MeansStrengths Weaknesses• Efficient:O(Ikdn)• Straightforward to

implement• Fins local optimum, can

use restarts moreadvanced search to findbettersolution

• Must be able to definemean

• Needtoprovidek• Susceptible to noise and

outliers• Cannot represent non-

convexclusters

4. Model-Basedclustering

Basicidea:ClusteringasprobabilityestimationGenerativemodel1) Probabilityofselectingacluster

2) Probabilityofgeneratinganobjectincluster

2016-2017 DataMining YsalinedeWouters

115

Representation:Eachclusterisrepresentedbyaprobabilitydistribution(density).Given: Data points, number of clusters,modelformofeachclusterFind: Maximum likelihood or MAPassignment of data points to clusters andlearnparametersforeachcluster.Quality of clustering: Likelihood of testobjects

KeyChallenge

• Ifweknewwhichexamplesbelong toeachcluster, thenwecanset themodelparameters

• Ifweknewmodelparameters,thenwecanestimateprobabilitythatclustergeneratedanexample

ð ChickenandeggproblemL

Missinginformation:ClustermembershipIdea:hiddenvariablesforclustermembershipEachinstancexihasasetofhiddenvariableszi1,...,zikIntuition:zij=1ifibelongstoclusterjand0otherwise

2016-2017 DataMining YsalinedeWouters

116

Ink-means,instancesareassignedtoexactlyonecluster.EMclusteringisa“soft”k-meanswhereachexamplehastheprobabilityofbelongtoacluster.Guess the initial parameters for model in each cluster and then iterate untilconvergence.

• Estep:Determinehowlikelyitisthateachclustergeneratedeachinstance• Mstep:Adjustclusterparameterstomaximizelikelihood.

E-Step

Recallthatzijisahiddenvariablewhichis1ifNjgeneratedxiand0otherwiseIntheE-step,wecomputehij,theexpectedvalueofthishiddenvariable.

M-Step

Giventheexpectedvalueshij,were-estimatethemeansoftheGaussiansandtheclusterprobabilities.

Note:“i”goesoverexamplesEMClustering…Tosumup• Willconvergetoalocalmaximum

2016-2017 DataMining YsalinedeWouters

117

• Sensitivetoinitialmeansofclusters• Havetochoosethenumberofclustersinadvance

EvaluatingclusterresultsGiven random data without any “structure”, clustering algorithms will stillreturnclusters.Thegoldstandard:doclusterscorrespondtonaturalcategories?Doclusterscorrespondtocategorieswecareabout?(Therearelotsofwaystopartitiontheworld)Approachestoclusterevaluation

o External validation. Ex: do genes clustered together have somecommonfunction?

o Internalvalidation:howwelldoesclusteringoptimizeintra-clustersimilarityandinter-clusterdissimilarity?

o Relativevalidation:howdoesitcomparetootherclusterings?Ex:Withaprobabilisticmethod(suchasEM)wecanask:howprobabledoesheld-asidedatalookaswevarythenumberofclusters?

5. Applications

5.1 LowQualityofWebSearches

• Systemperspective

o SmallcoverageofWeb(<16%)o Deadlinksandoutofdatepageso Limitedresources

• IRperspective(relevancyofdoc~similaritytoquery)o Veryshortquerieso Hugedatabaseo Noviceusers

5.2 Documentclustering

• Userreceivesmany(200-500)documentsfromwebsearchengine• Groupdocumentsinclustersbytopic• Presentclustersasinterface

ð Weneedawaytocomparequeriesanddocuments

• Vectorspacemodelo Howtodetermineimportantwordsinadocument?o How to determine the degree of importance of a termwithin a

documentandwithintheentirecollection?o Howtodeterminethedegreeofsimilaritybetweenadocumentand

thequery?

2016-2017 DataMining YsalinedeWouters

118

o Inthecaseoftheweb,whatisacollectionandwhataretheeffectsoflinks,formattinginformation,etc.?

Assumetdistincttermsremainafterpre-processing:vocabularyThese “orthogonal” terms form a vector space Dimension = t =|vocabulary|Eachterm,i,inadocumentorquery,j,isgivenareal-valuedweight,wij.Bothdocumentsandqueriesareexpressedast-dimensionalvectors:dj=(w1j,w2j,...,wtj)

Vector space model represents a collection of n documents by a term-documentmatrix.Eachentry:“Weight”ofaterminthedocument

Morefrequenttermsinadocumentaremoreimportant,i.e.mmoreindicativeofthetopicFij=frequencyoftermIinthedocumentjWemaywanttonormalizethetermfrequency(tf)bydividingbythefrequencyofthemostcommonterminthedocument:tfij=fij/maxi{fij}

Termsthatappearinmanydifferentdocumentsarelessindicativeofoveralltopic.dfi=documentfrequencyoftermI=numberofdocumentscontainingtermiidfi=inversedocumentfrequencyoftermi,=log2(N/dfi)(N:numberofdocuments)Anindicationofaterm’sdiscriminationpower

ð Logusedtodampentheeffectrelativetotf

2016-2017 DataMining YsalinedeWouters

119

Atypicalcombinedtermimportanceindicatoristf-idfweighting:

wij=tfijidfi=tfijlog2(N/dfi)Atermoccurringfrequentlyinthedocumentbutrarelyintherestofthecollectionis givenhighweight.Manyotherwaysofdetermining termweightshavebeenproposed.Experimentally,tf-idfworkswellExample:Givenadocumentcontainingtermswithgivenfrequencies:A(3),B(2),C(1)Assumecollectioncontains10,000documentsanddocumentfrequenciesofthesetermsare:A(50),B(1300),C(250)Then:A:tf=3/3;idf=log2(10000/50)=7.6;tf-idf=7.6B:tf=2/3;idf=log2(10000/1300)=2.9;tf-idf=2.0C:tf=1/3;idf=log2(10000/250)=5.3;tf-idf=1.8A query vector is typically treated as a document and also tf-idf weighted. Analternativeisfortheusertosupplyweightsforthegivenqueryterms.

• Innerproduct

o WijisweightoftermIindocjo WiqisweightoftermIinquery

• Cosinesimilarity

o Measuresthecosineoftheangelbetweentwovectorso Innerproductnormalizedbythevectorlengths

ComparisonD1=2T1+3T2+5T3D2=3T1+7T2+1T3Q=0T1+0T2+2T3Weightedinnerproductsim(D1,Q)=2*0+3*0+5*2=10

2016-2017 DataMining YsalinedeWouters

120

sim(D2,Q)=3*0+7*0+1*2=2Cosinesim(D1,Q)=10/Ö(4+9+25)(0+0+4)=0.81sim(D2,Q)=2/Ö(9+49+1)(0+0+4)=0.13

ð D1is6timesbetterthanD2usingcosinesimilaritybutonly5timesbetterusinginnerproduct.CommentsonVectorSpaceModel

• Simple,mathematicallybasedapproach• Considersbothlocal(tf)andglobal(idf)word• occurrencefrequencies• Providespartialmatchingandrankedresults.• Tendstoworkquitewellinpracticedespiteobviousweaknesses• Allowsefficientimplementationforlargedocumentcollections

WeaknesswithVectorSpaceModel

• Missingsemanticinformation.Ex:wordsense• Missingsyntacticinformation(Ex:phrasestructure,wordorder,proximity

information)• Assumptionoftermindependence(Ex:ignoressynonymy)

5.3 Bioinformatics

Molecular biology has become a data-rich science. The Nucleic Acids ResearchMolecularBiologyDatabaseCollectionindexes1330datasources.Thisrepresentslotsand lots of data, but there is a lack of understanding of biological systems.Computationalmethodsarecrucialforunderstandingavailabledata.Genes

GenesarethebasicunitsofheredityA gene is a sequence of DNA basesthatcarriestheinformationrequiredfor constructingaparticularprotein.Such a gene is said to encode aprotein. The human genomecomprises ~ 25,000 protein codinggenes

Typesofdata

• Fullsequencesofagenomes• Singlenucleotidepolymorphisms

2016-2017 DataMining YsalinedeWouters

121

• Geneexpressiondata• Manymore!

Microarraysallowustomeasuregeneexpression.Microarray isasolidsupport,onwhichpiecesofDNAarearrangedinagrid-likearray.ItmeasuresRNAabundancesbyexploitingcomplementaryhybridization.Howactivearevariousgenesindifferentcell/tissuetypes?Howdoestheactivitylevelofvariousgeneschangeunderdifferentconditions?

• Stagesofacellcycle• Environmentalconditions• Diseasestates• Knockoutexperiments

Whatgenesseemtoberegulatedtogether?

Datapointsaregenes

• Represented by expression levels across different samples (i.e.,features=samples)

• Goal:categorizenewgenes

Datapointsaresamples(e.g.,patients)• Representedbyexpressionlevelsofdifferentgenes(i.e.,features=genes)• Goal:categorizenewsamples

UnsupervisedLearningTask(1)Given:asetofmicroarrayexperimentsunderdifferentconditionsDo: cluster the genes, where a gen described by its expression levels in differentexperimentsUnsupervisedLearningTask(2)Given:asetofmicroarrayexperiments(samples)correspondingtodifferentconditionsorpatientsDo:clustertheexperimentsEx:

2016-2017 DataMining YsalinedeWouters

122

• Clustersamplesfrommicesubjectedtoavarietyoftoxiccompounds(Thomasetal.,2001)

• Clustersamplesfromcancerpatients,potentiallytodiscoverdifferentsubtypesofacancer

• Clustersamplestakenatdifferenttimepoints

2016-2017 DataMining YsalinedeWouters

123

CHAPTER9:USINGUNLABELEDDATA

1. Introduction

Whenyoudosomethinglikeclassificationyouhavelabelleddata.Theassumptionisthatsomeoneistellingyouwhatislabelledaspositiveorasnegative.Wheredoeslabelleddatacomefrom?Itcomesfromsometasks;peoplearewillingtolabel.

• Netflix,Amazon,etc.• Spam• Medicaldiagnoses

Oftenwehavetogetpeopletolabeldata

• Webranking• Documentclassification

Problem:labellingdataisexpensivebecauseyouhavetopaysomeonetodoitandyouneedalotofdata.Next,itisalsotime-consuming,thinkingaboutwhatthelabelshouldbe,etc.Learningmethodsneedlabelleddata.

• Lotsof<x,f(x)>pairs• Hardtoget…whowantstolabeldata

Butunlabelleddataisusuallyplentiful…couldweusethisinstead??

• Semi-supervisedlearning• Activelearning

Problemset-upMotivation:HardtogetlabelleddataGiven:weassumewehavesomelableddataandsomeunlabelleddata.

• Labeleddata:D=[<x,f(x)>]• Unlabeleddata:U=[<x,??>]

ð Learnahypotheseshthatapproximatesf.

2. Semi-supervisedlearning

Self-training:useanensembletechniquesuchasbaggingtobuildasetofclassifiersonlabelleddata.Classifyeachunlabelledtrainingexamplewitheachclassifier.Makeapredictiononeachunlabelledexamples.Ifallclassifiersagreeonanexample’slabel,addittothetrainingsetwithitspredictedlabel.Next,iterate.Forthefirstiterationyouhavelittletrainingdata.

2016-2017 DataMining YsalinedeWouters

124

Anotherwayiscalledco-training.Idea:youhavealittlelabelleddataandlotsofunlabelleddata.Andeachinstancehastwoparts:

• x=[x1,x2]• x1,x2conditionallyindependentgivenf(x)

Eachhalfcanbeusedtoclassifyinstance$f1,f2suchthatf1(x1)~f2(x2)~f(x)Bothf1,f2arelearnablef1ÎH1,f2ÎH2,$learningalgorithmsA1,A2Ex:youcanthinkofclassifyingdocumentonwebpages,havingtwofeaturesonthewebpage.One:wordsonthewebpage.Second:wordsappearingonanyinternetpage.

CanapplyA1togenerateasmuchtrainingdataasonewantsIfx1isconditionallyindependentofx2/f(x),thentheerrorinthelabelsproducedbyA1willlooklikerandomnoisetoA2!!!ThusnolimittoqualityofthehypothesisA2canmake

Learningtoclassifywebpagesascoursepages.

Twofeatures:• x1=bagofwordsonapage• x2=bagofwordsfromallanchorspointingtoapage

2016-2017 DataMining YsalinedeWouters

125

NaïveBayesclassifiers • 12labeledpages • 1039unlabeledpages

One thing that is surprising is that accuracy and error rate highly differ.With co-training,theerrorrateishalfed.

3. Activelearning

3.1 Theconcept

Active learning is a subfield of machine learning and, more generally, artificialintelligence.Thekeyhypothesisisthatifthelearningalgorithmisallowedtochoosethedatafromwhichitlearns—tobe“curious,”ifyouwill—itwillperformbetterwithlesstraining.Whyisthisadesirablepropertyforlearningalgorithmstohave?Considerthat,foranysupervisedlearningsystemtoperformwell,itmustoftenbetrainedonhundreds(eventhousands)oflabelledinstances.

SupposeyouaretheleaderofanEarthconvoysenttocolonizeplanetMars.

Problem:there’sarangeofspiky-to-roundfruitshapesonMars.Youneedtolearnthe“threshold”ofroundnesswherethefruitsgofrompoisonoustosafe.And…youneedtodeterminethisriskingasfewcolonists’livesaspossible.Awaytothinkaboutdoingthisisabinarysearch.Youneedtotestalloftheinstances.

Keyidea:thelearnercanchoosetrainingdata.Youthrowofsomelabelledexampleandchoosewhichlabelsyouwanttopreserve.

2016-2017 DataMining YsalinedeWouters

126

• OnMars:whetherafruitwaspoisonous/safe• Ingeneral:thetruelabelofsomeinstance

Goal:reducethetrainingcosts

• OnMars:thenumberof“livesatrisk”• Ingeneral:thenumberof“queries”

Activelearningsystemsattempttoovercomethelabelingbottleneckbyaskingqueriesintheformofunlabeledinstancestobelabeledbyanoracle(e.g.,ahumanannotator).In this way, the active learner aims to achieve high accuracy using as few labeledinstancesaspossible, therebyminimizing thecostofobtaining labeleddata.Activelearning iswell-motivated inmanymodernmachine learningproblemswheredatamaybeabundantbutlabelsarescarceorexpensivetoobtain.Howdowedothis?

Youcanthinkaboutgeneratingexamplesandaskingforthem.Anotherapproachyoucanthinkaboutisdatastreaming.Pool-BasedActiveLearningCycleThereareseveralscenariosinwhichactivelearnersmayposequeries,andtherearealsoseveraldifferentquerystrategiesthathavebeenusedtodecidewhichinstancesaremostinformative.Inthissection,twoillustrativeexamplesinthepool-basedactivelearningsetting(inwhichqueriesareselectedfromalargepoolofunlabeledinstancesU ) are presented using an uncertainty sampling query strategy (which selects theinstanceinthepoolaboutwhichthemodelisleastcertainhowtolabel).

Thisfigureillustratesthepool-basedactivelearningcycle.AlearnermaybeginwithasmallnumberofinstancesinthelabeledtrainingsetL,requestlabelsforoneormorecarefullyselectedinstances,learnfromthequeryresults,andthenleverageitsnew

2016-2017 DataMining YsalinedeWouters

127

knowledgetochoosewhich instances toquerynext.Onceaqueryhasbeenmade,thereareusuallynoadditionalassumptionsonthepartofthelearningalgorithm.ThenewlabeledinstanceissimplyaddedtothelabeledsetL,andthelearnerproceedsfromthereinastandardsupervisedway.ð Ihaveasmalllabelset,Ilearnamodel.Weusethatmodeltomakepredictionson

unlabelleddata.Inactivelearningweaskaquestiontosomeoneaboutwhatthetruelabelis.

Youshowaplot,anumberofquerieson the x axes and accuracy. Usuallyyougetsuchacurve.With active learning the algorithmdecideswhichexamplesittakes.Andusually for the same number ofinstancesyouhaveabetteraccuracythan passive learning. This is used alot.

3.2 Howtoselectqueries?Weassumethatwereceivedasmallsampleofunlabeledexamplesandasetoflabeledexamples.Thegoal is to reduce the trainingcost, that is to say, thecostofgettinglabels.Keyquestion:whichexamplesdowewanttouse?ð Let’strygeneralizingourbinarysearchmethodusingaprobabilisticclassifier.We

applysomethinglikelogisticregression.

2016-2017 DataMining YsalinedeWouters

128

Querytheexamplesthelearnerismostuncertainabout.

• Closestto0,5probability• Closesttodecisionsurface

Lookatexamplesclosetothedecisionboundarysincethesecausetheboundarytomove.Ifwehavepossibleandnegativeexamples, lookatexamplesthatareclosesttotheline.Wetakeone,andweasksomeonetogiveitalabel.Here,itisgivenanegativelabel.We retrain the classifier, giving a new decision boundary.We again pick anexampleclosetotheboundary,giveitalabel,andwedothisagainandagain,etc.

Theintuitionhereisthatweshouldfirstobservelabelsofpointsclosetothedecisionboundary.

3.3 Query-By-Committee(CBC)TrainacommitteeC={q1,q2,...,qC}ofclassifiersonthelabeleddatainL.QueryinstancesinUforwhichthecommitteeisinmostdisagreement.Trytofindmodelsthatareconsistendwiththetrainingdata,reducingthemodelversionspace.Keyidea:reducethemodelversionspace.Itexpeditessearchforamodelduringtraining

2016-2017 DataMining YsalinedeWouters

129

We have 4 labeled instances: 3triangles and a square. Train amodelsuchthateverythinginsiderectangles isofclasssquaresandeverything outside is of classtriangle.

Howtobuildacommittee:• “Sample”modelsfromP(q|L)[Dagan&Engelson,

ICML’95;McCallum&Nigam,ICML’98]• Standardensembles(e.g.,bagging,boosting)[Abe&Mamitsuka,ICML’98]

Howtomeasuredisagreement

• «XOR»committeeclassifications• View vote distribution as probabilities, use uncertainty measures (e.g.,

entropy)

3.4 AlternativeQueryTypesSofar,weassumedqueriesareinstances.Ex:fordocumentclassificationthelearnerqueriesdocuments.Canthelearnerdobetterbyaskingdifferenttypesofquestions?Questionsnotaboutinstancesbutaboutdifferentfeaturesforinstance.

• Multiple-instanceactivelearning• Featureactivelearning:trytogetfeedbackaboutwhichfeaturesaremore

usefulthanothers.Youcanthengetbetterresults.InNLPtasks,wecanoftenintuitivelylabelfeatures.

o Thefeatureword“puck”indicatestheclasshockeyo Thefeatureword“strike”indicatestheclassbaseball.

Tandem learning exploits this by asking both instance-label and feature-relevancequeries.Ex:“ispuckanimportantdiscriminativefeature?”

4. Summary

Bothtrytoattackthesameproblem:makingthemostofunlabeleddataU.Gettinglabelsisexpensive.Eachattacksfromadifferentdirection:

• Semi-supervisedlearningexploitswhatthemodelthinksitknowsaboutunlabeleddata.Propagatesinformationaboutwhatitknows.Propagateinformationfromlabeldatatounlabeleddata.

• Activelearningexplorestheunknownaspectsoftheunlabeleddata.Lookatwhatisuncertain.

2016-2017 DataMining YsalinedeWouters

130

2016-2017 DataMining YsalinedeWouters

131

CHAPTER10:TIMESERIES

1. TimeSeriesIntro

Time series are a collection of measurements made sequentially in time. Usuallypeopleonlywanttomeasureonevariablewhenlookingattimeseries.

Timeseriesfacesdifferentchallenges

• Lotsandlotsofdata:videosforinstance.• Autocorrelation:currentvaluedependsonpreviouslyobservedvalues.• Dataismessy:youusuallycollecttimeseriesfromsensors.

o Different sampling rates and ranges (e.g., clipping in accelerometerdata)

o Noiseo Missingvalues(e.g.,sensordrops)o Etc.

StandardPre-processing:Z-NormalizationUsuallywhenworkingwithcollecteddatafromsensorsyouwanttodosomesortofpre-processing.

Given:S=(s1,...,sn)

Leteach

2. TimeSeriesClassification

Oneoftheobvioustasksyoucandowithtimeseriesisclassification.Given:asetoftimeserieswithknownlabels.

Goal: make prediction about future time series. You are given a new time seriesmeasurementandyouwanttopredicttheclassofthisinstance.

2016-2017 DataMining YsalinedeWouters

132

Assumption:labelsareassignedtoentiresequence.Wedonotnecessarilyknowwhatvalue/setof valuesare responsible for the label. I don’t knowwhichmeasurementcausethetimeseriestobelabelledpositiveornegative.Idea1:DefinefeaturesThefirstobviousapproachyoucandoistodefinefeatures.So,givenatimeseries,convertittoafeaturevectorrepresentation.Thereareseveralwaystodoso:

• Usestatistics:Min,max,etc.• Transformations:discretewavelettransform,etc.• Entropy• …

ð Applystandardsupervisedlearnerstodatatomakeaprediction.

Idea2:NearestNeighbours

Compare three-time series tomy training set andmeasure distancesimilaritybetweenthem. Justpredict the labelbasedonthesmallestdistance.

ð Howcanwemeasurethedistancebetweentimeseries?• Euclideandistance

Given:Twotimeseriesofequallength Q=(q1,q2,…qn) S=(s1,…sn)Take the Euclidean distance by comparing eachmeasurement in each timeseriestotheonesoftheothertimeseries.

ð Canwe capture this alignment?One standard approach for doing this is calledDynamicTimeWarping(DTW)Basicidea:Allowone-to-manymapping.Eachmeasurementintimeseriesqgetsexactlyonemeasurementintimeseriess.Itallowsustohavesomenon-linearityinthedata.Constraintsonmapping:

2016-2017 DataMining YsalinedeWouters

133

§ Mustmatchthebeginningandendofsequence§ Monotonicity:cannotgobackwardsintime§ Continuity:nogaps.§ (optional)Warpingwindow:w:|i-j|≤w

ComputeDTWwithDynamicprogramming.Thereisnorestrictionthatbothsequencesshouldhavethesamelength.Againwehavethesameproblemset-up,havingtwotimeseriesSandQ.Given:S=(S1,…Sn)andQ=(q1,…,qm)Do:

• LetD=nxmmatrix• Letδ(si,qj)bethedistancebetweensiandqj• D[i,j]=δ(si,qj)+min(D[i-1,j-1],D[i,j-1],D[i-1,j])

Ifwisgiven,thenonlycomputeD[i,j]if|i-j|≤w

ð AnExampleisgivenintheslides.

Trace back. We want to start from the rightcorner. Trace back. We know that 5 will bematchedto7,so10canmematchedto7and8willbematchedto7.13ismatchedto12.Etc.

Thisallowsamappingbetweenbothsequences.

Thisisagoodwaytomeasurethedistancebetweenbothsequences.

Ifwehaveaworkingwindowof1,wegetthefollowingmatrix.Blueentriesaretheonesviolatingtheconstraints.

2016-2017 DataMining YsalinedeWouters

134

OnethingtoknowisthatDTWisnotaMetric…Because

o DTWviolatesthetriangleinequalityo Metrics are usually preferred as many efficiency tricks exploit the

triangleinequality§ However,many“tricks”existforDTW§ Cancomputeitveryquickly

o Also,mostofthetimeDTWactslikeametric.Ex:>99%ofrandomlysampledtripleswillsatisfythetriangleinequalityforDTW.

ExploitingTriangleInequality

SettingtheWarpingConstraintTheWarping constraint has a big influence on results and “best” value is problemdependent.

• Setviaatuningset• Usuallysmallvalues(e.g.,<20)arefine• Astrainingsetsizegetslarger,expectthatlesswarpingisneeded.Why?So,

themoreexampleIhave,Iwillexpecttohavetodolesswarping.

ImplementingDTW

• DynamicprogrammingDTW:N2timeandspaceperpaircompared• Manyefficiency tricksexist that scale this to very largedata sets (i.e, linear

time)o Earlyabandoningofcomputationo Exploitingwrappingconstraint

2016-2017 DataMining YsalinedeWouters

135

o Etc.

Sofar,weassumetimeseriesmeasuresonlyonevariable.However,multipleseriesmeasuremultiplevariables:S=(s1,...,sn)whereeachsi=(si,1,...,si,d).Thequestionis,howcanwedealwithmultipledimensionalvariables?ThereexisttwowaystoextendDTW.

• Independent: sumeachdimension separately.Align first dimensionsof twosequences, thenseconddimension, thirdone,etc.Soweessentiallyassumedimensionstobeindependent.DTW(S,Q)=DTW(S1,Q1)+...+DTW(Sd,Qd)

• Dependent:singlematrixandcomputescorealongalldimensions.e.g.,δ(si,qi)isddimensionalEuclideandistance.

Selectingindependentordependentmetricisdomaindependent:

• Independent:Accelerationonfootandhand• Dependent:Playerlocationonsoccerfield

Bewareofcurseofdimensionality!

• DTWprobablyonlyusefulifd<10(orperhapsevensmaller)• Ifd>10,probablyneedtodropsomedimensions.

AllowingforGaps:LongestCommonSubsequence(LCSS)PotentialissueswithDTW:allpointsarematchedsooutlierscandistortscore.Longest common subsequencewhich allows gaps,meaning that not all points arematched.Wewillignoresomenoiseinthedata.

ComputingLCSS:DynamicProgrammingGiven:S=(s1,...,sn)andQ=(q1,...,qm)

NotesonLCSS

o Thisisasimilaritynotadistance

2016-2017 DataMining YsalinedeWouters

136

o Symbolicdatausessi=qjinsteadofδ(si,qj)<ε

WhyisitinterestingtostudysomethinglikeDTW?• DTWiscommon:Medicine,bioinformatics,gesturerecognition,musicanalysis,

etc.• Manylarge-scaleempiricalevaluationsshowthatnearestneighbourwithDTW

ishardtobeato DTWisverysimpleandeasytoimplemento Goodadvice:Trysimplethingfirst

• Ideassuchasdynamicprogramming,etc.translatetootherproblems.

3. SymbolicRepresentations

ð Whyasymbolicrepresentation?• Enablesnewanalysisbyusingsymbolicapproaches:plotthemforinstance.• Symbolstendtobeeasierforpeopletointerpretthannumbers• Compressthedata

SAXrepresentationGiven:AtimeseriesS=(s1,...,sn),anwindowsizew,andanalphabetsizeaDo:ConvertStoasymbolicsequenceoflengthn/wwhichinvolvesanalphabetofasymbols.Sowewilldivideorsequencesintoparts.

Howdoesthisapproachwork?Firstweassumethedataisnormalized,thenweapplythePieceaggregateapproximation.ð Z-normalizethetimeseries

Applythepieceaggregateapproximation(PAA)• Dividetheseriesinton/wwindows• Computetheaveragevalueineachwindowandreplaceeachobservedvalue

inthewindowwiththeaverageAssignasymbolbasedonrangesofPAAvaluesConvertPAArepresentationtoastring

2016-2017 DataMining YsalinedeWouters

137

• Exploitthatthetimeseriesisznormalized• Dividetheareaunderthestandardnormalintoequalsizeregions• Assignonesymbolforeachregion

Anything that falls into a particular gets the correspondingsymbol.

ð WhySAX?• Converted time series to a string: Can apply fancy algorithms (e.g., from

bioinformatics)• Makesvisualizationeasier(borrowingideasfrombioinformatics)• Most indexingschemesworkbestwithsymbols:Canuseextensiblehashing

withSAX

4. Applicationstosports

ThreenewtypesofData• Eventstream:Eventswithtimeandlocation• Athletemonitoring:GPS,accelerometer,etc.• Opticaltracking:X,Ylocationsofplayers

ð Thesearealltimeseries

Twotasks• Ratingplayers:Assignaratingtoeachactionaplayerperformsinamatch• Understandsstrategy:Discoverpatternsformplayertrackingdata

STARSSGiven:Eventstreamwithtypeandlocationofallevents(e.g.,passesandshots)Do:Assignratingtoeachactionbyassigningavaluetoeachaction.Approach

2016-2017 DataMining YsalinedeWouters

138

1. Splitmatchesinphases2. Ratephases3. Distributephaseratingoverindividualactions4. Aggregateplayers’ratingsoverseasonRatingphases:1.Findkmostsimilarphases(e.g.,100)2.Ofthese,counthowmanyresultinagoal(e.g.,6)

ð Distributephaseratingacrossitsconstituentactions.Actionsattheendaremoreimportant:ExponentialDecay.

DiscoverOffensiveStrategiesinFootballMatchesGiven:Eventstreamwithtypeandlocationofallevents(e.g.,passesandshots)Locationsofallplayersandtheball(10hzsample)Find:Typicaloffensivestrategies

§ Filmstudyistimeconsuming§ Automationcanhelpspeedthisup§ Computersgoodatfindingpatternsinlargedatasets

Challenges:

• Relationshipsandhowtheychangeovertimeareimportant

§ Space§ Interactionsbetweenplayers

• Orderofeventsisimportant• Maywanttogeneralizeoverplayersinvolved• Exactsamesequenceofeventsunlikelytooccurmultipletimes

2016-2017 DataMining YsalinedeWouters

139

ImportantSteps1. Datacleaning

• Outliersandincorrectvalueso Validfieldcoordinateso Playerandballmovementsseem“possible”

• Teams switch direction at half time: Normalize data such tatteamalwaysattacksthesamegoal.

• Account for changes in data (e.g., position switches, newplayers,etc.

2. Eventstreampre-processing

3. ClusteringdataThreebenefits

• Teamsemploymultiplestrategies• Generalizefromaspecificlocation• Subsequentstepmorecomputationallyefficient

Divide phases into different groups such that the phases in a group are“similar”.

4. Identifying important strategies: within each cluster, find frequently

occurringsubsequences.Seeslidesformoredetails.

5. Toconclude

To conclude, time series occur in multiple domains: sports, healthcare, music,mathematics,etc.Sportsprovidesmanyrichsourcesofdataformhistoricalsources,sensorsprovidedetaileddata.Thisrelativelysimpleanalysishasmadeahugeimpactonseveralsports.

2016-2017 DataMining YsalinedeWouters

140

1-NNapproachwithDTWisaverystrongapproachforclassifyingtimeseries.

§ DTWisone-to-manymapping§ Canbeefficientlycomputed

Canconverttimeseriesintoasymbolicsequence.Keychallengesinclude

§ Volumeofdata§ Capturingstructure:Spatial,temporal,etc.