DATA MINING - Ekowiki

DATAMININGCoursenotes

KULYSALINEDEWOUTERS

2016-2017

2016-2017 DataMining YsalinedeWouters

2

Tableofcontent

CHAPTER1:INTRODUCTIONTODATAMINING 7

1. BACKGROUNDONDATAMINING 71.1 TECHNOLOGICALADVANCES 81.2 DATAMININGVS.MACHINELEARNING 102. DATAMININGTASKS 113. THEDATAMININGPROCESS 133.1 COMPONENTSOFADATAMININGSYSTEM 134. STATISTICALLIMITSONDATAMINING 145. THINGSUSEFULTOKNOW 145.1 IMPORTANCEOFWORDSINDOCUMENTS 145.2 HASHFUNCTIONS 155.3 INDEXES 15

CHAPTER2:GETTOKNOWYOURDATA 16

1. RETROSPECTIVEVS.PROSPECTIVEDATA 162. TYPESOFDATA 193. SIMPLEDESCRIPTIVESTATISTICS 204. VISUALIZATION 21

CHAPTER3:INTRODUCTIONTOPREDICTIVEMODELINGINDATAMINING 23

1. INDUCTIVELEARNING 231.1 SCORINGPROBLEMS 232. FEATURES 243. CALIBRATIONCHECK 254. BAYESRULE&NAÏVEBAYESCLASSIFIERS 255. LOGISTICREGRESSION,TRAININGTASK 266. CHALLENGEPROBLEM 27

CHAPTER4:RECOMMENDINGPRODUCTS 30

1. PROBLEMOVERVIEW 301.1 RECOMMENDATIONTYPES 311.2 CHALLENGES 322. CONTENTBASEDFILTERING 333. COLLABORATIVEFILTERING 344. EVALUATION 375. CASESTUDY:NETFLIXCHALLENGE 385.1 MODELLINGGLOBALANDLOCALBIASES 405.2 LATENTFACTORMODELS 435.3 MODELLINGTEMPORALDYNAMICS 445.4 TRYLOTSANDLOTSOFDIFFERENTMODELSANDCOMBINETHEOUTPUT 44

CHAPTER5:MODELENSEMBLES 47

1. MOTIVATIONANDOVERVIEW 472. SAMPLING-BASEDAPPROACHES 482.1 BAGGING:BOOTSTRAMAGGREGATING 482.2 ADABOOST(ADDAPTEDBOOSTING) 492.3 CROSS-VALIDATION 51


3

2.4 GRADIENTTREEBOOSTING 513. MANIPULATEFEATURES 544. ADDRANDOMNESS 555. STACKING 566. WHYDOENSEMBLEMETHODSWORK? 576.1 EXPECTEDERROR 576.2 BIAS/VARIANCEEXPLANATION 586.3 STATISTICALEXPLANATION 596.4 REPRESENTATIONALEXPLANATION 616.5 COMPUTATIONALEXPLANATION 617. SOMEAPPLICATIONS 62

INTERLUDE:ACTIONABLEDATAMINING 64

INTERLUDE:LARGESCALEDECISIONTREELEARNING 65

1. DECISIONTREEOVERVIEW 652. RAINFOREST 653. BOAT 66

CHAPTER6:ASSOCIATIONRULEMINING 68

1. INTRODUCTIONANDDEFINITIONS 682. NAÏVEALGORITHM 713. APRIORI 744. PCY 765. LIMITINGDISKI/O 785.1 SIMPLEALGORITHM:SAMPLE 795.2 THESONALGORITHM 805.3 TOIVONEN’SALGORITHM 806. FPGROWTH 827. INCORPORATINGCONSTRAINTSINTOMINING 848. PRESENTINGRESULTS,OTHERMETRICS 85

CHAPTER7:MININGSEQUENCES 89

1. MOTIVATION 892. BACKGROUNDANDDEFINITIONS 893. FREESPAN 934. PREFIXSPAN 94

CHAPTER8:CLUSTERING 991. UNSUPERVISEDLEARNING,CLUSTERINGINTRO 992. HIERARCHICALCLUSTERING 1012.1 THEALGORITHM 1012.2 DISTANCEMEASURES 1022.3 EVALUATIONOFTHEALGORITHM 1042.4 CLUSTERFEATUREVECTOR 1052.5 CLUSTERFEATURETREE 1053. PARTITIONALCLUSTERING 1083.1 PARTITIONINGALGORITHMS 1083.2 K-MEANSALGORITHM 1093.3 BRADLY-FAYYAD-REINA(BFR) 111


4

4. MODEL-BASEDCLUSTERING 1145. APPLICATIONS 1175.1 LOWQUALITYOFWEBSEARCHES 1175.2 DOCUMENTCLUSTERING 1175.3 BIOINFORMATICS 120

CHAPTER9:USINGUNLABELEDDATA 123

1. INTRODUCTION 1232. SEMI-SUPERVISEDLEARNING 1233. ACTIVELEARNING 1253.1 THECONCEPT 1253.2 HOWTOSELECTQUERIES? 1273.3 QUERY-BY-COMMITTEE(CBC) 1283.4 ALTERNATIVEQUERYTYPES 1294. SUMMARY 129

CHAPTER10:TIMESERIES 131

1. TIMESERIESINTRO 1312. TIMESERIESCLASSIFICATION 1313. SYMBOLICREPRESENTATIONS 1364. APPLICATIONSTOSPORTS 1375. TOCONCLUDE 139


5


6


7

CHAPTER1:INTRODUCTIONTODATAMININGDatamining refers to thediscoveryofmodels for data.However, amodel canbeoneofseveralthings.Originally,dataminorwasaderogatorytermreferringtoattemptstoextractinformationthatwasnotsupportedbythedata.Today, it is seenas theconstructionofastatisticalmodel,thatis,anunderlyingdistributionfromwhichthevisibledataisdrawn.Somepeopleregarddataminingassynonymouswithmachinelearning.Thereisnoquestionthat some data mining appropriately uses algorithms from machine learning. Machinelearningpractitionersusethedataasatrainingset,totrainanalgorithmofoneofthemanytypesusedbymachine-learningpractitioners,suchasBayesnet,supportvectormachines,decisiontrees,…Morerecently,computerscientistshave lookedatdataminingasanalgorithmicproblem.Therearemanydifferentapproachestomodellingdata.Wecouldfor instanceconstructastatisticalprocesswherebythedatacouldhavebeengenerated.

• Summarizingthedatasuccinctlyandapproximately;Ex: page rank idea. The entire complex structure of the Web issummarizedbyasinglenumberforeachpage.Thisnumberistheprobabilitythatarandomwalkeronthegraphwouldbeatthatpageatanygiventime.Itreflectstheimportanceofthepages.

• Extractingthemostprominentfeaturesofthedataandignoringtherest.Itrepresentsthedatabytheseexamples.

o Frequent itemsets:Makes sense fordata that consistsof“baskets”ofsmallsetsofitems.Lookforsmallsetsofitemsthat appear together in may baskets, these frequentitemsetsarethecharacterizationofthedatathatweseek.

o Similaritems:findpairsofsetsthathavearelatively largefractionoftheirelementsincommon.Ex:treatingcustomersatanonlinestorelikeamazonasthesetofitemstheyhavebought.Amazoncanlookforsimilarcustomers and recommend something many of thesecustomershavebought.Thisiscalledcollaborativefiltering.

1. Backgroundondatamining

Thiscourseprovidesabroadsurveyofseveralimportantandwell-knownsubfields.Themaingoalistodevelopanoverallsenseofhowtoextractinformationfromdatainasystematicway.

• The How: Gain insight into the working of specific algorithms. How to extractinformationfromdata.Wewanttounderstandtypicalquestionsunderlyingthehow.Howdoalgorithmswork,whatdotheytrytodo,howdotheyextractinformation?

• TheWhy:Understandthe“bigpicture”ofdataminingGoals:understandthechallengesindatamining.Manydefinitionsarepossible:


8

• PhrasetoputonCVtogethired• Non-trivialextractionofimplicit,previouslyunknownandusefulinformationfrom

data• Buzzwordusedtogetmoneyfromfundingagenciesandventurecapitalfirms• (Semi-)automatedexplorationandanalysisoflargedatasettodiscovermeaningful

patternsData mining can be interpreted as the process of automatically identifying models andpatternsfrommassiveobservationaldatabasesthatare

• Valid:holdonnewdatawithsomecertainty• Novel:non-obvioustothesystem.• Useful:shouldbepossibletoactontheitem.Shouldallowustohavealookinsidethe

problemforinstanceinordertoallowusdecisionmaking.• Understandable: humans should be able to interpret the pattern. Often we are

workingwithexpertandwedon’tevenunderstandwhatheistalkingabout.Dataminingcaneitherbedatadrivenor learningdriven.Youwillhavedataanduse it tomakeinferences,strategiesaboutfact,drawmodelsandpatterns.Theprocessofautomaticallyidentifyingmodelsandpatternsfrommassiveobservationaldatabases.

• Modelsandpatternscanbedecisiontreesforinstance.Theseareakindofrepresentationoftheknowledge.

• Massive,wearetalkingaboutdatabasesystemsandscalability.• Observational

ð Observedata

Therearethreegoalstodatamining

• Understandthedata,representation,whatisinthedata,…• Extracttheknowledgeofthedata• Make predictions about the future, what can happen

tomorrow,nextweekofyear,et.

1.1 Technologicaladvances25-30yearsagotherewasessentiallynodatamining,and inthe lastyears, it isbecomingextremelypopularandsuccessful.Itisusuallypopular,manycompaniesapplydatamining,thesamegoesforuniversities,sportsteams,etc.Itisalsotaughtinacademia.Therearetworeasonsforthispopularity.


9

• Technologyhasgreatlyimproved.• Databasesandthewebmeanseveryonehasdata,everyonecollectsdata.

Storageislargerandcheaper

ð Moore’slawformagneticdiskdensity:“capacitydoublesevery18months”.Storagecostperbyteisfallingrapidly.Thingslikestoragehasbecominglargerandcheaperovertheyears.Thereismoreofferregardingstoragecapacity.Next,wealsoobserveimprovementsincomputingpower.Thesupercomputerof15yearsagoisequivalentlypowerfulasmoderndesktops.Manylargedatasetsareavailabletoday.

Onlinetextsources• MEDLINEhas19millionpublishedarticles.• Wikipediahashugenumberofarticles.PeopletrytoanalyseWikipedia.• Websearchengineswhichcanindexbillionsofwebpageseveryday.100’sof

millionsofsitevisitorsperday• Retailtransactiondata• Ebay,Amazon,Walmart:>100milliontransactionsperday• Visa,Mastercard:similarorlargernumbers

Scientificuses

• DatacollectedandstoredatGB/hour• Remotesensorsonasatellite• Microarraysgeneratinggeneexpressiondata• Scientificsimulations• Traditionaltechniquesinfeasibleforrawdata• Datamininghelpsscientiststo• Classifyingandsegmentingdata• Formhypotheses• Findhiddenpatternsandcorrelations

Furthermore, data mining appears as commercially useful. Indeed, data analysis helpscompanies,stores,todevelopandtoknowhowtheyshoulddisplaytheshelves,whatandhowmuchtheyshouldsell.ManycompaniescollectandstoredataSearch-engines:clickdata,advertising,automatingthesearch.

• Stores:purchasesrecords• Banks:creditcardtransactions,wanttobeabletocatchfraud.

Competitionisstronganddataminingcanhelp• Marketingandadvertising• Search• Inventory.Inventoryremainsexpensive,youdon’twanttohavetomuchstocksoyou

shouldbesuretooptimizetheinventorylevel.Dataminingcrossesmanydifferentdisciplines.

• Databases:largedatasetsarestored• Machinelearning• Statistics


10

• Visualizationofthedata• Informationretrieval,goodretrievaltoallowanefficientsearchofthedata• High-performancecomputingtoensuretheanalysisgoesfast.

The two most similar things are Machine Learning and statistics. What is statistics?Statisticiansthinkintermsofmodels.Therearelotofpeople,calledpopulationandwetakesomesampleofthatpopulation.Thefirstthingyoucandoistowriteadistribution,describingthe sample. Another thing is tomake statistical inference, I have a sample and I try thehypothesis,Iapplyittothemodeandthenseeifitfitsthedata.Thereisalwaysarelationshipbetweensometargetvariableandinterpretablevariable.Thereisusuallynohypothesis.Statisticsisoftenmoreusedwhenwehavethisprospectivedata. Datamining focuses on analysis of existing data. It ismore algorithmic based; dataminingdefinesalgorithmsthatencompassthedata.Machine learning is interested in designing algorithms that are able to improve theperformancesorexperienceofatask.Performancescouldrelatetosomemetricswewanttooptimize(accuracy,ROCcurve,…)whereasataskreferstoacertainproblem,andexperienceisoftendata.

1.2 Dataminingvs.MachineLearning

Datamining• Dataminingfocusesmoreonscalability.• Dataminingismuchmoreapplicationfocused.• Dataminingismoreoftenusedinindustrialsetting.

MachineLearning• Moretheoreticalemphasis• Termmoreusedinresearch/academiathendatamining.

Intermsofscalingup.Asweseeonthepicture,wehavebothadatabasestoredondiskandamainmemory. It is very slow to access data that is stored on a diskwhereas themainmemoryisrelativelyfasttoaccess.Onlyasmallsubsetofthedatacanbestoredatthemainmemory.Oftendataminingtriestolimitsthenumberoftypesthatithastoretainonthedisk.


11

2. Dataminingtasks

ð Therearefourtasksincluded.

• EXPLORE:trytounderstandandgettoknowthedata.Therefore,wedoseveraltasksrelatedtostatisticsthatistosay,definerelevantandmeaningfulsummarystatisticsofthedata:numberofvariables,averages,means,modes,etc.Youalsotrytoidentifyflaws or problems in the data: skew, missing values, visually inspect the data,classificationproblems,…Explorationisthefirstthingyouwilldowhenyoureceivedata.

ð Whatiswrongwiththispicture?Thedataexceedsthemaximumrange.Whatever

therangeis, it isexceeded,thistendstobeaproblem.Anotherscalewouldbemoreappropriate.


12

Anotherexamplesrelatestothetaxidata,from2014.Guessesforthedecline?UBERcouldbea good reason, people using Uber instead of taxis. Again we cannot assume a causalrelationshiphere,butprobablyitisonereasonforthisdecline.

• DESSCRIPTITVEMODELINGYoutrytobuildmodelsthatdescribeandsummarizethedata.Youcansimulatehowthedatawasgenerated,modellingtheprocessthatgeneratedthedata.Techniques

• Clustering• Densityestimation/probabilisticmodels

• PREDICTIVEMODELING

WehavesomevariablesXandwewanttopredictthefuturevaleofanothervariable,Y,alsocalledthetargetvariable.

• Classification:Yisdiscrete• Regression:Yiscontinuous• Probabilityestimation:Prob(Y=y)

ThegoalistoexploretherelationshipbetweenthesetofvariablesXandthesinglevariableY.Youtrytoapproximateafunction,mappingconfigurationsofXàY.Thisiswhat many machine learning and statistic algorithms do. Often the focus is onaccuracy,notcomprehensibility.WemainlyfocusongettingthepredictionforY.Onebigapplicationis inpolitics,peoplemakingcampaignsmakelotofuseofdata.Which candidate do you prefer? Who will vote? Who will give money? Who canpersuade?Anothergoodapplicationofthisiswebsearch.Youhavesomequery,andyouwanttoknowwhichdocumentsarerelatedtothequery.Howdopeoplesolvethis?Thefirstattemptistomanuallycuratedirectories.Theproblemisthatitworksforsmalldatasetsbutitdoesnotscale.Next,anothernaturalthingtodoistrytomatchwordsinquerytowordsindocument.Thisisratherchallengingsincetherearemanywordsinthequery.Google’sideawastoexploitthewebstructure.Viewlinksasvotes,sothemorelinksmeansthemoreimportantwebpagesare.Theintuitionbehindthis isthatyoucantrustwhatotherpeoplesay.

• DISCOVERINGPATTERNS

Thegoalistodiscoverinteresting“local”patternsinthedata,nottocharacterizethedataglobally.Thefocusistofindhuman-interpretablepatternsthatdescribethedata.Techniques

• Item-setmining• Patternmining• Sequencemining

One examplewould be to find repeated DNA sequences. Another thingwould beproductrecommendation,findingcommonlyco-purchaseditems.Again,thisstepcouldbeillustratedbysports.NBAlogsallplaybyplayinformation:whichplayersareinthegame?Shotsattempts,etc.Thequestionsare:whichline-upsworkwell?Offensiveefficiency?Defensiveefficiency?


13

MachineLearningclass Dataminingclass

• MLhaslotsandlotsoftechniquesforpredictive models: naïve Bayes,decisiontrees,rulelearning,ILP,…

• Experimentalmethodologylikecrossvalidation

• Reinforcementlearning

• Predictive modelling motivated byapplications

• Pattern mining (unsupervisedlearning)

• Scalability

3. Thedataminingprocess

Firstselectdatatogetthetargetdata.Lookforoutliers,missingvalues,etc.Thenyoudosomesort of pre-processing, a key step where you apply transformation to your data, featuretransformation.Thenyouhavedataminingwhereyoufindmodelsandpatterns.Ideallyyouhaveknowledgeattheend.Inadataminingsystem,youwantsomethingthatiscomputationallysound.Youwantittobescalable, without it taking too much time and space complexity. You want things to beparallelizable.Nextyouwantthingstobestatisticallysound,findingmeaningfulpatterns.Doourresultsgeneralizetonewdata?Youalsowantthedatatobeeasytousedandcomprehensible.

3.1 Componentsofadataminingsystem

• Representation isabouthowyouactuallyrepresentthedata.There are two aspects of representation. The way data isrepresentedandthewaymodelsarerepresented.

o Data is represented using feature vectors, relationaldatabase,freetext,images,graphs,etc.

o Model: decision trees, graphical models, rule set,association rules, graph patterns, sequential patterns,etc.

• Evaluationcaneithersubjectiveorobjectiveo Objective: accuracy, precision and recall, cost, utility,

fast,etc.


14

o Subjective:interesting,novel,actionable• Search

o Combinatorialoptimization:Greedysearcho Convex optimization:Gradient descent,with functions

youwanttominimize,maximize,etc.o Constrainedsearch:Constrainedsearch

• Datamanagement• Userinterface

Tosumup,weliveanagewherelargeamountsofdataarecommonplace.Dataminingishugelypopularandhugelysuccessfulbecauseitextractsusefulinformationfromthisdata.Dataminingisabletohelppeoplesolvingproblems.Theinformationcomesinmanyformslikemodels,patterns,etc.Finally,dataminingischallenging!

4. StatisticallimitsonDataMining

ð Acommondata-miningprobleminvolvesdiscoveringunusualeventshiddenwithinmassiveamountsofdata.

• TotalInformationAwareness:theresultsyouobtainalldependsonhownarrowlyyoudefinetheactivitiesthatyoulookfor.

• Bonferroni’sPrinciple:supposeyouhaveacertainamountofdata,andyoulookforeventsofacertaintypewithinthatdata.Youcanexpecteventsofthistypetooccur,evenifthedataiscompletelyrandom,andthenumberofoccurrencesoftheseeventswillgrowasthesizeofthedatagrows.Calculatetheexpectednumberofoccurrencesoftheeventsyouarelookingfor,ontheassumptionthatdataisrandom.Ifthisnumberissignificantlylargerthanthe number of real instance you hope to find, then you must expect almostanythingyoufindtobebogus.

5. Thingsusefultoknow

5.1 ImportanceofWordsinDocuments

ð Inmanyapplicationsweshallbefacedwiththeproblemofcategorizingdocuments

(sequencesofwords),bytheirtopic.Topicsaretypicallyidentifiedbyfindingthespecialwordsthatcharacterizedocumentsaboutthattopic.Classificationoftensortsbylookingatdocuments,andfindingthesignificantwordsin those documents. Our intuition might be that the words appearing mostfrequentlyinadocumentarethemostsignificant.Butthisisnottrue.Ifact,theseveral hundred most common words in English (“the” or “and”) are oftenremoved from documents before any attempt to classify them. Actually, theindicatorsofthetopicarerelativelyrarewords.Butnotallrarewordsareequallyusefulasindicators.


15

TD.IDF=TermsFrequencyTimesInverseDocumentFrequency.Thisistheformalmeasureofhowconcentratedintorelativelyfewdocumentsaretheoccurrencesofagivenword.

5.2 HashFunctions

Ahash function h takes a hash-key value as an argument and produces a bucketnumberasaresult.Thebucketnumberisaninteger,normallyinrange0toB-1,whereBisthenumberofbuckets.Hash-keyscanbeofanytype.Thereisanintuitivepropertyofhashfunctionsthatthey“randomize” hash-keys. If hash-keys are drawn randomly from a reasonablepopulationofpossiblehash-keys, thenhwill sendapproximatelyequalnumbersofhash-keystoeachoftheBbuckets.

5.3 Indexes

Anindexisadatastructurethatmakesitefficienttoretrieveobjectsgiventhevalueofoneormoreelementsofthoseobjects.Themostcommonsituationisonewheretheobjectsarerecords,andtheindexisononeofthefieldsofthatrecord.Givenavaluevforthatfield,theindexletsusretrievealltherecordswithvaluevinthatfield.Ex: fileof (name,address,phone),andan indexonthephone field.Givenaphonenumber, the index allowsus to findquickly the recordor recordswith that phonenumber.


16

CHAPTER2:GETTOKNOWYOURDATAAnalysingdatarequires

• Collectingdata• Understandingdata• Puttingdataintoformatsuchthatanalysiscanbeperformed.

Collecting the data is the most complicated, time consuming, and important part of theanalysis.50-75%ofthetimeisspentoncollectingthedata.Mostofthemistakesarisehere,especiallyinthetransformationphase.

1. Retrospectivevs.ProspectivedataCorrelationalvs.causation

• RetrospectiveforcorrelationAandBarerelated• ProspectiveforcausalityAcausesB

Thereexisttwotypesofdata

• Prospectivedata:collectedaspartofcontrolledscientificexperiment.• Retrospectiveorobservationaldata:collectedpassivelyorsomeparticularreason.

ð Thetypeofdatayouhaveinfluencestheconclusionsyoucandrawfromit.With

prospectivedata,youcanfindthings likecausality.Withretrospectivedata it isdifficulttofindcausality.

PROSPECTIVEDATAProspective data is like original, traditional, scientific experiment design. We have ahypothesisHwewanttotestanddesignexperiment,withcontrols,totestH.wecollectdatatotrytodeterminethecause,beforeanalysingresultsandseeiftheyconfirmH.Ex:Clinicaltrials,geneknockoutexperiments,etc.

ð Veryexpensiveandtimeconsumingsinceyouneedalotofsamples,dividedataintogroups,trialsareneeded,etc.Today,thisseemscompletelyoutdated.

Dataminingislargelyaboutcorrelation.

• Datapassivelycollected• Datacollectedforsomeotherreason• Dataisrelativelycheaptocollect.

ð Wecandrawstrongerconclusionsfromprospectivedata,butitisexpensiveandtimeconsumingtocollect.

OBSERVATIONALDATANow,wehavehugeobservationaldata sets. Ex:Web logs, customer transactionsat retailstores,humangenome,etc.It makes sense to leverage available data since it may contain useful information and itappearsverycheaptocollect.


17

Theassumptionsofexperimentaldesignareviolated.

• Howcanweusesuchdatatodoscience?• Canwedomodelexploration,hypothesistesting?

Oneexception:theweb.IttendstobeveryeasytoexperimentontheWebasitcountsahugenumberofusers.Amazonforinstancehasahugenumberofusers,whichmakesiteasytoquicklyandeasilycollectdata.Amazonappliesarecommendationsystem.GregLinden’sshoppingcart recommendationsadditemsbasedonwhatisinshoppingcart.Amarketingvicepresidentstatesthatitmightdistractpeopleawayfromcheckingout.Thisresultinaprohibitiontocontinue.HedisobeyedandranatesttoseeandfoundnothavingrecommendationswascostingAmazona lotofmoney.Thisillustratesthatourintuitionsareoftenwrong.Whyshouldwedebateifwecancollectdata?Culturalreasons:

• UptonSinclear: it isdifficulttogetamantounderstandsomethingwhenhissalarydependsuponhisnotunderstandingit

• Corollary:nonottrusttheHIPPO:HighestPaidPerson’sOpinion.Bydefinition,thesepeopleareprobablybiasedintheiropinion.

Wehavetwogroups:thetreatmentgroupandthecontrolgroup.Wearegonnadivideourpopulation,usually50-50amongthesegroups.Thetreatmentgroupwillseeonevariant of the website; the control group will see another variant. We will thencomparetheoutcomesbetweenthetwogroups.Wecouldforinstanceseehowmuchmoneypeoplespendinonegroup.Thistestforcausationnotcorrelation.Ifitisdonecorrectly,thereareonlytwothingsthatcanexplainthechangeinoutcome.

• Avs.B• RandomChance

Everythingelseaffectsbothvariants.StatisticianstrytoanswerifAorBisbetterwithaformalhypothesistest.Notethatthisdoesnotsolveeverything.


18

Hypothesis testing outline:wedefine twohypotheses: the null hypothesis and thealternative.TheH0statesthattheoriginalmodelisbetterwhereasthealternativeorHastatesthatthevariantisbetter.Werunanexperimenttogetdataaboutthealternativehypothesis.Wethencomparehow probable the observed outcome is compared to what is expected if the nullhypothesisweretrue.Thep-valueisneededtoperformthetesting.Thev-valueistheprobabilityofthedataorsomethingmoreextremeunderthenullhypothesis.Usuallyweusep<=0,05tobeconfident that a difference is statistically significant. The probability of getting theoutcomecorrespondstotheareaunderthecurve.

Lookingathypothesistesting,whatkindoferrorscanwemake?

• TypeIError:rejectatruenullhypothesisakaafalsepositive.• TypeIIError:Donotrejectafalsenullhypothesisakaafalse

negative.TheNullhypothesisisactuallfalse.

SomekeythingstothinkaboutistheoverallEvaluationcriteria(OEC),ametricwearemeasuring.Itisimportanttothinkcarefullyaboutthis.Decideitupfrontanddonotchangeit.Effectsize:differenceofOECbetweentreatmentandcontrol.


19

ð Wehavetostayawareofdayofweekandseasonaleffects.Bewaretoofnewnesseffects.Assigninguserstogroupscanbedifficultasneedtoensurethatthisis

• Random• Consistentacrossplatforms.• Nointeractionsbetweendifferentexperiments,nocorrelations.• Fast,peopleareintoleranttoslownessofoperations.

2. Typesofdata

ð Whatcandatalooklike?

• Featurevector

• Transactiondataoccursforinstancewhenvisitingandanalysing

dataonAmazon’swebsite. Implicitly youcanuse it as sparserepresentation.

• Relational data: extremely common sort of data

representation,includingtablesmadeoutofrowsandcolumns.There might be dependencies between tables, anddependenciesbetweenrowsinthetable.Datamaybesortedinoneorseveralplaces.


20

• Semi-structured data: Little boxes where we get structuredinformationaboutsomeone.

• Graph:Molecules• Sequencialdata:DNA.Thisisonetypeofordereddata• TimeSeriesData• SpatialData:couldbelocationaldataforinstance.• Spatial-Temporal

3. Simpledescriptivestatistics

Firstgetanoverall senseof thedata,analysingdata typeofvariables for instance:numericaldata,textdata.Inotherwords,firstlookatobviousthingslike:numberofvariables,numberofdatapoints,missingvalues,classskewforpredictionproblems,etc. Missing values are tricky! Class skew looks at number of positive, negativeinstancesinthesampleforinstance.Next,diveintotheindividualvariables.

• Discretevariables:Foracertainvariableyoucouldforinstancelookatthenumberofvaluesthisvariablecantake.Howoftendoes each value occur, etc.Which values are ordered,mode(mostcommonvalue)?

• Continuous variables: look at the mean, sample variance,median, quartile. This gives an idea about how the data isdistributed.

o Q1:valueatposition0,25no Q3:valueatposition0,74no Interquartilerange:Q3-Q1

Whenlookingataverages,weareofteninterestedinweightedaverages.Themeanassumesthateachdatapoint isofequal importance inaverage.Butactually,somedatamaybemoreimportantthanotherdata.So,theestimatewillbemorereliable,more valuable, more representative, more recent, by taking the weight of eachindividualvariableintoaccount.


21

Simpson's paradox, or the Yule–Simpson effect, is a paradox in probability andstatistics, in which a trend appears in different groups of data but disappears orreverseswhenthesegroupsarecombined.Itissometimesgiventhedescriptivetitlereversalparadoxoramalgamationparadox.Thisresultisoftenencounteredinsocial-scienceandmedical-sciencestatistics,andisparticularlyconfoundingwhenfrequencydataisundulygivencausalinterpretations.Theparadoxicalelementsdisappearwhencausalrelationsarebroughtintoconsideration.The most famous example is the UC-Berkely admissions. UC-Berkley got sued forgenderbiasingradschooladmissionsinthe70s.

• Men’sacceptancerate:44%• Women’sacceptancerate:35%

Itwasobservedthatnodepartmentwasbiasedagainstwomen.Actually,mostslightlyfavoured women. Women tended to apply to competitive programs with lots ofapplicantsandlowadmitrates.Describingnetworks

• Geodesic:shortest_path(n,m)• Diameter:max(geodesic(n,m))n,mactorsingraph• Density:numberofexistingedges/allpossibleedges.• Degreedistribution,countingeachnode,attributestheyhave.

4. Visualization

Thisfigureshowsfourdatasets,madeofsimpledata.Ineachdataset,thexvariablehasameanof9,andYhasameanof7,5.Eachhassamecorrelation(X,Y):0,82.Theyalsoallhavethesamelinearregression:y=3+0,5x.

Lookingatthedata,therearemanydifferentwaystoplotitandrepresentthisvisually.


22

Linearregressionlineshavebeendrawn. We observe that eachgraphs is different. The two onthe left are more or less thesame.Thebottomlessismoreorless a straight line, except theoutlier.Ontheright,theshapeofthedistributionisdifferent.

Thegraphonthetophasacurvedshape.Whereasonthebottomright,itlookslikeastraightline.Next,wecanlookattheboxplotofthedata.Withcontinuousdata,peopleusethiskindofplots.

Anothercommonvisualizationisthebarchart.Therearedifferentpositions,andeachpositionincludesacount.Wecangetagoodinsightintowhat’sgoingon.Histogramsareanotherway toplotcontinuousvariables.Wedividedata intobins,usuallyeachofequalwidth,wethencountthenumberofpointsineachbin.Regardinghistograms,thebinsizeisimportant!Wecanalsolookatthedistribution’shape.

To conclude, typically, you start theanalysisby trying tounderstand thedata. Is itobservationalorprospective?Whatisbeingmeasured?Whatdoesthedatalooklike?


23

CHAPTER 3: INTRODUCTION TO PREDICTIVE MODELING IN DATAMININGThischapterisclosetomachinelearning,wewillcoversomealgorithmsandseehowtheycomeinhandywhenanalysingdata.

1. Inductivelearning

1.1 Scoringproblems

• Given:DataS={(x1,y1),...,(xn,yn)}• Learn:FunctionF:x->y

ð IfYisdiscrete,wewilluseclassification,ifitiscontinuous,weuseregression.

Veryclassicexamplesofscoringproblemsarecreditscoring,whereyouhavetopredictwhoislikelytodefaultonloan.Thescoreistheprobabilitypeoplewillrepay.Anotherexamplesrelatestomarketing.Marketingcostsalotofmoney,youhavetomailpeople,call,…Targetmarketingallowsusto identifypeoplewhoaremore likelytorespond.Doingthis,ascorecould be a probability of someone responding to themarketing campaign. Finally, churnpredictionisanothercommonthing,whereyourankcustomerswhoarelikelytoswitchtoanotherservice,dropservice.Youwillprovideincentivestothosemostlikelytoleave.

1.1.1 AregressionapproachThegoalofregressionistolearnafunctiontoapproximateE[Y|X]1foreachX.

ð Wetrytoestimatesheposteriorclassprobabilities=binaryclassifier.

Ex:What’stheprobabilitythataclientrespondstomyadvertisingcampaign.Themost common thingwe can do for this is the logistic regression. This is adiscriminativemodelforlearningtheprobabilityofYgivenX.Wedon’tcareaboutmodellingthesinglevaluesofX,modellingtheprobabilityofX.Thisisasigmoidappliedtolinearfunctionofdata.

1ExpectedvalueofYgivenX


24

2. Features• Therearebinaryvariables,whichmeansthatforeachword,we

createabinaryvariablethatiseitherOor1.• Therearediscretevariables,thatwetakeonkvalues.Wewill

usek-1values.• Real:justasis• Complexvariables

o Democratorindependento WordA=PresentANDWordB=Presento Realsin(x)orX1X2

Inlogisticregression,wehavetwopossibilities:Y=0andY=1.

Whypickuplogisticregression?Inmanytasks,calibrationoftheprobabilityestimatesisimportant.Byestimatingaprobability,itshouldbeaprobability.Ex:ifsay75%ofbeing in positive class, then in test set you would expect ¾ of the instances withestimatetobetrulypositive.


25

However, not all classifiers are well-calibrated. For instance, Naïve Bayes is notwhereaslogisticregressioniswellcalibrated.IfyoulookatprobabilitiesprovidedbyNaïveBayes,youdon’tgetgoodprobabilities.

3. Calibrationcheck

Perform the check on validation data, divide predicted probability estimates intomultiplebins.PlotY-Axis,foreachbin,thenumberofpositiveexamplesinthebinorthe total number of examples in the bin. The X axis represents the predictedprobability.Agoodcalibrationremainsclosetothediagonal.

4. BayesRule&NaïveBayesClassifiersLet’s consider a supervised learning problem inwhichwewish to approximate anunknowntargetfunctionf:XàYorequivalentlyP(Y|X).WeassumeYisaBoolean-valuedrandomvariable,andX isavectorcontainingnBooleanattributes.ApplyingBayesrule,weseethat:

ð IntractablesamplecomplexityforleaningBayesianclassifiers,hence,wemustlook

forwaystoreducethiscomplexity.TheNaïveBayesclassifierdoesthisbymakingaconditionalindependenceassumptionthatdramaticallyreducesthenumberofparameterstobeestimatedwhenmodellingP(X|Y).The Naïve Bayes classifier assumes all attributes describing X are conditionallyindependent given Y. this dramatically reduces the number of parameters thatmustbeestimatedtoleantheclassifier.GiventhreesetsofrandomvariablesX,YandZ,wesaythatXisconditionallyindependentofYgivenZ,ifandonlyiftheprobabilitydistributiongoverningXisindependentofthevalueofYgivenZ.

TheNaiveBayesalgorithmisaclassificationalgorithmbasedonBayesruleandasetofconditionalindependenceassumptions.GiventhegoaloflearningP(Y|X)whereX=⟨X1...,Xn⟩,theNaiveBayesalgorithmmakestheassumptionthateachXiisconditionallyindependentofeachoftheotherXksgivenY,andalsoindependentofeachsubsetof


26

theotherXk’sgivenY.Thisassumptiondramaticallysimplifiestherepresentationof(X|Y)andtheproblemofestimatingitfromthetrainingdata.

5. Logisticregression,trainingtask

LogisticRegressionisanapproachtolearningfunctionsoftheformf:XàY,orP(Y|X)in the casewhere Y is discrete-valued, andX = (X1,…, Xn) is any vector containingdiscreteorcontinuousvariables.LogisticRegressionassumesaparametricformforthedistributionP(Y|X),thendirectlyestimates itsparameters fromthetrainingdata.TheparametricmodelassumedbyLogisticRegressioninthecasewhereYisBooleanis:

Bothequationsmustsumto1.

Given:DataD=((x1,y1),...,(xn,yn))Youwanttolearnweightsthatmaximizetheinitialprobability:P(Y|X).Inotherwords,themaximumweighthatpushallpredictedprobabilitytozerofornegativeexamplesandto1forpositiveexamples.Thegoodnewsisthatthisfunctionisconcave,whichmeans this function iseasy tooptimize.However, there isnoclosed-formsolution,whichisbadnews.


27

ð LogisticregressionisoftenreferredtoasadiscriminativeclassifierbecausewecanviewthedistributionP(Y|X)asdirectlydiscriminatinghevalueofthetargetvalueYforanygiveninstanceX.

6. ChallengeProblem

Ex:Let’ssayyoutrytopredictfluoutbreaks.Yourmodelwillhelptomodelwhere,whenfluoccurs.What data could we use?What features would we look at?What else would beimportantintheprediction?Wecouldforexamplegetdatafromapharmacy,aspeopleoftengobuyingsomemedicinewhentheyfeelsick.Whenwegetsick,wealsotweetaboutitonFacebook,goontheinternettogetsomemoreinformation,hence,socialmediadatacouldbeinterestingtoo.However,socialmediamay be biased. Nevertheless, looking at socialmedia provides uswith largeamountsofdata.Logisticregressioncouldbeausefultooltopredictthetool.Ex:Influenzaresultsin250000–500000deathsintheworldeachyear.Rucklydetectingoutbreakscanreducerisk.Currently,thecentrefordiseasecontroltracksthisbyaggregatingcountsofpeoplewithflulikeillnesses:you go to the doctor, makes a guess, then he sends it to the CDC, which aggregates it.However,thistendstobeslowandnottimelyenough,becausedataneedstobesentandthenaggregated.Theideanowistoexploitquerylogs.Thegoalistopredicttheproportionofdoctorvisitsthatareflu-relatedascalculatedbythecentrefordisease(CD).Data includeshistoricalGoogleSearchdata.Amodelwouldbethelogisticregression.Features:queriesmostcorrelatedtotargetvariable,totrytoselecttherightfeatures.

There are several guesses about the drop.Flusarecorrelatedwiththefluseason.Butother queries are correlated too withseasons.

Oftentimes,thereisavariablenotbeusedintheanalysisthatiscorrelatedwiththetargetvariable,Y,andthepredictorvariables,X.inthisexample,itisseason.Incausalstudies,thesearecalledconfoundingvariables.Thisoccursoften,watchoutforit!


28

Anotherexampleiscreditscoring,whereapersonappliesforaloan.Thisapplicationiseitherrejectedoraccepted.However,somepeopleacceptedforaloanmaydefault.Thegoalistopredictforthosegrantedaloan,whowilldefaultonit.Traditionallyloansweregrantedbyexperts.However,therewerescepticismthatautomationisbetterthanexperts.Hence,itwasfirstadoptedincredit-cardapprovals.Lateritwasthenbroadlyadoptedinhome-loans,etc.Now,itiswidelyacceptedandusedbyalmostallbanks,credit-grantingagencies,etc.Customerprovideddata:Age,address,income,job,numberofcreditcards,savings,etc.Wealsogetinternaldataaboutcustomer:howlongwiththebank,previousloans,etc.Wealsobuyexternalcustomer-leveldatalikecreditreportsandlegalproceedings.Alsomacro-leveldatacomesinhandy:demographics(postcodeforinstance).

ð Thisischallengingasdatamightbeincorrectlyentered.Peoplealsodeliberatefalseinformation,enteringahigherincomeorlowerincomethanthetrueone.Alsolegalissuesmayoccur:illegaltouserace,colour,religion,nationalorigin,sexmaritalstatus,orageinthedecisiontograntcredit.

Theapproachwetakerelatestologisticregressionanddecisiontrees.Wewanttomakesurethatdata is relevantandfitscustomers.Weneedtoaccesstorelevantvariables for thesecustomers.Defininglabelsremainsatrickyproblem.

Practicalissues

• Cost/benefitanalysiso Savingofemployingsystem,smallgainsinaccuracymay

beveryvaluable.o Costsofdeveloping,testing,maintaining,checkinglegal

requirements,etc.• Otherpoints:

o Setthreshold:whenshouldloansbegrantedo Override:bankemployeescanignoresystemo Whenisanewmodelneeded?o Continuousevaluation.

AnotherplacewherelogisticregressionisusedisinsuggestingFacebookFriends.HowdoesFacebookcomeupwithpeoplewemayknow?Thefirstinsightisthatmostcomefromfriendsoffriends.Theproblemisthattherearelotsofcandidates.Theaverageuserhas130friends:17000candidates.Hence,thetrickistodefinelotsoffeatures,exploitingtherightdata.Youcan thinkof things like thenumberof friends incommon,agood friendofyours justfriendedthisperson,demographics(age,gender,etc.).Amodelisthenbuilt,topredictwhetherauserwillclickonagivensuggestion.Whatistheprobabilitythattheuserclicksonit?Therefore,alogisticregression(+decisiontree)isusedtoproducearanking.OnefinalapplicationinlogisticregressionisthepredictionofCTRforadvertising.WhenyouGoogle,youenterakeyword.


29

Ontherightyouhaveadslots.Thelinksarealgorithmicsearches.Everytimesomeoneclicksontheadd,yougetacertainamount.Advertisers

• Bidx$perkeywordw• Payx$everytimesomeoneclicksonadd• Givesamaximumbudget

ð Someonesearchesonakeywordw,whichadshouldthesearchenginedisplay?Naïvesolution:justreturntheadwiththehighestbidonw.Google’sinsight:profit=x$*clickthroughrate(CTR).DisplaybasedonthisCTR.Tosumup,weoftenwanttoassignascoretoexamplesinpredictionproblems.Thelogisticregression is aworkhorse in practice for this types of problems. There aremany relevantindustryapplications.


30

CHAPTER4:RECOMMENDINGPRODUCTS

ð Howwouldyourecommendproducts?Whatkindofdatadoweneed?Howdoyoucollectthis?Wecouldfor instancelookatproductsthatareusuallysoldtogether.Wecouldalso lookateverythingthecustomerhasbought,andclusteramongcustomers.Anotheroptioncouldbetolookatsimilarcustomerprofiles.

There is an extensive class ofWeb applications that involve predicting user responses tooptions.Suchafacilityiscalledarecommendationsystem.Ex: offering news articles to on-line newspaper readers, based on a prediction of readerinterest;offeringcustomersofanon-lineretailersuggestionsaboutwhattheymightliketobuy,basedontheirpasthistoryofpurchasesorproductsearches.

1. ProblemoverviewWhenwethinkabouteconomicsoftraditionalretailermarket,spaceisscarceandexpensivecommodity.

• Retailers:physical.Youneedsomespacetorent.Physicaldeliverysystemsarecharacterizedbyascarcityofresources.

• TVnetworks:time• Movietheatres:spaceandtime.• Brick-and-mortarstoreshavelimitedshelfspace,canshowthecustomeronly

asmallfractionofallthechoicesthatexist.><on-linestorescanmakeanythingthatexistsavailabletothecustomer.

ð Peoplearenotwillingtotravelfarforproducts(buying,movie,etc.),soyouneed

tohaveacustomerbased.Implication:focusonpopularproducts.

Ontheotherhand,ifyouhaveanonlineeconomy,forinstanceontheweb,storagespaceischeap,itismuchcheapertostorethingsontheweb.Next,sitescatertoeveryone,youarenotrestrictedtopeoplenearyou,youcanreachpeopleallaroundtheworld.Implication:lowcostandeasyaccess,whichmeansitispossibletooffermorechoice.

ð Problem:howcanwefindproductsgivenhugenumberofchoices?Solution:systemsthatcanrecommendproducts,particularlyunusualorunpopularones.

Rankingofproductsandsales.• Physical stores, they can stock

physicalproducts,limitedoffering.• Mixed:Amazon,ontheinternetbut

theyalsohavephysicalretailers• Onlineonly:purelyonline.Ex:iTunes


31

Thedistinctionbetweenthephysicalandon-line worlds has been called the long tailphenomenon. Items are ordered on thehorizontalaxisaccordingtotheirpopularity.Physical institutions provide only themostpopularitemstotheleftoftheverticalline,whilethecorrespondingon-lineinstitutionsprovidetheentirerangeofitems:thetailaswellasthepopularitems. Making recommendations. Ex: what do I want to watch tonight? Let’s look online.WhatwebsitecouldIuse?Basedonthevieweditems,thesystemwillrecommendmeaproduct.

Animportantconceptinrecommendation-systemapplicationistheutilitymatrix.Therearetwoclassesofentities,whichweshallrefertoasusersanditems.Usershavepreferencesforcertain items, and these preferences must be teased out of the data. The data itself isrepresentedasautilitymatrix,givingforeachuser-itempair,avaluethatrepresentswhatisknownaboutthedegreeofpreferenceofthatuserforthatitem.

withoutautilitymatrix,itisalmostimpossibletorecommenditems.However,acquiringdatafromwhich tobuildautilitymatrix isoftendifficult.Thereare twogeneralapproaches todiscoveringthevalueusersplaceonitems:

• Askuserstorateitems• Makeinferencesfromusers’behaviour

ð Goalofarecommendationsystemistopredicttheblanksintheutilitymatrix.Itis

notnecessarytopredicteveryblankbutonlytodiscoversomeentriesineachrowthatarelikelytobehigh.

ð Findalargesubsetofthosewiththehighestratings.

1.1 Recommendationtypes

Thereexistloadsofrecommendationtypes.• Editorial: these are list of favourites. Ex: newspapers, travel

magazines.Newsserviceshaveattemptedtoidentifyarticlesof


32

interestofreaders,basedonthearticlesthattheyhavereadinthepast.Mightbebasedonthesimilarityofimportantwordsinthedocuments,oronthearticlesthatarereadbypeoplewithsimilarreadingtastes.Youcanalsohavelistsofessentialitems.

• Aggregateso Top10listso Mostemailedarticleso Mostrecentposts

• Personalized user recommendations: They makerecommendationsbasedonyourpreferences.Perhapsthemostimportant use of recommendation systems is at on-lineretailers.

o Amazono Movie sites (Netflix): Netflix offers its customers

recommendationsofmoviestheymightlike.Thesearebasedonratingsprovidedbyusers.

1.2 Challenges

Thereare3keychallenges

• Howdowegetuserfeedback?• Howdowepredictanunknownrating?Youwanttobeableto

nailthethings.• Howdoweevaluatepredictions?

Howcouldweobtainratings?Thereexistmanywaystodo.Iwanttopredictwhataspecificuserwilllike,basedonhispreferences.

• Explicitratingofproducts• Boughtitems• Itemson“wishlists”• Recentlyclickedproductpages/link• Lengthoftimespentonproductpage• Printedlinks• Etc.

Problem:predictratings

• Given information about a user’s preferences, interests,likes/dislikes

• Predict:willaspecificuserlikeagivenproduct,service,etc.?

Challenges…• Sparsedata:mostusersrateveryfewitems• Coldstart:newitemshavenoratings• Mainparadigms


33

o Contentbasedfilteringo Collaborativefilteringo Hybrid:usebothpriorapproaches

2. Contentbasedfiltering

Idea:focusonpropertiesofitems.Similarityofitemsisdeterminedbymeasuringthesimilarityintheirproperties.Machinelearningproblem,sowecanjustapplybasicstandardsseen.Wecaneasilyrecommenditemstoacustomerbasedonprofilesofthepast.Wetrytofindthingsthataresimilartowhatthecustomerhasvisitedbefore,similartopreviousitemsratedhighlybythecustomer.

Soweneedtodefineanumberoffeatures,andobtaindatafromtheusers.Wewillthenapplyanymachinelearningalgorithmtothisdata.Ex:foreverymovie,Iconstructfeatures(actors,genre,director…),Ithentrytofindoutwhethertheuserwilllikethemovie.Akeychallengeisbuildingitemprofile.Anitemprofileisawaytodescribeeachitem.Usuallythesearehand-crafted.Youhavetospendalotoftimebuildingthisprofile.Usually,forsuchtasks,theyarenotdoneinanautomatedway.Ex:foranewsarticle:Words,title,author,etc.


34

Pro’s Con’s• Only need data about one user.

No need to have data aboutmillionsofusers.

• Results in a more personalizedapproach (Ex: good if user hasuniquetaste).

• More easily recommendnew/unpopularitems

• Can provide context forrecommendation(Ex:pathdowndecisiontree)

• Lotsof researchonclassificationalgorithms.Every timethere isanewone,Iuseitinsteadoftheoldone.

• Need to construct meaningfulfeatures that are predictive ofuser’s preferences Ex: lots ofhand-coding,guess&check)

• Never recommends itemsoutsideuser’scontentprofile

• Hardtobuildaprofileforanewuser. I need to have sufficientnumberofexamples inordertorecommendsomething.

• Ignoresinformationaboutotherusers

ð Supervisedclassificationproblem

Wenotonlyneedtocreatevectorsdescribingitems;weneedtocreatevectorswiththe same components that describe the user’s preferences; the utility matrixrepresentstheconnectionbetweenusersanditems.Thebestestimatewecanmakeregardingwhichitemstheuserliesissomeaggregationoftheprofilesofthoseitems.Acompletelydifferentapproachtoarecommendationsystemusingitemprofilesandutilitymatricesistotreattheproblemasoneofmachinelearning.Regardthegivendataasatrainingset,andforeachuser,buildaclassifierthatpredictstheratingofallitems.Theconstructionofadecisiontreemightbehandy.Nevertheless, itrequirestheselectionofapredicateforeachinteriornode.Therearemanywaysofpickingthebestpredicate.Oncethepredicatehasbeenchoosing,wedividethe itemsintothetwogroups:thosethatsatisfythepredicateandthosethatdonot.

3. Collaborativefiltering

Idea:Finduserswithsimilartastesandrecommendproductstheyliked.Thebiginsightisthatwecandothisjustbylookingatratings,andjustusetheseratingstomakethepredictions,noneedtolookatfeatures.So,thebigideaistofindotheruserswhoseratingsaresimilartothecurrentuser.Andthenpropagatethe(dis)likestothecurrentuser.


35

Standard matrix where each rowcorresponds to a user, so we haveloads of users. Each column is aproduct. The entries are the ratings.Therearemanyusers,manyproducts,andmostoftheentrieswillbeempty.Very sparse!Most of the entries aremissinghere.

So,wehaveourdatabase,andactiveusers,andwewanttofindsimilaritiesbetweenthecurrentuserandalltheotherusersinthedatabase.Forinstance,BobandEvehavethesameratings,thisratingisdifferentfromAlice,Chris,etc.Wethentrytofindouthowstrongthesimilarityis.Thealgorithmworksinfoursteps.

• STEP 1: Measure similarity between user of interest and allother users. The question is, how canwe compute similaritybetweentwousers?Thereareloadsofsimilaritymetrics.I have twousers, I knowwhat the first is interesting in and Ishouldmakerecommendationsforthesecondone.

Problem:treatsmissingratingsaszerosinceittreatseveryentryasavector.Thistreatingaszeroisawrongwayofdoingsinceitisclearlynotthecase.ThePearsoncorrelationisapossiblesolution.

ThekeythingtorememberwiththePearsoncorrelationisthatit only considers items k rated by both users. This is just astandardcorrelationwecoulduse.Ifthiscorrelationequals1,


36

this isaperfect linearrelationshipwithboth increasing inthesamedirection.Whereas as a correlation equals to -1meansthatwhenoneratingdecreases,theotheronewillincrease.

• STEP2: (Optional): select a smaller subset consistingofmostsimilar users. You could use the whole database. But therewouldbeloadsofpeoplewithnocorrelation,that’swhywewillignore“faraway”users,andpickonlythe“k”nearestusers.Wecouldalsodecidejustincludingallusersaboveapredeterminedweightthreshold.Note:coulduseanyk-NNstrategyhere.

• STEP3:Predictratingsasaweightedcombinationof“nearestneighbours”.Nowwewanttopredictratings.Onebigproblemhereisthatpeople havedifferent rating scales. So, oneusermight use 5stars,while othersmay use only 3.Wewill first compute anaveragerating,andthentrytoimplementanalgorithmtofindoutaboutthenearestusers.

Onepotentialproblemhereisthatcorrelationswhicharebasedonveryfewco-rateditemsmaybeinaccurate.SimplesolutionadjustthePearsoncorrelationbasedonnumberof co-rated items. I take some threshold. I then adapt thecorrelation.

• STEP 4: return the highest rated items. We don’t want to

recommend itemswitha lowcorrelationrate.Next,wedon’twanttooverloadtheusers,forinstance,returningthousandsofpossibleproducts.Weusuallyreturnthetop2,5or10items.Itisthenpossibletogivetheusertheoptiontoviewmoreitems.

ð Could you employ the basic collaborative filtering algorithm between pairs ofitems?Wouldthiswork?


37

ð Anexampleisprovidedintheslides.Givenausermoviepair,Iwanttopredicttherating.Thebigpracticalissueisthatwehavelargedatasets:wecouldhavemillionsofusersand100000sofitems.Thekeyissueishowtoefficientlyfindsimilaritems?Iwanttocomputepairwisesimilaritybetweenusers.

• Pairwisesimilarityfor1Musers:+-5days• Pairwisesimilarityfor10Musers:+-1year!

ThetrickistousetheLocalitySensitivehashing.

Pro’s Con’s

• Simpleandintuitiveapproach• Worksforanykindofitem

o Nofeaturedesigno Nofeatureselection

• WorksbetweenpairsofUsers• Exploits information about

otherusers/items

• Data sparsity: even lots ofdata, hard to find users thatratedthesameitems.

• Cold start: need for enoughusersindatabase

• First rate: can’t recommendunrateditems

o Nowproducto Uniqueitems

• Popularity bias: favours itemsthatlotsofpeoplelike(i.e.,badifyouhaveuniquetaste).

4. Evaluation

ð Howdoweevaluatethesealgorithms?Here,wehaveahugeratingmatrix.Wethink about the matrix, taking out the most recent ratings. We then makepredictionsonthesemostrecentratings.

ThetwomosttypicalmetricsareRoot-MeanSquareerror(RMSE)andtheRankCorrelationSpearman’s.


38

Assume0/1data

• Coverage: number of items/users for which the system canmake a prediction. You want to be able to make as manypredictionsaspossible.

• Precision:accuracyofpredicteditems• Receiveroperatorcharacteristic(ROC)curve

o Falsepositiverateo Truepositiverate

ð WeaknesswithMetrics

• Focusingonaccuracymissesimportantpoints

o Predictiondiversityo Predictioncontexto Orderofprediction

• Onlyhighratingsmatter:RMSEmightpenalizeamethodthatdoeswellforhighratingsandbadlyforothers.

5. Casestudy:NetflixchallengeNetflixisamonthlysubscriptionservice,providingmoviesandTV.TheoriginalmodelwasaMailorderDVDservice.Peoplecouldbuydigitalrightstomanymovies.Itwasalsopossibletorankmoviesofinterestandreceiveinmailbasedonmoviesyouwanttosee.Today,thingshavechanged,wenowfaceanewmodel:streamingonlinecontent.Capitalizeon“bingewatching”createtheirowncontentIn2009,Netflixhad

• 100,000movies• 10millionscustomers• MoreUS-based

Currently it tends to be very international. It was first based in the US, now it is veryinternational.

• Thecontentbasedonregion.


39

• InUS:~5,000moviesand~2,000TVshows.• 75millionsubscribers

Theywanted tohavearecommendersystem.The ideawas tocreateacompetition for1million$.That’swhy,in2006,theyannouncedtheNetflixPrize,amachinelearninganddataminingcompetitionformovieratingprediction.$1millionwasofferedtowhoeverimprovedtheaccuracyoftheexistingsystemcalledCinematch.Proposedalgorithmsweretestedontheirabilitytopredicttheratingsinasecretremainderofthelargerdataset.Ateamcameupwiththefinalcombinationof107algorithms.Toputthesealgorithmstouse,theyhadtoworktoovercomesomelimitations,forinstancethattheywerebuilttohandle100millionratings,insteadofthemorethan5billionthattheyhave.Theywerenotbuilttoadaptasmembersaddedmoreratings.

Given:atrainingsetof100millionratingDo: build a recommendation system that improves the root-mean squarederrorby10%overNetflix’ssystem.

• Provided100millionratings

o Matrixis99%sparseo 480,000userso 17,700movies

• Ratingincludeo Usero Movieo Ratingo Time-stamp

ð Whysponsorachallenge?Netflixonlymakesmoneyiftheykeepcustomers.The

Cinematch, their internal system was expensive to develop and made slowprogress.Next,publicityisgoodandtheawardingprizewillpayforitself.

ð Whatistolose?

§ Negativepublicity• Concernsaboutuserprivacy(i.e.,userbacklashtodatarelease)• Prizewontooquickly• Noonewinstheprize

§ Timeandeffortassociatedwithrunningthecompetition(setupserver)§ Resultsnotusefulinpractice(e.g.,tooslow).


40

Evaluationð Why was the test set split? To

avoidoverfitting.

The competition began in October of 2006. There were many more teams thananticipated,eventuallyaround40000.Therewerelotsonlotsofinitialprogressandwithinweeksa1%improvementoverbaselinesystemwasachieved.

Therearefourimportantbigideas

• Modellingglobalandlocalbiases• Latentfactormodels• Modellingtemporaldynamics• Trylotsandlotsofdifferentmodelsandcombinetheoutput

5.1 Modellingglobalandlocalbiases

Wehavetwocomponents

• Usermeanrating• User-Movieeffect

The Better baseline rating wants to take themean rating and divide it into threecomponents.Iwanttolookattheoverallmean,andthenaddtheuserbiasandmoviebias.

• Userbias:howmuchbetterisitthanothermovies.Fortheuserbias,eachuserhasadifferentratingscale.Whatthenumbermeansdependsontheindividualuser.Otherthingsaffecttheuserbiaslikethevaluesofotherratingstheusergaverecently(day-specificmood,anchoring,multi-useraccounts).


41

Ex:ifyouaregivenanumber,askpeopletowritetheircellphonenumber,andyouthenask“howmanycountriesarethereinAfrica?”.Peoplewillrespondbasedonthepreviouslygivennumbers.

• Moviebias:Howmuchhigherorlowisthismovieratingcomparedtoother.Youmay be interested in the (recent) popularity ofmovie I, selection bias;relatedtothenumberofratingsusergaveonthesameday(“frequency”).Bylookingatthesewehaveindicationsandexpectations.

ð Capturingtheglobaleffects.

ð Insteadofhavingsimilarities,wewillreplaceitbytheweight.


42

Wecanactually train theweights tooptimize theobjective function, that is tosay,minimizethisobjective!

• Takeinitialguessforweights• Takefunctionderivative• Update values of weight in

direction• Repeatuntilconvergence


43

In 2007, data appears as the challenge task at KDD and there was an associatedworkshop. The winners of the improvement prize were from BellKor, 8.4%improvement(YehudaKoren,BobBell,ChrisVolinksy,AT&TResearch).

5.2 Latentfactormodels

Idea: make some matrix factors and decompose it into interesting, smaller parts.Topicscapturesharedhiddenstructures.Youwantthenumberoftopicstobeassmallaspossible.

ð ExamplesareprovidedintheslidesSolveviaanoptimizationproblem• Challenges:

o Sparsematrixwithmanymissingentrieso Slowtocomputegradiento Needtoavoidoverfitting

• Solutionso Solveleastsquaresoverjustobserveddatao Stochasticgradientdescento Shrinktowardsmeaniflittledata

In 2008, the progress sloweddownandmany teamsdroppedout due to the timeneededtocomplete.Manypaperswerepublishedonthetask.Leadershad9,4%improvement.


44

5.3 Modellingtemporaldynamics

In2004,therewasaconsiderablejumpintheratingpeopleassigntomovies.Couldbe due to improvements of cinematch. Netflix improved matching, leading tohigherrankings?Another explanation could be that people are biased towards higher ratings:meaningofratingchanges?

5.4 Trylotsandlotsofdifferentmodelsandcombinetheoutput

Peoplearegettingdesperate.Theytriedlotsofthings,butstillhavenotreachedthe10%threshold.Idea:“KitchenSinkApproach”.Buildlotsandlotsofpredictors• Classifiers• Collaborativefilters• EnsemblesComeupwithcleverwaystoblendtheresults2009,attheendofJune,the leadingteamsubmitsresultsthatexceedthe10%threshold.Thecompetitionentersthe30-dayfinalperiod.


45

Anew“ensemble”teamisformedbasedoncollaborationforothersnearattopofleaderboardandquicklybeatthe10%thresholdtoo.Theraceison…Takeawaymessages

§ Knowingyourdataiscrucial§ Mustaccountforunpredictableusers’behaviour§ Discoveringhiddenstructurecanhelp§ Whenindoubt,combinelotsofmodels

Thankstothecompetition,itservedaspublicity(andmoney)forthewinningteam and its members. Netflix received much attention and valuable newalgorithms.Datamining/machinelearning

§ Excitementandpublicityforthefield§ Interestingnewpublications§ Valuabledataset

Tosumup,recommendersystemsareanimportantandactiveareaorresearchanduse.Threeparadigms

§ Content: based on designing features and applyingclassification/regressionalgos

§ Collaborative: based on comparing users/items (aka nearestneighbour)

§ Hybrid:blendsbothapproachesð Netflixchallengewasinterestingandsuccessful

One of the reasons their focus in the recommendation algorithms has changed isbecause Netflix as a whole has changed dramatically in the last few years. Netflixlaunchedan instant streaming service in 2007. Streamingmembers are looking forsomethinggreattowatchrightnow;theycansampleafewvideosbeforesettlingonone,theycanconsumeseveralinonesession,statisticsareprovidedsuchaswhetheravideoaswatchedfullyorpartially.Wehaveadaptedourpersonalizationalgorithmstothisnewscenarioinsuchawaythatnow75%ofwhatpeoplewatchisfromsomesortofrecommendation.Wereachedthis point by continuously optimizing the member experience and have measuredsignificantgainsinmembersatisfactionwheneverweimprovedthepersonalizationforourmembers.Overtheyears,Netflixdiscoveredthatthereisatremendousvaluetotheirsubscribersin incorporating recommendations to personalize as much of Netflix as possible.Personalizationstartsontheirhomepage,consistingofgroupsofvideosarrangedinhorizontalrows.Ex:Top10row:bestguessatthetentitlespeoplearemostlikelytoenjoy.


46

Animportantelementisawareness.Netflixwantmemberstobeawareofhowtheyareadaptingtotheirtastes. It thenencouragesmemberstogivefeedbackthatwillresultinbatterrecommendations.Next,recommendationsarepartlybasedonourfriends.KnowingaboutyourfriendsnotonlygivesNetflixanothersignaltouseinthepersonalizationalgorithms,butitalsoallows for different rows that rely mostly on the social circle to generaterecommendations.Similarityisanotherimportantsourceofpersonalizationintheirservice.Similaritycanbethoughtofinaverybroadsense;itcanbebetweenmoviesorbetweenmembersandcanbeinmultipledimensionssuchasmetadata,ratings,orviewingdata.Inmostofthepreviouscontexts–beitintheTop10row,thegenres,orthesimilars–ranking,thechoiceofwhatordertoplacetheitemsinarow,iscriticalinprovidinganeffectivepersonalizedexperience.ThegoalofNetflix’rankingsystemistofindthebestpossibleorderingofasetofitemsforamember,withinaspecificcontext,inreal-time.Their business objective is tomaximizemember satisfaction andmonth-to-monthsubscriptionretention,whichcorrelateswellwithmaximizingconsumptionofvideocontent.


47

CHAPTER5:MODELENSEMBLES

1. Motivationandoverview

Anensembleofclassifiersisasetofclassifierswhoseindividualdecisionsarecombinedinsomewaytoclassifynewexamples.One good learner produces one effective classifier: could learningmany classifiershelp?The main discovery is that ensembles are often much more accurate than theindividualclassifiersthatmakethemup.Anensemblecanbemoreaccuratethanitscomponentclassifiersonlyiftheindividualclassifiersdisagreewithoneanother.Humanensemblesaredemonstrablybetter.Ex:howmanyjellybeansinthejar?Individualestimatesvs.groupaverage.Ex:WhowantstobeaMillionaire:Expertfriendvs.audiencevote.ð IfclassifiersmakeINDEPENDENTmistakes,thenh*(combinedhypothesis)ismore

accurate,becausetheprobabilitythattheyallmakeamistakeislow.

Let’s say we assume independent errors (30%) and a majority vote. We have 21classifierswhichallhavethesameerrorrate.

Areaundercurvefor>=11wrongis0.026.

ð Orderofmagnitudeimprovement!

Howtogeneratethebaseclassifiers?• Differentlearners?• Bootstrapsamples?• Etc.


48

Howtointegrate/combinethem?

• Average• WeightedAverage• Instance-specificdecisions• Etc.

Ensembleapproaches

Sampledataset:createreplicatesinsomeway,orreweighttheexamplesinsomeway.

• Bagging• Boosting

Manipulatefeatures• Inputfeature.forinstance,Icanonlyshowasubsetofthem,trainone

classifierfirst,thenanother,etc.• Targetfeatures.Createanewclassifier.Groupingtogethersomelabels.

Addrandomness• Data• Algorithm

Stacking

2. Sampling-basedapproaches

Thisfirstmethodmanipulatesthetrainingexamplestogeneratemultiplehypotheses.Thelearningalgorithmisrunseveraltimes,eachtimewithadifferentsubsetofthetrainingexamples.Thisworksespeciallywellforunstablelearningalgorithms.Thekeyideaherehastodowiththestabilityofanalgorithm.Unstablelearner:minorvariationsintrainingdataresultinmajorchangesinclassifieroutput,leadtoadifferentmodel.

• Unstable:Decisiontree,neuralnetwork,rulelearningalgorithms• Stable:linearregression,nearestneighbour,linearthresholdalgorithms,etc.

ð Subsamplingisbestforunstablelearners

• Bagging• Boosting• Cross-validatedCommittees

2.1 Bagging:BootstramAggregating

Given:DatasetS,integerTFori=1,...,TSi=BootstrapreplicateofS(i.e.,samplewithreplacement)hi=ApplylearningalgorithmtoSi


49

Classifytestinstanceusingunweightedvote

Draw |Data Set| examples withreplacement. Each sample’ contains 63.2%oforiginalexamples(+duplicates)

Baggingisthemoststraightforwardwayofmanipulatingthetrainingset.Oneachrun,baggingpresentsthelearningalgorithmwithatrainingsetthatconsistsofasampleofmtrainingexamplesdrawnrandomlywithreplacementfromtheoriginaltrainingsetofmitems.

2.2 AdaBoost(AddaptedBoosting)

• Idea1:Assignweightstoexamples.Itwilliterativelychangetheweights.DecreaseweightofcorrectlylabelledexamplesIncrease weight of incorrectly labelled examples =>Focus attention onmisclassifiedexamplesThealgorithmwantstodecreasethefocusofexamplesthatarewell-classified.IwanttomakesureIgetinexactinstances.

• Idea2:Assignweightstoeachlearnedhypothesisbasedonhowaccurateitis.Weightedvotetolabelnewexamples.


50

Assumethatwearegoingtomakeoneaxisparallelcutthroughthefeaturespace.

Wehave3misclassifiedinstances.Whatwewanttodoistoassignahigherweighttothese 3 misclassified examples. In brief, we up weight the mistakes and wedownweighteverythingelse.

ð Howwillthenumberofroundseffectgeneralization?

Expect• Trainingerrortodroporreach0• Test error to increasewhenh*becomes too complex: “Occam’s razor” (i.e.

overfitting)• Hardtoknowwhentostoptraining

><Butofthe,thetesterrordoesnotincrease,evenafter100rounds.Thetesterrorcontinuestodrop,evenaftertrainingerroris0!.Occam’srazor:”simplerisbetter”appearstonotapply!

The key idea here is called margins. The training error only measures whetherclassifications are right or wrong. But it should also consider confidence ofclassifications.


51

Marginstrytoexploittheconfidence.H*isaweightedmajorityvoteofweakclassifiers.Eachoftheindividualclassifiersisbetterthan50-50.Wecanmeasuretheconfidencebymargin.Wetaketheweightedvoteandcalculatehowmuchweightwehaveonthepositivecasesandhowmuchwehaveonnegativecases.à(weightedvote+)−(weightedvote-)

Itturnsoutthatboostingreallytriestobuildupaclassifierwithahighmargin.Boostingincreasesthemarginveryaggressivelysinceitconcentratesonthehardestexamples.If themargin is large,moreweak learners agree andhencemore roundsdoesnotnecessarilyimplythatthefinalclassifierisgettingmorecomplex.Theorem: large margins yield better bound on generalization error. If all trainingexampleshavelargemargins,thenwecanapproximatefinalclassifierbyamuchsmallclassifier.Itissimilartohowpollscanpredictoutcomeofanot-too-closeelection.

2.3 Cross-validation

Anothertrainingsetsamplingmethodistoconstructthetrainingsetsbyleavingoutdisjointsubsetsofthetrainingdata.Ex: the training set can be randomly divided into 10 disjoint subsets. Then 10overlappingtrainingsetscanbeconstructedbydroppingoutadifferentoneofthese10subsets.

2.4 GradientTreeBoosting

Thebasealgorithmisoldbutveryhypednow.Wewillfocusonleastsquaresregressioncase.Wewillignoresomeofmathematicaldetails.GradientBoosting=GradientDescent+Boosting.Itfitsanadditivemodel(ensemble)in a greedy forward stage-wise manner. Each stage introduces a weak learner toaddress the shortcomings of the current model. Shortcomings are identified bygradients.


52

Intuitively,westartwithaverysimplehypothesis.

The function is just an additive function and the regressionmodel is just adecisiontree.IstartwiththeinitialdatasetandIwanttolearnamodel.hence,Ievaluateallpossibledecisiontrees.Idea:incrementallygetclosertotreelevel.

Recall,gradientdescent


53

ð ConnectiontoGradientDescent

Why is Gradient Boosting so powerful? It is a generalization ofAdaBoost.AdaBoostdesignsaparticularfunctionandtriestooptimizeclassification.


54

GradientBoostingtendstoabstractawaythealgorithmfromthelossfunctionandhencethetask.Thus, itcanplug inanydifferentiable loss functionandusethesamealgorithm

• Otherregressionlossfunctions• Classification• Ranking

Severaldetailswereskipped

• Regularizationtoavoidoverfittingo Restrict depth of trees by not considering full

possibletrees.Ienumeratetreesuptoacertaindepth.

o AddpenaltytermtoobjectivefunctionJ• Settingthelearningrateη• Derivationsanddiscussofalllossfunctions

AdaBoostvs.GradientBoostingð Howaretheysimilarandhowaretheydifferent

• Similarities

o Stage wise (1st model then second model, etc.) greedy learning ofadditivemodel.

o Focusonmispredictedexamples.Themodelmakessomemistakesandbothtechniqueswanttofocusonthesemistakes.

• Differenceso Focusonmispredictions

§ AdaBoost:High-weightdatapoints§ GradientBoosting:Gradientof lossfunction.Focusona little

partofeachexample.o Generality

§ AdaBoost:Justclassification.§ GradientBoosting:Anydifferentiableloss

3. Manipulatefeatures

Idea:differentlearnersseedifferentsubsetsoffeatures(ofeachtraininginstances).Empiricallyitprovidedmixedresults.Thetechniqueworksbestwheninputfeaturesarehighlyredundant.


55

ð Howtousethisinasamplingcontext?Theideaistocreatemorethanlogkmodels,togetsomeredundancy.“Error-CorrectingCodes”(someredundancy)

Given:IntegerTForI=1toT

• Partitionlabelsintotwodisjointsets• Buildclassifiertodistinguishbetweenthesesetsofexamples

4. Addrandomness

Neuralnetworks:• Differentinitialvalues• Notreallyindependent

Decisiontrees:• Considertop20attributeschooseoneatrandom?• Produce200classifiers• Toclassifynewinstance:Vote

FOIL:• Chooseanytestw/foilgainwithin80%oftop• Goodempiricalperformance

RandomForests

Random forests or random decision forests[1][2] are an ensemble learningmethod for classification, regression and other tasks, that operate by


56

constructingamultitudeofdecisiontreesattrainingtimeandoutputtingtheclass that is the mode of the classes (classification) or mean prediction(regression) of the individual trees. Random decision forests correct fordecisiontrees'habitofoverfittingtotheirtrainingset.

Increasingi• Increasescorrelationamongindividualtrees(BAD)• Alsoincreasesaccuracyofindividualtrees(GOOD)

UsetuningsettochoosegoodsettingforiOverall,randomforests

• Areveryfast• Dealwithlarge#offeatures• Reduceoverfittingsubstantially• Workverywellinpractice

Randomforestsdifferinonlyonewayfromthegeneralschemeofbagging:theyuseamodifiedtreelearningalgorithmthatselects,ateachcandidatesplit inthelearningprocess,arandomsubsetofthefeatures.Thisprocess issometimescalled"featurebagging". The reason for doing this is the correlation of the trees in an ordinarybootstrapsample:ifoneorafewfeaturesareverystrongpredictorsfortheresponsevariable(targetoutput),thesefeatureswillbeselectedinmanyoftheBtrees,causingthemtobecomecorrelated.

5. StackingStackingisasimilartoboosting:youalsoapplyseveralmodelstoyouroriginaldata.Thedifferencehereis,however,thatyoudon'thavejustanempiricalformulaforyourweightfunction,ratheryouintroduceameta-levelanduseanothermodel/approachtoestimatetheinputtogetherwithoutputsofeverymodeltoestimatetheweightsor,inotherwords,todeterminewhatmodelsperformwellandwhatbadlygiventheseinputdata.Idea:youwanttolearnwhethereachlearnerisgood


57

6. Whydoensemblemethodswork?

ð There are four possible explanations to why ensembles are better than basicclassifiers.

6.1 ExpectedError


58

• Bias:leftpicture.• Variance:rightpicture.Complicatedhypothesisspace.

SourcesofBias

• Bias comes from the inability to represent certain decisionboundaries.E.g.,linearthresholdunits,decisiontrees.

• Youmakeincorrectassumptionso E.g.,failureofindependenceassumptioninnaïveBayes

• Classifiersthatare“tooglobal”o E.g.,singlelinearseparator,smalldecisiontree

• Ifbiasishigh,themodelisunderfittingthedata

Sourcesofvariance• Statisticalsources.Classifiersare“toolocal”andcaneasilyfitthe

data.(e.g.,nearestneighbor,largedecisiontrees)• Computationalsources

o Decisionmadeonsmallsubsetsofthedata(e.g.,decisiontreesplitsneartheleaves)

o Randomization in the learning algorithm (e.g., neuralnetswithrandominitialweights)

o Unstable learning algorithms (e.g., decision boundarycanchangeifonetrainingexamplechanges)

o Highvariance⇒modelisoverfittingthedata

6.2 Bias/varianceexplanation


59

Biasandvariancearetwocomponentsoferror.

§ Bias:inabilitytorepresentthetrueconcept.IhavesomefunctionIwanttopredict,basedonhypothesisandmodel. Imighthavesomeerrorwhichiscalled“bias”.Wecanthinkaboutbiaswhenwedon’thaveagoodhypothesisspace.

§ Variance: fluctuationsduetorandomvariations inthedata.Appearswhenthehypothesisclassistoobig.

ð Hence,wearefacedwithatrade-off!

§ Moreexpressiveclassofhypotheses,whichgenerateshighervariance§ Lessexpressiveclass,whichwillgeneratehigherbias.

IhavetomakeadecisionbeforehandbasedonwhetherIpreferhighbiasorhighvariance.

ð Ensemblescanaddressbothbiasandvariance!

§ Bagging: if bootstrap approximation is correct, then itwould reducevariancewithoutchangingbias.Inpractice,baggingcanreduceboth.

• Forhigh-biasclassifiers,canreducebias• Forhigh-varianceclassifiers,canreducevariance

§ Boosting: Attends to allow you to work with more expressivehypothesis.

• Earlyiterations:primarilyreducesbias• Lateriterations:primarilyreducesvariances(apparently).

6.3 Statisticalexplanation

Tomotivatethisexplanation,wecouldlookatanexampleofflippingcoins.


60

ð IcanthencomputetheprobabilitythatIuseoneofthecoins,basedonthisfigures.

IseethatthereisasmallchangethatIflippedC1,butthereisabiggerchancethatIswapC2andC3.

Bayesianestimatecouldbeabetteridea.Wemaketheprobabilityofseeingahead aweighted vote among all possibilities.We then get aweighted voteaboutthepredictions.


61

ð How can the learning algorithm select among the set of (almost) equally goodhypotheses?HowcanIpickthebesthypothesis?Icanforexamplethingabouttheerrorrate.FromtheoryweknowitiscalledtheBayesoptimalclassifier,makingaweightedmajority vote among all possible hypothesis. That is the best way we can do:weightedbytheirposteriorprobability,probablythebestpossibleclassifier,thatis,itminimizestheerrorrate.Butitisactuallyinfeasible.

ð EnsemblelearningapproximatesBayesoptimal

6.4 Representationalexplanation

Each leanerworkswith a givenhypotheses class (i.e., set of possiblemodels). It ispossiblethatthetargetfunctionliesoutsideoftheconsideredhypothesisclass.

• Linear classifier: I will have a yfunctionthatseparatespositiveandnegatives examples. I won’t have asingleseparationline.

• Nostraight line,adecision treecanrepresentthisdiagonalline.

An ensemble averaging may approximatethetruetargetfunctionwitharbitrarilygoodaccuracy.

• Curvedboundarybyaveraginglines• Diagonal boundary by averaging

“staircases”.

6.5 Computationalexplanation

Mostlearningalgorithmssearchthroughhypothesesspacetofindone“good”model.Hypothesisspacecouldbeaspaceofdecisiontrees,regressionlines,etc.


62

The most interesting hypothesis space are huger or infinite, so we have to do aheuristicsearch.The learning might get stuck in a local minimum. One strategy for avoiding localminimaistorepeat thesearchmanytimeswithrandomrestarts.This isessentiallywhatbaggingdoes.Thismethodisreallypowerful.

7. Someapplications

Thebestoneoratleastthemostsuccessfuloneisinthewebsearch.

Uses ensembles of decision trees.They use the gradient boosting.Wecanthinkaboutthedata,wehaveaquery,here,a radiostation.Andwethen have different Url’s (= trainingdata).Wealsohavearankingamongthelines.Theywilltheninduceanensembleoftrees.

AscoreisusedtoranktheUrl’s.IwanttomakesureIpicktheperfectonefirst,thenagoodone,etc.Theywanttopredictascoreandlearnaboutdecisiontreestoperformtheranking.Onethingthatisimportantistoevaluatethetrees.Anotherreasonisthatthismethodtendstobeveryaccurate.Another area where ensembles arewidelyusedisinedgedetection.Givenan image, you want to detect edges,finding the outline of the differentfigures.What you could do then is touseanensemble,forinstancerandomforest.We will have some probabilisticmodels,traindecisiontrees,etc. Athirdareareferstomotioncapture.Iwanttomodelaperson.Thegoalstandardistohaveagoodcameratrackingsystem.Youthenhavearandommapandyoutrytoget locationof thepeople.This iswidelyused insportsdomain to lookatpeople’sbehaviouronthefield.Thisismethodisreallyaccurate.However,thelabsetupisincrediblyexpensiveandverytimeconsuming.Videogamesthattriestorecognizeyourgesture.Yougentrytoinferwhattheposesofpeopleare.Yougetadepthimageandyouthengotoabodypartbeforemovingontoa3Djointproposal.Decisiontreesareusedtomoveonthroughtheprocedure.


63

Tosumup,acommitteeofexpertsistypicallymoreeffectivethanasinglesupergenius.Keyissues

• Generatingbasemodels• Integratingresponsesfrombasemodels:Ihavetocombineor

aggregatethepredictions.

Popularensembletechniques• Manipulatetrainingdata:baggingandboosting• Manipulateoutputvalues:error-correctingoutputcoding

ð Bias/varianceisreallyimportantindatamining.


64

INTERLUDE:ACTIONABLEDATAMINING

• Howmanyofyougotothesamegrocerystoreeveryweek?• Howmanyofyougotothesamebar?• Howmanyofdrinkonetypeofbeer?• Whatwouldmakeyouchange?• Whydoweworkthisway?

Whatwewanttotakeawayisthatpeopletypicallyhavesimilarhabits.Similarhabitsmakesomethingseasiertoanalyse,enteringacertainroutine.Onceyouhavecertainhabits,itbecomeshardtochange.Task:basedoncustomerinformationDo:Youwanttopredictwhetherthecustomerispregnantornot.Manyquestionsarisethen

• Whatdataisneeded?Wherewouldyougetit?Weneeddataaboutgender,age.Purchasedata.Wealsoneeddataaboutsearchengines,lookingatwhichproductspeoplearelookingfor.Ishouldtrytofigureoutwhatpregnantpeoplearelookingfor.

• Whattechniquesareapplicabletothisproblem?Thisisabinaryclassificationproblem,whereyoumaywanttoassignascoretopregnantpeople.

• Whatwouldyoulookfor?• Howcouldyouexploityourfindings?Themaingoalofdataminingistoact

upon the data. So if I know if someone is pregnant, a want to use thisinformationtoactinacertainmanner.Imaybetendedtorecommendsomeproductsforinstance,proposingsomeadds,etc.

• Arethererisksinvolved?Ifso,whatarethey?Therearesomerisksrelatedtoprivacy.Thenquestionsarise:isthisrationalornot?Thereisalsoariskoftakingfalsepositive.Takeforinstanceapersonthatmakessearchesforhisorherfriendsbutwhomightnotbepregnantatall.


65

INTERLUDE:LARGESCALEDECISIONTREELEARNING

1. Decisiontreeoverview

• Internal nodes arechoicenodes.

• Leave nodes arepredictionnodes.Theyrepresent theclassification.

Ihaveabunchoftrainingdata.Ithensplit,pickingthebestinternalnode.Youfirstwanttocheckifallpointsareofthesame class. Otherwise you have to runthroughallattributesandevaluatethesplitsoneachofthem.Youthenfindthebestoneandpartitionthedatayouhave.

ð Whenwillthisbasicdecisiontreealgorithmbeinefficient?Thinkaboutverylarge

datasets.Answer:ifmydecisiontreedoesnotfitinthememory.EverytimeIwanttodoasplit,Ihavetodoafullscanofthedata.So,ImightkeepalistofinstancesIwanttolookat.TherearemanyexamplesIcanignore.

2. Rainforest

ð Canweimprove?Yes!Wejustneedaggregateinformationateachnodeinorder

tocomputethesplitpoint.Wecanjuststoretheaggregateinformation.ThedatastructureiscalledAttributeValueClassLabelsetorAVCset.Countsforeachclassareaggregatedanditstorestheclasslabeldistributionforeachattributevalue.TheAVCgroupisanAVCsetforallpossibleattributes.

RainForestAlgorithms

• RF-Write:Alwayswriteoutpartitionstodisk


66

Pro:DonothavetoreadwholeDBeachtimeCon:Writingtodiskisexpensive

• RF-Read:ScanthewholedatabaseateachnodelevelPro:AvoidexpensivewriteoperationsCon:Readlotsofirrelevantinformation

• RF-Hybrid:MixedstrategyUseRF-ReaduntilAVC-Groupsofchildnodesdon’tfitinmemory.ForeachlevelwhereAVC-Groupsdon’tfitinmemory,partitionchildnodesintosetsM&N

• AVC-GroupsfornÎMallfitinmemory• AVC-GroupsfornÎNarebuildondisk.

Processnodesinmemorythenfillmemoryfromdisk

IhaveadatabaseandIscanthedata. I build the AVC set foreach of my possible nodes. Ipickmyrootnode.

Inowgettwonodes.ForeachnodeI have to consider the possiblesplits.Iconsidereachattribute.So, I doubled the number ofattributesIhavetoconsider.

3. BOAT

Another way to think about scaling up is the BOAT algorithm. This stands forBootsrappedOptimistAlgorithmforTreeConstruction.Thisisveryscalabletolearningdecision trees. Indeed, it needs very few scans (perhaps 2) over the data. And itconstructsmultiplelevelsatonce.


67

Thealgorithmrunsintwophases• Samplingphases: it isgonnalookatasmuchasthedatasetwecan.

Take a sampleD́́Ì D from the training database such that D’ fits inmemory. Construct b bootstrap trees T1,...,Tb by creating trainingsamples D1,...,Db obtained by sampling with replacement from D’.Performbagging!Ithenendupwithalargenumberoftrees.Itwillthentrytocombinethemintoaunified,singletree.So,itwilltrytocreateasummarytree.

• CreateCoarseSplittingCriteria:Processthetreesinatop-downfashion.AteachnodeN,checkifthesplittingcriteriaisidenticalforalltrees.Ifnot,deleteNanditssubtreesinalltrees.Iftheyagreeconstructcoarsesplittingcriteria

ð Coarsesplittingcriteria• Discrete:Justtheattribute-values• Continuous:Confidenceinterval[L,U]suchthatthetruesplit

islikelyintheinterval.Wehavearangeofpossiblevalues.

Both trees agree on theroot node. On the leftsubtreetheyallagree.Onthe right subtree theyagree. But then theydisagree.

Onthelefttreewehave>20,ontheotherside>25.Soforagewedon’tknowwheretheactualthresholdwillbe.Forthefinaltree,wecomputeanexactsplit.

• Cleaningphase

Fordiscreteattributethecoarsesplittingcriteria istheexactsplittingattribute, whereas, for continuous data, only try splits withinconfidenceinterval.CollectallexamplesthatfollowthispathANDfallwithintheconfidenceinterval.Thencomputeexactsplitbasedontheseexamples


68

CHAPTER6:ASSOCIATIONRULEMINING

ð Howtofindpatterns?

1. Introductionanddefinitions

Given:SetoftransactionsFind:If-Thenrulesthatpredicttheoccurrenceofanitembasedonotheritemsinthetransaction.

The main motivation is finding regularities in data. What products were oftenpurchased together?WhatkindsofDNAare sensitive tonewdrug?Try to findco-occurrences. We try to find patterns in the data, patterns that predict a certainvariable.Foundationformanydataminingtasks:

• Association• Correlation

ð Algorithmsdonotrequirelabelleddataorforausertospecifyapredefinedtarget

concept.

• Anitemsetisacollectionofoneormoreitems.Ex:{Bread,Milk}• Ak-itemset isan itemsetthatcontainsk items.Ex:3-itemset{Bread,

Milk,Diaper}

Associationrulesareif-thenrulesaboutthecontentsofbasketsGiven:Setofitems:I={i1,i2,...,im}

Setoftransactions:D={d1,d2,...,dn}Anassociationrule:AàB


69

{i1,i2,...,ik}→jmeans:“ifabasketcontainsallofi1,...,ikthenitislikelytocontainj.”

Whendealingwithassociationrules,asimplequestionis:findsetsofitemsthatappear“frequently”inthebaskets.ThesupportcountforitemsetIisthenumberofbasketscontainingallitemsinI.thesupport isthefractionoftransactionsthatcontainan itemset.Basicallyweare justcountinghowmanytimesweseeanitemsetinthedata.Givenasupportthresholds,frequentitemsetsappearinatleastsbaskets.

AÞC=support({A}È{C})/|T|

The confidence of an association rule is the conditionalprobability of j given i1,…, ik. This gives ameasure of howaccurate the rule is. Given I have observed A, what is theprobabilityIalsoobserveB?Confidence(AÞB)=P(B|A)=sup({A,B})/sup(A)TheconfidenceoftheruleisthefractionofthebasketswithallofIthatalsocontainj.

Confidencealonecanbeuseful,providedthesupportforthe leftsideoftherule isfairlylarge.Weusuallywanttheconfidenceoftheruletobereasonablehigh,perhaps50%,orelsetherulehaslittlepracticaleffect.

Givenanassociationruleiàj,Interest=Confidence(j|I)-Support(j).

• Interest=0:IhasnoinfluenceonJ• Interest>0:Imaycausethepresenceofj• Interest<0:Idiscouragesthepresenceofj

IfIhasnoinfluenceonj,thenwewouldexpectthatthefractionofbasketsincludingIthatcontainjwouldbeexactlythesameasthefractionofallbasketsthatcontainj.sucharulehasinterest0.

Indataminingandassociationrulelearning,liftisameasureoftheperformanceofatargetingmodelatpredictingorclassifyingcasesashavinganenhancedresponse(withrespecttothepopulationasawhole),measuredagainstarandomchoicetargetingmodel.Atargetingmodelisdoingagoodjobiftheresponsewithinthetargetismuchbetterthantheaverageforthepopulationasawhole.Liftissimplytheratioofthesevalues:targetresponsedividedbyaverageresponse.Ifsomerulehada liftof1, itwould implythattheprobabilityofoccurrenceoftheantecedent and that of the consequent are independentof eachother.When twoevents are independent of each other, no rule can be drawn involving those twoevents.


70

If the lift is > 1, it lets us know the degree to which those two occurrences aredependentononeanother,andmakesthoserulespotentiallyusefulforpredictingtheconsequentinfuturedatasets.Why could the lift measure be useful? By dividing confidence by support, we can“normalize”dataandcomparetheresults.Theassociationruleminingtaskissimple.Given:Transactiondata,supports,andconfidencec.Find:Allassociationruleswithsupport≥sandconfidence≥cTaskfocus:

• Generalmany-manymapping(association)betweenitemsandbaskets• Connectionamong“items”not“baskets”• Focusoncommoneventnotonrareevents

Thereexistdifferenttypesofassociations

• Booleanassociations

• Quantitativeassociations

Numberofpredicatescaptured

• Singleattribute

• Multipleattributes

• Multi-relational

SingleorMultipleLevel

• SingleLevel

• MultipleLevel


71

Associationrulesareoftenusedintheretailsector.• Baskets:setsofproductssomeoneboughtinonetriptothestore• Items:products

Ex:giventhatmanypeoplebuybeeranddiaperstogether,runasaleondiapers;raisepriceofbeer.Onlyusefulifmanybuydiapersandbeer.Whatitemsshouldstorestockupon.

Byfindingfrequentitemsets,aretailercanlearnwhatiscommonlyboughttogether.Especiallyimportantarepairsorlargersetsofitemsthatoccurmuchmorefrequentlythanwouldbeexpectedweretheitemsboughtindependently.Ex:diapersandbeer.Onewouldhardlyexpectthesetwoitemstoberelated.Throughdataanalysisonechainstorediscoveredthatpeoplewhobuydiapersareunusuallylikelytobuybeer.

Anotherapplicationisplagiarism

• Baskets=sentences• Items=documentscontainthosesentences

Itemsthatappeartogethertoooftencouldrepresentplagiarism.Noticethatitemsdonothavetobe“in”baskets.Documentshavingahighcountcontainplagiarism.

Associationrulesarealsooftenusedinwebpagesanalytics.

• Baskets=webpages• Items=words

Unusualwords appearing together in a large number of documents,e.g.,“Brad”and“Angelina”,mayindicateaninterestingrelationship.

2. NaïveAlgorithm

Walmartsells100000itemsandcanstorebillionsofbaskets.TheWebhasbillionsofwordsandmanybillionsofpages.Wehaveaccesstolotsandlotsofdata…NaïveGenerateandTestGoal:Findallassociationruleswithsupport≥sandconfidence≥c.ForeachassociationruleX=>Y,keepitifitmeetssupport&confidencethreshold.><Complexity:exponentialinnumberofitems.

ThisNaïvealgorithmtendstobetooslow!


72

AssociationRuleMiningGoal Insight:Givenfrequentitemsets,

• Trivialtoderiveallassociationrules• Thesewillcontainallruleswithsupportsandconfidencec

Hardpart:findingthefrequentitemsets.Note:if{i1,i2,...,ik}→jhashighsupportandconfidence,thenboth{i1,i2,...,ik}and{i1,i2,...,ik,j}willbe“frequent”CreatingassociationrulesGiven:Supports,confidencec

• Step1:Findallitemsetswithsupports• Step2:ForeachfrequentitemsetL,foreachnon-emptysubsetsofL.

Outputtherules→{l-s}ifitsconfidence≥cOnceIhavefrequentitemsets,findingassociationrulesisquiteeasy.

Typically,dataiskeptinflatfilesratherthaninadatabasesystem.Dataisstoredondiskbasket-by-basket.Useknestedtoexpandbasketsintopairs,triples,etc.asyoureadbasket.Thetruecostofminingdisk-residentdataisusuallythenumberofdiskI/O’s.

• Readdatainpasses:allbasketsreadinturn• Cost:Numberofpassesanalgorithmtakes

Thebottleneckismainmemory.Formanyfrequent-itemsetalgorithms,mainmemoryis the critical resource. As we read baskets, we need to count something, e.g.,occurrencesofpairs.Thenumberofdifferentthingswecancountislimitedbymainmemory.Swappingcountsin/outisadisaster(why?).Thehardestproblemoftenturnsouttobefindingthefrequentpairs,oftenfrequentpairsarecommon,frequenttriplesarerare.Theprobabilityofbeingfrequentdropsexponentiallywithsize.Thenumberofsetsgrowsmoreslowlywithsize.First,focusonpairs,thenextendtolargestsets.NaïveAlgorithmWhatisthenaïvewaytothinkaboutthis?ForeachbasketthatIprocess,IgenerateallpossiblepairsandIprocessthem.Read file once, counting inmainmemory theoccurrencesof eachpair. Fromeachbasketofnitems,generateitsn(n-1)/2pairsbytwonestedloops.Failsif(#items)2exceedsmainmemoryRemember:#itemscanbe100K(Wal-Mart)or10B(Webpages)2approaches


73

1) Countallpairs,usingatriangularmatrix.Observeallpossiblepairsinthere.Icanfigureouttheidentifiersoftheitems.

- Assigneachitemanumber- Count{i,j}onlyifi<j- Keeppairs in theorder: {1,2}… {1,n } : {2,3} ...

{n-1,n}- Pair{i,j}attheposition:(i–1)(n–i/2)+j–i

ð 4bytes/pairstoreallpairs

Assigneachtransactionanumberandkeepacountforeachitemset.

Wehaveasimpletransactionaldatabasecontainingfourtransactions.ThenIhavemyarraywhereIgeneratedallpossibleitemsets.Ihavetokeeponeentryforeachpossiblepair.Iwanttoscanmydata.ForeachtransactionIgenerateallpossiblepairs.

2) Tableoftriples[i,j,c]=pair{i,j}countisc.Thismethodismorememoryefficientthanpreviousapproach.Thetotalnumberofbytesusedisabout12p,wherepisthenumberofpairsthatactuallyoccur.Thismethodrequiresustostorethreeintegers,ratherthanone,oreverypair thatdoesappear in somebasket. Thisapproachbeats the triangularmatrixifatmost1/3ofthepossiblepairsactuallyoccur.Itrequiresextraspaceforretrievalofstructure.Thetriplesmethoddoesnotrequireustostoreanythingifthecountforapairiszero.

Sofar,wefocusedonidentifyingfrequentpairs.But,thereexistmanymorepossiblefrequenttriplesthanfrequentpairs.Howcanweavoidgeneratingallofthem?Numberoffrequentitemsetsofsizetwowillbebiggerthanfrequentitemsetofsizethree.Butontheotherhand,therearemorepossiblefrequenttriplesthanpossibledoubles.


74

3. Apriori

Inpractice, themostmainmemory is requiredfordeterminingthefrequentpairs.Thenumber of items,while possible very large, is rarely so largewe cannot count all thesingletonsetsinmainmemoryatthesametime.In practice the support threshold is set high enough that it is only a rare set that isfrequent.Monotonicity tells us that if there is a frequent triple, then there are threefrequent pairs contained within it. Thus, we expect to find more frequent pairs thanfrequenttriples,morefrequenttriplesthanfrequentquadruples,andsoon.TheA-Priorialgorithmisdesignedtoreducethenumberofpairsthatmustbecounted,attheexpenseofperformingtwopassesoverdata,ratherthanone.ThisisthejoboftheA-PrioriAlgorithmandrelatedalgorithmstoavoidcountingmanytriplesorlargersets.Themainpurposehereistogenerateandtestapproachfordiscoveringfrequentitemsets.Theapriorialgorithmisaniterativeapproach.

• Find all frequent itemsets of size k before finding frequentitemsetsofsizek+1

• Onepassthroughthedataforeachfrequentitemsetsize.

Ifanitemsetappearsatleaststimes,sodoallitssubsets.ThecontrapositiveforpairsstatesthatifitemIdoesnotappearinsbaskets,thennopairincludingIcanappearinsbaskets.Itwillneverbefrequent.AssoonasIknowthatsomethingisinfrequent,thereisnopointofcountinganitemsetthatisbigger.

TheAprioriprincipleholdsduetothefollowingpropertyofthesupportmeasure: "X,Y:(XÍY)Þs(X)³s(Y)

• Pass1:Createtwotables.Thefirsttabletranslatesitemnamesintointegersfrom1ton.Theothertableisanarrayofcounts.Read baskets and count inmainmemory the occurrences ofeach item. It requiresmemoryproportional to thenumberofitems.Frequentitemsarethosethatappearstimes.Afterthefirstpassweexaminethecountsinordertodeterminethefrequentitemsets.Wethencreateanewnumberfrom1tomforjustthefrequentitems.Thistableisanarrayindexed1to


75

n,andtheentryforI iseither0, if itemI isnotfrequent,orauniqueintegerintherange1tomifitemIisfrequent.

• Pass 2: Read baskets again and count in main memory onlythosepairsbothofwhichwerefoundinPass1tobefrequent.In a double loop, generate all pairs of frequent items in thatbasket. For each such pair, add one to its count in the datastructureusedtostorecounts.It requires memory proportional to the square of frequentitems,plusa listofthefrequent items.Frequent itemsetsarethosethatappearstimes.

Verifyifeachcandidatesatisfiesthethreshold.

ð Ifnofrequentitemsetsofacertainsizearefound,thenmonotonicitytellsustherecanbenolargerfrequentitemsets,sowecanstop.

1) JoinStep:CandidatesetCk+1isgeneratedbyjoiningLkwithitself

SupposetheitemsinLkarelistedinanorder.JoineachelementinLkwithitself.Ifl1,l2∈Lk,theyarejoinableif:

o Thefirstk-1itemsinl1andl2arethesameo l1[1] = l2[1] AND l1[2] = l2[2] AND ... AND

l1[k-1]=l2[k-1]

2) Prune Step: Any k-itemset that is not frequent cannot be a

subsetofafrequent(k+1)-itemsetForeachcandidateitemsetsCk+1,lookateachsubsetofsizek,i.e.,droponeitemfromthecandidate.Ifanyofthesesubsetsisn’tfrequent,discardthiscandidate.àApplicationoftheAprioriPrinciple


76

Therealtrickisthepruningstep.Thisisthekeystep.

Thesizeofthefileofbasketsissufficientlylarge,henceitdoesnotfitinmainmemory.Thus,amajorcostofanyalgorithmisthetimeittakestoreadthebasketsfromdisk.Besidescost,anotherdata-relatedissuemayberelatedtotheuseofthemainmemoryfor itemset counting. All frequent-itemset algorithms require us tomaintainmanydifferentcountsaswemakeapassthroughthedata.Wemightforinstanceneedtocount the number of times that each pair of itemsoccurs in baskets.Wemust becareful aswecannot countanything thatdoesnot fit inmainmemory. Thus,eachalgorithmhasalimitonhowmanyitemsitcandealwith.

4. PCY

InthissectionweconsiderthePCYalgorithm,whichtakesadvantageofthefactthatinthefirstpassofA-Priorithereistypicallylotsofmainmemorynotneededforthecountingofsingleitems.Simpleproblem:checkifanobjecthasbeenpreviouslyencountered.Filterstrings:ScanalargerfileFofstringsandoutputthosethatareinS(e.g.,emails)Hassomeonevisitedthesetyet(abitmorecomplicated…).ð Often, fileswill not fit inmemory. This algorithmexploits the observation that

theremay bemuch unused space inmainmemory on the first pass. The PCYAlgorithmusesthatspaceforanarrayof integersthatgeneralizesthe ideaofaBloomfilter.


77

Think of an array as a hash table,whose buckets hold integers ratherthansetsofkeysorbits.Pairsofitemsare hashed to buckets of this hashtable.Asweexamineabasketduringthefirstpass,wenotonlyadd1tothecountforeachiteminthebasket,butwe generate all the pairs, using adoubleloop.Hash each pair and add 1 to thebucketintowhichthatpairhashes.

Attheendofthefirstpass,eachbuckethasacount,whichisthesumofthecountsofallthepairsthathashtothatbucket.Ifthecountofabucketisatleastasgreatasthesupportthresholds,itiscalledafrequentbucket.

• We can say nothing about the pairs that hash to a frequentbucket; theycouldallbe frequentpairs fromthe informationavailabletous.

• Ifthecountislessthans,weknownopairthathashestothisbucketcanbefrequent.

Nextdefinethesetofcandidatepairsforthenextpass.

Analysis

• Nofalsenegatives:catchesallelementsinF• Falsepositivesarepossible• Ifb=n|S|atmost1/nofthebitarrayis1,only1/nthofelements

notinSgetthroughfiltero Notewillbelessthan1/Ngivencollisionso Assuminghashinguniformlyatrandom

DuringthefirstpassoftheApriorialgorithm,mostmemoryisidle.Idea:usefreememoryforahashtable

• Hashpairs of items that appear in a transaction:weneed togeneratethese

• Justthecount,notthepairsthemselves• InterestedinthepresenceofapairANDwhetheritispresentat

leasts(support)times.


78

Observationaboutbuckets

• Abucketthatafrequentpairhashestomeetminimumsupportthreshold.Cannoteliminateanymemberofthisbucket

• Even without any frequent pair, a bucket can be frequent.Cannoteliminateanymemberofthisbucket

• Best case: count for a bucket is less thanminimum support.Eliminateallpairshashedtothisbucketevenifthepairconsistsoftwofrequentitems.

ð Anexampleisprovidedintheslides.

5. LimitingdiskI/O

A-Priori,PCY,etc., takekpassestofindfrequent itemsetsofsizek. inotherwords,previouslydiscussedalgorithmsuseonepassforeachsizeofitemsetweinvestigate.If themainmemory is too small to hold the data and the space needed to countfrequentitemsetsofonesize,theredoesnotseemtobeanywaytoavoidkpassestocomputetheexactcollectionoffrequentitemsets.

• Good:cleverpruningstrategy• Bad:stillrequireslotsofpassoverdataL

Manyapplicationswhereitisnotessentialtodiscovereveryfrequentitemset.Hence,itisquitesufficienttofindmostbutnotallofthem.Makingonepassoverthedataiswhattheeliminatingfactoris.Othertechniquesuse2orfewerpassesforallsizes:

• Simplealgorithm• SON(Savasere,Omiecinksi,andNavathe)• Toivonen


79

ð Thequestionis,canwefindallfrequentitemsetsin2passesorless?

5.1 SimpleAlgorithm:Sample

Idea:insteadofusingtheentirefileofbaskets,wecouldpickarandomsubsetofthebasketsandpretenditistheentiredataset.Procedure:Setofthedata, Ihaveadatabase,thenmemory.Weneedtohavetwopartsofthememory.Wewillrandomlysamplebasketsandmakesuretheseendupinthememory.Wethenhavesomedatainmemory,wethenrunApriorionthesample.So we need to think about the support threshold, that scales to the size of thedatabase.Thatistosay,scalebacksupportcount,e.g.,uses/100ifsampleis1/100oforiginalDBsize.Thesafestwaytopickthesampleistoreadtheentiredataset,andforeachbasket,selectthatbasketforthesamplewithsomefixedprobabilityp. Incaseourbasketsappearinrandomorderinthefilealready,thenwedonotevenhavetoreadtheentirefile.Oncewehaveselectedoursampleofthebaskets,weusepartofmainmemorytostorethesebaskets.Thebalanceofthemainmemoryisusedtoexecuteoneofthealgorithmswehavediscussed,suchasA-Priori,PCY,…thealgorithmmustrunpassesover themain-memory sample for each itemset size, until we find a size with nofrequentitems.Missessetsfrequentinwholebutnotinsample

• Canuseevensmallersupport• E.g.,s/125insteadofs/100

Optional:verifyguessesonentiredataset.

Thisalgorithmrequiresatmosttwopasses:onepasstoconstructthesampleandthenonesampleifyouwanttoverifyguesses.Thiscanbeveryfast;Itisalsoverysimple.Ofcoursethealgorithmwillfailifwhichevermethod seen cannot be run in the amount ofmainmemory left after storing thesample.

ð Be careful, it cannotbe relieduponeither toproduceall the itemsets that arefrequentinthewholedataset,norwillitproduceonlyitemsetsthatarefrequentinthehole;anitemsetthatisfrequentinthewholebutnotinthesampleisafalse


80

negative,whileanitemsetthatisfrequenttinthesamplebutnotthewholeisafalsepositive.

ð Retainasfrequentonlythoseitemsetsthatwerefrequentinthesampleandalsofrequent in thewhole.This improvementwilleliminateall falsepositives,butafalsenegativeisnotcountedandthereforeremainsundiscovered.

5.2 TheSONAlgorithmð Savasere,Omiecinski,andNavathe(SON)

This improvementavoidsbothfalsenegativesandfalsepositives,at thecostofmakingtwofullpasses.Idea:dividetheinputfileintochunks,treateachchunkasasample,andrunthesimple,randomizedalgorithmionthatchunk.Usepsasthreshold,ifeachchunkisfractionpofthewholefile,ands isthesupportthreshold.Storeondiskall thefrequentitemsetsfoundforeachchunk.Onceallchunkshavebeenprocessed,taketheunionofalltheitemsetsthathavebeenfoundfrequentforoneormorechunks.Thesearethecandidateitemsets.Monotonicity:every itemsetthat isfrequent inthewhole isfrequent inat leastonechunk,andwecanbesurethatallthetrulyfrequentitemsetsareamongthecandidates;i.e.,therearenofalsenegatives.

• 1stpass:readeachchunkandprocessit• 2ndpass:countallthecandidateitemsetsandselectthosethat

havesupportatleastsasthefrequentitemsets

5.3 Toivonen’sAlgorithmAgain,itusesthesimplealgorithm,butlowerthethresholdsforthesample.Exifthesampleis1%ofthebaskets,uses/125vs.s/100.Goal:avoidmissingtrulyfrequentitemsets.Toivonen’s algorithm begins by selecting a small sample of the input dataset, andfindingfromitthecandidatefrequentitemsets.Itisessentialthatthethresholdissettosomethinglessthanitsproportionalvalue.Thesmallerwemakethethreshold,the


81

more main memory we need for computing all itemsets that are frequent in thesample,butthemorelikelywearetoavoidthesituationwherethealgorithmmfailstoprovideananswer.Wethenconstructthenegativeborder.Thisisthecollectionofitemsetsthatarenotfrequentinthesample,butalloftheirimmediatesubsetsarefrequentinthesample.Example:ABCDisinthenegativeborderifandonlyif:

• Itisnotfrequentnithesample,and• ABC,ABD,ACD,BCDareallfrequentinsample.

Aisnegativeborderifandonlyof• Itisnotfrequentinthesample• Emptysetisfrequentunlessthenumberofbaskets<support

Tocompletethealgorithm,wemakeapassthroughtheentiredataset,countingalltheitemsetsthatarefrequentinthesampleorareinthenegativeborder.Therearetwopossibleoutcomes

• Nomember of the negative border is frequent in the wholedataset. In this case, the correct set of frequent itemsets isexactlythoseitemsetsfromthesamplethatwerefoundtobefrequentinthewhole.

• Somememberofthenegativeborderisfrequentinthewhole.Thewecannotbesurethattherearenotsomeevenlargersets,in neither the negative border nor the collection of frequentitemsetsforthesample,thatarealsofrequentinthewhole.So,we can give no answer at this time and must repeat thealgorithmwithanewrandomsample.

Theorem:ifthereisanitemsetthatisfrequentinthewhole,butnotfrequentinthesample,thenthereisamemberofthenegativeborderforthesamplethatisfrequentinthewhole.Proof:proveitbycontradictionSupposenot;i.e.;

1) ThereisanitemsetSfrequentinthewholebutnotfrequentinthesample,and

2) Nothinginnegativeborderisfrequentinthewhole


82

LetTbeasmallestsubsetofSthatisnotfrequentinthesample.Tisfrequentinthewhole(Sisfrequent+monotonicity).Tisinthenegativeborderbecauseotherwiseitwouldnotbethesmallest.AnyremovalfromTresultsinsomethingthatisfrequent.ð Why are these variants interesting? These employ standard things to improve

efficiency: hashing and sampling. Hashing is a good technique, used in manyalgorithmstomakethingsgofast.Thesearestandardtechniquesthatwewilloftenencounter.Theycanbeusedinmanyothercontexts,sogoodtoknowtherestrengthsandweaknesses.Alwaysthinkabouthowyoucanimproveanalgorithm!

6. FPGrowth

Quickreminder,wherearewesofar?• A-Priori

o Guaranteedtobecompleteo Slowdue to repeatedDB scans to generate candidate

itemsetsAlthough the Apriori heuristic achieves good performancegainedbyreducingthesizeofcandidatessets,itmaysufferfromsome costs in situations with a large number of frequentpatterns, long patterns, or quite low minimum supportthresholds.

• Toivonen’sAlgorithmo OnlyrequirestwoDBscanso Notguaranteedtobecomplete

ð Canwegetfastandcomplete?

ThereisanapproachcalledFrequent-patterntree(FPtree)thattriestodothis.Itmainly consists out of two big ideas. This is an extended prefix-tree structurestoringcrucial,quantitativeinformationaboutfrequentpatterns.EverythingwehaveseensofarisbasedontheA-Priorialgorithm,counteverythingofsize1beforemovingtosize2,andsoon.StudieshaveshownthatFP-growthisaboutanorderofmagnitudefasterthanApriori,especiallywhenthedatasetisdenseand/orwhenthefrequentpatternsarelong.

• Data compression: Exploit redundancy to compactly butcompletely represent frequent items in DB. Compress thedatabasetosomethingsmallerthatfitsinthemainmemory.

• AvoidsrepeatedDBscans.


83

Acompactdatastructurecanbedesignedbasedonthefollowingobservations

• Sinceonly the frequent itemswillplaya role in the frequent-pattern mining, it is necessary to perform one scan oftransactiondatabaseDBtoidentifythesetoffrequentitems

• Ifthesetoffrequentitemsofeachtransactioncanbestoredinsomecompactstructure,itmaybepossibletoavoidrepeatedlyscanningtheoriginaltransactiondatabase

• Ifmultipletransactionsshareasetoffrequentitems,itmaybepossible to merge the shared sets with the number ofoccurrencesregisteredascount.Itiseasytocheckwhethertwosetsareidenticalifthefrequentitemsinallofthetransactionsarelistedaccordingtoafixedorder.

• Iftwotransactionsshareacommonprefix,accordingtosomesortedorderoffrequentitems,thesharedpartscanbemergedusing one prefix structure as long as the count is registeredproperly. If the frequent items are stored in their frequencydescending order, there are better chances that more prefixstingscanbeshared.

TheFP-treeconstructiontakesexactlytwoscansofthetransactiondatabase:thefirstscancollectsthesetoffrequentitems,andthesecondconstructstheFP-tree.

TodiscoveritemsetsintheFP-tree,partitionthedataandcountwithin each partition: aka Divide-and-conquer. It avoidscandidategeneration.Looking at the five transactions,we observe that thefirst one and the second one are exactly the same

transactions.Thetransactionisrepeatedsowecouldjuststoreitoncewithcount.However,itisveryunlikelytofindmanexactrepeats.Thequestionis,howtodobetterthanthis?Whenwelookatthedata,therearestillrepeatedstructuresinthisdata.Wecanseethingslikefappearsinalltransactions.Question:howcouldweexploitthestructuretocompress?Thegoalistoveryquicklyandcheaplycompressthedata. Ideallyweshouldcompressthedatainoneortwopasses.

Insteadofstoringfonce.Westoreitonceandwekeeptrackofthecounts.Eachnodeinthethreeisjustanitemwiththecount.Wecanreconstructthepatternbygoingdownthetree.Question:isthereabetterwaytoorderthedata?


84

Generalidea:Divide-and-conquertorecursivelygrowfrequentpatternpathusingtheFP-tree.Method:

• Foreachitem,constructisconditionalpattern-baseandthenitsconditionalFP-tree

• RepeatprocessoneachnewlycreatedconditionalFP-treeuntilo TheresultingFP-treeisempty,oro Itcontainsonlyonepath

ð Adetailedexampleisgivenintheslides.

ð Why is this a beneficial technique? It is a very compact representation of thedatabase.Wejustincreasethecountsinsteadofjustkeepingtheitemsetsseveraltimesintothedatabase.

WhyisFP-GrowthFast?

• Nocandidategeneration,nocandidatetest• Usecompactdatastructure• Eliminaterepeateddatabasescans• BasicoperationiscountingandFP-treebuilding

7. Incorporatingconstraintsintomining

Itisnotrealistictofindallpatternsinadatabaseautonomously.

• Toomanypatterns• Patternstoounfocused


85

Datamining is interactive:users shouldprovideadviceanddirection toalgorithms.Usuallywehaveahypothesiswewanttoevaluatebylookingatthedata.Constraint-basedmining

• Userprovidesconstraintsonwhattomine• Systemexploitsconstraintsforefficiency

Typeofconstraints

• Interestconstraint:Ex:rulemeetsminimumsupport,confidence,interestthresholds

• Dataconstraint.Ex:productssoldtogetherinLeuven,inMarch2017-04-29

• Ruleorpatternconstraint.o Content:smallmale(sum<10)triggers largesale(sum

>200)o From:meta-rule.P(x,y)ÙQ(x,w)→takes(x,database

systems)

Givenconstraints,aminingalgorithmshouldbe• Sound:onlyfindsitemsetsthatsatisfyconstraints• Complete:findsallitemsetsthatsatisfyconstraints

ANaïvesolutionfindsallfrequentitemsets,thentestthemforconstraintsatisfaction.Thisisanefficienttechnique.Theefficientapproachistoanalysetheitemsetandpushtheconstraintintotheminingprocess.Then,thingswillbemuchmoreefficient.Anti-Monotonicity:whenanitemsetSviolatestheconstraint,sodoesanyofitssuperset.

Succinctness:ifAissuccinctyoucanenumerateall,andonlythose,thatsatisfyit.

Idea:withoutlookingatthedata,candetermineitemsetsthatsatisfiestheconstraint.Ex:min(S.Price)£vissuccinct

8. Presentingresults,othermetrics

• Maximalitemsets:noimmediatesupersetisfrequent


86

• Closed itemsets: no immediate superset has the same count(>0)

o Storeswhatisfrequentandexactcountso Losslessencodingoffrequentitemsets

ð Whydoweneedthesenotionsofmaximalandcloseditemsets?

• Waytocompressthesearchspace.Whatistheintuitionifweonlykeepthecloseditemsets?Orifweonlykeepthemaximalones,inwhichcasedoweloseinformation?

o If we only keep the closest, we can generate all thefrequentitemsets.

MiningMaximalItemsets:MaxMiner

Wewanttotrysomethingthatisasbigaspossibleandfrequent.Alwaysaddsub-branches.Pruningtechniques

• Localpruning(e.g.,atnodeA)o Ifh(N)Èt(N)isfrequent,prunewholesub-tree(e.g.,ifABCDisfrequentdon’t

explore)o If h(N)È i, for some iÎ t(N) is NOT frequent remove i from “h” before

expanding(e.g.,ifACnotfrequent,removeC)• Globalpruning:Whenmaxpatternfound(e.g.,ABCD).Pruneallnodes(e.g.,B,C,D)

whereh(N)Èt(N)isasubset(e.g.,subsetofABCD)

ð Anexampleisgivenintheslides


87

We have looked at two popular objectivemeasurements: support and confidence.There are other things we can look at, which are more subjective measures:interestigness.Arule(pattern)isinterestingifitis

• Unexpected(surprisingtotheuser)• Actionable(theusercandosomethingwithit)

Tosumup…Characterizationofalgorithms

• Searchtechniques:o Breadth-first:Apriorianditsvariantso Depth-first:FP-Growtho Lookahead

• Pruningtechniqueso Subsetinfrequento Hashingo Supersetfrequency

• Limitingdiskaccessesistheprimarygoal

Presentingrules• Whichitemsets

o Allitemsetso Closeditemsetso Maximalitemsets

• Whatmetricso Support,confidence,interesto Lifto Correlation

ð Whatisthedifferencebetweenstandardruleinductionandassociationrules?

o Rule induction: the rule is just a target. In rule mining we perform aclassifier.

o Associationrules

Associationrulesareanefficientwaytomineinterestinginformationinverylarge databases: get probabilities. It does not require (but can exploit) userguidanceforinterestingpatterns.


88

A-priorialgorithmanditsextensionsallowtheusertogatheragooddealofinformationwithouttoomanypassesthroughdata.

Foodforthought:Patternexplosionisawell-knownprobleminfrequentitemsetmining.Highsupport thresholds typically result only in few well-known patterns; but for low supportthresholds,thenumberoffrequentitemsetscaneasilybeordersofmagnitudelargerthanthenumberoftransactions.Knowledgediscoveryinsuchhumongousitemsetcollectionsisvirtuallyimpossible.Whatarethecausesofpatternexplosion?Canyouthinkofawaytosolveoratleastalleviatethisissue?

• Problemwiththechoiceofsupportbecausethereisatrade-off.o Toolow:lotsandlotsofitemsetso Toohigh:fewitemsetswhicharewell-known

• Iftherearestatisticaldependenciesbetweensomeitemsthenyouseethemappearingeverywheretogether,whichmightnotaddadditionalinformation.How canwedealwith it?We could for instance use constraints. Ifwe have somebackgroundinformation,wecanexcludesomeitemsets.


89

CHAPTER7:MININGSEQUENCES

• Understandhowsequentialpatternminingrelatestoassociationrulemining• Introducethebasicterminologyandconceptsofsequentialpatternmining• Abriefintroductiontothedifferentalgorithmsforsequentialpatterndiscovery

1. Motivation

Idea: Sequential pattern mining discovers frequent subsequences as patterns in asequencedatabase.Itisanimportantdataminingproblemwithbroadapplications.

• Associationrulemining:findrelationshipsbetweenitemsthatco-occursimultaneously.Lookingatassociationsruleminingwewerelookingatassociationsandconnectionsbetweenitemsinamarketbasketforinstance.

• Sequentialpatternanalysisconsiderstime(ordertransactionsoccur in). In sequentialpatternminingwewill take theorderintoaccount.Theordermattershere.Oftentheordercanbeimportantindecisionmaking.Many different applications domains are characterized bysequentialdata.Oftenthetimeororderofactions/eventscanbe relevant in decisionmaking. It is important to be able tocapturethosetypesofdependencies.The Sequential patternmining task is quite similar to that ofassociationrulemining.Given:Asequentialdatabaseandminimumsupportthreshold.Find:All(maximal)frequentsubsequences

2. Backgroundanddefinitions


90

Wepreviouslyworkedontransactiondatabase,assumingatransactionID.Here,wewilltimestampacustomeridentifier.Soweneedtomakeasequencedatabasebasedonthetransactionsofeachcustomer.

TransactionDatabase

SequenceDatabase

Thecustomer-sequencedatabaseisatime-orderlistofa customer’s transactions. For instance, customer 1bought a PC before paper. Customer 4 bought a PCbeforeaprinterandink.

Wecangroupalltransactionsbyusers.

Asequenceisjustanorderedlistofitemsets.Itemsinthetransactionsremainunordered.

The length of a sequence is the number of itemsets in the sequence, thenumberoftransactionsinit.

A sequence S1 = (a1,a2,...,an) is contained in another sequence S2 =(b1,b2,...,bm)IF

• Existsintegersi1<i2<...<insuchthat• a1⊆bi1ANDa2⊆bi2AND...ANDan⊆bin

IcanfindsometransactionsinS2suchthatS1occursinit.SowehavetofindawaytomapthosetransactionsofS1belongingtoS2.


91

An itemcanoccuratmostonce inanelementofasequence,butcanoccurmultipletimesindifferentelementsofasequence.

• A customer supports a sequence s if s is contained in thecustomersequence.The support is the fraction of customers who support thesequence.

• A sequence is frequent if it meets a minimum supportthreshold.

• A sequence s ismaximal if it is not contained in any othersequence.

Ex:

Rmq: <(Printer) (Ink)>means that the printer and the ink arebought in two separate transactions. Notice that customer 4buys both the printer and ink but in the same transactions.Whichisdifferentfrom<(Printer)(Ink)>.

RemembertheAprioriproperty:ifasequenceisnotfrequent,thennoneofitssuper-sequencesarefrequent.

Sequentialpatternminingfacessomechallenges.Indeed,databasescontainahugenumberofpossiblesequentialpatterns.Becausewearenowconsideringtheorder.Hence,generatingcandidatestendstobequitechallenging.Algorithmsshouldfindallpatterns(ifpossible)thatmeetaminimumsupportthreshold.Theyshouldalsobeefficientandscalableandincorporatetheuserprovidedconstraint.


92

ð Thebigchallengesresideingeneratingcandidates.

Thereare2possiblewaystoextendapossiblecandidate.• Extendthesequencetoincludeanotheritemset(i.e.,aseparate

transaction).Ihaveasequenceofacertainlength,forinstancelength2andIaddanewtransaction,leadingtoasequenceoflength3.

• Extendthenumberofitemsincludedinoneitemset(i.e.,model

co-occurringitems).

Assumethatwehave6frequentlength1sequences. First:Extendthelengthofthesequence.Thisyields6x6=36candidates.

Second:wewillextendthesizeofanitemset.Thisyields6x5/2=15candidates.Insteadofhaving15itemsetswewillhave15+36itemsets.

Thereareanumberofapplicationsthatpeoplelookat.

§ Thinkaboutretailshopping.First,buyacomputer,second,byprinterwithinthenext3months.

§ Other applications are in the medical domains. First distinguishsymptoms,thendiseasesandthirdtreatments.

§ Financialmarkets:commonlyoccurringtransactionorders.§ Webdata.Ex:clickstreams.Whatorderdopeoplevisitpagesin?§ Telephonecallingpatterns:whatorderdopeoplecallothersin?§ DNAsequences:whatstructureiscommonlyrepeated?

Quickcheck…

Whatisthesupportof:


93

§ (PC).Supportis2.Weonlycareaboutthenumberofcustomerswhoboughtapc.Here:customers3and4.Wedon’tcountthenumberoftimesapcwasbought.

§ (Ink,Paper).Supportis3.§ (PC)(Ink).Supportis2

Given:10thingsofsize1arefoundtobefrequent.Howmany itemsetsmustbeevaluatedon thenext search iteration?45 (=10x9/2)Howmany sequences must be evaluated on the next search iteration? 45(items)+100=145(=10x9/2)+102)

3. FreeSpan

ð Canwe develop amethodwhichmay absorb the spirit of Apriori but avoid orsubstantiallyreducetheexpensivecandidategenerationandtest?

FreeSpan isbasically the first thingwe thinkof in sequentialmining.The idea is toadapttheFP-Growthtothesequentialcase.Idea: Use frequent items to recursively project sequence databases into a set ofsmaller projected databases and grow subsequence fragments in each projecteddatabase.Overview:Divide-and-conquerapproach

§ Step1:Countsequencestofindf-list§ Step2:Recurse

• Projectsequencedatabaseintoasetofsmallerdatabases• Minefrequentpatternsineachprojecteddatabase.

Thebigchallengehereisthatordermatters!Step1:Findlength-1sequentialpatternsandlisttheminsupportdescendingorder

ð f_list=a:4,b:4,c:4,d:3,e:3,f:3;g:1

Rmq:wedon’tcounthowoftenapatternappearsinthetransactions.Wecountthenumberofcustomers’transactionsinwhichthe1-lengthsequenceappears.

Here,wecanpruneg,asitonlyhasasupportof1.Hence,wedon’tcareaboutg.wewillusetheotherfrequentsequencestopartitionthesearchspace.Step 2: Divide the search space. The complete set of sequence patterns can bepartitionedinto6disjointsubsets(movedownthef_list).


94

Thesubsetsofsequentialpatternscanbeminedbyconstructingprojecteddatabases.

§ Finding sequential patterns containing only item a. By scanningsequencedatabaseonce,theonlytwosequentialpatternscontainingonlyitemaare:(a)and(aa).

§ Finding sequential patterns containing itemb but no itemafter b inf_list.Thiscanbeachievedbyconstructingthe(b)-projecteddatabase.

§ Finding other subsets of sequential patterns. Others subsets can befoundsimilarly,byconstructingcorrespondingprojecteddatabaseandmining them recursively. These databases are constructedsimultaneouslyduringonescanoftheoriginalsequencedatabase.

Let’slookatthe<b>projecteddatabase.

Pro’s Con’s

• Projections replace candidategenerate

• Conprojectatanypointinsequence• Projected sequences may not be

shorter

In thiscase Ididn’tsavethatmuchwork.Mysequencesarenotthatshorter than the previous sequences. I justmanaged to remove onecustomerID.

4. PrefixSpan


95

Idea: instead of projecting sequence databases by considering all the possibleoccurrences of frequent subsequences, the projection is based only on frequentprefixes because any frequent subsequence can always be found by growing afrequentprefix.PrefixSpanmines the complete setofpatternsbut it greatly reduces theeffortsofcandidate subsequence generation. Moreover, it reduces the size of projecteddatabasesandleadstoefficientprocessing.PrefixSpan outperforms both the A-Priori algorithm and previously seen method,FreeSpan.Pro’s

o Maintainprojectionidea,avoidcandidategeneration.Wewanttoguaranteewedolesswork.

o Projectionbasedonsequenceprefix:ensurethatsequencesgetprogressivelyshorter.

Definition:PrefixSequenceβ=<b1b2...bm>isaprefixofsequenceα=<a1a2...an>,wherem≤n,IFF

• bi=aifori≤m-1• bm⊆am• Allitemsin(am–bm)arealphabeticallyafterthoseinbm

Definition:Projection

Asubsequenceα’ofsequenceαisaprojectionofαw.r.t.prefixβIFF• βisasubsequenceofα• α’hasprefixβ• Nopropersuper-sequenceα’’ofα’suchthatα’’isasubsequenceofαandhas

prefixβ

Similarly,whatwecandoislookattheSuffix.Splittheprojectionintotwopart,theprefix(data)andthesuffix.

• Letα’=<a1a2...an>betheprojectionofαw.r.t.prefixβ=<a1a2...am-1a’m>(m

≤n)• Sequenceγ=<a’’mam+1...an>iscalledthesuffixofαw.r.t.prefixβ,denotedas

γ=α/β,whereα’’m=(am-a’m)


96

• Wealsodenoteα’=β⋅γ

ProjectedDB • Given,prefixβandsuffixγ• WecallγtheβprojectedDB

Whenmyprefixgetslongerandlonger,mysuffixgetssmallerandsmaller.

Thefirstoccurrenceofeoccursinatransactionwithtwoitems,sowedon’twanttolosethefactthatthereissomestructureandthateco-occurs.So,weaddan“_”,toindicatethatfshouldoccurwithanotheritem.

ThePrefixSpanAlgorithmperformsadivide-and-conquersearchtofindallsequentialpatterns.

• Step1:Findallfrequentlength1sequences• Step 2: Divide search space based on length 1 prefixes and recurse:

PrefixSpan(α,l,S|α)

1. ScanS|αtofindfrequentitemsbsuchthat:a. bcanbeassembledtothe lastelementofαtoformasequential

pattern;orb. <b>canbeappendedtoαtoformasequentialpattern

2. Foreachfrequentitemb,appendittoαtoformasequentialpatternα’,

andoutputα’;

3. Foreachα’,constructα’-projecteddatabaseS|α’,andcallPrefixSpan(α’,l+1,S|α’).

Adetailedexampleisprovidedintheslides


97

ProjectedDBsallowustoavoidgeneratingcandidatesequences.• No candidate sequence needs to be generated by

PrefixSpan.Itonlygrowslongersequentialpatternsfromtheshorterfrequentones.Itdoesnotgeneratenortestany candidate sequence non-existent in a projecteddatabase.

• The projected databases keep shrinking: a projecteddatabase is smaller thantheoriginalonebecauseonlythe postfix subsequences of a frequent prefix areprojectedintoaprojecteddatabase.

• MajorcostofPrefixSpanistheconstructionofprojecteddatabases.

The costliest operation is building projected DBs. Every time I perform a newrecursive step, i have to build a new database. If the number and/or size ofprojecteddatabasescanbereduced,theperformanceofsequentialpatternminingcan be improved substantially. There are two ways to improve computationalefficiency.

o Bi-levelprojectionstoreducethenumberandsizesofprojectedDBs.

Thefirstscanidentifiesfrequentlength1sequences.Thesecondscanconstructsatriangularmatrixinsteadofprojecteddatabasesandcount.

• Support(a,a)foreachitem• Foreachpair(a,b)offrequentitemscount

o Support(<ab>)o Support(<ba>)o Support(<(ab)>

• Onthenextlevel,constructprojectedDBsWewilllookatverysmallpartsofthedata.Lookatthingswethinkweneed.

Foreachfrequentlength-2sequentialpatternconstructaprojectedDB.Thenonnextscanbuildthematrix.


98

ð Formoreinformation,cfr.Hand-writtennotesintheslides.

o Pseudo-projectionreducescost ifprojecteddatabasefits inmemory.Wewanttoavoidcopying.Insight: Postfixes of sequences often appear repeatedly in recursiveprojecteddatabases.Pseudo-projection:avoidrepeatedphysicalcopyingpostfix

• EfficientifprojectedDBfitsinmemory• CostlyifprojectedDBisdiskresident

Idea:usepointstoavoidcopyingWitheachprojectionstore

• Pointertobasesequence.PointstowhereIwanttobeinthesequence.

• Offsettowherepost-fixstarts

Insteadofbuildingthedata,Ijustmaintainapointer.

Toconclude,manyimportantdomainsarecharacterizedbythepresenceofsequentialdata.Sequential pattern mining helps capture time dependences between events. It issimilartofrequentitemsetsbutconsidersordersofelements.

ð Commonalgorithmsareinspiredbyitemsetmining.


99

CHAPTER8:CLUSTERING

1. Unsupervisedlearning,clusteringintro

Clusteringistheprocessofexaminingacollectionof“points”,andgroupingthepointsinto“clusters”accordingtosomedistancemeasure.Goal:findgroupsofitemsProduceclusterssuchthatthingsthatoccurinthesamegrouparehighlysimilar,orareclosetogetherindistance.Agoodclusteringmethodwillproduceclusterswith

• Highintra-classsimilarity• Lowinter-classsimilarity

Whenyouarefacedwiththeclusteringtask,manythingscanmakesense.Ex:dividesetintogroups.Youcandogroupingbasedongenderforinstance.Clustering is a subjective notion hence, it is difficult to give a precise definition ofclusteringquality.Itisapplicationdependentandultimatelysubjective.Thinkingabouthowmanyclusterscanalsobeatrickyquestion.HowmanyclustersdoIwant?

Remember… Supervisedlearning:

• Dataaresetofpairs<x,y>,wherey=f(x)• Goal:Approximatef

Unsupervisedlearning:thedataisjustx!Goal:FindstructureinthedataChallenge:Groundtruth isoftenmissing (noclearerror function, like insupervisedlearning)Onewaytothinkaboutclusteringisthatitisanunsupervisedlearningtechnique.Inunsupervisedlearning,youdon’thavealabel.Youjusthaveattributes.Thegoalistofindastructureinthedata,basedontheattributes.


100

><Insupervisedlearningweknowthelabelandwewanttooptimizeacertainmetric:ROC,AUC,…Unsupervisedlearningtendstobemoreabstract.Whyisclusteringuseful?Itcanbeinterestingtolookat.Youcanuseittocompressthedata.

• Groupsofdataareoftenuseful• Compressorreducethedata(e.g.,establishprototypes)• Identifyoutliers(ornovelexamples)• Pre-processingforsupervisedlearning(e.g.,featureconstruction)• Visualizationofthedata

Given: D = { x1, x2,..., xN }, where each xi is a d-dimensional feature vector (xi,1,xi,2,...,xi,d)Do:DividetheNvectorsintoKgroupssuchthatthegroupingis“optimal”

Thereare3keyissues

• Measureofdistances:distancebetweenexamples,betweenclusters,betweenexampleandclusters.

• Howwillwerepresentclusters?• Assigningitemstoclusters

Clusteringmethodsuseadistance(similarity)measuretoassessthedistancebetween

• Apairofinstances• Aclusterandaninstance• Apairofclusters

Rememberthat therequirements fora functiononpairsofpoints tobeadistancemeasurearethat

• Distancesarealwaysnonnegative,andonlythedistancebetweenapointanditselfis0.

• Distanceissymmetric;itdoesn’tmatterinwhichorderyouconsiderthepointswhencomputingtheirdistance.

• Distancemeasuresobeythetriangleinequality.Thedistancefromxtoytozisneverlessthanthedistancegoingfromxtozdirectly.

Givenadistancevalue,wecanconvertitintoasimilarityvalue:sim(i,j)=1/[1+dist(i,j)]Notalwaysstraightforwardtogotheotherway.We’lldescribeouralgorithmsintermsofdistances.

Editdistance:Thenumberofrequirededitstotransformastringintoanotherstringedit_distance(mining,smiling)=2


101

We can divide clustering algorithms into two groups. There exist several types ofclusterstructures:thehierarchicalstructuresandtheflatstructures.

• Hierarchicalarrangement:createahierarchybetweenobjects.• Flatarrangement:straightpartitionthedataintodifferentgroups.

Regardingtheclusterassignment,wedistinguishbetweendifferenttechniques

• Hardclustering:Eachitemoccursinexactlyonecluster• Softclustering:Eachitemhasaprobabilityofbelongingtoacertaincluster• Disjunctivevs.overlappingclusters.Itemsmayoccurinmultipleclusters

Clusteringapproaches

• Hierarchical:Createahierarchicaldecompositionof the setofobjectsusingsomecriterion

• Partitioning: Construct various partitions and then evaluate them by somecriterion

• Model-based:Hypothesizeamodelforeachclusterandfindbestfitofmodelstodata

• Density-based:Guidedbyconnectivityanddensityfunctions

2. Hierarchicalclustering

2.1 ThealgorithmThisalgorithmstartswitheachpointinitsowncluster.Clustersarecombinedbasedontheir“closeness”.Combinationstopswhenfurthercombinationleadstoclustersthat are undesirable for oneof several reasons. Ex:wemay stopwhenwehave apredetermined number of clusters, or wemay use ameasure of compactness forclusters, and refuse to construct a cluster by combining two smaller clusters if theresultingclusterhaspointsthatarespreadoutovertoolargearegion.Hierarchical clustering can do top-down (divisive or bottom-up (agglomerative)clustering.

• Top-down:subdivideobjects.Partitiontheitems• Bottom-up:everyobject is inan individualcluster,cluster themall together

intooneaggregatedcluster.Givenasetofinstances,Icreateoneclusterforeachinstance.Iwanttofindthenearesttwoclusters.

Ineithercase,wemaintainamatrixofdistance(orsimilarity)scoresforallpairsof

• Instances;• Clusters(formedsofar)• Instancesandclusters


102

Dendogramsareusedtorepresenthierarchicalclustering.Wehavedistanceonthex-axis.Leavesrepresenttheinstances.Thebarheightindicatesthedegreeofdifferencewithinthecluster.

Wehavetodecideinadvanceon:

• Howwillclustersberepresented?• Howwillwechoosewhichtwoclusterstomerge?• Whenwillwestopcombiningclusters?Thereareseveralapproacheswemight

usetostoppingtheclusteringprocess.o Wecouldbetold,orhaveabelief,abouthowmanclusterstherearein

thedata.o Wecouldstopcombiningwhenatsomepointthebestcombinationof

existingclusterproducesaclusterthatisinadequate.o Continue clustering until there is only one cluster. However, it is

meaninglesstoreturnasingleclusterconsistingofallthepoints.

2.2 Distancemeasures

Thedistancebetweentwoclusterscanbedeterminedinseveralways• Singlelink:distanceoftwomostsimilarinstances:dist(cu,cv)=min{dist(a,b)

|a∈cu,b∈cv}Taketheminimumdistances.Issue:chainphenomenon.


103

• Completelink:distanceoftwoleastsimilarinstances:dist(cu,cv)=max{dist(a,

b)|a∈cu,b∈cv}.Takethemaximumdistancebetweentwoclusters.Usuallywhenyoudoclusteringyouapplycompletelinkage.

• Averagelink:averagedistancebetweeninstances:dist(cu,cv)=avg{dist(a,b)

|a∈cu,b∈cv}

Ifwemergedcuandcvintocj,wecandeterminedistancetoeachothercluster:• Single link:

dist(cj,ck)=min{dist(cu,ck),dist(cv,ck)}• Complete link:

dist(cj,ck)=max{dist(cu,ck),dist(cv,ck)}• Averagelink:|c|*dist(c,c)+|c|*dist(c,c)dist(cj,ck)=

Singlelink-Chaining


104

Bottomline

• Simple,fast• Oftenlowquality

CompleteLink

• WorstcaseO(n3)• Fastalgorithm:RequiresO(n2)space• Nochaining• Bottom line:

TypicallymuchfasterthanO(n3)• Oftengoodquality

2.3 EvaluationofthealgorithmWeaknessesofagglomerativeclusteringmethods:

• Thebasicalgorithmisnotveryefficient.Ateachstep,wemustcomputethedistancesbetweeneachpairofclusters,inordertofindthebestmerger.

• Donotscalewell:timecomplexityofatleastO(n2),wherenisthenumberoftotalobjects

• Canneverundowhatwasdonepreviouslyð Integrationofhierarchicalwithdistance-basedclustering

• BIRCH:usesCF-treeandincrementallyadjuststhequalityofsub-clusters• CURE:selectswell-scatteredpointsfromtheclusterandthenshrinksthem

towardsthecentreoftheclusterbyaspecifiedfraction

BIRCH: Balanced Iterative Reducing andClustering usingHierarchies (Zhang,Ramakrishnan & Livny, 1996). It incrementally construct a CF (ClusteringFeature)tree.Parameters:maxdiameter,maxchildren

§ Phase 1: scan DB to build an initial in-memory CF tree (each node:#points,sum,sumofsquares)

§ Phase2:useanarbitraryclusteringalgorithmtoclustertheleafnodesoftheCF-tree

Scaleslinearly:findsagoodclusteringwithasinglescanWeaknesses:handlesonlynumericdata,sensitivetoorderofdatarecords


105

2.4 ClusterFeatureVector Given:X1,...,Xn,datapointsinaclusterwhereeachwithd-dimensions

Note:CFsareadditive!E.g.,CF1+CF2=(N1+N2,LS1+LS2,SS1+SS2)

2d+1valuesrepresentanynumberofpoints

d=numberofdimensions.AveragesineachdimensioncanbecalculatedasSUMi/NVarianceindimensionicanbecomputedby:(SUMSQi/N)–(SUMi/N)2TogetstandarddeviationtakesquarerootCanalsocomputetheradius.

2.5 ClusterFeatureTree


106

ACF-treeisaheight-balancedtreewithtwoparameters:

• Branchingfactor(nonleafnodesB,leafnodes,L)• ThresholdT

Eachnonleafnodehastheform[CFi,childi]EachleafnodehasCF

• SetofCFs• Twopointers:prevandnext

RadiusofasubclusterunderaleafnodecannotexceedthethresholdT

ToconstructtheCF-Tree,scanthedatasetandinserttheincomingdatainstancesintotheCFtreeonebyone.Eachinstanceisinsertedintotheclosestsubclusterunderaleafnode.Ifinsertioncausessubclusterdiametertoexceedthreshold,thencreateanewsub-cluster.Thenewsub-clustermaycauseitsparentstoexceedbranchingfactor.Ifso,splittheleafnode.

§ Identifyingthepairofsub-clusterswithlargestinter-clusterdistance.§ Dividebyproximitytothesetwosub-clusters

If this split causes thenon-leafnode toexceed thebranching fact, then recursivelysplit.Iftherootnodeissplit,thentheheightoftheCFtreeisincreasedbyone.ð AnexampleisgivenintheslidesTraditionalalgorithms


107

WhatwouldBIRCHdo?

Birchassumes:

§ Clustersarenormallydistributedineachdimension§ Axesarefixed:Ellipsesatananglearenotok

ClusteringUsingRepresentatives(CURE)Clusterdefinition:Setofrepresentativepoints.Itenablesclustersofdifferingshapes.ð RequiresanEuclideanspace

TheCUREalgorithmdoesnotassumeanythingabouttheshapeofclusters;theyneednotbenormallydistributed,andcanevenhavestrangebends,S-shapes,orevenrings.Insteadofrepresentingclustersbytheircentroid,itusesacollectionofrepresentativepoints,asthenameimplies.

Two-pass(hierarchical)clusteringapproach

• Pass1:Clusteringofsubsetofdatatopick“representativepoints”.


108

o Takea small sampleof thedataandcluster it inmain

memory. • Select a small set of points from each cluster to be

representativepoints.Thesepointsshouldbechosentobeasfarfromoneanotheraspossible.

• Moveeachoftherepresentativepointsafixedfractionofthe

distancebetweenitslocationandthecentroidofitscluster.20%isagoodfractiontochoose.

• Pass2:AssignallpointstoclustersScantheentiredatasetandassigneachexampleetothe“closest”cluster.

• Standardmetricdeterminestheclosest• Donebyfindingrepresentativewithsmallestdistancetoe.

3. Partitionalclustering

3.1 PartitioningAlgorithms

Method:ConstructapartitionofadatabaseDofnobjectsintoasetofkclusters


109

Givenak,findapartitionofkclustersthatoptimizesthechosenpartitioningcriterionGlobaloptimal:exhaustivelyenumerateallpartitionsHeuristicmethods:k-means,k-medoidsalgorithms

§ k-means(MacQueen,1967):Eachclusterisrepresentedbythecenterofthecluster

§ k-medoidsorPAM(Partitionaroundmedoids)(Kaufman&Rousseeuw,1987):Eachclusterisrepresentedbyoneoftheobjectsinthecluster

Idea:divideinstancesintodisjointclusters.Flatvs.treestructure.

Issues:

§ Howmanyclustersshouldtherebe?§ Howshouldclustersberepresented?

WecangenerateaPartitionalclusteringfromahierarchicalclusteringby“cutting”thetreeatsomelevel.

3.2 K-Meansalgorithm

TheK-meansalgorithmisacommonly-usedalgorithm.It isbotheasyto implementandquicktorun.Assumptions

§ Objectsaren-dimensionalvectors§ Distance/similaritymeasurebetweentheseinstances

Goal:Partitionthedata inKdisjointsubsets. Ideallythepartitionshouldreflectthestructureofthedata.


110

Initiallychoosekpointsthatarelikelytobeindifferentclusters.Makethesepointsthecentroidsoftheirclusters.Foreachremainingpointp,findthecentroidtowhichpisclosest.Addptotheclusterofthatcentroidandthenadjustthecentroidofthatclustertoaccountforp.

Howtoinitializetheclusters?Wewanttopickpointsthathaveagoodchanceoflyingindifferentclusters.Thereare2approaches

• Pickpointsthatareasfarawayfromoneanotheraspossible.Goodchoice:pickthefirstpointatrandom.

• Clusterasampleofthedata,perhapshierarchically,sotherearekclusters.Pickapointfromeachcluster,perhapsthatpointclosesttothecentroidofthecluster.

Anoptionalstepattheendistofixthecentroidsoftheclustersandtoreassigneachpoint,includingthekinitialpoint.Tothekclusters.Theresultsdependontheseedselection.Someseedscanresultinpoorconvergenceorsub-optimalclusterings.

• Pickpointsrandomlyorfromdatainstances• Useresultsofanothermethod• Manyrunsofk-means:eachwithrandomseeds

• |cj|isnumberofexamplesassignedtoclustercj• iÎcj,i.e.,examplesthatareassignedtoclustercj• Thisisavector:Calculateµalongeachdimension

ð AnexampleisprovidedintheslidesA trickpointof thisalgorithm ispicking the rightvalueofk.wemaynotknowthecorrect valueof k touse in a k-means clustering.However, ifwe canmeasure thequalityoftheclusteringforvariousvaluesofk,wecanusuallyguesswhattherightvalueofk is. Ifwetakeameasureofappropriateness forclusters, suchasaverageradiusordiameter,thevaluewillgrowslowly,aslongashenumbersofclustersweassumeremainsatorabovethetruenumberofclusters.Assoonaswetrytoformfewerclusters,thantherereallyare,themeasurewillriseprecipitously.

TimeComplexity

• Distancebetweentwoinstances:O(d),wheredisthedimensionalityofthevectors

• Reassigningclusters:O(kn)distancecomputations,orO(kdn)• Computing centroids: Each instance vector gets added once to some

centroid:O(dn)• AssumethesetwostepsareeachdoneonceforIiterations:O(Ikdn)


111

• Linearinallrelevantfactors,withfixednumberofiterations,moreefficientthanO(n2)HAC

3.3 Bradly-Fayyad-Reina(BFR)

TheBFRalgorithmisavariantofthek-meansthatisdesignedtoclusterdtainahigh-dimensionalEuclideanspace.Keyassumption: clusters arenormallydistributedarounda centroid in a Euclideanspace.Thestandarddeviationsindifferentdimensionsmayvarybutthedimensionsmustbeindependent.Algorithmoverview:

• Read points one main-memory-full at a time. The algorithm begins byselectingkpoints,usingoneofthemethodswesawinprevioussection.Then,thepointsofthedatafilearereadinchunks.Thesemightbechunksfromadistributedfilesystemoraconventionalfilemightbepartitionedintochunksoftheappropriatesize.Eachchunkmustconsistoffewenoughpoints that theycanbeprocessed inmainmemory.Also stored inmainmemoryaresummariesofthekclustersandsomeotherdata,sotheentirememoryisnotavailabletostoreachunk.Threeclassesofpoints

o Thediscardset:pointscloseenoughtoacentroidtoberepresentedstatistically.Foreachcluster,thediscardsetisrepresentedby:

§ Thenumberofpoints,N§ The vector SUM, whose ith component is the sum of the

coordinatesofthepointsintheithdimension.§ The vector SUMSQ: ith component = sum of squares of

coordinatesintheithdimension.o Thecompressionset:groupsofpointsthatareclosetogetherbut

notclosetoanycentroid.Theyarerepresentedstatistically,butnotassignedtoacluster

o Theretainedset:isolatedpoints

• Summarizemostpointsbysimplestatistics


112

• To begin, from the initial loadwe select the initial k centroids by somesensibleapproach

Initialization

• Takeasmall,randomsampleandclusteroptimally• Takeasample;pickarandompoint,andthenk-1morepoint,eachasfar

fromthepreviouslyselectedpointsaspossible

Processingpoints• Findpointsclosetoaclustercentroid.Addthemtoclusterandthediscard

set.• Performin-memoryclusteringonremainingpointsandtheoldRS.Clusters

becomeCS,otherstoRS.• Adjustclusterstatisticsbasedonnewpoints• ConsidermergingpairsofCS• OnfinalroundmargeellCSsandRSpointsintothenearestcluster.

Twokeyquestions

1) Whenisapoint“closeenough”toaclusterthatwecanittothatcluster?Weneedawaytodecidewhethertoputanewpointintoacluster.BFRsuggeststwoways

o The Mahalanobis distance is less than threshold. This is anormalized Euclidean distance. For point (x1,...,xk) and centroid(c1,...,ck):

§ Normalizeineachdimension:yi=|xi-ci|/ si§ Takesumofthesquaresoftheyi’s§ Takethesquareroot§ AcceptapointforaclusterifitsM.D.is<somethreshold,

e.g.,4standarddeviations

EqualM.D.Regions

o Lowlikelihoodofthecurrentlynearestcentroidchanging.

Addptoaclusterifitnotonlyhasthecentroidclosesttop,butitisveryunlikelythat,afterallthepointshavebeenprocessed,someotherclustercentroidwillbefoundtobenearertop.

2) Whenshouldtwocompressedsetsbecombinedintoone?Compute the variance of the combined sub-cluster. N, SUM and SUMQallowustomakethatcalculation.Combineifthevarianceisbelowsomeuserprovidedthreshold.

Gettingthekright


113

ð Howtoselectk?Trydifferentk,lookingatthechangeintheaveragedistancetocentroidaskincreases.Averagefallsrapidlyuntilrightk,thenchangeslittle.

• Toofew:manylongdistancestocentroid

• Justright:distancesrathershort

• Toomany:littleimprovementinaveragedistance


114

Towrapup

K-MeansStrengths Weaknesses• Efficient:O(Ikdn)• Straightforward to

implement• Fins local optimum, can

use restarts moreadvanced search to findbettersolution

• Must be able to definemean

• Needtoprovidek• Susceptible to noise and

outliers• Cannot represent non-

convexclusters

4. Model-Basedclustering

Basicidea:ClusteringasprobabilityestimationGenerativemodel1) Probabilityofselectingacluster

2) Probabilityofgeneratinganobjectincluster


115

Representation:Eachclusterisrepresentedbyaprobabilitydistribution(density).Given: Data points, number of clusters,modelformofeachclusterFind: Maximum likelihood or MAPassignment of data points to clusters andlearnparametersforeachcluster.Quality of clustering: Likelihood of testobjects

KeyChallenge

• Ifweknewwhichexamplesbelong toeachcluster, thenwecanset themodelparameters

• Ifweknewmodelparameters,thenwecanestimateprobabilitythatclustergeneratedanexample

ð ChickenandeggproblemL

Missinginformation:ClustermembershipIdea:hiddenvariablesforclustermembershipEachinstancexihasasetofhiddenvariableszi1,...,zikIntuition:zij=1ifibelongstoclusterjand0otherwise


116

Ink-means,instancesareassignedtoexactlyonecluster.EMclusteringisa“soft”k-meanswhereachexamplehastheprobabilityofbelongtoacluster.Guess the initial parameters for model in each cluster and then iterate untilconvergence.

• Estep:Determinehowlikelyitisthateachclustergeneratedeachinstance• Mstep:Adjustclusterparameterstomaximizelikelihood.

E-Step

Recallthatzijisahiddenvariablewhichis1ifNjgeneratedxiand0otherwiseIntheE-step,wecomputehij,theexpectedvalueofthishiddenvariable.

M-Step

Giventheexpectedvalueshij,were-estimatethemeansoftheGaussiansandtheclusterprobabilities.

Note:“i”goesoverexamplesEMClustering…Tosumup• Willconvergetoalocalmaximum


117

• Sensitivetoinitialmeansofclusters• Havetochoosethenumberofclustersinadvance

EvaluatingclusterresultsGiven random data without any “structure”, clustering algorithms will stillreturnclusters.Thegoldstandard:doclusterscorrespondtonaturalcategories?Doclusterscorrespondtocategorieswecareabout?(Therearelotsofwaystopartitiontheworld)Approachestoclusterevaluation

o External validation. Ex: do genes clustered together have somecommonfunction?

o Internalvalidation:howwelldoesclusteringoptimizeintra-clustersimilarityandinter-clusterdissimilarity?

o Relativevalidation:howdoesitcomparetootherclusterings?Ex:Withaprobabilisticmethod(suchasEM)wecanask:howprobabledoesheld-asidedatalookaswevarythenumberofclusters?

5. Applications

5.1 LowQualityofWebSearches

• Systemperspective

o SmallcoverageofWeb(<16%)o Deadlinksandoutofdatepageso Limitedresources

• IRperspective(relevancyofdoc~similaritytoquery)o Veryshortquerieso Hugedatabaseo Noviceusers

5.2 Documentclustering

• Userreceivesmany(200-500)documentsfromwebsearchengine• Groupdocumentsinclustersbytopic• Presentclustersasinterface

ð Weneedawaytocomparequeriesanddocuments

• Vectorspacemodelo Howtodetermineimportantwordsinadocument?o How to determine the degree of importance of a termwithin a

documentandwithintheentirecollection?o Howtodeterminethedegreeofsimilaritybetweenadocumentand

thequery?


118

o Inthecaseoftheweb,whatisacollectionandwhataretheeffectsoflinks,formattinginformation,etc.?

Assumetdistincttermsremainafterpre-processing:vocabularyThese “orthogonal” terms form a vector space Dimension = t =|vocabulary|Eachterm,i,inadocumentorquery,j,isgivenareal-valuedweight,wij.Bothdocumentsandqueriesareexpressedast-dimensionalvectors:dj=(w1j,w2j,...,wtj)

Vector space model represents a collection of n documents by a term-documentmatrix.Eachentry:“Weight”ofaterminthedocument

Morefrequenttermsinadocumentaremoreimportant,i.e.mmoreindicativeofthetopicFij=frequencyoftermIinthedocumentjWemaywanttonormalizethetermfrequency(tf)bydividingbythefrequencyofthemostcommonterminthedocument:tfij=fij/maxi{fij}

Termsthatappearinmanydifferentdocumentsarelessindicativeofoveralltopic.dfi=documentfrequencyoftermI=numberofdocumentscontainingtermiidfi=inversedocumentfrequencyoftermi,=log2(N/dfi)(N:numberofdocuments)Anindicationofaterm’sdiscriminationpower

ð Logusedtodampentheeffectrelativetotf


119

Atypicalcombinedtermimportanceindicatoristf-idfweighting:

wij=tfijidfi=tfijlog2(N/dfi)Atermoccurringfrequentlyinthedocumentbutrarelyintherestofthecollectionis givenhighweight.Manyotherwaysofdetermining termweightshavebeenproposed.Experimentally,tf-idfworkswellExample:Givenadocumentcontainingtermswithgivenfrequencies:A(3),B(2),C(1)Assumecollectioncontains10,000documentsanddocumentfrequenciesofthesetermsare:A(50),B(1300),C(250)Then:A:tf=3/3;idf=log2(10000/50)=7.6;tf-idf=7.6B:tf=2/3;idf=log2(10000/1300)=2.9;tf-idf=2.0C:tf=1/3;idf=log2(10000/250)=5.3;tf-idf=1.8A query vector is typically treated as a document and also tf-idf weighted. Analternativeisfortheusertosupplyweightsforthegivenqueryterms.

• Innerproduct

o WijisweightoftermIindocjo WiqisweightoftermIinquery

• Cosinesimilarity

o Measuresthecosineoftheangelbetweentwovectorso Innerproductnormalizedbythevectorlengths

ComparisonD1=2T1+3T2+5T3D2=3T1+7T2+1T3Q=0T1+0T2+2T3Weightedinnerproductsim(D1,Q)=2*0+3*0+5*2=10


120

sim(D2,Q)=3*0+7*0+1*2=2Cosinesim(D1,Q)=10/Ö(4+9+25)(0+0+4)=0.81sim(D2,Q)=2/Ö(9+49+1)(0+0+4)=0.13

ð D1is6timesbetterthanD2usingcosinesimilaritybutonly5timesbetterusinginnerproduct.CommentsonVectorSpaceModel

• Simple,mathematicallybasedapproach• Considersbothlocal(tf)andglobal(idf)word• occurrencefrequencies• Providespartialmatchingandrankedresults.• Tendstoworkquitewellinpracticedespiteobviousweaknesses• Allowsefficientimplementationforlargedocumentcollections

WeaknesswithVectorSpaceModel

• Missingsemanticinformation.Ex:wordsense• Missingsyntacticinformation(Ex:phrasestructure,wordorder,proximity

information)• Assumptionoftermindependence(Ex:ignoressynonymy)

5.3 Bioinformatics

Molecular biology has become a data-rich science. The Nucleic Acids ResearchMolecularBiologyDatabaseCollectionindexes1330datasources.Thisrepresentslotsand lots of data, but there is a lack of understanding of biological systems.Computationalmethodsarecrucialforunderstandingavailabledata.Genes

GenesarethebasicunitsofheredityA gene is a sequence of DNA basesthatcarriestheinformationrequiredfor constructingaparticularprotein.Such a gene is said to encode aprotein. The human genomecomprises ~ 25,000 protein codinggenes

Typesofdata

• Fullsequencesofagenomes• Singlenucleotidepolymorphisms


121

• Geneexpressiondata• Manymore!

Microarraysallowustomeasuregeneexpression.Microarray isasolidsupport,onwhichpiecesofDNAarearrangedinagrid-likearray.ItmeasuresRNAabundancesbyexploitingcomplementaryhybridization.Howactivearevariousgenesindifferentcell/tissuetypes?Howdoestheactivitylevelofvariousgeneschangeunderdifferentconditions?

• Stagesofacellcycle• Environmentalconditions• Diseasestates• Knockoutexperiments

Whatgenesseemtoberegulatedtogether?

Datapointsaregenes

• Represented by expression levels across different samples (i.e.,features=samples)

• Goal:categorizenewgenes

Datapointsaresamples(e.g.,patients)• Representedbyexpressionlevelsofdifferentgenes(i.e.,features=genes)• Goal:categorizenewsamples

UnsupervisedLearningTask(1)Given:asetofmicroarrayexperimentsunderdifferentconditionsDo: cluster the genes, where a gen described by its expression levels in differentexperimentsUnsupervisedLearningTask(2)Given:asetofmicroarrayexperiments(samples)correspondingtodifferentconditionsorpatientsDo:clustertheexperimentsEx:


122

• Clustersamplesfrommicesubjectedtoavarietyoftoxiccompounds(Thomasetal.,2001)

• Clustersamplesfromcancerpatients,potentiallytodiscoverdifferentsubtypesofacancer

• Clustersamplestakenatdifferenttimepoints


123

CHAPTER9:USINGUNLABELEDDATA

1. Introduction

Whenyoudosomethinglikeclassificationyouhavelabelleddata.Theassumptionisthatsomeoneistellingyouwhatislabelledaspositiveorasnegative.Wheredoeslabelleddatacomefrom?Itcomesfromsometasks;peoplearewillingtolabel.

• Netflix,Amazon,etc.• Spam• Medicaldiagnoses

Oftenwehavetogetpeopletolabeldata

• Webranking• Documentclassification

Problem:labellingdataisexpensivebecauseyouhavetopaysomeonetodoitandyouneedalotofdata.Next,itisalsotime-consuming,thinkingaboutwhatthelabelshouldbe,etc.Learningmethodsneedlabelleddata.

• Lotsof<x,f(x)>pairs• Hardtoget…whowantstolabeldata

Butunlabelleddataisusuallyplentiful…couldweusethisinstead??

• Semi-supervisedlearning• Activelearning

Problemset-upMotivation:HardtogetlabelleddataGiven:weassumewehavesomelableddataandsomeunlabelleddata.

• Labeleddata:D=[<x,f(x)>]• Unlabeleddata:U=[<x,??>]

ð Learnahypotheseshthatapproximatesf.

2. Semi-supervisedlearning

Self-training:useanensembletechniquesuchasbaggingtobuildasetofclassifiersonlabelleddata.Classifyeachunlabelledtrainingexamplewitheachclassifier.Makeapredictiononeachunlabelledexamples.Ifallclassifiersagreeonanexample’slabel,addittothetrainingsetwithitspredictedlabel.Next,iterate.Forthefirstiterationyouhavelittletrainingdata.


124

Anotherwayiscalledco-training.Idea:youhavealittlelabelleddataandlotsofunlabelleddata.Andeachinstancehastwoparts:

• x=[x1,x2]• x1,x2conditionallyindependentgivenf(x)

Eachhalfcanbeusedtoclassifyinstance$f1,f2suchthatf1(x1)~f2(x2)~f(x)Bothf1,f2arelearnablef1ÎH1,f2ÎH2,$learningalgorithmsA1,A2Ex:youcanthinkofclassifyingdocumentonwebpages,havingtwofeaturesonthewebpage.One:wordsonthewebpage.Second:wordsappearingonanyinternetpage.

CanapplyA1togenerateasmuchtrainingdataasonewantsIfx1isconditionallyindependentofx2/f(x),thentheerrorinthelabelsproducedbyA1willlooklikerandomnoisetoA2!!!ThusnolimittoqualityofthehypothesisA2canmake

Learningtoclassifywebpagesascoursepages.

Twofeatures:• x1=bagofwordsonapage• x2=bagofwordsfromallanchorspointingtoapage


125

NaïveBayesclassifiers • 12labeledpages • 1039unlabeledpages

One thing that is surprising is that accuracy and error rate highly differ.With co-training,theerrorrateishalfed.

3. Activelearning

3.1 Theconcept

Active learning is a subfield of machine learning and, more generally, artificialintelligence.Thekeyhypothesisisthatifthelearningalgorithmisallowedtochoosethedatafromwhichitlearns—tobe“curious,”ifyouwill—itwillperformbetterwithlesstraining.Whyisthisadesirablepropertyforlearningalgorithmstohave?Considerthat,foranysupervisedlearningsystemtoperformwell,itmustoftenbetrainedonhundreds(eventhousands)oflabelledinstances.

SupposeyouaretheleaderofanEarthconvoysenttocolonizeplanetMars.

Problem:there’sarangeofspiky-to-roundfruitshapesonMars.Youneedtolearnthe“threshold”ofroundnesswherethefruitsgofrompoisonoustosafe.And…youneedtodeterminethisriskingasfewcolonists’livesaspossible.Awaytothinkaboutdoingthisisabinarysearch.Youneedtotestalloftheinstances.

Keyidea:thelearnercanchoosetrainingdata.Youthrowofsomelabelledexampleandchoosewhichlabelsyouwanttopreserve.


126

• OnMars:whetherafruitwaspoisonous/safe• Ingeneral:thetruelabelofsomeinstance

Goal:reducethetrainingcosts

• OnMars:thenumberof“livesatrisk”• Ingeneral:thenumberof“queries”

Activelearningsystemsattempttoovercomethelabelingbottleneckbyaskingqueriesintheformofunlabeledinstancestobelabeledbyanoracle(e.g.,ahumanannotator).In this way, the active learner aims to achieve high accuracy using as few labeledinstancesaspossible, therebyminimizing thecostofobtaining labeleddata.Activelearning iswell-motivated inmanymodernmachine learningproblemswheredatamaybeabundantbutlabelsarescarceorexpensivetoobtain.Howdowedothis?

Youcanthinkaboutgeneratingexamplesandaskingforthem.Anotherapproachyoucanthinkaboutisdatastreaming.Pool-BasedActiveLearningCycleThereareseveralscenariosinwhichactivelearnersmayposequeries,andtherearealsoseveraldifferentquerystrategiesthathavebeenusedtodecidewhichinstancesaremostinformative.Inthissection,twoillustrativeexamplesinthepool-basedactivelearningsetting(inwhichqueriesareselectedfromalargepoolofunlabeledinstancesU ) are presented using an uncertainty sampling query strategy (which selects theinstanceinthepoolaboutwhichthemodelisleastcertainhowtolabel).

Thisfigureillustratesthepool-basedactivelearningcycle.AlearnermaybeginwithasmallnumberofinstancesinthelabeledtrainingsetL,requestlabelsforoneormorecarefullyselectedinstances,learnfromthequeryresults,andthenleverageitsnew


127

knowledgetochoosewhich instances toquerynext.Onceaqueryhasbeenmade,thereareusuallynoadditionalassumptionsonthepartofthelearningalgorithm.ThenewlabeledinstanceissimplyaddedtothelabeledsetL,andthelearnerproceedsfromthereinastandardsupervisedway.ð Ihaveasmalllabelset,Ilearnamodel.Weusethatmodeltomakepredictionson

unlabelleddata.Inactivelearningweaskaquestiontosomeoneaboutwhatthetruelabelis.

Youshowaplot,anumberofquerieson the x axes and accuracy. Usuallyyougetsuchacurve.With active learning the algorithmdecideswhichexamplesittakes.Andusually for the same number ofinstancesyouhaveabetteraccuracythan passive learning. This is used alot.

3.2 Howtoselectqueries?Weassumethatwereceivedasmallsampleofunlabeledexamplesandasetoflabeledexamples.Thegoal is to reduce the trainingcost, that is to say, thecostofgettinglabels.Keyquestion:whichexamplesdowewanttouse?ð Let’strygeneralizingourbinarysearchmethodusingaprobabilisticclassifier.We

applysomethinglikelogisticregression.


128

Querytheexamplesthelearnerismostuncertainabout.

• Closestto0,5probability• Closesttodecisionsurface

Lookatexamplesclosetothedecisionboundarysincethesecausetheboundarytomove.Ifwehavepossibleandnegativeexamples, lookatexamplesthatareclosesttotheline.Wetakeone,andweasksomeonetogiveitalabel.Here,itisgivenanegativelabel.We retrain the classifier, giving a new decision boundary.We again pick anexampleclosetotheboundary,giveitalabel,andwedothisagainandagain,etc.

Theintuitionhereisthatweshouldfirstobservelabelsofpointsclosetothedecisionboundary.

3.3 Query-By-Committee(CBC)TrainacommitteeC={q1,q2,...,qC}ofclassifiersonthelabeleddatainL.QueryinstancesinUforwhichthecommitteeisinmostdisagreement.Trytofindmodelsthatareconsistendwiththetrainingdata,reducingthemodelversionspace.Keyidea:reducethemodelversionspace.Itexpeditessearchforamodelduringtraining


129

We have 4 labeled instances: 3triangles and a square. Train amodelsuchthateverythinginsiderectangles isofclasssquaresandeverything outside is of classtriangle.

Howtobuildacommittee:• “Sample”modelsfromP(q|L)[Dagan&Engelson,

ICML’95;McCallum&Nigam,ICML’98]• Standardensembles(e.g.,bagging,boosting)[Abe&Mamitsuka,ICML’98]

Howtomeasuredisagreement

• «XOR»committeeclassifications• View vote distribution as probabilities, use uncertainty measures (e.g.,

entropy)

3.4 AlternativeQueryTypesSofar,weassumedqueriesareinstances.Ex:fordocumentclassificationthelearnerqueriesdocuments.Canthelearnerdobetterbyaskingdifferenttypesofquestions?Questionsnotaboutinstancesbutaboutdifferentfeaturesforinstance.

• Multiple-instanceactivelearning• Featureactivelearning:trytogetfeedbackaboutwhichfeaturesaremore

usefulthanothers.Youcanthengetbetterresults.InNLPtasks,wecanoftenintuitivelylabelfeatures.

o Thefeatureword“puck”indicatestheclasshockeyo Thefeatureword“strike”indicatestheclassbaseball.

Tandem learning exploits this by asking both instance-label and feature-relevancequeries.Ex:“ispuckanimportantdiscriminativefeature?”

4. Summary

Bothtrytoattackthesameproblem:makingthemostofunlabeleddataU.Gettinglabelsisexpensive.Eachattacksfromadifferentdirection:

• Semi-supervisedlearningexploitswhatthemodelthinksitknowsaboutunlabeleddata.Propagatesinformationaboutwhatitknows.Propagateinformationfromlabeldatatounlabeleddata.

• Activelearningexplorestheunknownaspectsoftheunlabeleddata.Lookatwhatisuncertain.


130


131

CHAPTER10:TIMESERIES

1. TimeSeriesIntro

Time series are a collection of measurements made sequentially in time. Usuallypeopleonlywanttomeasureonevariablewhenlookingattimeseries.

Timeseriesfacesdifferentchallenges

• Lotsandlotsofdata:videosforinstance.• Autocorrelation:currentvaluedependsonpreviouslyobservedvalues.• Dataismessy:youusuallycollecttimeseriesfromsensors.

o Different sampling rates and ranges (e.g., clipping in accelerometerdata)

o Noiseo Missingvalues(e.g.,sensordrops)o Etc.

StandardPre-processing:Z-NormalizationUsuallywhenworkingwithcollecteddatafromsensorsyouwanttodosomesortofpre-processing.

Given:S=(s1,...,sn)

Leteach

2. TimeSeriesClassification

Oneoftheobvioustasksyoucandowithtimeseriesisclassification.Given:asetoftimeserieswithknownlabels.

Goal: make prediction about future time series. You are given a new time seriesmeasurementandyouwanttopredicttheclassofthisinstance.


132

Assumption:labelsareassignedtoentiresequence.Wedonotnecessarilyknowwhatvalue/setof valuesare responsible for the label. I don’t knowwhichmeasurementcausethetimeseriestobelabelledpositiveornegative.Idea1:DefinefeaturesThefirstobviousapproachyoucandoistodefinefeatures.So,givenatimeseries,convertittoafeaturevectorrepresentation.Thereareseveralwaystodoso:

• Usestatistics:Min,max,etc.• Transformations:discretewavelettransform,etc.• Entropy• …

ð Applystandardsupervisedlearnerstodatatomakeaprediction.

Idea2:NearestNeighbours

Compare three-time series tomy training set andmeasure distancesimilaritybetweenthem. Justpredict the labelbasedonthesmallestdistance.

ð Howcanwemeasurethedistancebetweentimeseries?• Euclideandistance

Given:Twotimeseriesofequallength Q=(q1,q2,…qn) S=(s1,…sn)Take the Euclidean distance by comparing eachmeasurement in each timeseriestotheonesoftheothertimeseries.

ð Canwe capture this alignment?One standard approach for doing this is calledDynamicTimeWarping(DTW)Basicidea:Allowone-to-manymapping.Eachmeasurementintimeseriesqgetsexactlyonemeasurementintimeseriess.Itallowsustohavesomenon-linearityinthedata.Constraintsonmapping:


133

§ Mustmatchthebeginningandendofsequence§ Monotonicity:cannotgobackwardsintime§ Continuity:nogaps.§ (optional)Warpingwindow:w:|i-j|≤w

ComputeDTWwithDynamicprogramming.Thereisnorestrictionthatbothsequencesshouldhavethesamelength.Againwehavethesameproblemset-up,havingtwotimeseriesSandQ.Given:S=(S1,…Sn)andQ=(q1,…,qm)Do:

• LetD=nxmmatrix• Letδ(si,qj)bethedistancebetweensiandqj• D[i,j]=δ(si,qj)+min(D[i-1,j-1],D[i,j-1],D[i-1,j])

Ifwisgiven,thenonlycomputeD[i,j]if|i-j|≤w

ð AnExampleisgivenintheslides.

Trace back. We want to start from the rightcorner. Trace back. We know that 5 will bematchedto7,so10canmematchedto7and8willbematchedto7.13ismatchedto12.Etc.

Thisallowsamappingbetweenbothsequences.

Thisisagoodwaytomeasurethedistancebetweenbothsequences.

Ifwehaveaworkingwindowof1,wegetthefollowingmatrix.Blueentriesaretheonesviolatingtheconstraints.


134

OnethingtoknowisthatDTWisnotaMetric…Because

o DTWviolatesthetriangleinequalityo Metrics are usually preferred as many efficiency tricks exploit the

triangleinequality§ However,many“tricks”existforDTW§ Cancomputeitveryquickly

o Also,mostofthetimeDTWactslikeametric.Ex:>99%ofrandomlysampledtripleswillsatisfythetriangleinequalityforDTW.

ExploitingTriangleInequality

SettingtheWarpingConstraintTheWarping constraint has a big influence on results and “best” value is problemdependent.

• Setviaatuningset• Usuallysmallvalues(e.g.,<20)arefine• Astrainingsetsizegetslarger,expectthatlesswarpingisneeded.Why?So,

themoreexampleIhave,Iwillexpecttohavetodolesswarping.

ImplementingDTW

• DynamicprogrammingDTW:N2timeandspaceperpaircompared• Manyefficiency tricksexist that scale this to very largedata sets (i.e, linear

time)o Earlyabandoningofcomputationo Exploitingwrappingconstraint


135

o Etc.

Sofar,weassumetimeseriesmeasuresonlyonevariable.However,multipleseriesmeasuremultiplevariables:S=(s1,...,sn)whereeachsi=(si,1,...,si,d).Thequestionis,howcanwedealwithmultipledimensionalvariables?ThereexisttwowaystoextendDTW.

• Independent: sumeachdimension separately.Align first dimensionsof twosequences, thenseconddimension, thirdone,etc.Soweessentiallyassumedimensionstobeindependent.DTW(S,Q)=DTW(S1,Q1)+...+DTW(Sd,Qd)

• Dependent:singlematrixandcomputescorealongalldimensions.e.g.,δ(si,qi)isddimensionalEuclideandistance.

Selectingindependentordependentmetricisdomaindependent:

• Independent:Accelerationonfootandhand• Dependent:Playerlocationonsoccerfield

Bewareofcurseofdimensionality!

• DTWprobablyonlyusefulifd<10(orperhapsevensmaller)• Ifd>10,probablyneedtodropsomedimensions.

AllowingforGaps:LongestCommonSubsequence(LCSS)PotentialissueswithDTW:allpointsarematchedsooutlierscandistortscore.Longest common subsequencewhich allows gaps,meaning that not all points arematched.Wewillignoresomenoiseinthedata.

ComputingLCSS:DynamicProgrammingGiven:S=(s1,...,sn)andQ=(q1,...,qm)

NotesonLCSS

o Thisisasimilaritynotadistance


136

o Symbolicdatausessi=qjinsteadofδ(si,qj)<ε

WhyisitinterestingtostudysomethinglikeDTW?• DTWiscommon:Medicine,bioinformatics,gesturerecognition,musicanalysis,

etc.• Manylarge-scaleempiricalevaluationsshowthatnearestneighbourwithDTW

ishardtobeato DTWisverysimpleandeasytoimplemento Goodadvice:Trysimplethingfirst

• Ideassuchasdynamicprogramming,etc.translatetootherproblems.

3. SymbolicRepresentations

ð Whyasymbolicrepresentation?• Enablesnewanalysisbyusingsymbolicapproaches:plotthemforinstance.• Symbolstendtobeeasierforpeopletointerpretthannumbers• Compressthedata

SAXrepresentationGiven:AtimeseriesS=(s1,...,sn),anwindowsizew,andanalphabetsizeaDo:ConvertStoasymbolicsequenceoflengthn/wwhichinvolvesanalphabetofasymbols.Sowewilldivideorsequencesintoparts.

Howdoesthisapproachwork?Firstweassumethedataisnormalized,thenweapplythePieceaggregateapproximation.ð Z-normalizethetimeseries

Applythepieceaggregateapproximation(PAA)• Dividetheseriesinton/wwindows• Computetheaveragevalueineachwindowandreplaceeachobservedvalue

inthewindowwiththeaverageAssignasymbolbasedonrangesofPAAvaluesConvertPAArepresentationtoastring


137

• Exploitthatthetimeseriesisznormalized• Dividetheareaunderthestandardnormalintoequalsizeregions• Assignonesymbolforeachregion

Anything that falls into a particular gets the correspondingsymbol.

ð WhySAX?• Converted time series to a string: Can apply fancy algorithms (e.g., from

bioinformatics)• Makesvisualizationeasier(borrowingideasfrombioinformatics)• Most indexingschemesworkbestwithsymbols:Canuseextensiblehashing

withSAX

4. Applicationstosports

ThreenewtypesofData• Eventstream:Eventswithtimeandlocation• Athletemonitoring:GPS,accelerometer,etc.• Opticaltracking:X,Ylocationsofplayers

ð Thesearealltimeseries

Twotasks• Ratingplayers:Assignaratingtoeachactionaplayerperformsinamatch• Understandsstrategy:Discoverpatternsformplayertrackingdata

STARSSGiven:Eventstreamwithtypeandlocationofallevents(e.g.,passesandshots)Do:Assignratingtoeachactionbyassigningavaluetoeachaction.Approach


138

1. Splitmatchesinphases2. Ratephases3. Distributephaseratingoverindividualactions4. Aggregateplayers’ratingsoverseasonRatingphases:1.Findkmostsimilarphases(e.g.,100)2.Ofthese,counthowmanyresultinagoal(e.g.,6)

ð Distributephaseratingacrossitsconstituentactions.Actionsattheendaremoreimportant:ExponentialDecay.

DiscoverOffensiveStrategiesinFootballMatchesGiven:Eventstreamwithtypeandlocationofallevents(e.g.,passesandshots)Locationsofallplayersandtheball(10hzsample)Find:Typicaloffensivestrategies

§ Filmstudyistimeconsuming§ Automationcanhelpspeedthisup§ Computersgoodatfindingpatternsinlargedatasets

Challenges:

• Relationshipsandhowtheychangeovertimeareimportant

§ Space§ Interactionsbetweenplayers

• Orderofeventsisimportant• Maywanttogeneralizeoverplayersinvolved• Exactsamesequenceofeventsunlikelytooccurmultipletimes


139

ImportantSteps1. Datacleaning

• Outliersandincorrectvalueso Validfieldcoordinateso Playerandballmovementsseem“possible”

• Teams switch direction at half time: Normalize data such tatteamalwaysattacksthesamegoal.

• Account for changes in data (e.g., position switches, newplayers,etc.

2. Eventstreampre-processing

3. ClusteringdataThreebenefits

• Teamsemploymultiplestrategies• Generalizefromaspecificlocation• Subsequentstepmorecomputationallyefficient

Divide phases into different groups such that the phases in a group are“similar”.

4. Identifying important strategies: within each cluster, find frequently

occurringsubsequences.Seeslidesformoredetails.

5. Toconclude

To conclude, time series occur in multiple domains: sports, healthcare, music,mathematics,etc.Sportsprovidesmanyrichsourcesofdataformhistoricalsources,sensorsprovidedetaileddata.Thisrelativelysimpleanalysishasmadeahugeimpactonseveralsports.


140

1-NNapproachwithDTWisaverystrongapproachforclassifyingtimeseries.

§ DTWisone-to-manymapping§ Canbeefficientlycomputed

Canconverttimeseriesintoasymbolicsequence.Keychallengesinclude

§ Volumeofdata§ Capturingstructure:Spatial,temporal,etc.

DATA MINING - Ekowiki

Documents

Transcript of DATA MINING - Ekowiki