Introduction to Machine Learning

191
Data Mining MTAT.03.183 Introduction to Machine Learning Jaak Vilo 2017 Spring

Transcript of Introduction to Machine Learning

DataMiningMTAT.03.183

IntroductiontoMachineLearning

JaakVilo2017Spring

Acknowledgement

• KonstantinTretyakov – 2009coursematerials• LectureNotesforEAlpaydın 2010IntroductiontoMachineLearning

• MachineLearningcourse(spring)– BySvenLaur etal.– 2017-2018=>DataMiningFall,MLSpring• Meelis Kull

JaakViloandotherauthors UT:DataMining2009 2

JaakViloandotherauthors UT:DataMining2009 3

Coming up next• “Machine learning”– Terminology, foundations, general framework.

• Supervised machine learning– Basic ideas, algorithms & toy examples.

• Statistical challenges– P-values, significance, consistency, stability

• State of the art techniques– SVM, kernel methods, graphical models, latent

variable models, boosting, bagging, LASSO, on-line learning, deep learning, reinforcement learning, …

5.11.2009 Machinelearning::Introduction 4

LearningAssociations

• Basketanalysis:P(Y|X)probabilitythatsomebodywhobuysX alsobuysYwhereX andY areproducts/services.

Example:P(chips|beer)=0.7

5LectureNotesforEAlpaydın2010

IntroductiontoMachineLearning2e©TheMITPress(V1.0)

Classification

• Example:Creditscoring

• Differentiatingbetweenlow-riskandhigh-riskcustomersfromtheirincome andsavings

Discriminant: IFincome >θ1 ANDsavings >θ2THEN low-riskELSE high-risk

6LectureNotesforEAlpaydın 2010

IntroductiontoMachineLearning2e©TheMITPress(V1.0)

Classification:Applications• AkaPatternrecognition• Facerecognition:Pose,lighting,occlusion(glasses,beard),make-up,hairstyle

• Characterrecognition:Differenthandwritingstyles.

• Speechrecognition:Temporaldependency.• Medicaldiagnosis:Fromsymptomstoillnesses• Biometrics:Recognition/authenticationusingphysicaland/orbehavioralcharacteristics:Face,iris,signature,etc

• ...

7LectureNotesforEAlpaydın2010

IntroductiontoMachineLearning2e©TheMITPress(V1.0)

FaceRecognition

Training examples of a person

Test images

ORL dataset,AT&T Laboratories, Cambridge UK

8LectureNotesforEAlpaydın2010

IntroductiontoMachineLearning2e©TheMITPress(V1.0)

Regression

• Example:Priceofausedcar

• x:carattributesy:price

y=g(x|q )g()model,q parameters

y=wx+w0

9LectureNotesforEAlpaydın2010IntroductiontoMachineLearning2e©TheMITPress

(V1.0)

RegressionApplications

• Navigatingacar:Angleofthesteering• Kinematicsofarobotarm

α1=g1(x,y)α2=g2(x,y)

α1

α2

(x,y)

n Responsesurfacedesign

10LectureNotesforEAlpaydın2010

IntroductiontoMachineLearning2e©TheMITPress(V1.0)

JaakViloandotherauthors UT:DataMining2009 11

JaakViloandotherauthors UT:DataMining2009 12

JaakViloandotherauthors UT:DataMining2009 13

Multivariate linear regression

all the same, but instead of one feature, x is a k-dimensional vector

xi = (xi1, xi2, .., xik)

the model is the linear combination of all features:

via the matrix representation:

0

[email protected]

1

CA =

0

B@1 x11 . . . x1k...

......

1 xn1 . . . xnp

1

CA⇥

0

B@�0...�k

1

CA

y = �0 + �1x1 + �2x2 + ...+ �kxk

y = X�

Objectivesofregression

• Fitthedatawitha“function”– Beabletointerpretthefunction,parameters

• ”Simpler”modelsarebetter; regularisation

• Optimisation of“bestfit”– heuristics

• Bias,variance,logisticregression,…JaakViloandotherauthors UT:DataMining2009 14

SupervisedLearning:Uses• Predictionoffuturecases:Usetheruletopredicttheoutputforfutureinputs

• Knowledgeextraction:Theruleiseasytounderstand

• Compression: Theruleissimplerthanthedataitexplains

• Outlierdetection:Exceptionsthatarenotcoveredbytherule,e.g.,fraud

15LectureNotesforEAlpaydın2010

IntroductiontoMachineLearning2e©TheMITPress(V1.0)

Supervised learningObservation OutcomeSummer of2003wascold Winter of2003waswarmSummerof2004 wascold Winterof2004wascoldSummerof2005 wascold Winterof2005wascoldSummerof2006was hot Winterof2006waswarmSummerof2007wascold Winter of2007wascoldSummerof2008 waswarm Winterof2008waswarmSummerof2009wascold Winterof2009wascoldSummerof2010washot Winterof2010wascoldSummerof2011waswarm Winterof2011 willbe?

5.11.2009 Machinelearning::Introduction 16

Supervised learningObservation OutcomeStudy=hard, Professor=L Igeta CStudy=slack, Professor=J IgetanAStudy=hard, Professor=L IgetanAStudy=slack, Professor=L IgetaDStudy=slack, Professor=J IgetanAStudy=slack, Professor=J IgetanAStudy=hard, Professor=J IgetanAStudy=slack, Professor=J IgetaB

? IgetanA

5.11.2009 Machinelearning::Introduction 17

Supervised learningDay Observation OutcomeMon Iwas not usingmagneticbraceletTM In theeveningIhad

aheadacheTue IwasusingmagneticbraceletTM In theeveningIhad

lessheadacheWed Iwasusingmagnetic braceletTM In theeveningno

headache!Thu IwasusingmagneticbraceletTM The headacheis

gone!!Fri Iwasnot usingmagneticbraceletTM Noheadache!!

5.11.2009 Machinelearning::Introduction 18

Supervised learningDay Observation OutcomeMon Iwas not usingmagneticbraceletTM In theeveningIhad

aheadacheTue IwasusingmagneticbraceletTM In theeveningIhad

lessheadacheWed Iwasusingmagnetic braceletTM In theeveningno

headache!Thu IwasusingmagneticbraceletTM The headacheis

gone!!Fri Iwasnot usingmagneticbraceletTM Noheadache!!

5.11.2009 Machinelearning::Introduction 19

Magnetic braceletTM cures headache

Science

5.11.2009 Machinelearning::Introduction 20

Collectdata/observa

tions

Supervised learning

5.11.2009 Machinelearning::Introduction 21

Collectdata/observa

tions

Comeupwiththeories

Supervised learning

5.11.2009 Machinelearning::Introduction 22

Collectdata/observa

tions

Comeupwiththeories

BetterTheory

Supervised learning

5.11.2009 Machinelearning::Introduction 23

Collectdata/observa

tions

Comeupwiththeories

BetterTheory

Science J

5.11.2009 Machinelearning::Introduction 24

Supervised learning• Formally,

5.11.2009 Machinelearning::Introduction 25

Regression

5.11.2009 Machinelearning::Introduction 26

Classification

5.11.2009 Machinelearning::Introduction 27

Re-visit the 2x2 tables

5.11.2009 Machinelearning::Introduction 28

Predicted = Yes Predicted = No

Actual = Yes True positives (TP)

False negatives (FN)(Type II, β-error)

Actual = No

False positives (FP)(Type I, α-error)

True negatives (FN)

Positives Negatives

Classification summary

5.11.2009 Machinelearning::Introduction 29

Predicted = Yes Predicted = No

Actual = Yes True positives (TP)

False negatives (FN)(Type II, β-error)

Actual = No

False positives (FP)(Type I, α-error)

True negatives (FN)

Positives Negatives

Precision

Recall

F-measure = harmonic_mean(Precision, Recall)

Accuracy

JaakViloandotherauthors UT:DataMining2009 30

The “Dumb User” Perspective

5.11.2009 Machinelearning::Introduction 31

Weka, RapidMiner,MS SSAS, Clementine,SPSS, R, scikit-learn,…

The “Dumb User” Perspective

5.11.2009 Machinelearning::Introduction 32

Validation

Classification demo: Iris dataset

• 150 measurements, 4 attributes, 3 classes

5.11.2009 Machinelearning::Introduction 33

Classification demo: Iris dataset

5.11.2009 Machinelearning::Introduction 34

Validationa b c <-- classified as

50 0 0 | a = Iris-setosa0 49 1 | b = Iris-versicolor0 2 48 | c = Iris-virginica

Correctly Classified Instances 147 98%Incorrectly Classified Instances 3 2%Kappa statistic 0.97 Mean absolute error 0.0233Root mean squared error 0.108 Relative absolute error 5.2482 %Root relative squared error 22.9089 %Total Number of Instances 150

5.11.2009 Machinelearning::Introduction 35

Validationa b c <-- classified as

50 0 0 | a = Iris-setosa0 49 1 | b = Iris-versicolor0 2 48 | c = Iris-virginica

Class setosa versic. virg. AvgTP Rate 1 0.98 0.96 0.98FP Rate 0 0.02 0.01 0.01Precision 1 0.961 0.98 0.98Recall 1 0.98 0.96 0.98F-Measure 1 0.97 0.97 0.98ROC Area 1 0.99 0.99 0.99

5.11.2009 Machinelearning::Introduction 36

Validationa b c <-- classified as

50 0 0 | a = Iris-setosa0 49 1 | b = Iris-versicolor0 2 48 | c = Iris-virginica

setosa versic. virg. AvgTP Rate 1 0.98 0.96 0.98FP Rate 0 0.02 0.01 0.01Precision 1 0.961 0.98 0.98Recall 1 0.98 0.96 0.98F-Measure 1 0.97 0.97 0.98ROC Area 1 0.99 0.99 0.99

5.11.2009 Machinelearning::Introduction 37

Validationa b c <-- classified as

50 0 0 | a = Iris-setosa0 49 1 | b = Iris-versicolor0 2 48 | c = Iris-virginica

setosa versic.TP Rate 1 0.98 = TP/positive examplesFP Rate 0 0.02 = FP/negative examplesPrecision 1 0.961 = TP/positivesRecall 1 0.98 = TP/positive examplesF-Measure 1 0.97 = 2*P*R/(P + R)ROC Area 1 0.99 ~ Pr(s(false)<s(true))

5.11.2009 Machinelearning::Introduction 38

“False positives”“True positives”

Classification summary

5.11.2009 Machinelearning::Introduction 39

Predicted = Yes Predicted = No

Actual = Yes True positives (TP)

False negatives (FN)(Type II, β-error)

Actual = No

False positives (FP)(Type I, α-error)

True negatives (FN)

Positives Negatives

Classification summary

5.11.2009 Machinelearning::Introduction 40

Predicted = Yes Predicted = No

Actual = Yes True positives (TP)

False negatives (FN)(Type II, β-error)

Actual = No

False positives (FP)(Type I, α-error)

True negatives (FN)

Positives Negatives

Precision

Recall

F-measure = harmonic_mean(Precision, Recall)

Accuracy

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

JaakViloandotherauthors UT:DataMining2009 41

JaakViloandotherauthors UT:DataMining2009 42

JaakViloandotherauthors UT:DataMining2009 43

JaakViloandotherauthors UT:DataMining2009 44

ROC

• Yourpredictionssometimeshavenatural“order”– fromyour“bestbet”togradually”weaker”predictions

• Howmanyyoucalldependsonyou• Ateach“point”yourrecallmayincreaseorbe“falsepositive”.

• Modelis“better”ifitoutperformsothersacrossawiderangeof“cutoffs”.

JaakViloandotherauthors UT:DataMining2009 45

ROCcurve– ReceiverOperatorCharacteristic,AUC– AreaUnderCurve

JaakViloandotherauthors UT:DataMining2009 46

“Training” classifiers on data• Thus, a “good” classifier is the one which has

good Accuracy/Precision/Recall.– Or better ROC curve, higher ROC AUC score

• Hence, machine learning boils down to finding a function that optimizes these parameters for given data.

5.11.2009 Machinelearning::Introduction 47

“Training” classifiers on data• Thus, a “good” classifier is the one which has

good Accuracy/Precision/Recall.

• Hence, machine learning boils down to finding a function that optimizes these parameters for given data.

• Yet, there’s a catch

5.11.2009 Machinelearning::Introduction 48

“Training” classifiers on data• We want our algorithm to perform well

on “unseen” data!

– This makes algorithms and theory way more complicated.

– This makes validation somewhat more complicated.

5.11.2009 Machinelearning::Introduction 49

Proper validation• You may not test your algorithm on the

same data that you used to train it!

5.11.2009 Machinelearning::Introduction 50

Proper validation• You may not test your algorithm on the

same data that you used to train it!

5.11.2009 Machinelearning::Introduction 51

Proper validation :: Holdout

5.11.2009 Machinelearning::Introduction 52

Split

Training set

Testing set

Validation

JaakViloandotherauthors UT:DataMining2009 53

Proper validation• What are the “sufficient” sizes for the

test/training sets and why?• What if the data is scarce?– Cross-validation– K-fold cross-validation– Leave-one-out cross-validation– Bootstrap .632+

5.11.2009 Machinelearning::Introduction 54

TheNetflixPrizeData� Netflixreleasedthreedatasets

� 480,189users (anonymous)� 17,770movies� ratings onintegerscale1to5

� Trainingset:99,072,112<user,movie >pairswithratings� Probeset:1,408,395<user,movie >pairswithratings� Qualifyingset of2,817,131<user,movie >pairswithnoratings

Jeff Howbert

WhytheNetflixPrizeWasHard

� Massivedataset� Verysparse– matrixonly1.2%occupied

� Extremevariationinnumberofratingsperuser

� Statisticalpropertiesofqualifyingandprobesetsdifferentfromtrainingset

movie1

movie2

movie3

movie4

movie5

movie6

movie7

movie8

movie9

movie10

… movie177

70

user1 1 2 3user2 2 3 3 4user3 5 3 4user4 2 3 2 2user5 4 5 3 4user6 2user7 2 4 2 3user8 3 4 4user9 3user10 1 2 2

…user480189 4 3 3

Jeff Howbert

ModelBuildingandSubmissionProcess

99,072,112 1,408,395

1,408,7891,408,342

trainingset

quizset testset

qualifyingset(ratingsunknown)

probeset

MODEL

tuning

ratingsknown

validate

makepredictionsRMSEonpublic

leaderboard

RMSEkeptsecretfor

finalscoring

Jeff Howbert

Jeff Howbert

Intermediate summary• Supervised learning = predicting f(x) well.• For classification, “well” = high

accuracy/precision/recall on unseen data.• To achieve that, most training algorithms will

try to optimize their accuracy/precision/recall on training data.

• We can then validate how good they are on test data.

5.11.2009 Machinelearning::Introduction 59

Next• Three examples of approaches– Ad-hoc• Decision tree induction

– Probabilistic modeling• Naïve Bayes classifier

–Objective function optimization• Linear least squares regression

5.11.2009 Machinelearning::Introduction 60

Decision Tree Induction :: ID3

• Iterative Dichotomizer 3– Simple yet popular decision

tree induction algorithm– Builds a decision tree top-

down, starting at the root.

5.11.2009 Machinelearning::Introduction 61

Ross Quinlan

ID3

5.11.2009 Machinelearning::Introduction 62

ID3 :: First split

5.11.2009 Machinelearning::Introduction 63

Whichsplitisthemostinformative?

Entropy• Entropy measures the “informativeness” of a

probability distribution.

• A split is informative if it reduces entropy.

5.11.2009 Machinelearning::Introduction 64

Information gain of a split• Before split: – pno = 5/14, pyes = 9/14, H(p) = 0.94

• After split on outlook:

5.11.2009 Machinelearning::Introduction 65

H=0.97 H=0.97H=0

=0.69

Informationgain=0.94-0.69=0.25

Entropy,Gini,Misclassification

JaakViloandotherauthors UT:DataMining2009 66

ID31. Start with a single node2. Find the attribute with the largest

information gain3. Split the node according to this attribute4. Repeat recursively on subnodes

5.11.2009 Machinelearning::Introduction 67

C4.5• C4.5 is an extension of ID3– Supports continuous attributes– Supports missing values– Supports pruning

• There is also a C5.0– A commercial version with additional bells &

whistles

5.11.2009 Machinelearning::Introduction 68

Decision trees - discussion• The goods:– Easy & efficient– Interpretable and pretty

• The bads– Rather ad-hoc – optimal trees NP-hard to train– Can overfit unless properly pruned– Not the best model for all classification tasks

• Improvements:– Random Forest – multiple trees, majority voting

5.11.2009 Machinelearning::Introduction 69

WhyDoesmyMethodOverfit ?

• Indomainswithnoiseoruncertaintythesystemmaytrytodecreasethetrainingerrorbycompletelyfittingallthetrainingexamples

Slide from: Pavan J Joshi

Overfitting

• Thelongeryoutrain,theworseitbecomes…

• Alwaystestagainstdatathatwasnotusedintraining

JaakViloandotherauthors UT:DataMining2009 71

Pruning

• Deletesubtrees iftheydonotimprovepredictions

• Top/down• Bottom/up• …• MDL– MinimumDescriptionLengthPrinciple–MDL- Minimise lengthof“Model+Exceptions”

JaakViloandotherauthors UT:DataMining2009 72

JaakViloandotherauthors UT:DataMining2009 73

JaakViloandotherauthors UT:DataMining2009 74

JaakViloandotherauthors UT:DataMining2009 75

Whydowetrustinonetree?

• Treeisdifferentdependingondataused

• Whichattributesaremostimportant?

• Randomly selectasubsetofdata,randomlysubselect asubsetofattributes…

JaakViloandotherauthors UT:DataMining2009 76

RandomForest

JaakViloandotherauthors UT:DataMining2009 77

Each trained on a random subset of dataEach trained on a random sub-set of attributes

Every tree provides some level of accuracy. Majority voting!

Data(2Gaussians),RF,Logisticregression

JaakViloandotherauthors UT:DataMining2009 78

JaakViloandotherauthors UT:DataMining2009 79

References[BY10]Y.Ben-Haim andE.Yom-Tov.Astreamingparalleldecisiontreealgorithm.J.ofMachineLearningResearch,11:849–872,2010.[B96]L.Breiman.Baggingpredictors.MachineLearning,24(2):123–140,1996.[B01]L.Breiman.Randomforests.MachineLearning,45(1):5–32,2001.[FS97]Yoav FreundandRobertE.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting. JournalofComputerandSystemSciences,55(1):119-139,1997.[F01a]Yoav Freund.Anadaptiveversionoftheboostbymajorityalgorithm.MachineLearning,43(3):293--318,2001.[F+03] Yoav Freund,RajIyer,RobertSchapire,andYoramSinger.AnEfficientBoostingAlgorithmforCombiningPreferences.JournalofMachineLearningResearch4:933-969,2003.[F01b]Friedman,J.Greedyfunctionapproximation:agradientboostingmachine.AnnalsofStatistics,25(5):1189-1232,2001.[F+00]JeromeFriedman,TrevorHastieandRobertTibshirani.Additivelogisticregression:astatisticalviewofboosting.AnnalsofStatistics28(2):337-407,2000.[H98]Ho,T.K.Therandomsubspacemethodforconstructingdecisionforests.IEEEPAMI,20(8):832–844,1998.[P+10]D.Y.Pavlov, A.Gorodilov,andC.A.Brunk.BagBoo:ascalablehybridbagging-the-boostingmodel.CIKM-2010.[S+07]Daria Sorokina,RichCaruana,Mirek Riedewald.AdditiveGrovesofRegressionTrees.ECML-2007.[SUML11-Ch2]Biswanath Panda,JoshuaS.Herbach,Sugato Basu,andRobertoJ.Bayardo.MapReduce anditsApplicationtoMassivelyParallelLearningofDecisionTreeEnsembles.In“ScalingUpMachineLearning”,CambridgeU.Press,2011.[SUML11-Ch8]KrystaM.SvoreandChristopherJ.C.Burges.Large-scaleLearningtoRankusingBoostedDecisionTrees.In“ScalingUpMachineLearning”,CambridgeU.Press,2011.[SUML11-Ch9]RameshNatarajan andEdwinPednault.TheTransformRegressionAlgorithm.In“ScalingUpMachineLearning”,CambridgeU.Press,2011.[T+11]StephenTyree,Kilian Q.Weinberger,Kunal Agrawal.ParallelBoostedRegressionTreesforWebSearchRanking.WWW-2011.[Y+09]JerryYe,Jyh-Herng Chow,JiangChen,Zhaohui Zheng.StochasticGradientBoostedDistributedDecisionTrees.CIKM-2009.[Z+08]Z.Zheng,H.Zha,T.Zhang,O.Chapelle,K.Chen,andG.Sun.Ageneralboostingmethodanditsapplicationtolearningrankingfunctionsforwebsearch. NIPS2008.

Naïve Bayes Classifier

12.11.2009 Machinelearning::Introduction::PartII 81

The Tennis Dataset

12.11.2009 Machinelearning::Introduction::PartII 82

Shall we play tennis today?

12.11.2009 Machinelearning::Introduction::PartII 83

Shall we play tennis today?

12.11.2009 Machinelearning::Introduction::PartII 84

P(Yes)=9/14=0.64P(No)=5/14=0.36

è Yes

• Probabilistic model:

It’s windy today. Tennis, anyone?

12.11.2009 Machinelearning::Introduction::PartII 85

It’s windy today. Tennis, anyone?

• Probabilistic model:

12.11.2009 Machinelearning::Introduction::PartII 86

P(Weak)=8/14P(Strong)=6/14

P(Yes|Weak)=6/8P(No|Weak)=2/8

P(Yes|Strong)=3/6P(No|Strong)=3/6

More attributes• Probabilistic model:

12.11.2009 Machinelearning::Introduction::PartII 87

P(High,Weak) = 4/14P(Yes | High,Weak) = 2/4P(No | High,Weak) = 2/4

P(High,Strong) = 3/14P(Yes | High,Strong) = 1/3P(No | High,Strong) = 2/3

The Bayes ClassifierIn general:1.Estimate from data:

P(Class | X1,X2,X3,…)2.For a given instance (X1,X2,X3 …)

predict class whose conditional probability is greater:

P(C1 | X1,X2,X3,…) > P(C2 | X1,X2,X3,…)è predict C1

12.11.2009 Machinelearning::Introduction::PartII 88

The Bayes ClassifierIngeneral:1. P(C1 |X)>P(C2 |X)

è predictC1

12.11.2009 Machinelearning::Introduction::PartII 89

The Bayes ClassifierIngeneral:1. P(C1 |X)>P(C2 |X)

è predictC1

12.11.2009 Machinelearning::Introduction::PartII 90

The Bayes ClassifierIngeneral:1. P(C1 |X)>P(C2 |X)

è predictC12. P(X |C1)P(C1)/P(X)>P(X |C2)P(C2)/P(X)

è predictC1

12.11.2009 Machinelearning::Introduction::PartII 91

The Bayes ClassifierIngeneral:1. P(C1 |X)>P(C2 |X)

è predictC12. P(X |C1)P(C1)/P(X)>P(X |C2)P(C2)/P(X)

è predictC13. P(X |C1)P(C1)>P(X |C2)P(C2)

è predictC1

12.11.2009 Machinelearning::Introduction::PartII 92

The Bayes ClassifierIngeneral:1. P(C1 |X)>P(C2 |X)

è predictC12. P(X |C1)P(C1)/P(X)>P(X |C2)P(C2)/P(X)

è predictC13. P(X |C1)P(C1)>P(X |C2)P(C2)

è predictC14. P(X |C1)/P(X |C2)>P(C2)/P(C1)

è predictC112.11.2009 Machinelearning::Introduction::PartII 93

The Bayes Classifier3. P(X | C1)P(C1) > P(X | C2)P(C2) è predict C1

12.11.2009 Machinelearning::Introduction::PartII 94

The Bayes ClassifierIf the true underlying distribution is

known, the Bayes classifier is optimal(i.e. it achieves minimal error probability).

12.11.2009 Machinelearning::Introduction::PartII 95

Problem• We need exponential amount of data

12.11.2009 Machinelearning::Introduction::PartII 96

P(High,Weak) = 4/14P(Yes | High,Weak) = 2/4P(No | High,Weak) = 2/4

P(High,Strong) = 3/14P(Yes | High,Strong) = 1/3P(No | High,Strong) = 2/3

The Naïve Bayes ClassifierTo scale beyond 2-3 attributes, use a trick:

Assume that attributes of each class are independent:

P(X1,X2,X3 | Class) == P(X1|Class)P(X2|Class)P(X3|Class)…

12.11.2009 Machinelearning::Introduction::PartII 97

The Naïve Bayes ClassifierP(X | C1)/P(X | C2) > P(C2)/P(C1)

è predict C1

becomes

12.11.2009 Machinelearning::Introduction::PartII 98

Naïve Bayes Classifier1. Training:

For each value v of each attribute i, compute

2. Classification:For a given instance (x1, x2, …) compute

If classify as C1 , else C2

12.11.2009 Machinelearning::Introduction::PartII 99

Naïve Bayes Classifier1. Training:

For each value v of each attribute i, compute

2. Classification:For a given instance (x1, x2, …) compute

If classify as C1 , else C2

12.11.2009 Machinelearning::Introduction::PartII 100

Naïve Bayes Classifier• Extendable to continuous attributes via kernel

density estimation.• The goods:– Easy to implement, efficient– Won’t overfit, intepretable– Works better than you would expect (e.g. spam filtering)

• The bads– “Naïve”, linear– Usually won’t work well for too many classes

12.11.2009 Machinelearning::Introduction::PartII 101

JaakViloandotherauthors UT:DataMining2009 102

ROCcurve– ReceiverOperatorCharacteristic,AUC– AreaUnderCurve

JaakViloandotherauthors UT:DataMining2009 103

JaakViloandotherauthors UT:DataMining2009 104

Supervised learning• Four examples of approaches– Ad-hoc• Decision tree induction

– Probabilistic modeling• Naïve Bayes classifier

–Objective function optimization• Linear least squares regression

– Instance-based methods• K-nearest neighbors

12.11.2009 Machinelearning::Introduction::PartII 105

Linear regression

• Consider data points of the type

• Let us search for a function of the form

that would have the least possible sum of error squares:

12.11.2009 Machinelearning::Introduction::PartII 106

Linear regression• Set of functions f(x) = wx for various w.

12.11.2009 Machinelearning::Introduction::PartII 107

Linear regression• The function having the least error

12.11.2009 Machinelearning::Introduction::PartII 108

Linear regression• We need to solve:

where

• Setting the derivative to 0:

we get

12.11.2009 Machinelearning::Introduction::PartII 109

Linear regression• The idea naturally generalizes to more

complex cases (e.g. multivariate regression)• The goods– Simple, easy– Intepretable, popular, extendable to many variants–Won’t overfit

• The bads– Linear, i.e. works for a limited set of datasets.– Nearly never perfect.

12.11.2009 Machinelearning::Introduction::PartII 110

MedianErrorRegression(quantile)

• Minimizemeanerror• Morerobust:minimizemedianerror• Requiresdifferentoptimisation– DifferentialEvolution– GeneticProgramming– …

JaakViloandotherauthors UT:DataMining2009 111

JaakViloandotherauthors UT:DataMining2009 112

K-Nearest Neighbors

12.11.2009 Machinelearning::Introduction::PartII 113

K-Nearest Neighbors} One-nearestneighbor

12.11.2009 Machinelearning::Introduction::PartII 114

K-Nearest Neighbors• Training:– Store and index all training instances

• Classification– Given an instance to classify, find k instances from

the training sample nearest to it.– Predict the majority class among these k instances

12.11.2009 Machinelearning::Introduction::PartII 115

K-Nearest Neighbors• The goods– Trivial concept– Easy to implement– Asymptotically optimal– Allows various kinds of data

• The bads– Difficult to implement efficiently– Not interpretable (no model)– On smaller datasets looses to other methods

12.11.2009 Machinelearning::Introduction::PartII 116

Objective optimizationRegression modelsKernel methods, SVM, RBFNeural networks…

Instance-basedK-NN, LOWESSKernel densitiesSVM, RBF…

Probabilistic modelsNaïve BayesGraphical modelsRegression modelsDensity estimation…

Supervised learningAd-hocDecision trees, forestsRule induction, ILPFuzzy reasoning…

12.11.2009 Machinelearning::Introduction::PartII 117

Objective optimizationRegression models,Kernel methods, SVM, RBFNeural networks…

Instance-basedK-NN, LOWESSKernel densitiesSVM, RBF…

Probabilistic modelsNaïve BayesGraphical modelsRegression modelsDensity estimation…

Supervised learningAd-hocDecision trees, forestsRule induction, ILPFuzzy reasoning…

12.11.2009 Machinelearning::Introduction::PartII 118

Ensemble-learnersArcing,Boosting,Bagging,Dagging,Voting,Stacking

Perceptron

JaakViloandotherauthors UT:DataMining2009 119

NeuralNetwork

JaakViloandotherauthors UT:DataMining2009 120

Linear classification

19.11.2009 Machinelearning::Introduction::PartIII 121

Linear classification

19.11.2009 Machinelearning::Introduction::PartIII 122

Linear classification

19.11.2009 Machinelearning::Introduction::PartIII 123

Linearclassifiers

JaakViloandotherauthors UT:DataMining2009 124

H3 (green) doesn't separate the two classes. H1 (blue) does, with a small margin and H2 (red) with the maximum margin.

Linear classification• Learning a linear classifier from data:– Minimizing training error• NP-complete

– Minimizing sum of error squares• Suboptimal, yet can be easy and fun: e.g. the perceptron.

– Maximizing the margin• Doable and well-founded by statistical learning theory

19.11.2009 Machinelearning::Introduction::PartIII 125

The margin

19.11.2009 Machinelearning::Introduction::PartIII 126

JaakViloandotherauthors UT:DataMining2009 127

Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.

The margin• For any point x, its distance to the hyperplane:

• Assuming all points are classified correctly:

• The margin is then:

19.11.2009 Machinelearning::Introduction::PartIII 128

Maximal margin training• Find (w, b) such that Margin is maximal, can be

shown to be equivalent with:

subject to

• This is doable using efficient optimization algorithms.

19.11.2009 Machinelearning::Introduction::PartIII 129

Soft margin training• If data is not linearly separable:

subject to

• It is called the Support Vector Machine (SVM).• It can be generalized to regression tasks.

19.11.2009 Machinelearning::Introduction::PartIII 130

Soft margin training• In more general form:

where

is the hinge loss.

19.11.2009 Machinelearning::Introduction::PartIII 131

Regularization• There are many algorithms which essentially

look as follows:

For given data D find a model M, which minimizesError(M,D)+ Complexity(M)

• An SVM is a linear model, which minimizes Hinge loss + l2-norm penalty

19.11.2009 Machinelearning::Introduction::PartIII 132

Going nonlinear• But a linear classifier is so linear!

19.11.2009 Machinelearning::Introduction::PartIII 133

Solution: a nonlinear map• Instead of classifying points x, we’ll classify

points in a higher-dimensional space.

19.11.2009 Machinelearning::Introduction::PartIII 134

Linearseparationinnon-linearworld

JaakViloandotherauthors UT:DataMining2009 135http://news360.com/digestarticle/t-2ryjS4d0io86BLq8d-qA

JaakViloandotherauthors UT:DataMining2009 136

3

Kernel methods (not density!)• For nearly any linear classifier:

• Trained on a dataset

• The resulting vector w can be represented as:

• Which means:

19.11.2009 Machinelearning::Introduction::PartIII 137

JaakViloandotherauthors UT:DataMining2009 138

Kernel methods (not density!)

19.11.2009 Machinelearning::Introduction::PartIII 139

Kernel methods (not density!)

• Function K is called a kernel, it measures similarity between objects.

• The computation of is unnecessary.• You can use any type of data.• Your method is nonlinear.• Any linear method can be kernelized.• Kernels can be combined.19.11.2009 Machinelearning::Introduction::PartIII 140

JaakViloandotherauthors UT:DataMining2009 141

JaakViloandotherauthors UT:DataMining2009 142

JaakViloandotherauthors UT:DataMining2009 143

JaakViloandotherauthors UT:DataMining2009 144

JaakViloandotherauthors UT:DataMining2009 145

JaakViloandotherauthors UT:DataMining2009 146

Summary• SVM:– A maximal margin linear classifier.– A linear model, which minimizes

Hinge loss + l2-norm penalty– Kernelizable

• Kernel methods:– An easy and elegant way of “plugging-in”• nonlinearity• different data types

19.11.2009 Machinelearning::Introduction::PartIII 147

MLGallery

• http://home.comcast.net/~tom.fawcett/public_html/ML-gallery/pages/index.html– Nowdisappearedfromnet…

• ACollectionofexampledatasetsandillustrationofdifferentMLalgorithmsperformanceonthem

JaakViloandotherauthors UT:DataMining2009 148

JaakViloandotherauthors UT:DataMining2009 149

JaakViloandotherauthors UT:DataMining2009 150

JaakViloandotherauthors UT:DataMining2009 151

Scikit learn

JaakViloandotherauthors UT:DataMining2009 152

scikit-learnhttp://scikit-learn.org/

JaakViloandotherauthors UT:DataMining2009 153

AMustsoftware

• WEKA

• Mldemos

• R

• …

JaakViloandotherauthors UT:DataMining2009 154

JaakViloandotherauthors UT:DataMining2009 155

JaakViloandotherauthors UT:DataMining2009 156

JaakViloandotherauthors UT:DataMining2009 157

JaakViloandotherauthors UT:DataMining2009 158

http://playground.tensorflow.org/

JaakViloandotherauthors UT:DataMining2009 159

In previous episodes

Machinelearning::Introduction::PartIII19.11.2009 160

Approachestodataanalysis

} Thegeneralprincipleisthesame,though:1. Defineasetofpatternsofinterest2. Defineameasureofgoodnessforthepatterns3. Findthebestpatterninthedata

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009 Machinelearning::Introduction::PartIII 161

X Y Output0 0 False0 1 True1 0 True1 1 ?

Statistical Learning Theory• What is learning and how to analyze it?– There are various ways to answer this question. We’ll just

consider the most popular one.

• Let be the distribution of data.• We observe an i.i.d. sample:

• We produce a classifier:

19.11.2009 Machinelearning::Introduction::PartIII 162

Inductivebias

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible• Is it good or bad?–

• What should we do to enable learning?–

19.11.2009 Machinelearning::Introduction::PartIII 163

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible• Is it good or bad?– Good for cryptographers, bad for data miners

• What should we do to enable learning?– Introduce assumptions about data (“inductive bias”): 1. How does existing data relate to the future data?

2. What is the system we are learning?

19.11.2009 Machinelearning::Introduction::PartIII 164

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009 Machinelearning::Introduction::PartIII 165

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009 Machinelearning::Introduction::PartIII 166

Rule 1: Generalizationwill only come through understanding ofsimilarity!

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009 Machinelearning::Introduction::PartIII 167

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009 Machinelearning::Introduction::PartIII 168

The “No Free Lunch” Theorem

Learning purely from data is, in general, impossible

19.11.2009 Machinelearning::Introduction::PartIII 169

Rule 2: Data can only partially substitute knowledge about the system!

Statistical Learning Theory• What is learning and how to analyze it?– There are various ways to answer this question. We’ll just

consider the most popular one.

19.11.2009 Machinelearning::Introduction::PartIII 170

Statistical Learning Theory

Distribution of data (x – coords, y – color)

19.11.2009 Machinelearning::Introduction::PartIII 171

Statistical Learning Theory

19.11.2009 Machinelearning::Introduction::PartIII 172

SampleSandtrainedclassifierg

Statistical Learning Theory

19.11.2009 Machinelearning::Introduction::PartIII 173

Trainingerror

Statistical Learning Theory

19.11.2009 Machinelearning::Introduction::PartIII 174

Generalizationerror

Statistical Learning Theory

19.11.2009 Machinelearning::Introduction::PartIII 175

Expectedgeneralizationerror

Statistical Learning Theory• Questions:–What is the generalization error of our classifier?• How to estimate it?• How to find a classifier with low generalization error?

–What is the expected generalization error of our method?• How to estimate it?• What methods have low expected generalization error?• What methods are asymptotically optimal (consistent)?

–When is learning computationally tractable?

19.11.2009 Machinelearning::Introduction::PartIII 176

Statistical Learning Theory• Some answers: for linear classifiers– Finding a linear classifier with a small training error

is a good idea

(however, finding such a classifier is NP-complete, hence an alternative method must be used)

19.11.2009 Machinelearning::Introduction::PartIII 177

Statistical Learning Theory• Some answers: in general– Small training error è small generalization error.• But only if you search a limited space of classifiers.• The more data you have, the larger space of classifiers

you can afford.

19.11.2009 Machinelearning::Introduction::PartIII 178

Overfitting• Why limited space?– Suppose your hypothesis space is just one

classifier:f(x) = if [x > 3] then 1 else 0

– You pick first five training instances:(1 à 0), (2 à 0), (4 à 1), (6 à 1), (-1 à 0)

– How surprised are you? How can you interpret it?

19.11.2009 Machinelearning::Introduction::PartIII 179

Overfitting• Why limited space?– Suppose your hypothesis space is just one

classifier:f(x) = if [x > 3] then 1 else 0

– You pick first five training instances:(1 à 0), (2 à 0), (4 à 1), (6 à 1), (-1 à 0)

– How surprised are you? How can you interpret it?–What if you had 100000 classifiers to start with

and one of them matched? Would you be surprised?

19.11.2009 Machinelearning::Introduction::PartIII 180

Overfitting• Why limited space?– Suppose your hypothesis space is just one classifier:

f(x) = if [x > 3] then 1 else 0– You pick first five training instances:

(1 à 0), (2 à 0), (4 à 1), (6 à 1), (-1 à 0)– How surprised are you? How can you interpret it?– What if you had 100000 classifiers to start with and one of

them matched? Would you be surprised?

19.11.2009 Machinelearning::Introduction::PartIII 181

Large hypothesis space è Small training error becomes a matter of chance è Overfitting

Bias-variance dilemma• So what if the data is scarce?– No free lunch – Bias-variance tradeoff:

– The only way out is to introduce a strong yet “correct” bias (or, well, to get more data).

19.11.2009 Machinelearning::Introduction::PartIII 182

Bias Variance

Optimal errorYour classifiererror

Optimal error in yourhypothesis class

=“Overfitting”

JaakViloandotherauthors UT:DataMining2009 183

JaakViloandotherauthors UT:DataMining2009 184

WhyDoesmyMethodOverfit ?

• Indomainswithnoiseoruncertaintythesystemmaytrytodecreasethetrainingerrorbycompletelyfittingallthetrainingexamples

Slide from: Pavan J Joshi

Summary• Learning can be approached formally• Learning is feasible in many cases– But you pay with data or prior knowledge

19.11.2009 Machinelearning::Introduction::PartIII 186

Summary• Learning can be approached formally• Learning is feasible in many cases– But you pay with data or prior knowledge

• You have to be careful with complex models– Beware overfitting– If data is scarce – use simple models: they are not

optimal, but at least you can fit them from data!

19.11.2009 Machinelearning::Introduction::PartIII 187

Usingcomplexmodelswithscarcedataislikethrowingdataaway.

Next• “Machine learning”– Terminology, foundations, general framework.

• Supervised machine learning– Basic ideas, algorithms & toy examples.

• Statistical challenges– Learning theory, bias-variance, consistency…

• State of the art techniques– SVM, kernel methods, graphical models, latent

variable models, boosting, bagging, LASSO, on-line learning, deep learning, reinforcement learning, …

5.11.2009 Machinelearning::Introduction 188

Summary summary• “Machine learning”– Terminology, foundations, general framework.

• Supervised machine learning– Basic ideas, algorithms & toy examples.

• Statistical challenges– Learning theory, bias-variance, consistency,…

• State of the art techniques– SVM, kernel methods.

5.11.2009 Machinelearning::Introduction 189

Machine learning is important

19.11.2009 Machinelearning::Introduction::PartIII 190

Questions?

19.11.2009 Machinelearning::Introduction::PartIII 191