Introduction to Machine Learning
-
Upload
khangminh22 -
Category
Documents
-
view
11 -
download
0
Transcript of Introduction to Machine Learning
Acknowledgement
• KonstantinTretyakov – 2009coursematerials• LectureNotesforEAlpaydın 2010IntroductiontoMachineLearning
• MachineLearningcourse(spring)– BySvenLaur etal.– 2017-2018=>DataMiningFall,MLSpring• Meelis Kull
JaakViloandotherauthors UT:DataMining2009 2
Coming up next• “Machine learning”– Terminology, foundations, general framework.
• Supervised machine learning– Basic ideas, algorithms & toy examples.
• Statistical challenges– P-values, significance, consistency, stability
• State of the art techniques– SVM, kernel methods, graphical models, latent
variable models, boosting, bagging, LASSO, on-line learning, deep learning, reinforcement learning, …
5.11.2009 Machinelearning::Introduction 4
LearningAssociations
• Basketanalysis:P(Y|X)probabilitythatsomebodywhobuysX alsobuysYwhereX andY areproducts/services.
Example:P(chips|beer)=0.7
5LectureNotesforEAlpaydın2010
IntroductiontoMachineLearning2e©TheMITPress(V1.0)
Classification
• Example:Creditscoring
• Differentiatingbetweenlow-riskandhigh-riskcustomersfromtheirincome andsavings
Discriminant: IFincome >θ1 ANDsavings >θ2THEN low-riskELSE high-risk
6LectureNotesforEAlpaydın 2010
IntroductiontoMachineLearning2e©TheMITPress(V1.0)
Classification:Applications• AkaPatternrecognition• Facerecognition:Pose,lighting,occlusion(glasses,beard),make-up,hairstyle
• Characterrecognition:Differenthandwritingstyles.
• Speechrecognition:Temporaldependency.• Medicaldiagnosis:Fromsymptomstoillnesses• Biometrics:Recognition/authenticationusingphysicaland/orbehavioralcharacteristics:Face,iris,signature,etc
• ...
7LectureNotesforEAlpaydın2010
IntroductiontoMachineLearning2e©TheMITPress(V1.0)
FaceRecognition
Training examples of a person
Test images
ORL dataset,AT&T Laboratories, Cambridge UK
8LectureNotesforEAlpaydın2010
IntroductiontoMachineLearning2e©TheMITPress(V1.0)
Regression
• Example:Priceofausedcar
• x:carattributesy:price
y=g(x|q )g()model,q parameters
y=wx+w0
9LectureNotesforEAlpaydın2010IntroductiontoMachineLearning2e©TheMITPress
(V1.0)
RegressionApplications
• Navigatingacar:Angleofthesteering• Kinematicsofarobotarm
α1=g1(x,y)α2=g2(x,y)
α1
α2
(x,y)
n Responsesurfacedesign
10LectureNotesforEAlpaydın2010
IntroductiontoMachineLearning2e©TheMITPress(V1.0)
JaakViloandotherauthors UT:DataMining2009 13
Multivariate linear regression
all the same, but instead of one feature, x is a k-dimensional vector
xi = (xi1, xi2, .., xik)
the model is the linear combination of all features:
via the matrix representation:
0
1
CA =
0
B@1 x11 . . . x1k...
......
1 xn1 . . . xnp
1
CA⇥
0
B@�0...�k
1
CA
y = �0 + �1x1 + �2x2 + ...+ �kxk
y = X�
Objectivesofregression
• Fitthedatawitha“function”– Beabletointerpretthefunction,parameters
• ”Simpler”modelsarebetter; regularisation
• Optimisation of“bestfit”– heuristics
• Bias,variance,logisticregression,…JaakViloandotherauthors UT:DataMining2009 14
SupervisedLearning:Uses• Predictionoffuturecases:Usetheruletopredicttheoutputforfutureinputs
• Knowledgeextraction:Theruleiseasytounderstand
• Compression: Theruleissimplerthanthedataitexplains
• Outlierdetection:Exceptionsthatarenotcoveredbytherule,e.g.,fraud
15LectureNotesforEAlpaydın2010
IntroductiontoMachineLearning2e©TheMITPress(V1.0)
Supervised learningObservation OutcomeSummer of2003wascold Winter of2003waswarmSummerof2004 wascold Winterof2004wascoldSummerof2005 wascold Winterof2005wascoldSummerof2006was hot Winterof2006waswarmSummerof2007wascold Winter of2007wascoldSummerof2008 waswarm Winterof2008waswarmSummerof2009wascold Winterof2009wascoldSummerof2010washot Winterof2010wascoldSummerof2011waswarm Winterof2011 willbe?
5.11.2009 Machinelearning::Introduction 16
Supervised learningObservation OutcomeStudy=hard, Professor=L Igeta CStudy=slack, Professor=J IgetanAStudy=hard, Professor=L IgetanAStudy=slack, Professor=L IgetaDStudy=slack, Professor=J IgetanAStudy=slack, Professor=J IgetanAStudy=hard, Professor=J IgetanAStudy=slack, Professor=J IgetaB
? IgetanA
5.11.2009 Machinelearning::Introduction 17
Supervised learningDay Observation OutcomeMon Iwas not usingmagneticbraceletTM In theeveningIhad
aheadacheTue IwasusingmagneticbraceletTM In theeveningIhad
lessheadacheWed Iwasusingmagnetic braceletTM In theeveningno
headache!Thu IwasusingmagneticbraceletTM The headacheis
gone!!Fri Iwasnot usingmagneticbraceletTM Noheadache!!
5.11.2009 Machinelearning::Introduction 18
Supervised learningDay Observation OutcomeMon Iwas not usingmagneticbraceletTM In theeveningIhad
aheadacheTue IwasusingmagneticbraceletTM In theeveningIhad
lessheadacheWed Iwasusingmagnetic braceletTM In theeveningno
headache!Thu IwasusingmagneticbraceletTM The headacheis
gone!!Fri Iwasnot usingmagneticbraceletTM Noheadache!!
5.11.2009 Machinelearning::Introduction 19
Magnetic braceletTM cures headache
Supervised learning
5.11.2009 Machinelearning::Introduction 21
Collectdata/observa
tions
Comeupwiththeories
Supervised learning
5.11.2009 Machinelearning::Introduction 22
Collectdata/observa
tions
Comeupwiththeories
BetterTheory
Supervised learning
5.11.2009 Machinelearning::Introduction 23
Collectdata/observa
tions
Comeupwiththeories
BetterTheory
Re-visit the 2x2 tables
5.11.2009 Machinelearning::Introduction 28
Predicted = Yes Predicted = No
Actual = Yes True positives (TP)
False negatives (FN)(Type II, β-error)
Actual = No
False positives (FP)(Type I, α-error)
True negatives (FN)
Positives Negatives
Classification summary
5.11.2009 Machinelearning::Introduction 29
Predicted = Yes Predicted = No
Actual = Yes True positives (TP)
False negatives (FN)(Type II, β-error)
Actual = No
False positives (FP)(Type I, α-error)
True negatives (FN)
Positives Negatives
Precision
Recall
F-measure = harmonic_mean(Precision, Recall)
Accuracy
The “Dumb User” Perspective
5.11.2009 Machinelearning::Introduction 31
Weka, RapidMiner,MS SSAS, Clementine,SPSS, R, scikit-learn,…
Classification demo: Iris dataset
• 150 measurements, 4 attributes, 3 classes
5.11.2009 Machinelearning::Introduction 33
Validationa b c <-- classified as
50 0 0 | a = Iris-setosa0 49 1 | b = Iris-versicolor0 2 48 | c = Iris-virginica
Correctly Classified Instances 147 98%Incorrectly Classified Instances 3 2%Kappa statistic 0.97 Mean absolute error 0.0233Root mean squared error 0.108 Relative absolute error 5.2482 %Root relative squared error 22.9089 %Total Number of Instances 150
5.11.2009 Machinelearning::Introduction 35
Validationa b c <-- classified as
50 0 0 | a = Iris-setosa0 49 1 | b = Iris-versicolor0 2 48 | c = Iris-virginica
Class setosa versic. virg. AvgTP Rate 1 0.98 0.96 0.98FP Rate 0 0.02 0.01 0.01Precision 1 0.961 0.98 0.98Recall 1 0.98 0.96 0.98F-Measure 1 0.97 0.97 0.98ROC Area 1 0.99 0.99 0.99
5.11.2009 Machinelearning::Introduction 36
Validationa b c <-- classified as
50 0 0 | a = Iris-setosa0 49 1 | b = Iris-versicolor0 2 48 | c = Iris-virginica
setosa versic. virg. AvgTP Rate 1 0.98 0.96 0.98FP Rate 0 0.02 0.01 0.01Precision 1 0.961 0.98 0.98Recall 1 0.98 0.96 0.98F-Measure 1 0.97 0.97 0.98ROC Area 1 0.99 0.99 0.99
5.11.2009 Machinelearning::Introduction 37
Validationa b c <-- classified as
50 0 0 | a = Iris-setosa0 49 1 | b = Iris-versicolor0 2 48 | c = Iris-virginica
setosa versic.TP Rate 1 0.98 = TP/positive examplesFP Rate 0 0.02 = FP/negative examplesPrecision 1 0.961 = TP/positivesRecall 1 0.98 = TP/positive examplesF-Measure 1 0.97 = 2*P*R/(P + R)ROC Area 1 0.99 ~ Pr(s(false)<s(true))
5.11.2009 Machinelearning::Introduction 38
“False positives”“True positives”
Classification summary
5.11.2009 Machinelearning::Introduction 39
Predicted = Yes Predicted = No
Actual = Yes True positives (TP)
False negatives (FN)(Type II, β-error)
Actual = No
False positives (FP)(Type I, α-error)
True negatives (FN)
Positives Negatives
Classification summary
5.11.2009 Machinelearning::Introduction 40
Predicted = Yes Predicted = No
Actual = Yes True positives (TP)
False negatives (FN)(Type II, β-error)
Actual = No
False positives (FP)(Type I, α-error)
True negatives (FN)
Positives Negatives
Precision
Recall
F-measure = harmonic_mean(Precision, Recall)
Accuracy
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
JaakViloandotherauthors UT:DataMining2009 41
ROC
• Yourpredictionssometimeshavenatural“order”– fromyour“bestbet”togradually”weaker”predictions
• Howmanyyoucalldependsonyou• Ateach“point”yourrecallmayincreaseorbe“falsepositive”.
• Modelis“better”ifitoutperformsothersacrossawiderangeof“cutoffs”.
JaakViloandotherauthors UT:DataMining2009 45
ROCcurve– ReceiverOperatorCharacteristic,AUC– AreaUnderCurve
JaakViloandotherauthors UT:DataMining2009 46
“Training” classifiers on data• Thus, a “good” classifier is the one which has
good Accuracy/Precision/Recall.– Or better ROC curve, higher ROC AUC score
• Hence, machine learning boils down to finding a function that optimizes these parameters for given data.
5.11.2009 Machinelearning::Introduction 47
“Training” classifiers on data• Thus, a “good” classifier is the one which has
good Accuracy/Precision/Recall.
• Hence, machine learning boils down to finding a function that optimizes these parameters for given data.
• Yet, there’s a catch
5.11.2009 Machinelearning::Introduction 48
“Training” classifiers on data• We want our algorithm to perform well
on “unseen” data!
– This makes algorithms and theory way more complicated.
– This makes validation somewhat more complicated.
5.11.2009 Machinelearning::Introduction 49
Proper validation• You may not test your algorithm on the
same data that you used to train it!
5.11.2009 Machinelearning::Introduction 50
Proper validation• You may not test your algorithm on the
same data that you used to train it!
5.11.2009 Machinelearning::Introduction 51
Proper validation :: Holdout
5.11.2009 Machinelearning::Introduction 52
Split
Training set
Testing set
Validation
Proper validation• What are the “sufficient” sizes for the
test/training sets and why?• What if the data is scarce?– Cross-validation– K-fold cross-validation– Leave-one-out cross-validation– Bootstrap .632+
5.11.2009 Machinelearning::Introduction 54
TheNetflixPrizeData� Netflixreleasedthreedatasets
� 480,189users (anonymous)� 17,770movies� ratings onintegerscale1to5
� Trainingset:99,072,112<user,movie >pairswithratings� Probeset:1,408,395<user,movie >pairswithratings� Qualifyingset of2,817,131<user,movie >pairswithnoratings
Jeff Howbert
WhytheNetflixPrizeWasHard
� Massivedataset� Verysparse– matrixonly1.2%occupied
� Extremevariationinnumberofratingsperuser
� Statisticalpropertiesofqualifyingandprobesetsdifferentfromtrainingset
movie1
movie2
movie3
movie4
movie5
movie6
movie7
movie8
movie9
movie10
… movie177
70
user1 1 2 3user2 2 3 3 4user3 5 3 4user4 2 3 2 2user5 4 5 3 4user6 2user7 2 4 2 3user8 3 4 4user9 3user10 1 2 2
…user480189 4 3 3
Jeff Howbert
ModelBuildingandSubmissionProcess
99,072,112 1,408,395
1,408,7891,408,342
trainingset
quizset testset
qualifyingset(ratingsunknown)
probeset
MODEL
tuning
ratingsknown
validate
makepredictionsRMSEonpublic
leaderboard
RMSEkeptsecretfor
finalscoring
Jeff Howbert
Intermediate summary• Supervised learning = predicting f(x) well.• For classification, “well” = high
accuracy/precision/recall on unseen data.• To achieve that, most training algorithms will
try to optimize their accuracy/precision/recall on training data.
• We can then validate how good they are on test data.
5.11.2009 Machinelearning::Introduction 59
Next• Three examples of approaches– Ad-hoc• Decision tree induction
– Probabilistic modeling• Naïve Bayes classifier
–Objective function optimization• Linear least squares regression
5.11.2009 Machinelearning::Introduction 60
Decision Tree Induction :: ID3
• Iterative Dichotomizer 3– Simple yet popular decision
tree induction algorithm– Builds a decision tree top-
down, starting at the root.
5.11.2009 Machinelearning::Introduction 61
Ross Quinlan
Entropy• Entropy measures the “informativeness” of a
probability distribution.
• A split is informative if it reduces entropy.
5.11.2009 Machinelearning::Introduction 64
Information gain of a split• Before split: – pno = 5/14, pyes = 9/14, H(p) = 0.94
• After split on outlook:
5.11.2009 Machinelearning::Introduction 65
H=0.97 H=0.97H=0
=0.69
Informationgain=0.94-0.69=0.25
ID31. Start with a single node2. Find the attribute with the largest
information gain3. Split the node according to this attribute4. Repeat recursively on subnodes
5.11.2009 Machinelearning::Introduction 67
C4.5• C4.5 is an extension of ID3– Supports continuous attributes– Supports missing values– Supports pruning
• There is also a C5.0– A commercial version with additional bells &
whistles
5.11.2009 Machinelearning::Introduction 68
Decision trees - discussion• The goods:– Easy & efficient– Interpretable and pretty
• The bads– Rather ad-hoc – optimal trees NP-hard to train– Can overfit unless properly pruned– Not the best model for all classification tasks
• Improvements:– Random Forest – multiple trees, majority voting
5.11.2009 Machinelearning::Introduction 69
WhyDoesmyMethodOverfit ?
• Indomainswithnoiseoruncertaintythesystemmaytrytodecreasethetrainingerrorbycompletelyfittingallthetrainingexamples
Slide from: Pavan J Joshi
Overfitting
• Thelongeryoutrain,theworseitbecomes…
• Alwaystestagainstdatathatwasnotusedintraining
JaakViloandotherauthors UT:DataMining2009 71
Pruning
• Deletesubtrees iftheydonotimprovepredictions
• Top/down• Bottom/up• …• MDL– MinimumDescriptionLengthPrinciple–MDL- Minimise lengthof“Model+Exceptions”
JaakViloandotherauthors UT:DataMining2009 72
Whydowetrustinonetree?
• Treeisdifferentdependingondataused
• Whichattributesaremostimportant?
• Randomly selectasubsetofdata,randomlysubselect asubsetofattributes…
JaakViloandotherauthors UT:DataMining2009 76
RandomForest
JaakViloandotherauthors UT:DataMining2009 77
…
Each trained on a random subset of dataEach trained on a random sub-set of attributes
Every tree provides some level of accuracy. Majority voting!
References[BY10]Y.Ben-Haim andE.Yom-Tov.Astreamingparalleldecisiontreealgorithm.J.ofMachineLearningResearch,11:849–872,2010.[B96]L.Breiman.Baggingpredictors.MachineLearning,24(2):123–140,1996.[B01]L.Breiman.Randomforests.MachineLearning,45(1):5–32,2001.[FS97]Yoav FreundandRobertE.Schapire.Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting. JournalofComputerandSystemSciences,55(1):119-139,1997.[F01a]Yoav Freund.Anadaptiveversionoftheboostbymajorityalgorithm.MachineLearning,43(3):293--318,2001.[F+03] Yoav Freund,RajIyer,RobertSchapire,andYoramSinger.AnEfficientBoostingAlgorithmforCombiningPreferences.JournalofMachineLearningResearch4:933-969,2003.[F01b]Friedman,J.Greedyfunctionapproximation:agradientboostingmachine.AnnalsofStatistics,25(5):1189-1232,2001.[F+00]JeromeFriedman,TrevorHastieandRobertTibshirani.Additivelogisticregression:astatisticalviewofboosting.AnnalsofStatistics28(2):337-407,2000.[H98]Ho,T.K.Therandomsubspacemethodforconstructingdecisionforests.IEEEPAMI,20(8):832–844,1998.[P+10]D.Y.Pavlov, A.Gorodilov,andC.A.Brunk.BagBoo:ascalablehybridbagging-the-boostingmodel.CIKM-2010.[S+07]Daria Sorokina,RichCaruana,Mirek Riedewald.AdditiveGrovesofRegressionTrees.ECML-2007.[SUML11-Ch2]Biswanath Panda,JoshuaS.Herbach,Sugato Basu,andRobertoJ.Bayardo.MapReduce anditsApplicationtoMassivelyParallelLearningofDecisionTreeEnsembles.In“ScalingUpMachineLearning”,CambridgeU.Press,2011.[SUML11-Ch8]KrystaM.SvoreandChristopherJ.C.Burges.Large-scaleLearningtoRankusingBoostedDecisionTrees.In“ScalingUpMachineLearning”,CambridgeU.Press,2011.[SUML11-Ch9]RameshNatarajan andEdwinPednault.TheTransformRegressionAlgorithm.In“ScalingUpMachineLearning”,CambridgeU.Press,2011.[T+11]StephenTyree,Kilian Q.Weinberger,Kunal Agrawal.ParallelBoostedRegressionTreesforWebSearchRanking.WWW-2011.[Y+09]JerryYe,Jyh-Herng Chow,JiangChen,Zhaohui Zheng.StochasticGradientBoostedDistributedDecisionTrees.CIKM-2009.[Z+08]Z.Zheng,H.Zha,T.Zhang,O.Chapelle,K.Chen,andG.Sun.Ageneralboostingmethodanditsapplicationtolearningrankingfunctionsforwebsearch. NIPS2008.
Shall we play tennis today?
12.11.2009 Machinelearning::Introduction::PartII 84
P(Yes)=9/14=0.64P(No)=5/14=0.36
è Yes
• Probabilistic model:
It’s windy today. Tennis, anyone?
• Probabilistic model:
12.11.2009 Machinelearning::Introduction::PartII 86
P(Weak)=8/14P(Strong)=6/14
P(Yes|Weak)=6/8P(No|Weak)=2/8
P(Yes|Strong)=3/6P(No|Strong)=3/6
More attributes• Probabilistic model:
12.11.2009 Machinelearning::Introduction::PartII 87
P(High,Weak) = 4/14P(Yes | High,Weak) = 2/4P(No | High,Weak) = 2/4
P(High,Strong) = 3/14P(Yes | High,Strong) = 1/3P(No | High,Strong) = 2/3
…
The Bayes ClassifierIn general:1.Estimate from data:
P(Class | X1,X2,X3,…)2.For a given instance (X1,X2,X3 …)
predict class whose conditional probability is greater:
P(C1 | X1,X2,X3,…) > P(C2 | X1,X2,X3,…)è predict C1
12.11.2009 Machinelearning::Introduction::PartII 88
The Bayes ClassifierIngeneral:1. P(C1 |X)>P(C2 |X)
è predictC1
12.11.2009 Machinelearning::Introduction::PartII 89
The Bayes ClassifierIngeneral:1. P(C1 |X)>P(C2 |X)
è predictC1
12.11.2009 Machinelearning::Introduction::PartII 90
The Bayes ClassifierIngeneral:1. P(C1 |X)>P(C2 |X)
è predictC12. P(X |C1)P(C1)/P(X)>P(X |C2)P(C2)/P(X)
è predictC1
12.11.2009 Machinelearning::Introduction::PartII 91
The Bayes ClassifierIngeneral:1. P(C1 |X)>P(C2 |X)
è predictC12. P(X |C1)P(C1)/P(X)>P(X |C2)P(C2)/P(X)
è predictC13. P(X |C1)P(C1)>P(X |C2)P(C2)
è predictC1
12.11.2009 Machinelearning::Introduction::PartII 92
The Bayes ClassifierIngeneral:1. P(C1 |X)>P(C2 |X)
è predictC12. P(X |C1)P(C1)/P(X)>P(X |C2)P(C2)/P(X)
è predictC13. P(X |C1)P(C1)>P(X |C2)P(C2)
è predictC14. P(X |C1)/P(X |C2)>P(C2)/P(C1)
è predictC112.11.2009 Machinelearning::Introduction::PartII 93
The Bayes Classifier3. P(X | C1)P(C1) > P(X | C2)P(C2) è predict C1
12.11.2009 Machinelearning::Introduction::PartII 94
The Bayes ClassifierIf the true underlying distribution is
known, the Bayes classifier is optimal(i.e. it achieves minimal error probability).
12.11.2009 Machinelearning::Introduction::PartII 95
Problem• We need exponential amount of data
12.11.2009 Machinelearning::Introduction::PartII 96
P(High,Weak) = 4/14P(Yes | High,Weak) = 2/4P(No | High,Weak) = 2/4
P(High,Strong) = 3/14P(Yes | High,Strong) = 1/3P(No | High,Strong) = 2/3
…
The Naïve Bayes ClassifierTo scale beyond 2-3 attributes, use a trick:
Assume that attributes of each class are independent:
P(X1,X2,X3 | Class) == P(X1|Class)P(X2|Class)P(X3|Class)…
12.11.2009 Machinelearning::Introduction::PartII 97
The Naïve Bayes ClassifierP(X | C1)/P(X | C2) > P(C2)/P(C1)
è predict C1
becomes
12.11.2009 Machinelearning::Introduction::PartII 98
Naïve Bayes Classifier1. Training:
For each value v of each attribute i, compute
2. Classification:For a given instance (x1, x2, …) compute
If classify as C1 , else C2
12.11.2009 Machinelearning::Introduction::PartII 99
Naïve Bayes Classifier1. Training:
For each value v of each attribute i, compute
2. Classification:For a given instance (x1, x2, …) compute
If classify as C1 , else C2
12.11.2009 Machinelearning::Introduction::PartII 100
Naïve Bayes Classifier• Extendable to continuous attributes via kernel
density estimation.• The goods:– Easy to implement, efficient– Won’t overfit, intepretable– Works better than you would expect (e.g. spam filtering)
• The bads– “Naïve”, linear– Usually won’t work well for too many classes
12.11.2009 Machinelearning::Introduction::PartII 101
ROCcurve– ReceiverOperatorCharacteristic,AUC– AreaUnderCurve
JaakViloandotherauthors UT:DataMining2009 103
Supervised learning• Four examples of approaches– Ad-hoc• Decision tree induction
– Probabilistic modeling• Naïve Bayes classifier
–Objective function optimization• Linear least squares regression
– Instance-based methods• K-nearest neighbors
12.11.2009 Machinelearning::Introduction::PartII 105
Linear regression
• Consider data points of the type
• Let us search for a function of the form
that would have the least possible sum of error squares:
12.11.2009 Machinelearning::Introduction::PartII 106
Linear regression• Set of functions f(x) = wx for various w.
12.11.2009 Machinelearning::Introduction::PartII 107
Linear regression• The function having the least error
12.11.2009 Machinelearning::Introduction::PartII 108
Linear regression• We need to solve:
where
• Setting the derivative to 0:
we get
12.11.2009 Machinelearning::Introduction::PartII 109
Linear regression• The idea naturally generalizes to more
complex cases (e.g. multivariate regression)• The goods– Simple, easy– Intepretable, popular, extendable to many variants–Won’t overfit
• The bads– Linear, i.e. works for a limited set of datasets.– Nearly never perfect.
12.11.2009 Machinelearning::Introduction::PartII 110
MedianErrorRegression(quantile)
• Minimizemeanerror• Morerobust:minimizemedianerror• Requiresdifferentoptimisation– DifferentialEvolution– GeneticProgramming– …
JaakViloandotherauthors UT:DataMining2009 111
K-Nearest Neighbors• Training:– Store and index all training instances
• Classification– Given an instance to classify, find k instances from
the training sample nearest to it.– Predict the majority class among these k instances
12.11.2009 Machinelearning::Introduction::PartII 115
K-Nearest Neighbors• The goods– Trivial concept– Easy to implement– Asymptotically optimal– Allows various kinds of data
• The bads– Difficult to implement efficiently– Not interpretable (no model)– On smaller datasets looses to other methods
12.11.2009 Machinelearning::Introduction::PartII 116
Objective optimizationRegression modelsKernel methods, SVM, RBFNeural networks…
Instance-basedK-NN, LOWESSKernel densitiesSVM, RBF…
Probabilistic modelsNaïve BayesGraphical modelsRegression modelsDensity estimation…
Supervised learningAd-hocDecision trees, forestsRule induction, ILPFuzzy reasoning…
12.11.2009 Machinelearning::Introduction::PartII 117
Objective optimizationRegression models,Kernel methods, SVM, RBFNeural networks…
Instance-basedK-NN, LOWESSKernel densitiesSVM, RBF…
Probabilistic modelsNaïve BayesGraphical modelsRegression modelsDensity estimation…
Supervised learningAd-hocDecision trees, forestsRule induction, ILPFuzzy reasoning…
12.11.2009 Machinelearning::Introduction::PartII 118
Ensemble-learnersArcing,Boosting,Bagging,Dagging,Voting,Stacking
Linearclassifiers
JaakViloandotherauthors UT:DataMining2009 124
H3 (green) doesn't separate the two classes. H1 (blue) does, with a small margin and H2 (red) with the maximum margin.
Linear classification• Learning a linear classifier from data:– Minimizing training error• NP-complete
– Minimizing sum of error squares• Suboptimal, yet can be easy and fun: e.g. the perceptron.
– Maximizing the margin• Doable and well-founded by statistical learning theory
19.11.2009 Machinelearning::Introduction::PartIII 125
JaakViloandotherauthors UT:DataMining2009 127
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.
The margin• For any point x, its distance to the hyperplane:
• Assuming all points are classified correctly:
• The margin is then:
19.11.2009 Machinelearning::Introduction::PartIII 128
Maximal margin training• Find (w, b) such that Margin is maximal, can be
shown to be equivalent with:
subject to
• This is doable using efficient optimization algorithms.
19.11.2009 Machinelearning::Introduction::PartIII 129
Soft margin training• If data is not linearly separable:
subject to
• It is called the Support Vector Machine (SVM).• It can be generalized to regression tasks.
19.11.2009 Machinelearning::Introduction::PartIII 130
Soft margin training• In more general form:
where
is the hinge loss.
19.11.2009 Machinelearning::Introduction::PartIII 131
Regularization• There are many algorithms which essentially
look as follows:
For given data D find a model M, which minimizesError(M,D)+ Complexity(M)
• An SVM is a linear model, which minimizes Hinge loss + l2-norm penalty
19.11.2009 Machinelearning::Introduction::PartIII 132
Going nonlinear• But a linear classifier is so linear!
19.11.2009 Machinelearning::Introduction::PartIII 133
Solution: a nonlinear map• Instead of classifying points x, we’ll classify
points in a higher-dimensional space.
19.11.2009 Machinelearning::Introduction::PartIII 134
Linearseparationinnon-linearworld
JaakViloandotherauthors UT:DataMining2009 135http://news360.com/digestarticle/t-2ryjS4d0io86BLq8d-qA
Kernel methods (not density!)• For nearly any linear classifier:
• Trained on a dataset
• The resulting vector w can be represented as:
• Which means:
19.11.2009 Machinelearning::Introduction::PartIII 137
Kernel methods (not density!)
• Function K is called a kernel, it measures similarity between objects.
• The computation of is unnecessary.• You can use any type of data.• Your method is nonlinear.• Any linear method can be kernelized.• Kernels can be combined.19.11.2009 Machinelearning::Introduction::PartIII 140
Summary• SVM:– A maximal margin linear classifier.– A linear model, which minimizes
Hinge loss + l2-norm penalty– Kernelizable
• Kernel methods:– An easy and elegant way of “plugging-in”• nonlinearity• different data types
19.11.2009 Machinelearning::Introduction::PartIII 147
MLGallery
• http://home.comcast.net/~tom.fawcett/public_html/ML-gallery/pages/index.html– Nowdisappearedfromnet…
• ACollectionofexampledatasetsandillustrationofdifferentMLalgorithmsperformanceonthem
JaakViloandotherauthors UT:DataMining2009 148
In previous episodes
Machinelearning::Introduction::PartIII19.11.2009 160
Approachestodataanalysis
} Thegeneralprincipleisthesame,though:1. Defineasetofpatternsofinterest2. Defineameasureofgoodnessforthepatterns3. Findthebestpatterninthedata
The “No Free Lunch” Theorem
Learning purely from data is, in general, impossible
19.11.2009 Machinelearning::Introduction::PartIII 161
X Y Output0 0 False0 1 True1 0 True1 1 ?
Statistical Learning Theory• What is learning and how to analyze it?– There are various ways to answer this question. We’ll just
consider the most popular one.
• Let be the distribution of data.• We observe an i.i.d. sample:
• We produce a classifier:
19.11.2009 Machinelearning::Introduction::PartIII 162
Inductivebias
The “No Free Lunch” Theorem
Learning purely from data is, in general, impossible• Is it good or bad?–
• What should we do to enable learning?–
19.11.2009 Machinelearning::Introduction::PartIII 163
The “No Free Lunch” Theorem
Learning purely from data is, in general, impossible• Is it good or bad?– Good for cryptographers, bad for data miners
• What should we do to enable learning?– Introduce assumptions about data (“inductive bias”): 1. How does existing data relate to the future data?
2. What is the system we are learning?
19.11.2009 Machinelearning::Introduction::PartIII 164
The “No Free Lunch” Theorem
Learning purely from data is, in general, impossible
19.11.2009 Machinelearning::Introduction::PartIII 165
The “No Free Lunch” Theorem
Learning purely from data is, in general, impossible
19.11.2009 Machinelearning::Introduction::PartIII 166
Rule 1: Generalizationwill only come through understanding ofsimilarity!
The “No Free Lunch” Theorem
Learning purely from data is, in general, impossible
19.11.2009 Machinelearning::Introduction::PartIII 167
The “No Free Lunch” Theorem
Learning purely from data is, in general, impossible
19.11.2009 Machinelearning::Introduction::PartIII 168
The “No Free Lunch” Theorem
Learning purely from data is, in general, impossible
19.11.2009 Machinelearning::Introduction::PartIII 169
Rule 2: Data can only partially substitute knowledge about the system!
Statistical Learning Theory• What is learning and how to analyze it?– There are various ways to answer this question. We’ll just
consider the most popular one.
19.11.2009 Machinelearning::Introduction::PartIII 170
Statistical Learning Theory
Distribution of data (x – coords, y – color)
19.11.2009 Machinelearning::Introduction::PartIII 171
Statistical Learning Theory
19.11.2009 Machinelearning::Introduction::PartIII 172
SampleSandtrainedclassifierg
Statistical Learning Theory
19.11.2009 Machinelearning::Introduction::PartIII 174
Generalizationerror
Statistical Learning Theory
19.11.2009 Machinelearning::Introduction::PartIII 175
Expectedgeneralizationerror
Statistical Learning Theory• Questions:–What is the generalization error of our classifier?• How to estimate it?• How to find a classifier with low generalization error?
–What is the expected generalization error of our method?• How to estimate it?• What methods have low expected generalization error?• What methods are asymptotically optimal (consistent)?
–When is learning computationally tractable?
19.11.2009 Machinelearning::Introduction::PartIII 176
Statistical Learning Theory• Some answers: for linear classifiers– Finding a linear classifier with a small training error
is a good idea
(however, finding such a classifier is NP-complete, hence an alternative method must be used)
19.11.2009 Machinelearning::Introduction::PartIII 177
Statistical Learning Theory• Some answers: in general– Small training error è small generalization error.• But only if you search a limited space of classifiers.• The more data you have, the larger space of classifiers
you can afford.
19.11.2009 Machinelearning::Introduction::PartIII 178
Overfitting• Why limited space?– Suppose your hypothesis space is just one
classifier:f(x) = if [x > 3] then 1 else 0
– You pick first five training instances:(1 à 0), (2 à 0), (4 à 1), (6 à 1), (-1 à 0)
– How surprised are you? How can you interpret it?
19.11.2009 Machinelearning::Introduction::PartIII 179
Overfitting• Why limited space?– Suppose your hypothesis space is just one
classifier:f(x) = if [x > 3] then 1 else 0
– You pick first five training instances:(1 à 0), (2 à 0), (4 à 1), (6 à 1), (-1 à 0)
– How surprised are you? How can you interpret it?–What if you had 100000 classifiers to start with
and one of them matched? Would you be surprised?
19.11.2009 Machinelearning::Introduction::PartIII 180
Overfitting• Why limited space?– Suppose your hypothesis space is just one classifier:
f(x) = if [x > 3] then 1 else 0– You pick first five training instances:
(1 à 0), (2 à 0), (4 à 1), (6 à 1), (-1 à 0)– How surprised are you? How can you interpret it?– What if you had 100000 classifiers to start with and one of
them matched? Would you be surprised?
19.11.2009 Machinelearning::Introduction::PartIII 181
Large hypothesis space è Small training error becomes a matter of chance è Overfitting
Bias-variance dilemma• So what if the data is scarce?– No free lunch – Bias-variance tradeoff:
– The only way out is to introduce a strong yet “correct” bias (or, well, to get more data).
19.11.2009 Machinelearning::Introduction::PartIII 182
Bias Variance
Optimal errorYour classifiererror
Optimal error in yourhypothesis class
=“Overfitting”
WhyDoesmyMethodOverfit ?
• Indomainswithnoiseoruncertaintythesystemmaytrytodecreasethetrainingerrorbycompletelyfittingallthetrainingexamples
Slide from: Pavan J Joshi
Summary• Learning can be approached formally• Learning is feasible in many cases– But you pay with data or prior knowledge
19.11.2009 Machinelearning::Introduction::PartIII 186
Summary• Learning can be approached formally• Learning is feasible in many cases– But you pay with data or prior knowledge
• You have to be careful with complex models– Beware overfitting– If data is scarce – use simple models: they are not
optimal, but at least you can fit them from data!
19.11.2009 Machinelearning::Introduction::PartIII 187
Usingcomplexmodelswithscarcedataislikethrowingdataaway.
Next• “Machine learning”– Terminology, foundations, general framework.
• Supervised machine learning– Basic ideas, algorithms & toy examples.
• Statistical challenges– Learning theory, bias-variance, consistency…
• State of the art techniques– SVM, kernel methods, graphical models, latent
variable models, boosting, bagging, LASSO, on-line learning, deep learning, reinforcement learning, …
5.11.2009 Machinelearning::Introduction 188
Summary summary• “Machine learning”– Terminology, foundations, general framework.
• Supervised machine learning– Basic ideas, algorithms & toy examples.
• Statistical challenges– Learning theory, bias-variance, consistency,…
• State of the art techniques– SVM, kernel methods.
5.11.2009 Machinelearning::Introduction 189