arXiv:1803.08823v3 [physics.comp-ph] 27 May 2019

116
A high-bias, low-variance introduction to Machine Learning for physicists Pankaj Mehta, Ching-Hao Wang, Alexandre G. R. Day, and Clint Richardson Department of Physics, Boston University, Boston, MA 02215, USA * Marin Bukov Department of Physics, University of California, Berkeley, CA 94720, USA Charles K. Fisher Unlearn.AI, San Francisco, CA 94108 David J. Schwab Initiative for the Theoretical Sciences, The Graduate Center, City University of New York, 365 Fifth Ave., New York, NY 10016 (Dated: May 29, 2019) Machine Learning (ML) is one of the most exciting and dynamic areas of modern re- search and application. The purpose of this review is to provide an introduction to the core concepts and tools of machine learning in a manner easily understood and intuitive to physicists. The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, regularization, generalization, and gradient descent before moving on to more advanced topics in both supervised and unsupervised learning. Topics covered in the review include ensemble models, deep learning and neural networks, clustering and data visualization, energy-based models (in- cluding MaxEnt models and Restricted Boltzmann Machines), and variational methods. Throughout, we emphasize the many natural connections between ML and statistical physics. A notable aspect of the review is the use of Python Jupyter notebooks to in- troduce modern ML/statistical packages to readers using physics-inspired datasets (the Ising Model and Monte-Carlo simulations of supersymmetric decays of proton-proton collisions). We conclude with an extended outlook discussing possible uses of machine learning for furthering our understanding of the physical world as well as open problems in ML where physicists may be able to contribute. CONTENTS I. Introduction 3 A. What is Machine Learning? 3 B. Why study Machine Learning? 4 C. Scope and structure of the review 4 II. Why is Machine Learning difficult? 6 A. Setting up a problem in ML and data science 6 B. Polynomial Regression 6 III. Basics of Statistical Learning Theory 10 A. Three simple schematics that summarize the basic intuitions from Statistical Learning Theory 10 * [email protected] [email protected] B. Bias-Variance Decomposition 12 IV. Gradient Descent and its Generalizations 13 A. Gradient Descent and Newton’s method 13 B. Limitations of the simplest gradient descent algorithm 15 C. Stochastic Gradient Descent (SGD) with mini-batches 16 D. Adding Momentum 17 E. Methods that use the second moment of the gradient 17 F. Comparison of various methods 18 G. Gradient descent in practice: practical tips 19 V. Overview of Bayesian Inference 19 A. Bayes Rule 20 B. Bayesian Decisions 20 C. Hyperparameters 20 VI. Linear Regression 21 arXiv:1803.08823v3 [physics.comp-ph] 27 May 2019

Transcript of arXiv:1803.08823v3 [physics.comp-ph] 27 May 2019

A high-bias, low-variance introduction to Machine Learning for physicists

Pankaj Mehta, Ching-Hao Wang, Alexandre G. R. Day, and Clint Richardson

Department of Physics,Boston University,Boston, MA 02215,USA∗

Marin Bukov

Department of Physics,University of California,Berkeley, CA 94720,USA†

Charles K. Fisher

Unlearn.AI, San Francisco,CA 94108

David J. Schwab

Initiative for the Theoretical Sciences,The Graduate Center,City University of New York,365 Fifth Ave., New York,NY 10016

(Dated: May 29, 2019)

Machine Learning (ML) is one of the most exciting and dynamic areas of modern re-search and application. The purpose of this review is to provide an introduction to thecore concepts and tools of machine learning in a manner easily understood and intuitiveto physicists. The review begins by covering fundamental concepts in ML and modernstatistics such as the bias-variance tradeoff, overfitting, regularization, generalization,and gradient descent before moving on to more advanced topics in both supervisedand unsupervised learning. Topics covered in the review include ensemble models, deeplearning and neural networks, clustering and data visualization, energy-based models (in-cluding MaxEnt models and Restricted Boltzmann Machines), and variational methods.Throughout, we emphasize the many natural connections between ML and statisticalphysics. A notable aspect of the review is the use of Python Jupyter notebooks to in-troduce modern ML/statistical packages to readers using physics-inspired datasets (theIsing Model and Monte-Carlo simulations of supersymmetric decays of proton-protoncollisions). We conclude with an extended outlook discussing possible uses of machinelearning for furthering our understanding of the physical world as well as open problemsin ML where physicists may be able to contribute.

CONTENTS

I. Introduction 3A. What is Machine Learning? 3B. Why study Machine Learning? 4C. Scope and structure of the review 4

II. Why is Machine Learning difficult? 6A. Setting up a problem in ML and data science 6B. Polynomial Regression 6

III. Basics of Statistical Learning Theory 10A. Three simple schematics that summarize the basic

intuitions from Statistical Learning Theory 10

[email protected][email protected]

B. Bias-Variance Decomposition 12

IV. Gradient Descent and its Generalizations 13A. Gradient Descent and Newton’s method 13B. Limitations of the simplest gradient descent

algorithm 15C. Stochastic Gradient Descent (SGD) with

mini-batches 16D. Adding Momentum 17E. Methods that use the second moment of the

gradient 17F. Comparison of various methods 18G. Gradient descent in practice: practical tips 19

V. Overview of Bayesian Inference 19A. Bayes Rule 20B. Bayesian Decisions 20C. Hyperparameters 20

VI. Linear Regression 21

arX

iv:1

803.

0882

3v3

[ph

ysic

s.co

mp-

ph]

27

May

201

9

2

A. Least-square regression 21B. Ridge-Regression 22C. LASSO and Sparse Regression 23D. Using Linear Regression to Learn the Ising

Hamiltonian 24E. Convexity of regularizer 25F. Bayesian formulation of linear regression 27G. Recap and a general perspective on regularizers 28

VII. Logistic Regression 29A. The cross-entropy as a cost function for logistic

regression 29B. Minimizing the cross entropy 30C. Examples of binary classification 30

1. Identifying the phases of the 2D Ising model 302. SUSY 32

D. Softmax Regression 34E. An Example of SoftMax Classification: MNIST

Digit Classification 34

VIII. Combining Models 35A. Revisiting the Bias-Variance Tradeoff for

Ensembles 351. Bias-Variance Decomposition for Ensembles 352. Summarizing the Theory and Intuitions behind

Ensembles 38B. Bagging 39C. Boosting 40D. Random Forests 41E. Gradient Boosted Trees and XGBoost 42F. Applications to the Ising model and

Supersymmetry Datasets 44

IX. An Introduction to Feed-Forward Deep NeuralNetworks (DNNs) 45A. Neural Network Basics 46

1. The basic building block: neurons 462. Layering neurons to build deep networks:

network architecture. 47B. Training deep networks 48C. The Backpropagation algorithm 49

1. Deriving and implementing thebackpropagation equations 50

2. Computing gradients in deep networks: whatcan go wrong with backprop? 51

D. Regularizing neural networks and other practicalconsiderations 521. Implicit regularization using SGD:

initialization, hyper-parameter tuning, andEarly Stopping 52

2. Dropout 523. Batch Normalization 52

E. Deep neural networks in practice: examples 531. Deep learning packages 532. Approaching the learning problem 543. SUSY dataset 554. Phases of the 2D Ising model 55

X. Convolutional Neural Networks (CNNs) 56A. The structure of convolutional neural networks 56B. Example: CNNs for the 2D Ising model 58C. Pre-trained CNNs and transfer learning 58

XI. High-level Concepts in Deep Neural Networks 60A. Organizing deep learning workflows using the

bias-variance tradeoff 60B. Why neural networks are so successful: three

high-level perspectives on neural networks 611. Neural networks as representation learning 612. Neural networks can exploit large amounts of

data 61

3. Neural networks scale up wellcomputationally 62

C. Limitations of supervised learning with deepnetworks 62

XII. Dimensional Reduction and Data Visualization 63A. Some of the challenges of high-dimensional data 63B. Principal component analysis (PCA) 64C. Multidimensional scaling 66D. t-SNE 66

XIII. Clustering 68A. Practical clustering methods 69

1. K-means 692. Hierarchical clustering: Agglomerative

methods 703. Density-based (DB) clustering 71

B. Clustering and Latent Variables via the GaussianMixture Models 72

C. Clustering in high dimensions 74

XIV. Variational Methods and Mean-Field Theory(MFT) 75A. Variational mean-field theory for the Ising

model 76B. Expectation Maximization (EM) 78

XV. Energy Based Models: Maximum Entropy (MaxEnt)Principle, Generative models, and BoltzmannLearning 80A. An overview of energy-based generative models 80B. Maximum entropy models: the simplest

energy-based generative models 811. MaxEnt models in statistical mechanics 812. From statistical mechanics to machine

learning 823. Generalized Ising Models from MaxEnt 83

C. Cost functions for training energy-based models 831. Maximum likelihood 842. Regularization 84

D. Computing gradients 85E. Summary of the training procedure 86

XVI. Deep Generative Models: Hidden Variables andRestricted Boltzmann Machines (RBMs) 86A. Why hidden (latent) variables? 86B. Restricted Boltzmann Machines (RBMs) 87C. Training RBMs 89

1. Gibbs sampling and contrastive divergence(CD) 89

2. Practical Considerations 90D. Deep Boltzmann Machine 90E. Generative models in practice: examples 91

1. MNIST 912. Example: 2D Ising Model 92

F. Generative models in physics 92

XVII. Variational AutoEncoders (VAEs) and GenerativeAdversarial Networks (GANs) 94A. The limitations of maximizing Likelihood 94B. Generative models and adversarial learning 96C. Variational Autoencoders (VAEs) 97

1. VAEs as variational models 972. Training via the reparametrization trick 983. Connection to the information bottleneck 99

D. VAE with Gaussian latent variables and Gaussianencoder 1001. Implementing the Gaussian VAE 1002. VAEs for the MNIST dataset 1013. VAEs for the 2D Ising model 101

3

XVIII. Outlook 103A. Research at the intersection of physics and ML 103B. Topics not covered in review 104C. Rebranding Machine Learning as “Artificial

Intelligence” 105D. Social Implications of Machine Learning 105

XIX. Acknowledgments 106

A. Overview of the Datasets used in the Review 1061. Ising dataset 1062. SUSY dataset 1063. MNIST Dataset 107

References 107

I. INTRODUCTION

Machine Learning (ML), data science, and statisticsare fields that describe how to learn from, and make pre-dictions about, data. The availability of big datasets isa hallmark of modern science, including physics, wheredata analysis has become an important component of di-verse areas, such as experimental particle physics, ob-servational astronomy and cosmology, condensed matterphysics, biophysics, and quantum computing. Moreover,ML and data science are playing increasingly importantroles in many aspects of modern technology, ranging frombiotechnology to the engineering of self-driving cars andsmart devices. Therefore, having a thorough grasp of theconcepts and tools used in ML is an important skill thatis increasingly relevant in the physical sciences.

The purpose of this review is to serve as an introduc-tion to foundational and state-of-the-art techniques inML and data science for physicists. The review seeks tofind a middle ground between a short overview and a full-length textbook. While there exist many wonderful MLtextbooks (Abu-Mostafa et al., 2012; Bishop, 2006; Fried-man et al., 2001; Murphy, 2012), they are lengthy and usespecialized language that is often unfamiliar to physicists.This review builds upon the considerable knowledge mostphysicists already possess in statistical physics in orderto introduce many of the major ideas and techniquesused in modern ML. We take a physics-inspired peda-gogical approach, emphasizing simple examples (e.g., re-gression and clustering), before delving into more ad-vanced topics. The intention of this review and theaccompanying Jupyter notebooks (available at https://physics.bu.edu/~pankajm/MLnotebooks.html) is togive the reader the requisite background knowledge tofollow and apply these techniques to their own areas ofinterest.

While this review is written with a physics backgroundin mind, we aim for it to be useful to anyone with somebackground in statistical physics, and it is suitable forboth graduate students and researchers as well as ad-vanced undergraduates. The review is based on an ad-vanced topics graduate course taught at Boston Univer-

sity in Fall of 2016. As such, it assumes a level of familiar-ity with several topics found in graduate physics curricula(partition functions, statistical mechanics) and a fluencyin mathematical techniques such as linear algebra, multi-variate calculus, variational methods, probability theory,and Monte-Carlo methods. It also assumes a familiar-ity with basic computer programming and algorithmicdesign.

A. What is Machine Learning?

Most physicists learn the basics of classical statisticsearly on in undergraduate laboratory courses. Classicalstatistics is primarily concerned with how to use datato estimate the value of an unknown quantity. For in-stance, estimating the speed of light using measurementsobtained with an interferometer is one such example thatrelies heavily on techniques from statistics.

Machine Learning is a subfield of artificial intelligencewith the goal of developing algorithms capable of learningfrom data automatically. In particular, an artificially in-telligent agent needs to be able to recognize objects in itssurroundings and predict the behavior of its environmentin order to make informed choices. Therefore, techniquesin ML tend to be more focused on prediction rather thanestimation. For example, how do we use data from theinterferometry experiment to predict what interferencepattern would be observed under a different experimentalsetup? In addition, methods from ML tend to be appliedto more complex high-dimensional problems than thosetypically encountered in a classical statistics course.

Despite these differences, estimation and predictionproblems can be cast into a common conceptual frame-work. In both cases, we choose some observable quantityx of the system we are studying (e.g., an interference pat-tern) that is related to some parameters θ (e.g., the speedof light) of a model p(x|θ) that describes the probabilityof observing x given θ. Now, we perform an experimentto obtain a datasetX and use these data to fit the model.Typically, “fitting” the model involves finding θ that pro-vides the best explanation for the data. In the case when“fitting” refers to the method of least squares, the esti-mated parameters maximize the probability of observ-ing the data (i.e., θ = argmaxθ p(X|θ)). Estimationproblems are concerned with the accuracy of θ, whereasprediction problems are concerned with the ability of themodel to predict new observations (i.e., the accuracy ofp(x|θ)). Although the goals of estimation and predictionare related, they often lead to different approaches. Asthis review is aimed as an introduction to the concepts ofML, we will focus on prediction problems and refer thereader to one of many excellent textbooks on classicalstatistics for more information on estimation (Lehmannand Casella, 2006; Lehmann and Romano, 2006; Wasser-man, 2013; Witte and Witte, 2013).

4

B. Why study Machine Learning?

The last three decades have seen an unprecedented in-crease in our ability to generate and analyze large datasets. This “big data” revolution has been spurred by anexponential increase in computing power and memorycommonly known as Moore’s law. Computations thatwere unthinkable a few decades ago can now be routinelyperformed on laptops. Specialized computing machines(such as GPU-based machines) are continuing this trendtowards cheap, large-scale computation, suggesting thatthe “big data” revolution is here to stay.

This increase in our computational ability has been ac-companied by new techniques for analyzing and learningfrom large datasets. These techniques draw heavily fromideas in statistics, computational neuroscience, computerscience, and physics. Similar to physics, modern MLplaces a premium on empirical results and intuition overthe more formal treatments common in statistics, com-puter science, and mathematics. This is not to say thatproofs are not important or undesirable. Rather, manyof the advances of the last two decades – especially infields like deep learning – do not have formal justifica-tions (much like there still exists no mathematically well-defined concept of the Feynman path-integral in d > 1).

Physicists are uniquely situated to benefit from andcontribute to ML. Many of the core concepts and tech-niques used in ML – such as Monte-Carlo methods, simu-lated annealing, variational methods – have their originsin physics. Moreover, “energy-based models” inspired bystatistical physics are the backbone of many deep learn-ing methods. For these reasons, there is much in modernML that will be familiar to physicists.

Physicists and astronomers have also been at the fore-front of using “big data”. For example, experiments suchas CMS and ATLAS at the LHC generate petabytes ofdata per year. In astronomy, projects such as the SloanDigital Sky Survey (SDSS) routinely analyze and releasehundreds of terabytes of data measuring the properties ofnearly a billion stars and galaxies. Researchers in thesefields are increasingly incorporating recent advances inML and data science, and this trend is likely to acceler-ate in the future.

Besides applications to physics, part of the goal of thisreview is to serve as an introductory resource for thoselooking to transition to more industry-oriented projects.Physicists have already made many important contribu-tions to modern big data applications in an industrialsetting (Metz, 2017). Data scientists and ML engineersin industry use concepts and tools developed for ML togain insight from large datasets. A familiarity with MLis a prerequisite for many of the most exciting employ-ment opportunities in the field, and we hope this reviewwill serve as a useful introduction to ML for physicistsbeyond an academic setting.

C. Scope and structure of the review

Any review on ML must simultaneously accomplishtwo related but distinct goals. First, it must convey therich theoretical foundations underlying modern ML. Thistask is made especially difficult because ML is very broadand interdisciplinary, drawing on ideas and intuitionsfrom many fields including statistics, computational neu-roscience, and physics. Unfortunately, this means mak-ing choices about what theoretical ideas to include in thereview. This review emphasizes connections with sta-tistical physics, physics-inspired Bayesian inference, andcomputational neuroscience models. Thus, certain ideas(e.g., gradient descent, expectation maximization, varia-tional methods, and deep learning and neural networks)are covered extensively, while other important ideas aregiven less attention or even omitted entirely (e.g., statis-tical learning, support vector machines, kernel methods,Gaussian processes). Second, any ML review must givethe reader the practical know-how to start using the toolsand concepts of ML for practical problems. To accom-plish this, we have written a series of Jupyter notebooksto accompany this review. These python notebooks in-troduce the nuts-and-bolts of how to use, code, and im-plement the methods introduced in the main text. Luck-ily, there are numerous great ML software packages avail-able in Python (scikit-learn, tensorflow, Pytorch, Keras)and we have made extensive use of them. We have alsomade use of a new package, Paysage, for energy-basedgenerative models which has been co-developed by oneof the authors (CKF) and maintained by Unlearn.AI (acompany affiliated with two of the authors: CKF andPM). The purpose of the notebooks is to both familiarizephysicists with these resources and to serve as a startingpoint for experimenting and playing with ideas.

ML can be divided into three broad categories: super-vised learning, unsupervised learning, and reinforcementlearning. Supervised learning concerns learning from la-beled data (for example, a collection of pictures labeledas containing a cat or not containing a cat). Commonsupervised learning tasks include classification and re-gression. Unsupervised learning is concerned with find-ing patterns and structure in unlabeled data. Examplesof unsupervised learning include clustering, dimensional-ity reduction, and generative modeling. Finally, in rein-forcement learning an agent learns by interacting with anenvironment and changing its behavior to maximize itsreward. For example, a robot can be trained to navigatein a complex environment by assigning a high reward toactions that help the robot reach a desired destination.We refer the interested reader to the classic book by Sut-ton and Barto Reinforcement Learning: an Introduction(Sutton and Barto, 1998). While useful, the distinctionbetween the three types of ML is sometimes fuzzy andfluid, and many applications often combine them in noveland interesting ways. For example, the recent success

5

of Google DeepMind in developing ML algorithms thatexcel at tasks such as playing Go and video games em-ploy deep reinforcement learning, combining reinforce-ment learning with supervised learning methods basedon deep neural networks.

Here, we limit our focus to supervised and unsuper-vised learning. The literature on reinforcement learningis extensive and uses ideas and concepts that, to a largedegree, are distinct from supervised and unsupervisedlearning tasks. For this reason, to ensure cohesivenessand limit the length of this review, we have chosen notto discuss reinforcement learning. However, this omis-sion should not be mistaken for a value judgement onthe utility of reinforcement learning for solving physicalproblems. For example, some of the authors have usedinspiration from reinforcement learning to tackle difficultproblems in quantum control (Bukov, 2018; Bukov et al.,2018).

In writing this review, we have tried to adopt a stylethat reflects what we consider to be the best of thephysics tradition. Physicists understand the importanceof well-chosen examples for furthering our understand-ing. It is hard to imagine a graduate course in statisticalphysics without the Ising model. Each new concept thatis introduced in statistical physics (mean-field theory,transfer matrix techniques, high- and low-temperatureexpansions, the renormalization group, etc.) is applied tothe Ising model. This allows for the progressive buildingof intuition and ultimately a coherent picture of statisti-cal physics. We have tried to replicate this pedagogicalapproach in this review by focusing on a few well-chosentechniques – linear and logistic regression in the case ofsupervised learning and clustering in the case of unsu-pervised learning – to introduce the major theoreticalconcepts.

In this same spirit, we have chosen three interest-ing datasets with which to illustrate the various algo-rithms discussed here. (i) The SUSY data set consistsof 5, 000, 000 Monte-Carlo samples of proton-proton col-lisions decaying to either signal or background processes,which are both parametrized with 18 features. The sig-nal process is the production of electrically-charged su-persymmetric particles, which decay toW bosons and anelectrically-neutral supersymmetric particle, invisible tothe detector, while the background processes are variousdecays involving only Standard Model particles (Baldiet al., 2014). (ii) The Ising data set consists of 104

states of the 2D Ising model on a 40 × 40 square lat-tice, obtained using Monte-Carlo (MC) sampling at afew fixed temperatures T . (iii) The MNIST dataset com-prises 70000 handwritten digits, each of which comes ina square image, divided into a 28 × 28 pixel grid. Thefirst two datasets were chosen to reflect the various sub-disciplines of physics (high-energy experiment, condensedmatter) where we foresee techniques from ML becom-ing an increasingly important tool for research. The

MNIST dataset, on the other hand, introduces the fla-vor of present-day ML problems. By re-analyzing thesame datasets with multiple techniques, we hope readerswill be able to get a sense of the various, inevitable trade-offs involved in choosing how to analyze data. Certaintechniques work better when data is limited while othersmay be better suited to large data sets with many fea-tures. A short description of these datasets are given inthe Appendix.

This review draws generously on many wonderful text-books on ML and we encourage the reader to con-sult them for further information. They include AbuMostafa’s masterful Learning from Data, which intro-duces the basic concepts of statistical learning theory(Abu-Mostafa et al., 2012), the more advanced butequally good The Elements of Statistical Learning byHastie, Tibshirani, and Friedman (Friedman et al., 2001),Michael Nielsen’s indispensable Neural Networks andDeep Learning which serves as a wonderful introductionto the neural networks and deep learning (Nielsen, 2015)and David MacKay’s outstanding Information Theory,Inference, and Learning Algorithms which introducedBayesian inference and information theory to a wholegeneration of physicists (MacKay, 2003). More com-prehensive (and much longer) books on modern MLtechniques include Christopher Bishop’s classic PatternRecognition and Machine Learning (Bishop, 2006) andthe more recently published Machine Learning: A Prob-abilistic Perspective by Kevin Murphy (Murphy, 2012).Finally, one of the great successes of modern ML is deeplearning, and some of the pioneers of this field have writ-ten a textbook for students and researchers entitled DeepLearning (Goodfellow et al., 2016). In addition to thesetextbooks, we have consulted numerous research papers,reviews, and web resources. Whenever possible, we havetried to point the reader to key papers and other refer-ences that we have found useful in preparing this review.However, we are neither capable of nor have we made anyeffort to make a comprehensive review of the literature.

The review is organized as follows. We begin byintroducing polynomial regression as a simple examplethat highlights many of the core ideas of ML. The nextfew chapters introduce the language and major conceptsneeded to make these ideas more precise including toolsfrom statistical learning theory such as overfitting, thebias-variance tradeoff, regularization, and the basics ofBayesian inference. The next chapter builds on theseexamples to discuss stochastic gradient descent and itsgeneralizations. We then apply these concepts to linearand logistic regression, followed by a detour to discusshow we can combine multiple statistical techniques toimprove supervised learning, introducing bagging, boost-ing, random forests, and XG Boost. These ideas, thoughfairly technical, lie at the root of many of the advancesin ML over the last decade. The review continues witha thorough discussion of supervised deep learning and

6

neural networks, as well as convolutional nets. We thenturn our focus to unsupervised learning. We start withdata visualization and dimensionality reduction beforeproceeding to a detailed treatment of clustering. Ourdiscussion of clustering naturally leads to an examina-tion of variational methods and their close relationshipwith mean-field theory. The review continues with adiscussion of deep unsupervised learning, focusing onenergy-based models, such as Restricted Boltzmann Ma-chines (RBMs) and Deep Boltzmann Machines (DBMs).Then we discuss two new and extremely popular model-ing frameworks for unsupervised learning, generative ad-versarial networks (GANs) and variational autoencoders(VAEs). We conclude the review with an outlook anddiscussion of promising research directions at the inter-section physics and ML.

II. WHY IS MACHINE LEARNING DIFFICULT?

A. Setting up a problem in ML and data science

Many problems in ML and data science starts withthe same ingredients. The first ingredient is the datasetD = (X,y) whereX is a matrix of independent variablesand y is a vector of dependent variables. The second isthe model f(x;θ), which is a function f : x → y of theparameters θ. That is, f is a function used to predict anoutput from a vector of input variables. The final ingre-dient is the cost function C(y, f(X;θ)) that allows us tojudge how well the model performs on the observations y.The model is fit by finding the value of θ that minimizesthe cost function. For example, one commonly used costfunction is the squared error. Minimizing the squared er-ror cost function is known as the method of least squares,and is typically appropriate for experiments with Gaus-sian measurement errors.

ML researchers and data scientists follow a standardrecipe to obtain models that are useful for predictionproblems. We will see why this is necessary in the fol-lowing sections, but it is useful to present the recipe upfront to provide context. The first step in the analysisis to randomly divide the dataset D into two mutuallyexclusive groups Dtrain and Dtest called the training andtest sets. The fact that this must be the first step shouldbe heavily emphasized – performing some analysis (suchas using the data to select important variables) beforepartitioning the data is a common pitfall that can lead toincorrect conclusions. Typically, the majority of the dataare partitioned into the training set (e.g., 90%) with theremainder going into the test set. The model is fit by min-imizing the cost function using only the data in the train-ing set θ = arg minθ C(ytrain, f(Xtrain;θ)). Finally,the performance of the model is evaluated by computingthe cost function using the test set C(ytest, f(Xtest; θ)).The value of the cost function for the best fit model

on the training set is called the in-sample error Ein =C(ytrain, f(Xtrain;θ)) and the value of the cost func-tion on the test set is called the out-of-sample errorEout = C(ytest, f(Xtest;θ)).

One of the most important observations we can makeis that the out-of-sample error is almost always greaterthan the in-sample error, i.e. Eout ≥ Ein. We explorethis point further in Sec. VI and its accompanying note-book. Splitting the data into mutually exclusive train-ing and test sets provides an unbiased estimate for thepredictive performance of the model – this is known ascross-validation in the ML and statistics literature. Inmany applications of classical statistics, we start with amathematical model that we assume to be true (e.g., wemay assume that Hooke’s law is true if we are observinga mass-spring system) and our goal is to estimate thevalue of some unknown model parameters (e.g., we donot know the value of the spring stiffness). Problems inML, by contrast, typically involve inference about com-plex systems where we do not know the exact form of themathematical model that describes the system. There-fore, it is not uncommon for ML researchers to have mul-tiple candidate models that need to be compared. Thiscomparison is usually done using Eout; the model thatminimizes this out-of-sample error is chosen as the bestmodel (i.e. model selection). Note that once we selectthe best model on the basis of its performance on Eout,the real-world performance of the winning model shouldbe expected to be slightly worse because the test datawas now used in the fitting procedure.

B. Polynomial Regression

In the previous section, we mentioned that multiplecandidate models are typically compared using the out-of-sample error Eout. It may be at first surprising thatthe model that has the lowest out-of-sample error Eout

usually does not have the lowest in-sample error Ein.Therefore, if our goal is to obtain a model that is use-ful for prediction we may not want to choose the modelthat provides the best explanation for the current obser-vations. At first glance, the observation that the modelproviding the best explanation for the current datasetprobably will not provide the best explanation for futuredatasets is very counter-intuitive.

Moreover, the discrepancy between Ein and Eout be-comes more and more important, as the complexity of ourdata, and the models we use to make predictions, grows.As the number of parameters in the model increases,we are forced to work in high-dimensional spaces. The“curse of dimensionality” ensures that many phenomenathat are absent or rare in low-dimensional spaces becomegeneric. For example, the nature of distance changes inhigh dimensions, as evidenced in the derivation of theMaxwell distribution in statistical physics where the fact

7

0.0 0.2 0.4 0.6 0.8 1.0x

−4

−2

0

2

4

yNtrain =10, σ =0 (train)

Training

Linear

Poly 3

Poly 10

0.00 0.25 0.50 0.75 1.00 1.25x

0.0

0.5

1.0

1.5

2.0

2.5

y

Ntest =20, σ =0 (pred.)

test

linear

3rd order

10th order

0.0 0.2 0.4 0.6 0.8 1.0x

−4

−2

0

2

4

y

Ntrain =10, σ =0 (train)

Training

Linear

Poly 3

Poly 10

0.00 0.25 0.50 0.75 1.00 1.25x

0

20

40

60y

Ntest =20, σ =0 (pred.)

Test

linear

3rd order

10th order

FIG. 1 Fitting versus predicting for noiseless data. Ntrain = 10 points in the range x ∈ [0, 1] were generated from alinear model (top) or tenth-order polynomial (bottom). This data was fit using three model classes: linear models (red), allpolynomials of order 3 (yellow), all polynomials of order 10 (green) and used to make prediction on Ntest = 20 new data pointswith xtest ∈ [0, 1.2] (shown on right). Notice that in the absence of noise (σ = 0), given enough data points that fitting andpredicting are identical.

that all the volume of a d-dimensional sphere of radiusr is contained in a small spherical shell around r is ex-ploited. Almost all critical points of a function (i.e., thepoints where all derivatives vanish) are saddles ratherthan maxima or minima (an observation first made inphysics in the context of the p-spin spherical spin glass).For all these reasons, it turns out that for complicatedmodels studied in ML, predicting and fitting are verydifferent things (Bickel et al., 2006).

To develop some intuition about why we need to payclose attention to out-of-sample performance, we willconsider a simple one-dimensional problem – polynomialregression. Our task is a simple one, fitting data withpolynomials of different order. We will explore how ourability to predict depends on the number of data points

we have, the “noise” in the data generation process, andour prior knowledge about the system. The goal is tobuild intuition about why prediction is difficult in prepa-ration for introducing general strategies that overcomethese difficulties.Before reading the rest of the section, we strongly en-

courage the reader to read Notebook 1 and complete theaccompanying exercises.

Consider a probabilistic process that assigns a label yito an observation xi. The data are generated by drawingsamples from the equation

yi = f(xi) + ηi, (1)

where f(xi) is some fixed (but possibly unknown) func-tion, and ηi is a Gaussian, uncorrelated noise variable,

8

0.0 0.2 0.4 0.6 0.8 1.0x

−4

−2

0

2

4

yNtrain =100, σ =1 (train)

Training

Linear

Poly 3

Poly 10

0.00 0.25 0.50 0.75 1.00 1.25x

−10

−5

0

5

10

15

20

y

Ntest =20, σ =1 (pred.)

Test

linear

3rd order

10th order

0.0 0.2 0.4 0.6 0.8 1.0x

−4

−2

0

2

4

y

Ntrain =100, σ =1 (train)

Training

Linear

Poly 3

Poly 10

0.00 0.25 0.50 0.75 1.00 1.25x

−10

−5

0

5

10

15

20

y

Ntest =20, σ =1 (pred.)

Test

linear

3rd order

10th order

FIG. 2 Fitting versus predicting for noisy data. Ntrain = 100 noisy data points (σ = 1) in the range x ∈ [0, 1] weregenerated from a linear model (top) or tenth-order polynomial (bottom). This data was fit using three model classes: linearmodels (red), all polynomials of order 3 (yellow), all polynomials of order 10 (green) and used to make prediction on Ntest = 20new data points with xtest ∈ [0, 1.2](shown on right). Notice that even when the data was generated using a tenth orderpolynomial, the linear and third order polynomials give better out-of-sample predictions, especially beyond the x range overwhich the model was trained.

such that

〈ηi〉 = 0,

〈ηiηj〉 = δijσ2.

We will refer to the f(xi) as the function used to generatethe data, and σ as the noise strength. The larger σ is thenoisier the data; σ = 0 corresponds to the noiseless case.

To make predictions, we will consider a family of func-tions fα(x;θα) that depend on some parameters θα.These functions represent the model class that we are us-ing to model the data and make predictions. Note thatwe choose the model class without knowing the functionf(x). The fα(x;θα) encode the features we choose torepresent the data. In the case of polynomial regression

we will consider three different model classes: (i) all poly-nomials of order 1 which we denote by f1(x;θ1), (ii) allpolynomials up to order 3 which we denote by f3(x;θ3),and (iii) all polynomials of order 10, f10(x;θ10). Noticethat these three model classes contain different numberof parameters. Whereas f1(x;θ1) has only two parame-ters (the coefficients of the zeroth and first order termsin the polynomial), f3(x;θ3) and f10(x;θ10) have fourand eleven parameters, respectively. This reflects thefact that these three models have different model com-plexities. If we think of each term in the polynomial as a“feature” in our model, then increasing the order of thepolynomial we fit increases the number of features. Usinga more complex model class may give us better predic-

9

tive power, but only if we have a large enough samplesize to accurately learn the model parameters associatedwith these extra features from the training dataset.

To learn the parameters θα, we will train our modelson a training dataset and then test the effectiveness ofthe model on a different dataset, the test dataset. Sincewe are interested only in gaining intuition, we will simplyplot the fitted polynomials and compare the predictionsof our fits for the test data with the true values. As wewill see below, the models that give the best fit to existingdata do not necessarily make the best predictions evenfor a simple task like polynomial regression.

To illustrate these ideas, we encourage the reader toexperiment with the accompanying notebook to gener-ate data using a linear function f(x) = 2x and a tenthorder polynomial f(x) = 2x − 10x5 + 15x10 and askhow the size of the training dataset Ntrain and the noisestrength σ affect the ability to make predictions. Obvi-ously, more data and less noise leads to better predic-tions. To train the models (linear, third-order, tenth-order), we uniformly sampled the interval x ∈ [0, 1] andconstructed Ntrain training examples using (1). We thenfit the models on these training samples using standardleast-squares regression. To visualize the performance ofthe three models, we plot the predictions using the bestfit parameters for a test set where x are drawn uniformlyfrom the interval x ∈ [0, 1.2]. Notice that the test intervalis slightly larger than the training interval.

Figure 1 shows the results of this procedure for thenoiseless case, σ = 0. Even using a small training setwith Ntrain = 10 examples, we find that the model classthat generated the data also provides the best fit and themost accurate out-of-sample predictions. That is, thelinear model performs the best for data generated from alinear polynomial (the third and tenth order polynomialsperform similarly), and the tenth order model performsthe best for data generated from a tenth order polyno-mial. While this may be expected, the results are quitedifferent for larger noise strengths.

Figure 2 shows the results of the same procedure fornoisy data, σ = 1, and a larger training set, Ntrain = 100.As in the noiseless case, the tenth order model providesthe best fit to the data (i.e., the lowest Ein). In contrast,the tenth order model now makes the worst out-of-samplepredictions (i.e., the highest Eout). Remarkably, this istrue even if the data were generated using a tenth orderpolynomial.

At small sample sizes, noise can create fluctuations inthe data that look like genuine patterns. Simple mod-els (like a linear function) cannot represent complicatedpatterns in the data, so they are forced to ignore thefluctuations and to focus on the larger trends. Complexmodels with many parameters, such as the tenth orderpolynomial in our example, can capture both the globaltrends and noise-generated patterns at the same time. Inthis case, the model can be tricked into thinking that the

noise encodes real information. This problem is called“overfitting” and leads to a steep drop-off in predictiveperformance.

We can guard against overfitting in two ways: we canuse less expressive models with fewer parameters, or wecan collect more data so that the likelihood that the noiseappears patterned decreases. Indeed, when we increasethe size of the training data set by two orders of mag-nitude to Ntrain = 104 (see Figure 3) the tenth orderpolynomial clearly gives both the best fits and the mostpredictive power over the entire training range x ∈ [0, 1],and even slightly beyond to approximately x ≈ 1.05.This is our first experience with what is known as thebias-variance tradeoff, c.f. Sec. III.B. When the amountof training data is limited as it is when Ntrain = 100,one can often get better predictive performance by usinga less expressive model (e.g., a lower order polynomial)rather than the more complex model (e.g., the tenth-order polynomial). The simpler model has more “bias”but is less dependent on the particular realization of thetraining dataset, i.e. less “variance”. Finally we note thateven with ten thousand data points, the model’s perfor-mance quickly degrades beyond the original training datarange. This demonstrates the difficulty of predicting be-yond the training data we mentioned earlier.

This simple example highlights why ML is so difficultand holds some universal lessons that we will encounterrepeatedly in this review:

• Fitting is not predicting. Fitting existing data wellis fundamentally different from making predictionsabout new data.

• Using a complex model can result in overfitting. In-creasing a model’s complexity (i.e number of fittingparameters) will usually yield better results on thetraining data. However when the training data sizeis small and the data are noisy, this results in over-fitting and can substantially degrade the predictiveperformance of the model.

• For complex datasets and small training sets, sim-ple models can be better at prediction than com-plex models due to the bias-variance tradeoff. Ittakes less data to train a simple model than a com-plex one. Therefore, even though the correct modelis guaranteed to have better predictive performancefor an infinite amount of training data (less bias),the training errors stemming from finite-size sam-pling (variance) can cause simpler models to out-perform the more complex model when sampling islimited.

• It is difficult to generalize beyond the situationsencountered in the training data set.

10

0.0 0.2 0.4 0.6 0.8 1.0x

−4

−2

0

2

4

yNtrain =10000, σ =1 (train)

Training

Linear

Poly 3

Poly 10

0.00 0.25 0.50 0.75 1.00 1.25x

−10

−5

0

5

10

15

20

y

Ntest =100, σ =1 (pred.)

Test

linear

3rd order

10th order

FIG. 3 Fitting versus predicting for noisy data. Ntrain = 104 noisy data points (σ = 1) in the range x ∈ [0, 1] weregenerated from a tenth-order polynomial. This data was fit using three model classes: linear models (red), all polynomialsof order 3 (yellow), all polynomials of order 10 (green) and used to make prediction on Ntest = 100 new data points withxtest ∈ [0, 1.2](shown on right). The tenth order polynomial gives good predictions but the model’s predictive power quicklydegrades beyond the training data range.

III. BASICS OF STATISTICAL LEARNING THEORY

In this section, we briefly summarize and discuss thesense in which learning is possible, with a focus on su-pervised learning. We begin with an unknown functiony = f(x) and fix a hypothesis set H consisting of all func-tions we are willing to consider, defined also on the do-main of f . This set may be uncountably infinite (e.g. ifthere are real-valued parameters to fit). The choice ofwhich functions to include in H usually depends on ourintuition about the problem of interest. The functionf(x) produces a set of pairs (xi, yi), i = 1 . . . N , whichserve as the observable data. Our goal is to select a func-tion from the hypothesis set h ∈ H that approximatesf(x) as best as possible, namely, we would like to findh ∈ H such that h ≈ f in some strict mathematicalsense which we specify below. If this is possible, we saythat we learned f(x). But if the function f(x) can, inprinciple, take any value on unobserved inputs, how is itpossible to learn in any meaningful sense?

The answer is that learning is possible in the restrictedsense that the fitted model will probably perform approx-imately as well on new data as it did on the training data.Once an appropriate error function E is chosen for theproblem under consideration (e.g. sum of squared errorsin linear regression), we can define two distinct perfor-mance measures of interest. The in-sample error, Ein,and the out-of-sample or generalization error, Eout. Re-call from Sec II that both metrics are required due to thedistinction between fitting and predicting.

This raises a natural question: Can we say something

general about the relationship between Ein and Eout?Surprisingly, the answer is ‘Yes’. We can in fact sayquite a bit. This is the domain of statistical learningtheory, and we give a brief overview of the main resultsin this section. Our goal is to briefly introduce some ofthe major ideas from statistical learning theory becauseof the important role they have played in shaping how wethink about machine learning. However, this is a highlytechnical and theoretical field, so we will just skim oversome introductory topics. A more thorough introductionto statistical learning theory can be found in the intro-ductory textbook by Abu Mostafa (Abu-Mostafa et al.,2012).

A. Three simple schematics that summarize the basicintuitions from Statistical Learning Theory

The basic intuitions of statistical learning can be sum-marized in three simple schematics. The first schematic,shown in Figure 4, shows the typical out-of-sample er-ror, Eout, and in-sample error, Ein, as a function of theamount of training data. In making this graph, we haveassumed that the true data is drawn from a sufficientlycomplicated distribution, so that we cannot exactly learnthe function f(x). Hence, after a quick initial drop (notshown in figure), the in-sample error will increase withthe number of data points, because our models are notpowerful enough to learn the true function we are seekingto approximate. In contrast, the out-of-sample error willdecrease with the number of data points. As the number

11

Eout

Ein

Erro

r

Number of data points

Variance

Bias

FIG. 4 Schematic of typical in-sample and out-of-sample error as a function of training set size. Thetypical in-sample or training error, Ein, out-of-sample or gen-eralization error, Eout, bias, variance, and difference of errorsas a function of the number of training data points. Theschematic assumes that the number of data points is large (inparticular, the schematic does not show the initial drop inEin for small amounts of data), and that our model cannotexactly fit the true function f(x).

of data points gets large, the sampling noise decreasesand the training data set becomes more representativeof the true distribution from which the data is drawn.For this reason, in the infinite data limit, the in-sampleand out-of-sample errors must approach the same value,which is called the “bias” of our model.

The bias represents the best our model could do if wehad an infinite amount of training data to beat downsampling noise. The bias is a property of the kind offunctions, or model class, we are using to approximatef(x). In general, the more complex the model class weuse, the smaller the bias. However, we do not generallyhave an infinite amount of data. For this reason, to getbest predictive power it is better to minimize the out-of-sample error, Eout, rather than the bias. As shown inFigure 4, Eout can be naturally decomposed into a bias,which measures how well we can hypothetically do in theinfinite data limit, and a variance, which measures thetypical errors introduced in training our model due tosampling noise from having a finite training set.

The final quantity shown in Figure 4 is the differencebetween the generalization and training error. It mea-sures how well our in-sample error reflects the out-of-sample error, and measures how much worse we woulddo on a new data set compared to our training data. Forthis reason, the difference between these errors is pre-cisely the quantity that measures the difference between

Erro

r

Model Complexity

Opt

imum

Bias

Variance

Eout

FIG. 5 Bias-Variance tradeoff and model complexity.This schematic shows the typical out-of-sample error Eout asfunction of the model complexity for a training dataset of fixedsize. Notice how the bias always decreases with model com-plexity, but the variance, i.e. fluctuation in performance dueto finite size sampling effects, increases with model complex-ity. Thus, optimal performance is achieved at intermediatelevels of model complexity.

fitting and predicting. Models with a large difference be-tween the in-sample and out-of-sample errors are said to“overfit” the data. One of the lessons of statistical learn-ing theory is that it is not enough to simply minimizethe training error, because the out-of-sample error canstill be large. As we will see in our discussion of regres-sion in Sec. VI, this insight naturally leads to the idea of“regularization”.

The second schematic, shown in Figure 5, shows theout-of-sample, or test, error Eout as a function of “modelcomplexity”. Model complexity is a very subtle ideaand defining it precisely is one of the great achieve-ments of statistical learning theory. In many cases, modelcomplexity is related to the number of parameters weare using to approximate the true function f(x)1. Inthe example of polynomial regression discussed above,higher-order polynomials are more complex than the lin-ear model. If we consider a training dataset of a fixedsize, Eout will be a non-monotonic function of the modelcomplexity, and is generally minimized for models withintermediate complexity. The underlying reason for thisis that, even though using a more complicated modelalways reduces the bias, at some point the model be-comes too complex for the amount of training data andthe generalization error becomes large due to high vari-ance. Thus, to minimize Eout and maximize our predic-tive power, it may be more suitable to use a more bi-

1 There are, of course, exceptions. One neat example in the contextof one-dimensional regression in given in (Friedman et al., 2001),Figure 7.5.

12

x x

x xx x

x

xx

x

x

x

x

x

x

x

xxx

x

x

x

x

x

xx

x

x

x

xx

xx

xx

xx

x

xx

x xxxx

True model

Low variance,high-bias model

High variance,low-bias model

FIG. 6 Bias-Variance tradeoff. Another useful depictionof the bias-variance tradeoff is to think about how Eout variesas we consider different training data sets of a fixed size. Amore complex model (green) will exhibit larger fluctuations(variance) due to finite size sampling effects than the sim-pler model (black). However, the average over all the trainedmodels (bias) is closer to the true model for the more complexmodel.

ased model with small variance than a less-biased modelwith large variance. This important concept is commonlycalled the bias-variance tradeoff and gets at the heart ofwhy machine learning is difficult.

Another way to visualize the bias-variance tradeoff isshown in Figure 6. In this figure, we imagine traininga complex model (shown in green) and a simpler model(shown in black) many times on different training setsof a fixed size N . Due to the sampling noise from hav-ing finite size data sets, the learned models will differ foreach choice of training sets. In general, more complexmodels need a larger amount of training data. For thisreason, the fluctuations in the learned models (variance)will be much larger for the more complex model than thesimpler model. However, if we consider the asymptoticperformance as we increase the size of the training set(the bias), it is clear that the complex model will even-tually perform better than the simpler model. Thus, de-pending on the amount of training data, it may be more

favorable to use a less complex, high-bias model to makepredictions.

B. Bias-Variance Decomposition

In this section, we dig further into the central prin-ciple that underlies much of machine learning: the bias-variance tradeoff. We will discuss the bias-variance trade-off in the context of continuous predictions such as regres-sion. However, many of the intuitions and ideas discussedhere also carry over to classification tasks. Consider adataset D = (X,y) consisting of the N pairs of indepen-dent and dependent variables. Let us assume that thetrue data is generated from a noisy model

y = f(x) + ε (2)

where ε is normally distributed with mean zero and stan-dard deviation σε.

Assume that we have a statistical procedure (e.g. least-squares regression) for forming a predictor f(x; θ) thatgives the prediction of our model for a new data point x.This estimator is chosen by minimizing a cost functionwhich we take to be the squared error

C(y, f(X;θ)) =∑i

(yi − f(xi;θ))2. (3)

Therefore, the estimates for the parameters,

θD = arg minθC(y, f(X;θ)). (4)

are a function of the dataset, D. We would obtain adifferent error C(yj , f(Xj ; θDj )) for each dataset Dj =(yj ,Xj) in a universe of possible datasets obtained bydrawing N samples from the true data distribution. Wedenote an expectation value over all of these datasets asED.

We would also like to average over different instancesof the “noise” ε and we denote the expectation value overthe noise by Eε. Thus, we can decompose the expectedgeneralization error as

ED,ε[C(y, f(X; θD))] = ED,ε

[∑i

(yi − f(xi; θD))2

]

= ED,ε

[∑i

(yi − f(xi) + f(xi)− f(xi; θD))2

]=∑i

Eε[(yi − f(xi))2] + ED,ε[(f(xi)− f(xi; θD))2] + 2Eε[yi − f(xi)]ED[f(xi)− f(xi; θD)]

=∑i

σ2ε + ED[(f(xi)− f(xi; θD))2], (5)

where in the last line we used the fact that our noise has zero mean and variance σ2ε and the sum over i applies to all

13

terms. It is also helpful to further decompose the second term as follows:

ED[(f(xi)− f(xi; θD))2] = ED[f(xi)− ED[f(xi; θD)] + ED[f(xi; θD)]− f(xi; θD)2]

= ED[f(xi)− ED[f(xi; θD)]2] + ED[f(xi; θD)− ED[f(xi; θD)]2]

+2ED[f(xi)− ED[f(xi; θD)]f(xi; θD)− ED[f(xi; θD)]]= (f(xi)− ED[f(xi; θD)])2 + ED[f(xi; θD)− ED[f(xi; θD)]2]. (6)

The first term is called the bias

Bias2 =∑i

(f(xi)− ED[f(xi; θD)])2 (7)

and measures the deviation of the expectation value ofour estimator (i.e. the asymptotic value of our estimatorin the infinite data limit) from the true value. The secondterm is called the variance

V ar =∑i

ED[(f(xi; θD)− ED[f(xi; θD)])2], (8)

and measures how much our estimator fluctuates dueto finite-sample effects. Combining these expressions,we see that the expected out-of-sample error, Eout :=ED,ε[C(y, f(X; θD))], can be decomposed as

Eout = Bias2 + V ar +Noise, (9)

with Noise =∑i σ

2ε .

The bias-variance tradeoff summarizes the fundamen-tal tension in machine learning, particularly supervisedlearning, between the complexity of a model and theamount of training data needed to train it. Since datais often limited, in practice it is often useful to use aless-complex model with higher bias – a model whoseasymptotic performance is worse than another model –because it is easier to train and less sensitive to samplingnoise arising from having a finite-sized training dataset(smaller variance). This is the basic intuition behind theschematics in Figs. 4, 5, and 6.

IV. GRADIENT DESCENT AND ITS GENERALIZATIONS

Almost every problem in ML and data science startswith the same ingredients: a dataset X, a model g(θ),which is a function of the parameters θ, and a cost func-tion C(X, g(θ)) that allows us to judge how well themodel g(θ) explains the observationsX. The model is fitby finding the values of θ that minimize the cost function.

In this section, we discuss one of the most powerfuland widely used classes of methods for performing thisminimization – gradient descent and its generalizations.The basic idea behind these methods is straightforward:iteratively adjust the parametersθ in the direction wherethe gradient of the cost function is large and negative.In this way, the training procedure ensures the parame-ters flow towards a local minimum of the cost function.

However, in practice gradient descent is full of surprisesand a series of ingenious tricks have been developed bythe optimization and machine learning communities toimprove the performance of these algorithms.

The underlying reason why training a machine learn-ing algorithm is difficult is that the cost functions wewish to optimize are usually complicated, rugged, non-convex functions in a high-dimensional space with manylocal minima. To make things even more difficult, wealmost never have access to the true function we wishto minimize: instead, we must estimate this function di-rectly from data. In modern applications, both the sizeof the dataset and the number of parameters we wish tofit is often enormous (millions of parameters and exam-ples). The goal of this chapter is to explain how gradientdescent methods can be used to train machine learningalgorithms even in these difficult settings.

This chapter seeks to both introduce commonly usedmethods and give intuition for why they work. Wealso include some practical tips for improving the per-formance of stochastic gradient descent (Bottou, 2012;LeCun et al., 1998b). To help the reader gain more in-tuition about gradient descent and its variants, we havedeveloped a Jupyter notebook that allows the reader tovisualize how these algorithms perform on two dimen-sional surfaces. The reader is encouraged to experi-ment with the accompanying notebook whenever a newmethod is introduced (especially to explore how changinghyper-parameters can affect performance). The readermay also wish to consult useful reviews that cover thesetopics (Ruder, 2016) and this blog http://ruder.io/optimizing-gradient-descent/.

A. Gradient Descent and Newton’s method

We begin by introducing a simple first-order gradientdescent method and comparing and contrasting it withanother algorithm, Newton’s method. Newton’s methodis intimately related to many algorithms (conjugate gra-dient, quasi-Newton methods) commonly used in physicsfor optimization problems. Denote the function we wishto minimize by E(θ).

In the context of machine learning, E(θ) is just thecost function E(θ) = C(X, g(θ)). As we shall see forlinear and logistic regression in Secs. VI, VII, this energyfunction can almost always be written as a sum over n

14

−4 −2 0 2 4x

−5.0

−2.5

0.0

2.5

5.0y η =0.1

η =0.5

η =1

η =1.01

FIG. 7 Gradient descent exhibits three qualitativelydifferent regimes as a function of the learning rate.Result of gradient descent on surface z = x2 + y2 − 1 forlearning rate of η = 0.1, 0.5, 1.01. Notice that the trajectoryconverges to the global minima in multiple steps for smalllearning rates (η = 0.1). Increasing the learning rate fur-ther (η = 0.5) causes the trajectory to oscillate around theglobal minima before converging. For even larger learningrates (η = 1.01) the trajectory diverges from the minima. Seecorresponding notebook for details.

data points,

E(θ) =

n∑i=1

ei(xi,θ). (10)

For example, for linear regression ei is just the meansquare-error for data point i; for logistic regression, it isthe cross-entropy. To make analogy with physical sys-tems, we will often refer to this function as the “energy”.

In the simplest gradient descent (GD) algorithm, weupdate the parameters as follows. Initialize the parame-ters to some value θ0 and iteratively update the param-eters according to the equation

vt = ηt∇θE(θt),

θt+1 = θt − vt (11)

where ∇θE(θ) is the gradient of E(θ) w.r.t. θ and wehave introduced a learning rate, ηt, that controls how biga step we should take in the direction of the gradient attime step t. It is clear that for sufficiently small choice ofthe learning rate ηt this methods will converge to a localminimum (in all directions) of the cost function. How-ever, choosing a small ηt comes at a huge computationalcost. The smaller ηt, the more steps we have to take toreach the local minimum. In contrast, if ηt is too large,we can overshoot the minimum and the algorithm be-comes unstable (it either oscillates or even moves awayfrom the minimum). This is shown in Figure 7. In prac-tice, one usually specifies a “schedule” that decreases ηtat long times. Common schedules include power law andexponential decay in time.

To better understand this behavior and highlight someof the shortcomings of GD, it is useful to contrast GD

with Newton’s method which is the inspiration for manywidely employed optimization methods. In Newton’smethod, we choose the step v for the parameters in sucha way as to minimize a second-order Taylor expansion tothe energy function

E(θ + v) ≈ E(θ) +∇θE(θ)v +1

2vTH(θ)v,

where H(θ) is the Hessian matrix of second derivatives.Differentiating this equation respect to v and noting thatfor the optimal value vopt we expect ∇θE(θ+ vopt) = 0,yields the following equation

0 = ∇θE(θ) +H(θ)vopt. (12)

Rearranging this expression results in the desired updaterules for Newton’s method

vt = H−1(θt)∇θE(θt) (13)θt+1 = θt − vt. (14)

Since we have no guarantee that the Hessian is well con-ditioned, in almost all applications of Netwon’s method,one replaces the inverse of the Hessian H−1(θt) by somesuitably regularized pseudo-inverse such as [H(θt)+εI]−1

with ε a small parameter (Battiti, 1992).For the purposes of machine learning, Newton’s

method is not practical for two interrelated reasons.First, calculating a Hessian is an extremely expensivenumerical computation. Second, even if we employ first-order approximation methods to approximate the Hes-sian (commonly called quasi-Newton methods), we muststore and invert a matrix with n2 entries, where n is thenumber of parameters. For models with millions of pa-rameters such as those commonly employed in the neu-ral network literature, this is close to impossible withpresent-day computational power. Despite these practi-cal shortcomings, Newton’s method gives many impor-tant intuitions about how to modify GD algorithms toimprove their performance. Notice that, unlike in GDwhere the learning rate is the same for all parameters,Newton’s method automatically “adapts” the learningrate of different parameters depending on the Hessianmatrix. Since the Hessian encodes the curvature of thesurface we are trying to find the minimum of – morespecifically, the singular values of the Hessian are in-versely proportional to the squares of the local curvaturesof the surface – Newton’s method automatically adjuststhe step size so that one takes larger steps in flat di-rections with small curvature and smaller steps in steepdirections with large curvature.

Our derivation of Newton’s method also allows us todevelop intuition about the role of the learning rate inGD. Let us first consider the special case of using GDto find the minimum of a quadratic energy function ofa single parameter θ (LeCun et al., 1998b). Given thecurrent value of our parameter θ, we can ask what is

15

E(θ)

θθmin

E(θ)

θθmin

E(θ)

θθmin

E(θ)

θθmin

η<ηopt η=ηopt

η>ηopt η>2ηopt

A B

C D

FIG. 8 Effect of learning rate on convergence. For aone dimensional quadratic potential, one can show that thereexists four different qualitative behaviors for gradient descent(GD) as a function of the learning rate η depending on therelationship between η and ηopt = [∂2

θE(θ)]−1. (a) For η <ηopt, GD converges to the minimum. (b) For η = ηopt, GDconverges in a single step. (c) For ηopt < η < 2ηopt, GDoscillates around the minima and eventually converges. (d)For η > 2ηopt, GD moves away from the minima. This figureis adapted from (LeCun et al., 1998b).

the optimal choice of the learning rate ηopt, where ηopt

is defined as the value of η that allows us to reach theminimum of the quadratic energy function in a singlestep (see Figure 8). To find ηopt, we expand the energyfunction to second order around the current value

E(θ + v) = E(θc) + ∂θE(θ)v +1

2∂2θE(θ)v2. (15)

Differentiating with respect to v and setting θmin = θ−vyields

θmin = θ − [∂2θE(θ)]−1∂θE(θ). (16)

Comparing with (11) gives,

ηopt = [∂2θE(θ)]−1. (17)

One can show that there are four qualitatively differentregimes possible (see Fig. 8) (LeCun et al., 1998b). Ifη < ηopt, then GD will take multiple small steps to reachthe bottom of the potential. For η = ηopt, GD reachesthe bottom of the potential in a single step. If ηopt <η < 2ηopt, then the GD algorithm will oscillate acrossboth sides of the potential before eventually converging tothe minimum. However, when η > 2ηopt, the algorithmactually diverges!

It is straightforward to generalize this to the multidi-mensional case. The natural multidimensional general-ization of the second derivative is the Hessian H(θ). We

can always perform a singular value decomposition (i.e.a rotation by an orthogonal matrix for quadratic minimawhere the Hessian is symmetric, see Sec. VI.B for a briefintroduction to SVD) and consider the singular valuesλ of the Hessian. If we use a single learning rate for allparameters, in analogy with (17), convergence requiresthat

η <2

λmax, (18)

where λmax is the largest singular value of the Hessian.If the minimum eigenvalue λmin differs significantly fromthe largest value λmax, then convergence in the λmin-direction will be extremely slow! One can actually showthat the convergence time scales with the condition num-ber κ = λmax/λmin (LeCun et al., 1998b).

B. Limitations of the simplest gradient descent algorithm

The last section hints at some of the major shortcom-ings of the simple GD algorithm described in (11). Beforeproceeding, we briefly summarize these limitations anddiscuss general strategies for modifying GD to overcomethese deficiencies.

• GD finds local minima of the cost function. Sincethe GD algorithm is deterministic, if it converges,it will converge to a local minimum of our energyfunction. Because in ML we are often dealing withextremely rugged landscapes with many local min-ima, this can lead to poor performance. A similarproblem is encountered in physics. To overcomethis, physicists often use methods like simulatedannealing that introduce a fictitious “temperature”which is eventually taken to zero. The “tempera-ture” term introduces stochasticity in the form ofthermal fluctuations that allow the algorithm tothermally tunnel over energy barriers. This sug-gests that, in the context of ML, we should modifyGD to include stochasticity.

• Gradients are computationally expensive to calcu-late for large datasets. In many cases in statisticsand ML, the energy function is a sum of terms,with one term for each data point. For example, inlinear regression, E ∝∑n

i=1(yi −wT · xi)2; for lo-gistic regression, the square error is replaced by thecross entropy, see Secs. VI, VII. Thus, to calculatethe gradient we have to sum over all n data points.Doing this at every GD step becomes extremelycomputationally expensive. An ingenious solutionto this, discussed below, is to calculate the gra-dients using small subsets of the data called “minibatches”. This has the added benefit of introducingstochasticity into our algorithm.

16

• GD is very sensitive to choices of the learning rates.As discussed above, GD is extremely sensitive tothe choice of learning rates. If the learning rate isvery small, the training process takes an extremelylong time. For larger learning rates, GD can di-verge and give poor results. Furthermore, depend-ing on what the local landscape looks like, we haveto modify the learning rates to ensure convergence.Ideally, we would “adaptively” choose the learningrates to match the landscape.

• GD treats all directions in parameter space uni-formly. Another major drawback of GD is thatunlike Newton’s method, the learning rate for GDis the same in all directions in parameter space. Forthis reason, the maximum learning rate is set by thebehavior of the steepest direction and this can sig-nificantly slow down training. Ideally, we would liketo take large steps in flat directions and small stepsin steep directions. Since we are exploring ruggedlandscapes where curvatures change, this requiresus to keep track of not only the gradient but secondderivatives of the energy function (note as discussedabove, the ideal scenario would be to calculate theHessian but this proves to be too computationallyexpensive).

• GD is sensitive to initial conditions. One conse-quence of the local nature of GD is that initial con-ditions matter. Depending on where one starts, onewill end up at a different local minimum. There-fore, it is very important to think about how oneinitializes the training process. This is true for GDas well as more complicated variants of GD intro-duced below.

• GD can take exponential time to escape saddlepoints, even with random initialization. As we men-tioned, GD is extremely sensitive to the initial con-dition since it determines the particular local min-imum GD would eventually reach. However, evenwith a good initialization scheme, through random-ness (to be introduced later), GD can still take ex-ponential time to escape saddle points, which areprevalent in high-dimensional spaces, even for non-pathological objective functions (Du et al., 2017).Indeed, there are modified GD methods developedrecently to accelerate the escape. The details ofthese boosted method are beyond the scope of thisreview, and we refer avid readers to (Jin et al.,2017) for details.

In the next few subsections, we will introduce variantsof GD that address many of these shortcomings. Thesegeneralized gradient descent methods form the backboneof much of modern deep learning and neural networks,see Sec IX. For this reason, the reader is encouraged to

really experiment with different methods in landscapesof varying complexity using the accompanying notebook.

C. Stochastic Gradient Descent (SGD) with mini-batches

One of the most widely-applied variants of the gra-dient descent algorithm is stochastic gradient descent(SGD)(Bottou, 2012; Williams and Hinton, 1986). Asthe name suggests, unlike ordinary GD, the algorithmis stochastic. Stochasticity is incorporated by approx-imating the gradient on a subset of the data called aminibatch2. The size of the minibatches is almost al-ways much smaller than the total number of data pointsn, with typical minibatch sizes ranging from ten to afew hundred data points. If there are n points in total,and the mini-batch size is M , there will be n/M mini-batches. Let us denote these minibatches by Bk wherek = 1, . . . , n/M . Thus, in SGD, at each gradient descentstep we approximate the gradient using a single mini-batch Bk,

∇θE(θ) =

n∑i=1

∇θei(xi,θ) −→∑i∈Bk

∇θei(xi,θ). (19)

We then cycle over all k = 1, . . . , n/M minibatches oneat a time, and use the mini-batch approximation to thegradient to update the parameters θ at every step k. Afull iteration over all n data points – in other words usingall n/M minibatches – is called an epoch. For notationalconvenience, we will denote the mini-batch approxima-tion to the gradient by

∇θEMB(θ) =∑i∈Bk

∇θei(xi,θ). (20)

With this notation, we can rewrite the SGD algorithm as

vt = ηt∇θEMB(θ),

θt+1 = θt − vt. (21)

Thus, in SGD, we replace the actual gradient over thefull data at each gradient descent step by an approxima-tion to the gradient computed using a minibatch. Thishas two important benefits. First, it introduces stochas-ticity and decreases the chance that our fitting algorithmgets stuck in isolated local minima. Second, it signifi-cantly speeds up the calculation as one does not haveto use all n data points to approximate the gradient.Empirical and theoretical work suggests that SGD hasadditional benefits. Chief among these is that introduc-ing stochasticity is thought to act as a natural regular-izer that prevents overfitting in deep, isolated minima(Bishop, 1995a; Keskar et al., 2016).

2 Traditionally, SGD was reserved for the case where you train ona single example – in other words minibatches of size 1. However,we will use SGD to mean any approximation to the gradient ona subset of the data.

17

D. Adding Momentum

In practice, SGD is almost always used with a “mo-mentum” or inertia term that serves as a memory of thedirection we are moving in parameter space. This is typ-ically implemented as follows

vt = γvt−1 + ηt∇θE(θt)

θt+1 = θt − vt, (22)

where we have introduced a momentum parameter γ,with 0 ≤ γ ≤ 1, and for brevity we dropped the ex-plicit notation to indicate the gradient is to be takenover a different mini-batch at each step. We call this al-gorithm gradient descent with momentum (GDM). Fromthese equations, it is clear that vt is a running averageof recently encountered gradients and (1− γ)−1 sets thecharacteristic time scale for the memory used in the av-eraging procedure. Consistent with this, when γ = 0,this just reduces down to ordinary SGD as described inEq. (21). An equivalent way of writing the updates is

∆θt+1 = γ∆θt − ηt∇θE(θt), (23)

where we have defined ∆θt = θt − θt−1. In what shouldbe a familiar scenario to many physicists, momentumbased methods were first introduced in old, largely for-gotten (until recently) Soviet papers (Nesterov, 1983;Polyak, 1964).

Before proceeding further, let us try to get more in-tuition from these equations. It is helpful to consider asimple physical analogy with a particle of massm movingin a viscous medium with viscous damping coefficient µand potential E(w) (Qian, 1999). If we denote the par-ticle’s position by w, then its motion is described by

md2w

dt2+ µ

dw

dt= −∇wE(w). (24)

We can discretize this equation in the usual way to get

mwt+∆t − 2wt + wt−∆t

(∆t)2+ µ

wt+∆t −wt

∆t= −∇wE(w).

(25)Rearranging this equation, we can rewrite this as

∆wt+∆t = − (∆t)2

m+ µ∆t∇wE(w) +

m

m+ µ∆t∆wt. (26)

Notice that this equation is identical to Eq. (23) if weidentify the position of the particle, w, with the parame-ters θ. This allows us to identify the momentum param-eter and learning rate with the mass of the particle andthe viscous damping as:

γ =m

m+ µ∆t, η =

(∆t)2

m+ µ∆t. (27)

Thus, as the name suggests, the momentum parameteris proportional to the mass of the particle and effec-tively provides inertia. Furthermore, in the large vis-cosity/small learning rate limit, our memory time scalesas (1− γ)−1 ≈ m/(µ∆t).

Why is momentum useful? SGD momentum helpsthe gradient descent algorithm gain speed in directionswith persistent but small gradients even in the presenceof stochasticity, while suppressing oscillations in high-curvature directions. This becomes especially importantin situations where the landscape is shallow and flat insome directions and narrow and steep in others. It hasbeen argued that first-order methods (with appropriateinitial conditions) can perform comparable to more ex-pensive second order methods, especially in the contextof complex deep learning models (Sutskever et al., 2013).Empirical studies suggest that the benefits of includingmomentum are especially pronounced in complex modelsin the initial “transient phase” of training, rather thanduring a subsequent fine-tuning of a coarse minimum.The reason for this is that, in this transient phase, corre-lations in the gradient persist across many gradient de-scent steps, accentuating the role of inertia and memory.

These beneficial properties of momentum can some-times become even more pronounced by using a slightmodification of the classical momentum algorithm calledNesterov Accelerated Gradient (NAG) (Nesterov, 1983;Sutskever et al., 2013). In the NAG algorithm, ratherthan calculating the gradient at the current parameters,∇θE(θt), one calculates the gradient at the expectedvalue of the parameters given our current momentum,∇θE(θt + γvt−1). This yields the NAG update rule

vt = γvt−1 + ηt∇θE(θt + γvt−1)

θt+1 = θt − vt. (28)

One of the major advantages of NAG is that it allows forthe use of a larger learning rate than GDM for the samechoice of γ.

E. Methods that use the second moment of the gradient

In stochastic gradient descent, with and without mo-mentum, we still have to specify a “schedule” for tuningthe learning rate ηt as a function of time. As discussed inthe context of Newton’s method, this presents a numberof dilemmas. The learning rate is limited by the steepestdirection which can change depending on the current po-sition in the landscape. To circumvent this problem, ide-ally our algorithm would keep track of curvature and takelarge steps in shallow, flat directions and small steps insteep, narrow directions. Second-order methods accom-plish this by calculating or approximating the Hessianand normalizing the learning rate by the curvature. How-ever, this is very computationally expensive for modelswith extremely large number of parameters. Ideally, we

18

would like to be able to adaptively change the step sizeto match the landscape without paying the steep compu-tational price of calculating or approximating Hessians.

Recently, a number of methods have been introducedthat accomplish this by tracking not only the gradient,but also the second moment of the gradient. Thesemethods include AdaGrad (Duchi et al., 2011), AdaDelta(Zeiler, 2012), RMSprop (Tieleman and Hinton, 2012),and ADAM (Kingma and Ba, 2014). Here, we discussthe last two as representatives of this class of algorithms.

In RMSprop, in addition to keeping a running averageof the first moment of the gradient, we also keep track ofthe second moment denoted by st = E[g2

t ]. The updaterule for RMSprop is given by

gt = ∇θE(θ) (29)st = βst−1 + (1− β)g2

t

θt+1 = θt − ηtgt√

st + ε,

where β controls the averaging time of the second mo-ment and is typically taken to be about β = 0.9, ηt is alearning rate typically chosen to be 10−3, and ε ∼ 10−8

is a small regularization constant to prevent divergences.Multiplication and division by vectors is understood asan element-wise operation. It is clear from this formulathat the learning rate is reduced in directions where thegradient is consistently large. This greatly speeds up theconvergence by allowing us to use a larger learning ratefor flat directions.

A related algorithm is the ADAM optimizer. InADAM, we keep a running average of both the first andsecond moment of the gradient and use this informationto adaptively change the learning rate for different pa-rameters. In addition to keeping a running average of thefirst and second moments of the gradient (i.e. mt = E[gt]and st = E[g2

t ], respectively), ADAM performs an addi-tional bias correction to account for the fact that we areestimating the first two moments of the gradient using arunning average (denoted by the hats in the update rulebelow). The update rule for ADAM is given by (wheremultiplication and division are once again understood tobe element-wise operations)

gt = ∇θE(θ) (30)mt = β1mt−1 + (1− β1)gt

st = β2st−1 + (1− β2)g2t

mt =mt

1− (β1)t

st =st

1− (β2)t

θt+1 = θt − ηtmt√st + ε

,

(31)

where β1 and β2 set the memory lifetime of the first andsecond moment and are typically taken to be 0.9 and 0.99

respectively, and (βj)t denotes βj to the power t. The

parameters η and ε have the same role as in RMSprop.Like in RMSprop, the effective step size of a parameter

depends on the magnitude of its gradient squared. Tounderstand this better, let us rewrite this expression interms of the variance σ2

t = st− (mt)2. Consider a single

parameter θt. The update rule for this parameter is givenby

∆θt+1 = −ηtmt√

σ2t + m2

t + ε. (32)

We now examine different limiting cases of this expres-sion. Assume that our gradient estimates are consistentso that the variance is small. In this case our updaterule tends to ∆θt+1 → −ηt (here we have assumed thatmt ε). This is equivalent to cutting off large persis-tent gradients at 1 and limiting the maximum step sizein steep directions. On the other hand, imagine that thegradient is widely fluctuating between gradient descentsteps. In this case σ2 m2

t so that our update becomes∆θt+1 → −ηtmt/σt. In other words, we adapt our learn-ing rate so that it is proportional to the signal-to-noiseratio (i.e. the mean in units of the standard deviation).From a physics standpoint, this is extremely desirable:the standard deviation serves as a natural adaptive scalefor deciding whether a gradient is large or small. Thus,ADAM has the beneficial effects of (i) adapting our stepsize so that we cut off large gradient directions (and henceprevent oscillations and divergences), and (ii) measuringgradients in terms of a natural length scale, the stan-dard deviation σt. The discussion above also explainsempirical observations showing that the performance ofboth ADAM and RMSprop is drastically reduced if thesquare root is omitted in the update rule. It is alsoworth noting that recent studies have shown adaptivemethods like RMSProp, ADAM, and AdaGrad to gener-alize worse than SGD in classification tasks, though theyachieve smaller training error. Such discussion is beyondthe scope of this review so we refer readers to (Wilsonet al., 2017) for more details.

F. Comparison of various methods

To better understand these methods, it is helpful tovisualize the performance of the five methods discussedabove – gradient descent (GD), gradient descent withmomentum (GDM), NAG, ADAM, and RMSprop. Todo so, we will use Beale’s function:

f(x, y) = (1.5− x+ xy)2 (33)+(2.25− x+ xy2)2 + (2.625− x+ xy3)2.

This function has a global minimum at (x, y) = (3, 0.5)and an interesting structure that can be seen in Fig. 9.The figure shows the results of using all five methods

19

−4 −2 0 2 4x

−4

−2

0

2

4y

GD

GDM

NAG

RMS

ADAMS

FIG. 9 Comparison of GD and its generalization forBeale’s function. Trajectories from gradient descent (GD;black line), gradient descent with momentum (GDM; magentaline), NAG (cyan-dashed line), RMSprop (blue dash-dot line),and ADAM (red line) for Nsteps = 104. The learning ratefor GD, GDM, NAG is η = 10−6 and η = 10−3 for ADAMand RMSprop. β = 0.9 for RMSprop, β1 = 0.9 and β2 =0.99 for ADAM, and ε = 10−8 for both methods. Please seecorresponding notebook for details.

for Nsteps = 104 steps for three different initial condi-tions. In the figure, the learning rate for GD, GDM, andNAG are set to η = 10−6 whereas RMSprop and ADAMhave a learning rate of η = 10−3. The learning ratesfor RMSprop and ADAM can be set significantly higherthan the other methods due to their adaptive step sizes.For this reason, ADAM and RMSprop tend to be muchquicker at navigating the landscape than simple momen-tum based methods (see Fig. 9). Notice that in somecases (e.g. initial condition of (−1, 4)), the trajectoriesdo not find the global minimum but instead follow thedeep, narrow ravine that occurs along y = 1. This kind oflandscape structure is generic in high-dimensional spaceswhere saddle points proliferate. Once again, the adaptivestep size and momentum of ADAM and RMSprop allowsthese methods to traverse the landscape faster than thesimpler first-order methods. The reader is encouraged toconsult the corresponding Jupyter notebook and experi-ment with changing initial conditions, the cost functionsurface being minimized, and hyper-parameters to gainmore intuition about all these methods.

G. Gradient descent in practice: practical tips

We conclude this chapter by compiling some practicaltips from experts for getting the best performance fromgradient descent based algorithms, especially in the con-text of deep neural networks discussed later in the review,see Secs. IX, and XVI.B. This section draws heavily onbest practices laid out in (Bottou, 2012; LeCun et al.,1998b; Tieleman and Hinton, 2012).

• Randomize the data when making mini-batches. It

is always important to randomly shuffle the datawhen forming mini-batches. Otherwise, the gra-dient descent method can fit spurious correlationsresulting from the order in which data is presented.

• Transform your inputs. As we discussed above,learning becomes difficult when our landscape hasa mixture of steep and flat directions. One simpletrick for minimizing these situations is to standard-ize the data by subtracting the mean and normaliz-ing the variance of input variables. Whenever pos-sible, also decorrelate the inputs. To understandwhy this is helpful, consider the case of linear re-gression. It is easy to show that for the squarederror cost function, the Hessian of the energy ma-trix is just the correlation matrix between the in-puts. Thus, by standardizing the inputs, we areensuring that the landscape looks homogeneous inall directions in parameter space. Since most deepnetworks can be viewed as linear transformationsfollowed by a non-linearity at each layer, we expectthis intuition to hold beyond the linear case.

• Monitor the out-of-sample performance. Alwaysmonitor the performance of your model on a valida-tion set (a small portion of the training data that isheld out of the training process to serve as a proxyfor the test set – see Sec. XI for more on validationsets). If the validation error starts increasing, thenthe model is beginning to overfit. Terminate thelearning process. This early stopping significantlyimproves performance in many settings.

• Adaptive optimization methods do not always havegood generalization. As we mentioned, recent stud-ies have shown that adaptive methods such asADAM, RMSprop, and AdaGrad tend to have poorgeneralization compared to SGD or SGD with mo-mentum, particularly in the high-dimensional limit(i.e. the number of parameters exceeds the numberof data points) (Wilson et al., 2017). Although it isnot clear at this stage why sophisticated methods,such as ADAM, RMSprop, and AdaGrad, performso well in training deep neural networks such asgenerative adversarial networks (GANs) (Goodfel-low et al., 2014) [see Sec. XVII], simpler procedureslike properly-tuned plain SGD may work equallywell or better in some applications.

V. OVERVIEW OF BAYESIAN INFERENCE

Statistical modeling usually revolves around estima-tion or prediction (Jaynes, 1996). Bayesian methods arebased on the fairly simple premise that probability canbe used as a mathematical framework for describing un-certainty. This is not that different in spirit from the

20

main idea of statistical mechanics in physics, where weuse probability to describe the behavior of large systemswhere we cannot know the positions and momenta of allthe particles even if the system itself is fully determinis-tic (at least classically). In practice, Bayesian inferenceprovides a set of principles and procedures for learningfrom data and for describing uncertainty. In this section,we give a gentle introduction to Bayesian inference, withspecial emphasis on its logic (i.e. Bayesian reasoning) andprovide a connection to ML discussed in Sec. II and III.For a technical account of Bayesian inference in general,we refer readers to (Barber, 2012; Gelman et al., 2014).

A. Bayes Rule

To solve a problem using Bayesian methods, we haveto specify two functions: the likelihood function p(X|θ),which describes the probability of observing a datasetX for a given value of the unknown parameters θ, andthe prior distribution p(θ), which describes any knowl-edge we have about the parameters before we collect thedata. Note that the likelihood should be considered asa function of the parameters θ with the data X heldfixed. The prior distribution and the likelihood functionare used to compute the posterior distribution p(θ|X)via Bayes’ rule:

p(θ|X) =p(X|θ)p(θ)∫

dθ′ p(X|θ′)p(θ′) . (34)

The posterior distribution describes our knowledge aboutthe unknown parameter θ after observing the data X.In many cases, it will not be possible to analyticallycompute the normalizing constant in the denominator ofthe posterior distribution, i.e. p(X) =

∫dθ p(X|θ)p(θ),

and Markov Chain Monte Carlo (MCMC) methods areneeded to draw random samples from p(θ|X).

The likelihood function p(X|θ) is a common featureof both classical statistics and Bayesian inference, andis determined by the model and the measurement noise.Many common statistical procedures such as least-squarefitting can be cast as Maximum Likelihood Estimation(MLE). In MLE, one chooses the parameters θ that max-imize the likelihood (or equivalently the log-likelihoodsince log is a monotonic function) of the observed data:

θ = arg maxθ

log p(X|θ). (35)

In other words, in MLE we choose the parameters thatmaximize the probability of seeing the observed datagiven our generative model. MLE is an important con-cept in both frequentist and Bayesian statistics.

The prior distribution, by contrast, is uniquelyBayesian. There are two general classes of priors: if wedo not have any specialized knowledge about θ before we

look at the data then we would like to select an unin-formative prior that reflects our ignorance, otherwise weshould select an informative prior that accurately reflectsthe knowledge we have about θ. This review will focuson informative priors that are commonly used for MLapplications. However, there is a large literature on un-informative priors, including reparameterization invari-ant priors, that would be of interest to physicists andwe refer the interested reader to (Berger and Bernardo,1992; Gelman et al., 2014; Jaynes, 1996; Jeffreys, 1946;Mattingly et al., 2018).

Using an informative prior tends to decrease the vari-ance of the posterior distribution while, potentially, in-creasing its bias. This is beneficial if the decrease invariance is larger than the increase in bias. In high-dimensional problems, it is reasonable to assume thatmany of the parameters will not be strongly relevant.Therefore, many of the parameters of the model willbe zero or close to zero. We can express this beliefusing two commonly used priors: the Gaussian priorp(θ|λ) =

∏j

√λ2π e−λθ2j is used to express the assump-

tion that many of the parameters will be small, and theLaplace prior p(θ|λ) =

∏jλ2 e−λ|θj | is used to express the

assumption that many of the parameters will be zero.We’ll come back to this point later in Sec. VI.F.

B. Bayesian Decisions

The above section presents the tools for computing theposterior distribution p(θ|X), which uses probability asa framework for expressing our knowledge about the pa-rameters θ. In most cases, however, we need to summa-rize our knowledge and pick a single “best” value for theparameters. In principle, the specific value of the param-eters should be chosen to maximize a utility function.In practice, however, we usually use one of two choices:the posterior mean 〈θ〉 =

∫dθ θp(θ|X), or the poste-

rior mode θMAP = arg maxθ p(θ|X). Often, 〈θ〉 is calledthe Bayes estimate and θMAP is called the maximum-a-posteriori or MAP estimate. While the Bayes estimateminimizes the mean-squared error, the MAP estimate isoften used instead because it is easier to compute.

C. Hyperparameters

The Gaussian and Laplace prior distributions, used toexpress the assumption that many of the model parame-ters will be small or zero, both have an extra parameterλ. This hyperparameter or nuisance variable has to bechosen somehow. One standard Bayesian approach is todefine another prior distribution for λ – usually using anuninformative prior – and to average the posterior distri-bution over all choices of λ. This is called a hierarchicalprior. Computing averages, however, often requires long

21

Markov Chain Monte Carlo simulations that are compu-tationally intensive. Therefore, it is simpler if we canfind a good value of λ using an optimization procedureinstead. We will discuss how this is done in practice whendiscussing linear regression in Sec. VI.

VI. LINEAR REGRESSION

In Section II, we performed our first numerical MLexperiments by fitting datasets generated by polynomi-als in the presence of different levels of additive noise.We used the fitted parameters to make predictions on‘unseen’ observations, allowing us to gauge the perfor-mance of our model on new data. These experimentshighlighted the fundamental tension common to all MLmodels between how well we fit the training dataset andpredictions on new data. The optimal choice of predictordepended on, among many other things, the functionsused to fit the data and the underlying noise level. InSection III, we formalized this by introducing the notionof model complexity and the bias-variance decomposi-tion, and discussed the statistical meaning of learning.In this section, we take a closer look at these ideas in thesimple setting of linear regression.

As in Section II, fitting a given set of samples (yi,xi)means relating the independent variables xi to their re-sponses yi. For example, suppose we want to see howthe voltage across two sides of a metal slab V changesin response to the applied electric current I. Normallywe would first make a bunch of measurements labeledby i and plot them on a two-dimensional scatterplot,(Vi, Ii). The next step is to assume, either from an oracleor from theoretical reasoning, some models that mightexplain the measurements and measuring their perfor-mance. Mathematically, this amounts to finding somefunction f such that Vi = f(Ii;w), where w is some pa-rameter (e.g. the electrical resistance R of the metal slabin the case of Ohm’s law). We then try to minimize theerrors made in explaining the given set of measurementsbased on our model f by tuning the parameter w. Todo so, we need to first define the error function (formallycalled the loss function) that characterizes the deviationof our prediction from the actual response.

Before formulating the problem, let us set up the no-tation. Suppose we are given a dataset with n sam-ples D = (yi,x(i))ni=1, where x(i) is the i-th obser-vation vector while yi is its corresponding (scalar) re-sponse. We assume that every sample has p features,namely, x(i) ∈ Rp. Let f be the true function/modelthat generated these samples via yi = f(x(i);wtrue) + εi,where wtrue ∈ Rp is a parameter vector and εi is somei.i.d. white noise with zero mean and finite variance.Conventionally, we cast all samples into an n × p ma-trix, X ∈ Rn×p, called the design matrix, with the rowsXi,: = x(i) ∈ Rp, , i = 1, · · · , n being observations and

the columns X:,j ∈ Rn, j = 1, · · · p being measured fea-tures. Bear in mind that this function f is never knownto us explicitly, though in practice we usually presumeits functional form. For example, in linear regression, weassume yi = f(x(i);wtrue) + εi = wT

truex(i) + εi for some

unknown but fixed wtrue ∈ Rp.We want to find a function g with parameters w fit

to the data D that can best approximate f . When thisis done, meaning we have found a w such that g(x; w)yields our best estimate of f , we can use this g to makepredictions about the response y0 for a new data pointx0, as we did in Section II.

It will be helpful for our discussion of linear regres-sion to define one last piece of notation. For any realnumber p ≥ 1, we define the Lp norm of a vectorx = (x1, · · · , xd) ∈ Rd to be

||x||p = (|x1|p + · · ·+ |xd|p)1p (36)

A. Least-square regression

Ordinary least squares linear regression (OLS) is de-fined as the minimization of the L2 norm of the differencebetween the response yi and the predictor g(x(i);w) =wTx(i):

minw∈Rp

||Xw − y||22 = minw∈Rp

n∑i=1

(wTx(i) − yi)2. (37)

In other words, we are looking to find the w which mini-mizes the L2 error. Geometrically speaking, the predictorfunction g(x(i);w) = wTx(i) defines a hyperplane in Rp.Minimizing the least squares error is therefore equivalentto minimizing the sum of all projections (i.e. residuals)for all points x(i) to this hyperplane (see Fig. 10). For-mally, we denote the solution to this problem as wLS:

wLS = arg minw∈Rp

||Xw − y||22, (38)

which, after straightforward differentiation, leads to

wLS = (XTX)−1XTy. (39)

Note that we have assumed that XTX is invertible,which is often the case when n p. Formally speak-ing, if rank(X) = p, namely, the predictorsX:,1, . . . ,X:,p

(i.e. columns ofX) are linearly independent, then wLS isunique. In the case of rank(X) < p, which happens whenp > n, XTX is singular, implying there are infinitelymany solutions to the least squares problem, Eq. (38).In this case, one can easily show that if w0 is a solution,w0 +η is also a solution for any η which satisfiesXη = 0(i.e. η ∈ null(X)). Having determined the least squaressolution, we can calculate y, the best fit of our data X,as y = XwLS = PXy, where PX = X(XTX)−1XT,

22

c.f. Eq. (37). Geometrically, PX is the projection matrixwhich acts on y and projects it onto the column spaceof X, which is spanned by the predictors X:,1, · · · ,X:,p

(see FIG. 11). Notice that we found the optimal solu-tion wLS in one shot, without doing any sort of iterativeoptimization like that discussed in Section IV.

R p

FIG. 10 Geometric interpretation of least squares regression.The regression function g defines a hyperplane in Rp (greensolid line, here we have p = 2) while the residual of data pointx(i) (hollow circles) is its projection onto this hyperplane (bar-ended dashed line).

In Section III we explained that the difference betweenlearning and fitting lies in the prediction on “unseen"data. It is therefore necessary to examine the out-of-sample error. For a more refined argument on the roleof out-of-sample errors in linear regression, we encour-age the reader to do the exercises in the correspondingJupyter notebooks. The upshot is, following our defini-tion of Ein and Eout in Section III, the average in-sampleand out-of-sample error can be shown to be

Ein = σ2(

1− p

n

)(40)

Eout = σ2(

1 +p

n

), (41)

provided we obtain the least squares solution wLS fromi.i.d. samples X and y generated through y = Xwtrue +ε 3. Therefore, we can calculate the average generaliza-tion error explicitly:

|Ein − Eout| = 2σ2 p

n. (42)

This imparts an important message: if we have p n(i.e. high-dimensional data), the generalization error isextremely large, meaning the model is not learning. Evenwhen we have p ≈ n, we might still not learn well dueto the intrinsic noise σ2. One way to ameliorate thisis, as we shall see in the following few sections, to useregularization. We will mainly focus on two forms of

3 This requires that ε is a noise vector whose elements are i.i.d. ofzero mean and variance σ2, and is independent of the samplesX.

regularization: the first one employs an L2 penalty andis called Ridge regression, while the second uses an L1

penalty and is called LASSO.

span( X 1 , · · · , X p )y

y

y − y

FIG. 11 The projection matrix PX projects the responsevector y onto the column space spanned by the columns ofX, span(X:,1, · · · ,X:,p) (purple area), thus forming a fittedvector y. The residuals in Eq. (37) are illustrated by the redvector y − y.

B. Ridge-Regression

In this section, we study the effect of adding to theleast squares loss function a regularizer defined as the L2

norm of the parameter vector we wish to optimize over.In other words, we want to solve the following penalizedregression problem called Ridge regression:

wRidge(λ) = arg minw∈Rp

(||Xw − y||22 + λ||w||22

). (43)

This problem is equivalent to the following constrainedoptimization problem

wRidge(t) = arg minw∈Rp: ||w||22≤t

||Xw − y||22. (44)

This means that for any t ≥ 0 and solution wRidge inEq. (44), there exists a value λ ≥ 0 such that wRidgesolves Eq. (43), and vice versa4. With this equivalence, itis obvious that by adding a regularization term, λ||w||22,to our least squares loss function, we are effectively con-straining the magnitude of the parameter vector learnedfrom the data.

To see this, let us solve Eq. (43) explicitly. Differenti-ating w.r.t. w, we obtain,

wRidge(λ) = (XTX + λIp×p)−1XTy. (45)

4 Note that the equivalence between the penalized and the con-strained (regularized) form of least square optimization does notalways hold. It holds for Ridge and LASSO (introduced later),but not for best subset selection which is defined by choosing aL0 norm: λ||w||0. In this case, for every λ > 0 and any wBSthat solves the penalized form of best subset selection, there is avalue t ≥ 0 such that wBS also solves that constrained form ofbest subset selection, but the converse is not true.

23

In fact, when X is orthogonal, one can simplify this ex-pression further:

wRidge(λ) =wLS

1 + λ, for orthogonal X, (46)

where wLS is the least squares solution given by Eq. (39).This implies that the ridge estimate is merely the leastsquares estimate scaled by a factor (1 + λ)−1.

Can we derive a similar relation between the fittedvector y = XwRidge and the prediction made by leastsquares linear regression? To answer this, let us do a sin-gular value decomposition (SVD) on X. Recall that theSVD of an n× p matrix X has the form

X = UDV T, (47)

where U ∈ Rn×p and V ∈ Rp×p are orthogonal matricessuch that the columns of U span the column space ofX while the columns of V span the row space of X.D ∈ Rp×p =diag(d1, d2, · · · , dp) is a diagonal matrix withentries d1 ≥ d2 ≥ · · · dp ≥ 0 called the singular values ofX. Note thatX is singular if there is at least one dj = 0.By writing X in terms of its SVD, one can recast theRidge estimator Eq. (45) as

wRidge = V (D2 + λI)−1DUTy, (48)

which implies that the Ridge predictor satisfies

yRidge = XwRidge

= UD(D2 + λI)−1DUTy

=

p∑j=1

U:,j

d2j

d2j + λ

UT:j y (49)

≤ UUTy (50)= Xy ≡ yLS, (51)

where U:,j are the columns of U . Note that in the in-equality step we assumed λ ≥ 0 and used SVD to sim-plify Eq. (39). By comparing Eq. (49) with Eq. (51), it isclear that in order to compute the fitted vector y, bothRidge and least squares linear regression have to projecty to the column space of X. The only difference is thatRidge regression further shrinks each basis component jby a factor d2

j/(d2j + λ). We encourage the reader to do

the exercises in Notebook 3 to develop further intuitionabout how Ridge regression works.

C. LASSO and Sparse Regression

In this section, we study the effects of adding an L1 reg-ularization penalty, conventionally called LASSO, whichstands for “least absolute shrinkage and selection opera-tor”. Concretely, LASSO in the penalized form is definedby the following regularized regression problem:

wLASSO(λ) = arg minw∈Rp

||Xw − y||22 + λ||w||1. (52)

As in Ridge regression, there is another formulation forLASSO based on constrained optimization, namely,

wLASSO(t) = arg minw∈Rp: ||w||1≤t

||Xw − y||22. (53)

The equivalence interpretation is the same as in Ridgeregression, namely, for any t ≥ 0 and solution wLASSO inEq. (53), there is a value λ ≥ 0 such that wLASSO solvesEq. (52), and vice versa. However, to get the analyticsolution of LASSO, we cannot simply take the gradientof Eq. (52) with respect to w, since the L1-regularizer isnot everywhere differentiable, in particular at any pointwhere wj = 0 (see Fig. 13). Nonetheless, LASSO is aconvex problem. Therefore, we can invoke the so-called“subgradient optimality condition" (Boyd and Vanden-berghe, 2004; Rockafellar, 2015) in optimization theoryto obtain the solution. To keep the notation simple, weonly show the solution assuming X is orthogonal:

wLASSOj (λ) = sign(wLS

j )(|wLSj | − λ)+, for orthogonal X,

(54)where (x)+ denotes the positive part of x and wLS

j isthe j-th component of least squares solution. In Fig. 12,we compare the Ridge solution Eq. (46) with LASSOsolution Eq. (54). As we mentioned above, the Ridgesolution is the least squares solution scaled by a factorof (1 + λ). Here LASSO does something conventionallycalled “soft-thresholding" (see Fig. 12). We encourageinterested readers to work out the exercises in Notebook3 to explore what this function does.

1 + λλ

scaled by

LASSO Ridge

λ

λ

FIG. 12 [Adapted from (Friedman et al., 2001)] ComparingLASSO and Ridge regression. The black 45 degree line isthe unconstrained estimate for reference. The estimators areshown by red dashed lines. For LASSO, this corresponds tothe soft-thresholding function Eq. (54) while for Ridge regres-sion the solution is given by Eq. (46)

How different are the solutions found using LASSOand Ridge regression? In general, LASSO tends to givesparse solutions, meaning many components of wLASSOare zero. An intuitive justification for this result is pro-vided in Fig. 13. In short, to solve a constrained op-timization problem with a fixed regularization strengtht ≥ 0, for example, Eq. (44) and Eq. (53), one first carvesout the “feasible region" specified by the regularizer in thew1, · · · , wd space. This means that a solution w0 is le-gitimate only if it falls in this region. Then one proceeds

24

by plotting the contours of the least squares regressors inan increasing manner until the contour touches the fea-sible region. The point where this occurs is the solutionto our optimization problem (see Fig. 13 for illustration).Loosely speaking, since the L1 regularizer of LASSO hassharp protrusions (i.e. vertices) along the axes, and be-cause the regressor contours are in the shape of ovals (itis quadratic in w), their intersection tends to occur atthe vertex of the feasibility region, implying the solutionvector will be sparse.

w1

w2 w2

w1

ww

LASSO Ridge

t t

FIG. 13 [Adapted from (Friedman et al., 2001)] Illustra-tion of LASSO (left) and Ridge regression (right). The blueconcentric ovals are the contours of the regression functionwhile the red shaded regions represent the constraint func-tions: (left) |w1| + |w2| ≤ t and (right) w2

1 + w22 ≤ t. In-

tuitively, since the constraint function of LASSO has moreprotrusions, the ovals tend to intersect the constraint at thevertex, as shown on the left. Since the vertices correspond toparameter vectors w with only one non-vanishing component,LASSO tends to give sparse solution.

In Notebook 3, we analyze a Diabetes dataset usingboth LASSO and Ridge regression to predict the dia-betes outcome one year forward (Efron et al., 2004). InFigs. 14, 15, we show the performance of both methodsand the solutions wLASSO(λ), wRidge(λ) explicitly. Moredetails of this dataset and our regression implementationcan be found in Notebook 3.

D. Using Linear Regression to Learn the Ising Hamiltonian

To gain deeper intuition about what kind of physicsproblems linear regression allows us to tackle, considerthe following problem of learning the Hamiltonian forthe Ising model. Imagine you are given an ensemble ofrandom spin configurations, and assigned to each stateits energy, generated from the 1D Ising model:

H = −JL∑j=1

SjSj+1 (55)

where J is the nearest-neighbor spin interaction, andSj ∈ ±1 is a spin variable. Let’s assume the datawas generated with J = 1. You are handed the dataset D = (SjLj=1, Ej) without knowledge of what the

102

101

100

101

102

0.0

0.2

0.4

0.6

0.8

1.0

Per

form

ance

Train (Ridge)Test (Ridge)Train (LASSO)Test (LASSO)

FIG. 14 Performance of LASSO and ridge regression on thediabetes dataset measured by the R2 coefficient of determina-tion. The best possible performance is R2 = 1. See Notebook3.

102

101

100

101

102

0

100

200

300

400

500

600

|wi|

Ridge

102

101

100

101

102

0

100

200

300

400

500

600

LASSO

FIG. 15 Regularization parameter λ affects the weights (fea-tures) we learned in both Ridge regression (left) and LASSOregression (right) on the Diabetes dataset. Curves with dif-ferent colors correspond to different wi’s (features). NoticeLASSO, unlike Ridge, sets feature weights to zero leading tosparsity. See Notebook 3.

numbers Ej mean, and the configuration SjLj=1 can beinterpreted in many ways: the outcome of coin tosses,black-and-white pixels of an image, the binary represen-tation of integers, etc. Your goal is to learn a model thatpredicts Ej from the spin configurations.

Without any prior knowledge about the origin of thedata set, physics intuition may suggest to look for a spinmodel with pairwise interactions between every pair ofvariables. That is, we choose the following model class:

Hmodel[Si] = −

L∑j=1

L∑k=1

Jj,kSijS

ik, (56)

The goal is to determine the interaction matrix Jj,k byapplying linear regression on the data set D. This is a

25

well-defined problem, since the unknown Jj,k enters lin-early into the definition of the Hamiltonian. To this end,we cast the above ansatz into the more familiar linear-regression form:

Hmodel[Si] = Xi · J. (57)

The vectors Xi represent all two-body interactionsSijSikLj,k=1, and the index i runs over the samples inthe dataset. To make the analogy complete, we can alsorepresent the dot product by a single index p = j, k,i.e. Xi · J = Xi

pJp. Note that the regression model doesnot include the minus sign. In the following, we applyordinary least squares, Ridge, and LASSO regression tothe problem, and compare their performance.

10−4 10−2 100 102 104

λ

0.0

0.2

0.4

0.6

0.8

1.0

R2

Train (OLS)

Test (OLS)

Train (Ridge)

Test (Ridge)

Train (LASSO)

Test (LASSO)

FIG. 16 Performance of OLS, Ridge and LASSO regressionon the Ising model as measured by the R2 coefficient of de-termination. Optimal performance is R2 = 1.See Notebook4.

Figure. 16 shows the R2 of the three regression models.

R2 = 1−∑ni=1

∣∣∣ytruei − ypred

i

∣∣∣2∑ni=1

∣∣∣ytruei − 1

n

∑ni=1 y

predi

∣∣∣2 . (58)

Let us make a few remarks: (i) the regularization pa-rameter λ affects the Ridge and LASSO regressions atscales separated by a few orders of magnitude. Noticethat this is different for the data considered in the di-abetes dataset, cf. Fig. 14. Therefore, it is consideredgood practice to always check the performance for thegiven model and data as a function of λ. (ii) Whilethe OLS and Ridge regression test curves are monotonic,the LASSO test curve is not – suggesting an optimalLASSO regularization parameter is λ ≈ 10−2. At thissweet spot, the Ising interaction weights J contains onlynearest-neighbor terms (as did the model the data wasgenerated from).

Choosing whether to use Ridge or LASSO regression inthis case turns out to be similar to fixing gauge degrees offreedom. Recall that the uniform nearest-neighbor inter-actions strength Jj,k = J which we used to generate the

data, was set to unity, J = 1. Moreover, Jj,k was NOTdefined to be symmetric (we only used the Jj,j+1 butnever the Jj,j−1 elements). Figure. 17 shows the matrixrepresentation of the learned weights Jj,k. Interestingly,OLS and Ridge regression learn nearly symmetric weightsJ ≈ −0.5. This is not surprising, since it amounts to tak-ing into account both the Jj,j+1 and the Jj,j−1 terms, andthe weights are distributed symmetrically between them.LASSO, on the other hand, tends to break this symme-try (see matrix elements plots for λ = 0.01) 5. Thus,we see how different regularization schemes can lead tolearning equivalent models but in different ‘gauges’. Anyinformation we have about the symmetry of the unknownmodel that generated the data should be reflected in thedefinition of the model and the choice of regularization.In addition to the diabetes dataset in Notebook 3, weencourage the reader to work out Notebook 4 in whichlinear regression is applied to the one-dimensional Isingmodel.

E. Convexity of regularizer

In the previous section, we mentioned that the analyt-ical solution of LASSO can be found by invoking its con-vexity. In this section, we provide a gentle introductionto convexity theory and highlight a few properties whichcan help us understand the differences between LASSOand Ridge regression. First, recall that a set C ⊆ Rn iscalled convex if for any x, y ∈ C and t ∈ [0, 1],

tx+ (1− t)y ∈ C. (59)

In other words, every line segment joining x, y lies en-tirely in C. A function f : Rn → R is called con-vex if its domain, dom(f), is a convex set, and for anyx, y ∈dom(f) and t ∈ [0, 1] we have

f(tx+ (1− t)y) ≤ tf(x) + (1− t)f(y), (60)

That is, the function lies on or below the line segmentjoining its evaluation at x and y. This function f iscalled strictly convex if this inequality holds strictly forx 6= y and t ∈ (0, 1). Now, it turns out that for con-vex functions, any local minimizer is a global minimizer.Algorithmically, this means that in the optimization pro-cedure, as long as we are “going down the hill” and agreeto stop when we reach a minimum, then we have hitthe global minimum. In addition to this, there is anabundance of rich theory regarding convex duality andoptimality, which allow us to understand the solutionseven before solving the problem itself. We refer interested

5 Look closer, and you will see that LASSO actually splits theweights rather equally for the periodic boundary condition ele-ment at the edges of the anti-diagonal.

26

FIG. 17 Learned interaction matrix Jij for the Ising model ansatz in Eq. (56) for ordinary least squares (OLS) regression(left), Ridge regression (middle) and LASSO (right) at different regularization strengths λ. OLS is λ-independent but is shownfor comparison throughout.See Notebook 4.

27

readers to (Boyd and Vandenberghe, 2004; Rockafellar,2015).

Now let us examine the two regularizers we introducedearlier. A close inspection reveals that LASSO and Ridgeregressions are both convex problems but only Ridge re-gression is a strictly convex problem (assuming λ > 0).From convexity theory, this means that we always have aunique solution for Ridge but not necessary for LASSO.In fact, it was recently shown that under mild conditions,such as demanding general position for columns of X,the LASSO solution is indeed unique (Tibshirani et al.,2013). Apart from this theoretical characterization, (Zouand Hastie, 2005) introduced the notion of Elastic Net toretain the desirable properties of both LASSO and Ridgeregression, which is now one of the standard tools forregression analysis and machine learning. We refer toreader to explore this in Notebook 2.

F. Bayesian formulation of linear regression

In Section V, we gave an overview of Bayesian inferenceand phrased it in the context of learning and uncertaintyquantification. In this section we formulate least squaresregression from a Bayesian point of view. We shall seethat regularization in learning will emerge naturally aspart of the Bayesian inference procedure.

From the setup of linear regression, the data D used tofit the regression model is generated through y = xTw+ε. We often assume that ε is a Gaussian noise with meanzero and variance σ2. To connect linear regression to theBayesian framework, we often write the model as

p(y|x,θ) = N (y|µ(x), σ2(x)). (61)

In other words, our regression model is defined by a con-ditional probability that depends not only on data x buton some model parameters θ. For example, if the meanis a linear function of x given by µ = xTw, and thevariance is fixed σ2(x) = σ2, then θ = (w, σ2).

In statistics, many problems rely on estimation of someparameters of interest. For example, suppose we aregiven the height data of 20 junior students from a regionalhigh school, but what we are interested in is the averageheight of all high school juniors in the whole county. Itis conceivable that the data we are given are not rep-resentative of the student population as a whole. It istherefore necessary to devise a systematic way to pre-form reliable estimation. Here we present the maximumlikelihood estimation (MLE), and show that MLE for θis the one that minimizes the mean squared error (MSE)used in OLS, see Sec.VI.A.

MLE is defined by maximizing the log-likelihoodw.r.t. the parameters θ:

θ ≡ arg maxθ

log p(D|θ). (62)

Using the assumption that samples are i.i.d., we can writethe log-likelihood as

l(θ) ≡ log p(D|θ) =

n∑i=1

log p(yi|x(i),θ). (63)

Note that the conditional dependence of the responsevariable yi on the independent variable x(i) in the likeli-hood function is made explicit since in regression the ob-served value of data, yi, is predicted based on x(i) usinga model that is assumed to be a probability distributionthat depends on unknown parameter θ. This distribu-tion, when endowed with θ, can, as we hope, potentiallyexplain our prediction on yi. By definition, such distri-bution is the likelihood function we discussed in Sec. V.Note that this is consistent with the formal statisticaltreatment of regression where the goal is to estimate theconditional expectation of the dependent variable giventhe value of the independent variable (sometimes calledthe covariate) (Wasserman, 2013). We stress that thisnotation does not imply x(i) is unknown– it is still partof the observed data!

Using Eq. (61), we get

l(θ) = − 1

2σ2

n∑i=1

(yi −wTx(i)

)2

− n

2log(2πσ2

)= − 1

2σ2||Xw − y||22 + const. (64)

By comparing Eq. (38) and Eq. (64), it is clear that per-forming least squares is the same as maximizing the log-likelihood of this model.

What about adding regularization? In Section V, weintroduced the maximum a posteriori probability (MAP)estimate. Here we show that it actually corresponds toregularized linear regression, where the choice of priordetermines the type of regularization. Recall Bayes’ rule

p(θ|D) ∝ p(D|θ)p(θ). (65)

Now instead of maximizing the log-likelihood, l(θ) =log p(D|θ), let us maximize the log posterior, log p(θ|D).Invoking Eq. (65), the MAP estimator becomes

θMAP ≡ arg maxθ

log p(D|θ) + log p(θ). (66)

In Sec. V.C, we discussed that a common choice for theprior is a Gaussian distribution. Consider the Gaussianprior6 with zero mean and variance τ2, namely, p(w) =

6 Indeed, a Gaussian prior is the conjugate prior that gives a Gaus-sian posterior. For a given likelihood, conjugacy guarantees thepreservation of prior distribution at the posterior level. For ex-ample, for a Gaussian (Geometric) likelihood with a Gaussian(Beta) prior, the posterior distribution is still Gaussian (Beta)distribution.

28∏j N (wj |0, τ2). Then, we can recast the MAP estimator

into

θMAP ≡ arg maxθ

− 1

2σ2

n∑i=1

(yi −wTx(i))2 − 1

2τ2

n∑j=1

w2j

= arg max

θ

[− 1

2σ2||Xw − y||22 −

1

2τ2||w||22

]. (67)

Note that we dropped constant terms that do not de-pend on the maximization parameters θ. The equiva-lence between MAP estimation with a Gaussian priorand Ridge regression is established by comparing Eq. (67)and Eq. (44) with λ ≡ σ2/τ2. We relegate the analogousderivation for LASSO to an exercise in Notebook 3.

G. Recap and a general perspective on regularizers

In this section, we explored least squares linear regres-sion with and without regularization. We motivated theneed for regularization due to poor generalization, in par-ticular in the “high-dimensional limit" (p n). Insteadof showing the average in-sample and out-of-sample er-rors for the regularized problem explicitly, we conductednumerical experiments in Notebook 3 on the diabetesdataset and showed that regularization typically leadsto better generalization. Due to the equivalence betweenthe constrained and penalized form of regularized regres-sion (in LASSO and Ridge, but not generally true in casessuch as L0 penalization), we can regard the regularizedregression problem as an un-regularized problem but ona constrained set of parameters. Since the size of the al-lowed parameter space (e.g. w ∈ Rp when un-regularizedvs. w ∈ C ⊂ Rp when regularized) is roughly a proxy formodel complexity, solving the regularized problem is ineffect solving the un-regularized problem with a smallermodel complexity class. This implies that we’re less likelyto overfit.

We also showed the connection between using a reg-ularization function and the use of priors in Bayesianinference. This connection can be used to develop moreintuition about why regularization implies we are lesslikely to overfit the data: Let’s say you are a youngPhysics student taking a laboratory class where the goalof the experiment is to measure the behavior of severaldifferent pendula and use that to predict the formula(i.e. model) that determines the period of oscillation.In your investigation you would probably record manythings (hopefully including the length and mass!) in aneffort to give yourself the best possible chance of deter-mining the unknown relationship, perhaps writing downthe temperature of the room, any air currents, if the ta-ble were vibrating, etc. What you have done is create ahigh-dimensional dataset for yourself. However you actu-ally possess an even higher-dimensional dataset than youprobably would admit to yourself. For example you are

probably aware of the time of day, that it is a Wednes-day, your friend Alice being in attendance, your friendBob being absent with a cold, the country in which youare doing the experiment, and the planet you are on, butyou almost assuredly haven’t written these down in yournotebook. Why not? The reason is because you enteredthe classroom with strongly held prior beliefs that none ofthose things affect the physics which takes place in thatroom. Even of the things you did write down in an effortto be a careful scientist you probably hold some doubtas to their importance to your result and what is servingyou here is the intuition that probably only a few thingsmatter in the physics of pendula. Hence again you areapproaching the experiment with prior beliefs about howmany features you will need to pay attention to in orderto predict what will happen when you swing an unknownpendulum. This example might seem a bit contrived, butthe point is that we live in a high-dimensional world ofinformation and while we have good intuition about whatto write down in our notebook for well-known problems,often in the field of ML we cannot say with any confi-dence a priori what the small list of things to write downwill be, but we can at least use regularization to help usenforce that the list not be too long so that we don’t endup predicting that the period of a pendulum depends onBob having a cold on Wednesdays.

Of course, in both LASSO and Ridge regression thereis a parameter λ involved. In principle, this hyper-parameter is usually predetermined, which means thatit is not part of the regression process. As we saw inFig. 15, our learning performance and solution dependsstrongly on λ, thus it is vital to choose it properly. Aswe discussed in Sec. V.C, one approach is to assume anuninformative prior on the hyper-parameters, p(λ), andaverage the posterior over all choices of λ following thisdistribution. However, this comes with a large computa-tional cost. Therefore, it is simpler to choose the regular-ization parameter through some optimization procedure.

We’d like to emphasize that linear regression can beapplied to model non-linear relationship between inputand response. This can be done by replacing the inputx with some nonlinear function φ(x). Note that doingso preserves the linearity as a function of the parame-ters w, since model is defined by the their inner productφT (x)w. This method is known as basis function expan-sion (Bishop, 2006; Murphy, 2012).

Recent years have also seen a surge of interest in un-derstanding generalized linear regression models from astatistical physics perspective. Much of this research hasfocused on understanding high-dimensional linear regres-sion and compressed sensing (Donoho, 2006) (see (Ad-vani et al., 2013; Zdeborová and Krzakala, 2016) for ac-cessible reviews for physicists). On a technical level,this research imports and extends the machinery of spinglass physics (replica method, cavity method, and mes-sage passing) to analyze high-dimensional linear models

29

1 2 3

(1, 0, 0, 0) (0, 1, 0, 0) (0, 0, 1, 0) (0, 0, 0, 1)

0

FIG. 18 Pictorial representation of four data categories la-beled by the integers 0 through 3 (above), or by one-hot vec-tors with binary inputs (below).

(Advani and Ganguli, 2016; Fisher and Mehta, 2015a,b;Krzakala et al., 2014, 2012a,b; Ramezanali et al., 2015;Zdeborová and Krzakala, 2016). This is a rich areaof activity at the intersection of physics, computer sci-ence, information theory, and machine learning and in-terested readers are encouraged to consult the literaturefor further information (see also (Mezard and Montanari,2009)).

VII. LOGISTIC REGRESSION

So far we have focused on learning from datasets forwhich there is a “continuous” output. For example, inlinear regression we were concerned with learning the co-efficients of a polynomial to predict the response of acontinuous variable yi on unseen data based on its inde-pendent variables xi. However, a wide variety of prob-lems, such as classification, are concerned with outcomestaking the form of discrete variables (i.e. categories).For example, we may want to detect if there is a cat ora dog in an image. Or given a spin configuration of, say,the 2D Ising model, we would like to identify its phase(e.g. ordered/disordered). In this section, we introducelogistic regression which deals with binary, dichotomousoutcomes (e.g. True or False, Success or Failure, etc.).We encourage the reader to use the opportunity to buildtheir intuition about the inner workings of logistic regres-sion, as this will prove valuable later on in the study ofmodern supervised Deep Learning models (see Sec. IX).

This section is structured as follows: first, we definelogistic regression and derive its corresponding cost func-tion (the cross entropy) using a Bayesian approach, anddiscuss its minimization. Then, we generalize logistic re-gression to the case of multiple categories which is calledSoftMax regression. We demonstrate how to apply logis-tic regression using three different problems: (i) classify-ing phases of the 2D Ising model, (ii) learning featuresin the SUSY dataset, and (iii) MNIST handwritten digitclassification.

Throughout this section, we consider the case wherethe dependent variables yi ∈ Z are discrete and onlytake values from m = 0, . . . ,M − 1 (which enumeratethe M classes), see Fig. 18. The goal is to predict the

FIG. 19 Classifying data in the simplest case of only twocategories, labeled “noise” and “signal” (or “cats” and “dogs”),is the subject of Logistic Regression.

output classes from the design matrix X ∈ Rn×p made ofn samples, each of which bears p features. The primarygoal is to identify the classes to which new unseen samplesbelong.

Before delving into the details of logistic regression, itis helpful to consider a slightly simpler classifier: a lin-ear classifier that categorizes examples using a weightedlinear-combination of the features and an additive offset:

si = xTi w + b0 ≡ xTi w, (68)

where we use the short-hand notation xi = (1,xi) andw = (b0,w). This function takes values on the entire realaxis. In the case of logistic regression, however, the labelsyi are discrete variables. One simple way to get a discreteoutput is to have sign functions that map the output ofa linear regressor to 0, 1, σ(si) = sign(si) = 1 if si ≥ 0and 0 otherwise. Indeed, this is commonly known as the“perceptron” in the machine learning literature.

A. The cross-entropy as a cost function for logistic regression

The perceptron is an example of a “hard classification”:each datapoint is assigned to a category (i.e. yi = 0 oryi = 1). Even though the perceptron is an extremelysimple model, it is favorable in many cases (e.g. whendealing with noisy data) to have a “soft” classifier thatoutputs the probability of a given category. For example,given xi, the classifier returns the probability of being incategorym. One such function is the logistic (or sigmoid)function:

σ(s) =1

1 + e−s. (69)

Note that 1−σ(s) = σ(−s), which will be useful shortly.In many cases, it is favorable to work with a “soft” clas-sifier.

Logistic regression is the canonical example of a softclassifier. In logistic regression, the probability that a

30

data point xi belongs to a category yi = 0, 1 is givenby

P (yi = 1|xi,θ) =1

1 + e−xTi θ,

P (yi = 0|xi,θ) = 1− P (yi = 1|xi,θ), (70)

where θ = w are the weights we wish to learn from thedata. To gain some intuition for these equations, considera collection of non-interacting two-state systems coupledto a thermal bath (e.g. a collection of atoms that can bein two states). Furthermore, denote the state of systemi by a binary variable: yi ∈ 0, 1. From elementarystatistical mechanics, we know that if the two states haveenergies ε0 and ε1 the probability for finding the systemin a state yi is:

P (yi = 1) =e−βε0

e−βε0 + e−βε1=

1

1 + e−β∆ε,

P (yi = 1) = 1− P (yi = 0). (71)

Notice that in these expressions, as is often the case inphysics, only energy differences are observable. If thedifference in energies between two states is given by ∆ε =xTi w, we recover the expressions for logistic regression.We shall use this mapping between partition functionsand classification to generalize the logistic regressor toSoftMax regression in Sec. VII.D. Notice that in termsof the logistic function, we can write

P (yi = 1) = σ(xTi w) = 1− P (yi = 0). (72)

We now define the cost function for logistic regressionusing Maximum Likelihood Estimation (MLE). Recall,that in MLE we choose parameters to maximize the prob-ability of seeing the observed data. Consider a datasetD = (yi,xi) with binary labels yi ∈ 0, 1 from whichthe data points are drawn independently. The likelihoodof observing the data under our model is just:

P (D|w) =

n∏i=1

[σ(xTi w)

]yi [1− σ(xTi w)

]1−yi(73)

from which we can readily compute the log-likelihood:

l(w) =

n∑i=1

yi log σ(xTi w) + (1− yi) log[1− σ(xTi w)

].

(74)The maximum likelihood estimator is defined as the setof parameters that maximize the log-likelihood:

w = arg maxθ

n∑i=1

yi log σ(xTi w)+(1−yi) log[1− σ(xTi w)

].

(75)

Since the cost (error) function is just the negative log-likelihood, for logistic regression we find

C(w) = −l(w) (76)

=

n∑i=1

−yi log σ(xTi w)− (1− yi) log[1− σ(xTi w)

].

The right-hand side in Eq. (76) is known in statistics asthe cross entropy.

Having specified the cost function for logistic regres-sion, we note that, just as in linear regression, in practicewe usually supplement the cross-entropy with additionalregularization terms, usually L1 and L2 regularization(see Sec. VI for discussion of these regularizers).

B. Minimizing the cross entropy

The cross entropy is a convex function of the weights wand, therefore, any local minimizer is a global minimizer.Minimizing this cost function leads to the following equa-tion

0 = ∇C(w) =

n∑i=1

[σ(xTi w)− yi

]xi, (77)

where we made use of the logistic function identity∂zσ(s) = σ(s)[1−σ(s)]. Equation (77) defines a transcen-dental equation for w, the solution of which, unlike linearregression, cannot be written in a closed form. For thisreason, one must use numerical methods such as thoseintroduced in Sec. IV to solve this optimization problem.

C. Examples of binary classification

Let us now show how to use logistic regression inpractice. In this section, we showcase two pedagogi-cal examples to train a logistic regressor to classify bi-nary data. Each example comes with a correspondingJupyter notebook, see https://physics.bu.edu/ panka-jm/MLnotebooks.html.

1. Identifying the phases of the 2D Ising model

The goal of this example is to show how one can employlogistic regression to classify the states of the 2D Isingmodel according to their phase of matter.

The Hamiltonian for the classical Ising model is givenby

H = −J∑〈ij〉

SiSj , Sj ∈ ±1, (78)

where the lattice site indices i, j run over all nearestneighbors of a 2D square lattice, and J is an interaction

31

0 20

0

10

20

30

ordered phase

0 20

0

10

20

30

critical region

0 20

0

10

20

30

disordered phase

FIG. 20 Examples of typical states of the 2D Ising model for three different temperatures in the ordered phase (T/J = 0.75,left), the critical region (T/J = 2.25, middle) and the disordered phase (T/J = 4.0, right). The linear system dimension isL = 40 sites.

energy scale. We adopt periodic boundary conditions.Onsager proved that this model undergoes a phase tran-sition in the thermodynamic limit from an ordered fer-romagnet with all spins aligned to a disordered phase atthe critical temperature Tc/J = 2/ log(1 +

√2) ≈ 2.26.

For any finite system size, this critical point is smearedout to a critical region around Tc.

An interesting question to ask is whether one cantrain a statistical classifier to distinguish between the twophases of the Ising model. If successful, this can be usedto locate the position of the critical point in more compli-cated models where an exact analytical solution has so farremained elusive (Morningstar and Melko, 2017; Zhanget al., 2017b). In other words, given an Ising state, wewould like to classify whether it belongs to the ordered orthe disordered phase, without any additional informationother than the spin configuration itself. This categoricalmachine learning problem is well suited for logistic re-gression, and will thus consist of recognizing whether agiven state is ordered by looking at its bit configurations.Notice that, for the purposes of applying logistic regres-sion, the 2D spin state of the Ising model will be flattenedout to a 1D array, so it will not be possible to learn in-formation about the structure of the contiguous ordered2D domains [see Fig. 20]. Such information can be incor-porated using deep convolutional neural networks, seeSection IX.

To this end, we consider the 2D Ising model on a 40×40square lattice, and use Monte-Carlo (MC) sampling toprepare 104 states at every fixed temperature T out ofa pre-defined set. We furthermore assign a label to eachstate according to its phase: 0 if the state is disordered,and 1 if it is ordered.

It is well-known that near the critical temperature Tc,the ferromagnetic correlation length diverges, which leads

to, among other things, critical slowing down of the MCalgorithm. Perhaps identifying the phases is also harderin the critical region. With this in mind, consider thefollowing three types of states: ordered (T/J < 2.0),near-critical (2.0 ≤ T/J ≤ 2.5) and disordered (T/J >2.5). We use both ordered and disordered states to trainthe logistic regressor and, once the supervised trainingprocedure is complete, we will evaluate the performanceof our classification model on unseen ordered, disordered,and near-critical states.

Here, we deploy the liblinear routine (the default forScikit’s logistic regression) and stochastic gradient de-scent (SGD, see Sec. IV for details) to optimize the logis-tic regression cost function with L2 regularization. Wedefine the accuracy of the classifier as the percentage ofcorrectly classified data points. Comparing the accuracyon the training and test data, we can study the degreeof overfitting. The first thing to notice in Fig. 21 is thesmall degree of overfitting, as suggested by the training(blue) and test (red) accuracy curves being very close toeach other. Interestingly, the liblinear minimizer outper-forms SGD on the training and test data, but not on thenear-critical data for certain values of the regularizationstrength λ. Moreover, similar to the linear regressionexamples, we find that there exists a sweet spot for theSGD regularization strength λ that results in optimalperformance of the logistic regressor, at about λ ∼ 10−1.We might expect that the difficulty of the phase recogni-tion problem depends on the temperature of the queriedsample. Looking at the states in the near-critical region,c.f. Fig. 20, it is no longer easy for a trained human eyeto distinguish between the ferromagnetic and the disor-dered phases close to Tc. Therefore, it is interesting toalso compare the training and test accuracies to the ac-curacy of the near-critical state predictions. (Recall that

32

10−4 10−2 100 102 104

λ

0.4

0.5

0.6

0.7ac

cura

cy

train

test

critical

FIG. 21 Accuracy as a function of the regularization param-eter λ in classifying the phases of the 2D Ising model on thetraining (blue), test (red), and critical (green) data. The solidand dashed lines compare the ‘liblinear’ and ‘SGD’ solvers, re-spectively.

the model is not trained on near-critical states.) Indeed,the liblinear accuracy is about 7% smaller for the criti-cal states (green curves) compared to the test data (redline).

Finally, it is important to note that all of Scikit’s logis-tic regression solvers have in-built regularizers. We didnot emphasize the role of the regularizers in this section,but they are crucial in order to prevent overfitting. Weencourage the interested reader to play with the differentregularization types and numerical solvers in Notebook 6and compare model performances.

2. SUSY

In high energy physics experiments, such as the AT-LAS and CMS detectors at the CERN LHC, one majorhope is the discovery of new particles. To accomplish thistask, physicists attempt to sift through events and clas-sify them as either a signal of some new physical processor particle, or as a background event from already un-derstood Standard Model processes. Unfortunately, wedon’t know for sure what underlying physical process oc-curred (the only information we have access to are thefinal state particles). However, we can attempt to de-fine parts of phase space that will have a high percentageof signal events. Typically this is done by using a se-ries of simple requirements on the kinematic quantitiesof the final state particles, for example having one ormore leptons with large amounts of momentum that aretransverse to the beam line (pT ). Instead, here we willuse logistic regression in an attempt to find the relativeprobability that an event is from a signal or a backgroundevent. Rather than using the kinematic quantities of fi-

nal state particles directly, we will use the output of ourlogistic regression to define a part of phase space that isenriched in signal events (see Jupyter notebookNotebook5).

The dataset we are using comes from the UC IrvineML repository and has been produced using Monte Carlosimulations to contain events with two leptons (electronsor muons) (Baldi et al., 2014). Each event has the valueof 18 kinematic variables (“features”). The first 8 fea-tures are direct measurements of final state particles, inthis case the pT , pseudo-rapidity η, and azimuthal an-gle φ of two leptons in the event and the amount ofmissing transverse momentum (MET) together with itsazimuthal angle. The last ten features are higher or-der functions of the first 8 features; these features arederived by physicists to help discriminate between thetwo classes. These high-level features can be thought ofas the physicists’ attempt to use non-linear functions toclassify signal and background events, having been de-veloped with formidable theoretical effort. Here, we willuse only logistic regression to attempt to classify eventsas either signal (that is, coming from a SUSY process)or background (events from some already observed Stan-dard Model process). Later on in the review, in Sec. IX,we shall revisit the same problem with the tools of DeepLearning.

As stated before, we never know the true underlyingprocess, and hence the goal in these types of analysesis to find regions enriched in signal events. If we findan excess of events above what is expected, we can haveconfidence that they are coming from the type of sig-nal we are searching for. Therefore, the two metrics ofimport are the efficiency of signal selection, and the back-ground rejection achieved (also called detection/rejectionrates and similar to recall/precision). Oftentimes, ratherthan thinking about just a single working point, perfor-mance is characterized by Receiver Operator Charecter-istic curves (ROC curves). These ROC curves plot signalefficiency versus background rejection at various thresh-olds of some discriminating variable. Here that variablewill be the output signal probability of our logistic re-gression. Figure 22 shows examples of these outputs fortrue signal events (left) and background events (right)using L2 regularization with a regularization parameterof 10−5.

Notice that while the majority of signal events receivehigh probabilities of passing our discriminator and themajority of background events receive low probabilities,some signal events look background-like, and some back-ground events look signal-like to our discriminator. Thisis further reason to characterize performance of our selec-tion in terms of ROC curves. Figure 23 shows examplesof these curves using L2 regularization for many differentregularization parameters using two different ML pythonpackages, either TensorFlow (top) or Sci-Kit Learn (bot-tom), when using the full set of 18 input variables. Notice

33

FIG. 22 The probability of an event being a classified as a signal event for true signal events (left, blue) and background events(right, red).

there is minimal overfitting, in part because we trainedon such a large dataset (4.5 million events). More impor-tantly, however, is the underlying data we are workingwith: each input variable is an important feature.

FIG. 23 ROC curves for a variety of regularization parame-ters with L2 regularization using TensorFlow (top) or Sci-KitLearn (bottom).

While figure 23 shows nice discrimination power be-tween signal and background events, the adoption of MLtechniques adds complication to any analysis. Given that

we’ve already come up with a set of discriminating vari-ables, including higher order ones derived from theoriesabout SUSY particles, it’s worth reflecting on whetherthere is utility to the increased sophistication of ML. Toshow why we would want to use such a technique, recallthat, even to the learning algorithm, some signal eventsand background events look similar. We can illustratethis directly by looking at a plot comparing the pT spec-trum of the two highest pT leptons (often referred to asthe leading and sub-leading leptons) for both signal andbackground events. Figure 24 shows these two distribu-tions, and one can see that while some signal events areeasily distinguished, many live in the same part of phasespace as the background. This effect can also be seen bylooking at figure 22 where you can see that some signalevents look like background events and vice-versa.

FIG. 24 Comparison of leading vs. sub-leading lepton pT forsignal (blue) and background events (red). Recall that thesevariables have been scaled to have a mean of one.

One could then ask how much discrimination poweris obtained by simply putting different requirements onthe input variables rather than using ML techniques. Inorder to compare this strategy (often referred to as cut-

34

based in the field of HEP) to our regression results, differ-ent ROC curves have been made for each of the followingcases: logistic regression with just the simple kinematicvariables, logistic regression with the full set of variables,and simply putting a requirement on the leading leptonpT . Figure 25 shows that there is a clear performancebenefit from using logistic regression. Note also that inthe cut-based approach we have only used one variablewhere we could have put requirements on all of them.While putting more requirements would indeed increasebackground rejection, it would also decrease signal effi-ciency. Hence, the cut-based approach will never yield asstrong discrimination as the logistic regression we haveperformed. One other interesting point about these re-sults is that the higher-order variables noticeably help theML techniques. In later sections, we will return to thispoint to see if more sophisticated techniques can providefurther improvement.

FIG. 25 A comparison of discrimination power from using lo-gistic regression with only simple kinematic variables (green),logistic regression using both simple and higher-order kine-matic variables (purple), and a cut-based approach that variesthe requirements on the leading lepton pT .

D. Softmax Regression

So far we have focused only on binary classification,in which the labels are dichotomous variables. Here wegeneralize logistic regression to multi-class classification.One approach is to treat the label as a vector yi ∈ ZM2 ,namely a binary string of length M with only one com-ponent of yi being 1 and the rest zero. For example,yi = (1, 0, · · · , 0) means data the sample xi belongs toclass 17, cf. Fig. 18. Following the notation in Sec. VII.A,

7 For an alternative mathematical description of the categories,which labels the classes by integers, see http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression.

0 5 10 15 20 250

5

10

15

20

25

FIG. 26 An example of an input datapoint from the MNISTdata set. Each datapoint is a 28× 28-pixel image of a hand-written digit, with its corresponding label belonging to oneof the 10 digits. Each pixel contains a greyscale value repre-sented by an integer between 0 and 255.

the probability of xi being in class m′ is given by

P (yim′ = 1|xi, wkM−1k=0 ) =

e−xTi wm′∑M−1

m=0 e−xTi wm

, (79)

where yim′ ≡ [yi]m′ refers to them′-th component of vec-tor yi. This is known as the SoftMax function. There-fore, the likelihood of this M -class classifier is simply(cf. Sec. VII.A):

P (D|wkM−1k=0 ) =

n∏i=1

M−1∏m=0

[P (yim = 1|xi,wm)]yim

× [1− P (yim = 1|xi,wm)]1−yim (80)

from which we can define the cost function in a similarfashion:

C(w) = −n∑i=1

M−1∑m=0

yim logP (yim = 1|xi,wm)

+ (1− yim) log (1− P (yim = 1|xi,wm)) . (81)

As expected, for M = 1, we recover the cross entropy forlogistic regression, cf. Eq. (76).

E. An Example of SoftMax Classification: MNIST DigitClassification

A paradigmatic example of SoftMax regression is toclassify handwritten digits from the MNIST dataset.

35

Yann LeCun and collaborators first collected and pro-cessed 70000 handwritten digits, each of which is laidout on a 28 × 28-pixel grid. Every pixel assumes oneof 256 grayscale values, interpolating between white andblack. A representative input sample is show in Fig. 26.

Since there are 10 categories for the digits 0 through 9,this corresponds to SoftMax regression withM = 10. Weencourage readers to experiment with Notebook 7 to ex-plore SoftMax regression applied to MNIST. We includein Fig. 27 the learned weights wk, where k corresponds toclass labels (i.e. digits). We shall come back to SoftMaxregression in Sec. IX.

VIII. COMBINING MODELS

One of the most powerful and widely-applied ideas inmodern machine learning is the use of ensemble methodsthat combine predictions from multiple, often weak, sta-tistical models to improve predictive performance (Diet-terich et al., 2000). Ensemble methods, such as randomforests (Breiman, 2001; Geurts et al., 2006; Ho, 1998),and boosted gradient trees, such as XGBoost (Chen andGuestrin, 2016; Friedman, 2001), undergird many of thewinning entries in data science competitions such as Kag-gle, especially on structured datasets 8. Even in the con-text of neural networks, see Sec. IX, it is common tocombine predictions from multiple neural networks to in-crease performance on tough image classification tasks(He et al., 2015; Ioffe and Szegedy, 2015).

In this section, we give an overview of ensemble meth-ods and provide rules of thumb for when and why theywork. On one hand, the idea of training multiple modelsand then using a weighted sum of the predictions of theall these models is very natural. After all, the idea of the“wisdom of the crowds” can be traced back, at least, tothe writings of Aristotle in Politics. On the other hand,one can also imagine that the ensemble predictions canbe much worse than the predictions from each of the in-dividual models that constitute the ensemble, especiallywhen pooling reinforces weak but correlated deficienciesin each of the individual predictors. Thus, it is impor-tant to understand when we expect ensemble methods towork.

In order to do this, we will revisit the bias-variancetrade-off, discussed in Sec. III, and generalize it to con-sider an ensemble of classifiers. We will show that thekey to determining when ensemble methods work is thedegree of correlation between the models in the ensemble(Louppe, 2014). Armed with this intuition, we will intro-duce some of the most widely-used and powerful ensem-ble methods including bagging (Breiman, 1996), boosting

8 Neural networks generally perform better than ensemble meth-ods on unstructured data, images, and audio.

(Freund et al., 1999; Freund and Schapire, 1995; Schapireand Freund, 2012), random forests (Breiman, 2001),and gradient boosted trees such as XGBoost (Chen andGuestrin, 2016).

A. Revisiting the Bias-Variance Tradeoff for Ensembles

The bias-variance tradeoff summarizes the fundamen-tal tension in machine learning between the complexityof a model and the amount of training data needed to fitit (see Sec. III). Since data is often limited, in practiceit is frequently useful to use a less complex model withhigher bias – a model whose asymptotic performance isworse than another model – because it is easier to trainand less sensitive to sampling noise arising from having afinite-sized training dataset (i.e. smaller variance). Here,we will revisit the bias-variance tradeoff in the contextof ensembles, drawing upon the beautiful discussion inRef. (Louppe, 2014).

A key property that will emerge from this analysis isthe correlation between models that constitute the en-semble. The degree of correlation between models9 isimportant for two distinct reasons. First, holding the en-semble size fixed, averaging the predictions of correlatedmodels reduces the variance less than averaging uncor-related models. Second, in some cases, correlations be-tween models within an ensemble can result in an in-crease in bias, offsetting any potential reduction in vari-ance gained from ensemble averaging. We will discussthis in the context of bagging below. One of the mostdramatic examples of increased bias from correlations isthe catastrophic predictive failure of almost all derivativemodels used by Wall Street during the 2008 financial cri-sis.

1. Bias-Variance Decomposition for Ensembles

We will discuss the bias-variance tradeoff in the con-text of continuous predictions such as regression. How-ever, many of the intuitions and ideas discussed here alsocarry over to classification tasks. Before discussing en-sembles, let us briefly review the bias-variance tradeoff inthe context of a single model. Consider a data set consist-ing of data XL = (yj ,xj), j = 1 . . . N. Let us assumethat the true data is generated from a noisy model

y = f(x) + ε, (82)

where ε is a normally distributed with mean zero andstandard deviation σε.

9 For example, the correlation coefficient between the predictionsmade by two randomized models based on the same training setbut with different random seeds, see Sec. VIII.A.1 for precisedefinition.

36

Class 0 Class 1 Class 2 Class 3 Class 4

Class 5 Class 6 Class 7 Class 8 Class 9

classification weights vector wj for digit class j

FIG. 27 Visualization of the weights wj after training a SoftMax Regression model on the MNIST dataset (see Notebook7). We emphasize that SoftMax Regression does not have explicit 2D spatial knowledge; the model learns from data pointsflattened out in a one-dimensional array.

Assume that we have a statistical procedure (e.g. least-squares regression) for forming a predictor gL(x) thatgives the prediction of our model for a new data point xgiven that we trained the model using a dataset L. Thisestimator is chosen by minimizing a cost function which,for the sake of concreteness, we take to be the squarederror

C(X, g(x)) =∑i

(yi − gL(xi))2. (83)

The dataset L is drawn from some underlying distribu-tion that describes the data. If we imagine drawing manydatasets Lj of the same size as L from this distribu-tion, we know that the corresponding estimators gLj (x)will differ from each other due to stochastic effects aris-ing from sampling noise. For this reason, we can viewour estimator gL(x) as a random variable and define anexpectation value EL in the usual way. Note that the sub-script denotes that the expectation is taken over L. Inpractice, EL is computed by by drawing infinitely manydifferent datasets Lj of the same size, fitting the corre-sponding estimator, and then averaging the results. Wewill also average over different instances of the “noise” ε.The expectation value over the noise will be denoted byEε.

As discussed in Sec. III, we can decompose the ex-pected generalization error as

EL,ε[C(X, g(x))] = Bias2 + V ar +Noise. (84)

where the bias,

Bias2 =∑i

(f(xi)− EL[gL(xi)])2, (85)

measures the deviation of the expectation value of our es-timator (i.e. the asymptotic value of our estimator in thelimit of infinite data) from the true value. The variance

V ar =∑i

EL[(gL(xi)− EL[gL(xi)])2], (86)

measures how much our estimator fluctuates due tofinite-sample effects. The noise term

Noise =∑i

σ2εi (87)

is the part of the error due to intrinsic noise in the datageneration process that no statistical estimator can over-come.

Let us now generalize this to ensembles of estimators.Given a dataset XL and hyper-parameters θ that param-eterize members of our ensemble, we will consider a pro-cedure that deterministically generates a model gL(xi, θ)given XL and θ. We assume that the θ includes somerandom parameters that introduce stochasticity into ourensemble (e.g. an initial condition for stochastic gradientdescent or a random subset of features or data pointsused for training.) Concretely, with a giving dataset L,one has a learning algorithm A that generates a model

37

A(θ,L) based on a deterministic procedure which intro-duced stochasticity through θ in its execution on datasetL. We will be concerned with the expected predictionerror of the aggregate ensemble predictor

gAL (xi, θ) =1

M

M∑m=1

gL(xi, θm). (88)

For future reference, let us define the mean, variance,and covariance (i.e. the connected correlation function inthe language of physics), and the normalized correlationcoefficient of a single randomized model gL(x, θm) as:

EL,θm [gL(x, θm)] = µL,θm(x)

EL,θm [gL(x, θm)2]− EL,θm [gL(x, θ)]2 = σ2L,θm(x)

EL,θm [gL(x, θm)gL(x, θm′)]− Eθ[gL(x, θm)]2 = CL,θm,θm′ (x)

ρ(x) =CL,θm,θm′ (x)

σ2L,θ

. (89)

Note that the expectation EL,θm [·] is computed over thejoint distribution of L and θm. Also, by definition, weassume m 6= m′ in CL,θm,θm′ .

We can now ask about the expected generalization(out-of-sample) error for the ensemble

EL,ε,θ[C(X, gAL (x))

]= EL,ε,θ

[∑i

(yi − gAL (xi, θ))2

].

(90)As in the single estimator case, we decompose the errorinto a noise term, a bias-term, and a variance term. Tosee this, note that

EL,ε,θ[C(X, gAL (x))] = EL,ε,θ

[∑i

(yi − f(xi) + f(xi)− gAL (xi, θ))2

]=∑i

EL,ε,θ[(yi − f(xi))2 + (f(xi)− gAL (xi, θ))2 + 2(yi − f(xi))(f(xi)− gAL (xi, θ))]

=∑i

σ2εi +

∑i

EL,θ[(f(xi)− gAL (xi, θ))2], (91)

where in the last line we have used the fact that Eε[yi] = f(xi) to eliminate the last term. We can further decomposethe second term as

EL,θ[(f(xi)− gAL (xi, θ))2] = EL,θ[(f(xi)− EL,θ[gAL (xi, θ)] + EL,θ[g

AL (xi, θ)]− gAL (xi, θ))2]

= EL,θ[(f(xi)− EL,θ[gAL (xi, θ)])2] + EL,θ[(EL,θ[g

AL (xi, θ)]− gAL (xi, θ))2]

+ 2EL,θ[(EL,θ[gAL (xi, θ)]− gAL (xi, θ))(f(xi)− EL,θ[g

AL (xi, θ)])]

= (f(xi)− EL,θ[gAL (xi, θ)])2 + EL,θ[(g

AL (xi, θ)− EL,θ[g

AL (xi, θ)])2]

≡ Bias2(xi) + V ar(xi), (92)

where we have defined the bias of an aggregate predictoras

Bias2(x) ≡ (f(x)− EL,θ[gAL (x, θ)])2 (93)

and the variance as

V ar(x) ≡ EL,θ[(gAL (x, θ)− EL,θ[g

AL (x, θ)])2]. (94)

So far the calculation for ensembles is almost iden-tical to that of a single estimator. However, since theaggregate estimator is a sum of estimators, its varianceimplicitly depends on the correlations between the indi-vidual estimators in the ensemble. Using the definitionof the aggregate estimator Eq. (88) and the definitions inEq. (89), we see that

V ar(x) = EL,θ[(gAL (x, θ)− EL,θ[g

AL (x, θ)])2]

=1

M2

∑m,m′

EL,θ[gL(x, θm)gL(x, θm′)]−M2∑i

[µL,θ(x)]2

= ρ(x)σ2

L,θ +1− ρ(x)

Mσ2L,θ. (95)

38

Aggregating different linear hypotheses

Linear perceptron hypothesis

FIG. 28 Why combining models? On the left we show thatby combining simple linear hypotheses (grey lines) one canachieve better and more flexible classifications (dark line),which is in stark contrast to the case in which one only usesa single perceptron hypothesis as shown on the right.

This last formula is the key to understanding the powerof random ensembles. Notice that by using large ensem-bles (M →∞), we can significantly reduce the variance,and for completely random ensembles where the mod-els are uncorrelated (ρ(x) = 0), maximally suppressesthe variance! Thus, using the aggregate predictor beatsdown fluctuations due to finite-sample effects. The key,as the formula indicates, is to decorrelate the models asmuch as possible while still using a very large ensemble.One can be worried that this comes at the expense of avery large bias. This turns out not to be the case. Whenmodels in the ensemble are completely random, the biasof the aggregate predictor is just the expected bias of asingle model

Bias2(x) = (f(x)− EL,θ[gAL (x, θ)])2

= (f(x)− 1

M

M∑m=1

EL,θ[gL(x, θm)])2 (96)

= (f(x)− µL,θ)2. (97)

Thus, for a random ensemble one can always add moremodels without increasing the bias. This observation liesbehind the immense power of random forest methods dis-cussed below. For other methods, such as bagging, wewill see that the bootstrapping procedure actually doesincrease the bias. But in many cases, this increase in biasis negligible compared to the reduction in variance.

2. Summarizing the Theory and Intuitions behind Ensembles

Before discussing specific methods, let us briefly sum-marize why ensembles have proven so successful in manyML applications. Dietterich (Dietterich et al., 2000)identifies three distinct shortcomings that are fixed byensemble methods: statistical, computational, and rep-resentational. These are explained in the following dis-cussion from Ref. (Louppe, 2014):

The first reason is statistical. When thelearning set is too small, a learning algorithmcan typically find several models in the hy-pothesis space H that all give the same per-formance on the training data. Provided theirpredictions are uncorrelated, averaging sev-eral models reduces the risk of choosing thewrong hypothesis. The second reason is com-putational. Many learning algorithms rely onsome greedy assumption or local search thatmay get stuck in local optima. As such, an en-semble made of individual models built frommany different starting points may providea better approximation of the true unknownfunction than any of the single models. Fi-nally, the third reason is representational. Inmost cases, for a learning set of finite size, thetrue function cannot be represented by anyof the candidate models in H. By combin-ing several models in an ensemble, it may bepossible to expand the space of representablefunctions and to better model the true func-tion.

The increase in representational power of ensemblescan be simply visualized. For example, the classificationtask shown in Fig. 28 reveals that it is more advanta-geous to combine a group of simple hypotheses (verti-cal or horizontal lines) than to utilize a single arbitrarylinear classifier. This of course comes with the price ofintroducing more parameters to our learning procedure.But if the problem itself can never be learned through asimple hypothesis, then there is no reason to avoid ap-plying a more complex model. Since ensemble methodsreduce the variance and are often easier to train than asingle complex model, they are a powerful way of increas-ing representational power (also called expressivity in theML literature).

Our analysis also gives several intuitions for how weshould construct ensembles. First, we should try to ran-domize ensemble construction as much as possible to re-duce the correlations between predictors in the ensemble.This ensures that our variance will be reduced while min-imizing an increase in bias due to correlated errors. Sec-ond, the ensembles will work best for procedures wherethe error of the predictor is dominated by the varianceand not the bias. Thus, these methods are especially wellsuited for unstable procedures whose results are sensitiveto small changes in the training dataset.

Finally, we note that although the discussion abovewas derived in the context of continuous predictors suchas regression, the basic intuition behind using ensemblesapplies equally well to classification tasks. Using an en-semble allows one to reduce the variance by averaging theresult of many independent classifiers. As with regres-sion, this procedure works best for unstable predictors

39

for which errors are dominated by variance due to finitesampling rather than bias.

B. Bagging

BAGGing, or Bootstrap AGGregation, first introducedby Leo Breiman, is one of the most widely employed andsimplest ensemble-inspired methods (Breiman, 1996).Imagine we have a very large dataset L that we couldpartition into M smaller data sets which we labelL1, . . . ,LM. If each partition is sufficiently large tolearn a predictor, we can create an ensemble aggregatepredictor composed of predictors trained on each subsetof the data. For continuous predictors like regression,this is just the average of all the individual predictors:

gAL (x) =1

M

M∑i=1

gLi(x). (98)

For classification tasks where each predictor predicts aclass label j ∈ 1, . . . , J, this is just a majority vote ofall the predictors,

gAL (x) = arg maxj

M∑i=1

I[gLi(x) = j], (99)

where I[gLi(x) = j] is an indicator function that is equalto one if gLi(x) = j and zero otherwise. From the the-oretical discussion above, we know that this can signifi-cantly reduce the variance without increasing the bias.

While simple and intuitive, this form of aggregationclearly works only when we have enough data in each par-titioned set Li. To see this, one can consider the extremelimit where Li contains exactly one point. In this case,the base hypothesis gLi(x) (e.g. linear regressor) becomesextremely poor and the procedure above fails. One wayto circumvent this shortcoming is to resort to empir-ical bootstrapping, a resampling technique in statis-tics introduced by Efron (Efron, 1979) (see accompany-ing box and Fig. 29). The idea of empirical bootstrap-ping is to use sampling with replacement to create new“bootstrapped” datasets LBS1 , . . . , LBSM from our origi-nal dataset L. These bootstrapped datasets share manypoints, but due to the sampling with replacement, areall somewhat different from each other. In the baggingprocedure, we create an aggregate estimator by replac-ing the M independent datasets by the M bootstrappedestimators:

gBSL (x) =1

M

M∑i=1

gLBSi (x). (100)

and

gBSL (x) = arg maxj

M∑i=1

I[gLBSi (x) = j]. (101)

This bootstrapping procedure allows us to construct anapproximate ensemble and thus reduce the variance. Forunstable predictors, this can significantly improve thepredictive performance. The price we pay for using boot-strapped training datasets, as opposed to really partition-ing the dataset, is an increase in the bias of our bagged es-timators. To see this, note that as the number of datasetsM goes to infinity, the expectation with respect to thebootstrapped samples converges to the empirical distri-bution describing the training data set pL(x) (e.g. a deltafunction at each datapoint in L) which in general is dif-ferent from the true generative distribution for the datap(x).

In Fig. 30 we demonstrate bagging with a perceptron(linear classifier) as the base classifier that constitutesthe elements of the ensemble. It is clear that, althougheach individual classifier in the ensemble performs poorlyat classification, bagging these estimators yields reason-ably good predictive performance. This raises questionslike why bagging works and how many bootstrap samplesare needed. As mentioned in the discussion above, bag-ging is effective on “unstable” learning algorithms wheresmall changes in the training set result in large changesin predictions (Breiman, 1996). When the procedure isunstable, the prediction error is dominated by the vari-ance and one can exploit the aggregation component ofbagging to reduce the prediction error. In contrast, fora stable procedure the accuracy is limited by the biasintroduced by using bootstrapped datasets. This meansthat there is an instability-to-stability transition pointbeyond which bagging stops improving our prediction.

40

D = X 1 , · · · , X n

D (1) D (2) D (B ). . . . . . . . . . .

M (B )nM (2)

nM (1)n

Training samples

Bootstrap samples

Bootstrap replications

M n (D )

σσ

FIG. 29 Shown here is the procedure of empirical bootstrap-ping. The goal is to assess the accuracy of a statistical quan-tity of interest, which in the main text is illustrated as thesample median Mn(D). We start from a given dataset D andbootstrap B size n datasets D?(1), · · · ,D?(B) called the boot-strap samples. Then we compute the statistical quantity ofinterest on these bootstrap samples to get the median M?(k)

n ,for k = 1, · · · , B. These are then used to evaluate the accu-racy of Mn(D) (see also box on Bootstrapping in main text).It can be shown that in the n → ∞ limit the distributionof M?(k)

n would be a Gaussian centered around Mn(D) withvariance σ2 defined by Eq. (102) scales as 1/n.

Brief Introduction to Bootstrapping

Suppose we are given a finite set of n data pointsD = X1, · · · , Xn as training samples and ourjob is to construct measures of confidence for oursample estimates (e.g. the confidence interval ormean-squared error of sample median estimator).To do so, one first samples n points with re-placement from D to get a new set D?(1) =

X?(1)1 , · · · , X?(1)

n , called a bootstrap sample,which possibly contains repetitive elements. Thenwe repeat the same procedure to get in total Bsuch sets: D?(1), · · · ,D?(B). The next step is touse these B bootstrap sets to get the bootstrapestimate of the quantity of interest. For example,let M?(k)

n = Median(D?(k)) be the sample medianof bootstrap data D?(k). Then we can constructthe variance of the distribution of bootstrap medi-ans as :

V arB(Mn) =1

B − 1

B∑k=1

(M?(k)n − M?

n

)2

, (102)

where

M?n =

1

B

B∑k=1

M?(k)n (103)

is the mean of the median of all bootstrap sam-ples. Specifically, Bickel and Freedman (Bickel andFreedman, 1981) and Singh (Singh, 1981) showedthat in the n → ∞ limit, the distribution of thebootstrap estimate will be a Gaussian centeredaround Mn(D) = Median(X1, · · · , Xn) with stan-dard deviation proportional to 1/

√n. This means

that the bootstrap distribution M?n − Mn approxi-

mates fairly well the sampling distribution Mn−Mfrom which we obtain the training data D. Notethat M is the median based on which the true dis-tribution D is generated. In other words, if we plotthe histogram of M?(k)

n Bk=1, we will see that inthe large n limit it can be well fitted by a Gaus-sian which sharp peaks at Mn(D) and vanishingvariance whose definition is given by Eq. (102) (seeFig. 29).

C. Boosting

Another powerful and widely used ensemble method isBoosting. In bagging, the contribution of all predictorsis weighted equally in the bagged (aggregate) predictor.However, in principle, there are myriad ways to combinedifferent predictors. In some problems one might preferto use an autocratic approach that emphasizes the bestpredictors, while in others it might be better to opt for

41

−1.0 −0.5 0.0 0.5 1.0

x1

−3

−2

−1

0

1

2

3

4

5x2

Bagging: o, True label: x

FIG. 30 Bagging applied to the perceptron learningalgorithm (PLA). Training data size n = 500, number ofbootstrap datasets B = 25, each contains 50 points. Col-ors corresponds to different classes while the marker indicateshow these points are labelled: cross for true label and circlefor that obtained by bagging. Each gray dashed line indicatesthe prediction made, based on every bootstrap set while thedark dashed black line is the average of these.

more ‘democratic’ ways as is done in bagging. In all cases,the idea is to build a strong predictor by combining manyweaker classifiers.

In boosting, an ensemble of weak classifiers gk(x)is combined into an aggregate, boosted classifier. How-ever, unlike bagging, each classifier is associated with aweight αk that indicates how much it contributes to theaggregate classifier

gA(x) =

M∑K=1

αkgk(x), (104)

where∑k αk = 1. For the reasons outlined above, boost-

ing, like all ensemble methods, works best when we com-bine simple, high-variance classifiers into a more complexwhole.

Here, we focus on “adaptive boosting” or AdaBoost,first proposed by Freund and Schapire in the mid 1990s(Freund et al., 1999; Freund and Schapire, 1995; Schapireand Freund, 2012). The basic idea behind AdaBoost,is to form the aggregate classifier in an iterative pro-cess. Importantly, at each iteration we reweight the errorfunction to "highlight" data points where the aggregateclassifier performs poorly (so that in the next round theprocedure put more emphasis on making those right.) Inthis way, we can successively ensure that our classifierhas good performance over the whole dataset.

We now discuss the AdaBoost procedure in greaterdetail. Suppose that we are given a data set L =(xi, yi), i = 1, · · · , N where xi ∈ X and yi ∈ Y =+1,−1. Our objective is to find an optimal hypoth-

esis/classifier g : X → Y to classify the data. LetH = g : X → Y be the family of classifiers availablein our ensemble. In the AdaBoost setting, we are con-cerned with the classifiers that perform somehow betterthan “tossing a fair coin”. This means that for each clas-sifier, the family H can predict yi correctly at least halfof the time.

We construct the boosted classifier as follows:

• Initialize wt=1(xn) = 1/N, n = 1, · · · , N .

• For t = 1 · · · , T (desired termination step), do:

1. Select a hypothesis gt ∈ H that minimizes theweighted error

εt =

N∑i=1

wt(xi)1(gt(xi) 6= yi) (105)

2. Let αt = 12 ln 1−εt

εt, update the weight for each

data xn by

wt+1(xn)← wt(xn)exp[−αtyngt(xn)]

Zt,

where Zt =∑Nn=1 wt(xn)e−αtyngt(xn) ensures

all weights add up to unity.

• Output gA(x) = sign(∑T

t=1 αtgt(x))

There are many theoretical and empirical studies on theperformance of AdaBoost but they are beyond the scopeof this review. We refer interested readers to the exten-sive literature on boosting (Freund et al., 1999).

D. Random Forests

x = ( x 1 , x 2 , x 3 )Data

y =

x 1 ≥ 0

x 2 ≥ 0 x 3 ≥ 0

True

True True

False

False False

FIG. 31 Example of a decision tree. For an input observationx, its label y is predicted by traversing it from the root allthe way down the leaves, following branches it satisfies.

We now briefly review one of the most widely used andversatile algorithms in data science and machine learning,Random Forests (RF). Random Forests is an ensemble

42

method widely deployed for complex classification tasks.A random forest is composed of a family of (randomized)tree-based classifier decision trees (discussed below). De-cision trees are high-variance, weak classifiers that canbe easily randomized, and as such, are ideally suited forensemble-based methods. Below, we give a brief high-level introduction to these ideas.

A decision tree uses a series of questions to hierarchi-cally partition the data. Each branch of the decision treeconsists of a question that splits the data into smallersubsets (e.g. is some feature larger than a given num-ber? See Fig. 31), with the leaves (end points) of thetree corresponding to the ultimate partitions of the data.When using decision trees for classification, the goal isto construct trees such that the partitions are informa-tive about the class label (see Fig. 31). It is clear thatmore complex decision trees lead to finer partitions thatgive improved performance on the training set. How-ever, this generally leads to over-fitting10, limiting theout-of-sample performance. For this reason, in practicealmost all decision trees use some form of regularization(e.g. maximum depth for the tree) to control complex-ity and reduce overfitting. Decision trees also have ex-tremely high variance, and are often extremely sensitiveto many details of the training data. This is not surpris-ing since decision trees are learned by partitioning thetraining data. Therefore, individual decision trees areweak classifiers. However, these same properties makethem ideal for incorporation in an ensemble method.

In order to create an ensemble of decision trees, wemust introduce a randomization procedure. As discussedabove, the power of ensembles to reduce variance onlymanifests when randomness reduces correlations betweenthe classifiers within the ensemble. Randomness is usu-ally introduced into random forests in one of three dis-tinct ways. The first is to use bagging and simply “bag”the decision trees by training each decision tree on a dif-ferent bootstrapped dataset (Breiman, 2001). Strictlyspeaking, this procedure does not constitute a randomforest but rather a bagged decision tree. The secondprocedure is to only use a different random subset ofthe features at each split in the tree. This “featurebagging” is the distinguishing characteristic of randomforests (Breiman, 2001; Ho, 1998). Using feature bag-ging reduces correlations between decision trees that canarise when only a few features are strongly predictiveof the class label. Finally, extremized random forests(ERFs) combine ordinary and feature bagging with anextreme randomization procedure where splitting is donerandomly instead of using optimality criteria (see for de-tails Refs. (Geurts et al., 2006; Louppe, 2014)). Even

10 One extreme limit is an n−node tree, with n being the numberof data point in the dataset given.

FIG. 32 Classifying Iris dataset with aggregation models forscikit learn tutorial. This dataset seeks to classify iris flow-ers into three types (labeled in red, blue, or yellow) based ona measurement of four features: septal length septal width,petal length, and petal width. To visualize the decision sur-face, we trained classifiers using only two of the four potentialfeatures (e..g septal length, septal width). Each row corre-sponds to a different subset of two features and the columnsto a Decision Tree with 10-fold CV (first column), RandomForest with 30 trees and 10-fold CV (second column) andAdaBoost with 30 base hypotheses (third column). Decisionsurface learned is highlighted by color shades. See the corre-sponding tutorial for more details (Pedregosa et al., 2011)

though this reduces the predictive power of each indi-vidual decision tree, it still often improves the predictivepower of the ensemble because it dramatically reducescorrelations between members and prevents overfitting.

Examples of the kind of decision surfaces found by de-cision trees, random forests, and Adaboost are shown inFig. 32. We invite the reader to check out the correspond-ing scikit-learn tutorial for more details of how these areimplemented in python (Pedregosa et al., 2011).

There are many different types of decision trees andtraining procedures. A full discussion of decision trees(and random forests) lies beyond the scope of this reviewand we refer readers to the extensive literature on thesetopics (Lim et al., 2000; Loh, 2011; Louppe, 2014). Re-cently, decision trees were applied in high-energy physicsto study to learn non-Higgsable gauge groups (Wang andZhang, 2018).

E. Gradient Boosted Trees and XGBoost

Before we turn to applications of these techniques, webriefly discuss one final class of ensemble methods thathas become increasingly popular in the last few years:Gradient-Boosted Trees (Chen and Guestrin, 2016; Fried-man, 2001). The basic idea of gradient-boosted trees isto use intuition from boosting and gradient descent (inparticular Newton’s method, see Sec. IV) to construct

43

ensembles of decision trees. Like in boosting, the ensem-bles are created by iteratively adding new decision treesto the ensemble. In gradient boosted trees, one criticalcomponent is the a cost function that measures the per-formance of our ensemble. At each step, we computethe gradient of the cost function with respect to the pre-dicted value of the ensemble and add trees that move usin the direction of the gradient. Of course, this requiresa clever way of mapping gradients to decision trees. Wegive a brief overview of how this is done within XGBoost(Extreme Gradient Boosting), which has recently beenapplied, to classify and rank transcription factor bindingin DNA sequences (Li et al., 2018). Below, we followclosely the XGboost tutorial.

Our starting point is a clever parametrization of de-cision trees. Here, we use notation where the deci-sion tree makes continuous predictions (regression trees),though this can also easily be generalized to classifica-tion tasks. We parametrize a decision tree j, denotedas gj(x), with T leaves by two quantities: a functionq(x) that maps each data point to one of the leaves ofthe tree, q : x ∈ Rd → 1, 2 . . . , T and a weight vectorw ∈ RT that assigns a predicted value to each leaf. Inother words, the j-th decision tree’s prediction for thedatapoint xi is simply: gj(xi) = wq(xi).

In addition to a parametrization of decision trees, wealso have to specify a cost function which measures pre-dictions. The prediction of our ensemble for a datapoint(yi,xi) is given by

yi = gA(xi) =

M∑j=1

gj(xi), gj ∈ F (106)

where gj(xi) is the prediction of the j-th decision tree ondatapoint xi, M is the number of members of the ensem-ble, and F = g(x) = wq(x) is the space of trees. Asdiscussed in the context of random trees above, withoutregularization, decision trees tend to overfit the data bydividing it into smaller and smaller partitions. For thisreason, our cost function is generally composed of twoterms, a term that measures the goodness of predictionson each datapoint, li(yi, yi), which is assumed to be dif-ferentiable and convex, and for each tree in the ensemble,a regularization term Ω(gj) that does not depend on thedata:

C(X, gA) =

N∑i=1

l(yi, yi) +

M∑j=1

Ω(gj), (107)

where the index i runs over data points and the index jruns over decision trees in our ensemble. In XGBoost,the regularization function is chosen to be

Ω(g) = γT +λ

2||w||22, (108)

with γ and λ regularization parameters that must bechosen appropriately. Notice that this regularization pe-

nalizes both large weights on the leaves (similar to L2-regularization) and having large partitions with manyleaves.

As in boosting, we form the ensemble iteratively. Forthis reason, we define a family of predictors y(t)

i as

y(t)i =

t∑j=1

gj(xi) = y(t−1)i + gt(xi). (109)

Note that by definition y(M)i = gA(xi). The central idea

is that for large t, each decision tree is a small pertur-bation to the predictor (of order 1/T ) and hence we canperform a Taylor expansion on our loss function to secondorder:

Ct =

N∑i=1

l(yi, y(t−1)i + gt(xi)) + Ω(gt)

≈ Ct−1 + ∆Ct, (110)

with

∆Ct = aigt(xi) +1

2bigt(xi)

2 + Ω(gt), (111)

where

ai = ∂y(t−1)i

l(yi, y(t−1)i ), (112)

bi = ∂2

y(t−1)i

l(yi, y(t−1)i ). (113)

We then choose the t-th decision tree gt to minimize ∆Ct.This is almost identical to how we derived the Newtonmethod update in the section on gradient descent, seeSec. IV.

We can actually derive an expression for the param-eters of gt that minimize ∆Ct analytically. To simplifynotation, it is useful to define the set of points xi that getmapped to leaf j: Ij = i : qt(xi) = j and the functionsBj =

∑i∈Ij bi and Aj =

∑i∈Ij ai. Notice that in terms

of these quantities, we can write

∆Ct =

T∑j=1

[Ajwj +1

2(Bj + λ)w2

j ] + γT, (114)

where we made the t-dependence of all parameters im-plicit. Note that λ comes from the regularization term,Ω(gt), through Eq.(108). To find the optimal wj , just asin Newton’s method we take the gradient of the aboveexpression with respect to wj and set this equal to zero,to get

woptj = − AjBj + λ

. (115)

Plugging this expression into ∆Ct gives

∆Coptt = −1

2

T∑j=1

A2j

Bj + λ+ γT. (116)

44

It is clear that ∆Coptt measures the in-sample performanceof gt and we should find the decision tree that minimizesthis value. In principle, one could enumerate all possi-ble trees over the data and find the tree that minimizes∆Coptt . However, in practice this is impossible. Instead,an approximate greedy algorithm is run that optimizesone level of the tree at a time by trying to find optimalsplits of the data. This leads to a tree that is a goodlocal minimum of ∆Coptt which is then added to the en-semble. We emphasize that this is only a very high levelsketch of how the algorithm works. In practice, addi-tional regularization such as shrinkage(Friedman, 2002)and feature subsampling(Breiman, 2001; Friedman et al.,2003) is also used. In addition, there are many numericaland technical tricks used for the approximate algorithmand how to find splits of the data that give good decisiontrees (Chen and Guestrin, 2016).

F. Applications to the Ising model and SupersymmetryDatasets

We now illustrate some of these ideas using two exam-ples drawn from physics: (i) classifying the phases of thespin configurations of the 2D-Ising model above and be-low the critical temperature using random forests and (ii)classifying Monte-Carlo simulations of collision events inthe SUSY dataset as supersymmetric or standard usingan XGBoost implementation of gradient-boosted trees.Both examples were analyzed in Sec. VII.C using logisticregression. Here we show that on the Ising dataset, theRFs perform significantly better than logistic regressionmodels whereas gradient boosted trees seem to yield anaccuracy of about 80%, comparable to published results.The two accompanying Jupyter notebooks discuss practi-cal details of implementing these examples and the read-ers are encouraged to experiment with the notebooks.

The Ising dataset used for classification by RFs hereis identical to that used to study logistic regression inSec. VII.C. We assign a label to each state according toits phase: 0 if the state is disordered, and 1 if it is ordered.We divide the dataset into three categories according tothe temperature at which samples are drawn: ordered(T/J < 2.0), near-critical (2.0 ≤ T/J ≤ 2.5) and disor-dered (T/J > 2.5) (see Figure 20). We use the orderedand disordered states to train a random forest and eval-uate our learned model on a test set of unseen orderedand disordered states (test sets). We also ask how wellour RF can predict the phase of samples drawn in thecritical region (i.e. predict whether the temperature ofa critical sample is above or below the critical tempera-ture). Since our model is never trained on samples in thecritical region, prediction in this region is a test of thealgorithm’s ability to generalize to new regions in phasespace.

The results of fits using RFs to predict phases are

10 20 30 40 50 60 70 80 90 100Nestimators

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Acc

ura

cy

Train (coarse)

Test (coarse)

Critical (coarse)

Train (fine)

Test (fine)

Critical (coarse)

10 20 30 40 50 60 70 80 90 100Nestimators

0

10

20

30

40

50

60

70

80

90

Run t

ime (

s)

Fine

Coarse

FIG. 33 Using Random Forests (RFs) to classify Ising Phases.(Top) Accuracy of RFs for classifying the phase of samplesfrom the Ising mode for the training set (blue), test set (red),and critical region (green) using coarse trees with a few leaves(triangles) and fine decision trees with many leaves (filled cir-cles). RFs were trained on samples from ordered and disor-dered phases but were not trained on samples from the criticalregion. (Bottom) The time it takes to train RFs scales lin-early with the number of estimators in the ensemble. For theupper panel, note that the train case (blue) overlaps with thetest case (red). Here ‘fine’ and ‘coarse’ refer to trees with 2and 10,000 leaves, respectively. For implementation details,see Jupyter notebooks 9

shown in Figure 33. We used two types of RF classifiers,one where the ensemble consists of coarse decision treeswith a few leaves and another with finer decision treeswith many leaves (see corresponding notebook). RFshave extremely high accuracy on the training and testsets (over 99%) for both coarse and fine trees. How-ever, notice that the RF consisting of coarse trees per-form extremely poorly on samples from the critical regionwhereas the RF with fine trees classifies critical sampleswith an accuracy of nearly 85%. Interestingly, and unlikewith logistic regression, this performance in the criticalregion requires almost no parameter tuning. This is be-cause, as discussed above, RFs are largely immune tooverfitting problems even as the number of estimators inthe ensemble becomes large. Increasing the number of es-timators in the ensemble does increase performance butat a large cost in computational time (Fig. 33 bottom).

In the second application of ensemble methods to

45

0 100 200 300 400 500

F score

M_Delta_RM_TR_2

lepton 2 phiMT2

lepton 1 phiR

M_RS_R

lepton 2 pTlepton 1 pT

cos(theta_r1)missing energy phi

lepton 1 etalepton 2 eta

dPhi_r_bMET_rel

missing energy magnitudeaxial MET

Featu

res

273283290

314314319331346348351352361

396416

443461

505517

Feature importance

FIG. 34 Feature Importance Scores in SUSY dataset fromapplying XGBoost to 100, 000 samples. See Notebook 10 formore details.

physics-related datasets, we used the XGBoost imple-mentation of gradient boosted trees to classify Monte-Carlo collisions from the SUSY dataset. With defaultparameters using a small subset of the data (100, 000 outof the full 5, 000, 000 samples), we were able to achievea classification accuracy of about 79%, which could beimproved to nearly 80% after some fine-tuning (see ac-companying notebook). This is comparable to publishedresults (Baldi et al., 2014) and those obtained using lo-gistic regression in earlier chapters. One nice feature ofensemble methods such as XGBoost is that they auto-matically allow us to calculate feature scores (Fscores)that rank the importance of various features for clas-sification. The higher the Fscore, the more importantthe feature for classification. Figure 34 shows the featurescores from our XGBoost algorithm for the production ofelectrically-charged supersymmetric particles (χ±) whichdecay to W bosons and an electrically neutral supersym-metric particle χ0, which is invisible to the detector. Thefeatures are a mix of eight directly measurable quanti-ties from the detector, as well as ten hand crafted fea-tures chosen using physics knowledge. Consistent withthe physics of these supersymmetric decays in the leptonchannel, we find that the most informative features forclassification are the missing transverse energy along thevector defined by the charged leptons (Axial MET) andthe missing energy magnitude due to χ0.

IX. AN INTRODUCTION TO FEED-FORWARD DEEPNEURAL NETWORKS (DNNS)

Over the last decade, neural networks have emergedas the one of most powerful and widely-used supervisedlearning techniques. Deep Neural Networks (DNNs) havea long history (Bishop, 1995b; Schmidhuber, 2015), butre-emerged to prominence after a rebranding as “Deep

Learning” in the mid 2000s (Hinton et al., 2006; Hintonand Salakhutdinov, 2006). DNNs truly caught the atten-tion of the wider machine learning community and indus-try in 2012 when Alex Krizhevsky, Ilya Sutskever, andGeoff Hinton used a GPU-based DNN model (AlexNet)to lower the error rate on the ImageNet Large ScaleVisual Recognition Challenge (ILSVRC) by an incred-ible twelve percent from 28% to 16% (Krizhevsky et al.,2012). Just three years later, a machine learning groupfrom Microsoft achieved an error of 3.57% using an ultra-deep residual neural network (ResNet) with 152 layers(He et al., 2016)! Since then, DNNs have become theworkhorse technique for many image and speech recogni-tion based machine learning tasks. The large-scale indus-trial deployment of DNNs has given rise to a number ofhigh-level libraries and packages (Caffe, Keras, Pytorch,TensorFlow, etc.) that make it easy to quickly code anddeploy DNNs.

Conceptually, it is helpful to divide neural networksinto four categories: (i) general purpose neural net-works for supervised learning, (ii) neural networks de-signed specifically for image processing, the most promi-nent example of this class being Convolutional NeuralNetworks (CNNs), (iii) neural networks for sequentialdata such as Recurrent Neural Networks (RNNs), and(iv) neural networks for unsupervised learning such asDeep Boltzmann Machines. Here, we will limit our dis-cussions to the first two categories (unsupervised learn-ing is discussed later in the review). Though increas-ingly important for many applications such as audio andspeech recognition, for the sake of brevity, we omit a dis-cussion of sequential data and RNNs from this review.For an introduction to RNNs and LSTM networks seeChris Olah’s blog, https://colah.github.io/posts/2015-08-Understanding-LSTMs/, and Chapter 13 of (Bishop,2006) as well as the introduction to RNNs in Chapter 10of (Goodfellow et al., 2016) for sequential data.

Due to the number of recent books on deep learning(see for example Michael Nielsen’s introductory onlinebook (Nielsen, 2015) and the more advanced (Goodfel-low et al., 2016)), the goal of this section is to give ahigh-level introduction to the basic ideas behind DNNs,and provide some practical knowledge for coding simpleneural nets for supervised learning tasks (see the accom-panying Notebooks). This section assumes the readeris familiar with the basic concepts introduced in earliersections on logistic and linear regression. Throughout,we strive to provide intuition behind the inner workingsof DNNs, as well as highlight limitations of present-dayalgorithms.

The influx of corporate and industrial interests hasrapidly transformed the field in the last few years. Thismassive influx of money and researchers has given rise tonew dogmas and best practices that change rapidly. Aswith most intellectual fields experiencing rapid expan-sion, many commonly accepted heuristics many turn out

46

not to be as powerful as thought (Wilson et al., 2017),and widely held beliefs not as universal as once imagined(Lee et al., 2017; Zhang et al., 2016). This is especiallytrue in modern neural networks where results are largelyempirical and heuristic and lack the firm footing of manyearlier machine learning methods. For this reason, inthis review we have chosen to emphasize tried and truefundamentals, while pointing out what, from our currentvantage point, seem like promising new techniques. Thefield is rapidly evolving and readers are urged to readpapers and to implement these algorithms themselves inorder to gain a deep appreciation for the incredible powerof modern neural networks, especially in the context ofimage, speech, and natural language processing, as wellas limitations of the current methods.

In physics, DNNs and CNNs have already foundnumerous applications. In statistical physics, theyhave been applied to detect phase transitions in 2DIsing (Tanaka and Tomiya, 2017a) and Potts (Liet al., 2017) models, lattice gauge theories (Wetzel andScherzer, 2017), and different phases of polymers (Weiet al., 2017). It has also been shown that deep neural net-works can be used to learn free-energy landscapes (Sidkyand Whitmer, 2017). At the same time, methods fromstatistical physics have been applied to the field of deeplearning to study the thermodynamic efficiency of learn-ing rules (Goldt and Seifert, 2017), to explore the hy-pothesis space that DNNs span, make analogies betweentraining DNNs and spin glasses (Baity-Jesi et al., 2018;Baldassi et al., 2017), and to characterize phase transi-tions with respect to network topology in terms of er-rors (Li and Saad, 2017). In relativistic hydrodynam-ics, deep learning has been shown to capture featuresof non-linear evolution and has the potential to accel-erate numerical simulations (Huang et al., 2018), whilein mechanics CNNs have been used to predict eigenval-ues of photonic crystals (Finol et al., 2018). Deep CNNswere employed in lensing reconstruction of the cosmicmicrowave background (Caldeira et al., 2018). Recently,DNNs have been used to improve the efficiency of Monte-Carlo algorithms (Shen et al., 2018).

Deep learning has also found interesting applicationsin quantum physics. Various quantum phase transi-tions (Arai et al., 2017; Broecker et al., 2017; Iakovlevet al., 2018; van Nieuwenburg et al., 2017b; Suchs-land and Wessel, 2018) can be detected and studiedusing DNNs and CNNs, including the transverse-fieldIsing model (Ohtsuki and Ohtsuki, 2017), topologicalphases (Yoshioka et al., 2017; Zhang et al., 2017a,b)and non-invasive topological quality control (Caio et al.,2019). DNNs found applications even in non-equilibriummany-body localization (van Nieuwenburg et al., 2017a,b;Schindler et al., 2017; Venderley et al., 2017), and thecharacterization of photoexcited quantum states (Shinjoet al., 2019). Experimentally, DNNs were recently em-ployed in cold atoms to identify critical points (Rem

et al., 2018). Representing quantum states as DNNs (Gaoet al., 2017; Gao and Duan, 2017; Levine et al., 2017;Saito and Kato, 2017) and quantum state tomogra-phy (Torlai et al., 2018) are among some of the impres-sive achievements to reveal the potential of deep learn-ing to facilitate the study of quantum systems. Machinelearning techniques involving neural networks were alsoused to study quantum and fault-tolerant error correc-tion (Baireuther et al., 2017; Breuckmann and Ni, 2017;Chamberland and Ronagh, 2018; Davaasuren et al., 2018;Krastanov and Jiang, 2017; Maskara et al., 2018), es-timate rates of coherent and incoherent quantum pro-cesses (Greplova et al., 2017), to obtain spectra of 1/f -noise in spin-qubit devices (Zhang and Wang, 2018), andthe recognition of state and charge configurations andauto-tuning in quantum dots (Kalantre et al., 2017). Inquantum information theory, it has been shown that onecan perform gate decompositions with the help of neuralnets (Swaddle et al., 2017). In lattice quantum chromo-dynamics, DNNs have been used to learn action param-eters in regions of parameter space where principal com-ponent analysis fails (Shanahan et al., 2018). CNNs wereapplied to data from a high-energy experiment to iden-tify particle interactions in sampling calorimeters usedcommonly in neutrino physics (Aurisano et al., 2016).Last but not least, DNNs also found place in the studyof quantum control (Yang et al., 2017), and in scatteringtheory to learn the s-wave scattering length (Wu et al.,2018) of potentials.

A. Neural Network Basics

Neural networks (also called neural nets) are neural-inspired nonlinear models for supervised learning. Aswe will see, neural nets can be viewed as natural, morepowerful extensions of supervised learning methods suchas linear and logistic regression and soft-max methods.

1. The basic building block: neurons

The basic unit of a neural net is a stylized “neu-ron” i that takes a vector of d input features x =(x1, x2, . . . , xd) and produces a scalar output ai(x). Aneural network consists of many such neurons stackedinto layers, with the output of one layer serving as theinput for the next (see Figure 35). The first layer in theneural net is called the input layer, the middle layers areoften called “hidden layers”, and the final layer is calledthe output layer.

The exact function ai varies depending on the type ofnon-linearity used in the neural network. However, inessentially all cases ai can be decomposed into a linearoperation that weights the relative importance of the var-ious inputs, and a non-linear transformation σi(z) which

47

input w x

linear nonlinearity

output

x1

x2

x3

w1

w2

w3

inputlayer

hiddenlayers

outputlayer

B

A

. + b

FIG. 35 Basic architecture of neural networks. (A)The basic components of a neural network are stylized neu-rons consisting of a linear transformation that weights theimportance of various inputs, followed by a non-linear activa-tion function. (b) Neurons are arranged into layers with theoutput of one layer serving as the input to the next layer.

is usually the same for all neurons. The linear trans-formation in almost all neural networks takes the formof a dot product with a set of neuron-specific weightsw(i) = (w

(i)1 , w

(i)2 , . . . , w

(i)d ) followed by re-centering with

a neuron-specific bias b(i):

z(i) = w(i) · x+ b(i) = xT ·w(i), (117)

where x = (1,x) and w(i) = (b(i),w(i)). In terms of z(i)

and the non-linear function σi(z), we can write the fullinput-output function as

ai(x) = σi(z(i)), (118)

see Figure 35.Historically in the neural network literature, common

choices of nonlinearities included step-functions (percep-trons), sigmoids (i.e. Fermi functions), and the hyper-bolic tangent. More recently, it has become more com-mon to use rectified linear units (ReLUs), leaky recti-fied linear units (leaky ReLUs), and exponential linearunits (ELUs) (see Figure 36). Different choices of non-linearities lead to different computational and trainingproperties for neurons. The underlying reason for this isthat we train neural nets using gradient descent basedmethods, see Sec. IV, that require us to take derivativesof the neural input-output function with respect to theweights w(i) and the bias b(i).

Notice that the derivatives of the aforementioned non-linearities σ(z) have very different properties. Thederivative of the perceptron is zero everywhere exceptwhere the input is zero. This discontinuous behaviormakes it impossible to train perceptrons using gradient

descent. For this reason, until recently the most pop-ular choice of non-linearity was the tanh function or asigmoid/Fermi function. However, this choice of non-linearity has a major drawback. When the input weightsbecome large, as they often do in training, the activationfunction saturates and the derivative of the output withrespect to the weights tends to zero since ∂σ/∂z → 0for z 1. Such “vanishing gradients” are a feature ofany saturating activation function (top row of Fig. 36),making it harder to train deep networks. In contrast, fora non-saturating activation function such as ReLUs orELUs, the gradients stay finite even for large inputs.

2. Layering neurons to build deep networks: networkarchitecture.

The basic idea of all neural networks is to layer neuronsin a hierarchical fashion, the general structure of which isknown as the network architecture (see Fig. 35). In thesimplest feed-forward networks, each neuron in the in-put layer of the neurons takes the inputs x and producesan output ai(x) that depends on its current weights, seeEq. (118). The outputs of the input layer are then treatedas the inputs to the next hidden layer. This is usuallyrepeated several times until one reaches the top or outputlayer. The output layer is almost always a simple clas-sifier of the form discussed in earlier sections: a logisticregression or soft-max function in the case of categoricaldata (i.e. discrete labels) or a linear regression layer inthe case of continuous outputs. Thus, the whole neuralnetwork can be thought of as a complicated nonlineartransformation of the inputs x into an output y that de-pends on the weights and biases of all the neurons in theinput, hidden, and output layers.

The use of hidden layers greatly expands the represen-tational power of a neural net when compared with a sim-ple soft-max or linear regression network. Perhaps, themost formal expression of the increased representationalpower of neural networks (also called the expressivity) isthe universal approximation theorem which states that aneural network with a single hidden layer can approxi-mate any continuous, multi-input/multi-output functionwith arbitrary accuracy. The reader is strongly urgedto read the beautiful graphical proof of the theorem inChapter 4 of Nielsen’s free online book (Nielsen, 2015).The basic idea behind the proof is that hidden neuronsallow neural networks to generate step functions with ar-bitrary offsets and heights. These can then be addedtogether to approximate arbitrary functions. The proofalso makes clear that the more complicated a function,the more hidden units (and free parameters) are neededto approximate it. Hence, the applicability of the ap-proximation theorem to practical situations should notbe overemphasized. In condensed matter physics, a goodanalogy are matrix product states, which can approxi-

48

-5 0 5

-1

0

1

Perceptron

-5 0 5

-1

0

1

Sigmoid

-5 0 5

-1

0

1

Tanh

-5 0 5

0246

ReLU

-5 0 5

0246

Leaky ReLU

-5 0 5

0246

ELU

FIG. 36 Possible non-linear activation functions for neurons. In modern DNNs, it has become common to use non-linearfunctions that do not saturate for large inputs (bottom row) rather than saturating functions (top row).

mate any quantum many-body state to an arbitrary ac-curacy, provided the bond dimension can be increasedarbitrarily – a severe requirement not met in any usefulpractical implementation of the theory.

Modern neural networks generally contain multiplehidden layers (hence the “deep” in deep learning). Thereare many ideas of why such deep architectures are fa-vorable for learning. Increasing the number of layers in-creases the number of parameters and hence the represen-tational power of neural networks. Indeed, recent numer-ical experiments suggests that as long as the number ofparameters is larger than the number of data points, cer-tain classes of neural networks can fit arbitrarily labeledrandom noise samples (Zhang et al., 2016). This suggeststhat large neural networks of the kind used in practice canexpress highly complex functions. Adding hidden layersis also thought to allow neural nets to learn more complexfeatures from the data. Work on convolutional networkssuggests that the first few layers of a neural network learnsimple, “low-level” features that are then combined intohigher-level, more abstract features in the deeper layers.Other works suggest that it is computationally and al-gorithmically easier to train deep networks rather thanshallow, wider nets, though this is still an area of majorcontroversy and active research (Mhaskar et al., 2016).

Choosing the exact network architecture for a neuralnetwork remains an art that requires extensive numer-ical experimentation and intuition, and is often timesproblem-specific. Both the number of hidden layers andthe number of neurons in each layer can affect the per-formance of a neural network. There seems to be nosingle recipe for the right architecture for a neural net

that works best. However, a general rule of thumb thatseems to be emerging is that the number of parameters inthe neural net should be large enough to prevent under-fitting (see theoretical discussion in (Advani and Saxe,2017)).

Empirically, the best architecture for a problem de-pends on the task, the amount and type of data that isavailable, and the computational resources at one’s dis-posal. Certain architectures are easier to train, whileothers might be better at capturing complicated depen-dencies in the data and learning relevant input features.Finally, there have been numerous works that move be-yond the simple deep, feed-forward neural network ar-chitectures discussed here. For example, modern neuralnetworks for image segmentation often incorporate “skipconnections” that skip layers of the neural network (Heet al., 2016). This allows information to directly propa-gate to a hidden or output layer, bypassing intermediatelayers and often improving performance.

B. Training deep networks

In the previous section, we introduced the basic ar-chitecture for neural networks. Here we discuss how toefficiently train large neural networks. Luckily, the basicprocedure for training neural nets is the same as we usedfor training simpler supervised learning algorithms, suchas logistic and linear regression: construct a cost/lossfunction and then use gradient descent to minimize thecost function and find the optimal weights and biases.Neural networks differ from these simpler supervised pro-

49

cedures in that generally they contain multiple hiddenlayers that make taking the gradient computationallymore difficult. We will return to this in Sec. IX.C whichdiscusses the “backpropagation” algorithm for computinggradients.

Like all supervised learning procedures, the first thingone must do to train a neural network is to specify aloss function. Given a data point (xi, yi), xi ∈ Rd+1,the neural network makes a prediction yi(w), where ware the parameters of the neural network. Recall thatin most cases, the top output layer of our neural netis either a continuous predictor or a classifier that makesdiscrete (categorical) predictions. Depending on whetherone wants to make continuous or categorical predictions,one must utilize a different kind of loss function.

For continuous data, the loss functions that are com-monly used to train neural networks are identical to thoseused in linear regression, and include the mean squarederror

E(w) =1

n

n∑i=1

(yi − yi(w))2, (119)

where n is the number of data points, and the mean-absolute error (i.e. L1 norm)

E(w) =1

n

∑i

|yi − yi(w)|. (120)

The full cost function often includes additional terms thatimplement regularization (e.g. L1 or L2 regularizers), seeSec. VI.

For categorical data, the most commonly used lossfunction is the cross-entropy (Eq. (76) and Eq. (81)),since the output layer is often taken to be a logistic clas-sifier for binary data with two types of labels, or a soft-max classifier if there are more than two types of labels.The cross-entropy was already discussed extensively inearlier sections on logistic regression and soft-max classi-fiers, see Sec. VII. Recall that for classification of binarydata, the output of the top layer of the neural network isthe probability yi(w) = p(yi = 1|xi; w) that data pointi is predicted to be in category 1. The cross-entropy be-tween the true labels yi ∈ 0, 1 and the predictions isgiven by

E(w) = −n∑i=1

yi log yi(w) + (1− yi) log [1− yi(w)] .

More generally, for categorical data, y can take on Mvalues so that y ∈ 0, 1, . . . ,M − 1. For each datapointi, define a vector yim called a ‘one-hot’ vector, such that

yim =

1, if yi = m

0, otherwise.(121)

We can also define the probability that the neural net-work assigns a datapoint to categorym: yim(w) = p(yi =

m|xi; w). Then, the categorical cross-entropy is definedas

E(w) = −n∑i=1

M−1∑m=0

yim log yim(w)

+(1− yim) log [1− yim(w)] . (122)

As in linear and logistic regression, this loss function isoften supplemented by additional terms that implementregularization.

Having defined an architecture and a cost function, wemust now train the model. Similar to other supervisedlearning methods, we make use of gradient descent-basedmethods to optimize the cost function. Recall that thebasic idea of gradient descent is to update the parame-ters w to move in the direction of the gradient of the costfunction ∇wE(w). In Sec. IV, we discussed numerousoptimizers that implement variations of stochastic gra-dient descent (SGD, Nesterov, RMSProp, Adam, etc.)Most modern neural network packages, such as Keras,allow the user to specify which of these optimizers theywould like to use in order to train the neural network.Depending on the architecture, data, and computationalresources, different optimizers may work better on theproblem, though vanilla SGD is a good first choice.

Finally, we note that unlike in linear and logistic re-gression, calculating the gradients for a neural networkrequires a specialized algorithm, called Backpropagation(often abbreviated backprop) which forms the heart ofany neural network training procedure. Backpropaga-tion has been discovered multiple times independentlybut was popularized for modern neural networks in 1985(Rumelhart and Zipser, 1985). We will turn to the back-propagation algorithm in the next section. Before read-ing it, the reader is strongly encouraged to play withNotebook 11 in order to gain some intuition about howto build a DNN in practice using the high-level KerasPython package. Notebook 11 discusses a simple exam-ple where we build a feed-forward deep neural network forclassifying hand-written digits from the MNIST dataset.Figures 37 and 38 show the accuracy and the loss as afunction of the training episodes.

C. The Backpropagation algorithm

In the last section, we saw how to deploy a high-levelpackage, Keras, to design and train a simple neural net-work. This training procedure requires us to be able tocalculate the derivative of the cost function with respectto all the parameters of the neural network (the weightsand biases of all the neurons in the input, hidden, andvisible layers). A brute force calculation is out of thequestion since it requires us to calculate as many gradi-ents as parameters at each step of the gradient descent.The backpropagation algorithm (Rumelhart and Zipser,

50

0 2 4 6 8epoch

0.850

0.875

0.900

0.925

0.950

0.975m

od

elac

cura

cytrain

test

FIG. 37 Model accuracy of a DNN to study the MNIST prob-lem as a function of the training epochs (see Notebook 11).Besides the input and output layers, the DNN has four layersof size (100,400,400,50) with different nonlinearities σ(z).

0 2 4 6 8epoch

0.1

0.2

0.3

0.4

0.5

mod

ello

ss

train

test

FIG. 38 Model loss of a DNN to study the MNIST problem asa function of the training epochs (see Notebook 11). Besidesthe input and output layers, the DNN has four layers of size(100,400,400,50) with different nonlinearities σ(z).

1985) is a clever procedure that exploits the layered struc-ture of neural networks to more efficiently compute gra-dients (for a more detailed discussion with Python codeexamples see Chapter 2 of (Nielsen, 2015)).

1. Deriving and implementing the backpropagation equations

At its core, backpropagation is simply the ordinarychain rule for partial differentiation, and can be summa-rized using four equations. In order to see this, we mustfirst establish some useful notation. We will assume thatthere are L layers in our network with l = 1, . . . , L in-dexing the layer. Denote by wljk the weight for the con-

nection from the k-th neuron in layer l − 1 to the j-thneuron in layer l. We denote the bias of this neuron byblj . By construction, in a feed-forward neural network theactivation alj of the j-th neuron in the l-th layer can berelated to the activities of the neurons in the layer l − 1by the equation

alj = σ

(∑k

wljkal−1k + blj

)= σ(zlj), (123)

where we have defined the linear weighted sum

zlj =∑k

wljkal−1k + blj . (124)

By definition, the cost function E depends directly onthe activities of the output layer aLj . It of course also indi-rectly depends on all the activities of neurons in lower lay-ers in the neural network through iteration of Eq. (123).Let us define the error ∆L

j of the j-th neuron in the L-thlayer as the change in cost function with respect to theweighted input zLj

∆Lj =

∂E

∂zLj. (125)

This definition is the first of the four backpropagationequations.

We can analogously define the error of neuron j inlayer l, ∆l

j , as the change in the cost function w.r.t. theweighted input zlj :

∆lj =

∂E

∂zlj=∂E

∂aljσ′(zlj), (I)

where σ′(x) denotes the derivative of the non-linearityσ(·) with respect to its input evaluated at x. Noticethat the error function ∆l

j can also be interpreted as thepartial derivative of the cost function with respect to thebias blj , since

∆lj =

∂E

∂zlj=∂E

∂blj

∂blj∂zlj

=∂E

∂blj, (II)

where in the last line we have used the fact that∂blj/∂z

lj = 1, cf. Eq. (124). This is the second of the

four backpropagation equations.We now derive the final two backpropagation equations

using the chain rule. Since the error depends on neuronsin layer l only through the activation of neurons in thesubsequent layer l+ 1, we can use the chain rule to write

∆lj =

∂E

∂zlj=∑k

∂E

∂zl+1k

∂zl+1k

∂zlj

=∑k

∆l+1k

∂zl+1k

∂zlj

=

(∑k

∆l+1k wl+1

kj

)σ′(zlj). (III)

51

This is the third backpropagation equation. The finalequation can be derived by differentiating of the costfunction with respect to the weight wljk as

∂E

∂wljk=∂E

∂zlj

∂zlj∂wljk

= ∆ljal−1k (IV)

Together, Eqs. (I), (II), (III), and (IV) define the fourbackpropagation equations relating the gradients of theactivations of various neurons alj , the weighted inputszlj =

∑k w

ljka

l−1k +blj , and the errors ∆l

j . These equationscan be combined into a simple, computationally efficientalgorithm to calculate the gradient with respect to allparameters (Nielsen, 2015).

The Backpropagation Algorithm

1. Activation at input layer: calculate the activa-tions a1

j of all the neurons in the input layer.

2. Feedforward: starting with the first layer, exploitthe feed-forward architecture through Eq. (123) tocompute zl and al for each subsequent layer.

3. Error at top layer: calculate the error of the toplayer using Eq. (I). This requires to know the ex-pression for the derivative of both the cost functionE(w) = E(aL) and the activation function σ(z).

4. “Backpropagate” the error: use Eq. (III) topropagate the error backwards and calculate ∆l

j forall layers.

5. Calculate gradient: use Eqs. (II) and (IV) tocalculate ∂E

∂bljand ∂E

∂wljk.

We can now see where the name backpropagationcomes from. The algorithm consists of a forward passfrom the bottom layer to the top layer where one calcu-lates the weighted inputs and activations of all the neu-rons. One then backpropagates the error starting withthe top layer down to the input layer and uses these errorsto calculate the desired gradients. This description makesclear the incredible utility and computational efficiencyof the backpropagation algorithm. We can calculate allthe derivatives using a single “forward” and “backward”pass of the neural network. This computational efficiencyis crucial since we must calculate the gradient with re-spect to all parameters of the neural net at each stepof gradient descent. These basic ideas also underly al-most all modern automatic differentiation packages suchas Autograd (Pytorch).

2. Computing gradients in deep networks: what can go wrongwith backprop?

Armed with backpropagation and gradient descent, itseems like it should be straightforward to train any neural

network. However, until fairly recently it was widely be-lieved that training deep networks was an extremely dif-ficult task. One reason for this was that even with back-propagation, gradient descent on large networks is ex-tremely computationally expensive. However, the greatadvances in computational hardware (and the widespreaduse of GPUs) has made this a much less vexing prob-lem than even a decade ago. It is hard to understatethe impact these advances in computing have had on thepractical utility of neural networks.

On a more technical and mathematical note, anotherproblem that occurs in deep networks, which transmitinformation through many layers, is that gradients canvanish or explode. This is, appropriately, known as theproblem of vanishing or exploding gradients. This prob-lem is especially pronounced in neural networks that tryto capture long-range dependencies, such as RecurrentNeural Networks for sequential data. We can illustratethis problem by considering a simple network with oneneuron in each layer. We further assume that all weightsare equal, and denote them by w. The behavior of thebackpropagation equations for such a network can be in-ferred from repeatedly using Eq. (III):

∆1j = ∆L

j

L−1∏j=0

wσ′(zj) = ∆Lj (w)L

L−1∏j=0

σ′(zj), (126)

where ∆Lj is the error in the L-th topmost layer, and

(w)L is the weight to the power L. Let us now alsoassume that the magnitude σ′(zj) is fairly constant andwe can approximate σ′(zj) ≈ σ′0. In this case, noticethat for large L, the error ∆1

j has very different behaviordepending on the value of wσ′0. If wσ′0 > 1, the errorsand the gradient blow up. On the other hand, if wσ′0 < 1the errors and gradients vanish. Only when the weightssatisfy wσ′0 ≈ 1 and the neurons are not saturated willthe gradient stay well behaved for deep networks.

This basic behavior holds true even in more compli-cated networks. Rather than considering a single weight,we can ask about the eigenvalues (or singular values) ofthe weight matrices wljk. In order for the gradients tobe finite for deep networks, we need these eigenvalues tostay near unity even after many gradient descent steps.In modern feedforward and ReLU neural networks, thisis achieved by initializing the weights for the gradient de-scent in clever ways and using non-linearities that do notsaturate, such as ReLUs (recall that for saturating func-tions, σ′ → 0, which will cause the gradient to vanish).Proper initialization and regularization schemes such asgradient clipping (cutting-off gradients with very largevalues), and batch normalization also help mitigate thevanishing and exploding gradient problem.

52

D. Regularizing neural networks and other practicalconsiderations

DNNs, like all supervised learning algorithms, mustnavigate the bias-variance tradeoff. Regularization tech-niques play an important role in ensuring that DNNsgeneralize well to new data. The last five years have seena wealth of new specialized regularization techniques forDNNs beyond the simple L1 and L2 penalties discussed inthe context of linear and logistic regression, see Secs. VIand VII. These new techniques include Dropout andBatch Normalization. In addition to these specializedregularization techniques, large DNNs seem especiallywell-suited to implicit regularization that already takesplace in the Stochastic Gradient Descent (SGD) (Wilsonet al., 2017), cf. Sec. IV. The implicit stochasticity andlocal nature of SGD often prevent overfitting of spuriouscorrelations in the training data, especially when com-bined with techniques such as Early Stopping. In thissection, we give a brief overview of these regularizationtechniques.

1. Implicit regularization using SGD: initialization,hyper-parameter tuning, and Early Stopping

The most commonly employed and effective optimizerfor training neural networks is SGD (see Sec. IV for otheralternatives). SGD acts as an implicit regularizer by in-troducing stochasticity (from the use of mini-batches)that prevents overfitting. In order to achieve good per-formance, it is important that the weight initialization ischosen randomly, in order to break any leftover symme-tries. One common choice is drawing the weights from aGaussian centered around zero with some variance thatscales inversely with number of inputs to the neuron (Heet al., 2015; Sutskever et al., 2013). Since SGD is a lo-cal procedure, as networks get deeper, choosing a goodweight initialization becomes increasingly important toensure that the gradients are well behaved. Choosingan initialization with a variance that is too large or toosmall will cause gradients to vanish and the network totrain poorly – even a factor of 2 can make a huge differ-ence in practice (He et al., 2015). For this reason, it isimportant to experiment with different variances.

The second important thing is to appropriately choosethe learning rate or step-size by searching over five log-arithmic grid points (Wilson et al., 2017). If the bestperformance occurs at the edge of the grid, repeat thisprocedure until the optimal learning rate is in the middleof the grid parameters. Finally, it is common to centeror whiten the input data (just as we did for linear andlogistic regression).

Another important form of regularization that is of-ten employed in practice is Early Stopping. The idea ofEarly Stopping is to divide the training data into two por-

tions, the dataset we train on, and a smaller validationset that serves as a proxy for out-of-sample performanceon the test set. As we train the model, we plot both thetraining error and the validation error. We expect thetraining error to continuously decrease during training.However, the validation error will eventually increase dueto overfitting. The basic idea of early stopping is to haltthe training procedure when the validation error starts torise. This Early Stopping procedure ensures that we stopthe training and avoid fitting sample specific features inthe data. Early Stopping is a widely used essential toolin the deep learning regularization toolbox.

2. Dropout

Another important regularization schemed that hasbeen widely adopted in the neural networks literatureis Dropout (Srivastava et al., 2014). The basic idea ofDropout is to prevent overfitting by reducing spuriouscorrelations between neurons within the network by in-troducing a randomization procedure similar to that un-derlying ensemble models such as Bagging. Recall thatthe basic idea behind ensemble methods is to train an en-semble of models that are created using a randomizationprocedure to ensure that the members of the ensembleare uncorrelated, see Sec. VIII. This reduces the vari-ance of statistical predictions without creating too muchadditional bias.

In the context of neural networks, it is extremely costlyto train an ensemble of networks, both from the point ofview of the amount of data needed, as well as computa-tional resources and parameter tuning required. Dropoutcircumnavigates these problems by randomly droppingout neurons (along with their connections) from the neu-ral network during each step of the training (see Figure39). Typically, for each mini-batch in the gradient de-scent step, a neuron is dropped from the neural networkwith a probability p. The gradient descent step is thenperformed only on the weights of the “thinned” networkof individual predictors.

Since during training, on average weights are onlypresent a fraction p of the time, predictions are madeby reweighing the weights by p: wtest = pwtrain.Thelearned weights can be viewed as some “average” weightover all possible thinned neural network. This averag-ing of weights is similar in spirit to the Bagging proce-dure discussed in the context of ensemble models, seeSec. VIII.

3. Batch Normalization

Batch Normalization is a regularization scheme thathas been quickly adopted by the neural network com-munity since its introduction in 2015 (Ioffe and Szegedy,

53

X

StandardNeural Net

After applyingDropout

FIG. 39 Dropout During the training procedure neuronsare randomly “dropped out” of the neural network with someprobability p giving rise to a thinned network. This preventsoverfitting by reducing correlations among neurons and re-ducing the variance in a method similar in spirit to ensemblemethods.

2015). The basic inspiration behind Batch Normalizationis the long-known observation that training in neural net-works works best when the inputs are centered aroundzero with respect to the bias. The reason for this is thatit prevents neurons from saturating and gradients fromvanishing in deep nets. In the absence of such center-ing, changes in parameters in lower layers can give riseto saturation effects in higher layers, and vanishing gra-dients. The idea of Batch Normalization is to introduceadditional new “BatchNorm” layers that standardize theinputs by the mean and variance of the mini-batch.

Consider a layer l with d neurons whose inputs are(zl1, . . . , z

ld). We standardize each dimension so that

zlk −→ zlk =zlk − E[zlk]√

Var[zlk], (127)

where the mean and variance are taken over all sam-ples in the mini-batch. One problem with this procedureis that it may change the representational power of theneural network. For example, for tanh non-linearities, itmay force the network to live purely in the linear regimearound z = 0. Since non-linearities are crucial to therepresentational power of DNNs, this could dramaticallyalter the power of the DNN.

For this reason, one introduces two new parameters γlkand βlk for each neuron that can additionally shift andscale the normalized input

zlk −→ zlk = γlkzlk + βlk. (128)

One can think of Eqs. (127) and (128) as adding newextra layers zlk in the deep net architecture. Hence, the

new parameters γlk and βlk can be learned just like theweights and biases using backpropagation (since this isjust an extra layer for the chain rule). We initialize theneural network so that at the beginning of training theinputs are being standardized. Backpropagation then ad-justs γ and β during training.

In practice, Batch Normalization considerably im-proves the learning speed by preventing gradients fromvanishing. However, it also seems to serve as a power-ful regularizer for reasons that are not fully understood.One plausible explanation is that in batch normalization,the gradient for a sample depends not only on the sam-ple itself but also on all the properties of the mini-batch.Since a single sample can occur in different mini-batches,this introduces additional randomness into the trainingprocedure which seems to help regularize training.

E. Deep neural networks in practice: examples

Now that we have gained sufficient high-level back-ground knowledge about deep neural nets, let us discusshow to use them in practice.

1. Deep learning packages

In Notebook 11, we demonstrated that the numericalimplementation of DNNs is greatly facilitated by opensource python packages, such as Keras, TensorFlow, Py-torch, and others. The complexity and learning curvesfor these packages differ, depending on the user’s level offamiliarity with Python. The reader should keep mindmind that there are DNN packages written in other lan-guages, such as Caffe which uses C++, but we do notdiscuss them in this review for brevity.

Keras is a high-level framework which does not requireany knowledge about the inner workings of the underly-ing deep learning algorithms. Coding DNNs in Keras isparticularly simple, see Notebook 11, and allows one toquickly grasp the big picture behind the theoretical con-cepts which we introduced above. However, for advancedapplications, which may require more direct control overthe operations in between the layers, Keras’ high-leveldesign may turn out insufficient.

If one opens up the Keras black box, one will find thatit wraps the functionality of another package – Tensor-Flow11. Over the last years, TensorFlow, which is sup-ported by Google, has been gaining popularity and hasbecome the preferred library for deep learning. It is fre-quently used in Kaggle competitions, university classes,and industry. In TensorFlow one constructs data flow

11 While Keras can also be used with a Theano backend, we do notdiscuss this here since Theano support has been discontinued.

54

graphs, the nodes of which represent mathematical oper-ations, while the edges encode multidimensional tensors(data arrays). A deep neural net can then be thought ofas a graph with a particular architecture and connectiv-ity. One needs to understand this concept well before onecan truly unleash TensorFlow’s full potential. The learn-ing curve can sometimes be rather steep for TensorFlowbeginners, and requires a certain degree of perseveranceand devoted time to internalize the underlying ideas.

There are, however, many other open source packageswhich allow for control over the inter- and intra-layer op-erations, without the need to introduce computationalgraphs. Such an example is Pytorch, which offers li-braries for automatic differentiation of tensors at GPUspeed. As we discussed above, manipulating neural netsboils down to fast array multiplication and contractionoperations and, therefore, the torch.nn library oftendoes the job of providing enough access and controlla-bility to manipulate the linear algebra operations under-lying deep neural nets.

For the benefit of the reader, we have prepared Jupyternotebooks for DNNs using all three packages for the deeplearning problems we discuss below. We invite the readerto carefully examine the differences in the code whichshould help them decide on which package they prefer touse.

2. Approaching the learning problem

Let us now analyze a typical procedure for using neuralnetworks to solve supervised learning problems. As canbe seen already from the code snippets in Notebook 11,constructing a deep neural network to solve ML problemsis a multiple-stage process. Generally, one can identify aset of key steps:

1. Collect and pre-process the data.

2. Define the model and its architecture.

3. Choose the cost function and the optimizer.

4. Train the model.

5. Evaluate and study the model performanceon the test data.

6. Use the validation data to adjust the hyper-parameters (and, if necessary, network ar-chitecture) to optimize performance for thespecific dataset.

At this point a few remarks are in order. While wetreat Step 1 above as consisting mainly of loading andreshaping a dataset prepared ahead of time, we empha-size that obtaining a sufficient amount of data is a typicalchallenge in many applications. Oftentimes insufficient

data serves as a major bottleneck on the ultimate per-formance of DNNs. In such cases one can consider dataaugmentation, i.e. distorting data samples from the ex-isting dataset in some way to enhance size the dataset.Obviously, if one knows how to do this, one already haspartial information about the important features in thedata.

One of the first questions we are typically faced with ishow to determine the sizes of the training and test datasets. The MNIST dataset, which has 10 classificationcategories, uses 80% of the available data for trainingand 20% for testing. On the other hand, the ImageNetdata which has 1000 categories is split 50% − 50%. Asa rule of thumb, the more classification categories thereare in the task, the closer the sizes of the training andtest datasets should be in order to prevent overfitting.Once the size of the training set is fixed, it is common toreserve 20% of it for validation, which is used to fine-tunethe hyperparameters of the model.

Also related to data preprocessing is the standardiza-tion of the dataset. It has been found empirically that ifthe original values of the data differ by orders of magni-tude, training can be slowed down or impeded. This canbe traced back to the vanishing and exploding gradientproblem in backprop discussed in Sec. IX.C. To avoidsuch unwanted effects, one often resorts to two tricks:(i) all data should be mean-centered, i.e. from every datapoint we subtract the mean of the entire dataset, and (ii)rescale the data, for which there are two ways: if the datais approximately normally distributed, one can rescale bythe standard deviation. Otherwise, it is typically rescaledby the maximum absolute value so the rescaled data lieswithin the interval [−1, 1]. Rescaling ensures that theweights of the DNN are of a similar order of magnitude(notice the similarity of this idea to Batch Normalization,cf. Sec. IX.D.3).

The next issue is how to choose the right hyperparam-eters to start training the model. According to Bengio,the optimal learning rate is often an order of magnitudelower than the smallest learning rate that blows up theloss (Bengio, 2012). One should also keep in mind that,depending on how ambitious of a problem one is dealingwith, training the model can take a considerable amountof time. This can severely slow down any progress onimproving the model in Step 6. Therefore, it is usu-ally a good idea to play with a small enough fraction ofthe training data to get a rough feeling about the cor-rect hyperparameter regimes, the usefulness of the DNNarchitecture, and to debug the code. The size of thissmall ‘play set’ should be such that training on it canbe done fast and in real time to allow to quickly adjustthe hyperparameters. A typical strategy of exploring thehyperparameter landscape is to use grid searches.

Whereas it is always possible to view Steps 1-5 asgeneric and independent of the particular problem weare trying to solve, it is only when these steps are put

55

1e-05 0.0001 0.001 0.01 0.1

learning rate

1000

10000

100000

200000

data

sets

ize

54.5%

44.6%

42.7%

56.7%

51.5%

65.0%

48.8%

59.9%

74.5%

74.6%

74.1%

76.0%

77.5%

77.0%

78.6%

78.9%

65.5%

78.2%

79.6%

79.9%

0%

20%

40%

60%

80%

100%

accu

racy

(%)

FIG. 40 Grid search results for the test set accuracy of theDNN for the SUSY problem as a function of the learning rateand the size of the dataset. The data used includes all high-and low-level features.

together in Step 6 that the real benefit of deep learningis revealed, compared to less sophisticated methods suchas regression or bagging, see Secs. VI, VII, and VIII. Theoptimal choice of network architecture, cost function, andoptimizer is determined by the properties of the trainingand test datasets, which are usually revealed when wetry to improve the model.

While there is no “one-size-fits-them-all” recipe to ap-proach ML problems, we believe that the above list givesa good overview and can be a useful guideline to the lay-man. Furthermore, as it becomes clear, this ‘recipe’ canbe applied to generic supervised learning tasks, not justDNNs. We refer the reader to Sec. XI for more usefulhints and tips on how to use the validation data duringthe training process.

3. SUSY dataset

As a first example from physics, we discuss a DNNapproach to the SUSY dataset already introduced in thecontext of logistic regression in Sec. VII.C.2, and Baggingin Sec. VIII.F. For a detailed description of the SUSYdataset and the corresponding classification problem, werefer the reader to Sec. VII.C.2. There is an interest inusing deep learning methods to automate the discoveryof collision features from data. Benchmark results usingBayesian Decision Trees from a standard physics pack-age, and five-layer neural networks using Dropout werepresented in the original paper (Baldi et al., 2014); theydemonstrate the ability of deep learning to bypass theneed of using hand-crafted high-level features. Our goalhere is to study systematically the accuracy of a DNNclassifier as a function of the learning rate and the datasetsize.

Unlike the MNIST example where we used Keras, herewe use the opportunity to introduce the Pytorch package,see the corresponding notebook. We leave the discussionof the code-specific details for the accompanying note-book.

To classify the SUSY collision events, we construct aDNN with two dense hidden layers of 200 and 100 neu-rons, respectively. We use ReLU activation between theinput and the hidden layers, and a sofmax output layer.We apply dropout regularization on the weights of theDNN. Similar to MNIST, we use the cross-entropy as acost function and minimize it using SGD with batches ofsize 10% of the training data size. We train the DNN for10 epochs.

Figure 40 shows the accuracy of our DNN on the testdata as a function of the learning rate and the size of thedataset. It is considered good practice to start with alogarithmic scale to search through the hyperparameters,to get an overall idea for the order of magnitude of theoptimal values. In this example, the performance peaksat the largest size of the dataset and a learning rate of 0.1,and is of the order of 80%. Since the optimal performanceis obtained at the edge of the grid, we encourage thereader to extend the grid size to beat our result. Forcomparison, in the original study (Baldi et al., 2014),the authors achieved ≈ 89% by using the entire datasetwith 5, 000, 000 points and a more sophisticated networkarchitecture, trained using GPUs.

4. Phases of the 2D Ising model

As a second example from physics, we discuss a DNNapproach to the Ising dataset introduced in Sec. VII.C.1.We study the problem of classifying the states of the 2DIsing model with a DNN (Tanaka and Tomiya, 2017a),focussing on the model performance as a function of boththe number of hidden neurons and the learning rate.The discussion is accompanied by a notebook written inTensorFlow. As in the previous example, the interestedreader can find the discussion of the code-specific detailsin the notebook.

To classify whether a given spin configuration is in theordered or disordered phase, we construct a minimalisticmodel for a DNN with a single hidden layer containinga number of hidden neurons. The network architecturethus includes a ReLU-activated input layer, the hiddenlayer, and the softmax output layer. We pick the cate-gorical cross-entropy as a cost function and minimize itusing SGD with mini-batches of size 100. We train theDNN for 100 epochs.

Figure 41 shows the outcome of a grid search over alog-spaced learning rate and the number of neurons in thehidden layer. We see that about 10 neurons are enoughat a learning rate of 0.1 to get to a very high accuracy onthe test set. However, if we aim at capturing the physics

56

1e-06 1e-05 0.0001 0.001 0.01 0.1

learning rate

1

10

100

1000

hidd

enne

uron

s

50.5%

60.7%

68.9%

79.2%

49.3%

43.1%

70.8%

53.9%

50.8%

28.0%

85.1%

84.2%

49.7%

75.6%

93.3%

97.9%

52.0%

95.7%

99.7%

99.8%

68.4%

99.2%

99.9%

99.9%

0%

20%

40%

60%

80%

100%

accu

racy

(%)

1e-06 1e-05 0.0001 0.001 0.01 0.1

learning rate

1

10

100

1000

hidd

enne

uron

s

50.2%

68.6%

67.6%

67.6%

50.1%

46.4%

63.9%

51.9%

49.8%

39.3%

55.5%

56.2%

49.2%

74.5%

67.1%

59.9%

53.7%

83.7%

73.1%

74.0%

60.1%

84.2%

93.7%

96.0%

0%

20%

40%

60%

80%

100%

accu

racy

(%)

FIG. 41 Grid search results for the test set accuracy (top)and the critical set accuracy (bottom) of the DNN for theIsing classification problem as a function of the learning rateand the number of hidden neurons.

close to criticality, clearly more neurons are required toreliably learn the more complex correlations in the Isingconfigurations.

X. CONVOLUTIONAL NEURAL NETWORKS (CNNS)

One of the core lessons of physics is that we should ex-ploit symmetries and invariances when analyzing physi-cal systems. Properties such as locality and translationalinvariance are often built directly into the physical laws.Our statistical physics models often directly incorporateeverything we know about the physical system being an-alyzed. For example, we know that in many cases it issufficient to consider only local couplings in our Hamilto-nians, or work directly in momentum space if the systemis translationally invariant. This basic idea, tailoring ouranalysis to exploit additional structure, is a key feature ofmodern physical theories from general relativity, through

gauge theories, to critical phenomena.Like physical systems, many datasets and supervised

learning tasks also possess additional symmetries andstructure. For instance, consider a supervised learningtask where we want to label images from some dataset asbeing pictures of cats or not. Our statistical proceduremust first learn features associated with cats. Becausea cat is a physical object, we know that these featuresare likely to be local (groups of neighboring pixels in thetwo-dimensional image corresponding to whiskers, tails,eyes, etc). We also know that the cat can be anywherein the image. Thus, it does not really matter where inthe picture these features occur (though relative positionsof features likely do matter). This is a manifestation oftranslational invariance that is built into our supervisedlearning task. This example makes clear that, like manyphysical systems, many ML tasks (especially in the con-text of computer vision and image processing) also pos-sess additional structure, such as locality and translationinvariance.

The all-to-all coupled neural networks in the previoussection fail to exploit this additional structure. For exam-ple, consider the image of the digit ‘four’ from the MNISTdataset shown in Fig. 26. In the all-to-all coupled neuralnetworks used there, the 28 × 28 image was considereda one-dimensional vector of size 282 = 796. This clearlythrows away lots of the spatial information contained inthe image. Not surprisingly, the neural networks com-munity realized these problems and designed a class ofneural network architectures, convolutional neural net-works or CNNs, that take advantage of this additionalstructure (locality and translational invariance) (LeCunet al., 1995). Furthermore, what is especially interest-ing from a physics perspective is the recent finding thatthese CNN architectures are intimately related to mod-els such as tensor networks (Stoudenmire, 2018; Stouden-mire and Schwab, 2016) and, in particular, MERA-likearchitectures that are commonly used in physical mod-els for quantum condensed matter systems (Levine et al.,2017).

A. The structure of convolutional neural networks

A convolutional neural network is a translationally in-variant neural network that respects locality of the in-put data. CNNs are the backbone of many modern deeplearning applications and here we just give a high-leveloverview of CNNs that should allow the reader to delvedirectly into the specialized literature. The reader is alsostrongly encouraged to consult the excellent, succinctnotes for the Stanford CS231n Convolutional Neural Net-works class developed by Andrej Karpathy and Fei-Fei Li(https://cs231n.github.io/). We have drawn heavilyon the pedagogical style of these notes in crafting thissection.

57

Convolution Coarse-graining(pooling)

Convolution Coarse-graining(pooling)

Fully ConnectedLayer

W

H

D

FIG. 42 Architecture of a Convolutional Neural Network (CNN). The neurons in a CNN are arranged in threedimensions: height (H), width (W ), and depth (D). For the input layer, the depth corresponds to the number of channels (inthis case 3 for RGB images). Neurons in the convolutional layers calculate the convolution of the image with a local spatialfilter (e.g. 3× 3 pixel grid, times 3 channels for first layer) at a given location in the spatial (W,H)-plane. The depth D of theconvolutional layer corresponds to the number of filters used in the convolutional layer. Neurons at the same depth correspondto the same filter. Neurons in the convolutional layer mix inputs at different depths but preserve the spatial location. Poolinglayers perform a spatial coarse graining (pooling step) at each depth to give a smaller height and width while preserving thedepth. The convolutional and pooling layers are followed by a fully connected layer and classifier (not shown).

There are two kinds of basic layers that make up aCNN: a convolution layer that computes the convolu-tion of the input with a bank of filters (as a math-ematical operation, see this practical guide to imagekernels: http://setosa.io/ev/image-kernels/), andpooling layers that coarse-grain the input while main-taining locality and spatial structure, see Fig. 42. Fortwo-dimensional data, a layer l is characterized by threenumbers: height Hl, width Wl, and depth Dl

12. Theheight and width correspond to the sizes of the two-dimensional spatial (Wl, Hl)-plane (in neurons), and thedepth Dl (marked by the different colors in Fig. 42) –to the number of filters in that layer. All neurons corre-sponding to a particular filter have the same parameters(i.e. shared weights and bias).

In general, we will be concerned with local spatialfilters (often called a receptive field in analogy withneuroscience) that take as inputs a small spatial patchof the previous layer at all depths. For instance, asquare filter of size F is a three-dimensional array ofsize F × F ×Dl−1. The convolution consists of runningthis filter over all locations in the spatial plane. Todemonstrate how this works in practice, let us a considerthe simple example consisting of a one-dimensionalinput of depth 1, shown in Fig. 43. In this case, a filterof size F × 1× 1 can be specified by a vector of weightsw of length F . The stride, S, encodes by how manyneurons we translate the filter by when performingthe convolution. In addition, it is common to pad

12 The depth Dl is often called “number of channels”, to distin-guish it from the depth of the neural network itself, i.e. the totalnumber of layers (which can be convolutional, pooling or fully-connected), cf. Fig. 42.

the input with P zeros (see Fig. 43). For an inputof width W , the number of neurons (outputs) in thelayer is given by (W − F + 2P )/S + 1. We invite thereader to check out this visualization of the convolutionprocedure, https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md, for a squareinput of unit depth. After computing the filter, theoutput is passed through a non-linearity, a ReLU in Fig.43. In practice, one often inserts a BatchNorm layerbefore the non-linearity, cf. Sec. IX.D.3.

These convolutional layers are interspersed with pool-ing layers that coarse-grain spatial information by per-forming a subsampling at each depth. One common pool-ing operation is the max pool. In a max pool, the spatialdimensions are coarse-grained by replacing a small region(say 2×2 neurons) by a single neuron whose output is themaximum value of the output in the region. In physics,this pooling step is very similar to the decimation stepof RG (Iso et al., 2018; Koch-Janusz and Ringel, 2017;Lin et al., 2017; Mehta and Schwab, 2014). This gener-ally reduces the dimension of outputs. For example, ifthe region we pool over is 2 × 2, then both the heightand the width of the output layer will be halved. Gen-erally, pooling operations do not reduce the depth of theconvolutional layers because pooling is performed sepa-rately at each depth. A simple example of a max-poolingoperation is shown in Fig. 44. There are some studiessuggesting that pooling might be unnecessary (Springen-berg et al., 2014), but pooling layers remain a staple ofmost CNNs.

In a CNN, the convolution and max-pool layers aregenerally followed by an all-to-all connected layer and ahigh-level classifier such as a soft-max. This allows usto train CNNs as usual using the backprop algorithm,cf. Sec. IX.C. From a backprop perspective, CNNs are

58

almost identical to fully connected neural network archi-tectures except with tied parameters.

Apart from introducing additional structure, such astranslational invariance and locality, this convolutionalstructure also has important practical and computationalbenefits. All neurons at a given layer represent the samefilter, and hence can all be described by a single set ofweights and biases. This reduces the number of free pa-rameters by a factor ofH×W at each layer. For example,for a layer with D = 102 and H = W = 102, this givesa reduction in parameters of nearly 106! This allows forthe training of much larger models than would otherwisebe possible with fully connected layers. We are familiarwith similar phenomena in physics: e.g. in translation-ally invariant systems we can parametrize all eigenmodesby specifying only their momentum (wave number) andfunctional form (sin, cos, etc.), while without translationinvariance much more information is required.

B. Example: CNNs for the 2D Ising model

The inclusion of spatial structure in CNNs is an im-portant feature that can be exploited when designingneural networks for studying physical systems. In theaccompanying notebook, we used Pytorch to implementa simple CNN composed of a single convolutional layerfollowed by a soft-max layer. Every input data point(i.e. Ising configuration) is shaped as a two-dimensionalarray. We varied the output depth (i.e. the number ofoutput channels) of the convolutional layer from unity –a single set of weights and one bias – to an output depthof 50 distinct weights and biases. The CNN was thentrained using SGD for five epochs using a training setconsisting of samples from far in the paramagnetic andordered phases. The results are shown in Fig. 45. TheCNN achieved a 100% accuracy on the test set for allarchitectures, even for a CNN with depth one. We alsochecked the performance of the CNN on samples drawnfrom the near-critical region for temperatures T slightlyabove and below the critical temperature Tc. The CNNperformed admirably even on these critical samples withan accuracy of between 80% and 90%. As is the casewith all ML and neural networks, the performance onparts of the data that are missing from the training setis considerably worse than on test data that is similarto the training data. This highlights the importance ofproperly constructing an accurate training dataset andthe considerable obstacles of generalizing to novel situ-ations. We encourage the interested reader to explorethe corresponding notebook and design better CNN ar-chitectures with improved generalization performance onthe near-critical set.

The reader may wish to check out the second part ofthe MNIST notebook for a discussion of CNNs applied tothe digit recognition using the high-level Keras package.

W=5

F=3

S=2(units to shift

filter by)

S=1(units to shift

filter by)

P=1

bias=-2 ReLU (unit slope)

inputoutput

0

0

weight=[1,-1,1]

1

3-122

1 1 -1 6

-4

-1 -1 -3 4 4 -6

0

0

0

0

W=6

F=4

P=0

bias=-1 ReLU (unit slope)

input

output

weight=[1,-1,2,1]

1

0-2

-122 2 1 1

1 0 0

(a)

(b)

FIG. 43 Two examples to illustrate a one-dimensionalconvolutional layer with ReLU nonlinearity. Convolu-tional layer for a spatial filter of size F for a one-dimensionalinput of width W with stride S and padding P followed by aReLU non-linearity.

Regarding the SUSY dataset, we stress that the absenceof spatial locality in the collision features renders apply-ing CNNs to that problem inadequate.

C. Pre-trained CNNs and transfer learning

The immense success of CNNs for image recognitionhas resulted in the training of huge networks on enormousdatasets, often by large industrial research teams from

59

... ...

3 0

10

2 3

4 1

2

1

2

0 1

1

1

1 1

0

3

4

2 5

11

1 0

3 1

11

0 1

0

2

10

1

5

3

2

FIG. 44 Illustration of Max Pooling. Illustration of max-pooling over a 2 × 2 region. Notice that pooling is done ateach depth (vertical axis) separately. The number of outputsis halved along each dimension due to this coarse-graining.

Google, Microsoft, Amazon, etc. Many of these mod-els are known by name: AlexNet, GoogLeNet, ResNet,InceptionNet, VGGNet, etc. Most researchers and prac-titioners do not have the resources, data, or time to trainnetworks on this scale. Luckily, the trained models havebeen released and are now available in standard packagessuch as the Torch Vision library in Pytorch or the Caffeframework. These models can be used directly as a basisfor fine-tuning in different supervised image recognitiontasks through a process called transfer learning.

The basic idea behind transfer learning is that the fil-ters (receptive fields) learned by the convolution layersof these networks should be informative for most im-age recognition based tasks, not just the ones they wereoriginally trained for. In other words, we expect that,since images reflect the natural world, the filters learnedby these CNNs should transfer over to new tasks withonly slight modifications and fine-tuning. In practice,this turns out to be true for many tasks one might beinterested in.

There are three distinct ways one can take a pre-trained CNN and repurpose it for a new task. The follow-ing discussion draws heavily on the notes from the course

1 5 10 20 50Depth of hidden layer

60

70

80

90

100

110

Accu

racy

testcritical

FIG. 45 Single-layer convolutional network for classi-fying phases in the Ising mode. Accuracy on test set andcritical samples for a convolutional neural network with sin-gle layer of varying depth with filters of size 2, max-pool layerwith receptive field of size 2, followed by soft-max classifier.Notice that the test accuracy is 100% even for a CNN of depthone with a single set of weights. Accuracy on the near-criticaldataset is significantly below that for the test set.

CS231n mentioned in the introduction to this section.

• Use CNN as fixed feature detector at toplayer. If the new dataset we want to train on issmall and similar to the original dataset, we cansimply use the CNN as a fixed feature detectorand retrain our classifier. In other words, we re-move the classifier (soft-max) layer at the top of theCNN and replace it with a new classifier (linear sup-port vector machine (SVM) or soft-max) relevantto our supervised learning problem. In this proce-dure, the CNN serves as a fixed map from imagesto relevant features (the outputs of the top fully-connected layer right before the original classifier).This procedure prevents overfitting on small, simi-lar datasets and is often a useful starting point fortransfer learning.

• Use CNN as fixed feature detector at inter-mediate layer. If the dataset is small and quitedifferent from the dataset used to train the origi-nal image, the features at the top level might notbe suitable for our dataset. In this case, one maywant to instead use features in the middle of theCNN to train our new classifier. These features arethought to be less fine-tuned and more universal(e.g. edge detectors). This is motivated by the ideathat CNNs learn increasingly complex features thedeeper one goes in the network (see discussion onrepresentational learning in next section).

• Fine-tune the CNN. If the dataset is large, in

60

addition to replacing and retraining the classifierin the top layer, we can also fine-tune the weightsof the original CNN using backpropagation. Onemay choose to freeze some of the weights in theCNN during the procedure or retrain all of themsimultaneously.

All these procedures can be carried out easily by us-ing packages such as Caffe or the Torch Vision libraryin PyTorch. PyTorch provides a nice python notebookthat serves as tutorial on transfer learning. The readeris strongly urged to read the Pytorch tutorials carefullyif interested in this topic.

XI. HIGH-LEVEL CONCEPTS IN DEEP NEURALNETWORKS

In the previous sections, we introduced deep neuralnetworks and discussed how we can use these networksto perform supervised learning. Here, we take a step backand discuss some high-level questions about the practiceand performance of neural networks. The first part of thissection presents a deep learning workflow inspired by thebias-variance tradeoff. This workflow is especially rele-vant to industrial applications where one is often tryingto employ neural networks to solve a particular problem.In the second part of this section, we shift gears and askthe question, why have neural networks been so success-ful? We provide three different high-level explanationsthat reflect current dogmas. Finally, we end the sectionby discussing the limitations of supervised learning meth-ods and current neural network architectures.

A. Organizing deep learning workflows using the bias-variancetradeoff

Imagine that you are given some data and askedto design a neural network for learning how to per-form a supervised learning task. What are the bestpractices for organizing a systematic workflow thatallows us to efficiently do this? Here, we presenta simple deep learning workflow inspired by think-ing about the bias-variance tradeoff (see Figure 46).This section draws heavily on Andrew Ng’s tuto-rial at the Deep Learning School (available onlineat https://www.youtube.com/watch?v=F1ka6a13S9I)which readers are strongly encouraged to watch.

The first thing we would like to do is divide the datainto three parts. A training set, a validation or dev(development) set, and a test set. The test set is thedata on which we want to make predictions. The dev setis a subset of the training data we use to check how wellwe are doing out-of-sample, after training the model onthe training dataset. We use the validation error as a

proxy for the test error in order to make tweaks to ourmodel. It is crucial that we do not use any of the testdata to train the algorithm. This is a cardinal sin inML. We thus suggest the following workflow:

Estimate optimal error rate (Bayes rate).—Thefirst thing one should establish is the difficulty of thetask and the best performance one can expect to achieve.No algorithm can do better than the “signal” in thedataset. For example, it is likely much easier to classifyobjects in high-resolution images than in very blurry,low-resolution images. Thus, one needs to establisha proxy or baseline for the optimal performance thatcan be expected from any algorithm. In the context ofBayesian statistics, this is often called the Bayes rate.Since we do not know this a priori, we must get anestimate of this. For many tasks such as speech or objectrecognition, we can approximate this by the performanceof humans on the task. For a more specialized task,we would like to ask how well experts, trained at thetask, perform. This expert performance then serves as aproxy for our Bayes rate.

Minimize underfitting (bias) on training dataset.—After we have established the Bayes rate, we wantto make sure that we are using a sufficiently complexmodel to avoid underfitting on the training dataset.In practice, this means comparing the training errorrate to the Bayes rate. Since the training error doesnot care about generalization (variance), our modelshould approach the Bayes rate on the training set. Ifit does not, the bias of the DNN model is too largeand one should try training the model longer and/orusing a larger model. Finally, if none of these techniqueswork, it is likely that the model architecture is notwell suited to the dataset, and one should modify theneural architecture in some way to better reflect the un-derlying structure of the data (symmetries, locality, etc.).

Make sure you are not overfitting.— Next, werun our algorithm on the validation or dev set. If theerror is similar to the training error rate and Bayes rate,we are done. If it is not, then we are overfitting thetraining data. Possible solutions include, regularizationand, importantly, collecting more data. Finally, ifnone of these work, one likely has to change the DNNarchitecture.

If the validation and test sets are drawn from the samedistributions, then good performance on the validationset should lead to similarly good performance on thetest set. (Of course performance will typically be slightlyworse on the test set because the hyperparameters werefit to the validation set.) However, sometimes the train-ing data and test data differ in subtle ways because, forexample, they are collected using slightly different meth-

61

Trainingerrorhigh?BiggermodelTrainlongerNewmodelarchitecture

Yes

No

Underfi9ng

Valida;onerrorhigh?Overfi9ng

YesMoredataRegulariza;onNewmodelarchitecture

No

DONE!

Establishproxyforop;malerrorrate(e.g.expertperformance)

FIG. 46 Organizing a workflow for Deep Learning.Schematic illustrating a deep learning workflow inspired bynavigating the bias-variance tradeoff (Figure based on An-drew Ng’s talk at the 2016 Deep Learning School available athttps://www.youtube.com/watch?v=F1ka6a13S9I.) In thisdiagram, we have assumed that there in no mismatch be-tween the distributions the training and test sets are drawnfrom.

ods, or because it is cheaper to collect data in one wayversus another. In this case, there can be a mismatchbetween the training and test data. This can lead tothe neural network overfitting these small differences be-tween the test and training sets, and a poor performanceon the test set despite having a good performance onthe validation set. To rectify this, Andrew Ng suggestsmaking two validation or dev sets, one constructed fromthe training data and one constructed from the test data.The difference between the performance of the algorithmon these two validation sets quantifies the train-test mis-match. This can serve as another important diagnosticwhen using DNNs for supervised learning.

B. Why neural networks are so successful: three high-levelperspectives on neural networks

Having discussed the basics of neural networks, we con-clude by giving three complementary perspectives on thesuccess of DNNs and Deep Learning. This high-level dis-cussion reflects various dogmas and intuitions about thesuccess of DNNs and is in no way definitive or conclusive.As the reader was already warned in the introduction toDNNs, the field is rapidly expanding and many of theseperspectives may turn out to be only partially true oreven false. Nonetheless, we include them here as a guide-post for readers.

1. Neural networks as representation learning

One important and powerful aspect of the deep learn-ing paradigm is the ability to learn relevant featuresof the data with relatively little domain knowledge, i.e.with minimal hand-crafting. Often, the power of deeplearning stems from its ability to act like a black boxthat can take in a large stream of data and find goodfeatures that capture properties of the data we are in-terested in. This ability to learn good representationswith very little hand-tuning is one of the most attractiveproperties of DNNs. Many of the other supervised learn-ing algorithms discussed here (regression-based models,ensemble methods such as random forests or gradient-boosted trees) perform comparably or even better thanneural networks but when using hand-crafted featureswith small-to-intermediate sized datasets.

The hierarchical structure of deep learning models isthought to be crucial to their ability to represent com-plex, abstract features. For example, consider the useof CNNs for image classification tasks. The analysis ofCNNs suggests that the lower-levels of the neural net-works learn elementary features, such as edge detectors,which are then combined into higher levels of the net-works into more abstract, higher-level features (e.g. thefamous example of a neuron that “learned to respond tocats”) (Le, 2013). More recently, it has been shown thatCNNs can be thought of as performing tensor decompo-sitions on the data similar to those commonly used innumerical methods in modern quantum condensed mat-ter (Cohen et al., 2016).

One of the interesting consequences of this line ofthinking is the idea that one can train a CNN on onelarge dataset and the features it learns should also be use-ful for other supervised tasks. This results in the abilityto learn important and salient features directly from thedata and then transfer this knowledge to a new task. In-deed, this ability to learn important, higher-level, coarse-grained features is reminiscent of ideas like the renormal-ization group (RG) in physics where the RG flows sep-arate out relevant and irrelevant directions, and certainunsupervised deep learning architectures have a naturalinterpretation in terms of variational RG schemes (Mehtaand Schwab, 2014).

2. Neural networks can exploit large amounts of data

With the advent of smartphones and the internet, therehas been an explosion in the amount of data being gen-erated. This data-rich environment favors supervisedlearning methods that can fully exploit this rich dataworld. One important reason for the success of DNNsis that they are able to exploit the additional signal inlarge datasets for difficult supervised learning tasks. Fun-damentally, modern DNNs are unique in that they con-

62

Amount of Data

Perf

orm

ance

Large NN

Medium NN

Small NN

Traditional (e.g. logistic reg)

Smalldataregime

FIG. 47 Large neural networks can exploit the vastamount of data now available. Schematic of how neuralnetwork performance depends on amount of available data(Figure based on Andrew Ng’s talk at the 2016 Deep Learn-ing School available at https://www.youtube.com/watch?v=F1ka6a13S9I.)

tain millions of parameters, yet can still be trained onexisting hardwares. The complexity of DNNs (in termsof parameters) combined with their simple architecture(layer-wise connections) hit a sweet spot between expres-sivity (ability to represent very complicated functions)and trainability (ability to learn millions of parameters).

Indeed, the ability of large DNNs to exploit hugedatasets is thought to differ from many other commonlyemployed supervised learning methods such as SupportVector Machines (SVMs). Figure 47 shows a schematicdepicting the expected performance of DNNs of differ-ent sizes with the number of data samples and comparesthem to supervised learning algorithms such as SVMs orensemble methods. When the amount of data is small,DNNs offer no substantial benefit over these other meth-ods and often perform worse. However, large DNNs seemto be able to exploit additional data in a way other meth-ods cannot. The fact that one does not have to handengineer features makes the DNN even more well suitedfor handling large datasets. Recent theoretical resultssuggest that as long as a DNN is large enough, it shouldgeneralize well and not overfit (Advani and Saxe, 2017).In the data-rich world we live in (at least in the contextof images, videos, and natural language), this is a recipefor success. In other areas where data is more limited,deep learning architectures have (at least so far) been lesssuccessful.

3. Neural networks scale up well computationally

A final feature that is thought to underlie the successof modern neural networks is that they can harness theimmense increase in computational capability that hasoccurred over the last few decades. The architecture ofneural networks naturally lends itself to parallelizationand the exploitation of fast but specialized processorssuch as graphical processing units (GPUs). Google andNVIDIA set on a course to develop TPUs (tensor pro-cessing units) which will be specifically designed for themathematical operations underlying deep learning archi-tectures. The layered architecture of neural networks alsomakes it easy to use modern techniques such as automaticdifferentiation that make it easy to quickly deploy them.Algorithms such as stochastic gradient descent and theuse of mini-batches make it easy to parallelize code andtrain much larger DNNs than was thought possible fifteenyears ago. Furthermore, many of these computationalgains are quickly incorporated into modern packages withindustrial resources. This makes it easy to perform nu-merical experiments on large datasets, leading to furtherengineering gains.

C. Limitations of supervised learning with deep networks

Like all statistical methods, supervised learning usingneural networks has important limitations. This is es-pecially important when one seeks to apply these meth-ods, especially to physics problems. Like all tools, DNNsare not a universal solution. Often, the same or betterperformance on a task can be achieved by using a fewhand-engineered features (or even a collection of randomfeatures). This is especially important for hard physicsproblems where data (or Monte-Carlo samples) may behard to come by.

Here we list some of the important limitations of su-pervised neural network based models.

• Need labeled data.—Like all supervised learn-ing methods, DNNs for supervised learning requirelabeled data. Often, labeled data is harder to ac-quire than unlabeled data (e.g. one must pay forhuman experts to label images).

• Supervised neural networks are extremelydata intensive.—DNNs are data hungry. Theyperform best when data is plentiful. This is doublyso for supervised methods where the data must alsobe labeled. The utility of DNNs is extremely lim-ited if data is hard to acquire or the datasets aresmall (hundreds to a few thousand samples). Inthis case, the performance of other methods thatutilize hand-engineered features can exceed that ofDNNs.

63

• Homogeneous data.—Almost all DNNs dealwith homogeneous data of one type. It is very hardto design architectures that mix and match datatypes (i.e. some continuous variables, some discretevariables, some time series). In applications beyondimages, video, and language, this is often what isrequired. In contrast, ensemble models like randomforests or gradient-boosted trees have no difficultyhandling mixed data types.

• Many physics problems are not aboutprediction.—In physics, we are often not inter-ested in solving prediction tasks such as classifi-cation. Instead, we want to learn something aboutthe underlying distribution that generates the data.In this case, it is often difficult to cast these ideas ina supervised learning setting. While the problemsare related, it’s possible to make good predictionswith a “wrong” model. The model might or mightnot be useful for understanding the physics.

Some of these remarks are particular to DNNs, oth-ers are shared by all supervised learning methods. Thismotivates the use of unsupervised methods which in partcircumnavigate these problems.

XII. DIMENSIONAL REDUCTION AND DATAVISUALIZATION

Unsupervised learning is concerned with discoveringstructure in unlabeled data. In this section, we will be-gin our foray into unsupervised learning by way of datavisualization. Data visualization methods are importantfor modeling as they can be used to identify correlated orredundant features along with irrelevant features (noise)from raw or processed data. Conceivably, being able toidentify and capture such characteristics in a dataset canhelp in designing better predictive models. For data in-volving a relatively small number of features, studyingpair-wise correlations (i.e. pairwise scatter plots of allfeatures) may suffice in performing a complete analysis.This rapidly becomes impractical for datasets involvinga large number of measured featured (such as images).Thus, in practice, we often have to perform dimensionalreduction, namely, project or embed the data onto a lowerdimensional space, which we refer to as the latent space.As we will discuss, part of the complication of dimen-sional reduction lies in the fact that low-dimensional rep-resentations of high-dimensional data necessarily incursinformation lost. Below, we introduce some common lin-ear and nonlinear methods for performing dimensionalreduction with applications in data visualization of high-dimensional data.

A. Some of the challenges of high-dimensional data

Before we begin exploring some specific dimensionalreduction techniques, it is useful to highlight some of thegeneric difficulties encountered when dealing with high-dimensional data.

a. High-dimensional data lives near the edge of sample space.Geometry in high-dimensional space can be counterintu-itive. One example that is pertinent to machine learningis the following. Consider data distributed uniformly atrandom in a D-dimensional hypercube C = [−e/2, e/2]D,where e is the edge length. Consider also aD-dimensionalhypersphere S of radius e/2 centered at the origin andcontained within C. The probability that a data point xdrawn uniformly at random in C is contained within Sis well approximated by the ratio of the volume of S tothat of C : p(‖x‖2 < e/2) ∼ (1/2)D. Thus, as the di-mension of the feature space D increases, p goes to zeroexponentially fast. In other words, most of the data willconcentrate outside the hypersphere, in the corners ofthe hypercube. In physics, this basic observation under-lies many properties of ideal gases such as the Maxwelldistribution and the equipartition theorem (see Chapter3 of (Sethna, 2006) for instance).

b. Real-world data vs. uniform distribution. Fortunately,real-world data is not random or uniformly distributed!In fact, real data usually lives in a much lower dimen-sional space than the original space in which the fea-tures are being measured. This is sometimes referred toas the “blessing of non-uniformity” (in opposition to thecurse of dimensionality). Data will typically be locallysmooth, meaning that a local variation of the data willnot incur a change in the target variable (Bishop, 2006).This idea is central to statistical physics and field the-ories, where properties of systems with an astronomicalnumber of degrees of freedom can be well characterizedby low-dimensional “order parameters” or effective de-grees of freedom. Another instantiation of this idea ismanifest in the description of the bulk properties of agas of weakly interacting particles, which can be sim-ply described by the thermodynamic variables (temper-ature, pressure, etc.) that enter the equation of staterather than the enormous number of dynamical variables(i.e. position and momentum) of each particle in the gas.

c. Intrinsic dimensionality and the crowding problem. A re-current objective of dimensional reduction techniques isto preserve the relative pairwise distances (or defined sim-ilarities) between data points from the original space tothe latent space. This is a natural requirement, sincewe would like for nearby data points (as measured in the

64

a) b)

FIG. 48 The “Swiss roll”. Data distributed in a three-dimensional space (a) that can effectively be described on atwo-dimensional surface (b). A common goal of dimensionalreduction techniques is to preserve ordination in the data:points that are close-by in the original space are also near-byin the mapped (latent) space. This is true of the mapping (a)to (b) as can be seen by inspecting the color gradient.

original space) to remain close-by after the correspondingmapping to the latent space.

Consider the example of the “Swiss roll” presented inFIG. 48a. There, the relevant structure of the data cor-responds to nearby points with similar colors and is en-coded in the “unrolled” data in the latent space, see FIG.48b. Clearly, in this example a two-dimensional spaceis sufficient to capture almost the entirety of the infor-mation in the data. A concept which stems from signalprocessing that is relevant to our current exposition isthat of the intrinsic dimensionality of the data. Quali-tatively, it refers to the minimum number of dimensionsrequired to capture the signal in the data. In the caseof the Swiss roll, it is 2 since the Swiss roll can effec-tively be parametrized using only two parameters, i.e.X ∈ (x1 sin(x1), x1 cos(x1), x2). The minimum num-ber of parameters required for such a parametrization isthe intrinsic dimensionality of the data (Bennett, 1969).Attempting to represent data in a space of dimension-ality lower than its intrinsic dimensionality can lead toa “crowding” problem (Maaten and Hinton, 2008) (seeschematic, FIG. 49). In short, because we are attemptingto satisfy too many constraints (e.g. preserve all relativedistances of the original space), this results in a trivial so-lution for the latent space where all mapped data pointscollapse to the center of the map.

To alleviate this, one needs to weaken the constraintsimposed on the visualization scheme. Powerful methodssuch as t-distributed stochastic embedding (Maaten andHinton, 2008) (in short, t-SNE, see section XII.D) anduniform manifold approximation and projection (UMAP)(McInnes et al., 2018) have been devised to circumventthis issue in various ways.

2D 1D

FIG. 49 Illustration of the crowding problem. (Left) A two-dimensional dataset X consisting of 3 equidistant points.(Right) Mapping X to a one-dimensional space while try-ing to preserve relative distances leads to a collapse of themapped data points.

B. Principal component analysis (PCA)

A ubiquitous method for dimensional reduction, datavisualization and analysis is Principal Component Anal-ysis (PCA). The goal of PCA is to perform an orthog-onal transformation of the data in order to find high-variance directions. PCA is inspired by the observationthat in many cases, the relevant information in a signalis contained in the directions with largest13 variance (seeFIG. 50). Directions with small variance are ascribed to“noise” and can potentially be removed or ignored.

noise

signal

FIG. 50 PCA seeks to find the set of orthogonal directionswith largest variance. This can be seen as “fitting” an ellipseto the data with the major axis corresponding to the firstprincipal component (direction of largest variance). PCA as-sumes that directions with large variance correspond to thetrue signal in the data while directions with low variance cor-respond to noise.

13 This assumes that the features are measured and compared usingthe same units.

65

FIG. 51 (a) The first 2 principal component of the Isingdataset with temperature indicated by the coloring. PCAwas performed on a joined dataset of 1000 samples taken ateach temperatures T = 0.25, 0.5, · · · , 4.0. Almost all the vari-ance is explained in the first component which correspondsto the magnetization order parameter (linear combination ofthe features with weights all roughly equal). The paramag-netic phase corresponds to the middle cluster and the left andright clusters correspond to the symmetry-related ferromag-netic phases (b) Log of the spectrum of the covariance matrixversus rank ordering. Only one dimension has high-variance.

Surprisingly, such PCA-based projections often cap-ture a lot of the large scale structure of many datasets.For example, Figure 51 shows the projection of samplesdrawn from the 2D Ising model at various temperatureson the first two principal components. Despite living ina 1600 dimensional space (the samples are 40 × 40 spinconfigurations), a single principal component (i.e. a sin-gle direction in this 1600 dimensional space) can cap-ture 50% of the variability contained in our samples.In fact, one can verify that this direction weights all1600 spins nearly equally and thus corresponds to themagnetization order parameter. Thus, even without anyprior physical knowledge, one can extract relevant orderparameters using a simple PCA-based projection. Re-

cently, a correspondence between PCA and Renormal-ization Group flows across the phase transition in the 2DIsing model (Foreman et al., 2017) and in a more generalsetting (Bradde and Bialek, 2017) has been proposed.In statistical physics, PCA has also found applicationin detecting phase transitions (Wetzel, 2017), e.g. in theXY model on frustrated triangular and union jack lat-tices (Wang and Zhai, 2017). PCA was also used to clas-sify dislocation patterns in crystals (Papanikolaou et al.,2017; Wang and Zhai, 2018), and to find correlations inthe shear flow of athermal amorphous solids (Ruscherand Rottler, 2018). PCA is widely employed in biolog-ical physics when working with high-dimensional data.Physics has also inspired PCA-based algorithms to inferrelevant features in unlabelled data (Bény, 2018). Con-cretely, consider N data points, x1, . . .xN that live ina p-dimensional feature space Rp. Without loss of gener-ality, we assume that the empirical mean x = N−1

∑i xi

of these data points is zero14. Denote the N × p designmatrix asX = [x1,x2, . . . ;xN ]T whose rows are the datapoints and columns correspond to different features. Thep× p (symmetric) covariance matrix is therefore

Σ(X) =1

N − 1XTX. (129)

Notice that the j-th diagonal entry of Σ(X) correspondsto the variance of the j-th feature and Σ(X)ij measuresthe covariance (i.e. connected correlation in the languageof physics) between feature i and feature j.

We are interested in finding a new basis for the datathat emphasizes highly variable directions while reducingredundancy between basis vectors. In particular, we willlook for a linear transformation that reduces the covari-ance between different features. To do so, we first per-form singular value decomposition (SVD) on the designmatrix X, namely, X = USV T , where S is a diagonalmatrix of singular value si, the orthogonal matrix U con-tains (as its columns) the left singular vectors of X, andsimilarly V contains (as its columns) the right singularvectors of X. With this, one can rewrite the covariancematrix as

Σ(X) =1

N − 1V SUTUSV T

= V

(S2

N − 1

)V T

≡ V ΛV T . (130)

where Λ is a diagonal matrix with eigenvalues λi in thedecreasing order along the diagonal (i.e. eigendecompo-sition). It is clear that the right singular vectors of X(i.e. the columns of V ) are principal directions of Σ(X),

14 We can always center around the mean: x← xi − x

66

and the singular values of X are related to the eigenval-ues of the covariance matrix Σ(X) via λi = s2

i /(N − 1).To reduce the dimensionality of data from p to p < p, wefirst construct the p×p projection matrix Vp′ by selectingthe singular components with the p largest singular val-ues. The projection of the data from p to a p dimensionalspace is simply Y = XVp′ . The same idea is centralto matrix-product-state-like techniques used to compressthe number of components in quantum wavefunctions instudies of low-dimensional many-body lattice systems.

The singular vector with the largest singular value (i.ethe largest variance) is referred to as the first principalcomponent; the singular vector with the second largestsingular value as the second principal component, and soon. An important quantity is the ratio λi/

∑pi=1 λi which

is referred as the percentage of the explained variancecontained in a principal component (see FIG. 51.b).

It is common in data visualization to present the dataprojected on the first few principal components. This isvalid as long as a large part of the variance is explainedin those components. Low values of explained variancemay imply that the intrinsic dimensionality of the data ishigh or simply that it cannot be captured by a linear rep-resentation. For a detailed introduction to PCA, see thetutorials by Shlens (Shlens, 2014) and Bishop (Bishop,2006).

C. Multidimensional scaling

Multidimensional scaling (MDS) is a non-linear dimen-sional reduction technique which preserves the pairwisedistance or dissimilarity dij between data points (Coxand Cox, 2000). Moving forward, we use the term “dis-tance” and “dissimilarity” interchangeably. There are twotypes of MDS: metric and non-metric. In metric MDS,the distance is computed under a pre-defined metric andthe latent coordinates Y are obtained by minimizing thedifference between the distance measured in the originalspace (dij(X)) and that in the latent space (dij(Y )):

Y = arg minY

∑i<j

wij |dij(X)− dij(Y )|, (131)

where wij ≥ 0 are weight values. The weight matrix wijis a set of free parameters that specify the level of confi-dence (or precision) in the value of dij(X). If Euclideanmetric is used, MDS gives the same result as PCA and isusually referred to as classical scaling (Torgerson, 1958).Thus MDS is often considered as a generalization of PCA.In non-metric MDS, dij can be any distance matrix. Theobjective function is then to preserve the ordination inthe data, i.e. if d12(X) < d13(X) in the original space,then in the latent space we should have d12(Y ) < d13(Y ).

Both MDS and PCA can be implemented using stan-dard Python packages such as Scikit. MDS algorithmstypically have a scaling of O(N3) where N corresponds

to the number of data points, and are thus very limitedin their application to large datasets. However, sample-based methods have been introduce to reduce this scalingto O(N logN) (Yang et al., 2006). In the case of PCA,a complete decomposition has a scaling of O(Np2 + p3),where p is the number of features. Note that the firstterm Np2 is due to the computation of covariance ma-trix Eq.(129) while the second, p3, stems from eigenvaluedecomposition. Nothe that PCA can be improved to bearcomplexity O(Np2+p) if only the first few principal com-ponents are desired (using iterative approaches). PCAand MDS are often among the first data visualizationtechniques one resorts to.

D. t-SNE

It is often desirable to preserve local structures inhigh-dimensional datasets. However, when dealing withdatasets having clusters delimitated by complicated sur-faces or datasets with a large number of clusters, preserv-ing local structures becomes difficult using linear tech-niques such as PCA. Many non-linear techniques such asnon-classical MDS (Cox and Cox, 2000), self-organizingmap (Kohonen, 1998), Isomap (Tenenbaum et al., 2000)and Locally Linear Embedding (Roweis and Saul, 2000)have been proposed and to address this class of problems.These techniques are generally good at preserving localstructures in the data but typically fail to capture struc-tures at the larger scale such as the clusters in which thedata is organized (Maaten and Hinton, 2008).

Recently, t-stochastic neighbor embedding (t-SNE) hasemerged as one of the go-to methods for visualizing high-dimensional data. It has been shown to offer insight-ful visualization for many benchmark high-dimensionaldatasets (Maaten and Hinton, 2008). t-SNE is a non-parametric15 method that constructs non-linear embed-dings. Each high-dimensional training point is mappedto low-dimensional embedding coordinates, which are op-timized in a way to preserve the local structure in thedata.

When used appropriately, t-SNE is a powerful tech-nique for unraveling the hidden structure of high-dimensional datasets while at the same time preserv-ing locality. In physics, t-SNE has recently been usedto reduce the dimensionality and classify spin configu-rations, generated with the help of Monte Carlo sim-ulations, for the Ising (Carrasquilla and Melko, 2017)and Fermi-Hubbard models at finite temperatures (Ch’nget al., 2017). It was also applied to study clustering tran-sitions in glass-like problems in the context of quantum

15 It does not explicitly parametrize feature extraction required tocompute the embedding coordinates. Thus it cannot be appliedto find the coordinate of new data points.

67

control (Day et al., 2019).The idea of stochastic neighbor embedding is to asso-

ciate a probability distribution to the neighborhood ofeach data (note x ∈ Rp, p is the number of features):

pi|j =exp(−||xi − xj ||2/2σ2

i )∑k 6=i exp(−||xi − xk||2/2σ2

i ), (132)

where pi|j can be interpreted as the likelihood that xj isxi’s neighbor (thus we take pi|i = 0). σi are free band-width parameters that are usually determined by fixingthe local entropy H(pi) of each data point:

H(pi) ≡ −∑j

pj|i log2 pj|i. (133)

The local entropy is then set to equal a constant across alldata points Σ = 2H(pi), where Σ is called the perplexity.The perplexity constraint determines σi ∀ i and impliesthat points in regions of high-density will have smallerσi.

Using Gaussian likelihoods in pi|j implies that onlypoints that are nearby xi contribute to its probabilitydistribution. While this ensures that the similarity fornearby points is well represented, this can be a prob-lem for points that are far away from xi (i.e. outliers):they have exponentially vanishing contributions to thedistribution, which in turn means that their embeddingcoordinates are ambiguous (Maaten and Hinton, 2008).One way around this is to define a symmetrized distri-bution pij ≡ (pi|j + pj|i)/(2N). This guarantees that∑j pij > 1/(2N) for all data points xi, resulting in each

data point xi making a significant contribution to thecost function to be defined below.

t-SNE constructs a similar probability distribution qijin a low dimensional latent space (with coordinates Y =yi, yi ∈ Rp

′, where p′ < p is the dimension of the latent

space):

qij =(1 + ||yi − yj ||2)−1∑k 6=i(1 + ||yi − yk||2)−1

. (134)

The crucial point to note is that qij is chosen to be a longtail distribution. This preserves short distance informa-tion (relative neighborhoods) while strongly repelling twopoints that are far apart in the original space (see FIG.52). In order to find the latent space coordinates yi, t-SNE minimizes the Kullback-Leibler divergence betweenqij and pij :

C(Y ) = DKL(p||q) ≡∑ij

pij log

(pijqij

). (135)

This minimization is done via gradient descent (see sec-tion IV). We can gain further insights on what the em-bedding cost-function C is capturing by computing the

gradient of (135) with respect to yi explicitly:

∂yiC =∑j 6=i

4pijqijZi(yi − yj)−∑j 6=i

4q2ijZi(yi − yj),

= Fattractive,i − Frepulsive,i, (136)

where Zi = 1/(∑k 6=i(1 + ||yk − yi||2)−1). We have sepa-

rated the gradient of point yi into an attractive Fattractive

and repulsive term Frepulsive. Notice that Fattractive,i in-duces a significant attractive force only between pointsthat are nearby point i in the original space since it in-volves the pij term. Finding the embedding coordinatesyi is thus equivalent to finding the equilibrium configura-tion of particles interacting through the forces in (136).

x1y1x2 y2

qp

short-tail

long-tail

y1 y2x2x1

x0,y0

yi = arg miny

|p(xi) q(y)|

FIG. 52 Illustration of the t-SNE embedding. xi points cor-respond to the original high-dimensional points while the yipoints are the corresponding low-dimensional map points pro-duced by t-SNE. Here we consider two points, x1, x2, that arerespectively “close” and “far” from x0. The high-dimensionalGaussian (short-tail) distribution p(x) of x0’s neighbors isshown in blue. The low-dimensional Cauchy (fat-tail) distri-bution q(y) of x0’s neighbors is shown in red. The map pointyi, are obtained by minimizing the difference |q(y) − p(xi)|(similar to minimizing the KL divergence). We see that pointx1 is mapped to short distances |y1−y0|. In contrast, far-awaypoints such as x2 are mapped to relatively large distances|y2 − y0|.

Below, we list some important properties that oneshould bear in mind when analyzing t-SNE plots.

• t-SNE can rotate data. The KL divergence is in-variant under rotations in the latent space, since itonly depends on the distance between points. Forthis reason, t-SNE plots that are rotations of eachother should be considered equivalent.

• t-SNE results are stochastic. In applying gradientdescent the solution will depend on the initial seed.Thus, the map obtained may vary depending on theseed used and different t-SNE runs will give slightlydifferent results.

68

• t-SNE generally preserves short distance informa-tion. As a rule of thumb, one should expect thatnearby points on the t-SNE map are also closeby inthe original space, i.e. t-SNE tends to preserve or-dination (but not actual distances). For a pictorialexplanation of this, we refer the reader to Figure52.

• Scales are deformed in t-SNE. Since a scale-free dis-tribution is used in the latent space, one should notput too much emphasis on the meaning of the sizeof any clusters observed in the latent space.

• t-SNE is computationally intensive. Finally, adirect implementation of t-SNE has an algorith-mic complexity of O(N2) which is only appli-cable to small to medium data sets. Improvedscaling of the form O(N logN) can be achievedat the cost of approximating Eq. (135) by us-ing the Barnes-Hut method (Van Der Maaten,2014) for N -body simulations (Barnes and Hut,1986). More recently extremely efficient t-SNE implementation making use of fast Fouriertransforms for kernel summations in (136) havebeen made available on https://github.com/KlugerLab/FIt-SNE (Linderman et al., 2017).

As an illustration, in Figure 53 we applied t-SNE toa Gaussian mixture model consisting of thirty Gaus-sians, whose means are uniformly distributed in forty-dimensional space. We compared the results to a randomtwo-dimensional projection and PCA. It is clear that un-like more naïve dimensional reduction techniques, bothPCA and t-SNE can identify the presence of well-formedclusters. The t-SNE visualization cleanly separates allthe clusters while certain clusters blend together in thePCA plot. This is a direct consequence of the fact thatt-SNE keeps nearby points close together while repellingpoints that are far apart.

Figure 54 shows t-SNE and PCA plots for the MNISTdataset of ten handwritten numerical digits (0-9). It isclear that the non-linear nature of t-SNE makes it muchbetter at capturing and visualizing the complicated cor-relations between digits, compared to PCA.

XIII. CLUSTERING

In this section, we continue our discussion of unsuper-vised learning methods. Unsupervised learning is con-cerned with discovering structure in unlabeled data (forinstance learning local structures for data visualization,see section XII). The lack of labels make unsupervisedlearning much more difficult and subtle than its super-vised counterpart. What is somewhat surprising is thateven without labels it is still possible to uncover and ex-ploit the hidden structure in the data. Perhaps, the sim-plest example of unsupervised learning is clustering. The

FIG. 53 Different visualizations of a Gaussian mixture formedof K = 30 mixtures in a D = 40 dimensional space. TheGaussians have the same covariance but have means drawnuniformly at random in the space [−10, 10]40. (a) Plot of thefirst two coordinates. The labels of the different Gaussian isindicated by the different colors. Note that in a realistic set-ting, label information is of course not available, thus makingit very hard to distinguish the different clusters. (b) Randomprojection of the data onto a 2 dimensional space. (c) pro-jection onto the first 2 principal components. Only a smallfraction of the variance is explained by those components (theratio is indicated along the axis). (d) t-SNE embedding (per-plexity = 60, # iteration = 1000) in a 2 dimensional latentspace. t-SNE captures correctly the local structure of thedata.

FIG. 54 Visualization of the MNIST handwritten digits train-ing dataset (here N = 60000). (a) First two principalcomponents. (b) t-SNE applied with a perplexity of 30,a Barnes-Hut angle of 0.5 and 1000 gradient descent iter-ations. In order to reduce the noise and speed-up compu-tation, PCA was first applied to the dataset to project itdown to 40 dimensions. We used an open-source implemen-tation to produce the results (Linderman et al., 2017), seehttps://github.com/KlugerLab/FIt-SNE.

69

aim of clustering is to group unlabelled data into clustersaccording to some similarity or distance measure. Infor-mally, a cluster is thought of as a set of points sharingsome pattern or structure.

Clustering finds many applications throughout datamining (Larsen and Aone, 1999), data compression andsignal processing (Gersho and Gray, 2012; MacKay,2003). Clustering can be used to identify coarse featuresor high level structures in an unlabelled dataset. Thetechnique also finds many applications in physical sci-ences, ranging from detecting celestial emission sourcesin astronomical surveys (Sander et al., 1998) to inferringgroups of genes and proteins with similar functions inbiology (Eisen et al., 1998), and building entanglementclassifiers (Lu et al., 2017). Clustering is perhaps thesimplest way to look for hidden structure in a datasetand for this reason, is among the most widely used andemployed data analysis and machine learning techniques.

The field of clustering is vast and there exists aflurry of clustering methods suited for different purposes.Some common considerations one has to take into ac-count when choosing a particular method is the distribu-tion of the clusters (overlapping/noisy clusters vs. well-separated clusters), the geometry of the data (flat vs.non-flat), the cluster size distribution (multiple sizes vs.uniform sizes), the dimensionality of the data (low vs.high dimensional) and the computational efficiency of thedesired method (small vs. large dataset).

We begin section XIII.A with a focus on popular prac-tical clustering methods such as K-means clustering, hi-erarchical clustering and density clustering. Our goal isto highlight the strength, weaknesses and differences be-tween these techniques, while laying out some of the theo-retical framework required for clustering analysis. Thereexist many more clustering methods beyond those dis-cussed in this section16. The methods we discuss werechosen for their pedagogical value and/or their applica-bility to problems in physics.

In section XIII.B we discuss gaussian mixture modelsand the formulation of clustering through latent variablemodels. This section introduces many of the methods wewill encounter when discussing other unsupervised learn-ing methods later in the review. Finally, in section XIII.Cwe discuss the problem of clustering in high-dimensionaldata and possible ways to tackle this difficult problem.The reader is also urged to experiment with various clus-tering methods using Notebook 15.

16 Our complementary Python notebook introduces some of theseother methods.

A. Practical clustering methods

Throughout this section we focus on the Euclidean dis-tance as a similarity measure. Other measures may bebetter suited for specific problems. We refer the enthusi-ast reader to (Rokach and Maimon, 2005) for a more in-depth discussion of the different possible similarity mea-sures.

1. K-means

We begin our discussion with K-means clustering sincethis method is simple to implement and understand, andcovers the core concepts of clustering. Consider a set ofN unlabelled observations xnNn=1 where xn ∈ Rp andwhere p is the number of features. Also consider a setof K cluster centers called the cluster means: µkKk=1,with µk ∈ Rp, which we’ll compute “emperically" in thecluserting procedure. The cluster means can be thoughtof as the representatives of each cluster, to which datapoints are assigned (see FIG. 55). K-means clusteringcan be formulated as follows: given a fixed integerK, findthe cluster means µ and the data point assignments inorder to minimize the following objective function:

C(x,µ) =

K∑k=1

N∑n=1

rnk(xn − µk)2, (137)

where rnk ∈ 0, 1 is a binary variable called the assign-ment. The assignment rnk is 1 if xn is assigned to clusterk and 0 otherwise. Notice that

∑k rnk = 1 ∀ n and∑

n rnk ≡ Nk, where Nk the number of points assignedto cluster k. The minimization of this objective func-tion can be understood as trying to find the best clustermeans such that the variance within each cluster is min-imized. In physical terms, C is equivalent to the sum ofthe moments of inertia of every cluster. Indeed, as wewill see below, the cluster means µk correspond to thecenters of mass of their respective cluster.

K-means algorithm. The K-means algorithm alter-nates between two steps:

1. Expectation: Given a set of assignments rnk,minimize C with respect to µk. Taking a simplederivative and setting it to zero yields the follow-ing update rule:

µk =1

Nk

∑n

rnkxn. (138)

2. Maximization: Given a set of cluster means µk,find the assignments rnk which minimizes C.

70

Clearly, this is achieved by assigning each datapoint to their nearest cluster-mean:

rnk =

1 if k = arg mink′(xn − µk′)2

0 otherwise(139)

K-means clustering consists in alternating between thesetwo steps until some convergence criterion is met. Practi-cally, the algorithm should terminate when the change inthe objective function from one iteration to another be-comes smaller than a pre-specified threshold. A simpleexample of K-means is presented in FIG. 55.

C = 10.5, t = 1 a) C = 8 .8, t = 10 b)

C = 8 .0, t = 20 c)

0 10 20t

8.0

8.5

9.0

9.5

10.0

10.5 d)

C

FIG. 55 K-means with K = 3 applied to an artificial two-dimensional dataset. The cluster means at each iteration areindicated by cyan star markers. t indicates the iteration num-ber and C the value of the objective function. (a) The algo-rithm is initialized by randomly partitioning the space into 3sectors to generate an initial assignment. (b)-(c) For well sep-arated clusters, the algorithm converges rapidly to the trueclusters. (d) The objective function as a function of the it-eration. C converges after t = 18 iterations for this choice ofrandom seed (for center initialization).

A nice property of the K-means algorithm is that itis guaranteed to converge. To see this, one can verifyexplicitly (by taking second-order derivatives) that theexpectation step always decreases C. This is also true forthe assignment step. Thus, since C is bounded from be-low, the two-step iteration of K-means always convergesto a local minimum of C. Since C is generally a non-convex function, in practice one usually needs to run thealgorithm with different initial random cluster center ini-tializations and post-select the best local minimum. Asimple implementation of K-means has an average com-putational complexity which scales linearly in the size of

the data set (more specifically the complexity is O(KN)per iteration) and is thus scalable to very large datasets.

As we will see in section XIII.B, K-means is a hard-assignment limit of the Gaussian mixture model whereall cluster variances are assumed to be the same. Thishighlights a common drawback of K-means: if the trueclusters have very different variances (spreads), K-meanscan lead to spurious results since the underlying assump-tion is that the latent model has uniform variances.

2. Hierarchical clustering: Agglomerative methods

Agglomerative clustering is a bottom up approach thatstarts from small initial clusters which are then progres-sively merged to form larger clusters. The merging pro-cess generates a hierarchy of clusters that can be visu-alized in the form of a dendrogram (see FIG. 56). Thishierarchy can be useful to analyze the relation betweenclusters and the subcomponents of individual clusters.Agglomerative methods are usually specified by defin-ing a distance measure between clusters17. We denotethe distance between clusters X and Y by d(X,Y ) ∈ R.Different choices of distance result in different clusteringalgorithms. At each step, the two clusters that are theclosest with respect to the distance measure are mergeduntil a single cluster is left.

Agglomerative clustering algorithm Agglomerativeclustering algorithms can thus be summarized as follows:

1. Initialize each point to its own cluster.

2. Given a set of K clusters X1, X2, · · · , XK , mergeclusters until one cluster is left (K = 1):

(a) Find the closest pair of clusters (Xi, Xj):(i, j) = arg min(i′,j′) d(Xi′ , Xj′)

(b) Merge the pair. Update: K ← K − 1

Here we list a few of the most popular distances usedin agglomerative methods, often called linkage methodsin the clustering literature.

1. Single-linkage: the distance between clusters i andj is defined as the minimum distance between twoelements of the different clusters

d(Xi, Xj) = minxi∈Xi,xj∈Xj

||xi − xj ||2. (140)

2. Complete linkage: the distance between clusters iand j is defined as the maximum distance betweentwo elements of the different clusters.

d(Xi, Xj) = maxxi∈Xi,xj∈Xj

||xi − xj ||2 (141)

17 Note that this measure need not be a metric.

71

3. Average linkage: average distance between pointsof different clusters

d(Xi, Xj) =1

|Xi| · |Xj |∑

xi∈Xi,xj∈Xj

||xi − xj ||2 (142)

4. Ward’s linkage: This distance measure is analogousto the K-means method as it seeks to minimize thetotal inertia. The distance measure is the “errorsquared” before and after merging which simplifiesto:

d(Xi, Xj) =|Xi||Xj ||Xi ∪Xj |

(µi − µj)2, (143)

where µj is the center of cluster j.

A common drawback of hierarchical methods is thatthey do not scale well: at every step, a distance ma-trix between all clusters must be updated/computed.Efficient implementations achieve a typical computa-tional complexity of O(N2) (Müllner, 2011), making themethod suitable for small to medium-size datasets. Asimple but major speed-up for the method is to initial-ize the clusters with K-means using a large K (but stilla small fraction of N) and then proceed with hierarchi-cal clustering. This has the advantage of preserving thelarge-scale structure of the hierarchy while making use ofthe linear scaling of K-means. In this way, hierarchicalclustering may be applied to very large datasets.

3. Density-based (DB) clustering

Density clustering makes the intuitive assumption thatclusters are defined by regions of space with higher den-sity of data points. Data points that constitute noise orthat are outliers are expected to form regions of low den-sity. Density clustering has the advantage of being able toconsider clusters of multiple shapes and sizes while identi-fying outliers. The method is also suitable for large-scaleapplications.

The core assumption of DB clustering is that a rel-ative local density estimation of the data is possible.In other words, it is possible to order points accordingto their densities. Density estimates are usually accu-rate for low-dimensional data but become unreliable forhigh-dimensional data due to large sampling noise. Here,for brevity, we confine our discussion to one of the mostwidely used density clustering algorithms, DBSCAN. Wehave also had great success with another recently in-troduced variant of DB clustering (Rodriguez and Laio,2014) that is similar in spirit which the reader is urgedto consult. One of the authors (A. D.) has also createda Python package, https://pypi.org/project/fdc/,which makes use of accurate density estimates via ker-nel methods combined with agglomerative clustering toproduce fast and accurate density clustering (see alsoGitHub repository).

0.4 0.6 0.8

0.4

0.5

0.6

0.7

0.8

0.9

0

1

2

3

45

a)

0 1 4 5 2 3Leaf label

0.00

0.05

0.10

0.15

0.20

0.25

0.30

d(X

,Y)

b)

FIG. 56 Hierarchical clustering example with single linkage.(a) The data points are successively grouped as denoted by thecolored dotted lines. (b) Dendrogram representation of thehierarchical decomposition. Each node of the tree representsa cluster. One has to specify a scale cut-off for the distancemeasure d(X,Y ) (corresponding to a horizontal cut in thedendrogram) in order to obtain a set of clusters.

DBSCAN algorithm. Here we describe the mostprominent DB clustering algorithm: DBSCAN, ordensity-based spatial clustering of applications with noise(Ester et al., 1996). Consider once again a set of N datapoints X ≡ xnNn=1.

We start by defining the ε-neighborhood of point xnas follows:

Nε(xn) = x ∈ X|d(x,xn) < ε . (144)

Nε(xn) are the data points that are at a distance smallerthan ε from xn. As before, we consider d(·, ·) to be theEuclidean metric (which yields spherical neighborhoods,see Figure 57) but other metrics may be better suiteddepending on the specific data. Nε(xn) can be seen as acrude estimate of local density. xn is considered to be acore-point if at least minPts are in its ε-neighborhood.minPts is a free parameter of the algorithm that setsthe scale of the size of the smallest cluster one shouldexpect. Finally, a point xi is said to be density-reachableif it is in the ε-neighborhood of a core-point. From these

72

definitions, the algorithm can be simply formulated (seealso Figure 57):

→ Until all points in X have been visited; do

− Pick a point xi that has not been visited− Mark xi as a visited point− If xi is a core point; then

· Find the set C of all points that are densityreachable from xi.· C now forms a cluster. Mark all pointswithin that cluster as being visited.

→ Return the cluster assignments C1, · · · , Ck, with kthe number of clusters. Points that have not beenassigned to a cluster are considered noise or out-liers.

Note that DBSCAN does not require the user to specifythe number of clusters but only ε and minPts. While, itis common to heuristically fix these parameters, methodssuch as cross-validation can be used for their determi-nation. Finally, we note that DBSCAN is very efficientsince efficient implementations have a computational costof O(N logN).

B. Clustering and Latent Variables via the Gaussian MixtureModels

In the previous section, we introduced several practicalmethods for clustering. In this section, we will approachclustering from a more abstract vantage point, and inthe process, introduce many of the core ideas underlyingunsupervised learning. A central concept in many un-supervised learning techniques is the idea of a latent orhidden variable. Even though latent variables are not di-rectly observable, they still influence the visible structureof the data. For example, in the context of clustering wecan think of the cluster identity of each datapoint (i.e.which cluster does a datapoint belong to) as a latent vari-able. And even though we cannot see the cluster labelexplicitly, we know that points in the same cluster tendto be closer together. The latent variables in our data(cluster identity) are a way of representing and abstract-ing the correlations between datapoints.

In this language, we can think of clustering as an algo-rithm to learn the most probable value of a latent variable(cluster identity) associated with each datapoint. Calcu-lating this latent variable requires additional assumptionsabout the structure of our dataset. Like all unsuper-vised learning algorithms, in clustering we must makean assumption about the underlying probability distri-bution from which the data was generated. Our modelfor how the data is generated is called the generativemodel. In clustering, we assume that data points are as-signed a cluster, with each cluster characterized by some

Low

High

Rel

ativ

ede

nsity

a)

b)

minPts=4

"

FIG. 57 (a) Illustration of DBSCAN algorithm withminPts= 4. Two ε-neighborhood are represented as dashedcircles of radius ε. Red points are the core points andblue points are density-reachable point that are not corepoints. Outliers are gray colored. (b) Application of DB-SCAN (minPts=40) to a noisy dataset with two non-convexclusters. Density profile is shown for clarity. Outliers areindicated by black crosses.

cluster-specific probability distribution (e.g. a Gaussianwith some mean and variance that characterizes the clus-ter). We then specify a procedure for finding the valueof the latent variable. This is often done by choosingthe values of the latent variable that minimize some costfunction.

One common choice for a class of cost functions formany unsupervised learning problems is Maximum Like-lihood Estimation (MLE), see Secs. V and VI. In MLE,we choose the values of the latent variables that maximizethe likelihood of the observed data under our generativemodel (i.e. maximize the probability of getting the ob-

73

served dataset under our generative model). Such MLEequations often give rise to the kind of Expectation Maxi-mization (EM) equations that we first encountered in thelast section in the context of K-means clustering.

Gaussian Mixtures models (GMM) are a generativemodel often used in the context of clustering. In GMM,points are drawn from one of K Gaussians, each with itsown mean µk and covariance matrix Σk,

N (x|µ,Σ) ∼ exp

[−1

2(x− µ)Σ−1(x− µ)T

]. (145)

Let us denote the probability that a point is drawn frommixture k by πk. Then, the probability of generating apoint x in a GMM is given by

p(x|µk,Σk, πk) =

K∑k=1

N (x|µk,Σk)πk. (146)

Given a dataset X = x1, · · · ,xN, we can write thelikelihood of the dataset as

p(X|µk,Σk, πk) =

N∏i=1

p(xi|µk,Σk, πk) (147)

For future reference, let us denote the set of parameters(of K Gaussians in the model) µk,Σk, πk by θ.

To see how we can use GMM and MLE to performclustering, we introduce discrete binary K-dimensionallatent variables z for each data point x whose k-th com-ponent is 1 if point x was generated from the k-th Gaus-sian and zero otherwise (these are often called “one-hotvariables”). For instance if we were considering a Gaus-sian mixture with K = 3, we would have three possiblevalues for z ≡ (z1, z2, z3) : (1, 0, 0), (0, 1, 0) and (0, 0, 1).We cannot directly observe the variable z. It is a latentvariable that encodes the cluster identity of point x. Letus also denote all the N latent variables correspondingto a dataset X by Z.

Viewing the GMM as a generative model, we can writethe probability p(x|z) of observing a data point x givenz as

p(x|z; µk,Σk) =

K∏k=1

N (x|µk,Σk)zk (148)

as well as the probability of observing a given value oflatent variable

p(z|πk) =

K∏k=1

πzkk . (149)

Using Bayes’ rule, we can write the joint probability ofa clustering assignment z and a data point x given the

GMM parameters as

p(x, z;θ) = p(x|z; µk,Σk)p(z|πk). (150)We can also use Bayes rule to rearrange this expression

to give the conditional probability of the data point xbeing in the k-th cluster, γ(zk), given model parametersθ as

γ(zk) ≡ p(zk = 1|x; θ) =πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

. (151)

The γ(zk) are often referred to as the “responsibility”that mixture k takes for explaining x. Just like in ourdiscussion of soft-max classifiers, this can be made intoa “hard-assignment” by assigning each point to the clus-ter with the largest probability: arg maxk γ(zk) over theresponsibilities.

The complication is of course that we do not knowthe parameters θ of the underlying GMM but insteadmust also learn them from the dataset X. As discussedabove, ideally we could do this by choosing the param-eters that maximize the likelihood (or equivalently thelog-likelihood) of the data

θ = arg maxθ

log p(X|θ) (152)

where θ = µk,Σk, πk. Once we know the MLEs θ, wecould use Eq. (151) to calculate the optimal hard clusterassignment arg maxk γ(zk) where γ(zk) = p(zk = 1|x; θ).

In practice, due to the complexity of Eq. (147), it isalmost impossible to find the global maximum of the like-lihood function. Instead, we must settle for a local max-imum. One approach to finding a local maximum of thelikelihood is to use a method like stochastic gradient de-scent on the negative log-likelihood, cf. Sec IV. Here, weintroduce an alternative, powerful approach for findinglocal minima in latent variable models using an iterativeprocedure called Expectation Maximization (EM). Givenan initial guess for the parameters θ(0), the EM algorithmiteratively generates new estimates for the parametersθ(1), θ(2), . . .. Importantly, the likelihood is guaranteed tobe non-decreasing under these iterations and hence EMconverges to a local maximum of the likelihood (Neal andHinton, 1998).

The central observation underlying EM is that it isoften much easier to calculate the conditional likelihoodsof the latent variables p(t)(Z) = p(Z|X; θ(t)) given somechoice of parameters, and the maximum of the expectedlog likelihood given an assignment of the latent variables:θ(t+1) = arg maxθ Ep(Z|X;θ(t))[log p(X,Z; θ)]. To get anintuition for this later quantity notice that we can write

74

Ep(t) [log p(X,Z; θ)] =

N∑i=1

K∑k=1

γ(t)ik [logN (xi|µk,Σk) + log πk] , (153)

where we have used the shorthand γ(t)ik = p(zik|X; θ(t))

with zik the k-th component of zi. Taking the derivativeof this equation with respect to µk, Σk, and πk (subjectto the constraint

∑k πk = 1) and setting this to zero

yields the intuitive equations

µ(t+1)k =

∑Ni γ

(t)ik xi∑

i γ(t)ik

Σ(t+1)k =

∑Ni γ

(t)ik (xi − µk)(xi − µk)T∑

i γ(t)ik

π(t+1)k =

1

N

∑k

γ(t)ik (154)

These are just the usual estimates for the mean and vari-ance, with each data point weighed according to our cur-rent best guess for the probability that it belongs to clus-ter k. We can then use our new estimate θ(t+1) to calcu-late responsibility γ(t+1)

ik and repeat the process. This isessentially the K-Means algorithm discussed in the firstsection.

This discussion of the Gaussian mixture model intro-duces several concepts that we will return to repeatedlyin the context of unsupervised learning. First, it is oftenuseful to think of the visible correlations between featuresin the data as resulting from hidden or latent variables.Second, we will often posit a generative model that en-codes the structure we think exists in the data and thenfind parameters that maximize the likelihood of the ob-served data. Third, often we will not be able to directlyestimate the MLE, and will have to instead look for acomputationally efficient way to find a local minimum ofthe likelihood.

C. Clustering in high dimensions

Clustering data in high-dimension can be very chal-lenging. One major problem that is aggravated in high-dimensions is the generic accumulation of noise due torandom measurement error for each feature. This in turnleads to increased errors for pairwise similarity and dis-tance measures and thus tends to “blur” distances be-tween data points (Domingos, 2012; Kriegel et al., 2009;Zimek et al., 2012). Many clustering algorithms rely onthe explicit use of a similarity measure or distance met-rics that weigh all features equally. For this reason, onemust be careful when using an off-the-shelf method inhigh dimensions.

In order to perform clustering on high-dimensionaldata, it is often useful to denoise the data before pro-

FIG. 58 (a) Application of gaussian mixture modelling tothe Ising dataset. The normalized histogram corresponds tothe first principal component distribution of the dataset (orequivalently the magnetization in this case). The 1D datais fitted with a K = 3-component gaussian mixture. Thelikehood of the fitted gaussian mixture is represented in redand is obtained via the expectation-maximization algorithm(b) The gaussian mixture model can be used to computeposterior probability (responsibilities), i.e. the probabilityof being in one of the phases. Note that the point whereγ(1) = γ(2) = γ(3) can be interpreted as the critical point.Indeed the crossing occurs at T ≈ 2.26.

ceeding with using a standard clustering method such asK-means (Kriegel et al., 2009). Figure 54 illustrates anapplication of denoising to high-dimensional data. PCA(section XII.B) was used to denoise the MNIST datasetby projecting the 784 original dimensions onto the 40 di-mensions with the largest principal components. The re-sulting features were then used to construct a Euclideandistance matrix which was used by t-SNE to compute

75

the two-dimensional embedding that is presented. Usingt-SNE directly on original data leads to a “blurring” ofthe clusters (the reader is encouraged to test this them-selves).

However, simple feature selection or feature denoising(using PCA for instance) can sometimes be insufficientfor learning clusters due to the presence of large vari-ations in the signal and noise of the features that arerelevant for identifying the underlying clusters (Kriegelet al., 2009). Recent promising work suggests that oneway to overcome these limitations is to learn the latentspace and the cluster labels at the same time (Xie et al.,2016).

Finally we end the clustering section with a short dis-cussion on clustering validation, which can be particu-larly difficult for high-dimensional data. Often clusteringvalidation, i.e. verifying whether the obtained labels are“valid” is done by direct visual inspection. That is, thedata is represented in a low-dimensional space and thecluster labels obtained are visually inspected to makesure that different labels organize into distinct “blobs”.For high-dimensional data, this is done by performingdimensional reduction (section XII). However, this canlead to the appearance of spurious clusters since dimen-sional reduction inevitably loses information about theoriginal data. Thus, these methods should be used withcare when trying to validate clusters [see (Wattenberget al., 2016) for an interactive discussion on how t-SNEcan sometime be misleading and how to effectively useit].

A lot of work has been done to devise ways of validatingclusters based on various metrics and measures (Kriegelet al., 2009). Perhaps one of the most intuitive way ofdefining a good clustering is by measuring how well clus-ters generalize. Clustering methods based on leveragingpowerful classifiers to measure the generalization errorsof the clusters have been developed by some of the au-thors (Day and Mehta, 2018), see https://pypi.org/project/hal-x/. We believe this represents an espe-cially promising research direction in high-dimensionalclustering. Finally, we emphasize that this discussion isfar from exhaustive and we refer the reader to (Rokachand Maimon, 2005), Chapter 15, for an in-depth surveyof the various validation techniques.

XIV. VARIATIONAL METHODS AND MEAN-FIELDTHEORY (MFT)

A common thread in many unsupervised learning tasksis accurately representing the underlying probability dis-tribution from which a dataset is drawn. Unsuper-vised learning of high-dimensional, complex distributionspresents a new set of technical and computational chal-lenges that are different from those we encountered ina supervised learning setting. When dealing with com-

plicated probability distributions, it is often much easierto learn the relative weights of different states or datapoints (ratio of probabilities), than absolute probabili-ties. In physics, this is the familiar statement that theweights of a Boltzmann distribution are much easier tocalculate than the partition function. The relative prob-ability of two configurations, x1 and x2, are proportionalto the difference between their Boltzmann weights

p(x1)

p(x2)= e−β(E(x1)−E(x2)), (155)

where as is usual in statistical mechanics β is the inversetemperature and E(x; θ) is the energy of state x givensome parameters (couplings) θ . However, calculating theabsolute weight of a configuration requires knowledge ofthe partition function

Zp = Trxe−βE(x), (156)

(where the trace is taken over all possible configurationsx) since

p(x) =e−βE(x)

Zp. (157)

In general, calculating the partition function Zp is ana-lytically and computationally intractable.

For example, for the Ising model with N binary spins,the trace involves calculating a sum over 2N terms, whichis a difficult task for most energy functions. For this rea-son, physicists (and machine learning scientists) have de-veloped various numerical and computational methodsfor evaluating such partition functions. One approachis to use Monte-Carlo based methods to draw samplesfrom the underlying distribution (this can be done know-ing only the relative probabilities) and then use thesesamples to numerically estimate the partition function.This is the philosophy behind powerful methods such asMarkov Chain Monte Carlo (MCMC) (Andrieu et al.,2003) and annealed importance sampling (Neal and Hin-ton, 1998) which are widely used in both the statisticalphysics and machine learning communities. An alterna-tive approach – which we focus on here – is to approxi-mate the the probability distribution p(x) and partitionfunction using a “variational distribution” q(x; θq) whosepartition function we can calculate exactly. The varia-tional parameters θq are chosen to make the variationaldistribution as close to the true distribution as possible(how this is done is the focus of much of this section).

One of the most-widely applied examples of a varia-tional method in statistical physics is Mean-Field Theory(MFT). MFT can be naturally understood as a procedurefor approximating the true distribution of the system bya factorized distribution. The deep connection betweenMFT and variational methods is discussed below. Thesevariational MFT methods have been extended to under-stand more complicated spin models (also called graph-ical models in the ML literature) and form the basis of

76

powerful set of techniques that go under the name of Be-lief Propagation and Survey Propagation (MacKay, 2003;Wainwright et al., 2008; Yedidia et al., 2003).

Variational methods are also widely used in ML to ap-proximate complex probabilistic models. For example,below we show how the Expectation Maximization (EM)procedure, which we discussed in the context of Gaus-sian Mixture Models for clustering, is actually a generalmethod that can be derived for any latent (hidden) vari-able model using a variational procedure (Neal and Hin-ton, 1998). This section serves as an introduction to thispowerful class of variational techniques. For readers in-terested in an in-depth discussion on variational infer-ence for probabilistic graphical models, we recommendthe great treatise written by Michael I. Jordan and oth-ers(Jordan et al., 1999), the more physics oriented dis-cussion in (Yedidia, 2001; Yedidia et al., 2003), as well asDavid MacKay’s outstanding book (MacKay, 2003).

A. Variational mean-field theory for the Ising model

Ising models are a major paradigm in statisticalphysics. Historically introduced to study magnetism, itwas quickly realized that their predictive power appliesto a variety of interacting many-particle systems. Isingmodels are now understood to serve as minimal modelsfor complex phenomena such as certain classes of phasetransitions. In the Ising model, degrees of freedom calledspins assume discrete, binary values, e.g. si = ±1. Eachspin variable si lives on a lattice (or, in general, a graph),the sites of which are labeled by i = 1, 2 . . . , N . De-spite the extreme simplicity relative to real-world sys-tems, Ising models exhibit a high level of intrinsic com-plexity, and the degrees of freedom can become correlatedin sophisticated ways. Often, spins interact spatially lo-cally, and respond to externally applied magnetic fields.

A spin configuration s specifies the values si of thespins at every lattice site. We can assign an “energy” toevery such configuration

E(s,J) = −1

2

∑i,j

Jijsisj −∑i

hisi, (158)

where hi is a local magnetic field applied to the spin si,and Jij is the interaction strength between the spins siand sj . In textbook examples, the coupling parametersJ = (J, h) are typically uniform or, in studies of disor-dered systems, (Ji, hi) are drawn from some probabilitydistribution (i.e. quenched disorder).

The probability of finding the system in a given spinconfiguration at temperature β−1 is given by

p(s|β,J) =1

Zp(J)e−βE(s,J),

Zp(β,J) =∑

si=±1

e−βE(s,J), (159)

with∑si=±1 denoting the sum over all possible con-

figurations of the spin variables. We write Zp to em-phasize that this is the partition function correspondingto the probability distribution p(s|β,J), which will be-come important later. For a fixed number of lattice sitesN , there are 2N possible configurations, a number thatgrows exponentially with the system size. Therefore, itis not in general feasible to evaluate the partition func-tion Zp(β,J) in closed form. This represents a majorpractical obstacle for extracting predictions from physi-cal theories since the partition function is directly relatedto the free-energy through the expression

βFp(J) = − logZp(β,J) = β〈E(s,J)〉p −Hp,(160)

with

Hp = −∑

si=±1

p(s|β,J) log p(s|β,J) (161)

the entropy of the probability distribution p(s|β,J).Even though the true probability distribution p(s|β,J)

may be a very complicated object, we can still makeprogress by approximating p(s|β,J) by a variational dis-tribution q(s,θ) which captures the essential features ofinterest, with θ some parameters that define our varia-tional ansatz. The name variational distribution comesfrom the fact that we are going to vary the parame-ters θ to make q(s,θ) as close to p(s|β,J) as possible.The functional form of q(s,θ) is based on an “educatedguess”, which oftentimes comes from our intuition aboutthe problem. We can also define a variational free-energy

βFq(θ,J) = β〈E(s,J)〉q −Hq, (162)

where 〈E(s,J)〉q is the expectation value of the energyE(s,J) with respect to the distribution q(s,θ), and Hq

is the entropy of q(s,θ).Before proceeding further, it is helpful to introduce

a new quantity: the Kullback-Leibler divergence (KL-divergence or relative entropy) between two distributionsp(x) and q(x). The KL-divergence measures the dissim-ilarity between the two distributions and is given by

DKL(q‖p) = Trxq(x) logq(x)

p(x), (163)

which is the expectation w.r.t. q of the logarithmic dif-ference between the two distributions p and q. The traceTrx denotes a sum over all possible configurations x.Two important properties of the KL-divergence are (i)positivity: DKL(p‖q) ≥ 0 with equality if and only ifp = q (in the sense of probability distributions), and (ii)DKL(p‖q) 6= DKL(q‖p), that is the KL-divergence is notsymmetric in its arguments.

Variational mean-field theory is a systematic way forconstructing such an approximate distribution q(s,θ).The main idea is to choose parameters that minimize the

77

difference between the variational free-energy Fq(J ,θ)and the true free-energy Fp(J |β). We will show in Sec-tion XIV.B below that the difference between these twofree-energies is actually the KL-divergence:

Fq(J ,θ) = Fp(J , β) +DKL(q‖p). (164)

This equality, when combined with the non-negativity ofthe KL-divergence has important consequences. First,it shows that the variational free-energy is always largerthan the true free-energy, Fq(J ,θ) ≥ Fp(J), with equal-ity if and only if q = p (the latter inequality is foundin many physics textbooks and is known as the Gibbsinequality). Second, finding the best variational free-energy is equivalent to minimizing the KL divergenceDKL(q‖p).

Armed with these observations, let us now derive aMFT of the Ising model using variational methods. Inthe simplest MFT of the Ising model, the variational dis-tribution is chosen so that all spins are independent:

q(s,θ) =1

Zqexp

(∑i

θisi

)=∏i

eθisi

2 cosh θi. (165)

In other words, we have chosen a distribution q whichfactorizes on every lattice site. An important propertyof this functional form is that we can analytically find aclosed-form expression for the variational partition func-tion Zq. This simplicity also comes at a cost: ignor-ing correlations between spins. These correlations be-come less and less important in higher dimensions andthe MFT ansatz becomes more accurate.

To evaluate the variational free-energy, we make useof Eq. (162). First, we need the entropy Hq of the dis-tribution q. Since q factorizes over the lattice sites, theentropy separates into a sum of one-body terms

Hq(θ) = −∑

si=±1

q(s,θ) log q(s,θ)

= −∑i

qi log qi + (1− qi) log(1− qi), (166)

where qi = eθi

2 cosh θiis the probability that spin si is in the

+1 state. Next, we need to evaluate the average of theIsing energy E(s,J) with respect to the variational dis-tribution q. Although the energy contains bilinear terms,we can still evaluate this average easily, because the spinsare independent (uncorrelated) in the q distribution. Themean value of spin si in the q distribution, also knownas the on-site magnetization, is given by

mi = 〈si〉q =∑si=±1

sieθisi

2 cosh θi= tanh(θi). (167)

Since the spins are independent, we have

〈E(s,J)〉q = −1

2

∑i,j

Jijmimj −∑i

himi. (168)

The total variational free-energy is

βFq(J ,θ) = β〈E(s,J)〉q −Hq,

and minimizing with respect to the variational parame-ters θ, we obtain

∂θiβFq(J ,θ) = 2

dqidθi

−β∑

j

Jijmj + hi

+ θi

.

(169)Setting this equation to zero, we arrive at

θi = β∑j

Jijmj(θj) + hi. (170)

For the special case of a uniform field hi = h and uni-form nearest neighbor couplings Jij = J , by symmetrythe variational parameters for all the spins are identical,with θi = θ for all i. Then, the mean-field equationsreduce to their familiar textbook form (Sethna, 2006),m = tanh(θ) and θ = β(zJm(θ) + h), where z is thecoordination number of the lattice (i.e. the number ofnearest neighbors).

Equations (167) and (170) form a closed system, knownas the mean-field equations for the Ising model. To finda solution to these equations, one method is to iteratethrough and update each θi, once at a time, in an asyn-chronous fashion. Once can see the emerging relationshipof this approach to solving the MFT equations to Expec-tation Maximization (EM) procedure first introduced inthe context of the K-means algorithm in Sec. XIII.A. Tomake this explicit, let us spell out the iterative procedureto find the solutions to Eq. (170). We start by initializingour variational parameters to some θ(0) and repeat thefollowing two steps until convergence:

1. Expectation: Given a set of assignments at iterationt, θ(t), calculate the corresponding magnetizationsm(t) using Eq. (167)

2. Maximization: Given a set of magnetizations mt,find new assignments θ(t+1) which minimize thevariational free energy Fq. From, Eq. (170) thisis just

θ(t+1)i = β

∑j

Jijm(t)j + hi. (171)

From these equations, it is clear that we can think of theMFT of the Ising model as an EM-like procedure similarto the one we used for K-means clustering and GaussianMixture Models in Sec. XIII.

As is well known in statistical physics, even thoughMFT is not exact, it can often yield qualitatively andeven quantitatively precise predictions (especially in highdimensions). The discrepancy between the true physicsand MFT predictions stems from the fact that the varia-tional distribution q we chose cannot capture correlations

78

between the spins. For instance, it predicts the wrongvalue for the critical temperature for the two-dimensionalIsing model. It even erroneously predicts the existence ofa phase transition in one dimension at a non-zero tem-perature. We refer the interested reader to standardtextbooks on statistical physics for a detailed analysisof applicability of MFT to the Ising model. However,we emphasize that the failure of any particular varia-tional ansatz does not compromise the usefulness of theapproach. In some cases, one can consider changing thevariational ansatz to improve the predictive propertiesof the corresponding variational MFT (Yedidia, 2001;Yedidia et al., 2003). The take-home message is thatvariational MFT is a powerful tool but one that must beapplied and interpreted with care.

B. Expectation Maximization (EM)

Ideas along the lines of variational MFT have beenindependently developed in statistics and imported intomachine learning to perform maximum likelihood (ML)estimates. In this section, we explicitly derive the Expec-tation Maximization (EM) algorithm and demonstratefurther its close relation to variational MFT (Neal andHinton, 1998). We will focus on latent variable mod-els where some of the variables are hidden and cannotbe directly observed. This often makes maximum likeli-hood estimation difficult to implement. EM gets aroundthis difficulty by using an iterative two-step procedure,closely related to variational free-energy based approxi-mation schemes in statistical physics.

To set the stage for the following discussion, let x bethe set of visible variables we can directly observe and zbe the set of latent or hidden variables that we cannot di-rectly observe. Denote the underlying probability distri-bution from which x and z are drawn by p(z,x|θ), withθ representing all relevant parameters. Given a datasetx, we wish to find the maximum likelihood estimate ofthe parameters θ that maximizes the probability of theobserved data.

As in variational MFT, we view θ as variational pa-rameters chosen to maximize the log-likelihood L(θ) =〈log p(x|θ)〉Px , where the expectation is taken with re-spect to the marginal distributions of x. Algorithmically,this can be done by iterating the variational parametersθ(t) in a series of steps (t = 1, 2, . . . ) starting from somearbitrary initial value θ(0):

1. Expectation step (E step): Given the knownvalues of observed variable x and the current esti-mate of parameter θt−1, find the probability distri-bution of the latent variable z:

qt−1(z) = p(z|θ(t−1),x) (172)

2. Maximization step (M step): Re-estimate the

parameter θ(t) to be those with maximum likeli-hood, assuming qt−1(z) found in the previous stepis the true distribution of hidden variable z:

θt = arg maxθ〈log p(z,x|θ)〉qt−1

(173)

It was shown (Dempster et al., 1977) that each EM iter-ation increases the true log-likelihood L(θ), or at worstleaves it unchanged. In most models, this iteration pro-cedure converges to a local maximum of L(θ).

E-step

M-step

− Fq(θ( t ) )

− Fq(θ( t+1) )

− Fp

θ ( t+2)

θ ( t+1)

θ ( t )

FIG. 59 Convergence of EM algorithm. Starting from θ(t),E-step (blue) establishes −Fq(θ(t)) which is always a lowerbound of −Fp := 〈log p(x|θ)〉Px (green). M-step (red) is thenapplied to update the parameter, yielding θ(t+1). The up-dated parameter θ(t+1) is then used to construct −Fq(θ(t+1))in the subsequent E-step. M-step is performed again to up-date the parameter, etc.

To see how EM is actually performed and related tovariational MFT, we make use of KL-divergence betweentwo distributions introduced in the last section. Recallthat our goal is to maximize the log-likelihood L(θ).With data z missing, we surely cannot just maximizeL(θ) directly since parameter θ might couple both z andx. EM circumvents this by optimizing another objectivefunction, Fq(θ), constructed based on estimates of thehidden variable distribution q(z|x). Indeed, the functionoptimized is none other than the variational free energywe encountered in the previous section:

Fq(θ) := −〈log p(z,x|θ)〉q,Px − 〈Hq〉Px , (174)

where Hq is the Shannon entropy (defined in Eq. (161))of q(z|x). One can define the true free-energy Fp(θ) asthe negative log-likelihood of the observed data:

− Fp(θ) = L(θ) = 〈log p(x|θ)〉Px . (175)

In the language of statistical physics, Fp(θ) is the truefree-energy while Fq(θ) is the variational free-energy we

79

would like to minimize (see Table I). Note that we havechosen to employ a physics sign convention here of defin-ing the free-energy as minus log of the partition function.In the ML literature, this minus sign is often omitted(Neal and Hinton, 1998) and this can lead to some con-fusion. Our goal is to choose θ so that our variationalfree-energy Fq(θ) is as close to the true free-energy Fp(θ)as possible. The difference between these free-energiescan be written as

Fq(θ)− Fp(θ) = 〈fq(x,θ)− fp(x,θ)〉Px , (176)

where

fq(x,θ)− fp(x,θ)

= log p(x|θ)−∑z

q(z|x) log p(z,x|θ)

+∑z

q(z|x) log q(z|x)

=∑z

q(z|x) log p(x|θ)−∑z

q(z|x) log p(z,x|θ)

+∑z

q(z|x) log q(z|x)

=−∑z

q(z|x) logp(z,x|θ)

p(x|θ)+∑z

q(z|x) log p(z)

=∑z

q(z|x) logq(z|x)

p(z|x,θ)

= DKL(q(z|x)‖p(z|x,θ)) ≥ 0

where we have used Bayes’ theorem p(z|x,θ) =p(z,x|θ)/p(x|θ). Since the KL-divergence is always pos-itive, this shows that the variational free-energy Fq isalways an upper bound of the true free-energy Fp. Inphysics, this result is known as Gibbs’ inequality.

From Eq. (174) and the fact that the the entropy termin Eq. (174) does not depend on θ, we can immediatelysee that the maximization step (M-step) in Eq. (173)is equivalent to minimizing the variational free-energyFq(θ). Surprisingly, the expectation step (E-step) canalso viewed as the optimization of this variational free-energy. Concretely, one can show that the distributionof hidden variables z given the observed variable x andthe current estimate of parameter θ, Eq. (172), is theunique probability q(z) that minimizes Fq(θ) (now seenas a functional of q). This can be proved by taking thefunctional derivative of Eq. (174), plus a Lagrange mul-tiplier that encodes

∑z q(z) = 1, with respect to q(z).

Summing things up, we can re-write EM in the followingform (Neal and Hinton, 1998):

1. Expectation step: Construct the approximatingprobability distribution of unobserved z given thevalues of observed variable x and parameter esti-mate θ(t−1):

qt−1(z) = arg minq

Fq(θ(t−1)) (177)

2. Maximization step: Fix q, update the variationalparameters:

θ(t) = arg maxθ

−Fqt−1(θ). (178)

To recapitulate, EM implements ML estimation evenwith missing or hidden variables through optimizing alower bound of the true log-likelihood. In statisticalphysics, this is reminiscent of optimizing a variationalfree-energy which is a lower bound of true free-energydue to Gibbs inequality. In Fig. 59, we show pictoriallyhow EM works. The E-step can be seen as representingthe unobserved variable z by a probability distributionq(z). This probability is used to construct an alterna-tive objective function −Fq(θ), which is then maximizedwith respect to θ in the M-step. By construction, maxi-mizing the negative variational free-energy is equivalentto doing ML estimation on the joint data (i.e. both ob-served and unobserved). The name “M-step” is intuitivesince the parameters θ are found by maximizing −Fq(θ).The name “E-step” comes from the fact that one usuallydoesn’t need to construct the probability of missing datasexplicitly, but rather need only compute the “expected"sufficient statistics over these data, cf. Fig. 59.

On the practical side, EM has been demonstrated to beextremely useful in parameter estimation, particularly inhidden Markov models and Bayesian networks (see, forexample, (Barber, 2012; Wainwright et al., 2008)). Someof the authors have used EM in biophysics, to design al-gorithms which establish the equivalence of niche theoryand the Minimum Environmental Perturbation Princi-ple (Marsland III et al., 2019). One of the striking ad-vantages of EM is that it is conceptually simple and easyto implement (see Notebook 16). In many cases, imple-mentation of EM is guaranteed to increase the likelihoodmonotonically, which could be a perk during debugging.For readers interested in an overview on applications ofEM, we recommend (Do and Batzoglou, 2008).

Finally for advanced readers familiar with the physicsof disordered systems, we note that it is possible toconstruct a one-to-one dictionary between EM for la-tent variable models and the MFT of spin systems withquenched disorder. In a disordered spin systems, theIsing couplings J are commonly taken to be quenchedrandom variables drawn from some underlying probabil-ity distribution. In the EM procedure, the quenched dis-order is provided by the observed data points x whichare drawn from some underlying probability distributionthat characterizes the data. The spins s are like the hid-den or latent variables z. Similar analogues can be foundfor all the variational MFT quantities (see Table I). Thisstriking correspondence offers a glimpse into the deepconnection between statistical mechanics and unsuper-vised latent variable models – a connection that we willrepeatedly exploit to gain more intuition for the energy-based unsupervised models considered in the next few

80

statistical physics Variational EMspins/d.o.f.: s hidden/latent variables zcouplings /quenched disorder:J

data observations: x

Boltzmann factor e−βE(s,J) Complete probability:p(x,z|θ)

partition function: Z(J) marginal likelihood p(x|θ)energy: βE(s,J) negative log-complete data

likelihood: − log p(x,z|θ,m)

free energy: βFp(J |β) negative log-marginal likeli-hood: − log p(x|m)

variational distribution: q(s) variational distribution:q(z|x)

variational free-energy:Fq(J ,θ)

variational free-energy: Fq(θ)

TABLE I Analogy between quantities in statistical physicsand variational EM.

chapters.

XV. ENERGY BASED MODELS: MAXIMUM ENTROPY(MAXENT) PRINCIPLE, GENERATIVE MODELS, ANDBOLTZMANN LEARNING

Most of the models discussed in the previous sections(e.g. linear and logistic regression, ensemble models, andsupervised neural networks) are discriminative – theyare designed to perceive differences between groups orcategories of data. For example, recognizing differencesbetween images of cats and images of dogs allows a dis-criminative model to label an image as “cat” or “dog”.Discriminative models form the core techniques of mostsupervised learning methods. However, discriminativemethods have several limitations. First, like all super-vised learning methods, they require labeled data. Sec-ond, there are tasks that discriminative approaches sim-ply cannot accomplish, such as drawing new examplesfrom an unknown probability distribution. A model thatcan learn to represent and sample from a probability dis-tribution is called generative. For example, a genera-tive model for images would learn to draw new examplesof cats and dogs given a dataset of images of cats anddogs. Similarly, given samples generated from one phaseof an Ising model we may want to generate new sam-ples from that phase. Such tasks are clearly beyond thescope of discriminative models like the ensemble modelsand DNNs discussed so far in the review. Instead, wemust turn to a new class of machine learning methods.

The goal of this section is to introduce the reader toenergy-based generative models. As we will see, energy-based models are closely related to the kinds of modelscommonly encountered in statistical physics. We willdraw upon many techniques that have their origin instatistical mechanics (e.g. Monte-Carlo methods). The

section starts with a brief overview of generative mod-els, highlighting the similarities and differences with thesupervised learning methods encountered in earlier sec-tions. Next, we introduce perhaps the simplest kind ofgenerative models – Maximum Entropy (MaxEnt) mod-els. MaxEnt models have no latent (or hidden) vari-ables, making them ideal for introducing the key conceptsand tools that underlie energy-based generative models.We then present an extended discussion of how to trainenergy-based models. Much of this discussion will alsobe applicable to more complicated energy-based modelssuch as Restricted Boltzmann Machines (RBMs) and thedeep models discussed in the next section.

A. An overview of energy-based generative models

Generative models are a machine learning techniquethat allows to learn how to generate new examples sim-ilar to those found in a training dataset. The core ideaof most generative models is to learn a parametric modelfor the probability distribution from which the data wasdrawn. Once we have learned a model, we can gener-ate new examples by sampling from the learned gen-erative model (see Fig. 60). As in statistical physics,this sampling is often done using Markov Chain MonteCarlo (MCMC) methods. A review of MCMC methodsis beyond the scope of this discussion: for a concise andbeautiful introduction to MCMC-inspired methods thatbridges both statistical physics and ML the reader is en-couraged to consult Chapters 29-32 of David MacKay’sbook (MacKay, 2003) as well as the review by MichaelI. Jordan and collaborators (Andrieu et al., 2003).

The added complexity of learning models directlyfrom samples introduces many of the same fundamentaltensions we encountered when discussing discriminativemodels. The ability to generate new examples requiresmodels to be able to “generalize” beyond the examplesthey have been trained on, that is to generate new sam-ples that are not samples of the training set. The modelsmust be expressive enough to capture the complex cor-relations present in the underlying data distribution, butthe amount of data we have is finite which can give riseto overfitting.

In practice, most generative models that are used inmachine learning are flexible enough that, with a suffi-cient number of parameters, they can approximate anyprobability distribution. For this reason, there are threeaxes on which we can differentiate classes of generativemodels:

• The first axis is how easy the model is to train –both in terms of computational time and the com-plexity of writing code for the algorithm.

• The second axis is how well the model generalizesfrom the training set to the test set.

81

• The third axis is which characteristics of the datadistribution the model is capable of and focuses oncapturing.

All generative models must balance these competing re-quirements and generative models differ in the tradeoffsthey choose. Simpler models capture less structure aboutthe underlying distributions but are often easier to train.More complicated models can capture this structure butmay overfit to the training data.

One of the fundamental reasons that energy-basedmodels have been less widely-employed than their dis-criminative counterparts is that the training procedurefor these models differs significantly from those for su-pervised neural networks models. Though both employgradient-descent based procedures for minimizing a costfunction (one common choice for generative models isthe negative log-likelihood function), energy-based mod-els do not use backpropagation (see Sec. IX.C) and au-tomatic differentiation for computing gradients. Rather,one must turn to ideas inspired by MCMC based meth-ods in physics and statistics that sometimes go underthe name “Boltzmann Learning” (discussed below). As aresult, training energy-based models requires additionaltools that are not immediately available in packages suchas PyTorch and TensorFlow.

The open-source package – Paysage – that is built ontop of PyTorch bridges this gap by providing the toolsetfor training energy-based models (Paysage is maintainedby Unlearn.AI – a company affiliated with two of the au-thors (CKF and PM)). Paysage makes it easy to quicklycode and deploy energy-based models such as RestrictedBoltzmann Machines (RBMs) and Stacked RBMs – a“deep” unsupervised model. The package includes un-published training methods that significantly improvethe training performance, can be applied with variousdatatypes, and can be employed on GPUs. We makeuse of this package extensively in the next two sectionsand the accompanying Python notebooks. For example,Fig. 60 (and the accompanying Notebook 17) show howthe Paysage package can be used to quickly code andtrain a variety of energy-based models on the MNISThandwritten digit dataset.

Finally, we note that generative models at their mostbasic level are complex parametrizations of the probabil-ity distribution the data is drawn from. For this reason,generative models can do much more than just generatenew examples. They can be used to perform a multi-tude of other tasks that require sampling from a complexprobability distribution including “de-noising”, filling inmissing data, and even discrimination (Hinton, 2012).The versatility of generative models is one of the majorappeals of these unsupervised learning methods.

FIG. 60 Examples of handwritten digits (“reconstructions”)generated using various energy-based models using the pow-erful Paysage package for unsupervised learning. Examplesfrom top to bottom are: the original MNIST database, anRBM with Gaussian units which is equivalent to a HopfieldModel, a Restricted Boltzmann Machine (RBM), a RBM withan L1 penalty for regularization, and a Deep Boltzmann Ma-chine (DBM) with 3 layers. All models have 200 hidden units.See Sec. XVI and corresponding notebook for details

B. Maximum entropy models: the simplest energy-basedgenerative models

Maximum Entropy (MaxEnt) models are one of thesimplest classes of energy-based generative models. Max-Ent models have their origin in a series of beautiful pa-pers by Jaynes that reformulated statistical mechanics ininformation theoretic terms (Jaynes, 1957a,b). Recently,the flood of new, large scale datasets has resulted in aresurgence of interest in MaxEnt models in many fieldsincluding physics (especially biological physics), compu-tational neuroscience, and ecology (Elith et al., 2011;Schneidman et al., 2006; Weigt et al., 2009). MaxEntmodels are often presented as the class of generative mod-els that make the least assumptions about the underlyingdata. However, as we have tried to emphasize throughoutthe review, all ML and statistical models require assump-tions, and MaxEnt models are no different. Overlookingthis can sometimes lead to misleading conclusions, andit is important to be cognizant of these implicit assump-tions (Aitchison et al., 2016; Schwab et al., 2014).

1. MaxEnt models in statistical mechanics

MaxEnt models were introduced by E. T. Jaynes in atwo-part paper in 1957 entitled “Information theory andstatistical mechanics” (Jaynes, 1957a,b). In these incred-ible papers, Jaynes showed that it was possible to re-derive the Boltzmann distribution (and the idea of gen-eralized ensembles) entirely from information theoreticarguments. Quoting from the abstract, Jaynes consid-ered “statistical mechanics as a form of statistical infer-ence rather than as a physical theory” (portending the

82

close connection between statistical physics and machinelearning). Jaynes showed that the Boltzmann distribu-tion could be viewed as resulting from a statistical in-ference procedure for learning probability distributionsdescribing physical systems where one only has partialinformation about the system (usually the average en-ergy).

The key quantity in MaxEnt models is the informa-tion theoretic, or Shannon, entropy, a concept introducedby Shannon in his landmark treatise on information the-ory (Shannon, 1949). The Shannon entropy quantifiesthe statistical uncertainty one has about the value of arandom variable x drawn from a probability distributionp(x). The Shannon entropy of the distribution is definedas

Sp = −Trxp(x) log p(x) (179)

where the trace is a sum/integral over all possible val-ues a variable can take. Jaynes showed that the Boltz-mann distribution follows from the Principle of Maxi-mum Entropy. A physical system should be describedby the probability distribution with the largest entropysubject to certain constraints (often provided by measur-ing the average value of conserved, extensive quantitiessuch as the energy, particle number, etc.) The princi-ple uniquely specifies a procedure for parametrizing thefunctional form of the probability distribution. Once wehave specified and learned this form we can, of course,generate new examples by sampling this distribution.

Let us illustrate how this works in more detail. Sup-pose that we have chosen a set of functions fi(x) whoseaverage value we want to fix to some observed values〈fi〉obs. The Principle of Maximum Entropy states thatwe should choose the distribution p(x) with the largestuncertainty (i.e. largest Shannon entropy Sp), subject tothe constraints that the model averages match the ob-served averages:

〈fi〉model =

∫dxfi(x)p(x) = 〈fi〉obs. (180)

We can formulate the Principle of Maximum Entropyas an optimization problem using the method of Lagrangemultipliers by minimizing:

L[p] = −Sp +∑i

λi

(〈fi〉obs −

∫dxfi(x)p(x)

)+ γ

(1−

∫dxp(x)

),

where the first set of constraints enforce the requirementfor the averages and the last constraint enforces the nor-malization that the trace over the probability distribu-tion equals one. We can solve for p(x) by taking thefunctional derivative and setting it to zero

0 =δLδp

= (log p(x) + 1)−∑i

λifi(x)− γ.

The general form of the maximum entropy distributionis then given by

p(x) =1

Ze∑i λifi(x) (181)

where Z(λi) =∫

dx e∑i λifi(x) is the partition function.

The maximum entropy distribution is clearly justthe usual Boltzmann distribution with energy E(x) =−∑i λifi(x). The values of the Lagrange multipliers arechosen to match the observed averages for the set of func-tions fi(x) whose average value is being fixed:

〈fi〉model =

∫dxp(x)fi(x) =

∂ logZ

∂λi= 〈fi〉obs. (182)

In other words, the parameters of the distribution can bechosen such that

∂λi logZ = 〈fi〉data. (183)

To gain more intuition for the MaxEnt distribution, itis helpful to relate the Lagrange multipliers to the famil-iar thermodynamic quantities we use to describe physicalsystems (Jaynes, 1957a). Our x denotes the microscopicstate of the system, i.e. the MaxEnt distribution is aprobability distribution over microscopic states. How-ever, in thermodynamics we only have access to averagequantities. If we know only the average energy 〈E(x)〉obs,the MaxEnt procedure tells us to maximize the entropysubject to the average energy constraint. This yields

p(x) =1

Ze−βE(x), (184)

where we have identified the Lagrange multiplier conju-gate to the energy λ1 = −β = 1/kBT with the (negative)inverse temperature. Now, suppose we also constrain theparticle number 〈N(x)〉obs. Then, an almost identicalcalculation yields a MaxEnt distribution of the functionalform

p(x) =1

Ze−β(E(x)−µN(x)), (185)

where we have rewritten our Lagrange multipliers inthe familiar thermodynamic notation λ1 = −β andλ2 = µ/β. Since this is just the Boltzmann distribu-tion, we can also relate the partition function in ourMaxEnt model to the thermodynamic free-energy viaF = −β−1 logZ. The choice of which quantities toconstrain is equivalent to working in different thermo-dynamic ensembles.

2. From statistical mechanics to machine learning

The MaxEnt idea also provides a general procedurefor learning a generative model from data. The key dif-ference between MaxEnt models in (theoretical) physics

83

and ML is that in ML we have no direct access to ob-served values 〈fi〉obs. Instead, these averages must bedirectly estimated from data (samples). To denote thisdifference, we will call empirical averages calculated fromdata as 〈fi〉data. We can think of MaxEnt as a statisti-cal inference procedure simply by replacing 〈fi〉obs by〈fi〉data above.

This subtle change has important implications fortraining MaxEnt models. First, since we do not knowthese averages exactly, but must estimate them from thedata, our training procedures must be careful not to over-fit to the observations (our samples might not be reflec-tive of the true values of these statistics). Second, the av-erages of certain functions fi are easier to estimate fromlimited data than others. This is often an important con-sideration when formulating which MaxEnt model to fitto the data. Finally, we note that unlike in physics whereconservation laws often suggest the functions fi whoseaverages we hold fix, ML offers no comparable guide forhow to choose the fi we care about. For these reasons,choosing the fi is often far from straightforward. As afinal point, we note that here we have presented a physics-based perspective for justifying the MaxEnt procedure.We mention in passing that the MaxEnt in ML is alsoclosely related to ideas from Bayesian inference (Jaynes,1968, 2003) and this latter point of view is more com-mon in discussions of MaxEnt in the statistics and MLliterature.

3. Generalized Ising Models from MaxEnt

The form of a MaxEnt model is completely specifiedonce we choose the averages fi we wish to constrain.One common choice often used in MaxEnt modeling is toconstrain the first two moments of a distribution. Whenour random variables x are continuous, the correspondingMaxEnt distribution is a multi-dimensional Gaussian. Ifthe x are binary (discrete), then the corresponding Max-Ent distribution is a generalized Ising (Potts) model withall-to-all couplings.

To see this, consider a random variable x with firstand second moments 〈xi〉data and 〈xixj〉data, respectively.According to the Principle of Maximum Entropy, weshould choose to model this variable using a Boltzmanndistribution with constraints on the first and second mo-ments. Let ai be the Lagrange multiplier associated with〈xi〉data and Jij/2 be the Lagrange multiplier associatedwith 〈xixj〉data. Using Eq. (182), it is easy to verify thatthe energy function

E(x) = −∑i

aixi −1

2

∑ij

Jijxixj (186)

satisfies the above constraints.Partition functions for maximum entropy models are

often intractable to compute. Therefore, it is helpful to

consider two special cases where x has different support(different kinds of data). First, consider the case that therandom variables x ∈ Rn are real numbers. In this casewe can compute the partition function directly:

Z =

∫dx ea

Tx+ 12x

T Jx =√

(2π)ndetJ−1e−12aT J−1a.

(187)

The resulting probability density function is,

p(x) = Z−1e−E(x)

=1√

(2π)ndetJ−1e

12aT J−1a+aTx+ 1

2xT Jx

=1√

(2π)ndetΣe−

12 (x−µ)TΣ−1(x−µ), (188)

where µ = −J−1a and Σ = −J−1. This, of course, is thenormalized, multi-dimensional Gaussian distribution.

Second, consider the case that the random variable xis binary with xi ∈ −1,+1. The energy function takesthe same form as Eq. (186), but the partition functioncan no longer be computed in a closed form. This modelis known as the Ising model in the physics literature, andis often called a Markov Random Field in the machinelearning literature. It is well known to physicists thatcalculating the partition function for the Ising Model isintractable. For this reason, the best we can do is esti-mate it using numerical techniques such MCMC methodsor approximate methods like variational MFT methods,see Sec. XIV. Finally, we note that in ML it is common touse binary variables which take on values in xi ∈ 0, 1rather than ±1. This can sometimes be a source of con-fusion when translating between ML and physics litera-tures and can lead to confusion when using ML packagesfor physics problems.

C. Cost functions for training energy-based models

The MaxEnt procedure gives us a way of parametrizingan energy-based generative model. For any energy-basedgenerative model, the energy function E(x, θi) dependson some parameters θi – couplings in the language ofstatistical physics – that must be inferred directly fromthe data. For example, for the MaxEnt models the θiare just the Lagrange multipliers λi introduced in thelast section. The goal of the training procedure is to usethe available training data to fit these parameters.

Like in many other ML techniques, we will fit thesecouplings by minimizing a cost function using stochasticgradient descent (cf. Sec. IV). Such a procedure naturallyseparates into two parts: choosing an appropriate costfunction, and calculating the gradient of the cost func-tion with respect to the model parameters. Formulatinga cost function for generative models is a little bit trickier

84

than for supervised, discriminative models. The objec-tive of discriminative models is straightforward – predictthe label from the features. However, what we mean bya “good” generative model is much harder to define usinga cost function. We would like the model to generateexamples similar to those we find in the training dataset.However, we would also like the model to be able to gen-eralize – we do not want the model to reproduce “spuriousdetails” that are particular to the training dataset. Un-like for discriminative models, there is no straightforwardidea like cross-validation on the data labels that neatlyaddresses this issue. For this reason, formulating costfunctions for generative models is subtle and representsan important and interesting open area of research.

Calculating the gradients of energy-based models alsoturns out to be different than for discriminative mod-els, such as deep neural networks. Rather than relyingon automatic differentiation techniques and backpropa-gation (see Sec. IX.C), calculating the gradient requiresdrawing on intuitions from MCMC-based methods. Be-low, we provide an in-depth discussion of Boltzmannlearning for energy-based generative models, focusing onMaxEnt models. We put the emphasis on training pro-cedures that generalize to more complicated generativemodels with latent variables such as RBMs discussed inthe next section. Therefore, we largely ignore the in-credibly rich physics-based literature on fitting Ising-likeMaxEnt models (see the recent reviews (Baldassi et al.,2018; Nguyen et al., 2017) and references therein).

1. Maximum likelihood

By far the most common approach used for training agenerative model is to maximize the log-likelihood of thetraining data set. Recall, that the log-likelihood char-acterizes the log-probability of generating the observeddata using our generative model. By choosing the nega-tive log-likelihood as the cost function, the learning pro-cedure tries to find parameters that maximize the proba-bility of the data. This cost function is intuitive and hasbeen the work-horse of most generative modeling. How-ever, we note that the Maximum Likelihood estimation(MLE) procedure has some important limitations thatwe will return to in Sec. XVII.

In what follows, we employ a general notation that isapplicable to all energy-based models, not just the Max-Ent models introduced above. The reason for this is thatmuch of this discussion does not rely on the specific formof the energy function but only on the fact that our gen-erative model takes a Boltzmann form. We denote thegenerative model by the probability distribution pθ(x)and its corresponding partition function by logZ(θi).In MLE, the parameters of the model are fit by maximiz-

ing the log-likelihood:

L(θi) = 〈log (pθ(x))〉data

= −〈E(x; θi)〉data − logZ(θi), (189)

where we have set β = 1. In writing this expression wemade use of two facts: (i) our generative distribution isof the Boltzmann form, and (ii) the partition functiondoes not depend on the data:

〈logZ(θi)〉data = logZ(θi). (190)

2. Regularization

Just as for discriminative models like linear and logisticregression, it is common to supplement the log-likelihoodwith additional regularization terms (see Secs. VI andVII). Instead of minimizing the negative log-likelihood,one minimizes a cost function of the form

− L(θi) + Ereg(θi), (191)

where Ereg(θi) is an additional regularization term thatprevents overfitting. From a Bayesian perspective, thisnew term can be viewed as encoding a (negative) log-prior on model parameters and performing a maximum-a-posteriori (MAP) estimate instead of a MLE (see cor-responding discussion in Sec. VI).

As we saw by studying linear regression, different formsof regularization give rise to different kinds of properties.A common choice for the regularization function are thesums of the L1 or L2 norms of the parameters

Ereg(θi) = Λ∑i

|θi|α, α = 1, 2 (192)

with Λ controlling the regularization strength. For Λ = 0,there is no regularization and we are simply performingMLE. In contrast, a choice of large Λ will force manyparameters to be close to or exactly zero. Just as inregression, an L1 penalty enforces sparsity, with many ofthe θi set to zero, and L2 regularization shrinks the sizeof the parameters towards zero.

One challenge of generative models is that it is oftendifficult to choose the regularization strength Λ. Recallthat, for linear and logistic regression, Λ is chosen tomaximize the out-of-sample performance on a validationdataset (i.e. cross-validation). However, for generativemodels our data are usually unlabeled. Therefore, choos-ing a regularization strength is more subtle and there ex-ists no universal procedure for choosing Λ. One commonstrategy is to divide the data into a training set and a val-idation set and monitor a summary statistic such as thelog-likelihood, energy distance (Székely, 2003), or varia-tional free-energy of the generative model on the trainingand validation sets (the variational free-energy was dis-cussed extensively in Sec. XIV ) (Hinton, 2012). If the

85

gap between the training and validation datasets startsgrowing, one is probably overfitting the model even ifthe log-likelihood of the training dataset is still increas-ing. This also gives a procedure for “early stopping” –a regularization procedure we introduced in the contextof discriminative models. In practice, when using suchregularizers it is important to try many different valuesof Λ and then try to use a proxy statistic for overfittingto evaluate the optimal choice of Λ.

D. Computing gradients

We still need to specify a procedure for minimizing thecost function. One powerful and common choice thatis widely employed when training energy-based modelsis stochastic gradient descent (SGD) (see Sec. IV). Per-forming MLE using SGD requires calculating the gradi-ent of the log-likelihood Eq. (189) with respect to theparameters θi. To simplify notation and gain intuition,it is helpful to define “operators” Oi(x), conjugate to theparameters θi

Oi(x) =∂E(x; θi)

∂θi. (193)

Since the partition function is just the cumulant gener-ating function for the Boltzmann distribution, we knowthat the usual statistical mechanics relationships betweenexpectation values and derivatives of the log-partitionfunction hold:

〈Oi(x)〉model = Trxpθ(x)Oi(x) = −∂ logZ(θi)∂θi

. (194)

In terms of the operators Oi(x), the gradient ofEq. (189) takes the form (Ackley et al., 1987)

−∂L(θi)∂θi

=⟨∂E(x; θi)

∂θi

⟩data

+∂ logZ(θi)

∂θi= 〈Oi(x)〉data − 〈Oi(x)〉model. (195)

These equations have a simple and beautiful interpre-tation. The gradient of the log-likelihood with respect toa model parameter is a difference of moments – one calcu-lated directly from the data and one calculated from ourmodel using the current model parameters. The data-dependent term is known as the positive phase of thegradient and the model-dependent term is known as thenegative phase of the gradient. This derivation also givesan intuitive explanation for likelihood-based training pro-cedures. The gradient acts on the model to lower the en-ergy of configurations that are near observed data pointswhile raising the energy of configurations that are farfrom observed data points. Finally, we note that all infor-mation about the data only enters the training procedurethrough the expectations 〈Oi(x)〉data and our generativemodel is blind to information beyond what is containedin these expectations.

To use SGD, we must still calculate the expectationvalues that appear in Eq. (195). The positive phase of thegradient – the expectation values with respect to the data– can be easily calculated using samples from the trainingdataset. However, the negative phase – the expectationvalues with respect to the model – is generally much moredifficult to compute. We will see that in almost all cases,we will have to resort to either numerical or approximatemethods. The fundamental reason for this is that it isimpossible to calculate the partition function exactly formost interesting models in both physics and ML.

There are exceptional cases in which we can calcu-late expectation values analytically. When this hap-pens, the generative model is said to have a TractableLikelihood. One example of a generative model with aTractable Likelihood is the Gaussian MaxEnt model forreal valued data discussed in Eq. (188). The param-eters/Lagrange multipliers for this model are the localfields a and the pairwise coupling matrix J . In this case,the usual manipulations involving Gaussian integrals al-low us to exactly find the parameters µ = −J−1a andΣ = −J−1, yielding the familiar expressions µ = 〈x〉data

and Σ = 〈(x − 〈x〉data)(x − 〈x〉data)T 〉data. These arethe standard estimates for the sample mean and covari-ance matrix. Converting back to the Lagrange multipli-ers yields

J = −〈(x− 〈x〉data)(x− 〈x〉data)T 〉−1data. (196)

Returning to the generic case where most energy-basedmodels have intractable likelihoods, we must estimate ex-pectation values numerically. One way to do this is drawsamples Smodel = x′i from the model pθ(x) and evalu-ate arbitrary expectation values using these samples:

〈f(x)〉model =

∫dxpθ(x)f(x) ≈

∑x′i∈Smodel

f(x′i). (197)

The samples from the model x′i ∈ Smodel are often re-ferred to as fantasy particles in the ML literature andcan be generated using simple MCMC algorithms suchas Metropolis-Hasting which are covered in most modernstatistical physics classes. However, if the reader is unfa-miliar with MCMC methods or wants a quick refresher,we recommend the concise and beautiful discussion ofMCMC methods from both the physics and ML point-of-view in Chapters 29-32 of David MacKay’s masterfulbook (MacKay, 2003).

Finally, we note that once we have the fantasy particlesfrom the model, we can also easily calculate the gradientof any expectation value 〈f(x)〉model using what is com-monly called the “log-derivative trick” in ML (Fu, 2006;

86

Kleijnen and Rubinstein, 1996):

∂θi〈f(x)〉model =

∫dx∂pθ(x)

∂θif(x)

=⟨∂ log pθ(x)

∂θif(x)

⟩model

= 〈Oi(x)f(x)〉model

≈∑

x′j∈Smodel

Oi(xj)f(x′j). (198)

This expression allows us to take gradients of more com-plex cost functions beyond the MLE procedure discussedhere.

E. Summary of the training procedure

We now summarize the discussion above and present ageneral procedure for training an energy based model us-ing SGD on the cost function (see Sec. IV). Our goal is tofit the parameters of a model pλ(θi) = Z−1e−E(x,θi).Training the model involves the following steps:

1. Read a minibatch of data, x.

2. Generate fantasy particles x′ ∼ pλ using anMCMC algorithm (e.g., Metropolis-Hastings).

3. Compute the gradient of log-likelihood using thesesamples and Eq. (195), where the averages aretaken over the minibatch of data and the fantasyparticles from the model, respectively.

4. Use the gradient as input to one of the gradientbased optimizers discussed in section Sec. IV.

In practice, it is helpful to supplement this basic proce-dure with some tricks that help training. As with dis-criminative neural networks, it is important to initial-ize the parameters properly and print summary statisticsduring the training procedure on the training and vali-dation sets to prevent overfitting. These and many other“cheap tricks” have been nicely summarized in a shortnote from the Hinton group (Hinton, 2012).

A major computational and practical limitation ofthese methods is that it is often hard to draw samplesfrom generative models. MCMCmethods often have longmixing-times (the time one has to run the Markov chainto get uncorrelated samples) and this can result in bi-ased sampling. Luckily, we often do not need to knowthe gradients exactly for training ML models (recall thatnoisy gradient estimates often help the convergence ofgradient descent algorithms), and we can significantly re-duce the computational expense by running MCMC fora reasonable time window. We will exploit this observa-tion extensively in the next section when we discuss howto train more complex energy-based models with hiddenvariables.

XVI. DEEP GENERATIVE MODELS: HIDDEN VARIABLESAND RESTRICTED BOLTZMANN MACHINES (RBMS)

The last section introduced many of the core ideas be-hind energy-based generative models. Here, we extendthis discussion to energy-based models that include la-tent or hidden variables.

Including latent variables in generative models greatlyenhances their expressive power – allowing the model torepresent sophisticated correlations between visible fea-tures without sacrificing trainability. By having multiplelayers of latent variables, we can even construct powerfuldeep generative models that possess many of the samedesirable properties as deep, discriminative neural net-works.

We begin with a discussion that tries to provide a sim-ple intuition for why latent variables are such a pow-erful tool for generative models. Next, we introduce apowerful class of latent variable models called RestrictedBoltzmann Machines (RBMs) and discuss techniques fortraining these models. After that, we introduce DeepBoltzmann Machines (DBMs), which have multiple layersof latent variables. We then introduce the new Paysagepackage for training energy-based models and demon-strate how to use it on the MNIST dataset and sam-ples from the Ising model. We conclude by discussingrecent physics literature related to energy-based genera-tive models.

A. Why hidden (latent) variables?

Latent or hidden variables are a powerful yet elegantway to encode sophisticated correlations between observ-able features. The underlying reason for this is thatmarginalizing over a subset of variables – “integratingout” degrees of freedom in the language of physics – in-duces complex interactions between the remaining vari-ables. The idea that integrating out variables can leadto complex correlations is a familiar component of manyphysical theories. For example, when considering freeelectrons living on a lattice, integrating out phonons givesrise to higher-order electron-electron interactions (e.g. su-perconducting or magnetic correlations). More generally,in the Wilsonian renormalization group paradigm, all ef-fective field theories can be thought of as arising fromintegrating out high-energy degrees of freedom (Wilsonand Kogut, 1974).

Generative models with latent variables run this logicin reverse – encode complex interactions between visiblevariables by introducing additional, hidden variables thatinteract with visible degrees of freedom in a simple man-ner, yet still reproduce the complex correlations betweenvisible degrees in the data once marginalized over (in-tegrated out). This allows us to encode complex higher-order interactions between the visible variables using sim-

87

pler interactions at the cost of introducing new latentvariables/degrees of freedom. This trick is also widelyexploited in physics (e.g. in the Hubbard-Stratonovichtransformation (Hubbard, 1959; Stratonovich, 1957) orthe introduction of ghost fields in gauge theory (Faddeevand Popov, 1967)).

To make these ideas more concrete, let us revisit thepairwise Ising model introduced in the discussion of Max-Ent models, see Eq. (186). The model is described by aBoltzmann distribution with energy

E(v) = −∑i

aivi −1

2

∑ij

viJijvj , (199)

where Jij is a symmetric coupling matrix that encodesthe pairwise constraints and ai enforce the single-variableconstraint.

Our goal is to replace the complicated interactions be-tween the visible variables vi encoded by Jij , by interac-tions with a new set of latent variables hµ. In order todo this, it is helpful to rewrite the coupling matrix in aslightly different form. Using SVD, we can always expressthe coupling matrix in the form Jij =

∑Nµ=1WiµWjµ,

where Wiµi are appropriately normalized singular vec-tors. In terms of Wiµ, the energy takes the form

EHop(v) = −∑i

aivi −1

2

∑ijµ

viWiµWjµvj . (200)

We note that in the special case when both vi ∈−1,+1 and Wiµ ∈ −1,+1 are binary variables, amodel with this form of the energy function is known asthe Hopfield model (Amit et al., 1985; Hopfield, 1982).The Hopfield model has played an extremely importantrole in statistical physics, computational neuroscience,and machine learning, and a full discussion of its prop-erties is well beyond the scope of this review [see (Amit,1992) for a beautiful discussion that combines all theseperspectives]. Therefore, here we refer to all energy func-tions of the form Eq. (200) as (generalized) Hopfield mod-els, even for the case when the Wiµ are continuous vari-ables.

We now “decouple” the visible variables vi by intro-ducing a set of normally, distributed continuous latentvariables hµ (in condensed matter language we perform aHubbard-Stratonovich transformation). Using the usualidentity for Gaussian integrals, we can rewrite the Boltz-mann distribution for the generalized Hopfield model as

p(v) =e∑i aivi+

12

∑ijµ viWiµWjµvj

Z

=e∑i aivi

∏µ

∫dhµe−

12

∑µ h

2µ−

∑i viWiµhµ

Z

=

∫dh e−E(v,h)

Z(201)

Hidden Layer

Visible Layer ai(vi)

bμ(hμ)

WiμvihμInteractions

FIG. 61 A Restricted Boltzmann Machine (RBM) consists ofvisible units vi and hidden units hµ that interact with eachother through interactions of the formWiµvihµ. Importantly,there are no interactions between visible units themselves orhidden units themselves.

where E(v,h) is a joint energy functional of both thelatent and visible variables of the form

E(v,h) = −∑i

aivi +1

2

∑µ

h2µ −

∑iµ

viWiµhµ. (202)

We can also use the energy function E(v,h) to define anew energy-based model p(v,h) on both the latent andvisible variables

p(v,h) =e−E(v,h)

Z ′. (203)

Marginalizing over latent variables of course gives us backthe generalized Hopfield model (Barra et al., 2012)

p(v) =

∫dhp(v,h) =

e−EHop(v)

Z. (204)

Notice that E(v,h) contains no direct interactions be-tween visible degrees of freedom (or between hidden de-gree of freedom). Instead, the complex correlations be-tween the vi are encoded in the interaction between thevisible vi and latent variables hµ. It turns out that themodel presented here is a special case of a more generalclass of powerful energy-based models called RestrictedBoltzmann Machines (RBMs).

B. Restricted Boltzmann Machines (RBMs)

A Restricted Boltzmann Machine (RBM) is an energy-based model with both visible and hidden units where thevisible and hidden units interact with each other but donot interact among themselves. The energy function ofan RBM takes the general functional form

E(v,h) = −∑i

ai(vi)−∑µ

bµ(hµ)−∑iµ

Wiµvihµ, (205)

where ai(·) and bµ(·) are functions that we are free tochoose. The most common choice is:

ai(vi) =

aivi, if vi ∈ 0, 1 is binaryv2i

2σ2i, if vi ∈ R is continuous,

88

and

bµ(hµ) =

bµhµ, if hµ ∈ 0, 1 is binaryh2µ

2σ2µ, if hµ ∈ R is continuous.

For this choice of ai(·) and bµ(·), layers consisting of dis-crete binary units are often called Bernoulli layers, andlayers consisting of continuous variables are often calledGaussian layers. The basic bipartite structure of an RBM– i.e., a visible and hidden layer that interact with eachother but not among themselves – is often depicted usinga graph of the form shown in Fig. 61.

An RBM can have different properties depending onwhether the hidden and visible layers are taken to beBernoulli or Gaussian. The most common choice is tohave both the visible and hidden units be Bernoulli. Thisis what is typically meant by an RBM. However, othercombinations are also possible and used in the ML lit-erature. When all the units are continuous, the RBMreduces to a multi-dimensional Gaussian with a very par-ticular correlation structure. When the hidden units arecontinuous and the visible units are discrete, the RBM isequivalent to a generalized Hopfield model (see discussionabove). When the the visible units are continuous andthe hidden units are discrete, the RBM is often called aGaussian Bernoulli Restricted Boltzmann Machine (Dahlet al., 2010; Hinton and Salakhutdinov, 2006). It is evenpossible to perform multi-modal learning with a mixtureof continuous and discrete variables. For all these archi-tectures, the important point is that all interactions oc-cur only between the visible and hidden units and thereare no interactions between units within the hidden orvisible layers, see Fig. 61. This is analogous to QuantumElectrodynamics, where a free fermion and a free photoninteract with one another but not among themselves.

Specifying a generative model with this bipartite inter-action structure has two major advantages: (i) it enablescapturing both pairwise and higher-order correlations be-tween the visible units and (ii) it makes it easier to samplefrom the model using an MCMC method known as blockGibbs sampling, which in turn makes the model easier totrain.

Before discussing training, it is worth better under-standing the kind of correlations that can be capturedusing an RBM. To do so, we can marginalize over the hid-den units and ask about the resulting distribution overjust the visible units

p(v) =

∫dhp(v,h) =

∫dh

e−E(v,h)

Z(206)

where the integral should be replaced by a trace in allexpressions for discrete units.

We can also define a marginal energy using the expres-sion

p(v) =e−E(v)

Z. (207)

Combining these equations,

E(v) = − log

∫dhe−E(v,h)

= −∑i

ai(vi)−∑µ

log

∫dhµebµ(hµ)+

∑i viWiµhµ

To understand what correlations are captured by p(v) itis helpful to introduce the distribution

qµ(hµ) =ebµ(hµ)

Z(208)

of hidden units hµ, ignoring the interactions between vand h, and the cumulant generating function

Kµ(t) = log

∫dhµqµ(hµ)ethµ =

∑n

κ(n)µ

tn

n!. (209)

Kµ(t) is defined such that the nth cumulant is κ(n)µ =

∂nt Kµ|t=0.The cumulant generating function appears in the

marginal free-energy of the visible units, which can berewritten (up to a constant term) as:

E(v) = −∑i

ai(vi)−∑µ

(∑i

Wiµvi

)

= −∑i

ai(vi)−∑µ

∑n

κ(n)µ

(∑iWiµvi)

n

n!

= −∑i

ai(vi)−∑i

(∑µ

κ(1)µ Wiµ

)vi

−1

2

∑ij

(∑µ

κ(2)µ WiµWjµ

)vivj + . . . (210)

We see that the marginal energy includes all orders of in-teractions between the visible units, with the n-th ordercumulants of qµ(hµ) weighting the n-th order interac-tions between the visible units. In the case of the Hop-field model we discussed previously, qµ(hµ) is a standardGaussian distribution where the mean is κ(1)

µ = 0, thevariance is κ(2)

µ = 1, and all higher-order cumulants arezero. Plugging these cumulants into Eq. (210) recoversEq. (202).

These calculations make clear the underlying reasonfor the incredible representational power of RBMs witha Bernoulli hidden layer. Each hidden unit can encode in-teractions of arbitrarily high order. By combining manydifferent hidden units, we can encode very complex in-teractions at all orders. Moreover, we can learn whichorder of correlations/interactions are important directlyfrom the data instead of having to specify them ahead oftime as we did in the MaxEnt models. This highlightsthe power of generative models with even the simplest in-teractions between visible and latent variables to encode,learn, and represent complex correlations present in thedata.

89

C. Training RBMs

RBMs are a special class of energy-based generativemodels, which can be trained using the Maximum Like-lihood Estimation (MLE) procedure described in detailin Sec. XV. To briefly recap, first, we must choose a costfunction – for MLE this is just the negative log-likelihoodwith or without an additional regularization term to pre-vent overfitting. We then minimize this cost function us-ing one of the Stochastic Gradient Descent (SGD) meth-ods described in Sec. IV.

The gradient itself can be calculated using Eq. (195).For example, for the Bernoulli-Bernoulli RBM inEq. (205) we have

∂L(Wiµ, ai, bµ)∂Wiµ

= 〈vihµ〉data − 〈vihµ〉model

∂L(Wiµ, ai, bµ)∂ai

= 〈vi〉data − 〈vi〉model

∂L(Wiµ, ai, bµ)∂bµ

= 〈hµ〉data − 〈hµ〉model, (211)

where the positive expectation with respect to the datais understood to mean sampling from the model whileclamping the visible units to their observed values in thedata. As before, calculating the negative phase of thegradient (i.e. the expectation value with respect to themodel) requires that we draw samples from the model.Luckily, the bipartite form of the interactions in RBMswere specifically chosen with this in mind.

1. Gibbs sampling and contrastive divergence (CD)

The bipartite interaction structure of an RBMmakes itpossible to calculate expectation values using a MarkovChain Monte Carlo (MCMC) method known as Gibbssampling. The key reason for this is that since there areno interactions of visible units with themselves or hiddenunits with themselves, the visible and hidden units of anRBM are conditionally independent:

p(v|h) =∏i

p(vi|h)

p(h|v) =∏µ

p(hµ|v), (212)

with

p(vi = 1|h) = σ(ai +∑µ

Wiµhµ) (213)

p(hµ = 1|v) = σ(bµ +∑i

Wiµvi)

and where σ(z) = 1/(1 + e−z) is the sigmoid function.Using these expressions it is easy to compute expec-

tation values with respect to the data. The input to

t = 0 t = 1 t = 2 t = οο

data

t = 0 t = 1 t = 2 t = n

data

A

B

t = 0 t = 1 t = 2 t = n

fantasy particlesfrom last SGD step

C

Alternating Gibbs Sampling

Contrastive Divergence (CD-n)

Persistent Contrastive Divergence (PCD-n)

FIG. 62 (Top) To draw fantasy particles (samples from themodel) we can perform alternating (block) Gibbs samplingbetween the visible and hidden layers starting with a sam-ple from the data using the marginal distributions p(h|v)and p(v|h). The “time” t corresponds to the time in theMarkov chain for the Monte Carlo and measures the num-ber of passes between the visible and hidden states. (Middle)In Contrastive Divergence (CD), we approximately samplethe model by terminating the Gibbs sampling after n steps(CD-n) starting from the data. (C) In Persistent ContrastiveDivergence (PCD), instead of restarting the Gibbs samplerfrom the data, we initialize the sampler with the fantasy par-ticles calculated from the model at the last SGD step.

gradient descent is a minibatch of observed data. Foreach sample in the minibatch, we simply clamp the visi-ble units to the observed values and apply Eq. (213) usingthe probability for the hidden variables. We then averageover all samples in the minibatch to calculate expectationvalues with respect to the data. To calculate expectationvalues with respect to the model, we use (block) Gibbssampling. The idea behind (block) Gibbs sampling isto iteratively sample from the conditional distributionsht+1 ∼ p(h|vt) and vt+1 ∼ p(v|ht+1) (see Figure 62,top). Since the units are conditionally independent, eachstep of this iteration can be performed by simply draw-ing random numbers. The samples are guaranteed toconverge to the equilibrium distribution of the model inthe limit that t→∞. At the end of the Gibbs samplingprocedure, one ends up with a minibatch of samples (fan-tasy particles).

One drawback of Gibbs sampling is that it may takemany back and forth iterations to draw an independentsample. For this reason, the Hinton group introducedan approximate Gibbs sampling technique called Con-trastive Divergence (CD) (Hinton, 2002; Hinton et al.,2006). In CD-n, we just perform n iterations of (block)

90

Gibbs sampling, with n often taken to be as small as 1(see Figure 62)! The price for this truncation is, of course,that we are not drawing samples from the true model dis-tribution. But for our purpose – using the expectationsto estimate the gradient for SGD – CD-n has proven towork reasonably well. As long as the approximate gra-dients are reasonably correlated with the true gradient,SGD will move in a reasonable direction. CD-n of coursedoes come at a price. Truncating the Gibbs sampler pre-vents sampling far away from the starting point, whichfor CD-n are the data points in the minibatch. Therefore,our generative model will be much more accurate aroundregions of feature space close to our training data. Thus,as is often the case in ML, CD-n sacrifices the abilityto generalize to some extent in order to make the modeleasier to train.

Some of these undesirable features can be temperedby using a slightly different variant of CD called Persis-tent Contrastive Divergence (PCD) (Tieleman and Hin-ton, 2009). In PCD, rather than restarting the Gibbssampler from the data at each gradient descent step, westart the Gibbs sampling at the fantasy particles in thelast gradient descent step (see Fig. 62). Since parameterschange slowly compared to the Gibbs sampling, samplesthat are high probability at one step of the SGD are alsolikely to be high probability at the next step. This en-sures that PCD does not introduce large errors in theestimation of the gradients. The advantage of using fan-tasy particles to initialize the Gibbs sampler is to allowPCD to explore parts of the feature space that are muchfurther from the training dataset than one could reachwith ordinary CD.

We note that, in applications using RBMs as a vari-ational ansatz for quantum states, Gibbs sampling isnot necessarily the best option for training, and in prac-tice parallel tempering or other Metropolis schemes canoutperform Gibbs sampling. In fact, Gibbs sampling isnot even feasible with complex-valued weights requiredfor quantum wavefucntions, whereas Metropolis schemesmight be feasible (Carleo, 2018).

2. Practical Considerations

The previous section gave an overview of how to trainRBMs. However, there are many “tricks of the trade”that are missing from this discussion. Luckily, a succinctsummary of these has been compiled by Geoff Hinton andpublished as a note that readers interested in trainingRBMs are urged to consult (Hinton, 2012).

For completeness, we briefly list some of the importantpoints here:

• Initialization.—The model must be initialized.Hinton suggests taking the weights Wiµ from aGaussian with mean zero and standard deviation

Deep BoltzmannMachine (DBM)

LayerwisePretraining

Fine-tuning withPCD on full DBM

FIG. 63 Deep Boltzmann Machine contain multiple hiddenlayers. To train deep networks, first we perform layerwisetraining where each two layers are treated as a RBM. Thiscan be followed by fine-tuning using gradient descent and per-sistent contrastive divergence (PCD).

σ = 0.01 (Hinton, 2012). An alternative initial-ization scheme proposed by Glorot and Bengio in-stead chooses the standard deviation to scale withthe size of the layers: σ = 2/

√Nv +Nh where Nv

and Nh are number of visible and hidden units re-spectively (Glorot and Bengio, 2010). The bias ofthe hidden units is initialized to zero while the biasof the visible units is typically taken to be inverselyproportional to the mean activation, ai = 〈vi〉−1

data.

• Regularization.—One can of course use an L1 orL2 penalty, typically only on the weight parame-ters, not the biases. Alternatively, Dropout hasbeen shown to decrease overfitting when trainingwith CD and PCD, which results in more inter-pretable learned features.

• Learning Rates.—Typically, it is helpful to re-duce the learning rate in later stages of training.

• Updates for CD and PCD.—There are severalcomputational tricks one can use for speeding upthe alternating updates in CD and PCD, see Sec-tion 3 in (Hinton, 2012).

D. Deep Boltzmann Machine

In this section, we introduce Deep Boltzmann Ma-chines (DBMs). Unlike RBMs, DBMs possess multi-ple hidden layers and were the first models rebrandedas “deep learning” (Hinton et al., 2006; Hinton andSalakhutdinov, 2006) 18. Many of the advantages that

18 Technically, these were Deep Belief Networks where only the toplayer was undirected

91

are thought to stem from having deep layers were al-ready discussed in Sec. XI in the context of discrimina-tive DNNs. Here, we revisit many of the same themeswith emphasis on energy-based models.

An RBM is composed of two layers of neurons thatare connected via an undirected graph, see Fig. 61. Asa result, it is possible to perform sampling v ∼ p(v|h)and inference h ∼ p(h|v) with the same model. As withthe Hopfield model, we can view each of the hidden unitsas representative of a pattern, or feature, that could bepresent in the data. 19 The inference step involves assign-ing a probability to each of these features that expressesthe degree to which each feature is present in a givendata sample. In an RBM, hidden units do not influenceeach other during the inference step, i.e. hidden units areconditionally independent given the visible units. Thereare a number of reasons why this is unsatisfactory. Onereason is the desire for sparse distributed representations,where each observed visible vector will strongly activatea few (i.e. more than one but only a very small frac-tion) of the hidden units. In the brain, this is thoughtto be achieved by inhibitory lateral connections betweenneurons. However, adding lateral intra-layer connectionsbetween the hidden units makes the distribution difficultto sample from, so we need to come up with another wayof creating connections between the hidden units.

With the Hopfield model, we saw that pairwise linearconnections between neurons can be mediated throughanother layer. Therefore, a simple way to allow for ef-fective connections between the hidden units is to addanother layer of hidden units. Rather than just havingtwo layers, one visible and one hidden, we can add addi-tional layers of latent variables to account for the corre-lations between hidden units. Ideally, as one adds moreand more layers, one might hope that the correlationsbetween hidden variables become smaller and smallerdeeper into the network. This basic logic is reminiscent ofrenormalization procedures that seek to decorrelate lay-ers at each step (Li and Wang, 2018; Mehta and Schwab,2014; Vidal, 2007). The price of adding additional layersis that the models become harder to train.

Training DBMs is more subtle than RBMs due to thedifficulty of propagating information from visible to hid-den units. However, Hinton and collaborators realizedthat some of these problems could be alleviated via a lay-erwise procedure. Rather than attempting to the trainthe whole DBM at once, we can think of the DBM as astack of RBMs (see Fig. 63). One first trains the bottomtwo layers of the DBM – treating it as if it is a stand-alone RBM. Once this bottom RBM is trained, we cangenerate “samples” from the hidden layer and use these

19 In general, one should instead think of activity patterns of hiddenunits representing features in the data.

FIG. 64 Fantasy particles (samples) generated using the in-dicated model trained on the MNIST dataset. Samples weregenerated by running (alternating) layerwise Gibbs samplingfor 100 steps. This allows the final sample to be very far awayfrom the starting point in our feature space. Notice that thegenerated samples look much less like hand-written recon-structions than in Fig. 60 which uses a single max-probabilityiteration of the Gibbs sampler, indicating that training ismuch less effective when exploring regions of probability spacefaraway from the training data. In the Sec. XVII, we will ar-gue that this is likely a generic feature of Likelihood-basedtraining.

samples as an input to the next RBM (consisting of thefirst and second hidden layer – purple hexagons and greensquares in Fig. 63). This procedure can then be repeatedto pretrain all layers of the DBM.

This pretraining initializes the weights so that SGDcan be used effectively when the network is trained in asupervised fashion. In particular, the pretraining helpsthe gradients to stay well behaved rather than vanishor blow up – a problem that we discussed extensivelyin the earlier sections on DNNs. It is worth noting thatonce pretrained, we can use the usual Boltzmann learningrules in Eq. (195) to fine-tune the weights and improvethe performance of the DBM (Hinton et al., 2006; Hintonand Salakhutdinov, 2006). As we demonstrate in thenext section, the Paysage package presented here can beused to both construct and train DBMs using such apretraining procedure.

E. Generative models in practice: examples

1. MNIST

First, we apply the open source package Paysage(French for landscape) for training unsupervised energy-based models on the MNIST dataset.

In Notebook 17, we explicitly demonstrate how tobuild and train four different kinds of models: (i) a“Hopfield” type RBM with Gaussian hidden units andBernoulli (binary) visible units, (ii) a conventional RBMwhere both the visible and hidden units are Bernoulli,(iii) a conventional RBM with an additional L1-penaltythat enforces sparsity, and (iv) a Deep Boltzmann Ma-

92

chine (DBM) with three Bernoulli layers with L1 penaltyeach. We refer the reader to the Notebook for the detailsof the code. In the following, we show and briefly discussthe results.

After training the model, we compute reconstructionsand fantasy particles from the validation data set. Recallthat a reconstruction v′ of a given data point x is com-puted in two steps: (i) we fix the visible layer v = x tobe the data, and use MCMC sampling to find the stateof the hidden layer h which maximizes the probabilitydistribution p(h|v). (ii) fixing the same obtained stateh, we find the reconstruction v′ of the original data pointwhich maximizes the probability p(v′|h). In the case ofa DBM, the forward pass continues until we reach thelast of the hidden layers, and the backward pass goes inreverse. Figure 60 shows the result.

We also used MCMC to draw samples from the learnedprobability distributions, the so-called fantasy particles.To this end, we did layer-wise Gibbs sampling for a totalof a fixed number of equilibration steps. The result isshown in Figure 64.

Finally, one can use generative models to reduce thenoise in images (de-noising). We randomly flipped a frac-tion of the black&white bits in the validation data, anduse the models defined above to reconstruct (de-noise)the digit images. Figure 65 shows the result.

The full Paysage code used to generate Figs. 60, 64and 65 is available in Notebook 17. The pack-age was developed by one of the authors (CKF)along with his colleagues at Unlearn.AI and makesit easy to build, train, and deploy energy-basedgenerative models with different architectures.Paysage’s documentation is available on GitHub underhttps://github.com/drckf/paysage/tree/master/docs.

2. Example: 2D Ising Model

We can also analyze the 2D Ising data set. In previoussections, we used our knowledge of the critical point atTc/J ≈ 2.26 (see Onsager’s solution) to label the spinconfigurations and study the problem of classifying thestates according to their phase of matter. However, inmore complicated models, where the precise position ofTc is not known, one cannot label the states with suchan accuracy, if at all.

As we explained, generative models can be used tolearn a variational approximation for the probability dis-tribution that generated the data points. By using onlythe 2D spin configurations, we now attempt to train aBernoulli RBM, the fantasy particles of which are ther-mal Ising configurations. Unlike in previous studies ofthe Ising dataset, here we perform the analysis at a fixedtemperature T . We can then apply our model at threedifferent values T = 1.75, 2.25, 2.75 in the ordered, near-critical and disordered regions, respectively.

FIG. 65 Images from MNIST were randomly corrupted byadding noise. These noisy images were used as inputs to thevisible layer of the generative model. The denoised imagesare obtained by a single “deterministic” (max probability) it-eration v→ h→ v′.

We define a Deep Boltzmann machine with two hiddenlayers of Nhidden and Nhidden/10 units, respectively, andapply L1 regularization to all weights. As in the MNISTproblem above, we use layer-wise pre-training, and de-ploy Persistent Contrastive Divergence to train the DBMusing ADAM.

One of the lessons from this problem is that this task iscomputationally intensive, see Notebook 17. The train-ing time on present-day laptops easily exceeds that of pre-vious studies from this review. Thus, we encourage theinterested reader to try GPU-based training and studythe resulting speed-up.

Figures 66, 67 and 68 show the results of the numericalexperiment at T/J = 1.75, 2.25, 2.75 respectively, for aDBMwithNhidden = 800. Looking at the reconstructionsand the fantasy particles, we see that our DBM workswell in the disordered and critical regions. However, thechosen layer architecture is not optimal for T = 1.75 inthe ordered phase, presumably due to effects related tosymmetry breaking.

F. Generative models in physics

Generative models have been studied and used ex-tensively in the context of physics. For instance, inBiophysics, dynamic Boltzmann distributions have beenused as effective models in chemical kinetics (Ernst et al.,2018). In Statistical Physics, they were used to identifycriticality in the Ising model (Morningstar and Melko,2017). In parallel, tools from Statistical Physics havebeen applied to analyze the learning ability of RBMs (De-celle et al., 2018; Huang, 2017b), characterizing the spar-sity of the weights, the effective temperature, the non-

93

FIG. 66 MC samples, their reconstructions and fantasy particles generated by a Deep Boltzmann Machine in the orderedphase of the the 2D Ising data set at T/J = 1.75. We used two hidden layers of 1000 and 100 layers, respectively.

FIG. 67 MC samples, their reconstructions and fantasy particles generated by a Deep Boltzmann Machine in the criticalregime of the the 2D Ising data set at T/J = 2.25. We used two hidden layers of 1000 and 100 layers, respectively.

linearities in the activation functions of hidden units,and the adaptation of fields maintaining the activity inthe visible layer (Tubiana and Monasson, 2017). Spinglass theory motivated a deterministic framework forthe training, evaluation, and use of RBMs (Tramelet al., 2017); it was demonstrated that the training pro-cess in RBMs itself exhibits phase transitions (Barraet al., 2016, 2017); learning in RBMs was studied in thecontext of equilibrium (Cossu et al., 2018; Funai andGiataganas, 2018) and nonequilibrium (Salazar, 2017)thermodynamics, and spectral dynamics (Decelle et al.,2017); mean-field theory found application in analyzingDBMs (Huang, 2017a). Another interesting direction ofresearch is the use of generative models to improve MonteCarlo algorithms (Cristoforetti et al., 2017; Nagai et al.,2017; Tanaka and Tomiya, 2017b; Wang, 2017). Ideasfrom quantum mechanics have been put forward to in-troduce improved speed-up in certain parts of the learn-ing algorithms for Helmholtz machines (Benedetti et al.,2016, 2017).

At the same time, generative models have applica-

tions in the study of quantum systems too. Most no-tably, RBM-inspired variational ansatzes were used tolearn both complex-valued wavefunctions and the real-valued probability distribution associated with the abso-lute square of a quantum state (Carleo et al., 2018; Car-leo and Troyer, 2017; Freitas et al., 2018; Nomura et al.,2017; Torlai et al., 2018) and, in this context, RBMs aresometimes called Born machines (Cheng et al., 2017), in-cluding quantum state tomorgraphy (Carrasquilla et al.,2018; Torlai et al., 2018; Torlai and Melko, 2017). Fur-ther applications include the detection of order in low-energy product states (Rao et al., 2017), and learningEinstein-Podolsky-Rosen correlations on an RBM (Wein-stein, 2017). Inspired by the success of tensor networks inphysics, the latter have been used as a basis for both gen-erative and discriminative learning (Huggins et al., 2019):RBMs (Chen et al., 2018) were used to extract the spa-tial geometry from entanglement (You et al., 2017), andgenerative models based on matrix product states havebeen developed (Han et al., 2017). Last but not least,Quantum entanglement was studied using RBM-encoded

94

FIG. 68 MC samples, their reconstructions and fantasy particles generated by a Deep Boltzmann Machine in the disorderedphase of the the 2D Ising data set at T/J = 2.75. We used two hidden layers of 1000 and 100 layers, respectively.

states (Deng et al., 2017) and tensor product based gen-erative models have been used to understand MNIST andother ML datasets (Stoudenmire and Schwab, 2016).

XVII. VARIATIONAL AUTOENCODERS (VAES) ANDGENERATIVE ADVERSARIAL NETWORKS (GANS)

In the previous two sections, we considered energy-based generative models. Here, we extend our discus-sion to two new generative model frameworks that havegained wide appeal in the the last few years: generativeadversarial networks (GANs) (Goodfellow, 2016; Good-fellow et al., 2014; Radford et al., 2015) and variationalautoencoders (VAEs) (Kingma and Welling, 2013). Un-like energy-based models, both these generative modelingframeworks are based on differentiable neural networksand consequently can be trained using backpropagation-based methods. VAEs, in particular, can be easily im-plemented and trained using high-level packages such asKeras making them an easy-to-deploy generative frame-work. These models also differ from the energy-basedmodels in that they do not directly seek to maximize like-lihood. GANs, for example, employ a novel cost functionbased on adversarial learning (a concept we motivate andexplain below). Finally we note that VAEs and GANs arealready starting to make their way into physics (Heimelet al., 2018; Liu et al., 2017; Rocchetto et al., 2018; Wet-zel, 2017) and astronomy (Ravanbakhsh et al., 2017), andmethods from physics may prove useful for furtheringour understanding of these methods (Alemi and Abbara,2017). More generally, GANs have found important ap-plications in many artistic and image manipulation tasks(see references in (Goodfellow, 2016)).

The section is organized as follows. We start by moti-vating adversarial learning by discussing the limitationsof maximum likelihood based approaches. We then givea high-level introduction to the main idea behind gen-erative adversarial networks and discuss how they over-

come some of these limitations, simultaneously highlight-ing both the power of GANs and some of the difficulties.We then show how VAEs integrate the variational meth-ods introduced in Sec. XIV with deep, differentiable neu-ral networks to build more powerful generative modelsthat move beyond the Expectation Maximization (EM).We then briefly discuss VAEs from an information theo-retic perspective, before discussing practical tips for im-plementing and training VAEs. We conclude by usingVAEs on examples using the Ising and MNIST datasets(see also Notebooks 19 and 20).

A. The limitations of maximizing Likelihood

The Kullback-Leibler (KL)-divergence plays a centralrole in many generative models. Developing an intuitionabout KL-divergences is one of the keys to understandingwhy adversarial learning has proved to be such a powerfulmethod for generative modeling. Here, we revisit the KL-divergence with an eye towards understanding GANs andmotivate adversarial learning. The KL-divergence mea-sures the similarity between two probability distributionsp(x) and q(x). Strictly speaking, the KL divergence isnot a metric because it is not symmetric and does notsatisfy the triangle inequality.

Given two distributions, there are two distinct KL-divergences we can construct:

DKL(p||q) =

∫dxp(x) log

p(x)

q(x)(214)

DKL(q||p) =

∫dxq(x) log

q(x)

p(x). (215)

A related quantity called the Jensen-Shannon divergence,

DJS(p, q) =1

2

[DKL

(p

∥∥∥∥p+ q

2

)+DKL

(q

∥∥∥∥p+ q

2

)]does satisfy all of the properties of a squared metric (i.e.,the square root of the Jensen-Shannon divergence is a

95

−15 −10 −5 0 5 10 15x

∆ = 2.0

Model

Data

0.0 0.5 1.0 1.5 2.0 2.5 3.0∆

0

5

10

15

20Data-Model

Model-Data

FIG. 69 KL-divergences between the data distribution pdataand the model pθ. Data is drawn from a bimodal Gaus-sian distribution with unit variances peaked at ±∆ with∆ = 2.0 and the model pθ(x) is a Gaussian with mean zeroand same variance as pθ(x). (Top) pdata and pθ for ∆ = 2.(Bottom) DKL(pdata||pθ) (Data-Model) and DKL(pθ||pdata)(Model-Data) as a function of ∆. Notice that DKL(pdata||pθ)is insensitive to placing weight in the model distribution inregions where pdata ≈ 0 whereas DKL(pθ||pdata) punishes thisharshly.

metric). An important property of the KL-divergencethat we will make use of repeatedly is its positivity:DKL(p||q) ≥ 0 with equality if and only if p(x) = q(x)almost everywhere.

In generative models in ML, the two distributionswe are usually concerned with are the model distribu-tion pθ(x) and the data distribution pdata(x). We ofcourse would like these models to be as similar as possi-ble. However, as we discuss below, there are many sub-tleties about how we measure similarities that can havelarge consequences for the behavior of training proce-dures. Maximizing the log-likelihood of the data underthe model is the same as minimizing the KL divergencebetween the data distribution and the model distribu-tion DKL(pdata||pθ). To see this, we can rewrite the KL

−15 −10 −5 0 5 10 15x

∆ = 5.0

Model

Data

−4 −2 0 2 4x

∆ = 1.0

Model

Data

0 2 4 6 8 10∆

0

1

2

3

4

5 Data-Model

Model-Data

FIG. 70 KL-divergences between the data distribution pdataand the model pθ. Data is drawn from a Gaussian mixtureof the form pdata = 0.25N (−∆) + 0.25 ∗ N (∆) + 0.5N (0)where N (a) is a normal distribution with unit variance cen-tered at x = a. pθ(x) is a Gaussian with σ2 = 2. (Top)pdata and pθ for ∆ = 5. (Middle) pdata and pθ for ∆ = 1.(Bottom) DKL(pdata||pθ) [Data-Model] and DKL(pθ||pdata)[Model-Data] as a function of ∆. Notice that DKL(pθ||pdata)is insensitive to placing weight in the model distribution inregions where pθ ≈ 0 whereas DKL(pdata||pθ) punishes thisharshly .

96

divergence as:

DKL(pdata||pθ) =

∫dxpdata(x) log pdata(x)

−∫

dxpdata(x) log pθ(x)

= −S[pdata]− 〈log pθ(x)〉data (216)

Rearranging this equation, we have

〈log pθ(v)〉data = −S[pdata]−DKL(pdata||pθ) (217)

The equivalence follows from the positivity of KL-divergence and the fact that the entropy of the datadistribution is constant. In contrast, the original for-mulation of GANs minimizes an upper bound on theJensen-Shannon divergence between the model distribu-tion pθ(x) and the data distribution pdata(x) (Goodfellowet al., 2014).

This difference in objectives underlies the difference inbehavior between GANs and likelihood based generativemodels. To see this, we can compare the behavior of thetwo KL-divergences DKL(pdata||pθ) and DKL(pθ||pdata).As is evident from Fig. 69 and Fig. 70, though both ofthese KL-divergences measure similarities between thetwo distributions, they are sensitive to very differentthings. DKL(pθ||pdata) is insensitive to setting pθ ≈ 0even when pdata 6= 0 whereas DKL(pdata||pθ) punishesthis harshly. In contrast, DKL(pdata||pθ) is insensitiveto placing weight in the model distribution in regionswhere pdata ≈ 0 whereas DKL(pθ||pdata) punishes thisharshly. In other words, DKL(pdata||pθ) prefers modelsthat have a high probability in regions with lots of train-ing data points whereas DKL(pθ||pdata) punishes modelsfor putting high probability where there is no data.

In the context of the above discussion, this suggeststhat the way likelihood-based methods are most likely tofail, is by improperly “filling in” any low-probability den-sity regions between peaks in the data distribution. Incontrast, at least in principle, the Jensen-Shannon distri-bution which underlies GANs is sensitive both to placingweight where there is data since it has information aboutDKL(pdata||pθ) and to not placing weight where no datahas been observed (i.e. in low-probability density regions)since it has information about DKL(pθ||pdata).

In practice, DKL(pdata||pθ) can be calculated easilydirectly from the data using sampling. On the otherhand, DKL(pθ||pdata) is impossible to compute since wedo not know pdata(x). In particular, this integral cannotbe calculated using sampling since we cannot evaluatepdata(x) at the locations of the fantasy particles. Theidea of adversarial learning is to circumnavigate this dif-ficulty by using an adversarial learning procedure. Re-call, that DKL(pθ||pdata) is large when the model artifi-cially over-weighs low-density regions near real peaks (seeFig. 69). Adversarial learning accomplishes this sametask by teaching a discriminator network to distinguish

x sampled fromdata

“Discriminator”network D

D(x) tries tobe near 1

latent spaceinput z

“Generator”network G

G tries to makeD(G(z))be near 1

D(x) tries to make D(G(z))be near 0

x sampled fromdata

D

FIG. 71 A GAN consists of two differentiable functions (usu-ally represented as deep neural networks): a generator func-tion G(z; θG) that takes as an input a z sampled from someprior on the latent space and outputs a point x. The generatorfunction (neural network) has parameters θG. The discrim-inator function D(x; θD) discriminates between x from thedata and samples from the model: x = G(z; θG). The two net-works are trained by “playing a game” where the discriminatoris trained to distinguish between synthetic and real exampleswhile the generator is trained to try to fool the discriminator.Importantly, the cost function for the discriminator dependson the generator parameters and vice versa.

between real data points and samples generated from themodel. By punishing the model for generating pointsthat can be easily discriminated from the data, adversar-ial learning decreases the weight of regions in the modelspace that are far away from data points – regions thatinevitably arise when maximizing likelihood. This coreintuition implicitly underlies many adversarial trainingalgorithms (though it has been recently suggested thatthis may not be the entire story (Goodfellow, 2016)).

B. Generative models and adversarial learning

Here, we give a brief high-level overview of the ba-sic idea behind GANs. The mathematics and theoryof GANs draws deeply from concepts in Game Theorysuch as Nash Equilibrium that are foreign to most physi-cists. For this reason, a comprehensive discussion of

97

GANs is beyond the scope of the review. Readers inter-ested in learning more are directed to the comprehensivetutorial by Goodfellow (Goodfellow, 2016). GANs arealso notorious for being hard to train. For this reason,readers wishing to play with GANs should also considerthe very nice practical discussion entitled “How to traina GAN” (affectionately labeled “ganhacks”) available athttps://github.com/soumith/ganhacks.

The central idea of GANs is to construct two differ-entiable neural networks (see Fig. 71). The first neuralnetwork, usually a (de)convolutional network based onthe DCGAN architecture (Radford et al., 2015), approx-imates a generator function G(z; θG) that takes as inputa z sampled from some prior on the latent space, and out-puts a x from the model. The second network approxi-mates a discriminator function D(x; θD) that is designedto distinguish between x from the data and samples gen-erated by the model: x = G(z; θG). The scalar D(x) rep-resents the probability that x came from the data ratherthan the model pθG . We train D to distinguish actualdata points from synthetic examples and the generativenetwork to fool the discriminative network.

To define the cost function for training, it is useful todefine the functional

V (D,G) = Ex∼pdata(logD(x))

+ Ez∼pprior(log [1−D(G(z))]) . (218)

In the version of GANs most amenable to theoreticalanalysis – though not the version usually implementedin practice – we take the cost function for the discrimi-nator and generators to be C(G) = −C(D) = 1

2V (D,G).This choice of cost functions corresponds to what is calleda zero-sum game. Since the discriminator is maximized,we can write a cost function for the generator as

C(G) = maxD

V (G,D). (219)

It turns out that this cost function is related to theJensen-Shannon Divergence in a simple manner (Good-fellow, 2016; Goodfellow et al., 2014):

C(G) = − log 4 + 2DJS(pdata, pθG). (220)

This brings us back full circle to the discussion in the lastsection on KL-divergences.

C. Variational Autoencoders (VAEs)

We now turn our attention to another class of powerfullatent-variable, generative models called Variational Au-toencoders (VAEs). VAEs exploit the variational/mean-field theory ideas presented in Sec. XIV to build complexgenerative models using deep neural networks (DNNs).The central idea behind VAEs is to represent the mapfrom latent variables to observable variables using a

Prior distribution: p (z)θ

p (x|z)θq (z|x)

φDecoder: Encoder:

Dataset: D

x-space

z-space

FIG. 72 VAEs learn a joint distribution pθ(x, z) betweenlatent variables z with prior distribution p(z) and data x.The conditional distribution pθ(x|z) can be thought of as astochastic “decoder” that maps latent variables to new ex-amples. The stochastic “encoder” qφ(z|x) approximates thetrue but intractable pθ(z|x) – much like mean-field theoriesin statistical physics approximate true distributions with ana-lytically tractable approximations. Figure based on Kingma’sPh.D. dissertation Chapter 2. (Kingma et al., 2017).

DNN. The use of latent variables is a common themein many of the generative models we have encountered inunsupervised learning tasks from Gaussian Mixture Mod-els (see Sec. XIII) to Restricted Boltzmann Machines.However, in VAEs this mapping, p(x|z, θ) is much lessrestrictive and much more complicated since it takes theform of a DNN. This added complexity means we can-not use techniques such as Expectation Maximization totrain the model and instead must rely of methods basedon backpropagation.

1. VAEs as variational models

We start by discussing VAEs from a variational per-spective. We will make extensive use of the conceptsintroduced in Sec. XIV and the reader is strongly-encouraged to refresh their memory of this section beforeproceeding. A VAE is a latent-variable model pθ(x, z)with a latent variables z and observed variables x. Thelatent variables are drawn from some pre-specified priordistribution p(z). In practice, p(z) is almost always taken

98

to be a multivariate Gaussian. The conditional distribu-tion pθ(x|z) maps points in the latent space to new ex-amples (see Fig. 72). This is often called a “stochasticdecoder” and defines the generative model for the data.The reverse mapping that gives the posterior over thelatent variables pθ(z|x) is often called the “stochastic en-coder”.

A central challenge in latent variable modeling is to in-fer the posterior distribution of the latent variables givena sample from the data. This can in principle be donevia Bayes’ rule: pθ(z|x) = p(z)pθ(x|z)

pθ(x) . For some models,we can calculate this analytically. In this case, we canuse techniques like Expectation Maximization (EM) (seeSec. XIV). However, in general this is intractable sincethe denominator requires computing a sum over all con-figurations of the latent variables, pθ(x) =

∫pθ(x, z)dz =∫

pθ(x|z)p(z)dz (i.e. a partition function in the languageof physics), which is often intractable for large models.In VAEs, where the pθ(x|z) is modeled using a DNN, thisis impossible.

A first attempt to address the issue of computingp(x) could be through importance sampling (Neal, 2001).That is, we choose a proposal distribution q(z|x) whichis easy to sample from, and rewrite the sum as an expec-tation with respect to this distribution:

pθ(x) =

∫pθ(x|z)

p(z)

qφ(z|x)qφ(z|x)dz. (221)

Thus, by sampling from qφ(z|x) we can get a Monte Carloestimate of p(x). However, this requires generating sam-ples and thus our estimates will be noisy. If our proposaldistribution is poor, the variance in the estimate can bevery high.

An alternative approach that avoids these samplingissues is to use the variational approach discussed inSec. XIV. We know from Eq. (162) that we can writethe log-likelihood as

log p(x) = DKL(qφ(z|x)‖pθ(z|x,θ))− Fqφ(x), (222)

where the variational free energy is defined as

− Fqφ(x) ≡ Eqφ(z|x)[log pθ(x|z)]−DKL(qφ(z|x)|p(z)).(223)

In writing this term, we have used Bayes rule andEq. (174). Since the KL-divergence is strictly positive,the (negative) variational free energy is a lower-bound onthe log-likelihood. For this reason, in the VAE literature,it is often called the Evidence Lower BOund or ELBO.

Equation (223) has a beautiful interpretation. Thefirst term in this equation can be viewed as a “recon-struction error”, where we start with data x, encode itinto the latent representation using our approximate pos-terior qφ(z|x), and then evaluate the log probability ofthe original data given the inferred latents. For binaryvariables, this is just the cross-entropy which we first en-countered when studying logistic regression, cf. Sec. VII.

The second term acts as a regularizer and encourages theposterior distributions to be close to p(z). By maximizingthe ELBO, we minimize the KL-divergence between theapproximate and true posterior. By choosing a tractableqφ(z|x), we make this feasible (see Fig. 72).

2. Training via the reparametrization trick

VAEs train models by minimizing the variational freeenergy (maximizing the ELBO). Training a VAE is some-what complicated because we must simultaneously learntwo sets of parameters: the parameters θ that define ourgenerative model pθ(x, z) as well as the variational pa-rameters φ in qφ(z|x). The basic approach will be thesame as for all DNN models: we will use gradient de-scent with the variational free energy as the objective(cost) function. For a dataset L, we can write our costfunction as

Cθ,φ(L) =∑x∈L−Fqφ(x). (224)

Taking the gradient with respect to θ is easy since onlythe first term in Eq. (223) depends on θ,

Cθ,φ(x) = Eqφ(z|x)[∇θ log pθ(x|z)]

∼ ∇θ log pθ(x|z) (225)

where in the second line we have replaced the expec-tation value with a single Monte-Carlo sample z drawnfrom qφ(z|x) (see Fig. XVII.C.2). When pθ(x|z) is ap-proximated by a neural network, this can be calculatedusing backpropagation with the reconstruction error asthe objective function.

On the other hand, calculating the gradient with re-spect to the parameters φ is more complicated since φalso appears in the expectation value Eqφ(z|x). Ideally, wewould like to also use backpropagation to calculate thisas well. It turns out that this can be done by a simplechange of variables that often goes under the name the“reparameterization trick” (Kingma and Welling, 2013;Rezende et al., 2014). The basic idea is to change vari-ables so that φ no longer appears in the distribution weare taking an expectation value with respect to. To dothis, we express the random variable z ∼ qφ(z|x) as somedifferentiable and invertible transformation of anotherrandom variable ε:

z = g(ε, φ,x), (226)

where the distribution of ε is independent of x and φ.Then, we can replace expectation values over qφ(z|x) byexpectation values over pε

Eqφ(z|x)[f(z)] = Epε [f(z)]. (227)

Evaluating the derivative then becomes quite straight for-ward since

∇φEqφ(z|x)[f(z)] ∼ Epε [∇φf(z)]. (228)

99

q (z|x)φ

Inference Model

Datapoint

zGenerative Model

p (x|z)θ

Negative Variational Free Energy:

(ELBO)E

[ p (x|z)θ

log - q (z|x)||p(z))φ

Sample

KL( ]q (z|x)φ

FIG. 73 Schematic explaining the computational flow of VAEs. Figure based on Kingma’s Ph.D. dissertation Chapter 2.(Kingma et al., 2017).

Of course, when we do this we still need to be able tocalculate the Jacobian of this change of variables

dφ(x, φ) = Det

∣∣∣∣∂z

∂ε

∣∣∣∣ (229)

since

log qφ(z|x) = log p(ε)− log dφ(x, φ). (230)

Since we can calculate gradients, we can now use back-propagation on the full the ELBO objective function (wereturn to this below when we discuss concrete architec-tures and implementations of VAE).

One of the problems that commonly occurs when train-ing VAEs by performing a stochastic optimization of theELBO (variational free energy) is that it often gets stuckin undesirable local minima, especially at the beginningof the training procedure (Bowman et al., 2015; Kingmaet al., 2017; Sønderby et al., 2016). The underlying rea-son for this is that the ELBO objective function can beimproved in two qualitatively different ways correspond-ing to each of the two terms in Eq. (223): by minimizingthe reconstruction error or by making the posterior dis-tribution qφ(z|x) to be close to p(z) (Of course, the goalis to do both!). For complex datasets, at the beginning oftraining when the reconstruction error is extremely poor,the model often quickly learns to make q(z|x) ≈ p(z) andgets stuck in this local minimum. For this reason, in prac-tice it is found that it makes sense to modify the ELBOobjective to use an optimization schedule of the form

Eqφ(z|x)[log pθ(x|z)]− βDKL(qφ(z|x)|p(z)) (231)

where β is slowly annealed from 0 to 1 (Bowman et al.,2015; Sønderby et al., 2016). An alternative regulariza-tion is the “method of free bits”: modifying the objectivefunction of ELBO to ensure that on average qφ(z|x) hasat least λ natural units of information about p(z) (seeKingma Ph.D thesis (Kingma et al., 2017) for details) .

These observations hints at the more general connec-tion between VAEs and information theory that we turnto in the next section.

3. Connection to the information bottleneck

There is a fundamental connection between the vari-ational autoencoder objective and the information bot-tleneck (IB) for lossy compression (Tishby et al., 2000).The information bottleneck imagines we have input datax that is correlated with another variable of interest, y,and we are given access to the joint distribution, p(x, y).Our task is to take x as input and compress it in such away as to retain as much information as possible aboutthe relevance variable, y. To do this, Tishby et al. pro-pose to maximize the objective function

LIB = I(y; z)− βI(x; z) (232)

over a stochastic encoding distribution q(z|x), where z isour compression of the input, and β is a tradeoff param-eter that sets the relative preference of compression andaccuracy, and I(y; z) is the mutual information between yand z. Note that we choose a slightly different but equiv-alent form of the objective relative to Tishby et al.. Thisobjective is only known to have a closed-form solutionwhen x and y are jointly Gaussian (Chechik et al., 2005).Otherwise, the optimization can be performed through aBlahut-Arimoto type iterative update scheme (Arimoto,1972; Blahut, 1972). However, this is only guaranteed toconverge to a local optimum. A significant difficulty inimplementing IB is that it requires knowledge of the jointdistribution p(x, y) and that we must be able to computethe mutual information, a notoriously difficult quantityto estimate from samples. Hence, IB has in recent yearsbeen utilized less than it might otherwise.

100

To address these problems, variational approximationsto the IB objective function have been developed (Alemiet al., 2016; Chalk et al., 2016). These approximations,when applied to a particular choice of p(x, y) give thesame objective as the variational autoencoder. Herewe follow the exposition from Alemi et al.(Alemi et al.,2016). To see this, consider a dataset of N points, xi.We set x = i and y = xi in the IB objective, similarto (Slonim et al., 2005; Strouse and Schwab, 2017). Wechoose p(i) = 1/N and p(x|i) = δ(x − xi). That is, wewould like to find a compression of the data that preservesinformation about data point location while reducing in-formation about data point identity.

Imagine that we are unable to directly work with thedecoder p(x|z). The first approximation replaces the ex-act decoder inside the logarithm with an approximation,q(x|z). Due to the positivity of KL-divergence, namely,

DKL(p(x|z)||q(x|z)) ≥ 0

⇒∫dx p(x|z) log p(x|z) ≥

∫dx p(x|z) log q(x|z),

(233)

we have

I(x; z) =

∫dxdz p(x)p(z|x) log

(p(x|z)p(x)

)≥∫dxdz p(x)p(z|x) log q(x|z) +Hp(x)

≥∫dxdz p(x)p(z|x) log q(x|z), (234)

where Hp(x) ≥ 0 is the Shannon entropy of x. Thisquantity can be estimated from data samples (i, xi) af-ter drawing from p(z|i) = p(z|xi). Similarly, we canreplace the prior distribution of the encoding, p(z) =∫dx p(x)q(z|x) which is typically intractable, with a

tractable q(z) to get

I(i; z) ≤ 1

N

∑i

∫dz p(z|xi) log

p(z|xi)q(z)

(235)

Putting these two bounds Eqs. (234)and (235) togetherand note that x = i and y = xi, we get an upper boundfor the IB objective that takes the same form as the VAEobjective Eq. (231) we saw earlier:

LIB = I(x; z)− βI(y; z)

≤∫dx p(x)Ep(z|x)[log q(x|z)] (236)

− β 1

N

∑i

DKL(p(z|xi)|q(z)). (237)

Note that in Eq. (236) we have a conditional distributionof x given z but not their joint distribution inside theexpectation, which was the case in Eq. (231). This isdue to that we dropped the entropy term pertaining to

Hidden layers

Zmean log σz

Data

Latent layer parameters

Neural network withweights φ

q (z|x)||p(z))φKL( Use analytic expression for Gaussians

ε Standard normal variable

x

Hidden layers

Latent variable

Neural network withweights

z

p θ Reconstruction

E

[ p (x|z)θ

logq (z|x)φ

] Reconstruction Error(i.e. cross-entropy)

FIG. 74 Computational graph for a VAE with Gaussian hid-den units (i.e. p(z) are standard normal variables N (0, 1) andGaussian variational encoder whose posterior takes the formqφ(z|x) = N (µ(x),σ2(x)).

x, which is irrelevant in the optimization procedure. Infact, this objective has been explored and is called a β-VAE (Higgins et al., 2016). It’s interesting to note that inthe case of IB, the variational approximations are withrespect to the decoder and prior, whereas in the VAE,the variational approximations are with respect to theencoder.

D. VAE with Gaussian latent variables and Gaussian encoder

Our discussion of VAEs thus far has been quite ab-stract. In this section, we discuss one of the most widelyemployed VAE architectures: a VAE with factorizedGaussian posteriors, qφ(z|x) = N (z,µ(x),diag(σ2(x)))and standard normal latent variables p(z) = N (0, I).The training and implementation simplifies greatlyhere because we can analytically workout the termDKL(qφ(z|x)|p(z)).

1. Implementing the Gaussian VAE

We now show how we can combine analytic expressionsfor the KL-divergence with backpropagation to efficientlyimplement a Gaussian VAE. We start by first deriving an-alytic expressions for DKL(qφ(z|x)|p(z)) in terms of themeans µ(x) and variances σ2(x). This is just a simple ex-ercise in Gaussian integrals. For notational convenience,

101

FIG. 75 Embedding of MNIST dataset into a two-dimensional latent space using a VAE with two latent dimen-sions (see Notebook 19 and main text for details.) Data pointsare colored by their identity [0-9].

we drop the x-dependence of the means µ(x), variancesσ2(x), and qφ(x). A straight-forward calculation gives∫dzqφ(z) log p(z) =

∫N (z,µ(x),diag(σ2(x))) logN (0, I)

= −J2

log 2π − 1

2

J∑j=1

(µ2j + log σ2

j ), (238)

where J is the dimension of the latent space. An almostidentical calculation yields∫

dzqφ(z) log qφ(z) = −J2

log 2π− 1

2

J∑j=1

(1+σ2j ). (239)

Combining these equations gives

−DKL(qφ(z|x)|p(z)) =1

2

J∑j=1

(1 + log σ2

j (x)− µ2j (x)− σ2

j (x)).

(240)

This analytic expression allows us to implement theGaussian VAE in a straight forward way using neural net-works. The computational graph for this implementationis shown in Fig. 74. Notice that since the parameters areall compositions of differentiable functions, we can usestandard backpropagation algorithms to train VAEs.

2. VAEs for the MNIST dataset

In Notebook 19, we have implemented a VAE usingKeras and trained it using the MNIST dataset. The ba-sic architecture is the one describe above. All figureswere generated with a VAE that has a latent space ofdimension 2. The architecture of both the encoder and

decoder is a Multi-layer Perceptron (MLPs) – neural net-works with a single hidden layer. For this example, wetake the dimension of the hidden layer for both neuralnetworks to be 256. We trained the VAE using the RMS-prop optimizer for 50 epochs.

We can visualize the embedding in the latent spaceby plotting z of the test set and coloring the points bydigit identity [0-9] (see Figure XVII.D.2). Notice thatin general, digits that are similar end up being closer toeach other in the latent space. However, this is not alwaysthe case (see bright green points for example). This is ageneral feature of these low-dimensional embeddings andwe saw a similar phenomenon when we examined t-SNEin Section XII.

The real advantage that VAEs offer over embeddingssuch as t-SNE is that they are generative models. Givena set of examples, we can generate new examples – or fan-tasy particles as they are commonly called in ML – bysampling the latent space z and then using the decoder tomap these latent variables to new examples. The resultsof this procedure are shown in Figure XVII.D.2. In thetop figure, we sample the latent space uniformly in a 5×5grid. Notice that this results in extremely similar exam-ples through much of the latent space. The underlyingreason for this is that uniform sampling does not respectthe underlying Gausssian structure of the latent space z.In the bottom figure, we perform a uniform sampling onthe probability p(z) and mapped this back to the latentspace using the inverse Cumulative Distribution Func-tion (CDF) of the Gaussian. We see that the diversity ofthe generated examples is much higher for this samplingprocedure.

This example is indicative of a more general problem:once we have learned a generative model how should wesample latent spaces (White, 2016). This is especiallyimportant in high-dimensional spaces where direct visu-alization is not possible. Often certain directions in thelatent space can have different meanings. A particularlystriking visual illustration is the “smile vector” that in-terpolates between smiling and frowning faces (White,2016).

3. VAEs for the 2D Ising model

In Notebook 20, we used an almost identical architec-ture (though coded in a slightly different way) to traina VAE on the Ising dataset discussed through out thereview. The only differences between the two VAEs arethat the visible layer of the Ising VAE now has 1600 units(our samples are 40 × 40 instead of the 28 × 28 MNISTimages) and we have changed the standard deviation ofthe Gaussian of the latent variables p(z) from σ = 1 toσ = 0.2.

We once again visualize the embedding learned bythe VAE by plotting z and coloring the points by the

102

FIG. 76 (Top) Fantasy particle generated by uniform sam-pling of the latent space z. (Bottom) Fantasy particles gen-erated by uniform sampling of probability p(z) mapped tolatent space using the inverse Cumulative Distribution Func-tion (CDF) of the Gaussian.

temperature at which the sample was drawn (see Fig-ure XVII.D.3 top). Notice that the latent space haslearned a lot of the physics of the Ising model. For ex-ample, the first VAE dimension is just the magnetization(Fig. XVII.D.3 bottom). This is not surprising since we

FIG. 77 (Top) Embedding of the Ising dataset into a two-dimensional latent space using a VAE with two latent dimen-sions (see Notebook 20 and main text for details.) Data pointsare colored by the temperature each sample was drawn at.(Bottom) Correlation between the latent dimensions and themagnetization for each sample. Notice that the first principlecomponent corresponds to the magnetization.

saw in Section XII that the first principal component ofa PCA also corresponded to the magnetization.

We now ask how well the VAE can generate new exam-ples (see Fig. 78). We see that the examples look quitedifferent from real Ising configurations – they lack thelarge scale patchiness seen in the critical region. Theymostly turn out to be unstructured speckles that reflectonly the average probability that a pixel is on in a region.This is not surprising since our VAE has no spatial struc-ture, has only two latent dimensions, and the cost func-tion does not know about “correlations between spins”: there is very little information about correlations inthe binary cross-entropy which we use to measure recon-struction errors. The reader is encouraged to play withthe corresponding notebook and generate examples as wechange the latent dimension and/or choose modified ar-chitectures such as decoders based on CNNs instead ofMLPs.

This example also shows how much easier it is to dis-criminate between labeled data than it is to learn how togenerate new examples from an unlabeled dataset. Thisis true in all spheres of machine learning. This is also

103

FIG. 78 Fantasy particles for the Ising model generated byuniform sampling of probability p(z) mapped to latent spaceusing the inverse Cumulative Distribution Function (CDF) ofthe Gaussian.

one of the reasons that generative models are one of thecutting edge areas of modern Machine Learning researchand there are likely to be a barrage of new techniques forgenerative modeling in the next few years.

XVIII. OUTLOOK

In this review, we have attempted to give the reader theintellectual and practical tools to engage with MachineLearning (ML), data science, and parts of modern statis-tics. We have tried to emphasize that ML differs fromordinary statistics in that the goal is to predict ratherthan to fit. For this reason, all the techniques discussedhere have to navigate important tensions that lie at theheart of ML. The most prominent instantiation of theseinherent tradeoffs is the bias-variance tradeoff, which isperhaps the only universal principle in ML. Identifyinghow these tradeoffs manifest in a particular algorithm isthe key to constructing and training powerful ML meth-ods.

The immense progress in computing power and thecorresponding availability of large datasets ensure thatML will be an important part of the physics toolkit. Inthe future, we expect ML to be a core competency ofphysicists much like linear algebra, group theory, and

differential equations. We hope that this review will playsome small part toward this aspirational goal.

We wrote this review to provide a relatively conciseintroduction to ML using ideas and language familiarto physicists (though the review ended up being almosttwice the planned length). In writing the review, we havetried to accomplish two somewhat disparate tasks. First,we have tried to highlight more abstract and theoreticalconsiderations to show the unity of ML and statisticallearning. Many ML techniques can be understood bystarting with some key concepts from statistical learning(MLE, bias-variance tradeoff, regularization) and com-bining them with core concepts familiar from statisti-cal physics (Monte-Carlo, gradient descent, variationalmethods and MFT). Despite the high-level similaritiesbetween all the methods presented here, the way thatthese concepts manifest in any given technique is oftenquite clever and understanding these “hacks” is the key tounderstanding why some ML techniques turn out to beso powerful and others not so much. ML, in this sense, isas much an art as a science. Second, we have tried to givethe reader the practical know-how to start using the toolsand concepts from ML for immediately solving problems.We believe the accompanying python notebooks and theemphasis on coding in python have accomplished thistask.

A. Research at the intersection of physics and ML

We hope the review catalyzes more research at theintersection of physics and machine learning. Here webriefly highlight a few promising research directions. Wenote that this list is far from comprehensive.

• Applying ML to solve physics problems. Onetheme that has reoccurred through out the reviewis that ML is most effective in settings with well de-fined objectives and lots of data. For this reason,we expect ML to become a core competency of datarich fields such as high-energy experiments and as-tronomy. However, ML may also prove to be use-ful for helping further our physical understandingthrough data-driven approach to other branches ofphysics that may not be immediately obvious, suchas quantum physics (Dunjko and Briegel, 2017).For example, recent works have used ideas fromML to investigate disparate topics such as non-localcorrelations (Canabarro et al., 2018), disorderedmaterials and glasses (Schoenholz, 2017), electronicstructure calculations (Grisafi et al., 2017) andnumerical analysis of ferromagnetic resonances inthin films (Tomczak and Puszkarski, 2018), de-signing and analyzing quantum materials by inte-grating ML with existing techniques such as Dy-namical Mean Field Theory (DMFT) (Arsenaultet al., 2014), in the study of inflation (Rudelius,

104

2018), and even for experimental learning of quan-tum states by using ML to aid in quantum tomog-raphy (Rocchetto et al., 2017). For a comprehen-sive review of ML methods in seismology, see (Konget al., 2018).

• Machine Learning on quantum computers.Another interesting area of research that is likelyto grow is asking if and how quantum comput-ers can help improve state-of-the art ML algo-rithms (Arunachalam and de Wolf, 2017; Benedettiet al., 2016, 2017; Bromley and Rebentrost, 2018;Ciliberto et al., 2017; Daskin, 2018; Innocenti et al.,2018; Mitarai et al., 2018; Perdomo-Ortiz et al.,2017; Rebentrost et al., 2017; Schuld et al., 2017;Schuld and Killoran, 2018; Schuld et al., 2015).Concrete examples that seek to extend some ofthe basic ideas and methods we introduced inthis review to the quantum computing realm in-clude: algorithms for quantum-assisted gradientdescent (Kerenidis and Prakash, 2017; Rebentrostet al., 2016), classification (Schuld and Petruccione,2017), and Ridge regression (Yu et al., 2017). Inter-est in this field will undoubtedly grow once reliablequantum computers become available (see also thisrecent review (Dunjko and Briegel, 2017) ).

• Monte-Carlo Methods. An interesting area thathas seen a renewed interest with Bayesian modelingis the development of new Monte-Carlo methods forsampling complex probability distributions. Someof the workhorses of modern Machine Learning –Annealed Importance Sampling (AIS) (Neal, 2001)and Hamiltonian or Hybrid Monte-Carlo (HMC)(Neal et al., 2011) – are intimately related tophysics. As pointed out by Radford Neal, AIS isjust the Jarzynski inequality (Jarzynski, 1997) asa Monte-Carlo method and HMC was developedby physicists and exploits Hamiltonian dynamicsto improve proposal distributions.

• Statistical physics style theory of DeepLearning. Many techniques in ML have originsin statistical physics. Yet, a physics-style theoryof Deep Learning remains elusive. A key questionis to ask when and why these models manage togeneralize well. Physicists are only beginning toask these questions (Advani and Saxe, 2017; Mehtaand Schwab, 2014; Saxe et al., 2013; Shwartz-Zivand Tishby, 2017). But right now, it is fair to saythat the insights remain scattered and a cohesivetheoretical understanding is lacking.

• Biological physics and ML. Biological physicsis generating ever more datasets in fields rangingfrom neuroscience to evolution and immunology. Itis likely that ML will be an important part of the

biophysics toolkit in the future. Many of the au-thors of this review were inspired to engage withML for this reason.

• Using ideas from physics to develop newML algorithms. Many of the core ideas of MLfrom Monte-Carlo techniques to variational meth-ods have their origin in physics. There has been atremendous amount of recent work developing toolsto understand physical systems that may be of po-tential use to ML. For example, in quantum con-densed matter techniques such as DMRG, MERA,etc. have enriched both our practical and con-ceptual understandings (Stoudenmire and White,2012; Vidal, 2007; White, 1992). It will be inter-esting to figure how and if these numerical methodscan be translated from a physics to a ML setting.There are tantalizing hints that this is likely to bea fruitful direction (Han et al., 2017; Stoudenmire,2018; Stoudenmire and Schwab, 2016).

B. Topics not covered in review

Despite the considerable length of the review, we havehad to make many omissions for the sake of brevity. Itis our hope and belief that after reading this review thereader will have the conceptual and practical knowledgeto quickly learn about these other topics. Among themost prominent topics missing from this review are:

• Temporal/Sequential Data. We have not cov-ered techniques for dealing with temporal or se-quential data. Here, too there are many connec-tions with statistical physics. A powerful class ofmodels for sequential data called Hidden MarkovModels (Rabiner, 1989) that utilize dynamicalprogramming techniques have natural statisticalphysics interpretations in terms of transfer matri-ces (see (Mehta et al., 2011) for explicit exam-ple of this). Recently, Recurrent Neural Networks(RNNs) have become an important and powerfultool for dealing with sequence data (Goodfellowet al., 2016). RNNs generalize many of the ideasdiscussed in the DNN section to deal with temporaldata.

• Reinforcement Learning. Many of the most ex-citing developments in the last five years have comefrom combining ideas from reinforcement learningwith deep neural networks (Mnih et al., 2015; Sut-ton and Barto, 1998). RL traces its origins to be-haviourist psychology, when it was conceived asa way to explain and study reward-based deci-sion making. RL was put on solid mathematicalgrounds in the 50’s by Richard Bellman and col-laborators, and has by now become an inseparable

105

part of robotics and artificial intelligence. RL is afield of Machine Learning, in which an agent learnshow to master performing a specific task throughan interaction with its environment. Depending onthe reward it receives, the agent chooses to take anaction affecting the environment, which in turn de-termines the value of the next received reward, andso on. The long-term goal of the agent is to max-imise the cumulative expected return, thus improv-ing its performance in the longer run. Shadowed bymore traditional optimal control algorithms, Re-inforcement Learning has only recently taken offin physics (Albarran-Arriagada et al., 2018; Au-gust and Hernández-Lobato, 2018; Bukov, 2018;Bukov et al., 2018; Cárdenas-López et al., 2017;Chen et al., 2014; Chen and Xue, 2019; Dunjkoet al., 2017; Fösel et al., 2018; Lamata, 2017; Mel-nikov et al., 2017; Neukart et al., 2017; Niu et al.,2018; Ramezanpour, 2017; Reddy et al., 2016b; Sri-arunothai et al., 2017; Sweke et al., 2018; Zhanget al., 2018). Of particular interest are biophysicsinspired works that seek to use RL to understandnavigation and sensing in turbulent environments(Colabrese et al., 2017; Masson et al., 2009; Reddyet al., 2016a; Vergassola et al., 2007).

• Support Vector Machines (SVMs) and Ker-nel Methods. SVMs and kernel methods are apowerful set of techniques that work well when theamount of training data is limited (Burges, 1998).The mathematics and theory of SVM are very dif-ferent from statistical physics and for this reasonwe chose not to include them here. However, SVMsand kernel methods have played an extremely im-portant role in ML and are worth understanding.

C. Rebranding Machine Learning as “Artificial Intelligence”

The immense scientific progress in ML has also beenaccompanied by a massive public relations effort cen-tered around Silicon Valley. Starting with the successof ImageNet (the most prominent early use of GPUsfor training large models) and the widespread adoptionof Deep Learning based techniques by the Silicon Val-ley companies, there has been a deliberate move to re-brand modern ML as “artificial intelligence” or AI (seegraphs in (Katz, 2017)). Recently, computer scientistMichael I. Jordan (who is famously known for his formal-ization of variational inference, Bayesian network, andexpectation-maximization algorithm in machine learn-ing research) cautioned that “This confluence of ideasand technology trends has been rebranded as “AI” overthe past few years. This rebranding is worthy of somescrutiny”(Jordan, 2018).

AI, by design, is an ambiguous term that mixes aspi-rations with reality. It also conflates the statistical ideas

that form the basis of modern ML with the more com-monplace notions about what humans and behavioral sci-entists mean by intelligence (see (Lake et al., 2017) foran enlightening and important modern discussion of thisdistinction from a quantitative cognitive science point ofview as well as (Dreyfus, 1965) for a surprisingly relevantphilosophy-based critique from 1965).

Almost all the techniques discussed here rely on op-timizing a pre-specified objective function on a givendataset. Yet, we know that for large, complex modelschanging the data distribution or the goal can lead to animmediate degradation of performance. Deep networkshave poor generalizations to even a slightly different con-text (the infamous Validation-Test set mismatch). Thisinability to abstract and generalize is a common criticismlobbied against branding modern ML techniques as AI(Lake et al., 2017). For all these reasons, we have chosento use the term Machine Learning rather than artificialintelligence through out the review.

This is far from the first time we have seen the use ofthe term artificial intelligence and the grandiose promisesthat it implies. In fact, the early 1950’s and 1960’s aswell as the early 1980’s saw similar AI bubbles (see thisinteresting summary by Luke Muehlhauser for Open Phi-lanthropy (Muehlhauser, 2016)). These AI bubbles havebeen followed by what have been dubbed “AI Winters”(McDermott et al., 1985).

The “Singularity” may not be coming but the advancesin computing and the availability of large data sets likelyensure that the kind of statistical learning frameworksdiscussed are here to stay. Rather than a general arti-ficial intelligence, the kind of techniques presented hereseem to be best suited for three important tasks: (a)automating prediction from lots of labeled examples ina narrowly-defined setting (b) learning how to parame-terize and capture the correlations of complex probabil-ity distributions, and (c) finding policies for tasks withwell-defined goals and clear rules. We hope that thisreview has given the reader enough conceptual tools tostart forming their own opinions about reality and hypewhen it comes to modern ML research. As Michael I.Joran puts it, “...if the acronym “AI” continues to beused as placeholder nomenclature going forward, letâĂŹsbe aware of the very real limitations of this placeholder.LetâĂŹs broaden our scope, tone down the hype and rec-ognize the serious challenges ahead"(Jordan, 2018).

D. Social Implications of Machine Learning

The last decade has also seen a systematic increase inthe use and deployment of Machine Learning techniquesinto new areas of life and society. Some of the readers ofthis review may currently be (or eventually be) employedin industrial settings that seek to harness ML for practi-cal purposes. However, caution is in order when applying

106

ML. Without foresight and accountability, the scale andscope of modern ML algorithms can lead to large scaleunaccountable and undemocratic outcomes that can re-inforce or even worsen existing inequality and inequities.Mathematician and data scientist turned social commen-tator Cathy O’Neil has dubbed the indiscriminate use ofthese Big Data techniques “Weapons of Math Destruc-tion” (O’Neil, 2017).

When ML is used in a social context, abstract statis-tical relationships have real social consequences. Falsepositives can mean the difference between life and death(for example in the context of “signature drone strikes”)(Mehta, 2015). ML algorithms, like all techniques, haveimportant limitations and should be employed with greatcaution. It is our hope that ML practitioners keep thisin mind when working in social settings.

All algorithms involve inherent tradeoffs in fairness, apoint formalized by computer scientist Jon Kleinberg andcollaborators in a very interesting recent paper (Klein-berg et al., 2016). It is far from clear how to make al-gorithms fair for all people involved. This is even moretrue with methods like Deep Learning that are hard tointerpret. All ML algorithms have implicit assumptionsand choices reflected in the datasets we use to the kindof functions we choose to optimize. It is important toremember that there is no “ view from nowhere” (Adam,2006; Katz, 2017) – all ML algorithms reflect a point ofview and a set of assumptions about the world we livein. For this reason, we hope that ML practitioners anddata scientists will take the time to consider the socialconsequences of their actions. For example, developing aHippocratic Oath for data scientists is now being consid-ered (Simonite, 2018). Doing no harm seems like a goodstart for making sure that we harness ML for the benefitof all members of society.

XIX. ACKNOWLEDGMENTS

PM and DJS would like to thank Anirvan Sengupta,Justin Kinney, and Ilya Nemenman for useful conver-sations during the ACP working group. The authorsare also grateful to all readers who provided valuablefeedback on this manuscript while it was under peer re-view. We encourage readers to help keep the Notebookswhich accompany the review up-to-date, by contribut-ing to them on Github at https://github.com/drckf/mlreview_notebooks. PM, CHW, and AD were sup-ported by Simon’s Foundation in the form of a Simons In-vestigator in the MMLS and NIH MIRA program grant:1R35GM119461. MB acknowledges support from theEmergent Phenomena in Quantum Systems initiative ofthe Gordon and Betty Moore Foundation, the ERC syn-ergy grant UQUAM, and the U.S. Department of Energy,Office of Science, Office of Advanced Scientific Comput-ing Research, Quantum Algorithm Teams Program. DJS

was supported as a Simons Investigator in the MMLS andby NIH K25 grant GM098875-02. PM and DJS wouldlike to thank the NSF grant: PHYD1066293 for support-ing the Aspen Center for Physics (ACP) for facilitatingdiscussions leading to this work. This research was sup-ported in part by the National Science Foundation underGrant No. NSF PHY-1748958. The authors are pleasedto acknowledge that the computational work reported onin this paper was performed on the Shared ComputingCluster which is administered by Boston University’s Re-search Computing Services.

Appendix A: Overview of the Datasets used in the Review

1. Ising dataset

The Ising dataset we use throughout the review wasgenerated using the standard Metropolis algorithm togenerate a Markov Chain. The full dataset consist of16 × 10000 samples of 40 × 40 spin configurations (i.e.the design matrix has 160000 samples and 1600 features)drawn at temperatures 0.25, 0.5, · · · 4.0. The samplesare drawn for the Boltzmann distribution of the two-dimensional ferromagnetic Ising model on a 40×40 squarelattice with periodic boundary conditions.

2. SUSY dataset

The SUSY dataset was generated by Baldi et al (Baldiet al., 2014) to explore the efficacy of using Deep Learningfor classifying collision events. The dataset is download-able from the UCI Machine Learning Repository, a won-derful resource for interesting datasets. Here we quotedirectly from the paper:

The data has been produced using MonteCarlo simulations and contains events withtwo leptons (electrons or muons). In highenergy physics experiments, such as the AT-LAS and CMS detectors at the CERN LHC,one major hope is the discovery of new par-ticles. To accomplish this task, physicists at-tempt to sift through data events and classifythem as either a signal of some new physicsprocess or particle, or instead a backgroundevent from understood Standard Model pro-cesses. Unfortunately we will never know forsure what underlying physical process hap-pened (the only information to which we haveaccess are the final state particles). How-ever, we can attempt to define parts of phasespace that will have a high percentage of sig-nal events. Typically this is done by using aseries of simple requirements on the kinematic

107

quantities of the final state particles, for ex-ample having one or more leptons with largeamounts of momentum that is transverse tothe beam line ( pT ). Here instead we willuse logistic regression in order to attempt tofind out the relative probability that an eventis from a signal or a background event andrather than using the kinematic quantities offinal state particles directly we will use theoutput of our logistic regression to define apart of phase space that is enriched in sig-nal events. The dataset we are using has thevalue of 18 kinematic variables ("features") ofthe event. The first 8 features are direct mea-surements of final state particles, in this casethe pT , pseudo-rapidity, and azimuthal angleof two leptons in the event and the amountof missing transverse momentum (MET) to-gether with its azimuthal angle. The last tenfeatures are functions of the first 8 features;these are high-level features derived by physi-cists to help discriminate between the twoclasses. You can think of them as physicistsattempt to use non-linear functions to classifysignal and background events and they havebeen developed with a lot of deep thinkingon the part of physicist. There is however,an interest in using deep learning methods toobviate the need for physicists to manuallydevelop such features. Benchmark results us-ing Bayesian Decision Trees from a standardphysics package and 5-layer neural networksand the dropout algorithm are presented inthe original paper to compare the ability ofdeep-learning to bypass the need of using suchhigh level features. We will also explore thistopic in later notebooks. The dataset con-sists of 5 million events, the first 4,500,000 ofwhich we will use for training the model andthe last 500,000 examples will be used as atest set.

3. MNIST Dataset

The MNIST dataset is one of the simplest and mostwidely used Machine Learning Datasets. The MNISTdataset consists of hand-written images of numericalcharacters 0−9 and consists of a training set of 60,000 ex-amples, and a test set of 10,000 examples (LeCun et al.,1998a). Information about the MNIST database and itshistorical importance can be found at Yann Lecun’s wed-site: http://yann.lecun.com/exdb/mnist/. A briefdescription from the website:

The original black and white (bilevel) imagesfrom NIST were size normalized to fit in a

20x20 pixel box while preserving their aspectratio. The resulting images contain grey lev-els as a result of the anti-aliasing techniqueused by the normalization algorithm. the im-ages were centered in a 28x28 image by com-puting the center of mass of the pixels, andtranslating the image so as to position thispoint at the center of the 28x28 field.

The MNIST is often included by default in many modernML packages.

REFERENCES

Abu-Mostafa, Yaser S, Malik Magdon-Ismail, and Hsuan-Tien Lin (2012), Learning from data, Vol. 4 (AMLBookNew York, NY, USA:).

Ackley, David H, Geoffrey E Hinton, and Terrence J Se-jnowski (1987), “A learning algorithm for boltzmann ma-chines,” in Readings in Computer Vision (Elsevier) pp.522–533.

Adam, Alison (2006), Artificial knowing: Gender and thethinking machine (Routledge).

Advani, Madhu, and Surya Ganguli (2016), “Statistical me-chanics of optimal convex inference in high dimensions,”Physical Review X 6 (3), 031034.

Advani, Madhu, Subhaneil Lahiri, and Surya Ganguli (2013),“Statistical mechanics of complex neural systems and highdimensional data,” Journal of Statistical Mechanics: The-ory and Experiment 2013 (03), P03014.

Advani, Madhu S, and Andrew M Saxe (2017), “High-dimensional dynamics of generalization error in neural net-works,” arXiv preprint arXiv:1710.03667.

Aitchison, Laurence, Nicola Corradi, and Peter E Latham(2016), “Zipfs law arises naturally when there are under-lying, unobserved variables,” PLoS computational biology12 (12), e1005110.

Albarran-Arriagada, F, J. C. Retamal, E. Solano, andL. Lamata (2018), “Measurement-based adapta-tion protocol with quantum reinforcement learning,”arXiv:1803.05340 .

Alemi, Alexander A, Ian Fischer, Joshua V Dillon, and KevinMurphy (2016), “Deep variational information bottleneck,”arXiv preprint arXiv:1612.00410.

Alemi, Alireza, and Alia Abbara (2017), “Exponential capac-ity in an autoencoder neural network with a hidden layer,”arXiv preprint arXiv:1705.07441 .

Amit, Daniel J (1992), Modeling brain function: The world ofattractor neural networks (Cambridge university press).

Amit, Daniel J, Hanoch Gutfreund, and Haim Sompolinsky(1985), “Spin-glass models of neural networks,” PhysicalReview A 32 (2), 1007.

Andrieu, Christophe, Nando De Freitas, Arnaud Doucet, andMichael I Jordan (2003), “An introduction to mcmc formachine learning,” Machine learning 50 (1-2), 5–43.

Arai, Shunta, Masayuki Ohzeki, and Kazuyuki Tanaka(2017), “Deep neural network detects quantum phase tran-sition,” arXiv preprint arXiv:1712.00371 .

Arimoto, Suguru (1972), “An algorithm for computing thecapacity of arbitrary discrete memoryless channels,” IEEETransactions on Information Theory 18 (1), 14–20.

108

Arsenault, Louis-François, Alejandro Lopez-Bezanilla,O Anatole von Lilienfeld, and Andrew J Millis (2014),“Machine learning for many-body physics: the case of theanderson impurity model,” Physical Review B 90 (15),155136.

Arunachalam, Srinivasan, and Ronald de Wolf (2017),“A survey of quantum learning theory,” arXiv preprintarXiv:1701.06806 .

August, Moritz, and José Miguel Hernández-Lobato (2018),“Taking gradients through experiments: Lstms and mem-ory proximal policy optimization for black-box quantumcontrol,” arXiv:1802.04063.

Aurisano, A, A Radovic, D Rocco, A Himmel, MD Messier,E Niner, G Pawloski, F Psihas, A Sousa, and P Vahle(2016), “A convolutional neural network neutrino eventclassifier,” Journal of Instrumentation 11 (09), P09001.

Baireuther, P, TE O’Brien, B Tarasinski, and CWJBeenakker (2017), “Machine-learning-assisted correctionof correlated qubit errors in a topological code,” arXivpreprint arXiv:1705.07855 .

Baity-Jesi, M, L. Sagun, M. Geiger, S. Spigler, G. Ben Arous,C. Cammarota, Y. LeCun, M. Wyart, and G. Biroli (2018),“Comparing dynamics: Deep neural networks versus glassysystems,” .

Baldassi, Carlo, Federica Gerace, Hilbert J Kappen, CarloLucibello, Luca Saglietti, Enzo Tartaglione, and RiccardoZecchina (2017), “On the role of synaptic stochasticityin training low-precision neural networks,” arXiv preprintarXiv:1710.09825 .

Baldassi, Carlo, Federica Gerace, Luca Saglietti, and Ric-cardo Zecchina (2018), “From inverse problems to learning:a statistical mechanics approach,” in Journal of Physics:Conference Series, Vol. 955 (IOP Publishing) p. 012001.

Baldi, Pierre, Peter Sadowski, and Daniel Whiteson (2014),“Searching for exotic particles in high-energy physics withdeep learning,” Nature communications 5, 4308.

Barber, David (2012), Bayesian reasoning and machine learn-ing (Cambridge University Press).

Barnes, Josh, and Piet Hut (1986), “A hierarchical o (n logn) force-calculation algorithm,” nature 324 (6096), 446.

Barra, Adriano, Alberto Bernacchia, Enrica Santucci, andPierluigi Contucci (2012), “On the equivalence of hopfieldnetworks and boltzmann machines,” Neural Networks 34,1–9.

Barra, Adriano, Giuseppe Genovese, Peter Sollich, andDaniele Tantari (2016), “Phase transitions in restrictedboltzmann machines with generic priors,” arXiv preprintarXiv:1612.03132 .

Barra, Adriano, Giuseppe Genovese, Daniele Tantari, and Pe-ter Sollich (2017), “Phase diagram of restricted boltzmannmachines and generalised hopfield networks with arbitrarypriors,” arXiv preprint arXiv:1702.05882 .

Battiti, Roberto (1992), “First-and second-order methods forlearning: between steepest descent and newton’s method,”Neural computation 4 (2), 141–166.

Benedetti, Marcello, John Realpe-Gómez, Rupak Biswas,and Alejandro Perdomo-Ortiz (2016), “Quantum-assistedlearning of graphical models with arbitrary pairwise con-nectivity,” arXiv preprint arXiv:1609.02542 .

Benedetti, Marcello, John Realpe-Gómez, and AlejandroPerdomo-Ortiz (2017), “Quantum-assisted helmholtz ma-chines: A quantum-classical deep learning framework forindustrial datasets in near-term devices,” arXiv preprintarXiv:1708.09784 .

Bengio, Yoshua (2012), “Practical recommendations forgradient-based training of deep architectures,” in Neuralnetworks: Tricks of the trade (Springer) pp. 437–478.

Bennett, Robert (1969), “The intrinsic dimensionality of sig-nal collections,” IEEE Transactions on Information Theory15 (5), 517–525.

Bény, Cédric (2018), “Inferring relevant features: from qft topca,” arXiv preprint arXiv:1802.05756 .

Berger, James O, and José M Bernardo (1992), “On the devel-opment of the reference prior method,” Bayesian statistics4, 35–60.

Bickel, Peter J, and David A Freedman (1981), “Some asymp-totic theory for the bootstrap,” The Annals of Statistics ,1196–1217.

Bickel, Peter J, Bo Li, Alexandre B Tsybakov, Sara A van deGeer, Bin Yu, Teófilo Valdés, Carlos Rivero, Jianqing Fan,and Aad van der Vaart (2006), “Regularization in statis-tics,” Test 15 (2), 271–344.

Bishop, C M (2006), Pattern recognition and machine learning(springer).

Bishop, Chris M (1995a), “Training with noise is equivalent totikhonov regularization,” Neural computation 7 (1), 108–116.

Bishop, Christopher M (1995b), Neural networks for patternrecognition (Oxford university press).

Blahut, Richard (1972), “Computation of channel capacityand rate-distortion functions,” IEEE transactions on Infor-mation Theory 18 (4), 460–473.

Bottou, Léon (2012), “Stochastic gradient descent tricks,” inNeural networks: Tricks of the trade (Springer) pp. 421–436.

Bowman, Samuel R, Luke Vilnis, Oriol Vinyals, Andrew MDai, Rafal Jozefowicz, and Samy Bengio (2015), “Gener-ating sentences from a continuous space,” arXiv preprintarXiv:1511.06349.

Boyd, Stephen, and Lieven Vandenberghe (2004), Convexoptimization (Cambridge university press).

Bradde, Serena, and William Bialek (2017), “Pca meets rg,”Journal of Statistical Physics 167 (3-4), 462–475.

Breiman, Leo (1996), “Bagging predictors,” Machine learning24 (2), 123–140.

Breiman, Leo (2001), “Random forests,” Machine learning45 (1), 5–32.

Breuckmann, Nikolas P, and Xiaotong Ni (2017), “Scalableneural network decoders for higher dimensional quantumcodes,” arXiv preprint arXiv:1710.09489 .

Broecker, Peter, Fakher F Assaad, and Simon Trebst (2017),“Quantum phase recognition via unsupervised machinelearning,” arXiv preprint arXiv:1707.00663 .

Bromley, Thomas R, and Patrick Rebentrost (2018),“Batched quantum state exponentiation and quantum heb-bian learning,” arXiv:1803.07039 .

Bukov, Marin (2018), “Reinforcement learning for au-tonomous preparation of floquet-engineered states: Invert-ing the quantum kapitza oscillator,” Phys. Rev. B 98,224305.

Bukov, Marin, Alexandre G. R. Day, Dries Sels, Phillip Wein-berg, Anatoli Polkovnikov, and Pankaj Mehta (2018), “Re-inforcement learning in different phases of quantum con-trol,” Phys. Rev. X 8, 031086.

Burges, Christopher JC (1998), “A tutorial on support vectormachines for pattern recognition,” Data mining and knowl-edge discovery 2 (2), 121–167.

Caio, Marcello D, Marco Caccin, Paul Baireuther, Timo

109

Hyart, and Michel Fruchart (2019), “Machine learning as-sisted measurement of local topological invariants,” arXivpreprint arXiv:1901.03346 .

Caldeira, J, WLK Wu, B Nord, C Avestruz, S Trivedi, andKT Story (2018), “Deepcmb: Lensing reconstruction of thecosmic microwave background with deep neural networks,”arXiv preprint arXiv:1810.01483 .

Canabarro, Askery, Samuraí Brito, and Rafael Chaves (2018),“Machine learning non-local correlations,” arXiv preprintarXiv:1808.07069 .

Cárdenas-López, FA, L Lamata, JC Retamal, and E Solano(2017), “Generalized quantum reinforcement learning withquantum technologies,” arXiv preprint arXiv:1709.07848 .

Carleo, Giuseppe (2018), , Private Communication.Carleo, Giuseppe, Yusuke Nomura, and Masatoshi Imada

(2018), “Constructing exact representations of quantummany-body systems with deep neural networks,” arXivpreprint arXiv:1802.09558 .

Carleo, Giuseppe, and Matthias Troyer (2017), “Solving thequantum many-body problem with artificial neural net-works,” Science 355 (6325), 602–606.

Carrasquilla, Juan, and Roger G Melko (2017), “Machinelearning phases of matter,” Nature Physics 13 (5), 431.

Carrasquilla, Juan, Giacomo Torlai, Roger G Melko, and Le-andro Aolita (2018), “Reconstructing quantum states withgenerative models,” arXiv preprint arXiv:1810.10584 .

Chalk, Matthew, Olivier Marre, and Gasper Tkacik (2016),“Relevant sparse codes with variational information bottle-neck,” in Advances in Neural Information Processing Sys-tems, pp. 1957–1965.

Chamberland, Christopher, and Pooya Ronagh (2018), “Deepneural decoders for near term fault-tolerant experiments,”arXiv preprint arXiv:1802.06441 .

Chechik, Gal, Amir Globerson, Naftali Tishby, and YairWeiss (2005), “Information bottleneck for gaussian vari-ables,” Journal of machine learning research 6 (Jan), 165–188.

Chen, Chunlin, Daoyi Dong, Han-Xiong Li, Jian Chu, andTzyh-Jong Tarn (2014), “Fidelity-based probabilistic q-learning for control of quantum systems,” IEEE transac-tions on neural networks and learning systems 25 (5), 920–933.

Chen, Jing, Song Cheng, Haidong Xie, Lei Wang, and Tao Xi-ang (2018), “Equivalence of restricted boltzmann machinesand tensor network states,” Phys. Rev. B 97, 085104.

Chen, Jun-Jie, and Ming Xue (2019), “Manipulation of spindynamics by deep reinforcement learning agent,” arXivpreprint arXiv:1901.08748 .

Chen, Tianqi, and Carlos Guestrin (2016), “Xgboost: A scal-able tree boosting system,” in Proceedings of the 22nd acmsigkdd international conference on knowledge discovery anddata mining (ACM) pp. 785–794.

Cheng, Song, Jing Chen, and Lei Wang (2017), “Informationperspective to probabilistic modeling: Boltzmann machinesversus born machines,” arXiv preprint arXiv:1712.04144 .

Ch’ng, Kelvin, Nick Vazquez, and Ehsan Khatami(2017), “Unsupervised machine learning account of mag-netic transitions in the hubbard model,” arXiv preprintarXiv:1708.03350 .

Ciliberto, Carlo, Mark Herbster, Alessandro Davide Ialongo,Massimiliano Pontil, Andrea Rocchetto, Simone Severini,and Leonard Wossnig (2017), “Quantum machine learning:a classical perspective,” .

Cohen, Nadav, Or Sharir, and Amnon Shashua (2016), “On

the expressive power of deep learning: A tensor analysis,”in Conference on Learning Theory, pp. 698–728.

Colabrese, Simona, Kristian Gustavsson, Antonio Celani,and Luca Biferale (2017), “Flow navigation by smart mi-croswimmers via reinforcement learning,” Physical reviewletters 118 (15), 158004.

Cossu, Guido, Luigi Del Debbio, Tommaso Giani, Ava Kham-seh, and Michael Wilson (2018), “Machine learning deter-mination of dynamical parameters: The ising model case,”arXiv preprint arXiv:1810.11503 .

Cox, Trevor F, and Michael AA Cox (2000), Multidimen-sional scaling (CRC press).

Cristoforetti, Marco, Giuseppe Jurman, Andrea I Nardelli,and Cesare Furlanello (2017), “Towards meaningful physicsfrom generative models,” arXiv preprint arXiv:1705.09524.

Dahl, George, Abdel-rahman Mohamed, Geoffrey E Hinton,et al. (2010), “Phone recognition with the mean-covariancerestricted boltzmann machine,” in Advances in neural in-formation processing systems, pp. 469–477.

Daskin, Ammar (2018), “A quantum implementation modelfor artificial neural networks,” Quanta , 7–18.

Davaasuren, Amarsanaa, Yasunari Suzuki, Keisuke Fujii,and Masato Koashi (2018), “General framework for con-structing fast and near-optimal machine-learning-based de-coder of the topological stabilizer codes,” arXiv preprintarXiv:1801.04377 .

Day, Alexandre GR, Marin Bukov, Phillip Weinberg, PankajMehta, and Dries Sels (2019), “Glassy phase of opti-mal quantum control,” Physical Review Letters 122 (2),020601.

Day, Alexandre GR, and Pankaj Mehta (2018), “Validatedagglomerative clustering,” in preparation.

Decelle, Aurélien, Giancarlo Fissore, and Cyril Furtlehner(2017), “Spectral learning of restricted boltzmann ma-chines,” arXiv preprint arXiv:1708.02917 .

Decelle, Aurélien, Giancarlo Fissore, and Cyril Furtlehner(2018), “Thermodynamics of restricted boltzmann ma-chines and related learning dynamics,” arXiv preprintarXiv:1803.01960 .

Dempster, Arthur P, Nan M Laird, and Donald B Rubin(1977), “Maximum likelihood from incomplete data via theem algorithm,” Journal of the royal statistical society. SeriesB (methodological) , 1–38.

Deng, Dong-Ling, Xiaopeng Li, and S Das Sarma (2017),“Quantum entanglement in neural network states,” Physi-cal Review X 7 (2), 021021.

Dietterich, Thomas G, et al. (2000), “Ensemble methods inmachine learning,” Multiple classifier systems 1857, 1–15.

Do, Chuong B, and Serafim Batzoglou (2008), “What is theexpectation maximization algorithm?” Nature biotechnol-ogy 26 (8), 897–899.

Domingos, Pedro (2012), “A few useful things to know aboutmachine learning,” Communications of the ACM 55 (10),78–87.

Donoho, David L (2006), “Compressed sensing,” IEEE Trans-actions on information theory 52 (4), 1289–1306.

Dreyfus, Hubert L (1965), “Alchemy and artificial intelli-gence,”.

Du, Simon S, Chi Jin, Jason D Lee, Michael I Jordan, AartiSingh, and Barnabas Poczos (2017), “Gradient descent cantake exponential time to escape saddle points,” in Advancesin Neural Information Processing Systems, pp. 1067–1077.

Duchi, John, Elad Hazan, and Yoram Singer (2011), “Adap-

110

tive subgradient methods for online learning and stochas-tic optimization,” Journal of Machine Learning Research12 (Jul), 2121–2159.

Dunjko, Vedran, and Hans J Briegel (2017), “Machine learn-ing and artificial intelligence in the quantum domain,”arXiv preprint arXiv:1709.02779 .

Dunjko, Vedran, Yi-Kai Liu, Xingyao Wu, and Jacob MTaylor (2017), “Super-polynomial and exponential im-provements for quantum-enhanced reinforcement learning,”arXiv preprint arXiv:1710.11160 .

Efron, B (1979), “Bootstrap methods: another look at thejackknife annals of statistics 7: 1–26,” View Article Pub-Med/NCBI Google Scholar.

Efron, Bradley, Trevor Hastie, Iain Johnstone, Robert Tib-shirani, et al. (2004), “Least angle regression,” The Annalsof statistics 32 (2), 407–499.

Eisen, Michael B, Paul T Spellman, Patrick O Brown, andDavid Botstein (1998), “Cluster analysis and display ofgenome-wide expression patterns,” Proceedings of the Na-tional Academy of Sciences 95 (25), 14863–14868.

Elith, Jane, Steven J Phillips, Trevor Hastie, Miroslav Dudík,Yung En Chee, and Colin J Yates (2011), “A statisticalexplanation of maxent for ecologists,” Diversity and distri-butions 17 (1), 43–57.

Ernst, Oliver K, Thomas Bartol, Terrence Sejnowski, andEric Mjolsness (2018), “Learning dynamic boltzmann dis-tributions as reduced models of spatial chemical kinetics,”arXiv preprint arXiv:1803.01063 .

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu,et al. (1996), “A density-based algorithm for discoveringclusters in large spatial databases with noise.” in Kdd,Vol. 96, pp. 226–231.

Faddeev, Ludvig D, and Victor N Popov (1967), “Feynmandiagrams for the yang-mills field,” Physics Letters B 25 (1),29–30.

Finol, David, Yan Lu, Vijay Mahadevan, and AnkitSrivastava (2018), “Deep convolutional neural networksfor eigenvalue problems in mechanics,” arXiv preprintarXiv:1801.05733 .

Fisher, Charles K, and Pankaj Mehta (2015a), “Bayesian fea-ture selection for high-dimensional linear regression via theising approximation with applications to genomics,” Bioin-formatics 31 (11), 1754–1761.

Fisher, Charles K, and Pankaj Mehta (2015b), “Bayesianfeature selection with strongly regularizing priors maps tothe ising model,” Neural computation 27 (11), 2411–2422.

Foreman, Sam, Joel Giedt, Yannick Meurice, and JudahUnmuth-Yockey (2017), “Rg inspired machine learning forlattice field theory,” arXiv preprint arXiv:1710.02079 .

Fösel, Thomas, Petru Tighineanu, Talitha Weiss, and Flo-rian Marquardt (2018), “Reinforcement learning with neu-ral networks for quantum feedback,” arXiv:1802.05267 .

Freitas, Nahuel, Giovanna Morigi, and Vedran Dun-jko (2018), “Neural network operations and susuki-trotter evolution of neural network states,” arXiv preprintarXiv:1803.02118 .

Freund, Yoav, Robert Schapire, and Naoki Abe (1999), “Ashort introduction to boosting,” Journal-Japanese SocietyFor Artificial Intelligence 14 (771-780), 1612.

Freund, Yoav, and Robert E Schapire (1995), “A desicion-theoretic generalization of on-line learning and an applica-tion to boosting,” in European conference on computationallearning theory (Springer) pp. 23–37.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani

(2001), The elements of statistical learning, Vol. 1 (Springerseries in statistics New York).

Friedman, Jerome H (2001), “Greedy function approximation:a gradient boosting machine,” Annals of statistics , 1189–1232.

Friedman, Jerome H (2002), “Stochastic gradient boosting,”Computational Statistics & Data Analysis 38 (4), 367–378.

Friedman, Jerome H, Bogdan E Popescu, et al. (2003), “Im-portance sampled learning ensembles,” Journal of MachineLearning Research 94305.

Fu, Michael C (2006), “Gradient estimation,” Handbooks inoperations research and management science 13, 575–616.

Funai, Shotaro Shiba, and Dimitrios Giataganas (2018),“Thermodynamics and feature extraction by machine learn-ing,” arXiv preprint arXiv:1810.08179 .

Gao, Jun, Lu-Feng Qiao, Zhi-Qiang Jiao, Yue-Chi Ma, Cheng-Qiu Hu, Ruo-Jing Ren, Ai-Lin Yang, Hao Tang, Man-HongYung, and Xian-Min Jin (2017), “Experimental machinelearning of quantum states with partial information,” arXivpreprint arXiv:1712.00456 .

Gao, Xun, and Lu-Ming Duan (2017), “Efficient representa-tion of quantum many-body states with deep neural net-works,” arXiv preprint arXiv:1701.05039 .

Gelman, Andrew, John B Carlin, Hal S Stern, David B Dun-son, Aki Vehtari, and Donald B Rubin (2014), Bayesiandata analysis, Vol. 2 (CRC press Boca Raton, FL).

Gersho, Allen, and Robert M Gray (2012), Vector quantiza-tion and signal compression, Vol. 159 (Springer Science &Business Media).

Geurts, Pierre, Damien Ernst, and Louis Wehenkel (2006),“Extremely randomized trees,” Machine learning 63 (1),3–42.

Glorot, Xavier, and Yoshua Bengio (2010), “Understandingthe difficulty of training deep feedforward neural networks,”in Proceedings of the Thirteenth International Conferenceon Artificial Intelligence and Statistics, pp. 249–256.

Goldt, Sebastian, and Udo Seifert (2017), “Thermodynamicefficiency of learning a rule in neural networks,” arXivpreprint arXiv:1706.09713 .

Goodfellow, Ian (2016), “Nips 2016 tutorial: Generative ad-versarial networks,” arXiv preprint arXiv:1701.00160.

Goodfellow, Ian, Yoshua Bengio, and AaronCourville (2016), Deep Learning (MIT Press)http://www.deeplearningbook.org.

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio (2014), “Generative adversarial nets,” in Ad-vances in neural information processing systems, pp. 2672–2680.

Greplova, Eliska, Christian Kraglund Andersen, and KlausMølmer (2017), “Quantum parameter estimation with aneural network,” arXiv preprint arXiv:1711.05238 .

Grisafi, Andrea, David M Wilkins, Gábor Csányi, andMichele Ceriotti (2017), “Symmetry-adapted machine-learning for tensorial properties of atomistic systems,”arXiv preprint arXiv:1709.06757 .

Han, Zhao-Yu, Jun Wang, Heng Fan, Lei Wang, and PanZhang (2017), “Unsupervised generative modeling usingmatrix product states,” arXiv preprint arXiv:1709.01662 .

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun(2015), “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceed-ings of the IEEE international conference on computer vi-sion, pp. 1026–1034.

111

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun(2016), “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision andpattern recognition, pp. 770–778.

Heimel, Theo, Gregor Kasieczka, Tilman Plehn, and Jen-nifer M Thompson (2018), “Qcd or what?” arXiv preprintarXiv:1808.08979 .

Higgins, Irina, Loic Matthey, Arka Pal, Christopher Burgess,Xavier Glorot, Matthew Botvinick, Shakir Mohamed, andAlexander Lerchner (2016), “beta-vae: Learning basic vi-sual concepts with a constrained variational framework,”.

Hinton, Geoffrey E (2002), “Training products of experts byminimizing contrastive divergence,” Neural computation14 (8), 1771–1800.

Hinton, Geoffrey E (2012), “A practical guide to training re-stricted boltzmann machines,” in Neural networks: Tricksof the trade (Springer) pp. 599–619.

Hinton, Geoffrey E, Simon Osindero, and Yee-Whye Teh(2006), “A fast learning algorithm for deep belief nets,”Neural computation 18 (7), 1527–1554.

Hinton, Geoffrey E, and Ruslan R Salakhutdinov (2006), “Re-ducing the dimensionality of data with neural networks,”science 313 (5786), 504–507.

Ho, Tin Kam (1998), “The random subspace method for con-structing decision forests,” IEEE transactions on patternanalysis and machine intelligence 20 (8), 832–844.

Hopfield, John J (1982), “Neural networks and physical sys-tems with emergent collective computational abilities,”Proceedings of the national academy of sciences 79 (8),2554–2558.

Huang, Haiping (2017a), “Mean-field theory of input dimen-sionality reduction in unsupervised deep neural networks,”arXiv preprint arXiv:1710.01467 .

Huang, Haiping (2017b), “Statistical mechanics of unsuper-vised feature learning in a restricted boltzmann machinewith binary synapses,” Journal of Statistical Mechanics:Theory and Experiment 2017 (5), 053302.

Huang, Hengfeng, Bowen Xiao, Huixin Xiong, Zeming Wu,Yadong Mu, and Huichao Song (2018), “Applicationsof deep learning to relativistic hydrodynamics,” arXivpreprint arXiv:1801.03334 .

Hubbard, J (1959), “Calculation of partition functions,” Phys-ical Review Letters 3 (2), 77.

Huggins, William, Piyush Patil, Bradley Mitchell, K BirgittaWhaley, and E Miles Stoudenmire (2019), “Towards quan-tum machine learning with tensor networks,” Quantum Sci-ence and Technology 4 (2), 024001.

Iakovlev, I A, O. M. Sotnikov, and V. V. Mazurenko(2018), “Supervised learning magnetic skyrmion phases,”arXiv:1803.06682 .

Innocenti, Luca, Leonardo Banchi, Alessandro Ferraro,Sougato Bose, and Mauro Paternostro (2018), “Supervisedlearning of time-independent hamiltonians for gate design,”arXiv:1803.07119 .

Ioffe, Sergey, and Christian Szegedy (2015), “Batch normal-ization: Accelerating deep network training by reducing in-ternal covariate shift,” in International Conference on Ma-chine Learning, pp. 448–456.

Iso, Satoshi, Shotaro Shiba, and Sumito Yokoo (2018),“Scale-invariant feature extraction of neural net-work and renormalization group flow,” arXiv preprintarXiv:1801.07172 .

Jarzynski, Christopher (1997), “Nonequilibrium equality for

free energy differences,” Physical Review Letters 78 (14),2690.

Jaynes, Edwin T (1957a), “Information theory and statisticalmechanics,” Physical review 106 (4), 620.

Jaynes, Edwin T (1957b), “Information theory and statisticalmechanics. ii,” Physical review 108 (2), 171.

Jaynes, Edwin T (1968), “Prior probabilities,” IEEE Trans-actions on systems science and cybernetics 4 (3), 227–241.

Jaynes, Edwin T (1996), Probability theory: the logic of sci-ence (Washington University St. Louis, MO).

Jaynes, Edwin T (2003), Probability theory: the logic of sci-ence (Cambridge university press).

Jeffreys, Harold (1946), “An invariant form for the prior prob-ability in estimation problems,” Proceedings of the RoyalSociety of London. Series A, Mathematical and PhysicalSciences , 453–461.

Jin, Chi, Praneeth Netrapalli, and Michael I Jordan (2017),“Accelerated gradient descent escapes saddle points fasterthan gradient descent,” arXiv preprint arXiv:1711.10456.

Jordan, Michael (2018), “Artificial intelligence: The revolu-tion hasnâĂŹt happened yet. medium,”.

Jordan, Michael I, Zoubin Ghahramani, Tommi S Jaakkola,and Lawrence K Saul (1999), “An introduction to vari-ational methods for graphical models,” Machine learning37 (2), 183–233.

Kalantre, Sandesh S, Justyna P Zwolak, Stephen Ragole,Xingyao Wu, Neil M Zimmerman, MD Stewart, and Ja-cob M Taylor (2017), “Machine learning techniques forstate recognition and auto-tuning in quantum dots,” arXivpreprint arXiv:1712.04914 .

Katz, Yarden (2017), “Manufacturing an artificial intelligencerevolution,” SSRN Preprint .

Kerenidis, Iordanis, and Anupam Prakash (2017), “Quan-tum gradient descent for linear systems and least squares,”arXiv preprint arXiv:1704.04992 .

Keskar, Nitish Shirish, Dheevatsa Mudigere, Jorge Nocedal,Mikhail Smelyanskiy, and Ping Tak Peter Tang (2016),“On large-batch training for deep learning: Generalizationgap and sharp minima,” arXiv preprint arXiv:1609.04836.

Kingma, Diederik P, and Jimmy Ba (2014), “Adam:A method for stochastic optimization,” arXiv preprintarXiv:1412.6980.

Kingma, Diederik P, and Max Welling (2013),“Auto-encoding variational bayes,” arXiv preprintarXiv:1312.6114.

Kingma, DP, et al. (2017), “Variational inference & deeplearning,” PhD thesis 978-94-6299-745-5.

Kleijnen, Jack PC, and Reuven Y Rubinstein (1996), “Op-timization and sensitivity analysis of computer simulationmodels by the score function method,” European Journalof Operational Research 88 (3), 413–427.

Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan(2016), “Inherent trade-offs in the fair determination of riskscores,” arXiv preprint arXiv:1609.05807.

Koch-Janusz, Maciej, and Zohar Ringel (2017), “Mutual in-formation, neural networks and the renormalization group,”arXiv preprint arXiv:1704.06279 .

Kohonen, Teuvo (1998), “The self-organizing map,” Neuro-computing 21 (1-3), 1–6.

Kong, Qingkai, Daniel T. Trugman, Zachary E. Ross,Michael J. Bianco, Brendan J. Meade, and Peter Ger-stoft (2018), “Machine learning in seismology: Turn-ing data into insights,” Seismological Research Letters10.1785/0220180259.

112

Krastanov, Stefan, and Liang Jiang (2017), “Deep neuralnetwork probabilistic decoder for stabilizer codes,” arXivpreprint arXiv:1705.09334 .

Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek (2009),“Clustering high-dimensional data: A survey on subspaceclustering, pattern-based clustering, and correlation clus-tering,” ACM Transactions on Knowledge Discovery fromData (TKDD) 3 (1), 1.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton(2012), “Imagenet classification with deep convolutionalneural networks,” in Advances in neural information pro-cessing systems, pp. 1097–1105.

Krzakala, Florent, Andre Manoel, Eric W Tramel, and LenkaZdeborová (2014), “Variational free energies for compressedsensing,” in Information Theory (ISIT), 2014 IEEE Inter-national Symposium on (IEEE) pp. 1499–1503.

Krzakala, Florent, Marc Mézard, François Sausset, YF Sun,and Lenka Zdeborová (2012a), “Statistical-physics-basedreconstruction in compressed sensing,” Physical Review X2 (2), 021005.

Krzakala, Florent, Marc Mézard, Francois Sausset, Yifan Sun,and Lenka Zdeborová (2012b), “Probabilistic reconstruc-tion in compressed sensing: algorithms, phase diagrams,and threshold achieving matrices,” Journal of StatisticalMechanics: Theory and Experiment 2012 (08), P08009.

Lake, Brenden M, Tomer D Ullman, Joshua B Tenenbaum,and Samuel J Gershman (2017), “Building machines thatlearn and think like people,” Behavioral and Brain Sciences40.

Lamata, Lucas (2017), “Basic protocols in quantum reinforce-ment learning with superconducting circuits,” Scientific Re-ports 7.

Larsen, Bjornar, and Chinatsu Aone (1999), “Fast and effec-tive text mining using linear-time document clustering,” inProceedings of the fifth ACM SIGKDD international con-ference on Knowledge discovery and data mining (ACM)pp. 16–22.

Le, Quoc V (2013), “Building high-level features using largescale unsupervised learning,” in Acoustics, Speech and Sig-nal Processing (ICASSP), 2013 IEEE International Con-ference on (IEEE) pp. 8595–8598.

LeCun, Yann, Yoshua Bengio, et al. (1995), “Convolutionalnetworks for images, speech, and time series,” The hand-book of brain theory and neural networks 3361 (10), 1995.

LeCun, Yann, Léon Bottou, Yoshua Bengio, and PatrickHaffner (1998a), “Gradient-based learning applied to docu-ment recognition,” Proceedings of the IEEE 86 (11), 2278–2324.

LeCun, Yann, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller (1998b), “Efficient backprop,” in Neural net-works: Tricks of the trade (Springer) pp. 9–50.

Lee, Jason D, Ioannis Panageas, Georgios Piliouras, Max Sim-chowitz, Michael I Jordan, and Benjamin Recht (2017),“First-order methods almost always avoid saddle points,”arXiv preprint arXiv:1710.07406.

Lehmann, Erich L, and George Casella (2006), Theory ofpoint estimation (Springer Science & Business Media).

Lehmann, Erich L, and Joseph P Romano (2006), Testingstatistical hypotheses (Springer Science & Business Media).

Levine, Yoav, David Yakira, Nadav Cohen, and AmnonShashua (2017), “Deep learning and quantum entangle-ment: Fundamental connections with implications to net-work design.” arXiv preprint arXiv:1704.01552 .

Li, Bo, and David Saad (2017), “Exploring the func-

tion space of deep-learning machines,” arXiv preprintarXiv:1708.01422 .

Li, Chian-De, Deng-Ruei Tan, and Fu-Jiun Jiang (2017),“Applications of neural networks to the studies ofphase transitions of two-dimensional potts models,” arXivpreprint arXiv:1703.02369 .

Li, Richard Y, Rosa Di Felice, Remo Rohs, and Daniel ALidar (2018), “Quantum annealing versus classical ma-chine learning applied to a simplified computational biologyproblem,” npj Quantum Information 4 (1), 14.

Li, Shuo-Hui, and Lei Wang (2018), “Neural network renor-malization group,” arXiv preprint arXiv:1802.02840 .

Lim, Tjen-Sien, Wei-Yin Loh, and Yu-Shan Shih (2000), “Acomparison of prediction accuracy, complexity, and train-ing time of thirty-three old and new classification algo-rithms,” Machine learning 40 (3), 203–228.

Lin, Henry W, Max Tegmark, and David Rolnick (2017),“Why does deep and cheap learning work so well?” Journalof Statistical Physics 168 (6), 1223–1247.

Linderman, G C, M. Rachh, J. G. Hoskins, S. Steiner-berger, and Y. Kluger (2017), “Efficient Algorithms fort-distributed Stochastic Neighborhood Embedding,” ArXive-prints arXiv:1712.09005 [cs.LG].

Liu, Zhaocheng, Sean P Rodrigues, and Wenshan Cai(2017), “Simulating the ising model with a deep convo-lutional generative adversarial network,” arXiv preprintarXiv:1710.04987 .

Loh, Wei-Yin (2011), “Classification and regression trees,”Wiley Interdisciplinary Reviews: Data Mining and Knowl-edge Discovery 1 (1), 14–23.

Louppe, Gilles (2014), “Understanding random forests: Fromtheory to practice,” arXiv preprint arXiv:1407.7502.

Lu, Sirui, Shilin Huang, Keren Li, Jun Li, Jianxin Chen,Dawei Lu, Zhengfeng Ji, Yi Shen, Duanlu Zhou, andBei Zeng (2017), “A separability-entanglement classifier viamachine learning,” arXiv preprint arXiv:1705.01523 .

Maaten, Laurens van der, and Geoffrey Hinton (2008), “Vi-sualizing data using t-sne,” Journal of machine learning re-search 9 (Nov), 2579–2605.

MacKay, David JC (2003), Information theory, inference andlearning algorithms (Cambridge university press).

Marsland III, Robert, Wenping Cui, and Pankaj Mehta(2019), “The Minimum Environmental Perturbation Prin-ciple: A New Perspective on Niche Theory,” arXiv preprintarXiv:1901.09673 .

Maskara, Nishad, Aleksander Kubica, and TomasJochym-O’Connor (2018), “Advantages of versatile neural-network decoding for topological codes,” arXiv preprintarXiv:1802.08680 .

Masson, JB, M Bailly Bechet, and Massimo Vergassola(2009), “Chasing information to search in random environ-ments,” Journal of Physics A: Mathematical and Theoret-ical 42 (43), 434009.

Mattingly, Henry H, Mark K Transtrum, Michael C Abbott,and Benjamin B Machta (2018), “Maximizing the infor-mation learned from finite data selects a simple model,”Proceedings of the National Academy of Sciences 115 (8),1760–1765.

McDermott, Drew, M Mitchell Waldrop, B Chandrasekaran,John McDermott, and Roger Schank (1985), “The darkages of ai: a panel discussion at aaai-84,” AI Magazine6 (3), 122.

McInnes, Leland, John Healy, and James Melville(2018), “UMAP: Uniform Manifold Approximation and

113

Projection for Dimension Reduction,” arXiv e-prints ,arXiv:1802.03426arXiv:1802.03426 [stat.ML].

Mehta, Pankaj (2015), “Big data’s radical potential,https://www.jacobinmag.com/2015/03/big-data-drones-privacy-workers,” Jacobin .

Mehta, Pankaj, and David J Schwab (2014), “An exact map-ping between the variational renormalization group anddeep learning,” arXiv preprint arXiv:1410.3831.

Mehta, Pankaj, David J Schwab, and Anirvan M Sengupta(2011), “Statistical mechanics of transcription-factor bind-ing site discovery using hidden markov models,” Journal ofstatistical physics 142 (6), 1187–1205.

Melnikov, Alexey A, Hendrik Poulsen Nautrup, Mario Krenn,Vedran Dunjko, Markus Tiersch, Anton Zeilinger, andHans J Briegel (2017), “Active learning machine learnsto create new quantum experiments,” arXiv preprintarXiv:1706.00868 .

Metz, Cade (2017), “Move over, coders-physicists will soon rule silicon valley,”Https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/.

Mezard, Marc, and Andrea Montanari (2009), Information,physics, and computation (Oxford University Press).

Mhaskar, Hrushikesh, Qianli Liao, and Tomaso Poggio(2016), “Learning functions: when is deep better than shal-low,” arXiv preprint arXiv:1603.00988.

Mitarai, Kosuke, Makoto Negoro, Masahiro Kitagawa, andKeisuke Fujii (2018), “Quantum circuit learning,” arXivpreprint arXiv:1803.00745 .

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, An-drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves,Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski,et al. (2015), “Human-level control through deep reinforce-ment learning,” Nature 518 (7540), 529.

Morningstar, Alan, and Roger G Melko (2017), “Deeplearning the ising model near criticality,” arXiv preprintarXiv:1708.04622 .

Muehlhauser, Luke (2016), “What should we learn from pastai forecasts?” Open Philanthropy Website .

Müllner, Daniel (2011), “Modern hierarchical, agglomerativeclustering algorithms,” arXiv preprint arXiv:1109.2378 .

Murphy, Kevin (2012), Machine Learning: A ProbabilisticPerspective (MIT press).

Nagai, Yuki, Huitao Shen, Yang Qi, Junwei Liu, and LiangFu (2017), “Self-learning monte carlo method: Continuous-time algorithm,” arXiv preprint arXiv:1705.06724 .

Neal, Radford M (2001), “Annealed importance sampling,”Statistics and computing 11 (2), 125–139.

Neal, Radford M, and Geoffrey E Hinton (1998), “A viewof the em algorithm that justifies incremental, sparse, andother variants,” in Learning in graphical models (Springer)pp. 355–368.

Neal, Radford M, et al. (2011), “Mcmc using hamiltonian dy-namics,” Handbook of Markov Chain Monte Carlo 2 (11).

Nesterov, Yurii (1983), “A method of solving a convex pro-gramming problem with convergence rate o (1/k2),” in So-viet Mathematics Doklady, Vol. 27, pp. 372–376.

Neukart, Florian, David Von Dollen, Christian Seidel, andGabriele Compostella (2017), “Quantum-enhanced rein-forcement learning for finite-episode games with discretestate spaces,” arXiv preprint arXiv:1708.09354 .

Nguyen, H Chau, Riccardo Zecchina, and Johannes Berg(2017), “Inverse statistical problems: from the inverse isingproblem to data science,” Advances in Physics 66 (3), 197–

261.Nielsen, Michael A (2015), Neural networks and deep learning

(Determination Press).van Nieuwenburg, Evert, Eyal Bairey, and Gil Refael (2017a),

“Learning phase transitions from dynamics,” arXiv preprintarXiv:1712.00450 .

van Nieuwenburg, Evert PL, Ye-Hua Liu, and Sebastian DHuber (2017b), “Learning phase transitions by confusion,”Nature Physics 13 (5), 435.

Niu, Murphy Yuezhen, Sergio Boixo, Vadim Smelyanskiy,and Hartmut Neven (2018), “Universal quantum con-trol through deep reinforcement learning,” arXiv preprintarXiv:1803.01857 .

Nomura, Yusuke, Andrew Darmawan, Youhei Yamaji, andMasatoshi Imada (2017), “Restricted-boltzmann-machinelearning for solving strongly correlated quantum systems,”arXiv preprint arXiv:1709.06475 .

Ohtsuki, Tomi, and Tomoki Ohtsuki (2017), “Deep learningthe quantum phase transitions in random electron systems:Applications to three dimensions,” Journal of the PhysicalSociety of Japan 86 (4), 044708.

O’Neil, Cathy (2017), Weapons of math destruction: How bigdata increases inequality and threatens democracy (Broad-way Books).

Papanikolaou, Stefanos, Michail Tzimas, Hengxu Song, An-drew CE Reid, and Stephen A Langer (2017), “Learningcrystal plasticity using digital image correlation: Exam-ples from discrete dislocation dynamics,” arXiv preprintarXiv:1709.08225 .

Pedregosa, F, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay (2011),“Scikit-learn: Machine learning in Python,” Journal of Ma-chine Learning Research 12, 2825–2830.

Perdomo-Ortiz, Alejandro, Marcello Benedetti, John Realpe-Gómez, and Rupak Biswas (2017), “Opportunitiesand challenges for quantum-assisted machine learn-ing in near-term quantum computers,” arXiv preprintarXiv:1708.09757 .

Polyak, Boris T (1964), “Some methods of speeding up theconvergence of iteration methods,” USSR ComputationalMathematics and Mathematical Physics 4 (5), 1–17.

Qian, Ning (1999), “On the momentum term in gradient de-scent learning algorithms,” Neural networks 12 (1), 145–151.

Rabiner, Lawrence R (1989), “A tutorial on hidden markovmodels and selected applications in speech recognition,”Proceedings of the IEEE 77 (2), 257–286.

Radford, Alec, Luke Metz, and Soumith Chintala (2015),“Unsupervised representation learning with deep convo-lutional generative adversarial networks,” arXiv preprintarXiv:1511.06434.

Ramezanali, Mohammad, Partha P Mitra, and Anirvan MSengupta (2015), “Critical behavior and universality classesfor an algorithmic phase transition in sparse reconstruc-tion,” arXiv preprint arXiv:1509.08995.

Ramezanpour, A (2017), “Optimization by a quantum rein-forcement algorithm,” Phys. Rev. A 96, 052307.

Rao, Wen-Jia, Zhenyu Li, Qiong Zhu, Mingxing Luo, andXin Wan (2017), “Identifying product order with restrictedboltzmann machines,” arXiv preprint arXiv:1709.02597 .

Ravanbakhsh, Siamak, Francois Lanusse, Rachel Mandel-baum, Jeff G Schneider, and Barnabas Poczos (2017), “En-

114

abling dark energy science with deep generative models ofgalaxy images.” in AAAI, pp. 1488–1494.

Rebentrost, Patrick, Thomas R. Bromley, Christian Weed-brook, and Seth Lloyd (2017), “A quantum hopfield neuralnetwork,” arXiv:1710.03599 .

Rebentrost, Patrick, Maria Schuld, Francesco Petruccione,and Seth Lloyd (2016), “Quantum gradient descent andnewton’s method for constrained polynomial optimization,”arXiv preprint arXiv:1612.01789 .

Reddy, Gautam, Antonio Celani, Terrence J Sejnowski, andMassimo Vergassola (2016a), “Learning to soar in turbu-lent environments,” Proceedings of the National Academyof Sciences 113 (33), E4877–E4884.

Reddy, Gautam, Antonio Celani, and Massimo Vergassola(2016b), “Infomax strategies for an optimal balance be-tween exploration and exploitation,” Journal of StatisticalPhysics 163 (6), 1454–1476.

Rem, Benno S, Niklas Käming, Matthias Tarnowski, LucaAsteria, Nick Fläschner, Christoph Becker, Klaus Seng-stock, and Christof Weitenberg (2018), “Identifying quan-tum phase transitions using artificial neural networks onexperimental data,” arXiv preprint arXiv:1809.05519 .

Rezende, Danilo Jimenez, Shakir Mohamed, and DaanWierstra (2014), “Stochastic backpropagation and approx-imate inference in deep generative models,” arXiv preprintarXiv:1401.4082.

Rocchetto, Andrea, Scott Aaronson, Simone Severini, Gon-zalo Carvacho, Davide Poderini, Iris Agresti, Marco Ben-tivegna, and Fabio Sciarrino (2017), “Experimental learn-ing of quantum states,” arXiv preprint arXiv:1712.00127.

Rocchetto, Andrea, Edward Grant, Sergii Strelchuk,Giuseppe Carleo, and Simone Severini (2018), “Learn-ing hard quantum distributions with variational autoen-coders,” npj Quantum Information 4 (1), 28.

Rockafellar, Ralph Tyrell (2015), Convex analysis (Princetonuniversity press).

Rodriguez, Alex, and Alessandro Laio (2014), “Clustering byfast search and find of density peaks,” Science 344 (6191),1492–1496.

Rokach, Lior, and Oded Maimon (2005), “Clustering meth-ods,” in Data mining and knowledge discovery handbook(Springer) pp. 321–352.

Roweis, Sam T, and Lawrence K Saul (2000), “Nonlineardimensionality reduction by locally linear embedding,” sci-ence 290 (5500), 2323–2326.

Rudelius, Tom (2018), “Learning to inflate,” arXiv preprintarXiv:1810.05159 .

Ruder, Sebastian (2016), “An overview of gradient descentoptimization algorithms,” arXiv preprint arXiv:1609.04747.

Rumelhart, David E, and David Zipser (1985), “Feature dis-covery by competitive learning,” Cognitive science 9 (1),75–112.

Ruscher, Céline, and Jörg Rottler (2018), “Correlations inthe shear flow of athermal amorphous solids: A principalcomponent analysis,” arXiv preprint arXiv:1809.06487 .

Saito, Hiroki, and Masaya Kato (2017), “Machine learningtechnique to find quantum many-body ground states ofbosons on a lattice,” arXiv preprint arXiv:1709.05468 .

Salazar, Domingos SP (2017), “Nonequilibrium thermody-namics of restricted boltzmann machines,” arXiv preprintarXiv:1704.08724 .

Sander, Jörg, Martin Ester, Hans-Peter Kriegel, and Xiaowei

Xu (1998), “Density-based clustering in spatial databases:The algorithm gdbscan and its applications,” Data miningand knowledge discovery 2 (2), 169–194.

Saxe, Andrew M, James L McClelland, and Surya Gan-guli (2013), “Exact solutions to the nonlinear dynamics oflearning in deep linear neural networks,” arXiv preprintarXiv:1312.6120.

Schapire, Robert E, and Yoav Freund (2012), Boosting:Foundations and algorithms (MIT press).

Schindler, Frank, Nicolas Regnault, and Titus Neupert(2017), “Probing many-body localization with neural net-works,” Phys. Rev. B 95, 245134.

Schmidhuber, Jürgen (2015), “Deep learning in neural net-works: An overview,” Neural networks 61, 85–117.

Schneidman, Elad, Michael J Berry II, Ronen Segev, andWilliam Bialek (2006), “Weak pairwise correlations implystrongly correlated network states in a neural population,”Nature 440 (7087), 1007.

Schoenholz, Samuel S (2017), “Combining machine learningand physics to understand glassy systems,” arXiv preprintarXiv:1709.08015 .

Schuld, Maria, Mark Fingerhuth, and Francesco Petruc-cione (2017), “Implementing a distance-based classifierwith a quantum interference circuit,” arXiv preprintarXiv:1703.10793 .

Schuld, Maria, and Nathan Killoran (2018), “Quantum ma-chine learning in feature hilbert spaces,” arXiv:1803.07128.

Schuld, Maria, and Francesco Petruccione (2017), “Quan-tum ensembles of quantum classifiers,” arXiv preprintarXiv:1704.02146 .

Schuld, Maria, Ilya Sinayskiy, and Francesco Petruccione(2015), “An introduction to quantum machine learning,”Contemporary Physics 56 (2), 172–185.

Schwab, David J, Ilya Nemenman, and Pankaj Mehta (2014),Physical review letters 113 (6), 068102.

Sethna, James (2006), Statistical mechanics: entropy, or-der parameters, and complexity, Vol. 14 (Oxford UniversityPress).

Shanahan, Phiala E, Daniel Trewartha, and WilliamDetmold (2018), “Machine learning action parametersin lattice quantum chromodynamics,” arXiv preprintarXiv:1801.05784 .

Shannon, Claude E (1949), “Communication theory of secrecysystems,” Bell Labs Technical Journal 28 (4), 656–715.

Shen, Huitao, Junwei Liu, and Liang Fu (2018), “Self-learningmonte carlo with deep neural networks,” arXiv preprintarXiv:1801.01127 .

Shinjo, Kazuya, Shigetoshi Sota, Seiji Yunoki, andTakami Tohyama (2019), “Characterization of photoex-cited states in the half-filled one-dimensional extended hub-bard model assisted by machine learning,” arXiv preprintarXiv:1901.07900 .

Shlens, Jonathon (2014), “A tutorial on principal componentanalysis,” arXiv preprint arXiv:1404.1100.

Shwartz-Ziv, Ravid, and Naftali Tishby (2017), “Opening theblack box of deep neural networks via information,” arXivpreprint arXiv:1703.00810.

Sidky, Hythem, and Jonathan K Whitmer (2017), “Learn-ing free energy landscapes using artificial neural networks,”arXiv preprint arXiv:1712.02840 .

Simonite, Tom (2018), “Should data scientist adhere to a hip-pocratic oath?” Wired .

Singh, Kesar (1981), “On the asymptotic accuracy of efron’s

115

bootstrap,” The Annals of Statistics , 1187–1195.Slonim, Noam, Gurinder S Atwal, Gasper Tkacik, and

William Bialek (2005), “Estimating mutual informationand multi–information in large networks,” arXiv preprintcs/0502017.

Sønderby, Casper Kaae, Tapani Raiko, Lars Maaløe,Søren Kaae Sønderby, and Ole Winther (2016), “Laddervariational autoencoders,” in Advances in neural informa-tion processing systems, pp. 3738–3746.

Springenberg, Jost Tobias, Alexey Dosovitskiy, Thomas Brox,and Martin Riedmiller (2014), “Striving for simplicity: Theall convolutional net,” arXiv preprint arXiv:1412.6806.

Sriarunothai, Theeraphot, Sabine Wölk, Gouri Shankar Giri,Nicolai Fries, Vedran Dunjko, Hans J Briegel, and ChristofWunderlich (2017), “Speeding-up the decision making of alearning agent using an ion trap quantum processor,” arXivpreprint arXiv:1709.01366 .

Srivastava, Nitish, Geoffrey E Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov (2014), “Dropout: asimple way to prevent neural networks from overfitting.”Journal of machine learning research 15 (1), 1929–1958.

Stoudenmire, E Miles (2018), “Learning relevant featuresof data with multi-scale tensor networks,” arXiv preprintarXiv:1801.00315 .

Stoudenmire, E Miles, and David J Schwab (2016), “Super-vised learning with tensor networks,” in Advances in NeuralInformation Processing Systems, pp. 4799–4807.

Stoudenmire, EM, and Steven R White (2012), “Studyingtwo-dimensional systems with the density matrix renormal-ization group,” Annu. Rev. Condens. Matter Phys. 3 (1),111–128.

Stratonovich, RL (1957), “On a method of calculating quan-tum distribution functions,” in Soviet Physics Doklady,Vol. 2, p. 416.

Strouse, DJ, and David J Schwab (2017), “The deterministicinformation bottleneck,” Neural computation 29 (6), 1611–1630.

Suchsland, Philippe, and Stefan Wessel (2018), “Parameterdiagnostics of phases and phase transition learning by neu-ral networks,” arXiv preprint arXiv:1802.09876 .

Sutskever, Ilya, James Martens, George Dahl, and GeoffreyHinton (2013), “On the importance of initialization andmomentum in deep learning,” in International conferenceon machine learning, pp. 1139–1147.

Sutton, Richard S, and Andrew G Barto (1998), Reinforce-ment learning: An introduction, Vol. 1 (MIT press Cam-bridge).

Swaddle, Michael, Lyle Noakes, Liam Salter, Harry Small-bone, and Jingbo Wang (2017), “Generating 3 qubitquantum circuits with neural networks,” arXiv preprintarXiv:1703.10743 .

Sweke, Ryan, Markus S Kesselring, Evert PL van Nieuwen-burg, and Jens Eisert (2018), “Reinforcement learningdecoders for fault-tolerant quantum computation,” arXivpreprint arXiv:1810.07207 .

Székely, GJ (2003), “E-statistics: The energy of statisticalsamples,” Bowling Green State University, Department ofMathematics and Statistics Technical Report 3 (05), 1–18.

Tanaka, Akinori, and Akio Tomiya (2017a), “Detection ofphase transition via convolutional neural networks,” Jour-nal of the Physical Society of Japan 86 (6), 063001.

Tanaka, Akinori, and Akio Tomiya (2017b), “Towards reduc-tion of autocorrelation in hmc by machine learning,” arXivpreprint arXiv:1712.03893 .

Tenenbaum, Joshua B, Vin De Silva, and John C Langford(2000), “A global geometric framework for nonlinear dimen-sionality reduction,” science 290 (5500), 2319–2323.

Tibshirani, Ryan J, et al. (2013), “The lasso problem anduniqueness,” Electronic Journal of Statistics 7, 1456–1490.

Tieleman, Tijmen, and Geoffrey Hinton (2009), “Using fastweights to improve persistent contrastive divergence,” inProceedings of the 26th Annual International Conferenceon Machine Learning (ACM) pp. 1033–1040.

Tieleman, Tijmen, and Geoffrey Hinton (2012), “Lecture 6.5-rmsprop: Divide the gradient by a running average of itsrecent magnitude,” COURSERA: Neural networks for ma-chine learning 4 (2), 26–31.

Tishby, Naftali, Fernando C Pereira, and William Bialek(2000), “The information bottleneck method,” arXivpreprint physics/0004057.

Tomczak, P, and H. Puszkarski (2018), “Ferromagnetic res-onance in thin films studied via cross-validation of numer-ical solutions of the smit-beljers equation: Application to(ga,mn)as,” Phys. Rev. B 98, 144415.

Torgerson, Warren S (1958), “Theory and methods of scaling.”.

Torlai, Giacomo, Guglielmo Mazzola, Juan Carrasquilla,Matthias Troyer, Roger Melko, and Giuseppe Carleo(2018), “Neural-network quantum state tomography,” Na-ture Physics 14 (5), 447.

Torlai, Giacomo, and Roger G. Melko (2017), “Neural de-coder for topological codes,” Phys. Rev. Lett. 119, 030501.

Tramel, Eric W, Marylou Gabrié, Andre Manoel, FrancescoCaltagirone, and Florent Krzakala (2017), “A determin-istic and generalized framework for unsupervised learn-ing with restricted boltzmann machines,” arXiv preprintarXiv:1702.03260 .

Tubiana, Jérôme, and Rémi Monasson (2017), “Emergenceof compositional representations in restricted boltzmannmachines,” Physical Review Letters 118 (13), 138301.

Van Der Maaten, Laurens (2014), “Accelerating t-sne usingtree-based algorithms,” The Journal of Machine LearningResearch 15 (1), 3221–3245.

Venderley, Jordan, Vedika Khemani, and Eun-Ah Kim(2017), “Machine learning out-of-equilibrium phases ofmatter,” arXiv preprint arXiv:1711.00020 .

Vergassola, Massimo, Emmanuel Villermaux, and Boris IShraiman (2007), Nature 445 (7126), 406.

Vidal, Guifre (2007), “Entanglement renormalization,” Phys-ical review letters 99 (22), 220405.

Wainwright, Martin J, Michael I Jordan, et al. (2008),“Graphical models, exponential families, and variational in-ference,” Foundations and Trends® in Machine Learning1 (1–2), 1–305.

Wang, Ce, and Hui Zhai (2017), “Unsupervised learning stud-ies of frustrated classical spin models i: Principle compo-nent analysis,” arXiv preprint arXiv:1706.07977 .

Wang, Ce, and Hui Zhai (2018), “Machine learning of frus-trated classical spin models. ii. kernel principal componentanalysis,” arXiv preprint arXiv:1803.01205 .

Wang, Lei (2017), “Can boltzmann machines discover clusterupdates?” arXiv preprint arXiv:1702.08586 .

Wang, Yi-Nan, and Zhibai Zhang (2018), “Learning non-higgsable gauge groups in 4d f-theory,” arXiv preprintarXiv:1804.07296 .

Wasserman, Larry (2013), All of statistics: a concise course instatistical inference (Springer Science & Business Media).

Wattenberg, Martin, Fernanda Viégas, and Ian Johnson

116

(2016), “How to use t-sne effectively,” Distill 1 (10), e2.Wei, Qianshi, Roger G Melko, and Jeff ZY Chen (2017),

“Identifying polymer states by machine learning,” PhysicalReview E 95 (3), 032504.

Weigt, Martin, Robert A White, Hendrik Szurmant, James AHoch, and Terence Hwa (2009), “Identification of directresidue contacts in protein–protein interaction by messagepassing,” Proceedings of the National Academy of Sciences106 (1), 67–72.

Weinstein, Steven (2017), “Learning the einstein-podolsky-rosen correlations on a restricted boltzmann machine,”arXiv preprint arXiv:1707.03114 .

Wetzel, Sebastian Johann (2017), “Unsupervised learning ofphase transitions: from principle component analysis tovariational autoencoders,” arXiv preprint arXiv:1703.02435.

Wetzel, Sebastian Johann, and Manuel Scherzer (2017),“Machine learning of explicit order parameters: From theising model to SU(2) lattice gauge theory,” arXiv preprintarXiv:1705.05582 .

White, Steven R (1992), “Density matrix formulation forquantum renormalization groups,” Physical review letters69 (19), 2863.

White, Tom (2016), “Sampling generative networks:Notes on a few effective techniques,” arXiv preprintarXiv:1609.04468.

Williams, DRGHR, and Geoffrey Hinton (1986), “Learn-ing representations by back-propagating errors,” Nature323 (6088), 533–538.

Wilson, Ashia C, Rebecca Roelofs, Mitchell Stern, NathanSrebro, and Benjamin Recht (2017), “The marginal valueof adaptive gradient methods in machine learning,” arXivpreprint arXiv:1705.08292.

Wilson, Kenneth G, and John Kogut (1974), “The renormal-ization group and the epsilon expansion,” Physics Reports12 (2), 75–199.

Witte, RS, and J.S. Witte (2013), Statistics (Wiley).Wu, Yadong, Pengfei Zhang, Huitao Shen, and Hui Zhai

(2018), “Visualizing neural network developing perturba-tion theory,” arXiv preprint arXiv:1802.03930 .

Xie, Junyuan, Ross Girshick, and Ali Farhadi (2016), “Un-supervised deep embedding for clustering analysis,” in In-ternational conference on machine learning, pp. 478–487.

Yang, Tynia, Jinze Liu, Leonard McMillan, and Wei Wang(2006), “A fast approximation to multidimensional scaling,”in IEEE workshop on Computation Intensive Methods forComputer Vision.

Yang, Xu-Chen, Man-Hong Yung, and Xin Wang(2017), “Neural network designed pulse sequences for

robust control of single-triplet qubits,” arXiv preprintarXiv:1708.00238 .

Yedidia, Jonathan (2001), “An idiosyncratic journey beyondmean field theory,” Advanced mean field methods: Theoryand practice , 21–36.

Yedidia, Jonathan S, William T Freeman, and Yair Weiss(2003), “Understanding belief propagation and its general-izations,” Morgan Kaufmann Publishers Inc. San Francisco,CA, USA .

Yoshioka, Nobuyuki, Yutaka Akagi, and Hosho Katsura(2017), “Learning disordered topological phases by statisti-cal recovery of symmetry,” arXiv preprint arXiv:1709.05790.

You, Yi-Zhuang, Zhao Yang, and Xiao-Liang Qi (2017),“Machine learning spatial geometry from entanglement fea-tures,” arXiv preprint arXiv:1709.01223 .

Yu, Chao-Hua, Fei Gao, and Qiao-Yan Wen (2017),“Quantum algorithms for ridge regression,” arXiv preprintarXiv:1707.09524 .

Zdeborová, Lenka, and Florent Krzakala (2016), “Statisticalphysics of inference: Thresholds and algorithms,” Advancesin Physics 65 (5), 453–552.

Zeiler, Matthew D (2012), “Adadelta: an adaptive learningrate method,” arXiv preprint arXiv:1212.5701.

Zhang, Chengxian, and Xin Wang (2018), “Spin-qubit noisespectroscopy from randomized benchmarking by supervisedlearning,” arXiv preprint arXiv:1810.07914 .

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, BenjaminRecht, and Oriol Vinyals (2016), “Understanding deeplearning requires rethinking generalization,” arXiv preprintarXiv:1611.03530.

Zhang, Pengfei, Huitao Shen, and Hui Zhai (2017a), “Ma-chine learning topological invariants with neural networks,”arXiv preprint arXiv:1708.09401 .

Zhang, Xiao-Ming, Zi-Wei Cui, Xin Wang, and Man-HongYung (2018), “Automatic spin-chain learning to explore thequantum speed limit,” arXiv preprint arXiv:1802.09248 .

Zhang, Yi, Roger G Melko, and Eun-Ah Kim (2017b), “Ma-chine learning Z2 quantum spin liquids with quasi-particlestatistics,” arXiv preprint arXiv:1705.01947 .

Zimek, Arthur, Erich Schubert, and Hans-Peter Kriegel(2012), “A survey on unsupervised outlier detection in high-dimensional numerical data,” Statistical Analysis and DataMining: The ASA Data Science Journal 5 (5), 363–387.

Zou, Hui, and Trevor Hastie (2005), “Regularization and vari-able selection via the elastic net,” Journal of the Royal Sta-tistical Society: Series B (Statistical Methodology) 67 (2),301–320.