unifying kernel methods and neural networks and ...

120
UNIFYING KERNEL METHODS AND NEURAL NETWORKS AND MODULARIZING DEEP LEARNING By SHIYU DUAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2020

Transcript of unifying kernel methods and neural networks and ...

UNIFYING KERNEL METHODS AND NEURAL NETWORKS AND MODULARIZING DEEPLEARNING

By

SHIYU DUAN

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2020

c⃝ 2020 Shiyu Duan

To my family

ACKNOWLEDGMENTS

When I started my first term of graduate school as a master’s student in electrical and

computer engineering at University of Florida (UF), I knew little about machine learning and

had an undergraduate GPA of 2.99/4. My advisor, Dr. Jose Prıncipe, discovered me from his

classroom, miraculously saw the potential in me, and took me into his lab despite the many

stupid questions I pestered him with during and after classes. He gave me patience, trust, and

the freedom to grow — things that not a lot of Ph.D. students are lucky enough to get from

their advisors. He always asked the right questions that pointed me in the right direction. And

more importantly, he also asked the tough questions that pushed me forward. I did not even

believe in myself in the beginning, but my advisor made this possible. My gratitude toward him

is beyond words.

I am grateful to all that I’ve met along the way. But there are a few that I am particularly

thankful to. Dr. Shujian Yu has been a collaborator and a role model. His contributions made

this dissertation possible, and he greatly affected how I approach research through setting a

positive example himself. Dr. Eder Santana provided much guidance when I started in the

lab. And his passion for building wonderful things sparked my own. Spencer Chang proofread

this dissertation, to which I am very thankful. Dr. Luis Gonzalo Sanchez Giraldo, Dr. Robert

Jenssen, and Dr. Pingping Zhu’s times in the lab did not overlap with mine. But I was lucky

enough to know them in person and I’ve constantly looked up to them as role models. They’ve

also deeply inspired me through the excellence of their works. Outside of my lab, I have had

the pleasure to work with some extraordinary researchers during internships, including Dr.

Huaijin Chen, Dr. Jun Jiang, Dr. Jinwei Gu, Dr. Hao Pan, and Xiaohua Yang. The projects

we worked on are not directly related to this dissertation, but their research methodologies,

philosophies, and passions all have had a tremendous impact on me.

There are a few faculty members from UF that have directly helped with or indirectly

inspired my work. Dr. Yunmei Chen and Dr. Murali Rao have provided continuous guidance

and support for my research. Their brilliance and humility have helped me become a better

4

researcher as well as a better person. I’ve also met some of the best teachers in my life here at

UF, who introduced me, in their own ingenious ways, to the magnificence of mathematics and

statistics, the two fields that I have found the most inspirations from and am deeply fascinated

by. They are Dr. Scott McCullough, Dr. Kshitij Khare, Dr. Paul Robinson, Dr. Malay Ghosh,

Dr. Brett Presnell, Dr. William Hager, and Dr. James Hobert. Last but definitely not the

least, I would like to thank my committee members, Dr. Alina Zare, Dr. Kshitij Khare, Dr.

Sean Meyn, and Dr. Yunmei Chen, for the many helpful discussions and key insights that made

this work possible.

I think pursuing a Ph.D. is a privilege and a somewhat selfish thing to do, especially for

those of us who chose to do it in somewhere far away from home and from the ones we love.

Despite being at a age where most of our peers have to take responsibility of various other

concrete things in life, we isolate ourselves in a vacuum environment, laser-focused on our

research and thoroughly enjoying the good times as well as the bad times with little regard

of what happens outside of our little bubble. A lot of us glorify this action as advancing the

knowledge of the entire human race, something that is for the greater good and something

that will eventually benefit us and our family in a tangible way. While I definitely hope that

this is true, I cannot help but feel apologetic toward my loved ones for taking four years of my

companionship away from them and for avoiding for so long responsibilities that should have

been mine. I thank you all for your understanding and always being immensely supportive of

anything that I chose to do. This dissertation is your work as much as it is mine.

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1 Kernel Networks: Connectionist Models Based On Kernel Machines . . . . . . 131.2 Neural Networks Are Kernel Networks . . . . . . . . . . . . . . . . . . . . . . 141.3 Modularizing Deep Architecture Training With Provable Optimality . . . . . . 161.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.5 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1 Kernel Method-Based Connectionist Models . . . . . . . . . . . . . . . . . . 212.2 Connections Between Deep Learning and Kernel Method . . . . . . . . . . . . 23

2.2.1 Exact Equivalences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2.2 Equivalences in Infinite Widths and/or in Expectation . . . . . . . . . . 24

2.3 Modular Learning of Deep Architectures . . . . . . . . . . . . . . . . . . . . 25

3 MATHEMATICAL PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2 Kernel Method in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 A Primer on Kernel Method . . . . . . . . . . . . . . . . . . . . . . . 293.2.2 The “Kernel Trick” . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.2.1 Kernel machines: linear models on nonlinear features . . . . . 313.2.2.2 Kernel functions as similarity measures . . . . . . . . . . . . 33

4 KERNEL NETWORKS: DEEP ARCHITECTURES POWERED BY KERNEL MACHINES 34

4.1 A Recipe for Building Kernel Networks . . . . . . . . . . . . . . . . . . . . . 344.2 Why Kernel Networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Robustness in Choice of Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 364.4 An Example: Kernel MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.5 Experiments: Comparing KNs with Classical Kernel Machines . . . . . . . . . 40

5 NEURAL NETWORKS ARE KERNEL NETWORKS . . . . . . . . . . . . . . . . . 43

5.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6

5.2 Revealing the Disguised Kernel Machines in Neural Networks . . . . . . . . . . 435.2.1 Fully-Connected Neural Networks . . . . . . . . . . . . . . . . . . . . 445.2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . 475.2.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 485.2.4 Modules: Combinations of Sets of Layers . . . . . . . . . . . . . . . . 495.2.5 Add-Ons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2.5.1 Batch normalization . . . . . . . . . . . . . . . . . . . . . . 505.2.5.2 Pooling and padding layers . . . . . . . . . . . . . . . . . . 505.2.5.3 Residual connection . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Strength in Numbers: Universality Through Tractable Kernels . . . . . . . . . 525.4 Neural Operator Design Is a Way to Encode Prior Knowledge Into Kernel Machine 52

6 A PROVABLY OPTIMAL MODULAR LEARNING FRAMEWORK . . . . . . . . . 54

6.1 The Modular Learning Methodology . . . . . . . . . . . . . . . . . . . . . . . 566.1.1 The Setting, Goal, and Idea . . . . . . . . . . . . . . . . . . . . . . . 566.1.2 The Main Theoretical Result . . . . . . . . . . . . . . . . . . . . . . . 576.1.3 Applicability of the Main Result . . . . . . . . . . . . . . . . . . . . . 59

6.1.3.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . 596.1.3.2 Objective function . . . . . . . . . . . . . . . . . . . . . . . 60

6.1.4 From Theory to Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 616.1.4.1 Geometric interpretation of learning dynamics . . . . . . . . 646.1.4.2 Accelerating the approximated kernel network layers . . . . . 64

6.2 A Method for Module Reusability and Task Transferability Estimation . . . . . 65

7 MODULAR LEARNING: EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . 70

7.1 Sanity Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.1.1 Sanity Check: Modular Training Results in Identical Learning Dynamics

As End-to-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.1.2 Sanity Check: Proxy Objectives Align Well With Accuracy . . . . . . . 72

7.2 Modular Learning: Simple Network Backbones With Classical Kernels . . . . . 747.2.1 Fully Layer-Wise kMLPs . . . . . . . . . . . . . . . . . . . . . . . . . 747.2.2 The LeNet-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.3 Modular Learning: State-of-the-Art Network Backbones With NN-InspiredKernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827.3.1 Accuracy on MNIST and CIFAR-10 . . . . . . . . . . . . . . . . . . . 827.3.2 Label Efficiency of Modular Deep Learning . . . . . . . . . . . . . . . 837.3.3 Transferability Estimation With Proxy Objective . . . . . . . . . . . . . 877.3.4 Architecture Selection With Proxy Objective . . . . . . . . . . . . . . 88

8 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

APPENDIX

A PROOF OF PROPOSITION 4.2 & 4.3 . . . . . . . . . . . . . . . . . . . . . . . . 95

7

B PROOF OF THEOREM 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

C ADDITIONAL TRANSFERABILITY ESTIMATION PLOTS . . . . . . . . . . . . . 109

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

8

LIST OF TABLES

Table page

4-1 Kernel networks vs. classical kernel machines as well as kernel machines boostedwith multiple kernel learning techniques. (part 1) . . . . . . . . . . . . . . . . . . . 42

4-2 Kernel networks vs. classical kernel machines as well as kernel machines boostedwith multiple kernel learning techniques. (part 2) . . . . . . . . . . . . . . . . . . . 42

6-1 Comparisons between transferability estimations methods. . . . . . . . . . . . . . . 69

7-1 Testing modular training and acceleration method on kMLPs with classical kernels(Gaussian) for MNIST. (part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7-2 Testing modular training and acceleration method on kMLPs with classical kernels(Gaussian) for MNIST. (part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7-3 Comparing layer-wise kMLPs with classical kernels (Gaussian) against other deeparchitectures. (part 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7-4 Comparing layer-wise kMLPs with classical kernels (Gaussian) against other deeparchitectures. (part 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7-5 Testing modular learning on a simple LeNet-5 with classical kernels (Gaussian). . . . 81

7-6 Modular learning on LeNet-5 with NN-inspired kernels for MNIST. . . . . . . . . . 83

7-7 Modular learning on ResNets with NN-inspired kernels for CIFAR-10. . . . . . . . . 84

9

LIST OF FIGURES

Figure page

5-1 Revealing the hidden kernel machines in fully-connected neural networks. . . . . . . 44

5-2 Revealing the hidden kernel machines in convolutional neural networks. . . . . . . . 46

6-1 Illustrating the proposed modular training framework. . . . . . . . . . . . . . . . . 63

7-1 Learning dynamics of modular and end-to-end training agree with each other. . . . . 73

7-2 Overall accuracy is positively correlated with proxy value, validating the optimalityof our modular learning method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7-3 Data examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7-4 Geometrically interpreting the modular learning dynamics in a two-hidden-layer kMLP. 79

7-5 The representations learned by our modular learning method is more disentangled. . 81

7-6 Label efficiency of our modular learning method. . . . . . . . . . . . . . . . . . . . 85

7-7 Trasferability estimation results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7-8 Architecture selection results: Setting 1 . . . . . . . . . . . . . . . . . . . . . . . . 91

7-9 Architecture selection results: Setting 2 . . . . . . . . . . . . . . . . . . . . . . . . 92

C-1 Additional transferability estimation results. . . . . . . . . . . . . . . . . . . . . . . 110

C-2 Additional transferability estimation results. . . . . . . . . . . . . . . . . . . . . . . 111

10

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

UNIFYING KERNEL METHODS AND NEURAL NETWORKS AND MODULARIZING DEEPLEARNING

By

Shiyu Duan

December 2020

Chair: Jose C. PrıncipeMajor: Electrical and Computer Engineering

We study three important problems at the intersection of kernel methods and deep

learning.

Compared to deep neural networks (NNs), classical kernel machines lack a connectionist

nature and therefore cannot learn hierarchical, distributed representations. This has been

considered the key reason behind the suboptimal performance of kernel machines in

cutting-edge machine learning applications. The first problem we study is therefore how to

combine classical kernel machines with connectionism, creating new model families that are at

the same time as performant as NNs and as analyzable as kernel machines.

Understanding the connections between kernel methods in machine learning and deep

NNs in order to discover novel theoretical insights as well as powerful algorithms has been a

long-sought goal. Existing works in this regard established connections that rely on nontrivial

assumptions such as infinite network widths. Thus, the second problem we study is how to

create links between kernel methods and deep NNs that are direct and work for practical

network architectures without the need for unrealistic assumptions. This sheds new light on the

study of kernel methods as well as deep learning.

For a long time now, deep learning has been tied to end-to-end optimization. As a result,

practitioners cannot resort to divide-and-conquer strategies when developing deep learning

pipelines. This significantly complicates the process and rules out the adaptation of many

established best practices for fast up-scaling in engineering, e.g., regression test, module reuse,

11

and so on. Therefore, the third problem we study is how to reliably train deep architectures

in a completely modular fashion by borrowing theoretical tools from kernel methods. This will

enable modular deep learning workflows, which, as we have argued, has significant practical

implications for deep learning engineering.

12

CHAPTER 1INTRODUCTION

This work begins by extending kernel machines to connectionist models. These new

deep architectures, dubbed “kernel networks” (KNs), are analogs to neural networks (NNs)

powered by the more mathematically tractable kernel machines instead of artificial neurons.

We then proceed to show that by taking an alternative view on NN architectures, there exists

a single abstraction that captures both NNs and KNs. Specifically, we show that NNs can be

interpreted as KNs. Finally, based on these constructions, we end by presenting a theoretical

framework that modularizes the training of deep architectures. This framework is provably

optimal in many common situations, yet much more agile than the existing end-to-end solution

for training.

1.1 Kernel Networks: Connectionist Models Based On Kernel Machines

Connectionist models in machine learning are those that attempt to carry out computations

in a way that vaguely resembles a model of the human brain (Buckner & Garson, 2019).

These artificial neural networks (ANNs) are essentially sets of artificial neurons connected

in somewhat arbitrary ways. The most popular artificial neuron model can be described as a

function: fn(x) = ϕn

(w⊤

n x+ bn), with x in some Euclidean space and wn, bn some trainable

“weights”, and ϕn : R → R a nonlinear mapping (Rosenblatt, 1957). These base functions can

then be composed or concatenated, forming the modern ANNs.

Kernel machines, i.e., functions of the form fk(x) = ⟨wk,ϕk(x)⟩H + bk, with H being

an inner product space over the real line, x in some Euclidean space Rd, ϕk : Rd → H a

nonlinear mapping, wk, bk the trainable weights, have long been considered one of the most

representative “non-connectionist” models. They are among the most popular instantiations of

the broader family of methods in machine learning dubbed “kernel methods”, that is, methods

that use a positive definite kernel function k and/or an identity called the “kernel trick”: For

certain kernel functions k : X × X → R and ϕ : X → H, H an inner product space, one

may establish k(u,v) = ⟨ϕ(u),ϕ(v)⟩H ,∀u,v (Shalev-Shwartz & Ben-David, 2014). Other

13

members of the kernel methods family include, for example, Gaussian processes (Williams &

Rasmussen, 2006).

Despite its solid theoretical foundation and wild popularity in the early 2000s thanks

to the then highly performant support vector machines (SVMs) (Vapnik, 2000b), kernel

machines have been largely eclipsed by deep NNs (DNNs) in today’s machine learning

landscape especially in domains where large-scale training data is available. Many attribute the

underwhelming performance of kernel machines compared to DNNs to the fact that the former

cannot learn hierarchical, distributed representations as connectionist models would (Hinton,

2007).

We present a recipe that extends the kernel machines to form connectionist models.

The idea is that one can build connectionist models out of kernel machines in the same

way that one builds them out of artificial neurons since they can be abstracted as functions

with identical domains and co-domains. We call these connectionist models built from

kernel machines kernel networks (KNs). In fact, it is easy to see that for any NN, there is an

equivalent KN sharing exactly the same architecture in the sense that, albeit the base units are

different, the patterns of connections among these units are identical.

For the field of kernel methods in machine learning, this work expands the existing family

of models. On the other hand, for the field of deep learning (DL), our kernel machine-based

networks are as performant but more analyzable than the existing neural networks, thanks to

the mathematical tractability of the kernel machine.

1.2 Neural Networks Are Kernel Networks

The effort in understanding the connections between kernel methods and DL has began

since at least the mid 1990s (Neal, 1995). Recently, this topic has gained renewed interest

due to several key observations (Lee et al., 2017; Jacot et al., 2018; Shankar et al., 2020; Cho

& Saul, 2009; Arora et al., 2019b,a; Li et al., 2019). Namely, it has been established that

feedforward DNNs can be equated to kernel methods in certain situations. The established

connections, however, require highly nontrivial assumptions: The equivalence between a

14

particular kernel method, such as Gaussian process or kernel machine, and a family of NNs

only exists in the limit of the NN layer widths tending to infinity and having been trained with

a simple gradient descent scheme for infinitely long and/or in expectation of random NNs.

In the first case, these networks cannot possibly be implemented and actually underperform

their finitely-wide counterparts trained with a finite amount of time. In the latter case, the

networks are not even fully trainable (sometimes referred to as weakly-trained (Arora et al.,

2019a)). Moreover, these works sometimes propose kernels inspired by NNs and instantiate

kernel methods with these kernels. However, these algorithms, like their counterparts using

traditional kernels, typically have prohibitively high computational complexity (super-quadratic

in sample size, to be exact (Arora et al., 2019b)).

Contrasting existing works, we establish a strong connection between fully-trainable,

finitely-wide NNs to the KNs mentioned in Sec. 1.1. Specifically, we show that NNs are KNs,

without any limiting assumption.

The idea is that, as opposed to the common perception, where the elementwise

nonlinearity is considered to be the last component of an NN layer or module (combination of

layers), we view it as the first component of the immediate downstream node(s)1 . This way,

each node can be identified as a kernel machine, as defined in Sec. 1.1, with the kernel defined

by the NN nonlinearity.

Our construction is advantageous compared to the existing ones mainly in the following

regards. First, we establish exact equivalence on a model level that is agnostic to training,

whereas many existing works assume simple but infinitely-long training on the NNs (which is

not necessarily ideal for performance). Second, we consider fully-trainable and finitely-wide

NNs, which are much more practical than the weakly-trainable, infinitely-wide counterparts

considered by existing works. Further, the proposed construction works for NNs of all

1 We use the word “node” to refer to a base unit in a network with its parametric formunspecified. It can be a neuron or a kernel machine.

15

types with minimal adjustment, contrasting existing works where only feedforward NNs are

considered and significant modifications have to be made when extending from fully-connected

models to convolutional ones. Finally, our NN-equivalent KNs run in linear time instead of

super-quadratic, as existing NN-inspired kernel methods do.

1.3 Modularizing Deep Architecture Training With Provable Optimality

While the resurgence of DL (Krizhevsky et al., 2012) has enabled countless powerful yet

conceptually simple predictive models in various machine learning applications, its end-to-end

nature is forcing practitioners to abandon one of the most useful concepts in engineering:

modularization. When building a NN, large or small, the user is constrained to designing

and optimizing the entire model as a whole instead of taking a modular approach as in

other disciplines of engineering, namely dividing it into components, configuring each of the

components, and wiring them together to form the model.

The current end-to-end approach to DL has tremendously increased the complexity

of building a state-of-the-art model. Indeed, when implementing and training a DNN, it is

extremely difficult to debug unsatisfying performance without tearing down the entire model

and retraining from scratch. Tracing the source of the problem to one or several particular

layers and fixing it directly from there is virtually impossible. This also means that when

designing a new model, the user has to navigate in the hyperparameter space consisting of all

hyperparameters of all trainable components. For any reasonably-sized model, this translates

to hundreds of hyperparameters to be tuned simultaneously, making it practically impossible to

find the optimal combination. In fact, a typical DL work nowadays would start off from one of

the few iconic model designs and simply follow most of the original hyperparameter selections

even though there likely exists a better backbone and set of hyperparameters for the particular

task being tackled in this said work. Moreover, part or parts of a trained model cannot be

easily reused across tasks, which means that days of hyperparameter tuning and training would

be wasted if one wants to deploy the same model on a different dataset. Transfer learning

mitigates the issue, but gives no rigorous performance guarantee. Overall, the design and

16

training process of a state-of-the-art DL model has become so elusive that some are calling it

the modern-day alchemy (Synced, 2018).

Why are we not modularizing DL? Specifically, how can we train a feedforward multi-layer

network in a modular, sequential fashion? In other words, we would like to proceed from the

input layer to the output, greedily train a stack of layers as a module, freeze it afterwards, then

repeat with downstream layers without fine-tuning the trained modules. The difficulty with this

approach in supervised learning is that there is no explicit supervision for the latent modules.

Indeed, such supervision is only present in the output layer and can only be propagated to the

hidden modules via gradient information that “flows through” all modules, forcing one to train

the entire model as a whole.

In this work, we propose a novel modular training approach. Using a two-module network

F2 F1 as an example, where F1 is an input module and F2 an output module, suppose we are

given a loss function L and a set of training data S. We first identify the set of optimal input

modules F⋆1 to be the input modules of all minimizers of L. Further, we distinguish between

the set of trainable parameters of F2, denoted θ22 , and the non-trainable ones, denoted ω2

3 .

The key idea is that if we could find a proxy hidden objective L1 that is a function of only F1,

ω2, and S with the property

argminF1

L1(F1,ω2, S) ⊂ F⋆1, (1-1)

then we would be able to use this loss as the explicit supervision for F1 and decouple the

training of the two modules: We may first train F1 to minimize L1 and freeze it afterwards at,

say, F′1, then train F2 F′

1 to minimize L. Due to the construction of L1, the resulting solution

we get from wiring together the two trained modules would be as good as if we had trained

them simultaneously to minimize L.

2 An example would be the layer weights.

3 An example would be the type of nonlinearity used.

17

The main result of this part of the dissertation is that if F2 admits a kernel machine-like

representation, then in classification and for the commonly-used loss functions, such a proxy L1

can be found and is simple to use.

An overview can be given using a two-player game as an analogy. Player 1, i.e., F1,

transforms S to a new representation S ′, whereas player 2, i.e., F2, seeks to achieve optimal

performance in the given task using S ′. The objective of player 1 is to find a transformation

that maximizes player 2’s performance. The conventional view is that player 1 needs full

information on player 2, that is, both ω2 and θ2, to be able to produce the optimal solution

in this set-up. This work demonstrates that under mild conditions on player 2, player 1

can achieve the optimal solution by having access to (1) partial information on player 2

(specifically, only ω2), and (2) pairwise summary of examples in S ′. In other words, pairwise

information on S ′, typically overlooked in the existing end-to-end scheme of deep learning, is

sufficient to compensate any missing information on θ2.

To showcase one of the main benefits of modularization — module reuse with confidence

— we demonstrate that one can easily and reliably quantify the reusability of a pre-trained

module on a new target task with our proxy objective function, providing a fast yet effective

solution to an important practical issue in transfer learning. Moreover, this method can be

extended to measure task transferability, a central problem in transfer learning, continual/lifelong

learning, and multi-task learning (Tran et al., 2019). Unlike many existing methods,

our approach requires no training. Moreover, it is task-agnostic, flexible, and completely

data-driven. Nevertheless, in our experiments, it accurately described the task space structure

on binary classification tasks derived from CIFAR-10 using only a small amount of labeled data.

As another example demonstrating the practical benefits of modular workflows, we show that

accurate network architecture search can be performed in polynomial time (linear in depth)

using components from our modular learning framework, contrasting how a naive approach

would take exponential time.

18

Our modular learning framework utilizes labels in a way that drastically differs from

and is more efficient than how labels are typically used in the existing end-to-end paradigm.

Specifically, training of the latent modules requires only pairwise label information on pairs of

examples in the form of whether or not they belong to the same class. The full label of each

individual example is not needed. Neither does the algorithm need to know the relationship

among all examples simultaneously. In contrast, backpropagation requires full label information

on all examples. We then empirically show that the output module, which indeed requires

full supervision for training, is highly label-efficient, achieving state-of-the-art accuracy on

benchmarking datasets such as CIFAR-10 with as few as a single randomly-selected labeled

example from each class. Overall, our modular training requires a different, weaker form

of supervision than the existing end-to-end method yet still produces models that are as

performant. This indicates that the existing form of supervision used in backpropagation (full

labels on individual data examples), which drives nearly all fully supervised and semi-supervised

learning paradigms, is not efficient enough. This observation potentially enables less expensive

label acquisition pipelines and more efficient un/semi-supervised learning algorithms.

1.4 Contributions

This dissertation makes the following contributions:

1. We detail a recipe for building connectionist models with kernel machines, and call these

models “kernel networks”.

2. We show that neural networks can in fact be viewed as kernel networks, thus providing a

unified perspective on kernel method and deep learning.

3. We propose a theoretical framework for modularizing the training of deep architectures

with provable optimality, which can serve as the foundation for future deep learning

workflows with enhanced analyzability, reusability, and interpretability.

1.5 Dissertation Structure

The rest of this dissertation is organized as follows. In Chapter 2, we review related

work in the literature. Chapter 3 contains the mathematical preliminaries necessary for our

19

constructions. We then proceed to present our recipe for building KNs in detail in Chapter 4.

Next, we show that NNs are in fact special cases of KNs in Chapter 5. Finally, the modular

learning framework is described in Chapter 6. Experiments for modular learning are presented

in Chapter 7. Chapter 8 concludes the main text, where we shall provide conclusions.

20

CHAPTER 2RELATED WORK

2.1 Kernel Method-Based Connectionist Models

Arguably, the four most widely-adopted members of the kernel method family in machine

learning are SVM (Vapnik, 2000b) in classification, Gaussian process (Williams & Rasmussen,

2006) in regression, kernel adaptive filter (KAF) (Liu et al., 2011) in temporal filtering, and

RBF network (Broomhead & Lowe, 1988) as a general-purpose function approximator1 .

SVM and KAF can both be viewed as groups of carefully-designed training algorithms

combined with underlying models that are kernel machines as defined in this work (Vapnik,

2000b; Liu et al., 2011). The kernel trick was used here to enable linear inference in

potentially nonlinear feature spaces, boosting the capacity of the algorithms without losing

the mathematical tractability. The classical training algorithm of SVM seeks to find the

separating hyperplane in a feature space (induced by the kernel used) with the largest margin,

which has been shown to produce the hyperplane with the minimum capacity, guaranteeing

best generalization among all separating hyperplanes (Vapnik, 2000b). The hyperplane solution

can be obtained via constrained optimization methods. For KAF, one of the most popular

members in the family, the kernel least-mean-square (KLMS) algorithm (Liu et al., 2008),

inherits its training algorithm from the famed least-mean-squares filter, which can be viewed as

an online version of the Wiener optimal solution. KLMS works in an online set-up and updates

the filter weights with a closed-form update that can be shown to converge to the optimal

solution (in the mean-squared-error sense) given stationary signal.

RBF network can be understood as special cases of kernel machines using radial basis

functions as kernels expanded using the kernel trick. They can be used as general-purpose

function approximators, just as any kernel machine. And they have been shown to possess the

1 We are aware that these methods can be configured for other purposes, e.g., that Gaussianprocess can be extended to classification. Here, however, we focus on their most popularusages in the literature.

21

universal approximation capability when the number of centers used is unbounded (Park &

Sandberg, 1991). The kernel trick is used to enable linear inference on nonlinear features, as in

the case of SVM and KAF.

Gaussian process is used for regression and achieves predictions on unknown test points

by sampling from a posterior normal distribution modeled using training data (Williams &

Rasmussen, 2006). The kernel is used as a similarity measure to construct the covariance

matrix of the distribution.

Note that none of these methods are connectionist models per se.

Kernel machines have been extended to connectionist models. Perhaps one of the earliest

attempts in this direction is (Zhuang et al., 2011), where individual kernel machines are

concatenated and composed to form architectures similar to a two-layer multi-layer perceptron

(MLP). This work focused on multiple kernel learning (MKL). As a further generalization,

(Zhang et al., 2017) proposed special cases of KNs that are equivalent to MLPs and CNNs.

These works share the same idea as ours, where connectionist models with kernel machines

as the base building blocks are proposed. Among works of this kind, ours enjoys the greatest

generality in the sense that we present a generic recipe that works for any network architecture.

Apart from efforts in extending kernel machines to connectionist models, there are other

attempts combining connectionism with other members of the kernel method family. (Suykens,

2017) created restricted Boltzmann machines (RBM)-like representations for kernel machines.

The resulting restricted kernel machines (RKMs) are then composed to build deep RKMs. For

Gaussian process, (Wilson et al., 2016) proposed to learn the covariance kernel matrix with

NN in an attempt to make the kernel “adaptive”. This idea also underlies the now standard

approach of learning features with NN for an SVM to classify, which was discussed in details

by, e.g., (Huang & LeCun, 2006; Tang, 2013). This approach can be viewed as building

neural-classical kernel hybrid networks. (Mairal et al., 2014) proposed to learn hierarchical

representations by learning to approximate kernel feature maps on training data in an attempt

to capture features that are invariant to irrelevant variations in images.

22

2.2 Connections Between Deep Learning and Kernel Method

The links between deep learning and kernel method has been long known. Some works

establish connections via exactly matching one architecture to the other, to which, evidently, all

works in Sec. 2.1 belong since they attempt to propose kernel method-based models that are

deep architectures themselves. Others establish links between deep learning and kernel method

from a probabilistic perspective by, for example, studying large-sample behavior in the limit of

infinite layer widths and/or in expectation over random network parameters.

Arguably the most practical results produced by existing works on connecting NNs with

KMs are the simpler training schemes to obtain useful NNs (Neal, 1995; Lee et al., 2017; Arora

et al., 2019a). The paradigm is usually that certain kernels are identified to be equivalent to

NNs in the infinite widths limit and/or in expectation. Then these kernels are plugged into

models that typically do not require iterative training, such as kernel regression (Arora et al.,

2019a) or Gaussian process (Lee et al., 2017). The performance of the resulting KMs reflect

that of unrealistic NNs (usually in the sense that they are only partially trainable or infinitely

wide) and is empirically much inferior at least on vision benchmarking datasets when compared

to the NNs commonly used in practice.

2.2.1 Exact Equivalences

The exact equivalence between kernel machines and certain shallow NNs have been

established. (Vapnik, 2000a) defined kernels that mimic single-hidden-layer MLPs. The

resulting KMs bear the same mathematical formulations as the corresponding NNs with

the constraint that the input layer weights of these NNs are fixed. (Suykens & Vandewalle,

1999) modified these kernels to allow the kernel machines to be interpreted as fully-trainable

single-hidden-layer MLPs. Their construction can be viewed as a special case of ours. They

did not, however, point out the connections between MLPs and kernel machines. Instead, their

work focused on an alternative approach to train shallow MLPs in classification. Specifically,

the input and output layers were trained alternately, with the former learning to minimize the

23

VC dimension (Vapnik & Chervonenkis, 1971) of the latter while the latter learning to classify.

An optimality guarantee of the training was hinted.

2.2.2 Equivalences in Infinite Widths and/or in Expectation

That single-hidden-layer MLPs are Gaussian processes in the infinite width limit and in

expectation of random input layer has been known at least since (Neal, 1995). (Lee et al.,

2017) generalized the classic result to deeper MLPs. (Cho & Saul, 2009) defined a family

of “arc-cosine” kernels to imitate the computations performed by infinitely-wide networks

in expectation. (Shankar et al., 2020) proposed kernels that are equivalent to expectations

of finite-widths random networks. Note that the above works equate Gaussian process to

NNs that are not fully-trainable. Indeed, equivalence was only established in expectation of

random network weights, limiting the capability of the resulting models in practice. (Hermans

& Schrauwen, 2012) used the kernel method to expand the echo state networks to essentially

infinite-sized recurrent neural networks. The resulting network can then be viewed as a

recursive kernel that can be used in SVMs.

Recent works succeeded in establishing stronger connections, ones that link fully-trainable

NNs with Gaussian process. (Jacot et al., 2018) studied the learning dynamics and generalization

of NNs in the infinite widths limit and proved that gradient descent (and also some more

general formulations of training) is equivalent to the so-called “kernel gradient descent” with

respect to a fixed neural tangent kernel (NTK). The special case of least-squares regression

was described in full details, in which the evolution of the network function during training was

explicitly characterized and related to properties of the NTK. (Arora et al., 2019a) presented

exact computations of some kernels, using which the kernel regression models can be shown

to be the limit (in widths and training time) of fully-trainable, infinitely-wide fully-connected

networks trained with gradient descent. The authors then presented kernels corresponding to

CNNs without proving convergence of the network functions to the kernel-based presentations.

Despite the full trainability and elegant theoretical construction, the resulting models are often

outperformed by the corresponding NNs on competitive benchmarking datasets and suffer from

24

high computational complexity. This underwhelming performance from kernel method inspired

by infinitely-wide networks is further confirmed by some recent work (Lee et al., 2020), limiting

the practical value of these models compared to, e.g., standard, finitely-wide CNNs.

2.3 Modular Learning of Deep Architectures

Many existing works in machine learning can be analyzed from the perspective of

modularization. An old example is the mixture of experts (Jacobs et al., 1991; Jordan &

Jacobs, 1994), which uses a gating function to enforce each expert in a committee of networks

to solve a distinct group of training cases. For every input data point, multiple expert networks

compete to take on a given supervised learning task. Instead of winner-take-all, all expert

networks may work together but the winner expert plays a more important role than the

others (Chen, 2015). Another recent example is the generative adversarial networks (GANs)

(Goodfellow et al., 2014). Typical GANs have two adversarial networks that are essentially

decoupled in functionality and can be viewed as two modules. One of the two networks,

dubbed a generator, attempts to synthesize contents that are “realistic” according to some

criterion specified by user through the choice in “real” examples and the objective function.

The other network, a discriminator, tries to distinguish the synthesized content from the

given, real ones. The two networks, however, are typically trained jointly despite their distinct

roles. (Watanabe et al., 2018) proposed an a posteriori method that analyzes a trained

network as modules in order to extract useful information. The network was trained end-to-end

beforehand. These works do not focus on fully modularizing the training of deep architectures,

contrasting ours.

Among works on improving or substituting backpropagation (Rumelhart et al., 1986)

in learning a deep architecture, most aim at improving the classical method, working as

add-ons. The most notable ones are perhaps the unsupervised greedy pre-training methods

in, e.g, (Hinton et al., 2006) and (Bengio et al., 2007). (Erdogmus et al., 2005) proposed an

initialization scheme for backpropagation that can be interpreted as propagating the output

target to the latent layers. (Lee et al., 2015a) used auxiliary classifiers to aid the training of

25

latent layers. These classifiers operate on latent activations and induce loss values that are

minimized during training alongside the main objective. (Raghu et al., 2017) tried to quantify

the quality of hidden representations toward learning more interpretable deep architectures, but

the proposed quality measure was not directly used in optimization. All these methods still rely

on end-to-end backpropagation for learning the underlying network.

As for works that attempt to fully modularize training, on the other hand, (Fahlman &

Lebiere, 1990) pioneered the idea of fully greedy learning of NNs. In their work, each new node

is added to maximize the correlation between its output and the residual error. This can also

be viewed from an ensemble method perspective, similar to, e.g., how new learners are added

into an ensemble in boosting algorithms (Freund et al., 1999). (Xu & Principe, 1999) proposed

to train MLPs layer-by-layer by maximizing mutual information. The idea was to consider a

MLP as a communication channel and the objective was to transmit as much information as

possible about the desired target at each layer. Then each layer, a stage in this communication

channel, was trained so that the mutual information between its output and the desired signal

was maximized. They did not provide an optimality guarantee for this approach, however.

Another way to remove the need for global backpropagation, thus enabling fully-modularized

training, is to locally approximate supervision rather than basing it on flowing gradient through

all layers. (Bengio, 2014; Lee et al., 2015b) locally approximates a “target” for each layer with

the target of layer i being the target at the output sent through the approximate inverse of

all layers in between layer i and the output. This inverse is approximated with autoencoders.

There is no guarantee, however, that any layer is always or even sometimes invertible during

training. And approximating inverse can easily introduce large error. (Jaderberg et al.,

2017) approximates gradient locally at each layer or each node. The gradient information

is again approximated by individual networks. This removes the need for each layer to

“wait” for other layers to pass over the necessary information during training, opening up

possibilities for highly-parallel, much accelerated NN training. (Carreira-Perpinan & Wang,

2014) reformulates the NN optimization problem by explicitly writing out the latent activation

26

vectors as optimization variables and solves it with alternatively optimizing the latent auxiliary

optimization variables and the network layers. No gradient computation across layers is

necessary. (Balduzzi et al., 2015) factorizes the error signal in backpropagation to form local

approximations, removing the need to pass gradient across layers and modularizing training.

Some authors pursue the goal of modularizing deep architectures with different

approaches. In (Zhou & Feng, 2017), the connectionist model analog of decision trees are

proposed, the training of which does not need end-to-end backpropagation. (Lowe et al., 2019)

proposed to learn the hidden layers with unsupervised contrastive learning, decoupling their

training from that of the output layer. In terms of performance, however, the authors only

demonstrated results on fully or partly self-supervised tasks instead of fully supervised ones.

27

CHAPTER 3MATHEMATICAL PRELIMINARIES

3.1 Notations

Throughout, we use bold capital letters for matrices and tensors, bold lower-case letters

for vectors, and unbold lower-case letters for scalars. (v)i denotes the ith component of vector

v. And W(j) denotes the jth column of matrix W unless noted otherwise. For a 3D tensor

X, X[:, :, c] denotes the cth matrix indexing along the third dimension from the left (or the cth

channel). We use ⟨·, ·⟩H to denote the inner product in an inner product space H. And the

subscript shall be omitted if doing so causes no confusion. For functions, we use bold letters

to denote vector/matrix/tensor-valued functions, and unbold lower-case letters are reserved

specifically for scalar-valued ones. Function compositions shall be denoted with the usual . In

a network, we call a composition of an arbitrary number of layers as a module for convenience.

We call models that are linear in their trainable weights (not necessarily their inputs) linear

models.

3.2 Kernel Method in Machine Learning

In this work, we consider a kernel to be a bivariate, symmetric, continuous function over

the real numbers defined on some Euclidean space:

k : Rd × Rd → R. (3-1)

It is worth noting that more general definitions exist. For example, a kernel might be defined

on any nonempty set in general and might map into the complex numbers instead of only the

reals. We only consider this restricted definition as it suffices for our purposes.

A kernel is said to be positive semidefinite if for any finite sequence u1, ...,un and any real

sequence c1, ..., cn, we haven∑

i=1

n∑j=1

cicjk(ui,uj) ≥ 0. (3-2)

A Hilbert space is a complete inner product space. We consider only Hilbert spaces over

the reals, i.e., those whose inner products are defined over the real numbers.

28

3.2.1 A Primer on Kernel Method

There is a two-way connection between certain Hilbert spaces and positive semidefinite

kernels (Scholkopf & Smola, 2001).

Let H be a Hilbert space of real functions on an Euclidean space X. Let ϕ : X → H be

an injective mapping. Then if the evaluation functional over H is continuous everywhere in H,

we can define a unique kernel for H with

k(u,v) = ⟨ϕ(u),ϕ(v)⟩H , ∀u,v. (3-3)

These Hilbert spaces are called reproducing kernel Hilbert spaces (RKHSs).

We now detail this construction and show that k is indeed a valid kernel. Specifically, the

evaluation functional Lu is defined as

Lu : H → R : f 7→ f(u),∀f ∈ H. (3-4)

If H is such that this functional is continuous in H for all u ∈ X, then Riesz representation

theorem states that for all u ∈ X, there is a unique uH ∈ H such that

Lu(f) = ⟨f,uH⟩H ,∀f ∈ H. (3-5)

We can then define an injective mapping ϕ : X → H : u 7→ uH . Now, it is easy to see that

uH(v) = ⟨ϕ(u),ϕ(v)⟩H , ∀u,v ∈ X. (3-6)

Defining our kernel via

k(u,v) := uH(v), (3-7)

it is easy to see that it is positive semidefinite and symmetric. Indeed, continuity and symmetry

follows immediately from definition. To see that this kernel is positive semidefinite,

n∑i=1

n∑j=1

cicjk(ui,uj) =

⟨n∑

i=1

ciϕ(ui),n∑

j=1

cjϕ(uj)

⟩H

=

∥∥∥∥∥n∑

i=1

ciϕ(ui)

∥∥∥∥∥2

H

≥ 0, (3-8)

where the norm is the one induced by the inner product.

29

The connection between certain Hilbert spaces and positive semidefinite kernels can be

established in the other direction as well. Per Moore-Aronszajn Theorem, for every symmetric,

continuous, positive semidefinite bivariate function k that maps into R, one can find a unique

RKHS H such that

k(u,v) = ⟨ϕ(u),ϕ(v)⟩H ,∀u,v, (3-9)

where the mapping ϕ is defined through k as ϕ(u) := k(u, ·) (Aronszajn, 1950).

3.2.2 The “Kernel Trick”

While the kernel theory is profound and has found use in mathematics, statistics, etc., its

single most useful property for the machine learning community is perhaps that certain kernels

k represent inner products between features under certain (potentially nonlinear) feature maps.

Indeed, the machine learning community typically considers an input Euclidean space X and a

feature space H with a feature map ϕ : X → H, and since for certain k,ϕ, H, we have

k(u,v) = ⟨ϕ(u),ϕ(v)⟩H ,∀u,v ∈ X, (3-10)

any inner product in the feature space can be conveniently computed by evaluating k without

explicitly knowing what ϕ is. This identity is sometimes referred to as the “kernel trick”

(Shalev-Shwartz & Ben-David, 2014).

Thanks to the kernel trick, one can represent many useful geometric quantities with kernel

values. For example, one can evaluate distance between feature vectors with only kernel values

as follows.

∥ϕ(u)− ϕ(v)∥2H = k(u,u) + k(v,v)− k(u,v). (3-11)

Examples of such k that are popular with the machine learning community include:

1. linear kernel: k(u,v) = uTv;

2. Gaussian kernel: k(u,v) = e−∥u−v∥2/σ2, where σ is a hyperparameter;

3. polynomial kernel: k(u,v) = (u⊤v + c)d, where c ∈ R, d ∈ N are hyperparameters.

The ϕ of a Gaussian kernel can be shown to be an infinite series in ℓ2 (Scholkopf & Smola,

2001).

30

Two popular usages of kernel method based on this identity are discussed below.

3.2.2.1 Kernel machines: linear models on nonlinear features

A classic usage of the kernel trick is to boost the capacity of linear algorithms without

compromising their mathematical tractability. One example is the kernel machine. Kernel

machines can be considered as linear models in feature spaces, the mappings into which

are potentially nonlinear (Shalev-Shwartz & Ben-David, 2014). Consider a feature map

ϕ : Rd → H, where H is an RKHS, a kernel machine is a linear model in this feature space H:

f(x) = ⟨w,ϕ(x)⟩H + b,w ∈ H, b ∈ R, (3-12)

where w, b are its trainable weights and bias, respectively. It is easy to see that the function

is still linear in the trainable weights, yet it is capable of representing mappings that are

potentially nonlinear in its input.

One can use the kernel trick to implement highly nontrivial feature maps, thus realizing

highly complicated functions in the input space. For feature maps that are not implementable,

e.g., when ϕ(x) is an infinite series as in the case of the Gaussian kernel, one can approximate

the kernel machine with a set of “centers” x1, . . . ,xn as follows:

f(x) ≈ ⟨w,ϕ(x)⟩H + b,w ∈ spanϕ(x1), . . . ,ϕ(xn), b ∈ R, (3-13)

Assuming a kernel k can be used for the kernel trick, the right hand side is equal to

n∑i

αik(xi,x) + b, αi, b ∈ R. (3-14)

Now, the learnable parameters of this model become the αi’s and b. This implicit implementation

without evaluating ϕ directly is typically how RBF networks are presented.

Note that for kernels whose feature maps are implementable, e.g., the polynomial kernel,

the approximation is not necessary and one may implement the kernel machines in exact forms,

i.e., by implementing ϕ directly. On the other hand, when estimation is indeed necessary, one

may take the training set to be the set of centers.

31

This approximation changes the computational complexity of evaluating the kernel

machine on a single example from O(h), where h is the dimension of the feature space H

and can be infinite for certain kernels, to O(nd), where d is the dimension of the input space.

In practice, the runtime for a sample is quadratic in sample size since n is typically on the

same order as the sample size. In particular, since one usually uses the entire training set as

the centers, the complexity of running the kernel machine over the training set is quadratic,

contrasting the linear complexity of other popular models such as NNs. This severely limits the

practicality of kernel machines on today’s machine learning datasets, which usually have sample

sizes on the order of tens of thousands, rendering super-quadratic complexity not acceptable.

There exists acceleration methods that reduce the complexity via further approximation in this

case (e.g., (Rahimi & Recht, 2008)), yet the compromise in performance can be nonnegligible

in practice especially when the input space dimension is large.

There are essentially two key results guaranteeing the expressiveness of kernel machines

even under their approximated representations in Eq. 3-14. Many kernels, including the

Gaussian, induce kernel machines that are universal function approximators, meaning essentially

that they can approximate arbitrary function with arbitrary precision under mild assumptions

(Micchelli et al., 2006). These results typically require that one has the freedom to sample

potentially an arbitrarily large set of centers. Another important result, dubbed the representer

theorem (Scholkopf et al., 2001), states that for minimizing an objective function

ℓ (xi,yi, f(xi)ni=1) + g(∥f∥H) (3-15)

over all f that admits a representation

∞∑i=1

αik(zi,x) + b, αi, b ∈ R, (3-16)

the optimal solution takes the form

n∑i

αik(xi,x) + b, αi, b ∈ R, (3-17)

32

where g can be any strictly increasing real function and the norm ∥ · ∥H is the canonical one

induced by the inner product in H. In effect, this theorem states that it suffices to use the

training set as centers. And without further assumptions, reducing the number of centers

breaks the optimality guarantee and usually worsens performance in practice.

SVM (Vapnik, 2000b) and many kernel adaptive filters such as KLMS (Liu et al., 2008)

are essentially kernel machines optimized with a specialized algorithm and/or for a specific

objective function.

Some other classical machine learning algorithms that use the kernel trick to turn linear

algorithms into nonlinear ones include kernel principal component analysis (Scholkopf et al.,

1998). But we do not delve into the details of these methods since they are less related to our

work.

3.2.2.2 Kernel functions as similarity measures

Since positive semidefinite kernels can be associated with inner product values, they can

be used to quantify similarity between examples. Further, since they induce Gram matrices

that are themselves positive semidefinite, these kernels enable concise ways to compute the

covariance matrices of Gaussian processes (Williams & Rasmussen, 2006), which is how the

kernel method is mainly used in the Gaussian process literature. Choosing different kernels can

be a way to inject prior knowledge about a given task into the learning process, reflecting that

a specific notion of similarity is preferred for the given task.

33

CHAPTER 4KERNEL NETWORKS: DEEP ARCHITECTURES POWERED BY KERNEL MACHINES

In this chapter, we present the details of our proposed kernel networks — an extension of

the classical kernel machines to connectionist models. We first describe a simple, generic

recipe for building KNs. The idea is that for any given NN architecture, one can swap

artificial neurons with kernel machines since they have the same “I/O”. Then one may

follow this procedure and end up with either a full KN, where all neurons became kernel

machines, or a neural-kernel hybrid network, where only some of the neurons were substituted

by kernel machines. We then discuss characteristics of KNs that make them interesting

compared to NNs. A concrete example is presented afterwards, where we describe in detail the

KN-equivalent of MLP. Its model complexity is also described.

4.1 A Recipe for Building Kernel Networks

Given a set of artificial neurons fi : Xi → Rsi=1, with each Xi being an Euclidean space

and each neuron defined as a set of mappings admitting such a representation:

fi : x → ϕi(wTi x+ bi),wi ∈ Xi, bi ∈ R, (4-1)

ϕi some real-valued function. The procedure of building a network from these base neurons

can be abstracted as a functional: F : fisi=1 7→ f , where f itself is a set of mappings

defined on some Euclidean space mapping into another Euclidean space. The set is over all the

wi, bi, i = 1, ..., s. Note that F is defined on a set of sets of real-valued functions.

This functional F performs only two bivariate operations between pairs of elements in

its input (pairs of sets of real functions): exhaustively composing () or concatenating ([·, ·])

pairs of functions (“exhaustively” in the sense that these operations are performed on each

and every pair of functions from the two operand sets), where the concatenation between two

functions (hi, hj) is defined to return a vector-valued function with its first coordinate being hi

and the second hj, both operating on the same input.

34

As an example, given three neurons f1, f2, f3 : R2 → R with weights (w1, b1), (w2, b2), (w3, b3),

respectively, one can build a two-layer MLP with the first layer having two neurons and the

second layer one neuron with a network-building functional F defined as:

F(f1, f2, f3) = f3 [f1, f2]. (4-2)

The resulting NN is defined on R2 and maps into R with trainable weights w1,w2,w3, b1, b2, b3.

Note that in general, an element in input may appear more than once in order to make possible

recurrent connection(s).

We now present a recipe for building KNs. First of all, it is easy to see that any given NN

fNN is fully characterized by a set of base neurons fi : Xi → Rsi=1 and a network-building

functional F as

fNN := F(fisi=1). (4-3)

Define a set of kernel machines gi : Xi → Rsi=1 with each being a set of mappings admitting

the form:

gi : x 7→ ⟨ϕi(x),w⟩Hi+ bi,wi ∈ Hi, bi ∈ R,

ki(u,v) = ⟨ϕi(u),ϕi(v)⟩Hi, ∀u,v,

(4-4)

where Hi is an RKHS feature space with kernel ki and ϕi a feature map. Now, for this

NN fNN , one can build a KN with the exact same connectivity by adopting the same

network-building functional on the gi’s:

gKN := F(gisi=1). (4-5)

Evidently, one may also apply the same F on

hi|hi = fi,∀i ∈ I, hj = gj,∀j ∈ J, I, J form a partition of 1, ..., s (4-6)

and obtain a neural-kernel hybrid model.

35

4.2 Why Kernel Networks?

Despite their strong performance in the most challenging machine learning problems, NNs

are notoriously difficult to analyze both due to the fact that each neuron itself is a nonlinear

model and that the connectivities can be largely arbitrary.

Kernel machines, in comparison, are much more mathematically tractable since they are

simply linear models, for which profound theory has been developed. Further, their linearity

and the fact that they operate in feature spaces equipped with useful constructions such

as an inner product together allows users to comfortably conceptualize learning in terms of

geometrical concepts, making it much more intuitive to understand or interpret the model.

However, largely due to the lack of flexibility in architecture, which in turn is caused by their

non-connectionist nature, kernel machines have been unable to learn powerful representations

as NNs do (Bengio et al., 2013). The result of that is the underwhelming performance in

cutting-edge machine learning applications.

KN, as a new family of connectionist models, is a step towards models that combine

the best of both worlds. Indeed, KN shares the same strong expressive power as NN since a

kernel machine is a universal function approximator under mild conditions (Park & Sandberg,

1991; Micchelli et al., 2006). KN is also flexible thanks to its connectionist nature, and similar

to NN, domain knowledge can be injected into the model via architecture design. On the

other hand, in KN, each node is a linear model, of which we have deep understanding. While,

admittedly, the mathematical tractability of KN is still not ideal given the arbitrariness in

connectivity among nodes, we now can at least conceptualize and interpret learning more

comfortably locally at each node. We hope this can serve as a step toward more interpretable

yet performant deep learning systems.

4.3 Robustness in Choice of Kernel

A criticism toward kernel methods in general is that their performance typically strongly

relies on the choice of kernel and the related hyperparameters. This issue is somewhat

automatically mitigated in KN, as has been noted in, for example, (Huang & LeCun, 2006).

36

The reason behind KN’s robustness to choice of kernel or kernel hyperparameters is that

it automatically performs kernel learning alongisde learning to perform the given task. To

see this, note that even though the network is built from generic kernels, each kernel on a

non-input layer admits the form k (Fi(·),Fj(·)), where Fi,Fj are upstream modules in the

network. The fact that these modules are learnable makes this kernel adaptive, mitigating to

some extent any limitation caused by using a fixed, generic kernel k. With training, Fi,Fj

tunes this adaptive kernel according to the task at hand. And it is always a valid kernel if the

generic kernel k is. Other works, such as (Wilson et al., 2016), essentially also build on this

idea to learn adaptive kernels with nonparametric module(s) and mitigate the limitation from

using fixed, generic kernels.

4.4 An Example: Kernel MLP

To provide a concrete example for KN, we now define the KN equivalent of an l-layer

MLP.

Recall that MLP with l layers is defined as the set of functions:

FMLP,l :=Fl · · · F1

∣∣∣Fi : Rdi−1 → Rdi : u 7→ (fi,1(u), ..., fi,di(u))⊤ ,

fi,j(u) = ϕi,j(w⊤i,ju+ bi,j),wi,j ∈ Rdi−1 , bi,j ∈ R

,

(4-7)

where ϕi,j is usually user-specified and the same for the same i across all j.

Now, the KN equivalent of such a model, which we shall refer to as kernel MLP (kMLP),

is defined as the set of functions:

FkMLP,l :=Fl · · · F1

∣∣∣Fi : Rdi−1 → Rdi : u 7→ (fi,1(u), ..., fi,di(u))⊤ ,

fi,j(u) = ⟨wi,j,ϕi,j(u)⟩Hi,j+ bi,j,wi,j ∈ Hi,j, bi,j ∈ R

,

(4-8)

where ϕi,j is a feature map into a RKHS feature space Hi,j with kernel ki,j. Evidently, we also

have

ki,j(u,v) = ⟨ϕi,j(u),ϕi,j(v)⟩Hi,j,∀u,v. (4-9)

37

For kernels with intractable feature maps, this kMLP can be approximated using the

kernel trick. Specifically, suppose we choose xtmt=1,xt ∈ R0,∀t = 1, ...,m to be our set of

centers. Denote Fi · · · F1(xt) as x(i)t for i ≥ 1 and also define x

(0)t = xt for consistency, we

have

FkMLP,l :=Fl · · · F1

∣∣∣Fi : Rdi−1 → Rdi : u 7→ (fi,1(u), ..., fi,di(u))⊤ ,

fi,j(u) =m∑p=1

αi,j,pki,j

(x(i−1)t ,u

)+ bi,j, αi,j,p, bi,j ∈ R

,

(4-10)

Clearly, the main difference among these models is the definition of the base unit fi,j.

Model Complexity Model complexity, on a high level, quantifies how “expressive” a model

can be. It is an integral part of bounds on the generalization performance of a specific model

architecture (Shalev-Shwartz & Ben-David, 2014). And for this reason, estimating model

complexity is a central topic in statistical learning theory and provides much insights into

both theoretical understanding of machine learning models as well as practical issues such as

architecture design.

In this section, we give a bound on the model complexity of FkMLP,l in terms of the

well-known complexity measure, Gaussian complexity (Bartlett & Mendelson, 2002). In

particular, this bound quantifies the relationship between the depth and width of the model

and its expressive power. We first review the definition of Gaussian complexity.

Definition 1 (Gaussian complexity(Bartlett & Mendelson, 2002)). Let X1, ..., Xn be i.i.d.

random elements defined on metric space X. Let F be a set of functions mapping from X into

R. Define

Gn(F) = E

[supF∈F

1

n

n∑i=1

ZiF (Xi)∣∣∣X1, ..., Xn

], (4-11)

where Z1, ..., Zn are standard normal random variables. The Gaussian complexity of F is

defined as

Gn(F) = EGn(F). (4-12)

38

Intuitively, Gaussian complexity quantifies “expressiveness” of a model through how

well its output sequence can correlate with a normally distributed noise sequence (Bartlett &

Mendelson, 2002). And it has been widely considered as one of the most popular complexity

measures, along with Rademacher complexity (Bartlett & Mendelson, 2002) and VC

dimension (Vapnik, 2000b; Mohri et al., 2018). It is well-known that Gaussian complexity

and Rademacher complexity are closely related, and generalization bounds stated in terms of

one can often be easily reformulated in terms of the other with similar tightness guarantees

(Bartlett & Mendelson, 2002).

For the following Propositions and the Lemma based on which they are proved (in

Appendix A), we impose the following assumptions:

1. ∀i, ki,1 = ki,2 = · · · = ki,di = ki;

2. ∀u ∈ Rdi−1 , ki(u, ·) is Li,u-Lipschitz with respect to the Euclidean metric on Rdi−1 . And

supu∈Rdi−1 Li,u = Li < ∞.

These two Propositions are proved in Appendix A

Proposition 4.1 (Gaussian complexity of kMLP, 1-norm). Let FkMLP,l be defined as in

Section 4.4 with dl = 1. Define Ω1 to be the set formed by all possible f1,j. Denote αi,j =

(αi,j,1, ..., αi,j,m)⊤. Assuming ∥αi,j∥1 ≤ Ai, ∀j for some Ai for all i = 2, ..., l, we have

Gn(FkMLP,l) ≤ 2d1

l∏i=2

AiLidiGn(Ω1). (4-13)

Proposition 4.2 (Gaussian complexity of kMLP, 2-norm). Let FkMLP,l be defined as in

Section 4.4 with dl = 1. Define Ω1 to be the set formed by all possible f1,j. Denote αi,j =

(αi,j,1, ..., αi,j,m)⊤. Assuming ∥αi,j∥2 ≤ Ai, ∀j for some Ai for all i = 2, ..., l, we have

Gn(FkMLP,l) ≤ 2d1ml−12

l∏i=2

AiLidiGn(Ω1). (4-14)

From these Propositions, we see that the model complexity of a kMLP grows in the

network depth and width in a similar way as that of MLP (Sun et al., 2015). In particular,

39

kMLP’s expressive power increases linearly in the width of any given layer and exponentially in

the depth of the network.

Further, we have the following result completing these earlier ones.

Proposition 4.3 (Gaussian complexity of a single kernel machine(Bartlett & Mendelson,

2002)). Assuming for all f1,j, we have

m∑p=1

m∑q=1

α1,j,pα1,j,qk(xp,xq) ≤ A21 (4-15)

for some A1 ≥ 0. Let Ω1 be the set of all such f1,j’s. Then

Gn(Ω1) ≤2A1

m

√√√√ m∑p=1

k(xp,xp). (4-16)

This Proposition describes how the choice of centers affects the overall model complexity

and can be directly plugged in the earlier ones to complete the bounds therein.

4.5 Experiments: Comparing KNs with Classical Kernel Machines

In this section, we compare KNs against classical kernel machines on some popular

benchmarking datasets to showcase how combining connectionism with kernel method

improves performance of the latter. Comparisons with NNs are deferred to Chapter 7.

We now compare a single-hidden-layer kMLP using simple, generic kernels with the

classical SVM and SVMs enhanced by multiple kernel learning (MKL) algorithms that used

significantly more kernels to demonstrate the competence of kMLP and in particular, its

ability to perform well without excessive kernel parameterization thanks to connectionism.

The standard SVM and seven other SVMs enhanced by popular MKL methods were

compared (Zhuang et al., 2011), including the classical convex MKL (Lanckriet et al.,

2004) with kernels learned using the extended level method proposed in (Xu et al., 2009)

(MKLLEVEL); MKL with Lp norm regularization over kernel weights (Kloft et al., 2011)

(LpMKL), for which the cutting plane algorithm with second order Taylor approximation of Lp

was adopted; Generalized MKL in (Varma & Babu, 2009) (GMKL), for which the target kernel

class was the Hadamard product of single Gaussian kernel defined on each dimension; Infinite

40

Kernel Learning in (Gehler & Nowozin, 2008) (IKL) with MKLLEVEL as the embedded optimizer

for kernel weights; 2-layer Multilayer Kernel Machine in (Cho & Saul, 2009) (MKM); 2-Layer

MKL (2LMKL) and Infinite 2-Layer MKL in (Zhuang et al., 2011) (2LMKLINF).

Eleven binary classification datasets that have been widely used in MKL literature were

split evenly for training and test and were all normalized to zero mean and unit variance prior

to training. Twenty runs with identical settings but random weight initializations were repeated

for each model. For each repetition, a new training-test split was selected randomly.

For kMLP, all results were achieved using a greedily-trained (this training algorithm is

described in Chapter 6), one-hidden-layer model with the number of kernel machines ranging

from 3 to 10 on the first layer for different data sets. The second layer was a single kernel

machine. All kernel machines within one layer used the same Gaussian kernel, and the two

kernels on the two layers differed only in kernel width σ. All hyperparameters were chosen via

5-fold cross-validation.

As for the other models compared, for each data set, SVM used a Gaussian kernel. For

the MKL algorithms, the base kernels contained Gaussian kernels with 10 different widths

on all features and on each single feature and polynomial kernels of degree 1 to 3 on all

features and on each single feature. For 2LMKLINF, one Gaussian kernel was added to the base

kernels at each iteration. Each base kernel matrix was normalized to unit trace. For LpMKL,

p was selected from 2, 3, 4. For MKM, the degree parameter was chosen from 0, 1, 2. All

hyperparameters were selected via 5-fold cross-validation. These baseline results are obtained

from (Zhuang et al., 2011)

From Table 4-1, kMLP compares favorably with other models, which validates our

claim that kMLP can be more expressive than traditional single kernel machines and that it

learns its own kernels nonparametrically hence can work well even without excessive kernel

parameterization, all thanks to its connectionist nature. Performance difference among models

can be small for some data sets, which is expected since these datasets are all rather small

in size and not too challenging. Nevertheless, it is worth noting that only two Gaussian

41

Table 4-1. Kernel networks vs. classical kernel machines: Average test error (%) and standarddeviation (%) from 20 runs. Results with overlapping 95% confidence intervals (notshown) are considered equally good. Best results are marked in bold. The averageranks (calculated using average test error) are provided in the bottom row. Whencomputing confidence intervals, due to the limited sizes of the data sets, we pooledthe twenty random samples.

Size/Dim. SVM MKLLEVEL LpMKL GMKL IKL MKMBreast 683/10 3.2± 1.0 3.5± 0.8 3.8± 0.7 3.0± 1.0 3.5± 0.7 2.9± 1.0Diabetes 768/8 23.3± 1.8 24.2± 2.5 27.4± 2.5 33.6± 2.5 24.0± 3.0 24.2± 2.5Australian 690/14 15.4± 1.4 15.0± 1.5 15.5± 1.6 20.0± 2.3 14.6± 1.2 14.7± 0.9Iono 351/33 7.2± 2.0 8.3± 1.9 7.4± 1.4 7.3± 1.8 6.3± 1.0 8.3± 2.7Ringnorm 400/20 1.5± 0.7 1.9± 0.8 3.3± 1.0 2.5± 1.0 1.5± 0.7 2.3± 1.0Heart 270/13 17.9± 3.0 17.0± 2.9 23.3± 3.8 23.0± 3.6 16.7± 2.1 17.6± 2.5Thyroid 140/5 6.1± 2.9 7.1± 2.9 6.9± 2.2 5.4± 2.1 5.2± 2.0 7.4± 3.0Liver 345/6 29.5± 4.1 37.7± 4.5 30.6± 2.9 36.4± 2.6 40.0± 2.9 29.9± 3.6German 1000/24 24.8± 1.9 28.6± 2.8 25.7± 1.4 29.6± 1.6 30.0± 1.5 24.3± 2.3Waveform 400/21 11.0± 1.8 11.8± 1.6 11.1± 2.0 11.8± 1.8 10.3± 2.3 10.0± 1.6Banana 400/2 10.3± 1.5 9.8± 2.0 12.5± 2.6 16.6± 2.7 9.8± 1.8 19.5± 5.3Rank - 4.2 6.3 7.0 6.9 4.3 5.4

Table 4-2. Table 4-1, continued.2LMKL 2LMKLINF kMLP-13.0± 1.0 3.1± 0.7 2.4± 0.723.4± 1.6 23.4± 1.9 23.2± 1.914.5± 1.6 14.3± 1.6 13.8± 1.77.7± 1.5 5.6± 0.9 5.0± 1.42.1± 0.8 1.5± 0.8 1.5± 0.616.9± 2.5 16.4± 2.1 15.5± 2.76.6± 3.1 5.2± 2.2 3.8± 2.134.0± 3.4 37.3± 3.1 28.9± 2.925.2± 1.8 25.8± 2.0 24.0± 1.811.3± 1.9 9.6± 1.6 10.3± 1.913.2± 2.1 9.8± 1.6 11.5± 1.95.0 2.8 1.6

kernels were used for kMLP, whereas all other models except for SVM used significantly

more kernels. When comparing with the classic SVM, kMLP is better in almost all datasets

although sometimes not by a statistically significant margin. One thing to note is that the

SVM was trained with a more sophisticated and perhaps better optimization algorithm whereas

kMLP with simple gradient descent. The constrained optimization approach used by SVM

can be adopted to train the output layer of kMLP, which can potentially further improve its

performance.

42

CHAPTER 5NEURAL NETWORKS ARE KERNEL NETWORKS

We describe an alternative view on NNs that allows them to be formally equated to

instantiations of KNs. This establishes a strong connection between finitely-wide, fully-trainable

NNs and kernel method. To begin with, we introduce a set of notations exclusively used in

this chapter to simplify discussions. We then formally introduce our construction, starting from

fully-connected networks and then extending to convolutional ones and more. Only minimal

adjustments are needed when extending from the fully-connected models to more complicated

ones.

5.1 Notations

In this chapter, we shall propose an alternative view on NN layers, making possible

an interpretation of these modules as KN layers. Therefore, it simplifies the presentation

if we represent layers or modules under our view with symbols different from those under

the conventional view. For this purpose, we use letter Gi (or gi,j) with a numeric subscript

i ∈ N \ 0 to refer to the ith network layer or module (or node) under the conventional

view and letter Fi (fi,j) layer or module (node) (of the same model) under our view. Whether

the involved component is a layer or potentially a module will be clear from context if such

distinction needs to be made.

5.2 Revealing the Disguised Kernel Machines in Neural Networks

To show that NNs can be alternatively interpreted as KNs, we proceed by showing that

the base neurons can be interpreted as kernel machines. The idea can be described as follows.

An artificial neuron and a kernel machine differ in the relative order of their inner product

operation and nonlinearity. Based on this observation, instead of considering the nonlinearity

to be the ending nonlinearity of a given node gi,j, which would make the unit a neuron, we

consider it to be a part of the beginning nonlinearity of some immediate downstream node

gi+1,p. After potentially repeating this process to remove the ending nonlinearity of gi+1,p and

43

⟨wi,1, ⋅ ⟩ ϕi( ⋅ )

Conventional View: ( layer or module), Gi ith Gi+1

Our View: ( layer or module), , is a kernel machine layer w/ kernel defined by

Fi ith Fi+1 Fi+2Fi+1 ϕi

⟨wi,2, ⋅ ⟩ ϕi( ⋅ )

ϕi+1( ⋅ )

ϕi+1( ⋅ )

……

⟨wi+1,1, ⋅ ⟩

⟨wi+1,2, ⋅ ⟩

Figure 5-1. Viewing the layers from a new perspective, we identify the kernel machines“hidden” in neural networks and subsequently show that neural networks are in factkernel networks (c.f. Chapter 4) in disguise. Specifically, by absorbing the endingnonlinearity of a node into some immediate downstream node, the base unitsbecome linear models in feature spaces. These linear models can be shown to bekernel machines. If there is a trailing nonlinearity after the output layer, we canabsorb it into the loss function instead of considering it as a part of the network,making our proposed view universally applicable to all neural networks. Best viewedin color.

denoting this new node fi+1,p, we can show that fi+1,p admits the representation of a kernel

machine. This is illustrated in Fig. 5-1

5.2.1 Fully-Connected Neural Networks

In this section, we describe our method for fully-connected NNs in full details. To simplify

the presentation, we use a one-hidden-layer MLP as an example. Nevertheless, the idea scales

easily to deeper models.

Note that we assume the output layer has a single neuron. When the output layer has

multiple neurons, one can apply the same analysis to each of these output neurons individually.

Finally, we assume the output layer to be linear without loss of generality since if there is a

trailing nonlinearity at the output, it can be viewed as a part of the loss function instead of a

part of the network.

44

Considering an input vector x ∈ Rd0 , a one-hidden-layer MLP h = g2 G1 is given as

G1(x) = ϕ(W⊤1 x) ∈ Rd1 ; (5-1)

f(x) = g2(G1(x)) = w⊤2 G1(x), (5-2)

where ϕ is an elementwise nonlinearity such as ReLU (Nair & Hinton, 2010) with ϕ(v) :=

(ϕ ((v)1) , . . . , ϕ ((v)d1))⊤ for any v ∈ Rd1 and some ϕ : R → R, G1 the input layer, g2 the

output layer, h(x) the final model output, and W1 and w2 the learnable weights of the NN

model.

We can re-group the base units into layers differently. Indeed, without changing the

overall input-output mapping h, we can redefine the layers as follows.

h(x) = ⟨w2,ϕ(W⊤1 x)⟩Rd1 = ⟨w2,ϕ(F1(x))⟩Rd1 = f2(F1(x)), (5-3)

where F1(x) =(⟨W(1)

1 ,x⟩Rd0 , . . . , ⟨W(d1)1 ,x⟩Rd0

)⊤and ⟨·, ·⟩Rk is the canonical inner product

of Rk for k = d0, d1, i.e., the dot product. In other words, we treat F1 as the new input layer

and absorb the nonlinearity ϕ into the old output layer g2, forming the new output layer f2.

Now, it is easy to see that f2 and all nodes on F1, denoted f1,j, are kernel machines.

Indeed, each f1,j can be written as f1,j(·) = ⟨W(j)1 ,ψ(·)⟩Rd0 with ψ being the identity

mapping on Rd0 , which is a kernel machine on Rd0 with the identity kernel, i.e., k1(·, ·) :=

⟨·, ·⟩Rd0 . On the other hand, f2 can be written as f2(·) = ⟨w2,ϕ(·)⟩Rd1 , which is a kernel

machine on Rd1 with kernel k2(·, ·) := ⟨ϕ(·),ϕ(·)⟩Rd1 .

Note that the kernel k2 (together with F1) can also be considered as a kernel on Rd0 :

k′2 (F1(·),F1(·)) := ⟨ϕ (F1(·)) ,ϕ (F1(·))⟩Rd1 .

All kernels involved in this construction are positive semidefinite (see the Lemma below).

In particular, the fact that k2 is positive semidefinite enables using convex optimization

techniques for training the output layer under many loss function formulations (Scholkopf &

Smola, 2001), assuming the input layer has been trained (a greedy training scheme will be later

discussed).

45

𝝓!

𝝓"

𝝓#

𝒘,𝝓! (𝒙) + 𝑏

𝒘,𝝓"(𝒙) + 𝑏

𝒘,𝝓#(𝒙) +

𝑏𝒙

Figure 5-2. A convolutional layer (illustrated is a single-filter instantiation), similar to thefully-connected case before, can be considered as a special case of kernel networklayers. Each color corresponds to a kernel machine. Elements in black are sharedacross kernel machines. The main difference of a convolutional layer compared to afully-connected one, from a kernel network standpoint, is that all kernel machineson a given channel of the former share weights but use distinct kernels. Incomparison, all kernel machines on a fully-connected neural network layer sharekernels but not weights. Best viewed in color.

Lemma 1. For any ϕ : X → Rp for some p ∈ N \ 0, k(·, ·) := ⟨ϕ(·),ϕ(·)⟩Rp is a positive

semidefinite kernel on X.

Proof. It suffices to show that for any ci ∈ R,xi ∈ X, i = 1, ..., n, we have

n∑i

n∑j

cicjk(xi,xj) ≥ 0. (5-4)

To see that this is indeed true, we have

n∑i

n∑j

cicjk(xi,xj) = c⊤Kc = c⊤P⊤Pc = ∥Pc∥2 ≥ 0, (5-5)

where c = (c1, . . . , cn)⊤, K is the matrix whose ijth entry is k(xi,xj), and P is the matrix

whose ith column is ϕ(xi).

46

5.2.2 Convolutional Neural Networks

We show that convolutional layers, similar to the fully-connected ones, can also be

interpreted as instantiations of kernel network layers. Despite that we use 2D convolution

layers as an example in this discussion, the same idea can be easily extended to more complex

models.

Assuming we are given a network consisting solely of 2D convolution layers (without

padding) with the output layer being linear, then similar to the fully-connected case, we first

absorb the ending elementwise nonlinearity ϕi of each layer into the immediate downstream

one. Now, suppose the activation tensor of layer i is X(i) ∈ RHi×Wi×Ci , each node on layer

i+1 is a kernel machine with some nodes sharing weights with each other. There are Ci+1 sets

of weight-sharing nodes, with each set having exactly Hi+1 ×Wi+1 elements.

We index each set of weight-sharing kernel machines with the double index p ∈ P, q ∈ Q.

Then each kernel machine within a weight-sharing set have feature map ψi+1,p,q(X) :=

ri+1,p,q ϕi(X), where rp,q : RHi×Wi×Ci → Rh×w×Ci : Z 7→ [si+1,p,q(Z[:, :, 1]), ..., si+1,p,q(Z[:

, :, Ci])]. si+1,p,q denotes an operator that returns a vectorized receptive field with size h,w

centered at a specific location (depending on p, q and parameters of convolution including, e.g.,

stride) upon receiving a matrix (the exact formulation of this operator depends on convolution

parameters such as dilation). And [·, ..., ·] denotes vector concatenation.

Layer i + 1 is formed by concatenating these individual kernel machines (in three

dimensions, forming a tensor). Specifically, all kernel machines sharing weights (and only these

kernel machines) are grouped into a matrix (a channel) with order determined by the double

index p, q. The channels are further concatenated to form the entire layer i+ 1.

This is illustrated in Fig. 5-2.

From a kernel network standpoint, convolutional layers differ from fully-connected ones

mainly in the following two aspects. First, in the convolutional case, some kernel machines

within a layer share weights, whereas fully-connected layers do not impose such constraint.

47

Second, each set of weight-sharing kernel machines on a given convolutional layer use distinct

kernels. In comparison, all nodes on a fully-connected layer share the same kernel.

These observations allow for a new perspective in interpreting convolutional networks.

Indeed, the fact that each convolutional layer is essentially kernel machines using distinct

kernels but sharing weights suggests that it can be viewed as an instantiation of the

multiple kernel learning framework that has been studied in the kernel method literature

for decades (Gonen & Alpaydın, 2011). However here the composition is different because it is

an embedding of functions.

In MKL, the goal is to learn a composite kernel from a pool of base kernels in the

hope that this new kernel is better-suited for the task at hand. Given a pool of base kernels

k1, . . . , km, a popular choice of a composite kernel model is k =∑m

i=1 ηiki, where the ηi’s

are free parameters. Convolutional networks can be viewed as a new deep MKL scheme, where

the importance of each kernel in any given layer is learned implicitly through the training of the

network.

5.2.3 Recurrent Neural Networks

Recurrent NNs (RNNs) can also be formulated as kernel networks. Due to the degree of

arbitrariness involved in RNNs, we demonstrate the process on a specific architecture despite

that the same idea works for other models too. Similar as before, we assume without loss of

generality that the output layer is linear.

Given a input sequence xt ∈ Rd0t=1 Consider the following RNN instantiation

(sometimes referred to as the Jordan network (Jordan, 1997)):

ht = ϕ(W⊤

h [xt,ht−1] + bh

)yt = w⊤

y ht + by,

(5-6)

where [·, ·] denotes vector concatenation, Wh,Wy,bh, by are the learnable weights, and ϕ is a

given elementwise nonlinearity. Let Wh,t denote the weight matrix of the hidden layer at time

48

t and define similarly for other weights and biases, we can rewrite this formulation into

G1,t(xt) = ϕ(W⊤

h,t [xt,G1,t−1(xt−1)] + bh,t

)ut(xt) = g2,t (G1,t(xt)) = w⊤

y,tG1,t(xt) + by,t,

(5-7)

To see that this can be re-written into a KN, first, we re-group layers such that each new

node consists of a nonlinearity followed by an inner product with weights and a summation

with a bias. Then the output node becomes a kernel machine f2,t(·) = ⟨wy,t,ϕ(·)⟩Rd1 +

by,t, where d1 is the width of the hidden layer. The kernel is defined through k2(·, ·) =

⟨ϕ(·),ϕ(·)⟩Rd1 . Each node on the hidden layer F1,t, denoted f1,j,t, is a kernel machine

f1,j,t(·) =⟨W

(j)h,t,ψt(xt)

⟩Rd0+d1

+ bh,t, where ψt is defined as ψt(·) = [·,ϕ(F1,t−1,xt−1)] and

F1,t−1,xt−1 is the output of F1,t−1 on xt−1, which can be further expanded backwards in time.

The kernel of this kernel machine is defined through k1,t(·, ·) = ⟨ψt(·),ψt(·)⟩Rd0+d1 .

The recurrency of this network manifests itself through the definition of ψt, which is

recurrent in itself. Also, it is easy to see that the feature map ψt (and hence the kernel k1,t)

has memory and is in charge of the storage of the internal state of the model through time.

5.2.4 Modules: Combinations of Sets of Layers

A module, in our context, is a set of layers connected in some fashion, forming an overall

mapping denoted as F : Rd0 → Rdt . An example of a module can be m layers composed to

form a feedforward module: F := Gm · · · G1.

Given any module F, let the output layer be Fm and the other layers be F′. So F =

Fm F′. As we have shown, as long as the network consists of layers that fall into the

three categories above (fully-connected, convolutional, and recurrent), we can always assume

that Fm admits the representation where each node fm,j is a kernel machine fm,j(·) =

⟨ϕm,j(·),wm,j⟩Rdm−1 + bm,j. Then the entire module F can be written into a single KN layer,

with each node being a kernel machine defined as

hm,j(·) = ⟨ψm,j(·),wm,j⟩Rdm−1 + bm,j;

ψm,j(·) = ϕm,j F′(·).(5-8)

49

Clearly, this kernel machine has a trainable feature map ψm,j.

5.2.5 Add-Ons

There are many popular operators in NNs that are typically used alongside one of the

three main layer types instead of independently. And they can also be reformulated as KN

layers.

5.2.5.1 Batch normalization

Suppose we are given a layer F1 : Rd0 → Rd1 that is of one of the types we have

covered earlier (fully-connected, convolutional, or recurrent), ahead of which a batch

normalization operator (Ioffe & Szegedy, 2015) is applied to its input. Consider that

F1 have already been formulated into a KN layer, i.e., each node is a kernel machine

with f1,j(·) = ⟨w1,j,ϕ1,j(·)⟩Rd0 + b1,j. Denote the batch normalization operator as

ϕBN : Rd0 → Rd0 , the entire layer (batch normalization together with F1) can be written as a

new KN layer with each node being a kernel machine defined as

fBN,1,j(·) = ⟨w1,j,ϕBN,1,j(·)⟩Rd0 + b1,j;

ϕBN,1,j := ϕ1,j ϕBN ;

ϕBN(·) =· − µσ

∗ γ + β,

(5-9)

where µ,σ is the sample mean and sample standard deviation vector, respectively, and γ,β

are some learnable parameters of ϕBN . In other words, ϕBN both has an internal state and is

trainable. Therefore, the kernel machine fBN,1,j both has memory and a learnable feature map.

5.2.5.2 Pooling and padding layers

Pooling layers are typically used to reduce feature map dimension and/or filter out

irrelevant information. A pooling layer is similar to a convolutional layer. Let the input

tensor be X ∈ RHi×Wi×Ci , A pooling layer can be considered as a set of kernel machines

concatenated in three dimensions. We index the kernel machines with the triple index p ∈

P, q ∈ Q, t ∈ 1, ..., Ci, where the exact elements in P,Q depend on, e.g., the stride of the

pooling operation. Then each kernel machine within a weight-sharing set have feature map

50

ψi+1,p,q,t : RHi×Wi×Ci → R : Z 7→ pool si+1,p,q,t(Z[:, :, t]). sp,q,t denotes an operator that

returns a vectorized receptive field with size h,w centered at a specific location (depending

on p, q) upon receiving a matrix. And pool is an operator that performs the pooling operation

(e.g., vector mean for the average pool, max for the max pool). The weight of this kernel

machine is fixed to 1.

Padding layers perform padding operations around input tensor to provide user with more

control on the output shape. By itself, a padding layer can be represented as a kernel feature

map ϕ : RH×H×C → RH′×W ′×C , that returns a new tensor that is the input tensor with extra

padded values. The specific padded values depend on the type of padding operator involved.

This kernel feature map can be composed with potentially some other feature map(s) and

absorbed into a downstream layer kernel machine.

5.2.5.3 Residual connection

Let F be a module1 . One may assume without loss of generality that this module is

defined on a vector. As we have seen, this module can be re-written into a KN layer with each

node formulated as a kernel machine fj(·) = ⟨ϕj(·),wj⟩Rp + bj. A residual block with this

module as the backbone can be written as (He et al., 2016):

Fres(·) = F(·) + ·. (5-10)

This residual block can be formulated into a KN layer with each node being a kernel machine

defined as

fres,j(·) = ⟨ϕres,j(·), [wj, 1]⟩Rp+1 + bj

ϕres,j(·) = [ϕj(·), (·)j],(5-11)

where [·, ·] denotes vector concatenation.

1 Residual connection is typically used within a module of multiple layers instead of a singlelayer, which is why we assume a module instead of a layer here.

51

5.3 Strength in Numbers: Universality Through Tractable Kernels

Note that all kernels involved in our earlier constructions equating NNs to KNs have

feature maps mapping from Rp to Rq for some finite p, q, where p, q are determined by the

layer architecture of the NNs. This suggests that, these models, when viewed as KNs, use only

kernels with tractable feature maps. In other words, the kernel machines involved do not need

to be approximated with the kernel trick: One can directly implement the explicit linear model

representation f(x) = ⟨w,ϕ(x)⟩ + b instead. Then the computational complexity of running

the network on a dataset is linear in the size of the dataset, contrasting the super-quadratic

growth of typical kernel methods.

The implication of this observation is profound. First, note that to build a universal kernel

machine, one has to use a kernel that has intractable feature map such as the Gaussian since

simpler kernels do not possess an universal property (Micchelli et al., 2006). This directly

causes the kernel machine to have super-quadractic runtime due to the necessary use of

the kernel trick approximation. On the other hand, our NN-equivalent KNs are universal

approximators since the NNs are. However, these KNs use simple, generic kernels with

tractable feature maps throughout their architectures. Together, these observations show

that connectionism removes the need for a complicated kernel in order for the resulting kernel

method to be universal, serving as another strong argument in favor of combining kernel

method with connectionism, as we have done in proposing KNs. To put this in some more

intuitive but less rigorous terms, going “deep” allows kernel method to be expressive without

using an intractable kernel.

5.4 Neural Operator Design Is a Way to Encode Prior Knowledge Into KernelMachine

There exists a large body of work on designing kernels for kernel methods with prior

knowledge on the learning task encoded into the design (Scholkopf & Smola, 2001). A good

example would be the string kernels (Scholkopf & Smola, 2001), which are kernel functions

defined on finite symbol sequences (strings) in order to handle data of that nature and have

52

been proven highly useful in learning tasks where sequence data are processed, such as text

classification.

On the other hand, the growth rate of the amount of work on designing specialized

modules and components for NNs has been exponential over the past decade. Similar to kernel

design, these works seek to inject some prior belief on the task at hand into the model in the

hope that learning can be accelerated or simplified. A representative work in this regard can be

the ResNets (He et al., 2016), where the authors introduced residual connection, i.e., summing

module input directly to its output, to reflect the notion that deep models should be at least as

good as their shallow counterparts since the extra layers can simply learn to approximate the

identity mapping and recover the shallow networks. This has been shown to tremendously ease

the training of deep architectures, by mitigating the vanishing gradient problem.

As we have shown in this chapter, these specialized neural operators can be reformulated

into kernel machines with specialized kernels and/or constraints on weights. These kernels and

constraints evidently reflect the same prior belief represented by the original neural operator.

In other words, neural operator design can also be considered from a kernel design perspective.

This provides a unified view on the two important yet hitherto separated research topics.

53

CHAPTER 6A PROVABLY OPTIMAL MODULAR LEARNING FRAMEWORK

In this chapter, we study the modular learning of deep architectures, a problem that is of

interest to theorists and practitioners alike.

In any scalable engineering workflow, modularization is among the core components.

Indeed, to build pretty much anything at scale, the rule of thumb is to “divide and conquer”,

i.e., divide the overall pipeline into components with clearly-defined functionalities, and then

design, implement, and fine-tune each component individually before gradually connecting

them back together to form the pipeline. Such a philosophy is advantageous in many aspects.

First, via breaking the overall task into subtasks, the scale of each task space is significantly

reduced, making it much more tractable to engineer a solution. Moreover, such a practice

enables clearly-defined unit tests that further enable precisely pinpointing sources of error. A

modular system is also more understandable and conveys more information to people beyond

the designer herself. Last but not the least, each functional module can potentially be reused

for new tasks, drastically reducing the time and effort for pivoting a pipeline for a different

task.

Contrasting the rest of the engineering world, deep learning models are typically not

modularized. In fact, they can not be reliably modularized due to the fact that their

training has to be end-to-end. Therefore, deep learning engineers often have to give up on

leveraging the powerful concept of modularization. And deep learning workflows have been

notoriously vague, labor-consuming, and even painful sometimes. As an example illustrating

the ramification from the absence of modularization, one of the major bottlenecks in scaling

up a deep learning pipeline is the lack of reliable testing procedures like those used in regular

software engineering. Indeed, there exist mature continuous integration test procedures based

on unit tests for regular software to ensure, e.g., the backward compatibility of any software

update. But a group of deep learning engineers working on the same model backbone can

almost never guarantee that a newly-added network head does not worsen performance of

54

other heads on existing tasks. And, when performance is in fact worsened, there is no reliable

procedure to debug.

To enable modular deep learning workflows, we develop a modular learning framework

for classification that trains each module in the network individually without the need for

end-to-end fine-tuning. The key contribution is that we prove this training procedure, albeit

modular, still finds the overall best solution as an end-to-end approach would. The optimality

of this framework holds for a large family of objective functions for NNs and KNs alike, using

our earlier observation that NNs can be interpreted as KNs.

Specifically, focusing on the two-module case, we prove that the training of input and

output modules can be decoupled without compromise in performance by leveraging pairwise

kernel evaluations on training examples from distinct classes, where the kernel is defined by the

output module’s nonlinearity. It suffices to have the input module optimize a proxy objective

function that does not involve the trainable parameters of the output one, removing the need

for error backpropagation between modules. The idea can be easily generalized to enable

modular training with more than two modules by analyzing one pair of modules at a time.

This is, to the best of our knowledge, the first purely modular yet provably optimal

training scheme for NNs, opening up the possibility of modularized deep learning with improved

user-friendliness, interpretability, maintainability, and reusability.

Besides enabling modularized pipelines, our proposed training method utilizes labels

more efficiently than the existing end-to-end backpropagation. To be specific, the training

of the latent modules only requires relative labels on pairs of data, that is, whether each

pair of examples are from the same class or not. The exact class of each data example

is not needed. This is evidently a weaker form of supervision than the full labels used in

backpropagation. The training of the output module in our framework indeed requires full

labels just as backpropagation. However, we empirically show that, given that the latent

modules have been well-trained, the output module is extremely label-efficient, needing as few

as a single full label per class to achieve the same accuracy as end-to-end backpropagation.

55

Overall, our modular training requires a different but more efficient form of supervision than

the existing end-to-end backpropagation and can potentially enable less costly procedures for

acquiring labeled data and more powerful un/semi-supervised learning algorithms.

In the following, we first present our proposed learning method in details. Then, we

quantitative demonstrate the superior label efficiency of our proposed training method,

validating our claim. To showcase one of the major benefits of modular workflows — module

reuse with confidence — we propose a simple method using components from our modular

learning framework (specifically, our proxy objective function) that enables efficient yet reliable

module reusability estimation, i.e., quantifying the competence of a pre-trained module on

a new target task. Specifically, this method allows the user to effectively determine among

numerous network bodies pre-trained on different datasets which one is the most suitable for

the task at hand at practically no computational cost. Moreover, this method can be extended

to measure task transferability, a central problem in transfer learning, continual/lifelong

learning, and multi-task learning (Tran et al., 2019). Unlike many existing methods, our

approach requires no training, is task agnostic, flexible, and completely data-driven.

6.1 The Modular Learning Methodology

In this section, we present our method for modular training with two modules. A

generalization to cases with more modules can be achieved by analyzing pairs of modules

at a time with our results here. The set-up and goal are first described, followed by a sketch of

our idea. Then the main theoretical results are provided, proving the optimality of our proposed

method.

6.1.1 The Setting, Goal, and Idea

Suppose we have a deep feedforward model consisting of two modules F = F2 F1 and an

objective function L(F, S), where S is a training set S = (xi, yi)ni=1. Note that both F2 and

F1 can be compositions of an arbitrary number of layers.

The goal of our proposed modular learning framework can be described as follows.

56

1. The training algorithm works by first learning F1 without touching F2, freeze it

afterwards at, say, F′1, then learning F2 (without fine-tuning F1). Suppose the output

module converged at F′2.

2. F′2 F′

1 ∈ argminF L(F, S).

To achieve this goal, in particular, part 2, an important observation is that for a given S,

define

F⋆1 := F1 : ∃F2 s.t. F2 F1 ∈ argmin

FL(F, S). (6-1)

Then the goal during training F1 can be to find an F′1 that is in F⋆

1. With that done, one

can simply train F2 to minimize L(F2 F′1, S) and the resulting minimizer F′

2 will satisfy

F′2 F′

1 ∈ argminF L(F, S). And if we can characterize F⋆1 independently of the trainable

parameters of F2, the training of F1 will not involve training F2.

Therefore, the key missing component in this framework is a F2-free characterization

of F⋆1. More concretely, denoting the trainable parameters of F2 as θ2 and the nontrainable

ones, e.g., layer width, as ω2, and assuming that we train F1 by maximizing a proxy objective

function L1. We need to find a proxy objective function L1 that is only a function of F1,ω2, S

and such that

argmaxF1

L1(F1,ω2, S) ∈ F⋆1. (6-2)

We now show that this is possible for a large family of L under mild architectural assumptions

on F2.

6.1.2 The Main Theoretical Result

Some assumptions are imposed before we discuss the main result. To simplify the

presentation, we discuss only binary classification in this section. The result easily extends to

classification with more classes. With this binary classification assumption, we may further

assume that F2 is scalar-valued. In other words, we may write f2 instead. Further, we assume

that f2 admits the form f2(·) = ⟨w,ϕ(·)⟩ + b with kernel k(·, ·) = ⟨ϕ(·),ϕ(·)⟩. This

assumption is actually satisfied by a large family of architectures, per our previous discussion

that many NN components can be rewritten into this form. And we may assume without loss

57

of generality that there is no further nonlinearity at the output of f2 since if there is any, we

could absorb it into the formulation of the objective function instead.

The following Theorem states that a F⋆1 can be characterized solely using pairwise

evaluations of k on pairs training data from distinct classes. This Theorem is proved in

Appendix B.

Theorem 6.1. Let S = (xi, yi)ni=1,xi ∈ Rd0 , yi ∈ +,−,∀i, be given and consider

F1 : Rd0 → Rd1 , f2 : Rd1 → R : z 7→ ⟨w,ϕ(z)⟩ + b, where w, b are free parameters and ϕ is a

given mapping into a real inner product space with ∥ϕ(u)∥ = α for all u ∈ Rd1 and some fixed

α > 0.1 Let I+ be the set of i’s in 1, . . . , n such that yi = +. And let I− be the set of j’s

in 1, . . . , n such that yj = −. Suppose the objective function L admits the following form:

L(f2 F1, S)

=1

n

∑i∈I+

ℓ+ (f2 F1(xi)) +1

n

∑j∈I−

ℓ− (f2 F1(xj)) + λg(∥w∥), (6-3)

where λ ≥ 0, g, ℓ+, ℓ− are all real-valued with g, ℓ−nondecreasing and ℓ+ nonincreasing.

For an F⋆1, let f

⋆2 be in argminf2 L(f2 F

⋆1, S). If ∀i ∈ I+, j ∈ I−, F

⋆1 satisfies

∥ϕ(F⋆1(xi))− ϕ(F⋆

1(xj))∥ ≥ ∥ϕ(s)− ϕ(t)∥,∀s, t ∈ Rd1 , (6-4)

then

f ⋆2 F⋆

1 ∈ argminf2F1

L(f2 F1, S). (6-5)

Remark: Defining kernel

k (F1(u),F1(v)) = ⟨ϕ (F1(u)) ,ϕ (F1(v))⟩ , (6-6)

1 Throughout, we consider the natural norm induced by the inner product, i.e., ∥t∥2 :=⟨t, t⟩, ∀t.

58

Eq. 6-4 is equivalent to

k (F⋆1(xi),F

⋆1(xj)) ≤ k (s, t) , ∀s, t ∈ Rd1 ,∀i ∈ I+, j ∈ I−. (6-7)

Further, if the infimum of k(u,v) is attained in Rd1 × Rd1 and equals β, then Eq. 6-4 is

equivalent to

k (F⋆1(xi),F

⋆1(xj)) = β, ∀i ∈ I+, j ∈ I−. (6-8)

Interpreting a two-module classifier f2 F1 as F1 learning a new representation of the

given data on which f2 will carry out the classification, the intuition behind our result can

be explained as follows. Given some data, its “optimal” representation for a linear model to

classify should be the one where examples from distinct classes are located as far from each

other as possible. Thus, the optimal F1 should be fully characterized as the module that

produces this optimal representation. And since our f2 is a linear model in an RKHS feature

space and because distance in an RKHS can be expressed via evaluations of its reproducing

kernel using the kernel trick, the optimal F1 can then be fully described with only pairwise

kernel evaluations over the training data, i.e., k (F1(xi),F1(xj)) for xi,xj being training

examples from different classes.

6.1.3 Applicability of the Main Result

We now show that the assumptions imposed by this result are in fact satisfied in many

popular classification set-ups.

6.1.3.1 Network architecture

In terms of the network architecture, the Theorem essentially assumes that the node on

the output layer assumes a kernel machine representation. This is evidently satisfied by KNs,

for which all nodes are kernel machines. On the other hand, we have shown in Chapter 4

that many NN architectures also admit this representation, indicating that this assumption is

satisfied by NNs, neural-classical kernel hybrid networks, and KNs alike.

We now provide details on one way to see how most popular feedforward NN backbones

satisfy the architectural assumption in the Theorem. There are other ways to fit these

59

backbones into a model formulation that works with the Theorem, but they may not

immediately yield an implementable modular training algorithm. For example, as we have

discussed in Chapter 4, some NN layers can be abstracted as kernel machines but with

trainable ϕ. These abstractions may satisfy the assumptions of the Theorem, but won’t work

with the modular training algorithm proposed in the next section, which requires ϕ to be a

fixed function.

Most feedforward NN structures, including the ResNet (He et al., 2016) and the VGG

(Simonyan & Zisserman, 2014), admit the following representation:

f = G2 G1, (6-9)

where G2 is the output linear layer: g2,j(·) = ⟨w2,j, ·⟩ + b2,j, and G1 can be a composition of

arbitrary layers ending with an elementwise nonlinearity ϕ. When the model uses a nonlinearity

on top of the output layer, one may absorb this nonlinearity into the objective function such

that the model would still assume the aforementioned representation. That this model satisfies

the condition on network architecture in the Theorem becomes evident after re-grouping the

modules. Specifically, write G1 = ϕ F1 for some F1, and, considering the binary classification

case with the model having a single output node, we can rewrite the model as

f = f2 F1,

f2(·) = ⟨w2,ϕ(·)⟩+ b2.

(6-10)

This formulation of the same model satisfies the requirement of the Theorem.

Note that the assumption that ∥ϕ(u)∥ is fixed for all u may require that one normalizes

the activation vector/matrix/tensor in practice.

6.1.3.2 Objective function

Theorem 6.1 works for objective functions of a specific form defined in Eq. 6-3. This

formulation is general enough to include many popular objective functions including softmax

+ cross-entropy, any monotonic nonlinearity + mean squared error, and hinge loss. Indeed,

60

the empirical risks of these objective functions can be decomposed into two terms with one

nondecreasing and the other nonincreasing such that the condition required by Theorem 6.1 is

satisfied. The details are provided below.

• softmax + cross-entropy (two-class version):

ℓ(f, S) =1

n

n∑i=1

−1i∈I+ ln (σ (f(xi)))− 1i∈I− ln (1− σ (f(xi))) (6-11)

=1

n

∑i∈I+

ln(e−f(xi) + 1

)+

1

n

∑j∈I−

ln(ef(xj) + 1

), (6-12)

where σ is the softmax nonlinearity.

• tanh + mean squared error (this decomposition works for any monotonic nonlinearitywith the value of yi adjusted for the range of the nonlinearity):

ℓ(f, S) =1

n

n∑i=1

(yi − δ (f(xi)))2 (6-13)

=1

n

∑i∈I+

(1− δ (f(xi)))2 +

1

n

∑j∈I−

(1 + δ (f(xj)))2 (6-14)

where δ is the hyperbolic tangent nonlinearity.

• hinge loss:

ℓ(f, S) =1

n

n∑i=1

max(0, 1− yif(xi)) (6-15)

=1

n

∑i∈I+

max(0, 1− f(xi)) +1

n

∑j∈I−

max(0, 1 + f(xj)). (6-16)

6.1.4 From Theory to Algorithm

The theoretical result can be used as the foundation of an actionable training algorithm.

As we have discussed early on in this Chapter, the missing piece linking our Theorem and

a concrete algorithm is a proxy objective function for training the hidden module. We now

present in details such an algorithm for classification with potentially more than two classes.

Assume without loss of generality that the model architecture is F = F2 F1, with each

node on the output layer as f2,j(·) = ⟨w2,j,ϕ(·)⟩ + b2,j. If ϕ does not satisfy ∥ϕ(u)∥ = α

61

for all u and some α > 0, normalize it by, e.g., dividing (elementwise) by its norm, such

that this condition is satisfied. Let an objective function L be given and suppose it (or its

two-class analog) satisfies the requirement of Theorem 6.1. Define kernel k(F1(·),F1(·)) =

⟨ϕ(F1(·)),ϕ(F1(·))⟩. Determine β := min k based on ϕ. Some examples include: β = 0 for

ReLU and sigmoid; β = −1 for tanh.

Given a batch of training data (xi, yi)ni=1 and let N (for negative) denote all pairs of

indices i, j such that yi = yj and P (for positive) denote all pairs of indices i = j with yi = yj.

Train F1 to maximize one of the following proxy objective functions.

• Alignment (negative only) (AL-NEO):

L1(F1) =β∑

(i,j)∈N k (F1(xi),F1(xj))

|β||N |1/2√∑

(i,j)∈N (k (F1(xi),F1(xj)))2; (6-17)

• Contrastive (negative only) (CTS-NEO):

L1(F1) = − 1

|N |∑

(i,j)∈N

exp (k (F1(xi),F1(xj))) ; (6-18)

• Negative Mean Squared Error (negative only) (NMSE-NEO):

L1(F1) = − 1

|N |∑

(i,j)∈N

(k (F1(xi),F1(xj))− β)2 . (6-19)

All of the above proxy objectives can be shown to learn an F1 that satisfies the optimality

condition required by Theorem 6.1.

Note that in cases where some of these proxies are undefined (for example, when β = 0),

we may train F1 to maximize the following alternative proxy objectives instead assuming

α := sup k is known, where we define k⋆ij to be α if (i, j) ∈ P or if i = j and β if otherwise.

Compared to the previous proxies, these are applicable to all k’s and impose a stronger

constraint on learning. Specifically, in addition to controlling the inter-class pairs, these proxies

force intra-class pairs to share representations.

62

𝐅! 𝐅" 𝐅!# 𝐅"Need: Pairs of 𝐱!, 𝐱" from distinct classes (no explicit target)Do: Maximize proxy objective L#Obtain: 𝑭#$ ∈ argmax𝑭!𝐿# 𝑭# 𝐱! , 𝑭# 𝐱𝒋 !,"

training in progress

untrained

Stage 1: Train Module 1 Stage 2: Train Module 2 Training complete

trained and frozen

Need: 𝐱! (input), 𝑦! (target), 𝑖 = 1, … , 𝑛, 𝑭#$Do: Minimize the overall objective LObtain: 𝑭($ ∈ argmin𝑭"𝐿 𝑭( 𝑭#$ (𝐱!) , 𝑦! !

𝐅!# 𝐅"#

Figure 6-1. The proposed modular training framework (two-module case) consists of twotraining stages. Suppose we are given a two-module model F2 F1 and an overallclassification objective function L(F2 F1), e.g., cross-entropy with weightregularization. First, the input module F1 is trained to maximize a proxy hiddenobjective as defined in Sec. 6.1.4. Then this input module is frozen at, say, F′

1.And the output module is trained to minimize L(F2 F′

1). Note that during thetraining of the input module does not involve training the output module, and viceversa. In other words, the training process is fully modular.

• Alignment (AL): The kernel alignment (Cristianini et al., 2002) between the kernelmatrix formed by k and k⋆ on the given data. This is also closely related to theCauchy-Schwarz divergence in information theoretic learning (Principe, 2010).

• Upper Triangle Alignment (UTAL): Same as AL, except only the upper triangles minusthe main diagonals of the two matrices are considered. This can be considered a refinedversion of the raw alignment.

• Contrastive (CTS):

L1(F1) =

∑(i,j)∈P exp (k (F1(xi),F1(xj)))∑

(i,j)∈N∪P exp (k (F1(xi),F1(xj))); (6-20)

• Negative Mean Squared Error (NMSE):

L1(F1) = − 1

n2

∑(i,j)

(k (F1(xi),F1(xj))− k⋆

ij

)2. (6-21)

Empirically, we found that alignment and squared error typically produced better results than

the contrastive ones, which is potentially due to how the exponential term involved in the

contrastive objectives made the range of ideal learning rates to be smaller.

Now suppose F1 has been trained and frozen at F′1, we simply train F2 to minimize the

overall objective function L(F2 F′1).

This training algorithm is illustrated in Fig. 6-1.

63

6.1.4.1 Geometric interpretation of learning dynamics

The sufficient conditions described by Theorem 6.1 can be interpreted geometrically:

Under an F1 satisfying these conditions, images of examples from distinct classes are as distant

as possible in the RKHS induced by k. Intuitively, such a representation is the “easiest” for the

classification task. And our Theorem essentially justified this intuition in a rigorous fashion.

Therefore, the learning dynamics of this training algorithm on the latent module can

be given a straightforward geometric interpretation: It trains the latent module to push

apart examples from different classes in a feature space. When using the second set of

proxy objectives, the algorithm also squeezes together those examples within the same class.

Eventually, the output module works as a classifier on the final hidden representation in the

feature space.

6.1.4.2 Accelerating the approximated kernel network layers

When using the kernel trick to approximate some kernel machines in the network,

there is a natural method to accelerate the approximated kernel machines on non-input

layers and reduce the otherwise super-quadratic runtime: When using the second set of

proxy objectives, the hidden targets are sparse in the sense that for an ideal F1, we have

ϕ(F1(xp)) = ϕ(F1(xq)) for those xp,xq with yp = yq and ϕ(F1(xp)) = ϕ(F1(xq))

otherwise. Since for approximated kernel machines, we usually approximate w2,j using∑mp=1 α2,j,pϕ(F1(xp)) with some examples from the training set as the centers, retaining only

one example from each class would result in exactly the same hypothesis class for the model

f2,j because ∑m

p=1 α2,j,pϕ(F1(xp)) : α2,j,p = α+ϕ(F1(x+)) + α−ϕ(F1(x−)) : α−, α+ for

arbitrary x+,x− in the training set.

Thus, after training a given hidden module, depending on how well its objective function

has been maximized, one may discard some of the centers for kernel machines of the next

module to speed up the training of that module without sacrificing performance. This

trick also has a regularization effect on the kernel machines since the number of trainable

64

parameters of an approximated kernel machine grows linearly in the number of its centers and

fewer centers would therefore result in fewer parameters.

6.2 A Method for Module Reusability and Task Transferability Estimation

Among the many benefits of modular workflows, one that is of particular interest to

practitioners is improved module reusability. For example, in software engineering, standalone

components can be reused across the system, saving developers from re-implementations,

thanks to how the functionality of each component can be clearly defined without overlap or

significant dependence between modules. This degree of reliable module reusability, although

common in most practices of engineering, is sorely missed in deep learning workflows.

In this section, we demonstrate how our proposed modular training enables modular

workflows and subsequently a simple yet highly reliable module reusability estimation approach.

Formally, we consider the following problem: Assume that one is given a set of network bodies

pre-trained on some source tasks, i.e., tasks that we have already solved. Then consider a

target task, i.e., a task to be solved. If one were to pick a pre-trained network body, freeze it,

and then train a new network head on top of it to solve the target task, which pre-trained body

should one pick in order to maximize performance on the target task? We name this problem

module reusability estimation.

This practical problem is essentially equivalent to the theoretical problem of describing

task space structure among tasks. Specifically, given a set of tasks, we would like to

quantitative describe the relationships among them. And one popular notion of a task pair

relationship is that of task transferability, that is, how helpful the features useful for one source

task is to a target task. This theoretical issue that is central to many important research

domains including transfer learning, continual/lifelong learning, meta learning, and multi-task

learning (Zamir et al., 2018; Achille et al., 2019; Tran et al., 2019; Nguyen et al., 2020).

We propose a solution to module reusability estimation based on components from our

modular learning framework. Let a target task be characterized by the objective function

L(F2 F1, S), where L is an objective function satisfying the requirements in Theorem 6.1 and

65

S is some training data. And suppose we are given a set of pre-trained F1’s. This formulation

is standard in the task transferability literature (Tran et al., 2019). Our modular learning

framework suggests that we can measure the goodness of these pre-trained F1’s for this target

task by measuring L1(F1, S), where L1 can be any proxy objective defined in Sec. 6.1.4. This

is based on the fact that the F1 that maximizes the proxy objective L1(F1, S) constitutes

the input module of a minimizer for L(F2 F1, S), per Theorem 6.1. Therefore, one can

simply rank the pre-trained F1’s in terms of how well they maximize L1(F1, S) and select the

maximizer among the given.

How does this serve as a solution to the theoretical problem of describing task space

structure? Since a well-trained network body must encode key information for the source

task, this procedure can quantify how helpful knowledge on one source task is to a target

task via the lens of optimal model performance by fully training a network head to solve the

target task, thus fully describing the task space structure among the given tasks in terms of

transferability.

The main advantage of our proposed solution is that this procedure requires no training.

Indeed, one only needs to run the given modules on potentially a subset of S. And empirically,

we found that a small subset of target task training data is usually enough to give quality

estimate. Moreover, this method is task-agnostic in the sense that the method works regardless

of the tasks (datasets) at hand, without the need for modification.

In terms of the optimality of our transferability measure, if we define the true transferability

of a particular F1 to be minF2 E(x,y)L(F2 F1, (x, y)) (Tran et al., 2019), then minF2 L(F2

F1, S) is a bound on the true transferability minus a complexity measure on the model class

from which we choose our model (Bartlett & Mendelson, 2002).

Comparisons of our method against some notable related work are provided in Sec. 6.2

and summarized in Table 6-1, from which it is clear that our method is among the fastest and

most flexible.

Related Work: Model Reusability and Task Transferability

66

For practitioners in particular, a central issue in the current deep learning is that highly

performant models trained with significant resources might be immediately rendered useless

on a fresh dataset. Therefore, it is highly relevant to design reliable metrics with which one

can estimate the reusability of a model or certain modules within a model, thus shortening the

development cycle of deep learning pipelines.

From a theoretical standpoint, the issue of model reusability is essentially an issue of

quantitatively describing the task landscape among a few tasks of interest in terms of their

pairwise relationships, i.e., estimating to what extent representations or features learned from

one task might aid the learning of another. Also known as task transferability estimation,

this topic is central to many important, active research domains including transfer learning,

multi-task learning, meta learning, continual/lifelong learning, etc (Zamir et al., 2018; Achille

et al., 2019; Tran et al., 2019; Nguyen et al., 2020).

The task where the representations are learned is usually referred to as the source task.

From a model reusability perspective, this is the task where the model of interest was trained

on. On the other hand, the task for which we are interested in solving with the learned

representations is called the target task. Again, for model reusability, this is the task to which

we’d like to apply our pre-trained model.

Task transferability measures based on information-theoretical concepts have been

proposed. Early task-relatedness measures include the F -relatedness (Ben-David & Schuller,

2003) and the A-distance (Ben-David et al., 2007). And more recently, H-divergence (Ganin

et al., 2016) and Wasserstein distance (Li et al., 2018) are gaining increasing attention.

Specifically, (Liu et al., 2017) applied H-divergence in natural language processing for

text classification, and (Janati et al., 2018) used Wasserstein distance to estimate the

similarity of linear parameters instead of the data generation distributions. Along this

line of research, some more recent methods include H-score (Bao et al., 2019) and the

Bregman-correntropy conditional divergence (Yu et al., 2020), the latter of which used the

67

correntropy functional (Liu et al., 2007) and the Bregman matrix divergence (Kulis et al.,

2009) to quantify divergence between mappings.

Another direction in which some task transferability works follow is to extract relevant

knowledge about tasks from trained models. Compared to the earlier information-theoretic

measure-based methods, these model-based ones, as the name suggests, rely on the availability

of performant trained models. Below are some notable works following this line of research.

They are more related to our approach.

Assuming that identical data was used between two tasks, (Tran et al., 2019) estimated

task transferability via the negative conditional entropy between label sequences. No actual

trained model was used in the estimation, but the existence of an optimal trained source

model was assumed. Moreover, the source and the target task were both assumed to be

characterized, besides the data, by the cross-entropy loss. This assumption on identical training

data between tasks was removed by (Nguyen et al., 2020). (Nguyen et al., 2020) also assumes

that the cross-entropy loss is the loss used by both the source and the target task.

Compared to the earlier two methods, Taskonomy (Zamir et al., 2018) and Task2Vec (Achille

et al., 2019) more heavily rely on the performance of trained models to measure task

relatedness. And they both require actually training models on target tasks. Specifically,

in Taskonomy, transferability between a given task pair is quantified through the relative

transfer performance to a third target task using either as the source. Task2Vec, on the other

hand, exhaustively tunes a “probe” network on all target tasks and extracts task-characterizing

information from the learnable parameters of this network.

All of the existing model-based methods, including our own, make limiting assumption(s)

on the set-up. The main assumption of our method, required by our modular training

optimality guarantee, is that the output module (classifier) that will be tuned on top of

the transferred component admits a particular form as described earlier. Specifically, this

module is assumed to be a linear model in a (potentially nonlinear) feature space. This is

68

satisfied, of course, by a single-layer NN/KN, which is the commonly-assumed classifier head in

model reusability or task transferability estimation set-ups.

We summarize the properties of these related methods in Table 6-1.

Table 6-1. Comparisons with similar methods for task transferability estimation. For thetraining-free methods, we also provide the test runtime in terms of n, the size ofthe target task dataset that is being used for the estimation. Note that the extratrainings constitute the main overhead of the methods that require actual training,making comparisons on their test runtime against the training-free methodsmeaningless. Details are provided in Sec. 6.2.

Method Training Test Runtime Main AssumptionTASK2VEC (Achilleet al., 2019)

3 a reference model

Taskonomy (Zamiret al., 2018)

3 a third reference task

NCE (Tran et al.,2019)

7 O(n3) tasks share input

LEEP (Nguyen et al.,2020)

7 O(n2) cross-entropy loss

Ours 7 O(n2) form of classifier

69

CHAPTER 7MODULAR LEARNING: EXPERIMENTS

In this chapter, we present experimental results validating the efficacy of our proposed

architectures and, more importantly, the modular learning method. To begin with, we perform

two sanity checks showing that modular training results in identical learning dynamics as

end-to-end and also that our proposed proxy objectives align well with overall classification

accuracy. This confirms that our approach is as effective as end-to-end in terms of learning to

minimize the overall objective despite its modular nature.

We then evaluate our learning algorithm on simple network backbones. Two main sets

of comparisons are performed. First, we compare KNs using only classical kernels (Gaussian)

with NNs (can be viewed as KNs using NN-inspired kernels) to showcase how connectionism

enabled classical kernel method to achieve strong performance. Second, we compare MLPs

with end-to-end training and other greedily-pretrained architectures with our modular learning

algorithm applied to KNs with classical kernels. This is mainly to verify the effectiveness of our

modular algorithm. These experiments are performed on relatively less challenging datasets as

they mainly serve the purpose of providing some initial insights into the KN architecture as well

as the training algorithm.

In the next section, we evaluate our modular learning method on more complex NN

backbones and more challenging datasets. These are our main results. First, we present test

results on MNIST (LeCun et al., 1998) and CIFAR-10 (Krizhevsky et al., 2009). On both

MNIST and CIFAR-10, our modular approach compares favorably against end-to-end training

in terms of accuracy. We then discuss the label efficiency of our method. Recall that the

training of the latent module involves only pairs of examples from distinct classes with no

need for knowing the actual classes. This is a weaker form of supervision than knowing exactly

which class each example belongs to as required by backpropagation. Indeed, without caching

information over samples, sampling strict subsets (mini-batches) from the training set and

knowing only this pairwise relationship on pairs of examples in each subset is insufficient

70

for recovering the full labels of the training set. And we empirically show that the output

module, which requires full supervision in training, is highly label-efficient, achieving 94.88%

accuracy on CIFAR-10 with 10 randomly selected labeled examples (one from each class) using

a ResNet-18 (He et al., 2016) backbone (94.93% when using all 50000 labels). The fact that

our modular approach runs almost only on pairwise labels (weak supervision) yet still produces

performant models shows that our approach can leverage supervision more efficiently than

traditional end-to-end backpropagation using full labels on individual examples. Then, we

show that our proposed task transferability estimation method accurately describes the task

space structure on 15 binary classification tasks derived from CIFAR-10 using only a small

amount of labeled data, demonstrating a practical advantage of modular workflows. Finally,

we demonstrate that modular workflows enable fast network architecture search as another

example on their practical benefits. Using a proxy objective function from our modular learning

framework, the search for optimal network depth and widths can be done with high accuracy

in polynomial time (linear in network depth), contrasting how a naive approach would take

exponential runtime.

7.1 Sanity Checks

We first perform two sanity checks to verify that our learning method can indeed

effectively learn a deep architecture without between-module backpropagation.

7.1.1 Sanity Check: Modular Training Results in Identical Learning Dynamics AsEnd-to-End

In this section, we attempt to answer the important question: Do end-to-end and the

proposed modular training, when restricted to the architecturally identical hidden modules,

drive the underlying modules to functions that are identical in terms of minimizing the overall

loss function? Clearly, a positive answer would verify empirically the optimality of the modular

approach.

We now test this hypothesis with toy data. The set-up is as follows. We generate 1000

32-dimensional random input and assign them random labels from 10 classes. The underlying

71

network consists two modules. The input module is a (32 → 512) fully-connected (fc) layer

followed by ReLU nonlinearity and another (512 → 2) fc layer. The output module is a

(2 → 10) fc layer. The two modules are linked by a tanh nonlinearity (output normalized to

unit vector). The overall loss is the cross-entropy loss. And for modular training, the proxy

objective is CTS-NEO. We visualize the activations from the tanh nonlinearity as an indicator

of the behavior of the hidden layers under training.

From Fig. 7-1, we see that the underlying input modules are indeed driven to the same

functions (when restricted to the training data and in terms of minimizing the overall loss) by

both modular and end-to-end in the limit of training time going to infinity. This confirms the

optimality of our proposed method (albeit in a simplified set-up) and allows us to modularize

learning with confidence.

7.1.2 Sanity Check: Proxy Objectives Align Well With Accuracy

We now extend the verification of optimality from the simplified set-up in the previous

section to a more practical setting. Recall that we have characterized our proxy objective

as function of the input module whose maximizers constitute parts of the overall objective

minimizers and we have proposed to train the input module to maximize the proxy objective.

Ideally, however, a proxy objective for input module should satisfy: The overall accuracy

is a function of solely the proxy value and the output module and as a function of the proxy

value, the overall accuracy is strictly increasing. This type of results is reminiscent of the

“ideal bounds” for representation learning sought for in, e.g., (Arora et al., 2019c). We now

demonstrate empirically that the proxies we proposed enjoy this property.

The set-up is as follows. We train a LeNet-5 as two modules on MNIST with (conv1 →

tanh → max pool → conv2 → tanh → max pool → fc1 → tanh → fc2) as the input

module, and (tanh → fc3) as the output one. The input module is trained with a number

of different epochs so as to achieve different proxy values. And for each epoch, we freeze the

input module and train output module to minimize the overall cross-entropy until it converges.

Mathematically, suppose we froze input module at F′1 and we denote the proxy as L1 and

72

1 0 1x1

1.5

1.0

0.5

0.0

0.5

1.0

1.5x2

mdlr repr., epoch 200

1 0 1x1

1.5

1.0

0.5

0.0

0.5

1.0

1.5

x2

mdlr repr., epoch 5100

1 0 1x1

1.5

1.0

0.5

0.0

0.5

1.0

1.5

x2

mdlr repr., epoch 12600

1 0 1x1

1.5

1.0

0.5

0.0

0.5

1.0

1.5

x2

e2e repr., epoch 100

1 0 1x1

1.5

1.0

0.5

0.0

0.5

1.0

1.5x2

e2e repr., epoch 6100

1 0 1x1

1.5

1.0

0.5

0.0

0.5

1.0

1.5

x2

e2e repr., epoch 11300

Figure 7-1. Using toy data, we visualize the learning dynamics of our modular trainingapproach and that of backpropagation by visualizing the output representationsfrom the input module. We observe that the two methods result in input modulesthat are identical functions when restricted to the random training data, confirmingthat one may greedily optimize the input module using a proxy objective andobtain the same outcome as an end-to-end approach. e2e for end-to-end training.mdlr for modular training. Classes are color-coded.

overall accuracy as A, we are visualizing maxF2 A(F2 F′1) vs. L1(F

′1), where F2 is the output

module.

Fig. 7-2 shows that the overall accuracy, as a function of the proxy value, is indeed

approximately increasing. Further, this positive correlation becomes near perfect in high-accuracy

regime, which agrees with our theoretical results. To summarize, we have extended our

theoretical guarantees and empirically verified that maximizing the proposed proxy objectives

effectively learns input modules that are optimal in terms of maximizing the overall accuracy,

rendering end-to-end training unnecessary.

73

0.0 0.2 0.4 0.6 0.8 1.0proxy objective (normalized)

80.0

82.5

85.0

87.5

90.0

92.5

95.0

97.5

100.0

acc.

(%)

MNIST Accuracy vs. Proxy Objective

proxy name:nmse_neo

utal_neocts_neo

Figure 7-2. The overall accuracy, as a function of the proxy objective, is increasing. Thepositive correlation becomes near perfect in high-performance regime, validatingour theoretical results. Overall, this justifies the optimality of training the inputmodule to maximize the proxy objective. Note that the values of the illustratedproxies were normalized to [0, 1] in order for them to be properly visualized in thesame plot.

7.2 Modular Learning: Simple Network Backbones With Classical Kernels

These experiments were conducted with simple network backbones using classical kernels

(Gaussian) to provide initial insights into our KN architecture and the modular learning

approach. In particular, we show that connectionism helps classical kernel machines achieve

performance comparable to NNs, and that our modular learning, without bells and whistles,

compares favorably with end-to-end and end-to-end enhanced with various techniques.

7.2.1 Fully Layer-Wise kMLPs

We first demonstrate the competence of KNs and the effectiveness of the modular

learning method using simple networks (kMLPs) with classical kernels (Gaussian). The kMLPs

were trained fully layer-wise, using an extension of our two-module algorithm proposed earlier.

Here, the extension of this two-module method to networks with more than two modules was

74

obtained by analyzing pairs of modules at a time. Adam (Kingma & Ba, 2014) was used as

the underlying optimization algorithm. We then compare kMLPs trained with end-to-end

backpropagation against the same kMLPs trained with our modular algorithm to show the

effectiveness of the latter. We also compare kMLPs learned layer-wise with other popular deep

architectures including MLPs, deep belief networks (DBNs) (Hinton & Salakhutdinov, 2006)

and stacked autoencoders (SAEs) (Vincent et al., 2010), with the last two trained using a

combination of unsupervised greedy pre-training and standard backpropagation (Hinton et al.,

2006; Bengio et al., 2007). We also visualize the learning dynamics of modular kMLPs and

show that it is intuitive and simple to interpret.

In terms of the datasets used, rectangles, rectangles-image and convex are binary

classification datasets, mnist (50k test) and mnist (50k test) rotated are variants of MNIST.

mnist (50k test) contains 10000 training images, 2000 validation images, and 50000 test

images taken from the standard MNIST. mnist (50k test) rotated is the same as the fourth

except that the digits have been randomly rotated. fashion-mnist is the Fashion-MNIST

dataset (Xiao et al., 2017). These datasets all contain 28× 28 grayscale images. In rectangles,

rectangles-image, the model needs to learn if the height of the rectangle is longer than the

width, and in convex, if the white region is convex. Examples from these datasets are shown in

Fig 7-9. For detailed descriptions of these datasets, see (Larochelle et al., 2007).

The experimental set-up for the modular kMLPs is as follows. kMLP-1 corresponds to a

one-hidden-layer kMLP with the first layer consisting of 15 to 150 kernel machines using the

same Gaussian kernel and the second layer being a single or ten (depending on the number of

classes) kernel machines using another Gaussian kernel. In training, no preprocessing method

was used for all models. To ensure that the comparisons with other models are fair, we

used the regularized (two-norm regularization on weights) cross-entropy loss as the objective

function for the output layer of all models. The hidden proxy objective used for modular

kMLPs was AL, NMSE, and NMSE with the squared error substituted by absolute error. On

convex, a two-hidden-layer kMLP achieved a test error rate of 19.36%, 18.53% and 21.70%

75

using AL, NMSE, and NMSE with absolute error as the hidden objectives, respectively. As

a baseline, our best two-hidden-layer MLP achieved an error rate of 23.28% on this dataset.

For all results that follow, we present the better result among the three. Hyperparameters

were selected using the validation set. The validation set was then used in final training only

for early-stopping based on validation error. For the standard MNIST and Fashion-MNIST,

the last 5000 training examples were held out as validation set. kMLP-1FAST is the same

kMLP for which we accelerated by randomly choosing a subset of the training set as centers

for the second layer after the first had been trained. The kMLP-2 and kMLP-2FAST are

the two-hidden-layer kMLPs, the second hidden layers of which contained 15 to 150 kernel

machines. Settings of all the kMLPs trained with backpropagation can be found in (Zhang

et al., 2017). Note that because it is extremely time/memory-consuming to train kMLP-2 with

backpropagation without any acceleration method, to make training possible, we could only

randomly use 10000 examples from the entire training set of 55000 examples as centers for the

kMLP-2 (e2e) from Table 7-1.

Figure 7-3. From left to right: example from rectangles, rectangles-image, convex, mnist (50ktest) and mnist (50k test) rotated.

We now test the layer-wise learning algorithm against end-to-end backpropagation using

the standard MNIST dataset (LeCun et al., 1998). Results from several MLPs were added

as benchmarks. These models were trained with Adam or RMSProp (Tieleman & Hinton,

2012) and extra training techniques such as dropout (Srivastava et al., 2014) and batch

normalization (BN) (Ioffe & Szegedy, 2015) were applied to boost performance. kMLPs

accelerated using the proposed method (kMLPFAST) were also tested, for which we randomly

discarded some centers of each non-input layer before its training. Two popular acceleration

methods for kernel machines were compared, including using a parametric representation

76

(kMLPPARAM), i.e., for each node in a kMLP, f(·) =∑m

p=1 αpk(·,wp), αp,wp learnable and m

a hyperparameter, and using random Fourier features (kMLPRFF) (Rahimi & Recht, 2008).

Table 7-1. Testing the proposed layer-wise algorithm and acceleration method on MNIST withkMLP backbone. The numbers following the model names indicate the number ofhidden layers used. mdlr stands for modular training, whereas e2e stands forend-to-end. All other models were trained with end-to-end backpropagation. ForkMLPFAST, we also include in parentheses the ratio between the number of trainingexamples randomly chosen as centers for the kernel machines on the layer and thesize of the training set. Apart from kMLP-2 (e2e), the backpropagation kMLPresults are from (Zhang et al., 2017). The entries correspond to test errors (%) and95% confidence intervals (%). Results with overlapping confidence intervals areconsidered equally good. Best results are marked in bold.MLP-1

(RMSProp+BN)MLP-1

(RMSProp+dropout)MLP-2

(RMSProp+BN)MLP-2

(RMSProp+dropout)

2.05 ± 0.28 1.77 ± 0.26 1.58 ± 0.24 1.67 ± 0.25

kMLP-1PARAM

(e2e)kMLP-1FAST

(mdlr)kMLP-2(e2e)

kMLP-2(mdlr)

1.88 ± 0.27 1.75 ± 0.26 (0.54) 3.66 ± 0.37 1.56 ± 0.24

Table 7-2. Table 7-1, continued.

kMLP-1(e2e)

kMLP-1(mdlr)

kMLP-1RFF

(e2e)

3.44 ± 0.36 1.77 ± 0.26 2.01 ± 0.28

kMLP-2RFF

(e2e)kMLP-2PARAM

(e2e)kMLP-2FAST

(mdlr)

1.92 ± 0.27 2.45 ± 0.30 1.47 ± 0.24 (1/0.19)

Results in Table 7-1 validate the effectiveness of the KN architecture and our modular

training algorithm. For both the single-hidden-layer and the two-hidden-layer kMLPs, the

layer-wise algorithm consistently outperformed backpropagation. The modular method is also

much faster than end-to-end. In fact, it is practically impossible to use backpropagation to

train kMLP with more than two hidden layers without any acceleration method due to the

computational complexity involved. Moreover, it is worth noting that the proposed acceleration

trick is clearly very effective despite its simplicity and even produced models outperforming the

original ones, which may be due to its regularization effect. This shows that kMLP together

77

with the greedy learning scheme can be of practical interest even when dealing with the

massive data sets in today’s machine learning.

Table 7-3. Comparing kMLPs (trained fully layer-wise) with MLPs and other popular deeparchitectures trained with backpropagation and backpropagation enhanced byunsupervised greedy pre-training. The MLP-1 (SGD), DBN and SAE results arefrom (Larochelle et al., 2007). Note that in order to be consistent with (Larochelleet al., 2007), the MNIST results below were obtained using a train/test split(10k/50k) more challenging than what is commonly used in the literature. ForkMLPFAST, we also include in parentheses the ratio between the number of trainingexamples randomly chosen as centers for the kernel machines on the layer and thesize of the training set. The entries correspond to test errors (%) and 95%confidence intervals (%). Results with overlapping confidence intervals areconsidered equally good. Best results are marked in bold.

rectangles rectangles-image convex

MLP-1 (SGD) 7.16 ± 0.23 33.20 ± 0.41 32.25 ± 0.41MLP-1 (Adam) 5.37 ± 0.20 28.82 ± 0.40 30.07 ± 0.40MLP-1 (RM-SProp+BN)

5.37 ± 0.20 23.81 ± 0.37 28.60 ± 0.40

MLP-1 (RM-SProp+dropout)

5.50 ± 0.20 23.67 ± 0.37 36.28 ± 0.42

MLP-2 (SGD) 5.05 ± 0.19 22.77 ± 0.37 25.93 ± 0.38MLP-2 (Adam) 4.36 ± 0.18 25.69 ± 0.38 25.68 ± 0.38MLP-2 (RM-SProp+BN)

4.22 ± 0.18 23.12 ± 0.37 23.28 ± 0.37

MLP-2 (RM-SProp+dropout)

4.75 ± 0.19 23.24 ± 0.37 34.73 ± 0.42

DBN-1 4.71 ± 0.19 23.69 ± 0.37 19.92 ± 0.35DBN-3 2.60 ± 0.14 22.50 ± 0.37 18.63 ± 0.34SAE-3 2.41 ± 0.13 24.05 ± 0.37 18.41 ± 0.34kMLP-1 2.24 ± 0.13 23.29 ± 0.37 19.15 ± 0.34

kMLP-1FAST 2.36 ± 0.13 (0.05) 23.86 ± 0.37 (0.01) 20.34 ± 0.35 (0.17)kMLP-2 2.24 ± 0.13 23.30 ± 0.37 18.53 ± 0.34kMLP-2FAST 2.21 ± 0.13

(0.3/0.3)23.24 ± 0.37(0.01/0.3)

19.32 ± 0.35(0.005/0.03)

From Table 7-3, we see that the performance of kMLP is on par with some of the most

popular and most mature deep architectures. In particular, the greedily-trained kMLPs with

Gaussian kernel compared favorably with their direct NN equivalents, i.e., the MLPs, even

though neither batch normalization nor dropout was used for the former. These results further

validate our earlier theoretical results on the modular algorithm, showing that it indeed has the

potential to be a substitute for end-to-end backpropagation.

78

Table 7-4. Table 7-3, continued.

MNIST (50k test) MNIST (50k test) rotated Fashion-MNIST

4.69 ± 0.19 18.11 ± 0.34 15.47 ± 0.714.71 ± 0.19 18.64 ± 0.34 12.98 ± 0.664.57 ± 0.18 18.75 ± 0.34 14.55 ± 0.694.31 ± 0.18 14.96 ± 0.31 12.86 ± 0.665.17 ± 0.19 18.08 ± 0.34 12.94 ± 0.664.42 ± 0.18 17.22 ± 0.33 11.48 ± 0.623.57 ± 0.16 13.73 ± 0.30 11.51 ± 0.633.95 ± 0.17 13.57 ± 0.30 11.05 ± 0.613.94 ± 0.17 14.69 ± 0.31 N/A3.11 ± 0.15 10.30 ± 0.27 N/A3.46 ± 0.16 10.30 ± 0.27 N/A3.10 ± 0.15 11.09 ± 0.28 11.72 ± 0.632.95 ± 0.15 (0.1) 12.61 ± 0.29 (0.1) 11.45 ± 0.62 (0.28)3.16 ± 0.15 10.53 ± 0.27 11.23 ± 0.623.18 ± 0.15 (0.3/0.3) 10.94 ± 0.27 (0.1/0.7) 10.85 ± 0.61 (1/0.28)

(a) Examples from testset.

(b) Kernel matrix of thefirst hidden layer(epoch 25).

(c) Kernel matrix of thesecond hidden layer(epoch 0).

(d) Kernel matrix of thesecond hidden layer(epoch 15).

Figure 7-4. Visualizing the learning dynamics in a two-hidden-layer kMLP. Each entry in thekernel matrices corresponds to the inner product between the learnedrepresentations of two examples in the RKHS. The labels are given on the twoaxes. The examples used to produce this figure are provided in Fig. 7-4a in theorder of the labels plotted. The darker the entry, the more distant the learnedrepresentations are in the RKHS.

In Fig. 7-9, we visualize the learning dynamics within a two-hidden-layer kMLP learned

layer-wise. Since by construction of the Gaussian kernel, the image vectors are all of unit

norm in the RKHS, we can visualize the distance between two vectors by visualizing the value

of their inner product. In Fig. 7-4d, we can see that while the image vectors are distributed

randomly prior to training (see Fig. 7-4c), there is a clear pattern in their distribution after

training that reflects the dynamics of training: The layer-wise algorithm squeezes examples

79

from the same class closer together while pushes examples from different class farther apart

during training the latent layers. And it is easy to see that such a representation would

be simple to classify for the output layer. Fig. 7-4b and 7-4d suggest that this greedy,

layer-wise algorithm still learns “deep” representations: The higher-level representations

are more distinctive for different digits than the lower-level ones. Moreover, since learning

becomes increasingly simple for the upper layers as the representations become more and more

well-behaved, these layers are usually easy to set up and converge very fast during training.

7.2.2 The LeNet-5

We created and trained with our modular method a neural-classical kernel hybrid version

of the classic LeNet-5 (LeCun et al., 1998) and compare it with the original trained end-to-end.

The hidden representations learned from the two models are visualized. We show that the

hidden representations learned by the modular model are much more discriminative than those

from the original.

Specifically, we substitute the output layer of the classic LeNet-5 (LeCun et al., 1998)

architecture with kernel machines using Gaussian kernels, call this the kLeNet-5, and train it

using our modular algorithm with all the layers but the output layer as one module and the

output layer as a the other one. Since this experiment serves the purpose of providing more

insights into our learning algorithm instead of demonstrating state-of-the-art performance

(this is deferred to the next section), we use the original LeNet-5 without increasing the size

of any layer or the number of layers. ReLU (Glorot et al., 2011) and max pooling were used

as activations and pooling layers, respectively. Adam was used as the underlying optimizer for

both end-to-end and modular. The networks were trained and tested on the unpreprocessed

MNIST, Fashion-MNIST and CIFAR-10.

In Table 7-5, the results suggest that the hybrid model, trained with our modular method,

is on par with the NN counterpart trained end-to-end. We emphasize that the modular

framework does not help the network learn superior solutions compared to the traditional

end-to-end method. In that regard, it offers the same optimality guarantee as that provided

80

Table 7-5. Substituting the output layer of LeNet-5 with classical kernel machines usingGaussian kernels (kLeNet-5). This neural-classical kernel hybrid network was trainedin a modular fashion as two modules. The entries correspond to test errors (%) and95% confidence intervals (%). Results with overlapping confidence intervals areconsidered equally good. Best results are marked in bold.

MNIST Fashion-MNIST CIFAR-10LeNet-5 0.76 ± 0.17 9.34 ± 0.57 36.42 ± 0.94kLeNet-5 0.75 ± 0.17 8.67 ± 0.55 35.87 ± 0.94

Figure 7-5. Visualizing the data representation of the MNIST test set in the last hidden layerof kLeNet-5 (left) and LeNet-5 (right). Each color corresponds to a digit.Representations learned by kLeNet-5 are more discriminative for different digits.

by end-to-end backpropagation. The value of the modular learning framework is that it can

enables modular workflows, as we have argued earlier.

Fig. 7-5 provides more insights into the difference of kLeNet-5 and LeNet-5, in which

we plotted the activations of the last hidden layer of the two models after PCA dimension

reduction using the MNIST test set. In particular, we see that the representations in the

last hidden layer of kLetNet-5 are much more discriminative for different digits than those

in the corresponding layer of LeNet-5. Note that since the two models differed only in their

output layers, this observation suggests that the modular training algorithm together with

classical kernels such as Gaussian turns deep architectures into more efficient representation

learners, which may prove useful for computer vision tasks that build on convolutional features

(Gatys et al., 2015; Gardner et al., 2015). Further, it is conceivable that a greedily-trained,

81

neural-classical kernel hybrid models are more robust against perturbations on the pixels of the

test images than the original NN trained end-to-end. This feature may help the model defend

against adversarial examples, leading to more secure deep learning systems.

7.3 Modular Learning: State-of-the-Art Network Backbones With NN-InspiredKernels

These experiments were performed to demonstrate the efficacy of our modular learning on

state-of-the-art network backbones. The backbones we used in this section are all NNs, and

we did not modify their architectures by introducing classical kernels as we did in the previous

section.

7.3.1 Accuracy on MNIST and CIFAR-10

We now present the main results on MNIST and CIFAR-10 demonstrating the effectiveness

of our modular learning method. To facilitate fair comparisons, end-to-end and modular

training operate on the same backbone network. For all results, we used stochastic gradient

descent as the optimizer with batch size 128. For each module in the modular method as well

as the end-to-end baseline, we trained with annealing learning rates (0.1, 0.01, 0.001, each for

200 epochs). The momentum was set to 0.9 throughout. For data preprocessing, we used

simple mean subtraction followed by division by standard deviation. On CIFAR-10, we used the

standard data augmentation pipeline from (He et al., 2016), i.e., random flipping and clipping

with the exact parameters specified therein. In modular training, the models were trained as

two modules, with the output layer alone as the output module and everything else as the

input module as specified in Sec. 6.1.4. We normalized the activation vector right before the

output layer of each model to a unit vector such that the equal-norm condition required by

Theorem 6.1 is satisfied. We did not observe a significant performance difference after this

normalization.

From Table 7-6 and 7-7, we see that on three different backbone networks and both

MNIST and CIFAR-10, our modular learning compares favorably against end-to-end. Since

under two-module modular training, the ResNets can be viewed as kernel machines with

82

adaptive kernels (c.f. Sec. 5.2.4), we also compared performance with other NN-inspired kernel

methods in the literature. In Table 7-7, the ResNet kernel machines clearly outperformed other

kernel methods by a considerable margin.

Table 7-6. Accuracy on MNIST. e2e for end-to-end. mdlr for modular as two modules. Obj.Fn. specifies the overall objective function (and the proxy objective for the inputmodule, if applicable) used. XE stands for cross-entropy. Definitions of the proxyobjectives can be found in Sec. 6.1.4. Using LeNet-5 as the network backbone onMNIST, our modular approach compares favorably against end-to-end training.

Model Training Obj. Fn. Acc. (%)LeNet-5 (ReLU) e2e XE 99.33LeNet-5 (tanh) e2e XE 99.32LeNet-5 (ReLU) mdlr AL/XE 99.35LeNet-5 (ReLU) mdlr UTAL/XE 99.42LeNet-5 (ReLU) mdlr MSE/XE 99.36LeNet-5 (tanh) mdlr AL(-NEO)/XE 99.11 (99.19)LeNet-5 (tanh) mdlr UTAL(-NEO)/XE 99.21 (99.11)LeNet-5 (tanh) mdlr MSE(-NEO)/XE 99.27 (99.23)LeNet-5 (tanh) mdlr CTS(-NEO)/XE 99.16 (99.16)

7.3.2 Label Efficiency of Modular Deep Learning

In our proposed modular learning framework, the training of the latent modules requires

only implicit labels (weak supervision): Only pairs of training examples from distinct classes

are needed. And their specific class identities are not. On the other hand, training the output

module requires full labels. But we hypothesize that the number of full labels needed should be

much smaller than that needed for training the entire model with end-to-end backpropagation,

thus rendering the overall training pipeline more label efficient that an end-to-end pipeline.

Our reasoning is explained as follows. Intuitively, the input module in a deep architecture

can be understood as learning a new representation of the given data with which the output

module’s classification task is simplified. A simple way to quantify the difficulty of a learning

task is through its sample complexity, which, when put in simple terms, refers to the number of

labeled examples needed for a given model and a training algorithm to achieve a certain level

of test accuracy (Shalev-Shwartz & Ben-David, 2014).

83

Table 7-7. Accuracy on CIFAR-10. Modular training yielded favorable results compared toend-to-end. To the best of our knowledge, there is no other purely modular trainingmethod that matches backpropagation with a competitive network backbone onCIFAR-10 (Jaderberg et al., 2017; Lee et al., 2015b; Carreira-Perpinan & Wang,2014; Lowe et al., 2019; Duan et al., 2019). The ResNets trained with our modularapproach can be viewed as kernel machines and outperformed other existingNN-inspired kernel methods. ∗ means the method used more sophisticated datapreprocessing than the ResNets. Note that all of the baseline kernel methods havequadratic runtime and it is nontrivial to incorporate data augmentation into theirpipelines.

Model Data Aug. Training Obj. Fn. Acc. (%)ResNet-18 flip & clip e2e XE 94.91ResNet-152 flip & clip e2e XE 95.87ResNet-18 flip & clip mdlr AL/XE 94.93ResNet-152 flip & clip mdlr AL/XE 95.73CKN (Mairalet al., 2014)

none 82.18

CNTK (Li et al.,2019)

flip 81.40

CNTK∗ (Liet al., 2019)

flip 88.36

CNN-GP (Liet al., 2019)

flip 82.20

CNN-GP∗ (Liet al., 2019)

flip 88.92

NKWT∗ (Shankaret al., 2020)

flip 89.80

In a modular training setting, one can decouple the training of input and output modules

and then it would make sense to discuss the sample complexity of the two modules individually.

The output modules should require fewer labeled data examples to train since its task has been

simplified when the input one has been well-trained. We now observe that this is indeed the

case.

We trained ResNet-18 on CIFAR-10 with end-to-end and our modular approach. The

input module was trained with the full training set in the modular method, but again, this only

requires implicit pairwise labels. We now compare the need for fully-labeled data between the

two training methods.

84

0 1000 2000 3000fully-labeled sample size

20

40

60

80

100

acc.

(%)

CIFAR-10 Label Efficiency:Modular vs. End-to-End

0 20 40 6020

40

60

80

100

CIFAR-10 Label Efficiency:Modular vs. End-to-End

(Zoomed-In)

training:mdlr

mdlr (balanced)e2e

Figure 7-6. With the output module trained with only 30 randomly chosen fully-labeledexamples, the modular model still achieved 94.88% accuracy on CIFAR-10 (thesame model achieved 94.93% when using all 50000 labeled examples). When thetraining data has balanced classes (mdlr (balanced)), modular training onlyrequired 10 randomly chosen examples to achieve 94.88% accuracy — a singleexample per class. In contrast, the end-to-end model achieved 94.91%/21.22%when trained with 50000/50 labeled examples, respectively. Note that the inputmodule is trained with weak supervision in modular training: the module only seeswhether each pair of examples are from the same class or not. Overall, thisindicates that our learning approach utilizes supervisions more efficiently thantraditional end-to-end backpropagation with full labels on individual examples.

With the full training set of 50000 labeled examples, the modular and end-to-end model

achieved 94.93% and 94.91% test accuracy, respectively. From Fig. 7-6, we see that while the

end-to-end model struggled in the label-scarce regime, achieving barely over 20% accuracy

with 50 fully-labeled examples, the modular model consistently achieved strong performance,

achieving 94.88% accuracy with 30 randomly chosen fully-labeled examples for training. In

fact, if we ensure that there is at least one example from each class in the training data,

the modular approach needed as few as 10 randomly chosen examples to achieve 94.88%

accuracy, that is, a single randomly chosen example per class. The fact that the modular

model underperformed at 10 and 20 labels is likely caused by that there were some classes

85

missing in the training data that we randomly selected. Specifically, 4 classes were missing in

the 10-example training data, and 2 in the 20-example data.

These observations suggest that our modular training method for classification can almost

completely rely on implicit pairwise labels and still produce highly performant models, which

suggests new paradigms for obtaining labeled data that can potentially be less costly than the

existing ones. This also indicates that the current form of strong supervision, i.e., full labels on

individual data examples, relied on by existing end-to-end training, is not efficient enough. In

particular, we have shown that by leveraging pairwise information in the RKHS, implicit labels

in the form of whether pairs of examples are from the same class are just as informative. This

may have further implications for un/semi-supervised learning.

Connections With Un/Semi-Supervised Learning. Whenever the training batch size

is less than the size of the entire training set, we are using strictly less supervision than using

full labels. This is because if one does not cache information over batches, it is impossible to

recover the labels for the entire training set if one is only given pairwise relationships on each

batch. The fact that our learning method still produces equally performant models suggests

that we are using supervision more efficiently.

Existing semi-supervised methods also aim at utilizing supervision more efficiently. They

still rely on end-to-end training and use full labels, but achieve enhanced efficiency mostly

through data augmentation. While our results are not directly comparable to those from

existing methods (these methods do not work in the same setting as ours: They use fewer

full labels while we use mostly implicit labels), as a reference point, the state-of-the-art

semi-supervised algorithm achieves 88.61% accuracy on CIFAR-10 with 40 labels with a

stronger backbone network (Wide ResNet-28-2) and a significantly more complicated training

pipeline (Sohn et al., 2020).

Therefore, we think the insights gained from our modular learning suggest that (1) the

current form of supervision, i.e., full labels on individual data examples, although pervasively

used by supervised and semi-supervised learning paradigms, is far from efficient and (2) by

86

leveraging novel forms of labels and training methods, simpler yet stronger un/semi-supervised

learning paradigms exist.

7.3.3 Transferability Estimation With Proxy Objective

To empirically verify the effectiveness of our transferability estimation method, we created

a set of tasks by selecting 6 classes from CIFAR-10 and grouping each pair into a single binary

classification task. The classes selected are: cat, automobile, dog, horse, truck, deer. Each new

task is named by concatenating the first two letters from the classes involved, e.g., cado refers

to the task of classifying dogs from cats.

We trained one ResNet-18 using the proposed modular method for each source task,

using AL as the proxy objective. For each model, the output linear layer plus the nonlinearity

preceding it is the output module, and everything else belongs to the input module. All

networks achieved on average 98.7% test accuracy on the task they were trained on1 ,

suggesting that each frozen input module possesses information essential to the source task.

Therefore, the question of quantifying input module reusability is essentially the same as

describing the transferability between each pair of tasks.

The true transferability between a source and a target task in the task space can be

quantified by the test performance of the optimal output module trained for the target task

on top of a frozen input module from a source task (Tran et al., 2019). Our estimation of

transferability is based purely on proxy objective: A frozen input module achieving higher proxy

objective value on a target task means that the source and target are more transferable. This

estimation is performed using a randomly selected subset of the target task training data.

In Fig. 7-7, we visualize the target space structure with trdo being the source and the

target, respectively. Each task is assigned a distinct color and an angle in polar coordinate. A

task that is more transferable to trdo is plotted closer to the origin. We see from the figures

that our method accurately described the target space structure despite its simplicity, using

1 Highest accuracy was 99.95%, achieved on audo. Lowest was 93.05%, on cado.

87

True Transferability

Sour

ce: t

rdo

Estimated Transferability (Using 1000 Examples)

Targ

et: t

rdo

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

Figure 7-7. Each task is assigned a distinct color and an angle in polar coordinate.Transferability with respect to trdo is illustrated using the distance from the origin,with a smaller distance indicating a higher transferability. At practically nocomputational overhead, our proposed method correctly described the task spacestructure using only 10% randomly selected training data from target task. Plotswith other 14 tasks being the source and target are provided in Appendix C.

only 10% of target task training data. The discovered transferability relationships also align

well with common sense. For example, trdo is more transferable to, e.g., detr, since they

both contain the truck class. trdo also transfers well with auca because features useful for

distinguishing trucks from dogs are likely also helpful for classifying automobiles from cats.

7.3.4 Architecture Selection With Proxy Objective

As another example demonstrating the practical advantages of modular workflows, we

show in this section that fast yet reliable network architecture selection can be performed with

components from our modular learning framework, namely a proxy objective.

While a high-level decision on network type can be made relatively easily for a learning

task, for example, CNNs for vision tasks and LSTMs (Hochreiter & Schmidhuber, 1997) for

language tasks, choosing an optimal network architecture by deciding on the network depths,

layer widths, and so on, is extremely time-consuming in general. For example, given that we

88

would like to use a MLP for a certain task, how many hidden layers should that MLP have and

how many nodes should there be on each hidden layer? One reliable solution is to exhaustively

train and test all possible architectures in a reasonable range of depths and widths. But even

if one assumes that the optimal learning hyperparameters such as step size can be efficiently

determined for each training session, this would still be a daunting task. For example, assume

that a single training session takes O(t) time2 , then if we were to search all architectures with

depth less than or equal to d and with the number of widths to be searched for each layer

less than or equal to w, the total time complexity would be O((wt)d) – an unacceptably large

number for any reasonable d.

In contrast, our modular learning enables a much more efficient search procedure with

polynomial time (linear in depth, to be exact). Indeed, if we were to greedily search over the

candidate widths at each depth by choosing the width that maximized the hidden proxy, then

the entire process is O(wtd) since we fix all upstream layers during the search at any particular

depth. The question, however, is whether this greedy search procedure is equally accurate

compared to the exhaustive one with exponential complexity above?

We now validate that our fast greedy search scheme is reliable with experiments. The

set-up is as follows. We trained MLPs on MNIST. Two architecture search spaces were

considered.

1. In the first search space, the network hidden layer number was fixed at 2 and for each

hidden layer, we searched over 13 different widths3 . This is to give a fine-grained

illustration on how well our strategy works for determining optimal layer widths. And we

keep the network shallow so that we can perform this fine-grained widths search using

the exhaustive baseline strategy in a reasonable time.

2 This can be refined by considering, e.g., that bigger networks typically take longer to train.

3 They are 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096.

89

2. In the second search space, we varied the number of hidden layers in 1, 2, 3, 4, 5, and

each hidden layer width was allowed to vary in 4, 16, 64. This setting is to show how our

strategy can accurately determine a network architecture (depth and widths). Although,

the search in each layer width is not as fine-grained as in the first setting due to the

prohibitively high computational complexity of exhaustive search when the network is

deep.

In each setting, we exhaustively trained all possible network architectures with end-to-end

backpropagation to obtain the ground truth competence (test accuracy) of that architecture.

Then in the first setting, the true competence of each layer width is the maximum competence

among all architectures with the corresponding layer having this specific width. In the second

setting, the true competence of each network depth is the maximum competence among

all architectures with this specific depth. Each of such a competence estimation requires

exponential runtime (in network depth).

For our greedy fast search strategy, each hidden layer was searched greedily among

candidate widths. All upstream layers were held fixed at the optimal widths obtained via earlier

search sessions. These upstream layers were still trained together with the layer being searched

to give the module enough model capacity. In both settings, the estimated competence of a

layer width or network depth is how well a specific architecture maximized a proxy objective.

We used UTAL as the proxy objective throughout the experiments. After a search in a

particular layer was complete, we fixed this layer at the optimal width and continued to the

next layer. This process started from the input layer and proceeded to the output. Each

competence estimate using our method can be obtained in linear time (in network depth).

From Fig. 7-7 and 7-9, we can see that in both search settings, the predicted competence

using our greedy search scheme aligns well with the actual competence (test accuracy). The

advantage of the greedy scheme is its speed. Indeed, obtaining each predicted competence

took only linear time in network depth. But obtaining each true competence took exponential

time in depth. Even in setting 1, where the network was shallow, this means 12× longer due

90

101 102 103 104

hidden layer 1 width

0.0

0.2

0.4

0.6

0.8

1.0

101 102 103 104

hidden layer 2 width

relative accuracyrelative competence

MNIST Network Body: Predicted Competence vs. Actual Acc.

Figure 7-8. Network architecture search setting 1: Fine-grained search for optimal layerwidths. For each layer width, “accuracy” (ground truth competence) was obtainedas the optimal test accuracy among all models with the corresponding layer havingthis particular width. On the other hand, the estimated “competence” wasobtained with our greedy strategy, where we first searched the first hidden layerand chose the most competent width as the one that maximized the proxy, andthen kept it fixed to search the second hidden layer in the same manner. The“accuracy” and “competence” values were normalized to [0, 1] (the higher thebetter) so that they could be plotted on the same scale, and hence the name“relative”. Our greedy strategy can find near-optimal widths using 12× less timethan the exhaustive search used to find ground truth.

to that there were 12 candidate widths. To conclude, modular workflow enables fast and

reliable architecture selection schemes, which potentially has practical implications for research

fields such as neural architecture search and other deep learning applications in both the

academia and the industry. The same idea of course extends to searching for optimal network

nonlinearities and so on.

91

1 2 3 4 5# hidden layers

0.0

0.2

0.4

0.6

0.8

1.0

relative competencerelative accuracy

MNIST Network Body: Predicted Competence vs. Actual Acc.

Figure 7-9. Network architecture search setting 2: Search for optimal network depth. At eachnetwork depth, the “accuracy” (ground truth competence) was obtained as theoptimal test accuracy among all architectures with this particular depth. Threewidths were searched for each layer. For each depth, the search time is exponentialin network depth. The estimated “competence” values were obtained through agreedy procedure, where we searched and fixed the width of each layer sequentiallywith each layer fixed at the width that maximized the proxy. The competence of aspecific depth was how well the optimal architecture with this depth maximized theproxy, which can be determined in linear time in network depth. Our greedystrategy concluded that the optimal depth is 4 for this task, which agrees with theground truth. However, the greedy process took merely a couple of hours, whereasobtaining the ground truth via the exhaustive search took days on a Tesla V100GPU.

92

CHAPTER 8CONCLUSIONS

This work is one that is situated at the intersection of deep learning and kernel method,

using theoretical tools from kernel method to enrich our understanding on deep learning

approaches. We set out to address three open questions, all of which are of both theoretical

and practical importance.

We first combine connectionism, the “secret sauce” behind the success of deep learning

architectures, and classical kernel method, and propose a family of deep models that are based

on kernel machines. These so-called kernel networks are as expressive as neural networks,

but at the same time are more mathematically tractable thanks to the linear nature of kernel

machines. We compared kernel networks against both classical kernel machines and equivalent

neural networks and demonstrated strong performance.

We then reveal that kernel networks, as a model abstraction, subsumes neural networks.

This conclusion was established by taking a new perspective on the neural network layers.

In particular, we absorbed the ending nonlinearity of each node into a part of a beginning

nonlinearity of the immediate downstream nodes, enabling a novel kernel machine interpretation

on these neural network nodes. This newly established connection between deep learning and

kernel method is among the simplest and strongest in the literature, requiring no infinite

network width or random weights as many existing works do.

Finally, we addressed the important yet understudied problem of modular training of deep

architectures. In this regard, we put forward a framework that allows a deep architecture to be

trained in a purely module-by-module fashion. We proved its optimality for classification with

a wide range of objective functions under mild assumptions on the network architecture.

Our modular approach learns an overall solution that is as strong as that learned by

end-to-end backpropagation. On the other hand, our method makes it possible for users

to reliably modularize their workflows, enhancing the analyzability, module reusability, and user

friendliness of deep learning. We demonstrate strong performance from our modular learning

93

on state-of-the-art network backbones. We also showed that our training approach almost

solely relies on implicit, pairwise labels, making it more efficient than traditional end-to-end

training in terms of sample complexity. As a demonstration on how modular workflow can

significantly simplify deep learning applications, we proposed a simple module reusability

and task transferability estimation method based on components from our modular learning

framework and showed strong experimental results.

Overall, we believe that this work merely scratched the surface of the many topics that

it touched on. Some of these topics, such as modular learning, have significant practical

implications and are yet heavily understudied. We hope that our work can serve as a proof

of concept on the value of these research topics, and then draw more attention from the

community to them. And we also wish that our initial attempts at solving these problems can

inspire more and better solutions in the future.

94

APPENDIX APROOF OF PROPOSITION 4.2 & 4.3

Lemma 2. Suppose f1 ∈ F1, ..., fd ∈ Fd are elements from sets of real-valued functions de-

fined on Rp for some p ≥ 1, F ⊂ F1×· · ·×Fd is a subset of their direct sum. For f ∈ F, define

ω f : Rp × · · · × Rp × Rq → R : (x1, ...,xγ,y) 7→ ω(f1(x1), ..., fd(x1), f1(x2), ..., fd(xγ),y),

where x1, ...,xγ ∈ Rp,y ∈ Rq and ω : Rγ·d × Rq → R is bounded and L-Lipschitz for each

y ∈ Rq with respect to the Euclidean metric on Rγ·d. Let ω F = ω f : f ∈ F.

Define

Gn,j(Fi) = E

[supf∈Fi

1

n

n∑t=1

Ztf (Xt,j)

], i = 1, ..., d, j = 1, ..., γ, (A-1)

where Xt,j, t = 1, ..., n are i.i.d. random vectors defined on Rp. We have

Gn(ω F) ≤ 2Ld∑

i=1

γ∑j=1

Gn,j(Fi). (A-2)

In particular, if for all j, the Xt,j on which the Gaussian complexities of the Fi are

evaluated are sets of i.i.d. random vectors with the same distribution, we have G1n(Fi) = · · · =

Gγn(Fi) =: Gn(Fi) for all i and Eq. A-2 becomes

Gn(ω F) ≤ 2γLd∑

i=1

Gn(Fi). (A-3)

This Lemma is a generalization of a result on the Gaussian complexity of Lipschitz

functions on Rp from (Bartlett & Mendelson, 2002).

Proof. We prove the case where γ = 2. An extension to the general case is simple.

Let F be indexed by A. Without loss of generality, assume |A| < ∞. Define

Tα =n∑

t=1

ω (fα,1 (Xt) , ..., fα,d (X′t) ,Yt)Zt,

Vα = L

n∑t=1

d∑i=1

(fα,i (Xt)Zt,i + fα,i (X′t)Zn+t,i) ,

(A-4)

95

where α ∈ A, Xtnt=1, X′t

nt=1 are i.i.d. random samples on Rp × Rp, Ytnt=1 an i.i.d.

random sample on Rq, and Z1, ..., Zn, Z1,1, ..., Z2n,d are i.i.d. standard normal random

variables. The Xt’s, X′t’s, and the Yt’s are mutually independent.

Let arbitrary α, β ∈ A be give, and define ∥Tα−Tβ∥22 := EZ1,...,Zn(Tα−Tβ)2, ∥Vα−Vβ∥22 :=

EZ1,...,Zn(Vα − Vβ)2. We have

∥Tα − Tβ∥22 =n∑

t=1

(ω (fα,1 (Xt) , ..., fα,d (X′t) ,Yt)− ω (fβ,1 (Xt) , ..., fβ,d (X

′t) ,Yt))

2

≤ L2

n∑t=1

d∑i=1

((fα,i (Xt)− fβ,i (Xt))

2 + (fα,i (X′t)− fβ,i (X

′t))

2)

= ∥Vα − Vβ∥22.

(A-5)

By Slepian’s Lemma (Pisier, 1999),

nGn(ω F) = EZ1,...,Zn supα∈A

≤ 2EZ1,1,...,Z2n,dsupα∈A

≤ 2nLd∑

i=1

(Gn(Fi) + G ′

n(Fi)),

(A-6)

where Gn is computed on X1, ...,Xn and G ′n is computed on X′

1, ...,X′n.

Taking the expectation over X1, ...,Xn,X′1, ...,X

′n,Y1, ...,Yn proves the result.

Lemma 3. Given kernel k : Rd1 × Rd1 → R, let

F1 =(f1, ..., fd1) : u 7→ (f1(u), ..., fd1(u))

⊤ ,u ∈ Rd0∣∣∣f1, ..., fd1 ∈ Ω

, (A-7)

where Ω is a given set of real-valued functions on Rd0 . Also define

F =

h : u 7→

m∑ν=1

ανk (F(xν),F(u))∣∣∣∥α∥1 ≤ A, b ∈ R,F ∈ F1

, (A-8)

where α := (α1, ..., αm)⊤ and x1, ...,xm is a given set of examples. We have

Gn(F) ≤ 2ALd1Gn(Ω). (A-9)

96

Proof. First, note that the bias b does not change Gn(F).

Gn(F) = E supα,F

1

n

n∑t=1

m∑ν=1

ανk (F(xν),F(ut))Zt

≤ E∑

α,F,yν∈Rd1

1

n

n∑t=1

m∑ν=1

ανk (yν ,F(ut))Zt.

(A-10)

Suppose the supremum over the yν ’s is attained at Y1, ...,Ym, which are random vectors

since they are functions of the Zt’s.

Write

gν F(u) = k(F(u),Yν),

ω F(u,Y) =m∑

ν=1

ανgν F(u) =m∑

ν=1

ανk(F(u),Yν),(A-11)

where Y is the concatenation of all Yν ’s.

Then we have

Gn(F) ≤ E supα,F

1

n

n∑t=1

m∑ν=1

ανk (Yν ,F(ut))Zt

= E supα,F

1

n

n∑t=1

ω F(ut,Y)Zt

= Gn(ω F1).

(A-12)

We now prove a Lipschitz property for ω. For any ξ1, ξ2 ∈ Rd1 , we have

|ω(ξ1,Y)− ω(ξ2,Y)| =

∣∣∣∣∣m∑

ν=1

αν (gν(ξ1)− gν(ξ2))

∣∣∣∣∣≤

m∑ν=1

|αν | |gν(ξ1)− gν(ξ2)|

≤ Amaxν

|gν(ξ1)− gν(ξ2)|

= Amaxν

|k(ξ1,Yν)− k(ξ2,Yν)|

≤ Amaxν

LYν∥ξ1 − ξ2∥2

≤ AL∥ξ1 − ξ2∥2.

(A-13)

97

Therefore, ω F(u), as a function of F(u), is Lipschitz with respect to the Euclidean

metric on Rd1 with Lipschitz constant at most AL. It is easy to check that ω is bounded.

Now, the result follows from Lemma 2 with γ = 1.

Proof of Proposition 4.1. The result follows from recursively applying Lemma 3 to all layer

pairs.

Lemma 4. Given kernel k : Rd1 × Rd1 → R, let

F1 =(f1, ..., fd1) : u 7→ (f1(u), ..., fd1(u))

⊤ ,u ∈ Rd0∣∣∣f1, ..., fd1 ∈ Ω

, (A-14)

where Ω is a given set of real-valued functions on Rd0 . Also define

F =

h : u 7→

m∑ν=1

ανk (F(xν),F(u))∣∣∣∥α∥2 ≤ A, b ∈ R,F ∈ F1

, (A-15)

where α := (α1, ..., αm)⊤ and x1, ...,xm is a given set of examples. We have

Gn(F) ≤ 2m12ALd1Gn(Ω). (A-16)

Proof. The proof is the same as that of Lemma 3 except for Eq. A-13 (and subsequently the

bound on the Lipschitz constant of ω), which becomes

|ω(ξ1,Y)− ω(ξ2,Y)| =

∣∣∣∣∣m∑

ν=1

αν (gν(ξ1)− gν(ξ2))

∣∣∣∣∣≤ ∥α∥2

√√√√ m∑ν=1

(gν(ξ1)− gν(ξ2))2

≤ m12Amax

ν|gν(ξ1)− gν(ξ2)|

= m12Amax

ν|k(ξ1,Yν)− k(ξ2,Yν)|

≤ m12Amax

νLYν∥ξ1 − ξ2∥2

≤ m12AL∥ξ1 − ξ2∥2.

(A-17)

98

Proof of Proposition 4.2. The result follows from recursively applying Lemma 4 to all layer

pairs.

99

APPENDIX BPROOF OF THEOREM 6.1

Lemma 5. Given an inner product space H, a unit vector e ∈ H, and four other vectors

v+,v−,v⋆+, and v⋆

− with ∥v+∥H = ∥v−∥H = ∥v⋆+∥H = ∥v⋆

−∥H > 0, where the norm is the

canonical norm induced by the inner product, i.e., ∥u∥2H := ⟨u,u⟩H ,∀u ∈ H. Assume

∥v+ − v−∥H ≤ ∥v⋆+ − v⋆

−∥H . (B-1)

Then there exists a unit vector e⋆ ∈ H such that

⟨e,v+⟩H ≤ ⟨e⋆,v⋆+⟩H ; (B-2)

⟨e,v−⟩H ≥ ⟨e⋆,v⋆−⟩H . (B-3)

Proof. Throughout, we omit the subscript H on inner products and norms for brevity, which

shall cause no ambiguity since no other inner product or norm will be involved in this proof.

If v⋆+ = v⋆

−, then ∥v⋆+ − v⋆

−∥ = 0, which would imply

∥v+ − v−∥ = 0 ⇐⇒ v+ = v−. (B-4)

Then we may choose e⋆ such that ⟨e⋆,v⋆+⟩ = ⟨e,v+⟩ and the result holds trivially.

On the other hand, if v⋆+ = −v⋆

+, e⋆ = v⋆

+/∥v⋆+∥ would be a valid choice of e⋆. Indeed,

by Cauchy-Schwarz,

⟨e⋆,v⋆+⟩ = ∥v⋆

+∥ ≥ ⟨e,v+⟩; (B-5)

⟨e⋆,v⋆−⟩ = −∥v⋆

−∥ ≤ ⟨e,v−⟩. (B-6)

Therefore, we may assume that v⋆+ = ±v⋆

−.

For two vectors a,b, we define the “angle” between them, denoted θ ∈ [0, π], via

cos θ := ⟨a,b⟩/(∥a∥∥b∥). This angle is well-defined since cos is injective on [0, π].

Since

⟨a,b⟩ = −1

2∥a− b∥2 + 1

2∥a∥2 + 1

2∥b∥2,∀a,b, (B-7)

100

∥v+∥ = ∥v−∥ = ∥v⋆+∥ = ∥v⋆

−∥ > 0 and ∥v+ − v−∥ ≤ ∥v⋆+ − v⋆

−∥ together implies

⟨v+,v−⟩/(∥v+∥∥v−∥) ≥ ⟨v⋆+,v

⋆−⟩/(∥v⋆

+∥∥v⋆−∥), which then implies θ ≤ θ⋆ since cos is strictly

decreasing on [0, π], where θ is the angle between v+,v− and θ⋆ is the angle between v⋆+,v

⋆−.

Let γ+ be the angle between e,v+ and γ− that between e,v−. Note that γ−−γ+ ∈ [0, π].

Define

p := ⟨e,v+⟩ = ∥v+∥ cos γ+; (B-8)

n := ⟨e,v−⟩ = ∥v−∥ cos γ−. (B-9)

Now, suppose that we have shown

γ− − γ+ ≤ θ; (B-10)

∃e⋆ s.t. γ⋆− − γ⋆

+ = θ⋆ and γ⋆− = γ−, (B-11)

where γ⋆+ is the angle between e⋆,v⋆

+ and γ⋆− that between e⋆,v⋆

−. Define:

p⋆ := ⟨e⋆,v⋆+⟩ = ∥v⋆

+∥ cos γ⋆+; (B-12)

n⋆ := ⟨e⋆,v⋆−⟩ = ∥v⋆

−∥ cos γ⋆−. (B-13)

Then using ∥v−∥ = ∥v⋆−∥ > 0 and the earlier result that θ ≤ θ⋆, we would have

n = n⋆; (B-14)

γ+ ≥ γ⋆+, (B-15)

which, together with the fact that cos is strictly decreasing on [0, π] and the assumption that

∥v+∥ = ∥v⋆+∥ > 0, implies

n = n⋆; (B-16)

p ≤ p⋆, (B-17)

proving the result.

101

To prove Eq. B-10, it suffices to show cos(γ− − γ+) ≥ cos θ since cos is decreasing on

[0, π], to which both γ− − γ+ and θ belong. To this end, we have

cos(γ− − γ+) = cos γ− cos γ+ + sin γ− sin γ+ (B-18)

=pn

∥v+∥∥v−∥+

√(∥v+∥2 − p2)(∥v−∥2 − n2)

∥v+∥2∥v−∥2. (B-19)

Since ∥v+∥2 − p2 = ∥v+∥2 + p2 − 2p2 = ∥v+∥2 + p2 − 2p⟨e,v+⟩ = ∥v+ − pe∥2 and similarly

∥v−∥2 − n2 = ∥v− − ne∥2,

cos(γ− − γ+) =pn+ ∥v+ − pe∥∥v− − ne∥

∥v+∥∥v−∥(B-20)

≥ pn+ ⟨v+ − pe,v− − ne⟩∥v+∥∥v−∥

(B-21)

=⟨v+,v−⟩∥v+∥∥v−∥

(B-22)

= cos θ, (B-23)

where the inequality is due to Cauchy-Schwarz.

To prove B-11, it suffices to show that there exists e⋆ such that

A one of v⋆+ − p⋆e⋆,v⋆

− − n⋆e⋆ is a scalar multiple of the other;

B cos γ⋆− = cos γ−.

Indeed, A implies ∥v⋆+ − p⋆e⋆∥∥v⋆

− − n⋆e⋆∥ = ⟨v⋆+ − p⋆e⋆,v⋆

− − n⋆e⋆⟩, which, together

with similar arguments as we used to prove Eq. B-10, implies γ⋆− − γ⋆

+ = θ⋆. Also note that B

is equivalent to n = n⋆.

We prove existence constructively. Specifically, we set e⋆ = av⋆+ + bv⋆

−, a, b ∈ R and find

a, b such that e⋆ satisfies A and B simultaneously.

Let s = ⟨v⋆+,v

⋆−⟩, r = ∥v⋆

+∥2 = ∥v⋆−∥2. Note that we immediately have:

r > 0; |s| ≤ r. (B-24)

The assumption v⋆+ = ±v⋆

− implies |s| < r.

102

Since e⋆ is a unit vector, we have the following constraint on a, b:

a2r + b2r + 2abs = 1. (B-25)

And some simple algebra yields

p⋆ = ar + bs; (B-26)

n⋆ = as+ br. (B-27)

To identify those a, b such that A holds, we first rewrite v⋆+ − p⋆e⋆ and v⋆

− − n⋆e⋆ as

follows.

v⋆+ − p⋆e⋆ = v⋆

+ − (ar + bs)(av⋆+ + bv⋆

−) (B-28)

= (1− a2r − abs)v⋆+ + (−abr − b2s)v⋆

−. (B-29)

Similarly,

v⋆− − n⋆e⋆ = (−a2s− abr)v⋆

+ + (1− abs− b2r)v⋆−. (B-30)

Define

w1,+ := 1− a2r − abs; (B-31)

w1,− := −abr − b2s; (B-32)

w2,+ := −a2s− abr; (B-33)

w2,− := 1− abs− b2r. (B-34)

Then we have

v⋆+ − p⋆e⋆ = w1,+v

⋆+ + w1,−v

⋆−; (B-35)

v⋆− − n⋆e⋆ = w2,+v

⋆+ + w2,−v

⋆−. (B-36)

Assuming none of w1,+, w1,−, w2,+, w2,− is 0, A is equivalent to

w1,+w2,− = w1,−w2,+. (B-37)

103

To check that this is always true, we have

w1,+w2,− (B-38)

= a3brs+ ab3rs+ a2b2s2 + a2b2r2 − 2abs− r(a2 + b2) + 1 (B-39)

= a3brs+ ab3rs+ a2b2s2 + a2b2r2 (B-40)

because of Eq. B-25. And

w1,−w2,+ = a3brs+ ab3rs+ a2b2s2 + a2b2r2. (B-41)

Therefore, A is always true.

If at least one of w1,+, w1,−, w2,+, w2,− is 0, we have the following mutually exclusive

cases:

i one of the four coefficients is 0 while the other three are not; 7

ii w1,+ = w1,− = 0, the others are not 0; 3

iii w2,+ = w2,− = 0, the others are not 0; 3

iv w1,+ = w2,− = 0, the others are not 0; 7

v w1,− = w2,+ = 0, the others are not 0; 7

vi w1,+ = w2,+ = 0, the others are not 0; 3

vii w1,− = w2,− = 0, the others are not 0; 3

viii three of the four coefficients are 0 while one is not; 3

ix w1,+ = w1,− = w2,+ = w2,− = 0, 3

where the cases marked with 7 are the ones where A cannot be true (so our choice of a, b

cannot fall into these cases) and the ones marked with 3 are the ones where A is true. Note

that A cannot be true in cases iv and v because if A was true, it would imply that v⋆+ and

v⋆− are linearly dependent. However, since ∥v⋆

+∥ = ∥v⋆−∥, we would have either v⋆

+ = v⋆− or

v⋆+ = −v⋆

−, both of which have been excluded from the discussion in the beginning of this

proof.

Therefore, A is satisfied by any a, b that satisfy Eq. B-25 but none of case i, iv, and v.

104

We now turn to the search for a, b such that B holds. B is equivalent to

as+ br = n ⇐⇒ b =1

r(n− as). (B-42)

Therefore, finding a, b such that A and B hold simultaneously amounts to finding a, b such

that

b =1

r(n− as); (B-43)

a2r + b2r + 2abs = 1; (B-44)

none of case i, iv, v is true. (B-45)

Now, substituting b = (n− as)/r into Eq. B-44 and solving for a, we have

a2 =r − n2

r2 − s2, (B-46)

and we choose

a =

√r − n2

r2 − s2. (B-47)

This root is real since n = ⟨e,v−⟩ and therefore n2 = r cos2 γ− ≤ r.

To verify that

a =

√r − n2

r2 − s2, b =

1

r(n− as). (B-48)

satisfies B-45, first note that since this solution of a, b comes from solving n = n⋆ and

Eq. B-44, we have

w1,+ = 1− a(bs+ ar) = b(br + as) = bn; (B-49)

w1,− ∝ b(bs+ ar); (B-50)

w2,+ ∝ a(as+ br) = an; (B-51)

w2,− = 1− b(as+ br) = 1− bn. (B-52)

105

We now analyze each one of case i, iv, and v individually and show that our choice of a, b

does not fall into any of them, proving that this particular a, b satisfies Eq. B-43, B-44, and

condition B-45.

case i:

If w1,+ = 0, then either b = 0 or n = 0, resulting in w1,− ∝ b(bs + ar) = 0 or

w2,+ ∝ an = 0, i.e., there must be at least 2 coefficients being 0.

If w1,− = 0, then either b = 0, in which case w1,+ = 0, or bs + ar = 0. In the latter case,

we would then use Eq. B-44 and have

abs+ a2r = 0 =⇒ b2r + abs = 1 =⇒ b(br + as) = bn = 1

=⇒ w2,− = 1− bn = 0. (B-53)

Again, we would have at least 2 nonzero coefficients.

If w2,+ = 0, then either n = 0, which would result in w1,+ = bn = 0, or a = 0. Assuming

a = 0, Eq. B-48 would imply n = ±√r and b = n/r. Either way, we would have b = 1/n and

therefore w2,− = 1− bn = 0.

Finally, if w2,− = 0, then first assuming n = 0, we would have b = 1/n. Then Eq. B-48

would give 1/n = (n − as)/r. Solving for a, we have a = (n2 − r)/(ns). r ≥ n2 would imply

that a ≤ 0. On the other hand, Eq. B-48 also implied a ≥ 0. Hence, a must be 0, in which

case w2,+ ∝ an = 0. If n = 0, we would have w2,+ = 0 as well.

case iv:

It is easy to see that this case is impossible since w1,+ + w2,− = 1.

case v:

If w1,− = w2,+ = 0, then we have already shown in the proof of case i that either w1,+ or

w2,− must be 0, that is, at least 3 coefficients would be 0.

In summary, we have found an e⋆ such that A and B hold simultaneously, proving B-11.

106

Proof of Theorem 6.1. The result amounts to proving

L(f ⋆2 F⋆

1, S) ≤ L(f2 F1, S), ∀f2,F1. (B-54)

Define S+ to be the set of all xi such that i ∈ I+ and S− the set of all xj such that

j ∈ I−. Let κ = 1n

∑ni=1 1i∈I+, we have

L(f2 F⋆1, S)

≤ κℓ+(f2 F⋆

1(x⋆+))+ (1− κ)ℓ−

(f2 F⋆

1(x⋆−))

+ λg(∥w∥) (B-55)

= κℓ+

(⟨w

∥w∥,ϕ⋆

+

⟩∥w∥+ b

)(B-56)

+ (1− κ)ℓ−

(⟨w

∥w∥,ϕ⋆

⟩∥w∥+ b

)+ λg(∥w∥)

for some x⋆+,x

⋆− from S+, S−, respectively, where ϕ

⋆+ := ϕ

(F⋆

1(x⋆+))and ϕ⋆

− := ϕ(F⋆

1(x⋆−)).

For any f ′2 F′

1, let f′2 be parameterized by w′, b′. We have

L(f ′2 F′

1, S)

≥ κℓ+(f ′2 F′

1(x′+))+ (1− κ)ℓ−

(f ′2 F′

1(x′−))

+ λg(∥w′∥) (B-57)

= κℓ+

(⟨w′

∥w′∥,ϕ′

+

⟩∥w′∥+ b′

)(B-58)

+ (1− κ)ℓ−

(⟨w′

∥w′∥,ϕ′

⟩∥w′∥+ b′

)+ λg(∥w′∥)

for x′+,x

′− with x′

+ maximizing f ′2 F′

1(xi) over xi ∈ S+ and x′− minimizing f ′

2 F′1(xj) over

xj ∈ S−, where ϕ′+ := ϕ

(F′

1(x′+))and ϕ′

− := ϕ(F′

1(x′−)).

Using the assumption on F⋆1,

∥ϕ⋆+ − ϕ⋆

−∥ ≥ ∥ϕ′+ − ϕ′

−∥. (B-59)

107

Then using Lemma 5, there exists a unit vector e⋆ such that

⟨e⋆,ϕ⋆

+

⟩≥

⟨w′

∥w′∥,ϕ′

+

⟩; (B-60)

⟨e⋆,ϕ⋆

−⟩≤

⟨w′

∥w′∥,ϕ′

⟩. (B-61)

(B-62)

Let A := w : ∥w∥ = ∥w′∥, then evidently, e⋆∥w′∥ ∈ A, and we have,

L(f ′2 F′

1, S)

≥ κℓ+(⟨e⋆,ϕ⋆

+

⟩∥w′∥+ b′

)(B-63)

+ (1− κ)ℓ−(⟨e⋆,ϕ⋆

−⟩∥w′∥+ b′

)+ λg (∥w′∥)

= κℓ+

(⟨e⋆∥w′∥

∥e⋆∥w′∥∥,ϕ⋆

+

⟩∥e⋆∥w′∥∥+ b′

)(B-64)

+ (1− κ)ℓ−

(⟨e⋆∥w′∥∥e⋆∥w′∥∥

,ϕ⋆−

⟩∥e⋆∥w′∥∥+ b′

)+ λg (∥e⋆∥w′∥∥) (B-65)

≥ minw∈A

κℓ+

(⟨w

∥w∥,ϕ⋆

+

⟩∥w∥+ b′

)(B-66)

+ (1− κ)ℓ−

(⟨w

∥w∥,ϕ⋆

⟩∥w∥+ b′

)+ λg (∥w∥)

≥ minw∈A,b

κℓ+

(⟨w

∥w∥,ϕ⋆

+

⟩∥w∥+ b

)(B-67)

+ (1− κ)ℓ−

(⟨w

∥w∥,ϕ⋆

⟩∥w∥+ b

)+ λg (∥w∥)

≥ minw∈A,b

L(f2 F⋆1, S) (B-68)

≥ minw,b

L(f2 F⋆1, S) (B-69)

= L(f ⋆2 F⋆

1, S). (B-70)

This proves the result.

108

APPENDIX CADDITIONAL TRANSFERABILITY ESTIMATION PLOTS

109

True Transferability

Sour

ce: a

uca

Estimated Transferability (Using 1000 Examples)

Targ

et: a

uca

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True Transferability

Sour

ce: a

udo

Estimated Transferability (Using 1000 Examples)

Targ

et: a

udo

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True Transferability

Sour

ce: a

utr

Estimated Transferability (Using 1000 Examples)

Targ

et: a

utr

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True Transferability

Sour

ce: c

ado

Estimated Transferability (Using 1000 Examples)

Targ

et: c

ado

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True Transferability

Sour

ce: d

eau

Estimated Transferability (Using 1000 Examples)

Targ

et: d

eau

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True Transferability

Sour

ce: d

eca

Estimated Transferability (Using 1000 Examples)

Targ

et: d

eca

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True Transferability

Sour

ce: d

edo

Estimated Transferability (Using 1000 Examples)

Targ

et: d

edo

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True Transferability

Sour

ce: d

eho

Estimated Transferability (Using 1000 Examples)

Targ

et: d

eho

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

Figure C-1. Additional figures with other tasks as the source/target task, supplementingFig. 7-7.

110

True Transferability

Sour

ce: d

etr

Estimated Transferability (Using 1000 Examples)

Targ

et: d

etr

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True Transferability

Sour

ce: h

oau

Estimated Transferability (Using 1000 Examples)

Targ

et: h

oau

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True Transferability

Sour

ce: h

oca

Estimated Transferability (Using 1000 Examples)

Targ

et: h

oca

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True TransferabilitySo

urce

: hod

o

Estimated Transferability (Using 1000 Examples)

Targ

et: h

odo

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True Transferability

Sour

ce: h

otr

Estimated Transferability (Using 1000 Examples)

Targ

et: h

otr

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

True Transferability

Sour

ce: t

rca

Estimated Transferability (Using 1000 Examples)

Targ

et: t

rca

task:aucaaudoautr

cadodeaudecadedodehodetr

hoauhocahodohotrtrcatrdo

Figure C-2. Additional figures with other tasks as the source/target task, supplementingFig. 7-7.

111

REFERENCES

Achille, A., Lam, M., Tewari, R., Ravichandran, A., Maji, S., Fowlkes, C. C., Soatto, S., &Perona, P. (2019). Task2vec: Task embedding for meta-learning. In Proceedings of the IEEEInternational Conference on Computer Vision, (pp. 6430–6439).

Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American mathemat-ical society , 68(3), 337–404.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., & Wang, R. (2019a). On exactcomputation with an infinitely wide neural net. In Advances in Neural Information ProcessingSystems, (pp. 8139–8148).

Arora, S., Du, S. S., Li, Z., Salakhutdinov, R., Wang, R., & Yu, D. (2019b). Harnessing thepower of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663 .

Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., & Saunshi, N. (2019c). A theoreticalanalysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229 .

Balduzzi, D., Vanchinathan, H., & Buhmann, J. (2015). Kickback cuts backprop’s red-tape:biologically plausible credit assignment in neural networks. In Twenty-Ninth AAAI Confer-ence on Artificial Intelligence.

Bao, Y., Li, Y., Huang, S.-L., Zhang, L., Zheng, L., Zamir, A., & Guibas, L. (2019). Aninformation-theoretic approach to transferability in task transfer learning. In 2019 IEEEInternational Conference on Image Processing (ICIP), (pp. 2309–2313). IEEE.

Bartlett, P. L., & Mendelson, S. (2002). Rademacher and gaussian complexities: Risk boundsand structural results. Journal of Machine Learning Research, 3(Nov), 463–482.

Ben-David, S., Blitzer, J., Crammer, K., & Pereira, F. (2007). Analysis of representations fordomain adaptation. In Advances in neural information processing systems, (pp. 137–144).

Ben-David, S., & Schuller, R. (2003). Exploiting task relatedness for multiple task learning. InLearning Theory and Kernel Machines, (pp. 567–580). Springer.

Bengio, Y. (2014). How auto-encoders could provide credit assignment in deep networks viatarget propagation. arXiv preprint arXiv:1407.7906 .

Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and newperspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8),1798–1828.

Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training ofdeep networks. In Advances in neural information processing systems, (pp. 153–160).

Broomhead, D. S., & Lowe, D. (1988). Radial basis functions, multi-variable functionalinterpolation and adaptive networks. Tech. rep., Royal Signals and Radar EstablishmentMalvern (United Kingdom).

112

Buckner, C., & Garson, J. (2019). Connectionism. In E. N. Zalta (Ed.) The StanfordEncyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, fall 2019 ed.

URL https://plato.stanford.edu/archives/fall2019/entries/connectionism/

Carreira-Perpinan, M., & Wang, W. (2014). Distributed optimization of deeply nested systems.In Artificial Intelligence and Statistics, (pp. 10–19).

Chen, K. (2015). Deep and modular neural networks. In Springer Handbook of ComputationalIntelligence, (pp. 473–494). Springer.

Cho, Y., & Saul, L. K. (2009). Kernel methods for deep learning. In Advances in neuralinformation processing systems, (pp. 342–350).

Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, J. S. (2002). On kernel-targetalignment. In Advances in neural information processing systems, (pp. 367–373).

Duan, S., Yu, S., Chen, Y., & Principe, J. C. (2019). On kernel method–based connectionistmodels and supervised deep learning without backpropagation. Neural computation, (pp.1–39).

Erdogmus, D., Fontenla-Romero, O., Principe, J. C., Alonso-Betanzos, A., & Castillo, E.(2005). Linear-least-squares initialization of multilayer perceptrons through backpropagationof the desired response. IEEE Transactions on Neural Networks, 16(2), 325–337.

Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. InAdvances in neural information processing systems, (pp. 524–532).

Freund, Y., Schapire, R., & Abe, N. (1999). A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14(771-780), 1612.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M.,& Lempitsky, V. (2016). Domain-adversarial training of neural networks. The Journal ofMachine Learning Research, 17(1), 2096–2030.

Gardner, J. R., Upchurch, P., Kusner, M. J., Li, Y., Weinberger, K. Q., Bala, K., & Hopcroft,J. E. (2015). Deep manifold traversal: Changing labels with convolutional features. arXivpreprint arXiv:1511.06421 .

Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A neural algorithm of artistic style. arXivpreprint arXiv:1508.06576 .

Gehler, P. V., & Nowozin, S. (2008). Infinite kernel learning.

Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. InProceedings of the fourteenth international conference on artificial intelligence and statistics,(pp. 315–323).

Gonen, M., & Alpaydın, E. (2011). Multiple kernel learning algorithms. Journal of machinelearning research, 12(Jul), 2211–2268.

113

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural informationprocessing systems, (pp. 2672–2680).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp.770–778).

Hermans, M., & Schrauwen, B. (2012). Recurrent kernel machines: Computing with infiniteecho state networks. Neural Computation, 24(1), 104–133.

Hinton, G. E. (2007). Learning multiple layers of representation. Trends in cognitive sciences,11(10), 428–434.

Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep beliefnets. Neural computation, 18(7), 1527–1554.

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neuralnetworks. science, 313(5786), 504–507.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation,9(8), 1735–1780.

Huang, F. J., & LeCun, Y. (2006). Large-scale learning with svm and convolutional for genericobject categorization. In 2006 IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR’06), vol. 1, (pp. 284–291). IEEE.

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 .

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of localexperts. Neural computation, 3(1), 79–87.

Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence andgeneralization in neural networks. In Advances in neural information processing systems, (pp.8571–8580).

Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., Silver, D., &Kavukcuoglu, K. (2017). Decoupled neural interfaces using synthetic gradients. In Pro-ceedings of the 34th International Conference on Machine Learning-Volume 70 , (pp.1627–1635). JMLR. org.

Janati, H., Cuturi, M., & Gramfort, A. (2018). Wasserstein regularization for sparse multi-taskregression. arXiv preprint arXiv:1805.07833 .

Jordan, M. I. (1997). Serial order: A parallel distributed processing approach. In Advances inpsychology , vol. 121, (pp. 471–495). Elsevier.

114

Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the em algorithm.Neural computation, 6(2), 181–214.

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .

Kloft, M., Brefeld, U., Sonnenburg, S., & Zien, A. (2011). Lp-norm multiple kernel learning.The Journal of Machine Learning Research, 12 , 953–997.

Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deepconvolutional neural networks. In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou,& K. Q. Weinberger (Eds.) Advances in Neural Information Processing Systems 25: 26thAnnual Conference on Neural Information Processing Systems 2012. Proceedings of ameeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, (pp. 1106–1114).

URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

Kulis, B., Sustik, M. A., & Dhillon, I. S. (2009). Low-rank kernel learning with bregman matrixdivergences. Journal of Machine Learning Research, 10(Feb), 341–376.

Lanckriet, G. R., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learningthe kernel matrix with semidefinite programming. Journal of Machine learning research,5(Jan), 27–72.

Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empiricalevaluation of deep architectures on problems with many factors of variation. In Proceedingsof the 24th international conference on Machine learning , (pp. 473–480).

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied todocument recognition. Proceedings of the IEEE , 86(11), 2278–2324.

Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., & Tu, Z. (2015a). Deeply-supervised nets. InArtificial intelligence and statistics, (pp. 562–570).

Lee, D.-H., Zhang, S., Fischer, A., & Bengio, Y. (2015b). Difference target propagation. InJoint european conference on machine learning and knowledge discovery in databases, (pp.498–515). Springer.

Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., & Sohl-Dickstein, J. (2017).Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165 .

Lee, J., Schoenholz, S. S., Pennington, J., Adlam, B., Xiao, L., Novak, R., & Sohl-Dickstein,J. (2020). Finite versus infinite neural networks: an empirical study. arXiv preprintarXiv:2007.15801 .

115

Li, Y., Carlson, D. E., et al. (2018). Extracting relationships by multi-domain matching. InAdvances in Neural Information Processing Systems, (pp. 6798–6809).

Li, Z., Wang, R., Yu, D., Du, S. S., Hu, W., Salakhutdinov, R., & Arora, S. (2019). Enhancedconvolutional neural tangent kernels. arXiv preprint arXiv:1911.00809 .

Liu, P., Qiu, X., & Huang, X. (2017). Adversarial multi-task learning for text classification.arXiv preprint arXiv:1704.05742 .

Liu, W., Pokharel, P. P., & Prıncipe, J. C. (2007). Correntropy: Properties and applications innon-gaussian signal processing. IEEE Transactions on Signal Processing , 55(11), 5286–5298.

Liu, W., Pokharel, P. P., & Principe, J. C. (2008). The kernel least-mean-square algorithm.IEEE Transactions on Signal Processing , 56(2), 543–554.

Liu, W., Principe, J. C., & Haykin, S. (2011). Kernel adaptive filtering: a comprehensiveintroduction, vol. 57. John Wiley & Sons.

Lowe, S., O’Connor, P., & Veeling, B. (2019). Putting an end to end-to-end: Gradient-isolatedlearning of representations. In Advances in Neural Information Processing Systems, (pp.3033–3045).

Mairal, J., Koniusz, P., Harchaoui, Z., & Schmid, C. (2014). Convolutional kernel networks. InAdvances in neural information processing systems, (pp. 2627–2635).

Micchelli, C. A., Xu, Y., & Zhang, H. (2006). Universal kernels. Journal of Machine LearningResearch, 7(Dec), 2651–2667.

Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of machine learning . MITpress.

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines.In Proceedings of the 27th international conference on machine learning (ICML-10), (pp.807–814).

Neal, R. M. (1995). BAYESIAN LEARNING FOR NEURAL NETWORKS . Ph.D. thesis,University of Toronto.

Nguyen, C. V., Hassner, T., Archambeau, C., & Seeger, M. (2020). Leep: A new measure toevaluate transferability of learned representations. arXiv preprint arXiv:2002.12462 .

Park, J., & Sandberg, I. W. (1991). Universal approximation using radial-basis-functionnetworks. Neural computation, 3(2), 246–257.

Pisier, G. (1999). The volume of convex bodies and Banach space geometry , vol. 94.Cambridge University Press.

Principe, J. C. (2010). Information theoretic learning: Renyi’s entropy and kernel perspectives.Springer Science & Business Media.

116

Raghu, M., Gilmer, J., Yosinski, J., & Sohl-Dickstein, J. (2017). Svcca: Singular vectorcanonical correlation analysis for deep learning dynamics and interpretability. In Advances inNeural Information Processing Systems, (pp. 6076–6085).

Rahimi, A., & Recht, B. (2008). Random features for large-scale kernel machines. In Advancesin neural information processing systems, (pp. 1177–1184).

Rosenblatt, F. (1957). The perceptron, a perceiving and recognizing automaton Project Para.Cornell Aeronautical Laboratory.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations byback-propagating errors. nature, 323(6088), 533–536.

Scholkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. InInternational conference on computational learning theory , (pp. 416–426). Springer.

Scholkopf, B., Smola, A., & Muller, K.-R. (1998). Nonlinear component analysis as a kerneleigenvalue problem. Neural computation, 10(5), 1299–1319.

Scholkopf, B., & Smola, A. J. (2001). Learning with kernels: support vector machines,regularization, optimization, and beyond . MIT press.

Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory toalgorithms. Cambridge university press.

Shankar, V., Fang, A., Guo, W., Fridovich-Keil, S., Schmidt, L., Ragan-Kelley, J., & Recht, B.(2020). Neural kernels without tangents. arXiv preprint arXiv:2003.02237 .

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 .

Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang,H., & Raffel, C. (2020). Fixmatch: Simplifying semi-supervised learning with consistency andconfidence. arXiv preprint arXiv:2001.07685 .

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout:a simple way to prevent neural networks from overfitting. The journal of machine learningresearch, 15(1), 1929–1958.

Sun, S., Chen, W., Wang, L., Liu, X., & Liu, T.-Y. (2015). On the depth of deep neuralnetworks: A theoretical view. arXiv preprint arXiv:1506.05232 .

Suykens, J. A. (2017). Deep restricted kernel machines using conjugate feature duality. Neuralcomputation, 29(8), 2123–2163.

Suykens, J. A., & Vandewalle, J. (1999). Training multilayer perceptron classifiers based on amodified support vector method. IEEE transactions on Neural Networks, 10(4), 907–911.

Synced (2018). LeCun vs Rahimi: Has Machine Learning Become Alchemy?

117

URL https://medium.com/@Synced/lecun-vs-rahimi-has-machine-learning-become-alchemy-21cb1557920d

Tang, Y. (2013). Deep learning using linear support vector machines. arXiv preprintarXiv:1306.0239 .

Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop, coursera: Neural networks formachine learning. University of Toronto, Technical Report.

Tran, A. T., Nguyen, C. V., & Hassner, T. (2019). Transferability and hardness of supervisedclassification tasks. In Proceedings of the IEEE International Conference on ComputerVision, (pp. 1395–1405).

Vapnik, V. (2000a). The nature of statistical learning theory.

Vapnik, V., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequenciesof events to their probabilities. Theory of Probability & Its Applications, 16(2), 264–280.

Vapnik, V. N. (2000b). The Nature of Statistical Learning Theory, Second Edition. Statisticsfor Engineering and Information Science. Springer.

Varma, M., & Babu, B. R. (2009). More generality in efficient multiple kernel learning.In Proceedings of the 26th Annual International Conference on Machine Learning , (pp.1065–1072).

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A., & Bottou, L. (2010).Stacked denoising autoencoders: Learning useful representations in a deep network with alocal denoising criterion. Journal of machine learning research, 11(12).

Watanabe, C., Hiramatsu, K., & Kashino, K. (2018). Modular representation of layered neuralnetworks. Neural Networks, 97 , 62–73.

Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning , vol. 2.MIT press Cambridge, MA.

Wilson, A. G., Hu, Z., Salakhutdinov, R., & Xing, E. P. (2016). Deep kernel learning. InArtificial intelligence and statistics, (pp. 370–378).

Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: a novel image dataset forbenchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 .

Xu, D., & Principe, J. C. (1999). Training mlps layer-by-layer with the information potential.In IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339), vol. 3, (pp. 1716–1720). IEEE.

Xu, Z., Jin, R., King, I., & Lyu, M. (2009). An extended level method for efficient multiplekernel learning. In Advances in neural information processing systems, (pp. 1825–1832).

Yu, S., Shaker, A., Alesiani, F., & Principe, J. C. (2020). Measuring the discrepancy betweenconditional distributions: Methods, properties and applications.

118

Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., & Savarese, S. (2018). Taskonomy:Disentangling task transfer learning. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, (pp. 3712–3722).

Zhang, S., Li, J., Xie, P., Zhang, Y., Shao, M., Zhou, H., & Yan, M. (2017). Stacked kernelnetwork. arXiv preprint arXiv:1711.09219 .

Zhou, Z.-H., & Feng, J. (2017). Deep forest. arXiv preprint arXiv:1702.08835 .

Zhuang, J., Tsang, I. W., & Hoi, S. C. (2011). Two-layer multiple kernel learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence andStatistics, (pp. 909–917).

119

BIOGRAPHICAL SKETCH

Shiyu Duan received his B.S. in electronic engineering from Fudan University in 2016.

From 2016 to 2020, he worked at the Computational NeuroEngineering Laboratory in

University of Florida as a research assistant under the supervision of Dr. Jose C. Prıncipe.

He received his Ph.D. in electrical and computer engineering from University of Florida in 2020.

His research interests include machine learning theory and computer vision.

120