Towards Domain Generalization using Meta-Regularization

11
MetaReg: Towards Domain Generalization using Meta-Regularization Yogesh Balaji Department of Computer Science University of Maryland College Park, MD [email protected] Swami Sankaranarayanan * Butterfly Network Inc. NewYork, NY [email protected] Rama Chellappa Department of Electrical and Computer Engineering University of Maryland College Park, MD [email protected] Abstract Training models that generalize to new domains at test time is a problem of fundamental importance in machine learning. In this work, we encode this notion of domain generalization using a novel regularization function. We pose the problem of finding such a regularization function in a Learning to Learn (or) meta- learning framework. The objective of domain generalization is explicitly modeled by learning a regularizer that makes the model trained on one domain to perform well on another domain. Experimental validations on computer vision and natural language datasets indicate that our method can learn regularizers that achieve good cross-domain generalization. 1 Introduction Existing machine learning algorithms including deep neural networks achieve good performance in cases where the training and the test data are sampled from the same distribution. While this is a reasonable assumption to make, it might not hold true in practice. Deploying the perception system of an autonomous vehicle in new environments compared to its training setting might lead to failure owing to the shift in data distribution. Even strong learners such as deep neural networks are known to be sensitive to such domain shifts [9][33]. Approaches that resolve this issue in a domain adaptation framework have access to the target distribution. This is hardly true in practice: deploying real systems involve generalizing to unseen sources of data. This problem, also known as domain generalization is the focus of this paper. Most machine learning models (including neural networks) are susceptible to domain shift - models trained on one dataset perform poorly on a related but a shifted dataset. Approaches for addressing this issue can be broadly grouped into two major categories - domain adaptation and domain generalization. Domain adaptation techniques assume access to target dataset, the models are trained using a combination of labeled source data and unlabeled/ sparsely labeled target data. Domain generalization, on the other hand, is a much harder problem as we assume no access to target information. Instead, the variations in multiple source domains are utilized to generalize to novel test distributions. * Work done while at University of Maryland, College Park. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

Transcript of Towards Domain Generalization using Meta-Regularization

MetaReg: Towards Domain Generalization usingMeta-Regularization

Yogesh BalajiDepartment of Computer Science

University of MarylandCollege Park, MD

[email protected]

Swami Sankaranarayanan∗

Butterfly Network Inc.NewYork, NY

[email protected]

Rama ChellappaDepartment of Electrical and Computer Engineering

University of MarylandCollege Park, MD

[email protected]

Abstract

Training models that generalize to new domains at test time is a problem offundamental importance in machine learning. In this work, we encode this notionof domain generalization using a novel regularization function. We pose theproblem of finding such a regularization function in a Learning to Learn (or) meta-learning framework. The objective of domain generalization is explicitly modeledby learning a regularizer that makes the model trained on one domain to performwell on another domain. Experimental validations on computer vision and naturallanguage datasets indicate that our method can learn regularizers that achieve goodcross-domain generalization.

1 Introduction

Existing machine learning algorithms including deep neural networks achieve good performancein cases where the training and the test data are sampled from the same distribution. While thisis a reasonable assumption to make, it might not hold true in practice. Deploying the perceptionsystem of an autonomous vehicle in new environments compared to its training setting might lead tofailure owing to the shift in data distribution. Even strong learners such as deep neural networks areknown to be sensitive to such domain shifts [9][33]. Approaches that resolve this issue in a domainadaptation framework have access to the target distribution. This is hardly true in practice: deployingreal systems involve generalizing to unseen sources of data. This problem, also known as domaingeneralization is the focus of this paper.

Most machine learning models (including neural networks) are susceptible to domain shift - modelstrained on one dataset perform poorly on a related but a shifted dataset. Approaches for addressing thisissue can be broadly grouped into two major categories - domain adaptation and domain generalization.Domain adaptation techniques assume access to target dataset, the models are trained using acombination of labeled source data and unlabeled/ sparsely labeled target data. Domain generalization,on the other hand, is a much harder problem as we assume no access to target information. Instead,the variations in multiple source domains are utilized to generalize to novel test distributions.

∗Work done while at University of Maryland, College Park.

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

Figure 1: Illustration of the proposed approach. Figure (a) depicts the network design - We employ ashared feature network F and p task networks {Ti}pi=1. Each task network Ti is trained only on thedata from domain i, and the shared network F is trained on all p source domains. The figure on theright illustrates the optimization updates. At each iteration we sample a pair of domains (i, j) fromthe training set. The black arrows are the SGD updates of the task network Ti trained on domain i.From each point in the black path, we take l gradient steps using the regularized loss and the samplesfrom domain i to reach a new point ∗. We then compute the loss on domain j at ∗. The regularizerparameters φ are updated so that this meta-loss is minimized. This ensures that the task network Titrained with the proposed regularizer generalizes to domain j

The conventional approach to improve the generalization of a parametric model is to introduceregularization in the loss function [32]. In a Bayesian setting, regularization can be interpreted aspriors on the parameters. Several regularization schemes have been proposed for neural networksincluding weight decay [17], Dropout [28], DropConnect [31], batch normalization [14], etc. Whilethese schemes have been shown to reduce test error on samples drawn from the same trainingdistribution, they do not generalize when there is a training-test distribution mismatch. Hence, theobjective of this work is to learn a regularizer that generalizes to novel distributions not present in thetraining data.

Designing such regularizers for achieving cross-domain generalization is a challenging problem. Thedifficulty in mathematically modeling domain shifts makes it hard to design hand-crafted regularizers.Instead, we take a data-driven approach where we aim to learn the regularization function using thevariability in the source domains. We cast the problem of learning regularizers in a learning to learn,or meta-learning framework, which has received a resurgence in interest recently with applicationsincluding few-shot learning [7][24] and learning optimizers [20][1]. Similar to [7], we follow anepisodic paradigm where at each iteration, we sample an episode comprising meta-train and meta-testdata such that the domains contained in meta-train and meta-test sets are disjoint. The objective isthen to train the regularizer such that k steps of gradient descent using the meta-train data results indecreasing the loss in the meta-test. This procedure is repeated for multiple episodes sampled fromthe source dataset. After the regularizer is trained, we fine-tune a new model on the entire sourcedataset using the trained regularizer.

The primary contribution of this work is that we propose a scheme for learning regularizationfunctions that enable domain generalization. We show how the notion of domain generalization canbe explicitly encoded in a regularization function, which can then be used to train models that aremore robust to domain shifts. This framework is also scalable as the same regularizer can later beused to fine-tune on a larger dataset. Experiments indicate that our approach can learn regularizersthat achieve good cross-domain generalization on benchmark domain generalization datasets.

2 Related work

Domain Adaptation Domain adaptation has received significant attention in recent years frommachine learning, computer vision and natural language processing communities. Non-deep learning

2

approaches to this problem include feature engineering methods [6], learning intermediate subspacesusing manifolds [11][12] and dictionaries [23], etc. Recent methods harness the expressive powerof deep neural networks to learn domain invariant representations. The method in [9] attempts toreduce the distributional distance between source and target embeddings by formulating a minimaxgame between the feature network and a domain discriminator network. [5] uses stacked denoisingautoencoders for learning robust data representations that adapt well to target domains. Othernotable works include the use of Maximum Mean Discrepancy (MMD) [21], Generative AdversarialNetworks (GAN) [25][26], co-training [4], etc.

Meta-learning The concept of meta-learning (or) learning to learn has a long standing history,some of the earlier works include [30] [27]. Recently, there has been a lot of interest in applyingsuch strategies for deep neural networks. One interesting application is the problem of learningthe optimization updates of neural networks by casting it as a policy learning problem in a Markovdecision process [1] [20]. Few-shot learning is another problem where meta-learning strategies havebeen widely explored. [24] proposes an LSTM-based meta learner for learning the optimizationupdates of a few-shot classifier. Instead of learning the updates, [7] learns transferable weightrepresentations that quickly adapts to a new task using only a few samples. Other recent applicationsthat use meta learning include imitation learning [8], visual question answering [29], etc.

Domain Generalization Unlike domain adaptation, domain generalization is a relatively lessexplored area of research. [22] proposes domain invariant component analysis, a kernel-basedalgorithm for minimizing the differences in the marginal distributions of multiple domains. [10]attempts to learn a domain-invariant feature representation by using multi-view autoencoders toperform cross domain reconstructions. The method in [15] decomposes the parameters of a model(SVM classifier) into domain specific and domain invariant components, and uses the domain invariantparameters to make predictions on the unseen domain. [18] extends this idea to decompose theweights of deep neural networks using multi-linear model and tensor decomposition.

Finn et al. [7] recently proposed a model agnostic meta-learning procedure for few shot learningproblems. The objective of the MAML approach is to find a good initialization θ such that few gradientsteps from θ results in a good task specific network. The focus of the MAML is to adapt quickly in fewshot settings. Recently, [19] proposed a meta learning based approach (MLDG) extending MAMLto the domain generalization problem. This approach has the following limitations - the objectivefunction of MAML is more suited for fast task adaptation for which it was originally proposed. Indomain generalization however, we do not have access to samples from a new domain, and so aMAML-like objective might not be effective. The second issue is scalability - it is hard to scaleMLDG to deep architectures like Resnet [13]. Our approach attempts to tackle both these problems- (1) We explicitly address the notion of domain generalization in our episodic training procedureby using a regularizer to go from a task specific representation to a task general representation ateach episode. (2) We make our approach scalable by freezing the feature network and performingmeta learning only on the task network. This enables us to use our approach to train deeper modelslike Resnet-50. A similar approach for training meta-learning algorithms in feature space has beenexplored in a recent work [34].

3 Method

3.1 Problem Setup

We begin with a formal description of the domain generalization problem. Let X denote the instancespace (which can be images, text, etc.) and Y denote the label space. Domain generalizationinvolves data sampled from p source distributions and q target distributions, each containing datafor performing the same task. Classification tasks are considered in this work. Hence, Y is thediscrete set {1, 2, . . . Nc}, where Nc denotes the number of classes. Let {Di}p+qi=1 represent the p+ q

distributions, each of which exists on the joint space X × Y . Let Di = {(x(i)j , y

(i)j )}Ni

j=1 represent

the dataset sampled from the ith distribution, i.e. each (xij , yij)

i.i.d.∼ Di. In the rest of the paper,Di is referred to as the ith domain. Note that every Di shares the same label space. In the domaingeneralization problem, each of the p+ q domains contain varied domain statistics. The objective isto train models on the p source domains so that they generalize well to the q novel target domains.

3

We are interested in training a parametric model MΘ : X → Y using data only from the p sourcedomains. In this work, we consider MΘ to be a deep neural network. We decompose the networkM into a feature network F and a task network T (i.e) MΘ(x) = (Tθ ◦ Fψ)(x), where Θ = {ψ, θ}.Here, ψ denotes the weights of the feature network F , and θ denotes the weights of the task network.The output of MΘ(x) is a vector of dimension Nc with ith entry denoting the probability that theinstance x belongs to the class i. Standard neural network training involves minimizing the crossentropy loss function given by Eq (1)

L(ψ, θ) = E(x,y)∼D[−y. log(MΘ(x))] =

p∑i=1

Ni∑j=1

−y(i)j . log(MΘ(x

(i)j )) (1)

Here, y(i)j is the one-hot representation of the label y(i)

j and ‘.’ denotes the dot product between twovectors. The above loss function does not take into account any factor that models domain shifts, sogeneralization to a new domain is not expected. To accomplish this, we propose using a regularizerR(ψ, θ). The new loss function then becomes Lreg(ψ, θ) = L(ψ, θ) + R(ψ, θ). The regularizerR(ψ, θ) should capture the notion of domain generalization (i.e) it should enable generalization to anew distribution with varied domain statistics. Designing such regularizers is hard in general, so wepropose to learn it using meta learning.

3.2 Learning the regularizer

In this work, we model the regularizer R as a neural network parametrized by weights φ. Moreover,the regularization is applied only on the parameters θ of the task network to enable scalable meta-learning. So, the regularizer is denoted as Rφ(θ) in the rest of the paper. We now discuss how theparameters of the regularizer Rφ(θ) are estimated. In this stage of the training pipeline, the neuralnetwork architecture consists of a feature network F and p task networks {Ti}pi=1 (with parametersof Ti denoted by θi) as shown in Fig. 1. Each Ti is trained only on the samples from domain iand F is the shared network trained on all p source domains. The reason for using p task networksis to enforce domain-specificity in the models so that the regularizer can be trained to make themdomain-invariant.

We now describe the procedure for learning the regularizer:

• Base model updates: We begin by training the shared network F and p task networks{Ti}pi=1 using supervised classification loss L(ψ, θ) given by Eq (1). Note that there is noregularization in this step. Let the network parameters at the kth step of this optimization bedenoted as [ψ(k), θ

(k)1 , . . . θ

(k)p ].

• Episode creation: To trainRφ(θ), we follow an episodic training procedure similar to [19].Let a, b be two randomly chosen domains from the training set. Each episode contains datapartitioned into two subsets - (1) m1 labeled samples from domain a denoted as metatrainset and (2) m2 labeled samples from domain b denoted as metatest set. The domainscontained in both the sets are disjoint (i.e) a 6= b, and the data is sampled only from thesource distributions (i.e) a, b ∈ {1, 2, . . . p}.

• Regularizer updates At iteration k, a new task network Tnew is initialized with θ(k)a - the

base model’s task network parameters of the ath domain at iteration k. Using the samplesfrom the metatrain set (which contains domain a), l steps of gradient descent is performedwith the regularized loss function Lreg(ψ, θ) on Tnew. Let θ̂(k)

a denote the parameters ofTnew after these l gradient steps. We treat each update of the network Tnew as a separatevariable in the computational graph. θ̂(k)

a then depends on φ through these l gradient steps.The unregularized loss on the metatest set computed using Tnew (with parameters θ̂(k)

a ) isthen minimized with respect to the regularizer parameters φ. Each regularizer update unrollsthrough the l gradient steps as θ̂(k)

a depends on φ through the l gradient steps. This entireprocedure can be expressed by the following set of equations:

4

β1 ← θ(k)a

βt = βt−1 − α∇βt−1

[L(a)(ψ(k), βt−1) +Rφ(βt−1)

]∀t ∈ {2, . . . l} (2)

θ̂(k)a = βl

φ(k+1) = φ(k) − α∇φL(b)(ψ(k), θ̂(k)a )|φ=φ(k) (3)

Here, L(i)(ψ, θnew) = E(x,y)∼Di[−y. log(Tθnew

(Fψ(x)))] (i.e) the loss of task networkTnew on samples from domain i, and α is the learning rate. Eq (2) represents l stepsof gradient descent from the initial point θ(k)

a using samples from metatrain set, with βtdenoting the output at the tth step. Eq (3) is the meta-update step for updating the parametersof the regularizer. This update ensures that l steps of gradient descent using the regularizedloss on samples from domain a results in task network a performing well on domain b.It is important to note that the dependence of φ on θ̂(k)

a comes from the l gradient stepsperformed in Eq. 2. So, the gradients of φ propagates through these l unrolled gradient steps.

Since the same regularizer Rφ(θ) is trained on every (a, b) pair, the resulting regularizer we learncaptures the notion of domain generalization. Please refer to Fig. 1 for a pictoral description of themeta-update step. The entire algorithm is given in Algorithm 1

3.3 Training the final model

Once the regularizer is learnt, the regularization parameters φ are frozen and the final task networkinitialized from scratch is trained on all p source domains using the regularized loss functionLreg(ψ, θ). The network architectures consists of just one F − T pair. In this paper, we useweighted L1 loss as our regularization function, (i.e) Rφ(θ) =

∑i φi|θi|. The weights of this

regularizer are estimated using the meta-learning procedure discussed above. However, our approachis general and can be extended to any class of regularizers (refer to Section. 5). The use of weightedL1 loss can be interpreted as a learnable weight decay mechanism - Weights θi for which φi is positivewill be decayed to 0 and those for which φi is negative will be boosted. By using our meta-learningprocedure, we select a common set of weights that achieve good cross-domain generalization acrossevery pair of source domains (a, b).

Algorithm 1 MetaReg training algorithm

Require: Niter: number of training iterationsRequire: α1, α2: Learning rate hyperparameters

1: for t in 1 : Niter do2: for i in 1 : p do3: Sample nb labeled images {(x(i)

j , y(i)j ) ∼ Di}nb

j=14: Perform supervised classification updates:5: ψ(t) ← ψ(t−1) − α1∇ψL(i)(ψ(t−1), θ

(t−1)i )

6: θ(t)i ← θ

(t−1)i − α1∇θiL(i)(ψ(t−1), θ

(t−1)i )

7: end for8: Choose a, b ∈ {1, 2, . . . p} randomly such that a 6= b

9: β1 ← θ(t)a

10: for i = 2 : l do11: Sample metatrain set {(x(a)

j , y(a)j ) ∼ Da}nb

j=1

12: βi = βi−1 − α2∇βi−1 [L(a)(ψ(t), βi−1) +Rφ(βi−1)]13: end for14: θ̂

(t)a = βl

15: Sample metatest set {(x(b)j , y

(b)j ) ∼ Db}nb

j=1

16: Perform meta-update for regularizer φ(t) = φ(t−1) − α2∇φL(b)(ψ(t), θ̂(t)a )|φ=φ(t)

17: end for

5

Table 1: Cross-domain recognition accuracy (in %) averaged over 5 runs on PACS dataset usingAlexnet architecture. For the baseline setting, the numbers on the parenthesis indicate the baselineperformance as reported by [19]

Method Art painting Cartoon Photo Sketch AverageBaseline 67.21 ± 0.72 (64.91) 66.12 ± 0.51 (64.28) 88.47 ± 0.63 (86.67) 55.32 ± 0.44 (53.08) 69.28 (67.24)

D-MTAE ([10]) 60.27 58.65 91.12 47.68 64.48DSN ([3] 61.13 66.54 83.25 58.58 67.37

DBA-DG ([18]) 62.86 66.97 89.50 57.51 69.21MLDG ([19]) 66.23 66.88 88.0 58.96 70.01

MetaReg (Ours) 69.82 ± 0.76 70.35 ± 0.63 91.07 ± 0.41 59.26 ± 0.31 72.62

3.4 Summary of the training pipeline

The feature network is first trained using combined data from all source domains, and is kept frozen inthe rest of training. The regularizer parameters are then estimated using the meta-learning proceduredescribed in the previous section. As the individual task networks are updated on their respectivesource domain data, the regularizer updates are derived from each point of this SGD path with theobjective of cross-domain generalization (refer Alg. 1). To learn the regularizer effectively at theearly stages of the task network updates, replay memory is used where the regularizer updates areperiodically derived from the early stages of the task networks’ SGD paths. The learnt regularizeris used in the final step of the training process where a single F − T network is trained using theregularized cross-entropy loss.

4 Experiments

In this section, we describe the experimental validation of our proposed approach. We performexperiments on two benchmark domain generalization datasets - Multi-domain image recognitionusing PACS dataset [18] and sentiment classification using Amazon Reviews dataset [2].

4.1 PACS dataset

PACS dataset is a recently proposed benchmark dataset for domain generalization. This datasetcontains images from four domains - Photo, Art painting, Cartoon and Sketch. Following [19], weperform experiments on four settings: In each setting, one of the four domains is treated as the unseentarget domain, and the model is trained on the other three source domains.

Alexnet The first set of experiments is based on the Alexnet [16] model pretrained on Imagenet.The feature network F comprises of the top layers of Alexnet model till pool5 layer, while the tasknetwork T contains fc6, fc7 and fc8 layers. For the regularizer network, we used weighted L1

loss (i.e) Rφ(θ) =∑i φi|θi|, where φi are the parameters estimated using meta-learning. In all our

experiments, Baseline setting denotes training a neural network (Alexnet in this case) on all of thesource domains without performing any domain generalization. Other comparison methods includeMulti-task Autoencoders (MTAE) [10], Domain Separation Networks (DSN) [3], Artier DomainGeneralization (DBA-DG) [18] and MLDG [19]. While some of these methods were originallyproposed for domain adaptation, they were adapted to the domain generalization problem as done in[19].

All our models are trained using the SGD optimizer with learning rate 5e− 4 and a batch size of 64.This is in accordance with the setup used in [19]. Table 1 presents the results of our approach alongwith other comparison methods. We observe that our method obtains a performance improvement of3.34% over the baseline, thus achieving the state-of-the-art performance on this dataset.

Resnet One disadvantage with approaches like MLDG [19] is that it requires differentiating throughk steps of optimization updates, and this might not be scalable to deeper architectures like Resnet.Even our approach requires a similar optimization process. However, unlike [19], we performmeta-learning only on the task network. Since the task network is much shallower than the featurenetwork, our approach is scalable even to some of the contemporary deep architectures. In thissection, we show experiments using two such architectures - Resnet18 and Resnet 50.

6

We use the Resnet-18 and Resnet-50 models pretrained on ImageNet as our feature network, and thelast fully connected layer as our task network. Similar to the previous experiment, we used weightedL1 loss as our class of regularizers. All models were trained using SGD optimizer with a learningrate of 0.001 and momentum 0.9. The hyper-parameters α1 and α2 are both set as 0.001. The resultsof our experiments are reported in Table. 2. Our method performs better than baseline in both settings.It is important to note that the baseline numbers for Resnet architectures are much higher than that ofAlexnet. Even on such stronger baselines, our method gives performance improvement.

4.2 Sentiment Classification

In this section, we perform experiments on the task of sentiment classification on Amazon reviewsdataset as pre-processed by [5]. The dataset contains reviews of products belonging to four domains -books, DVD, electronics and kitchen appliances. The differences in textual description of the reviewseach of these product categories manifests as domain shift. Following [9], we use unigrams andbigrams as features resulting in 5000 dimensional vector representations. The reviews are assignedbinary labels - 0 if the rating of the product is upto 3 stars, and 1 if the rating is 4 or 5 stars.

We conduct 4 cross-domain experiments - in each setting one of the four domains is treated as theunseen test domain, and the other three domains are used as source domains. Similar to [9], weused a neural network with one hidden layer (with 100 neurons) as our task network. All modelswere trained using an SGD optimizer with learning rate 0.01 and momentum 0.9 for 5000 iterations.The results of our experiments are reported in the Table. 3. Since there is significant variationin performance over runs, each experiment was repeated 10 times with different random weightinitialization and averages of these 10 runs are reported. We observe that our method performs betterthan the baseline in all of the settings. However, the performance improvement is less compared tothe previous experiments. This is because of the nature of the problem and the architectural choice.We would like to point out that even domain adaptation methods that make use of unlabeled targetdata achieve similar gains in performance [9] in this dataset.

5 Ablation Study

For all the ablation experiments except 5.3, we use the Resnet-18 model as our neural networkarchitecture, and Art-painting setting in PACS dataset as our experimental setting, (i.e) we use Artpainting domain as the test domain, and Cartoon, Photo and Sketch as source domains.

5.1 Class of Regularizers

In this experiment, we study the effect of different regularizers on the performance of ourapproach. We experimented on the following class of regularizers: (1) Weighted L1 loss:Rφ(θ) =

∑i φi|θi|, (2) Weighted L2 loss: Rφ(θ) =

∑i φiθ

2i , and (3) Two layer neural network:

Rφ(θ) = φ(2)T (ReLU(φ(1)T θ)). The performance of these regularizers are reported in Table. 4.We observe that the Weighted L1 regularizer performs the best among the three. Also, we observedthat training networks with the weighted L1 regularizer lead to better convergence and stability inperformance compared to the other two. We also compare our approach with two other schemes: (1)DropConnect [31] and (2) Default L1 regularization, which is Weighted L1 regularization where theweights φi = 1. We observe that both these schemes do not improve the baseline performance.

Table 2: Cross-domain recognition accuracy (in %) averaged over 5 runs on PACS dataset usingResnet architectures

Method Art painting Cartoon Photo Sketch AverageResnet-18

Baseline 79.9 ± 0.22 75.1 ± 0.35 95.2 ± 0.18 69.5 ± 0.37 79.9Metareg (Ours) 83.7 ± 0.19 77.2 ± 0.31 95.5 ± 0.24 70.3 ± 0.28 81.7

Resnet-50Baseline 85.4 ± 0.24 77.7 ± 0.31 97.8 ± 0.17 69.5 ± 0.42 82.6

Metareg (Ours) 87.2 ± 0.13 79.2 ± 0.27 97.6 ± 0.31 70.3 ± 0.18 83.6

7

Table 3: Cross domain classification accuracy (x %) averaged over 10 runs on Amazon Reviewsdataset

Method Books DVD Electronics Kitchen AverageBaseline 75.5 ± 0.52 79.0 ± 0.37 83.7 ± 0.44 84.7 ± 0.63 80.7

Metareg (Ours) 76.1 ± 0.41 79.6 ± 0.32 83.9 ± 0.28 85.1 ± 0.43 81.2

Table 4: Effect of different classes of regularization functionsBaseline DropConnect [31] Default L1 Weighted L1 Weighted L2 2 layer NN

79.9 80.1 79.7 83.7 83.2 83.3

5.2 Availability of data over time

In all of the previous experiments, we assumed that the entire training data is available from thestart of the training process. But consider a more general setting where we train our model on someinitial data, but more data gets available over time. Is it possible make use of the newly available datato improve our models without having to perform meta-learning again? We propose the followingsolution: Train the feature network, task network and regularizer on the initial dataset. On the newdata, finetune the task network and feature network using the regularizer trained on the initial data.Note that, we do not perform meta learning again on the new data, and so this is computationallyefficient since meta-learning procedure incurs significant overhead over a regular finetuning process.With approaches like MLDG [19], meta-learning has to be performed even on the new data.

We simulate these experimental conditions as follows: In each setting, we consider a fraction f ofthe PACS dataset as our intial dataset on which our model and the regularizer are trained. We thenfinetune our model on the remaining data using the trained regularizer. The performance of thesemodels on the test set are shown in Table. 5. We observe that there is little drop in performance forall f values. Our approach is able to learn good regularizers even with 10% of the entire dataset.

Table 5: Experiments for training models on less dataData fraction f 0.1 0.2 0.3 0.4 0.5 1.0Accuracy (in %) 82.86 83.11 83.42 83.62 83.60 83.71

5.3 Effect of the number of layers regularized

In our training paradigm, the neural network is decomposed into feature and task network, and themeta-regularization is performed only on the task network. Deciding this feature/task network split isa design choice which needs to be understood. The effect of domain generalization performance onvarying the number of layers is reported in Table 6. This experiment is performed using the Alexnetarchitecture on PACS dataset with Cartoon as the target domain. We observe that as the number

(a) No reg (b) Reg, f=0.1 (c) Reg, f=0.5 (d) Reg, f=1

Figure 2: Histogram of the weights learnt by the task network. "No reg" corresponds to the networkwithout regularizaton, and "Reg, f=x" corresponds to the regularized network, where the regualrizerR is trained only on x% of the data

8

of regularization layers increases, the generalization performance increases and saturates beyond apoint.

Table 6: Effect of cross-domain generalization with varying number of layers regularized on PACSdataset using Alexnet model. Cartoon is used as the test domain

Layers regularized None fc8 fc7 + fc8 fc6 + fc7 + fc8Accuracy (in %) 66.12 67.31 70.10 70.35

5.4 Visualizing the weights

We plot the histogram of the weights learnt by the task network with and without the use of ourregularizer in Fig. 2. The following observations can be made: (1) For the network with regularization,there is a sharp peak at 0. This is because the weights θi for which φi are positive are decayed to 0.(2) The weights of the network with regularization has wider spread compared to the network withoutregularization. This is because the weights θi for which φi are negative are boosted, due to whichcertain weights have high values.

6 Conclusion and Future Work

In this work, we addressed the problem of domain generalization by using regularization. The taskof finding the desired regularizer that captures the notion of domain generalization is modeled as ameta-learning problem. Experiments indicate that the learnt regularizers achieve good cross-domaingeneralization on the benchmark domain generalization datasets. Some avenues for future workinclude scalable meta-learning approaches for learning regularization functions over convolutionallayers while preserving the spatial dependency between the channels, and extending our approach todeep reinforcement learning problems.

7 Acknowledgement

This reseach was supported by MURI from the Army Research Office under the Grant No. W911NF-17-1-0304. This is part of the collaboration between US DOD, UK MOD and UK Engineering andPhysical Research Council (EPSRC) under the Multidisciplinary University Research Initiative.

References[1] Marcin Andrychowicz, Misha Denil, Sergio Gómez, Matthew W Hoffman, David Pfau, Tom

Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. InD. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in NeuralInformation Processing Systems 29, pages 3981–3989. Curran Associates, Inc., 2016.

[2] John Blitzer, Ryan McDonald, and Fernando Pereira. Domain adaptation with structuralcorrespondence learning. In Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing, EMNLP ’06, 2006.

[3] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and DumitruErhan. Domain separation networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, andR. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 343–351.Curran Associates, Inc., 2016.

[4] Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. InAdvances in Neural Information Processing Systems 24. Curran Associates, Inc., 2011.

[5] Minmin Chen, Zhixiang Eddie Xu, Kilian Q. Weinberger, and Fei Sha. Marginalized denoisingautoencoders for domain adaptation. In ICML. icml.cc / Omnipress, 2012.

[6] Hal Daume III. Frustratingly easy domain adaptation. In Proceedings of the 45th AnnualMeeting of the Association of Computational Linguistics, June 2007.

9

[7] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-tation of deep networks. In Proceedings of the 34th International Conference on MachineLearning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1126–1135, 2017.

[8] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visualimitation learning via meta-learning. Proceedings of Machine Learning Research, 2017.

[9] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, FrançoisLaviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neuralnetworks. J. Mach. Learn. Res., 17(1):2096–2030, January 2016.

[10] Muhammad Ghifary, W. Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. Domaingeneralization for object recognition with multi-task autoencoders. In 2015 IEEE InternationalConference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015.

[11] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsuperviseddomain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition,pages 2066–2073, 2012.

[12] Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object recog-nition: An unsupervised approach. In Proceedings of the 2011 International Conference onComputer Vision, ICCV ’11, 2011.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. arXiv preprint arXiv:1512.03385, 2015.

[14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. In Proceedings of the 32Nd International Conference onInternational Conference on Machine Learning - Volume 37, ICML’15, pages 448–456, 2015.

[15] Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A. Efros, and Antonio Torralba.Undoing the damage of dataset bias. In Proceedings of the 12th European Conference onComputer Vision - Volume Part I, ECCV’12, 2012.

[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In Advances in Neural Information Processing Systems 25,pages 1097–1105. 2012.

[17] Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. InAdvances in Neural Information Processing Systems, pages 950–957, 1992.

[18] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Deeper, broader and artierdomain generalization. In IEEE International Conference on Computer Vision, ICCV 2017,Venice, Italy, October 22-29, 2017, pages 5543–5551, 2017.

[19] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Learning to generalize:Meta-learning for domain generalization. CoRR, abs/1710.03463, 2017.

[20] Ke Li and Jitendra Malik. Learning to optimize neural nets. CoRR, abs/1703.00441, 2017.

[21] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferablefeatures with deep adaptation networks. In Proceedings of the 32nd International Conferenceon Machine Learning, pages 97–105, 2015.

[22] K. Muandet, D. Balduzzi, and B. Schölkopf. Domain generalization via invariant featurerepresentation. In Proceedings of the 30th International Conference on Machine Learning,W&CP 28(1), pages 10–18. JMLR, 2013. Volume 28, number 1.

[23] Jie Ni, Qiang Qiu, and Rama Chellappa. Subspace interpolation via dictionary learning forunsupervised domain adaptation. In 2013 IEEE Conference on Computer Vision and PatternRecognition, Portland, OR, USA, June 23-28, 2013, pages 692–699, 2013.

[24] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In InInternational Conference on Learning Representations (ICLR), 2017.

10

[25] Swami Sankaranarayanan, Yogesh Balaji, Carlos D. Castillo, and Rama Chellappa. Generate toadapt: Aligning domains using generative adversarial networks. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), June 2018.

[26] Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, Ser Nam Lim, and Rama Chellappa.Learning from synthetic data: Addressing domain shift for semantic segmentation. In The IEEEConference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[27] Jürgen Schmidhuber. On learning how to learn learning strategies. Technical report, 1995.

[28] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1),January 2014.

[29] Damien Teney and Anton van den Hengel. Visual question answering as a meta learning task.CoRR, abs/1711.08105, 2017.

[30] Sebastian Thrun and Lorien Pratt, editors. Learning to Learn. Kluwer Academic Publishers,Norwell, MA, USA, 1998.

[31] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neuralnetworks using dropconnect. In Proceedings of the 30th International Conference on MachineLearning, pages 1058–1066, 2013.

[32] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. In ICLR, 2017.

[33] Yang Zhang, Philip David, and Boqing Gong. Curriculum domain adaptation for semanticsegmentation of urban scenes. In The IEEE International Conference on Computer Vision(ICCV), volume 2, page 6, Oct 2017.

[34] Fengwei Zhou, Bin Wu, and Zhenguo Li. Deep meta-learning: Learning to learn in the conceptspace. CoRR, abs/1802.03596, 2018.

11