Learning Transferable Feature Representations Using Neural ...

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4124–4134Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

4124

Learning Transferable Feature Representations Using Neural NetworksHimanshu S. Bhatt∗

AI Labs, American ExpressBengaluru, India

[email protected]

Arun RajkumarIIT Madras

Chennai, [email protected]

Shourya Roy∗AI Labs, American Express

Bengaluru, [email protected]

Sriranjani RamakrishnanConduent Labs

Bengaluru, [email protected]

Abstract

Learning representations such that the sourceand target distributions appear as similar aspossible has benefited transfer learning tasksacross several applications. Generally it re-quires labeled data from the source and onlyunlabeled data from the target to learn suchrepresentations. While these representationsact like a bridge to transfer knowledge learnedin the source to the target; they may leadto negative transfer when the source specificcharacteristics detract their ability to representthe target data. We present a novel neuralnetwork architecture to simultaneously learna two-part representation which is based onthe principle of segregating source specificrepresentation from the common representa-tion. The first part captures the source specificcharacteristics while the second part capturesthe truly common representation. Our archi-tecture optimizes an objective function whichacts adversarial for the source specific part ifit contributes towards the cross-domain learn-ing. We empirically show that two parts of therepresentation, in different arrangements, out-performs existing learning algorithms on thesource learning as well as cross-domain taskson multiple datasets.

1 Introduction

Unsupervised domain adaptation is a sub field ofmachine learning where one learns from annotateddata in a source domain with the aim of perform-ing well on non-annotated data in a target domain.This attractive feature wherein the data, distribu-tions and tasks may vary across domains has ledto the widespread use of domain adaptation algo-rithms in several real world applications. A typicaldomain adaptation algorithm is provided with an-notated source data and non-annotated target data

∗ Research done while working for Xerox Research Cen-tre India.

Figure 1: Illustrates the fundamental idea of learn-ing transferable feature representations behind the pro-posed technique.

and it learns a ‘common representation’ where thesource and target data distributions look similar.In this common representation, a model trained onthe source data is expected to perform well on thetarget data as well.

While learning a common representation is use-ful for transferring knowledge from the source tothe target domain, this may often lead to ‘nega-tive transfer’ if we do not account for the funda-mental question “what to transfer”. It is observedthat each domain has specific features that arehighly discriminating only within a domain andcontribute negatively if transferred across domainsin a brute force manner (Pan and Yang, 2010),as shown in Figure 1. Traditional domain adap-tation algorithms, being oblivious to such sourcespecific characteristics, learn common representa-tions which suffer from transfer loss as the sourcespecific characteristics restrict their transferability.Moreover, it is also observed that the representa-tion learned for domain adaptation optimizes forthe performance in the target domain, often at thecost of source classification performance. Whilethis can be justified for domain adaptation where

4125

the primary objective is maximizing the target per-formance, a technique that simultaneously sus-tains the source performance will always be pre-ferred.

Our primary contribution is a novel neural net-work learning algorithm based on the principle oftwo-part hidden representation where individualparts can be disentangled or combined for learn-ing tasks in different domains. We highlight someof the salient features of our algorithm:

• A novel technique for learning a two-part rep-resentation between domains. One compris-ing source specific and the other comprisingcommon characteristics.

• The two-part representation behaves differ-ently for different learning objectives:

1. For the cross-domain task, explicitlylearning the source specific representa-tion and keeping them separate fromcommon representation enhances theperformance in the target domain.

2. For the source learning task, the sourcespecific and common units come to-gether to sustain the source performancewhere the performance of most domainadaptation algorithms is compromised.

The proposed neural network architectureachieves this through an objective function whichacts adversarial for a part of representation (sourcespecific part) if it contributes to the cross-domainlearning. Moreover, the proposed two-part repre-sentation learning approach also mitigates the pos-sible effects of “negative transfer”, as learning sep-arate source specific and common representationsevades the influence of source specific character-istics on the common representation.

2 Related Work

The problem of domain adaptation has gained alot of attention due to its huge practical implica-tions. Pan et al. (2010) focuses on learning a com-mon representation minimizing the divergence be-tween the source and target domains. Many bodyof work exists in literature including learning non-linear mappings (Daume III, 2009; Pan et al.,2011; Blitzer et al., 2007; Pan et al., 2010; Barneset al., 2018), mappings to mitigate domain diver-gence (Pan et al., 2010), common features (Daiet al., 2007; Dhillon et al., 2003), ensemble based

approaches (Bhatt et al., 2015), subspace basedmethods (Gopalan et al., 2011; Gong et al., 2012;Harel and Mannor, 2010; Fernando et al., 2013)and neural networks based methods (Glorot et al.,2011; Chopra et al., 2013; Long and Wang, 2015;Tzeng et al., 2014).

A variant of unsupervised models namelymarginalized stacked denoising autoencoders(mSDA) (Chen et al., 2012a) learn robust repre-sentation to input corruption noise, which is stableacross changes in domains, allowing cross-domaintransfer. Existing literature exploits the princi-ple of representations generalizing across domainsfor classification, without labelled data from tar-get ((Sarma et al., 2018), (Bhatt et al., 2016))and with labelled data from target ((Zhang et al.,2018)). Our work emphasizes on domain dis-crimination by incorporating domain divergenceand source risk minimization into the objective forlearning better transferable representation withoutany labelled data from target domain. Another lineof work aims to achieve distribution consistencybetween the source and target domains with lin-ear data reconstruction such as co-regularizationbased augmented space (Kumar et al., 2010), cou-pled learning to link target-specific features tosource features (Blitzer et al., 2011) and transferof the source examples to the target and vice-versa(Zhou et al., 2016).

Domain adversarial neural networks (DANN)(Ajakan et al., 2014; Ganin et al., 2016), closelysimilar in philosophy to our work, learns a singlerepresentation by using an adversarial (Liu et al.,2017) gradient reversal component for domain di-vergence. In DANN, the entire hidden layer con-tributes unanimously towards the source classifi-cation and domain divergence objective. UnlikeDANN, our approach segregates the hidden layerwhere the two components of hidden layer aretreated differently for different objectives. Boththe source specific and common parts contributepositively to the source classification objective.However, for the domain divergence objective, thecommon part contributes positively (i.e., tries tominimize divergence by maximizing the domainregressor’s loss); whereas, the source specific partcontributes negatively (i.e., tries to maximize di-vergence by minimizing domain regressor’s loss)

Generative adversarial networks (GAN) (Good-fellow et al., 2014) build generative models to syn-thesize samples and falls closely in the same cat-

4126

egory due to the similar method of measuring andminimizing the discrepancy between the featuredistributions. The GAN model learns the repre-sentation in generative mode while our work isbased on discriminative learning.

Domain separation networks (DSN) (Bousmaliset al., 2016) inspired from shared-space compo-nent analysis, explicitly and jointly models thedomain-specific (private) and shared componentdomain representation. DSN is based on CNNand ours is a feed-forward network based on dis-criminating adversarial framework. The objectivefunction of DSN has separate losses for difference,similarity, reconstruction and task-specific, whileour approach follows min-max optimization crite-rion minimizing domain specific component lossand maximizing shared component loss.

Jiang & Zhai (2007) also proposed a two-stageapproach for domain generalization and adapta-tion where first stage finds the generalizable fea-ture representation across domains and its appro-priate weights. The second stage picks up featuresuseful for the target domain using semi-supervisedlearning. Their approach is a semi-supervised ap-proach which uses labelled data from source andtarget domains along with linear classifiers. How-ever, our framework is unsupervised (no labeleddata from target) and leverages non-linear neuralnetwork classifier.

3 Problem Formulation

Let us consider a binary classification task whereX ⊆ Rn is the input space and Y = {0, 1} is thelabel space. We have two different distributionsover X × Y , called the source domain Ds and thetarget domain Dt. We have labeled samples fromsource S drawn i.i.d from Ds and unlabeled sam-ples from the target T drawn i.i.d. from Dt.

S = {(xsi , ysi )}mi=1 ∼ (Ds)m;

T = {xti}m′

i=1 ∼ (Dt)m′

where m and m′ are the number of labeled sourceand unlabeled target samples. Let h(·) be theD-dimensional hidden representation of the net-work which is further represented as h(·) =hss(·)

⊕hc(·), where hss(·), hc(·),

⊕represent

source specific, common representations and con-catenation respectively. The neural network isparametrized by {W,V, b.c}. Our objective is tolearn two parts of the hidden layer such that the

source specific characteristics hss(·) do not detractthe ability of common representation hc(·) to gen-eralize to the target task. Let W be the weightmatrix between input and hidden units. W ′ &W ′′ be the weight matrix between the input unitsto the common and source specific units respec-tively. Let o(·) & o′(·) be the domain regressor forthe common and source specific representationsparametrized by {u, d} & {u′, d′} respectively.

4 Proposed Neural Network Architecture

The proposed neural network is a fully connectedarchitecture, as shown in Figure 2. The empha-sis of our work, in contrast to most of the pre-vious work, is not only on modeling the similar-ity between the domains but also on modeling thedifferences i.e., the domain specific information.We propose to achieve this by learning a two parthidden layer comprising the source specific partand the common part. The network tries to opti-mize two objectives - a classification objective anda domain divergence objective. The classificationobjective tries to minimize the mis-classificationsin the labeled source data while the domain diver-gence objective attempts to learn a representationwhere both the source and target domain data ap-pears close to each other. In our network, boththe source specific part and the common part con-tribute positively to the source classification ob-jective (i.e., minimize the mis-classification loss).However, for the domain divergence objective,the common part contributes positively (i.e., triesto minimize divergence) whereas the source spe-cific part contributes negatively (i.e., tries to max-imize divergence). Thus, the common representa-tion acquires domain independence and generaliz-able classification abilities while the source spe-cific representation remains domain-specific andhighly discriminating for the in-domain classifica-tion task.

4.1 Learning in Source Domain

A neural network architecture with one hiddenlayer learns the function, h : X → RD, to mapthe input to a D-dimensional representation:

h(x) = sigm(Wx+ b),

where h(x) = hss(x)⊕hc(x) and

sigm(a)[

11+exp(−ai)

]|a|i=1

is parametrized by

a matrix-vector pair (W, b) ∈ RD×n × RD.

4127

Figure 2: Illustrates the architecture to simultaneously learn the common and source specific representations.

For source classification, our network follows astandard neural network architecture where theoutput function f : RD → [0, 1]L is given as:

f(x) = softmax(Vh(x) + c)

Given source examples S = {(xsi , ysi )}mi=1 and theclassification loss as the negative log-probabilityof the correct label:

`(f(x), y) = log1

fy(x)

Objective function for the source classificationtask becomes:

minW,V,b,c

[1

m

m∑i=1

`(f(xsi ), ysi )

](1)

4.2 Domain DivergenceTheoretical results in transfer learning literature(Ben-David et al., 2010) show that adapting to atarget domain from a source domain depends ona measure of similarity between the two. A for-mal measure used in this context is known as H-divergence. Intuitively, it is based on the capacityof a hypothesis classH to distinguish between ex-amples generated by a pair of source-target tasks.Definition 1 Given feature distributions of twodomains, Ds & Dt and a hypothesis class H, theH-divergence between Ds and Dt is defined as:

dH(Ds,Dt) = 2 supη∈H

∣∣∣∣ Prxs∼Ds

[η(xs) = 1]

− Prxt∼Dt

[η(xt) = 1

]

We employ a result due to Ben-David et al. (2010)where they proved that for a symmetric hypothesisclass H, one can compute an approximate empiri-cal H-divergence by running a learning algorithmon the problem of discriminating between sourceand target examples. For this, we construct a newdataset as:

{(xsi , 1)}mi=1 ∪ {(xtj , 0)}m′

j=1.

where the target and source samples are labeledas 0 and 1 respectively. Then, the error (ε) of theclassifier trained on the above dataset can be usedas an approximation of H-divergence termed asproxy −A distance (PAD) and is given as:

dA = 2(1− 2ε)

Let the common representation for thesource and target samples be hc(S){hc(xsi )}mi=1

and hc(T ){hc(xti)}m′

i=1 respectively. LetdcH(hc(S), hc(T )) be the empirical H-divergenceon the common representation, given as:

dcH(hc(S), hc(T )) =

2

(1−min

η∈H

[1

m

m∑i=1

I[η(hc(xsi )) = 1]

1

m′

m′∑i=1

I[η(hc(xti)) = 0]

])

4128

Algorithm 1 Learning Two-Part Hidden Representation for Neural Network

Input: Samples S = {(xsi , ysi )}mi=1 and T = {xti}m′

i=1,hidden layer size l with ns source specific and nc commonnodes, adaptation parameter λ, learning rate α.Output: neural network {W,V,b,c}.Initialization: W, V← random init(l)b, c, u, d, u′, d′ ← 0while stopping criteria is not met do

for i from 1 tom do#Forward Propagationh(xs

i)← σ(b+Wxsi)

f(xsi)← softmax(c+ V h(xs

i))# where h(xs

i) = hc(xsi) + hss(x

si)

#Backpropagation4c ← −(e(ys

i )− f(xsi))

4V ←4ch(xsi)

>

4b ← (V >4c)� h(xsi)� (1− h(xs

i))4w ←4b · (xs

i)>

# where h(xsi) = hc(x

si) + hss(x

si)

#Domain adaptation regularizer...#...from current domain - common representationo(xs

i)← σ(d+ u>hc(xsi))

4d ← λ(1 − o(xsi)); 4u ← λ(1 −

o(xsi))hc(x

si)

tmp ← λ(1 − o(xsi))u � hc(x

si) � (1 −

hc(xsi))

4b ←4b+tmp;4w′ ←4w′+tmp·(xsi)

>

#...from current domain - source specific repre-sentationo′(xs

i)← σ(d′ + u′>hss(xsi))

4d′ ← λ(1− o′(xsi))

4u′ ← λ(1− o′(xsi))hss(x

si)

tmp ← λ(1 − o′(xsi))u � hss(x

si) � (1 −

hss(xsi))

4b ← 4b + tmp; 4w′′ ← 4w′′ + tmp ·(xs

i)>

#...from other domain - common representationj ⇐ uniform integer(1, ...,m′)hc(x

tj)⇐ σ(b+Wxt

j)

o(xtj)← σ(d+ u>hc(x

tj))

4d ← 4d − λo(xtj);4u ← 4u −

λo(xtj)hc(x

tj)

tmp← −λo(xtj)u�hc(x

tj)� (1−hc(x

tj))

4b ←4b+tmp;4w′ ←4w′+tmp·(xtj)

>

#...from other domain - source specific represen-tationj ⇐ uniform integer(1, ...,m′)hss(x

tj)⇐ σ(b+Wxt

j)

o′(xtj)← σ(d′ + u′>hss(x

tj))

4d′ ←4d′ + λo′(xtj)

4u′ ←4u′ + λo′(xtj)hss(x

tj)

tmp← λo′(xtj)u

′�hss(xtj)�(1−hss(x

tj))

4b ← 4b + tmp; 4w′′ ← 4w′′ + tmp ·(xt

j)>

#Update neural network parametersW ←W − α4w;V ← V − α4v

W ′ ←W ′ − α4w′ ;W ′′ ←W ′′ − α4w′′

b⇐ b− α4b; c⇐ c− α4c

#Update domain classifier parametersu← u+ α4u; d← d+ α4d

u′ ← u′ + α4u′ ; d′ ← d′ + α4d′

end forend while

where I[·] is the indicator function and η(·) is a hy-pothesis function from H. To estimate the “min”part of the above equation, we use a logistic re-gression model that predicts the probability that agiven input (using the common representation) isfrom the source domain Dx

S (denoted by z = 1) orthe target domain Dx

T (denoted by z = 0):

p(z = 1|φ) = o(φ)sigm(d+ uTφ)

where φ is either hc(xs) or hc(xt) and o(·) is thedomain (logistic) regressor on the common repre-sentation with loss function `d(·, ·) defined as:

`d(o(·), z) = −z log(o(·))−(1− z) log(1− o(·))

(2)

Similarly, the divergence on the source specificrepresentation dssH(hss(S), hss(T )) is given as:

dssH(hss(S), hss(T )) =

2

(1−min

η∈H

[1

m

m∑i=1

I[η(hss(xsi )) = 1]

1

m′

m′∑i=1

I[η(hss(xti)) = 0]

])

The “min” part of above equation is estimated us-ing the domain regressor for the source specificrepresentation, o′(φ′)sigm(d′ + u′Tφ′), where φ′

is either hss(xs) or hss(xt) and `d′(·, ·) is its loss,

defined similar to Eq. 2.

4.3 The Learning Algorithm

Adding domain regressor terms to the objective ofEq. 1, we get the final objective function as:

4129

minW,V,b,c

[1

m

m∑i=1

`(f(xsi ), ysi )+

λ maxW ′,u,b,d

(− 1

m

m∑i=1

`d(o(xsi ), 1)

− 1

m′

m′∑i=1

`d(o(xti), 0)

)+

λ minW ′′,u′,b,d′

(− 1

m

m∑i=1

`d′(o′(xsi ), 1)

− 1

m′

m′∑i=1

`d′(o′(xti), 0)

)]where the hyper-parameter λ > 0 is the domainadaptation regularization term that controls thetrade-off between the source risk and the domaindivergence terms. In other words, it controls howmuch weight mass is put on generalizable com-mon representation v/s the source specific repre-sentation.

The optimization problem involves minimiza-tion with respect to some parameters and max-imization with respect to the others. We use astochastic gradient descent (SGD) approach whichsamples a pair of source and target example xsi , xtiand updates all the parameters of the neural net-work. The first term in the objective represents thesource classification loss and updates for its asso-ciated parameters, i.e. {W,V,b, c}, follow thenegative of the gradient to minimize this loss. Thesecond term in the objective represents the loss ofo(·) which is the domain regressor on the commonrepresentation. This term is maximized so as todiminish the ability of domain regressor to detectwhether the sample belongs to the source or thetarget domain using the common representation.This makes both the domains look similar by min-imizing the divergence. Note, minimizing the do-main divergence is equivalent to maximizing theloss of the domain regressor. Therefore, the asso-ciated parameters, {W′,u,d}, are updated in thedirection of the gradient (since we maximize withrespect to them, instead of minimizing). The lastterm in the objective represents the loss of o′(·)which is the domain regressor on the source spe-cific representation. We want the source specificrepresentation to make the two domains appearlargely distinct and hence, minimize the loss of

Table 1: Collections from the OSM dataset.

Target Description # Unlabelled # LabelledCol1 Mobile support 22645 5650Col2 Obama Healthcare 36902 11050Col3 Microsoft kinnect 20907 3258Col4 X-box 36000 4580

its domain regressor. The updates for its associ-ated parameters, {W′′,u′,d′}, follows the nega-tive of the gradient. The algorithm is detailed inAlgorithm 1 where e(y) represents a one-hot vec-tor, consisting of all 0s except for a 1 at position yand � represents the element-wise product.

5 Experimental Evaluation

The effectiveness of the proposed technique whichlearns source specific and common shared rep-resentations between domains is evaluated for across-domain sentiment classification task.

5.1 DatasetsThe first dataset used in this research is the Ama-zon review dataset (Blitzer et al., 2007) whichhas four domains each comprising user reviewsabout Books (B), DVDs (D), Kitchen appliances(K) and Electronics (E) respectively. Each do-main has 2000 reviews in-total with equal num-ber of positive and negative reviews. Each reviewis encoded in 5000 dimensional feature vectorsof unigrams/bigrams pre-processed to tf-idf vec-tors. The performance is compared on 12 differentcross-domain classification tasks on the Amazonreview dataset and is reported as the classificationaccuracy for binary classification. For each task,1400 labeled reviews from one domain constitutethe source and 1400 unlabeled reviews from a dif-ferent domain constitute the target. Unseen non-overlapping 200 and 400 reviews from the targetdomain are used as the validation and test set.

The second dataset is from Twitter.com com-prising tweets about the products and services indifferent domains and is referred to as online so-cial media (OSM) dataset. Table 1 lists differentcollections where the tweets are collected basedon user-defined keywords captured in a listeningengine which then crawls the social media andfetches comments matching the keywords. Thisdataset being noisy and comprising short-text ismore challenging than the other dataset. We uselabelled comments from the source and unlabelledcomments from the target for learning. While re-porting the performance on the target, we used the

4130

Table 2: Comparing the cross-domain performance of different approaches on the Amazon Review dataset. D→Brepresents the performance of an algorithm on unlabeled target domain B with D as labeled source domain.

Method D→ B E→ B K→ B B→ D E→ D K→ D D→ E B→ E K→ E D→ K B→ K E→ KSS 43.2 47.3 47.0 43.5 47.4 46.7 48.6 47.6 48.4 51.2 51.0 49.6NN 76.8 71.4 74.2 71.4 72.0 67.3 71.6 72.4 77.6 76.5 77.8 79.6SVM 77.9 73.2 75.4 73.2 73.5 70.4 73.6 71.5 78.2 77.7 78.8 82.3SCL 78.7 75.3 76.8 78.2 75.0 73.1 75.3 75.8 84.0 77.1 79.3 85.4SFA 80.5 75.9 76.6 77.6 75.3 74.2 75.4 77.0 84.2 78.1 80.3 85.8PJNMF 81.8 77.2 78.8 79.4 76.3 75.8 76.4 77.8 84.4 79.0 81.6 86.4SDA 81.1 76.6 76.8 78.2 75.4 75.4 75.8 77.4 83.8 78.4 80.8 87.2mSDA 81.3 77.6 78.5 79.5 76.5 76.4 75.4 77.2 83.6 78.5 81.2 88.2TLDA 81.5 78.0 80.6 79.8 76.6 76.4 76.2 78.0 84.2 79.4 81.8 87.6BTDNNs 81.9 78.6 81.2 80.0 77.9 76.2 76.8 78.6 85.2 80.5 82.7 88.3SS+Common 78.8 76.7 77.3 74.4 77.8 73.6 74.4 76.8 80.4 78.6 80.2 83.5DANN 79.5 77.4 78.2 76.3 78.4 76.3 75.2 77.2 81.4 78.9 80.6 85.8DSN 81.5 78.9 79.0 78.3 79.5 77.4 76.0 78.3 83.4 79.5 81.4 87.7Proposed 83.2 81.8 83.8 81.3 81.8 82.2 82.4 83.2 86.0 86.2 88.4 89.9Gold-standard 84.6 84.6 84.6 83.4 83.4 83.4 86.7 86.7 86.7 90.2 90.2 90.2

Table 3: Comparing the cross-domain performance of different approaches on the OSM dataset.

Method Col2→1 Col3→1 Col4→1 Col1→2 Col3→2 Col4→2 Col1→3 Col2→3 Col4→3 Col1→4 Col2→4 Col3→4SS 35.0 39.4 35.6 32.8 40.2 38.6 40.7 41.9 42.5 45.0 44.9 42.4NN 66.4 65.2 68.3 65.8 66.8 63.8 65.2 67.2 68.2 67.3 67.2 68.1SVM 67.1 63.2 64.3 62.6 64.3 60.4 62.8 63.2 65.8 68.2 69.3 72.4SCL 68.2 67.5 67.2 67.1 67.3 64.1 64.5 65.3 72.1 68.8 70.1 73.6SFA 71.3 67.6 67.8 69.1 70.2 67.8 68.2 68.4 74.2 69.5 72.3 76.3PJNMF 72.0 67.2 68.3 70.4 70.5 68.4 69.3 69.1 74.8 70.0 72.5 74.8SDA 71.5 66.3 67.6 68.2 69.3 70.2 67.6 68.3 68.7 72.4 69.3 72.6mSDA 72.1 67.5 68.2 69.0 70.4 70.8 68.3 69.1 69.2 73.0 70.2 73.1TLDA 72.4 67.8 68.6 69.7 71.1 71.5 68.8 69.8 70.0 73.8 70.7 73.8BTDNNs 73.1 68.3 69.0 70.2 71.6 72.1 69.4 70.2 70.6 74.2 71.3 74.2SS+Common 68.7 67.9 67.7 67.5 67.8 64.9 65.0 65.7 72.6 69.4 70.7 74.2DANN 69.6 69.5 69.8 70.0 68.7 66.2 66.3 66.6 73.4 70.6 71.4 75.7DSN 72.9 68.6 69.4 70.5 72.0 72.2 69.5 70.3 70.8 74.3 71.5 74.6Proposed 77.6 74.5 75.5 76.2 77.8 78.2 75.2 75.7 76.1 80.1 77.9 80.9Gold-standard 78.2 78.2 78.2 79.1 79.1 79.1 81.0 81.0 81.0 81.4 81.4 81.4

comments for which the actual labels are avail-able; however, label information is used only asground truth to report the performance. The com-ments were pre-processed by converting it to low-ercase followed by stemming. Further, feature se-lection was based on document frequency (DF =5) which reduces the number of features as well asspeed up the learning task.

5.2 Experimental Protocol

Performance of proposed architecture is comparedwith standard neural network architecture withone hidden layer (“NN”) (as described in Eq. 1)and a support vector machine (“SVM”) (Chih-Wei Hsu and Lin, 2003) with linear kernel wherethe training is performed on labelled source do-main and performance is reported on the targetdomain. “Gold-standard” refers to target domainsupervised performance of the SVM. The perfor-mance is further compared with popular sharedrepresentation learning approaches for domainadaptation including Structural CorrespondanceLearning (“SCL”) (Blitzer et al., 2006),(Blitzeret al., 2007), Spectral Feature Alignment (“SFA”)(Pan et al., 2010) and “PJNMF” (Zhou et al.,

2015).We also compared the performance with

“DANN” (Ajakan et al., 2014), stacked De-noising Auto-encoders (“SDA”) (Glorot et al.,2011), and marginalized SDA (“mSDA”) (Chenet al., 2012b) and transfer learning with deepauto-encoders (“TLDA”) (Pan et al., 2008) ,“BTDNNs” (Zhou et al., 2016) and “DSN” (Bous-malis et al., 2016) which are some of the popu-lar approaches in cross-domain sentiment analy-sis. The performance is also compared with dif-ferent components of the learned representationsi.e. source specific (“SS”), common (“Proposed”),and “SS+common” representations. For SDA,mSDA, TLDA, BTDNNs, SS, SS+common andthe proposed, a standard SVM is trained on thelearned representation and is applied to predict thesentiment labels for target data.

Training is done using stochastic gradient de-scent (SGD) with minibatch size of 50. The ini-tial learning rate was fixed at 0.01 and then em-pirically varied to find optimal value as 0.0001.Epochs were fixed at 25, above which gradientswere found to saturate. The hyperparameter λ wasvaried in the range [0,1].

4131

5.3 Results and Analysis

Results in Table 2 show the efficacy of the pro-posed neural network architecture for learningcommon shared representation while limiting thesource specific representation from negatively ef-fecting their generalizable capabilities in the targetdomain. Results suggest that the learned commonrepresentation, referred to as “Proposed”, consis-tently outperforms other existing algorithms forall cross-domain sentiment analysis task on theAmazon review dataset (Blitzer et al., 2007). Thesource specific (SS) representation performs con-sistently poor at the all target tasks as they aretrained to emphasize only on the source task. Re-sults also suggest that combining source specificrepresentation with the common representation,referred to as “SS+Common” leads to a lower per-formance than the common representation alone.This validates our assertion that combining sourcespecific characteristics with common representa-tion negatively effects the generalization capabil-ities of the common representation in the targetdomain. The proposed method also surpassesBTDNNs (state-of-the-art) which focuses on thefeasibility of transfer between domains with alinear data reconstruction for distribution consis-tency. Contrary to the proposed two-part repre-sentations, it suggests that enforcing distributionconsistency across all hidden units suppresses thediscriminating information which results in lowerclassification performance for BTDNNs. The pro-posed approach even outperforms deep learningbased methods (SDA, mSDA and TLDA) as theseapproaches learn the unified domain-invariablefeature representation by combining the sourcedomain and target domain data which may notseparate out the domain-specific features from thecommonality of domains. On the contrary, the ob-jective used in the paper is based on the min-maxoptimization criterion that minimizes the domainspecific component loss as well as maximizes theshared component loss. In other words, the pro-posed approach not only models the similarity be-tween domains but also models and mitigates thesource domain specific information, thus leadingto better cross-domain performance.

Results in Table 3 compare the performance ofall algorithms on the OSM dataset. We observethat the overall performance of all the algorithmsis lower on the OSM dataset, as compared to thefirst dataset, as it is more challenging due to short

Figure 3: Compares the proxy − A distances (PAD)computed on the common representations learned us-ing the proposed technique v/s the (a) source specificand (b) representations learned using DANN.

and noisy text. Both Tables 2 & 3 demonstratethat the domain adaptation methods perform betterthan the baselines and “SS” representation whichsuggests that transferring knowledge across do-mains benefits the cross-domain sentiment clas-sification task. The improvements achieved bythe proposed technique, which reaches closest tothe target domain supervised performance “Gold-standard”, is consistently better than the existingalgorithms as it explicitly keeps away any sourcespecific components from the learned commonrepresentation so as to yield the best generaliza-tion on the target domain.

5.3.1 The Common Representation:

The primary objective of the common represen-tation is to make the source and target distribu-tions appear similar. In other words, these repre-sentation should be such that it becomes arduousto distinguish between the source and target ex-amples for a model trained on this representation.We compute proxy − A distance (PAD) betweentwo domains, as explained in Eq. 2. Figure 3 il-

4132

Figure 4: Compares the performance on the source classification task. For example, B→D here represent theperformance of an algorithm on the source domain B when the representations are learned with B as labeledsource and D as unlabeled target domain.

lustrates that the learned common representationleads to a lower PAD when compared with eitherthe source specific representation or the represen-tation learned using DANN. A low PAD betweendomains for a given representation signifies thatthe divergence between the domains is reduced.

5.3.2 The Source Specific Representation:We evaluate the performance of different parts ofthe learned representation on the source domain.Optimizing the target domain performance is in-deed the primary objective for domain adapta-tion; however, existing domain adaptation algo-rithms generally exhibit a lower performance onthe source. We empirically demonstrate that theproposed method for learning source specific andcommon parts of the hidden layer sustains a higherlevel of performance in the source as well.

Results in Figure 4 compares the performanceof the different representations on the source do-main. We compare the performance of the sourcespecific representation and the common represen-tation learned using the proposed approach withthe representation learned using DANN (Ajakanet al., 2014) and the skyline source domain perfor-mance. Results suggest that while the two individ-ual parts of the learned representation yield lowersource domain performance, the combined sourcespecific and common representation (“combined”)outperforms the source domain performance of therepresentation learned using DANN. This signi-fies that the two parts of the learned representationlearn complementary characteristics i.e. sourcespecific and general.

5.3.3 Source Specific & Common Units:While learning the two part representation withour neural network architecture, the number ofsource specific and common units in the hiddenlayer is an important factor to influence the cross-

domain performance. In our experiments, we ob-served that when the source and target domainswere similar (as measured by the PAD), hiddenlayer with a higher portion of common v/s sourcespecific units resulted in better cross-domain per-formance as compared to when the source andtarget domains were dissimilar. This intuitivelysuggests that for similar domains there are morecommonalities than domain specific characteris-tics and hence, a higher number of common unitsis required to capture this commonality. Similarly,we observed that for not so similar domains, thesource specific units dominate the number of com-mon units.

6 Conclusion & Future WorkThe paper proposed a novel neural network learn-ing algorithm based on the principle of learninga two-part representation where each part opti-mizes for different objective. One part capturesthe source specific characteristics that are discrim-inating for learning in the source domain. Theother part captures the common representation be-tween the source and target domain pair whichcontributes to both source domain learning as wellas generalizes to the unlabelled target domain task.The major contribution of this work is to learnthe common shared representation between do-mains by explicitly disentangling the source spe-cific characteristics so as not to detract the capa-bilities of common representation for the cross-domain task. In the cross-domain task, the com-mon part of the representation performs best whenit is isolated from the source specific part. Onthe contrary, both the source specific and commonparts of the representation come along for efficientperformance in the source domain task. Finally,we demonstrated the efficacy of the proposed ap-proach for cross-domain classification on differentdatasets.

4133

ReferencesHana Ajakan, Pascal Germain, Hugo Larochelle,

Francois Laviolette, and Mario Marchand. 2014.Domain-adversarial neural networks. arXiv preprintarXiv:1412.4446.

Jeremy Barnes, Roman Klinger, and Sabine Schulte imWalde. 2018. Projecting embeddings for domainadaption: Joint modeling of sentiment analysis indiverse domains. In Proceedings of the 27th Inter-national Conference on Computational Linguistics,pages 818–830. Association for Computational Lin-guistics.

Shai Ben-David, John Blitzer, Koby Crammer, AlexKulesza, Fernando Pereira, and Jennifer WortmanVaughan. 2010. A theory of learning from differentdomains. Machine learning, 79(1-2):151–175.

Himanshu Sharad Bhatt, Arun Rajkumar, and ShouryaRoy. 2016. Multi-source iterative adaptation forcross-domain classification. In In Proceedings ofInternational Joint Conference on Artificial Intelli-gence.

Himanshu Sharad Bhatt, Deepali Semwal, and ShouryaRoy. 2015. An iterative similarity based adapta-tion technique for cross-domain text classification.In Proceedings of Conference on Natural LanguageLearning, pages 52–61.

J. Blitzer, R. McDonald, and F. Pereira. 2006. Domainadaptation with structural correspondence learning.In Proceedings of Conference on Empirical Methodsin Natural Language Processing, pages 120–128.

John Blitzer, Mark Dredze, and Fernando Pereira.2007. Biographies, bollywood, boom-boxes andblenders: Domain adaptation for sentiment classi-fication. In Proceedings of Association for Compu-tational Linguistics, pages 440–447.

John Blitzer, Sham Kakade, and Dean Foster. 2011.Domain adaptation with coupled subspaces. In Pro-ceedings of the Fourteenth International Conferenceon Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages173–181, Fort Lauderdale, FL, USA. PMLR.

Konstantinos Bousmalis, George Trigeorgis, NathanSilberman, Dilip Krishnan, and Dumitru Erhan.2016. Domain separation networks. arXiv preprintarXiv:1608.06019.

Minmin Chen, Zhixiang Xu, Kilian Weinberger, andFei Sha. 2012a. Marginalized denoising autoen-coders for domain adaptation.

Minmin Chen, Zhixiang Xu, Kilian Q. Weinberger,and Fei Sha. 2012b. Marginalized denoising au-toencoders for domain adaptation. In Proceedingsof the 29th International Coference on InternationalConference on Machine Learning, ICML’12, pages1627–1634, USA. Omnipress.

Chih-Chung Chang Chih-Wei Hsu and Chih-Jen Lin.2003. A practical guide to support vector classifi-cation. Technical report, Department of ComputerScience, National Taiwan University.

S Chopra, S Balakrishnan, and R Gopalan. 2013. Dlid:Deep learning for domain adaptation by interpolat-ing between domains. ICML Workshop on Chal-lenges in Representation Learning.

Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and YongYu. 2007. Co-clustering based classification forout-of-domain documents. In Proceedings of Inter-national Conference on Knowledge Discovery andData Mining, pages 210–219.

Hal Daume III. 2009. Frustratingly easy domain adap-tation. arXiv preprint arXiv:0907.1815.

I. S. Dhillon, S. Mallela, and D. S Modha. 2003.Information-theoretic co-clustering. In Proceedingsof International Conference on Knowledge Discov-ery and Data Mining, pages 89–98.

Basura Fernando, Amaury Habrard, Marc Sebban, andTinne Tuytelaars. 2013. Unsupervised visual do-main adaptation using subspace alignment. In Pro-ceedings of the IEEE International Conference onComputer Vision, pages 2960–2967.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,Pascal Germain, Hugo Larochelle, Francois Lavi-olette, Mario Marchand, and Victor Lempitsky.2016. Domain-adversarial training of neural net-works. Journal of Machine Learning Research,17(59):1–35.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio.2011. Domain adaptation for large-scale sentimentclassification: A deep learning approach. In In Pro-ceedings of the International Conference on Ma-chine Learning.

Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grau-man. 2012. Geodesic flow kernel for unsuperviseddomain adaptation. In Computer Vision and Pat-tern Recognition (CVPR), 2012 IEEE Conferenceon, pages 2066–2073. IEEE.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In Advances in neural informationprocessing systems, pages 2672–2680.

Raghuraman Gopalan, Ruonan Li, and Rama Chel-lappa. 2011. Domain adaptation for object recog-nition: An unsupervised approach. In 2011 inter-national conference on computer vision, pages 999–1006. IEEE.

Maayan Harel and Shie Mannor. 2010. Learning frommultiple outlooks. arXiv preprint arXiv:1005.0027.

http://aclweb.org/anthology/C18-1070



http://dl.acm.org/citation.cfm?id=3042573.3042781

http://dl.acm.org/citation.cfm?id=3042573.3042781

4134

Jing Jiang and ChengXiang Zhai. 2007. A two-stageapproach to domain adaptation for statistical classi-fiers. In Proceedings of the sixteenth ACM confer-ence on Conference on information and knowledgemanagement, pages 401–410. ACM.

Abhishek Kumar, Avishek Saha, and Hal Daume.2010. Co-regularization based semi-supervised do-main adaptation. In J. D. Lafferty, C. K. I. Williams,J. Shawe-Taylor, R. S. Zemel, and A. Culotta, ed-itors, Advances in Neural Information ProcessingSystems 23, pages 478–486. Curran Associates, Inc.

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017.Adversarial multi-task learning for text classifica-tion. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics, ACL2017, Vancouver, Canada, July 30 - August 4, Vol-ume 1: Long Papers, pages 1–10.

Mingsheng Long and Jianmin Wang. 2015. Learningtransferable features with deep adaptation networks.arXiv preprint arXiv:1502.02791.

S. J. Pan and Q. Yang. 2010. A survey on trans-fer learning. IEEE Transactions on Knowledge andData Engineering, 22(10):1345–1359.

Sinno Jialin Pan, , Ivor W. Tsang, James T. Kwok, andQiang Yang. 2011. Domain adaptation via transfercomponent analysis. IEEE Transactions on NeuralNetworks, 22(2):199–210.

Sinno Jialin Pan, James T Kwok, and Qiang Yang.2008. Transfer learning via dimensionality reduc-tion. In AAAI, volume 8, pages 677–682.

Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, QiangYang, and Zheng Chen. 2010. Cross-domain senti-ment classification via spectral feature alignment. InProceedings of International Conference on WorldWide Web, pages 751–760.

Prathusha K. Sarma, Yingyu Liang, and William A.Sethares. 2018. Domain adapted word embed-dings for improved sentiment classification. CoRR,abs/1805.04576.

Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko,and Trevor Darrell. 2014. Deep domain confusion:Maximizing for domain invariance. arXiv preprintarXiv:1412.3474.

Qi Zhang, Xuanjing Huang, Minlong Peng, and Yu-Gang Jiang. 2018. Cross-domain sentiment classi-fication with target domain specific information. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2018,Melbourne, Australia, July 15-20, 2018, Volume 1:Long Papers, pages 2505–2513.

Guangyou Zhou, Tingting He, Wensheng Wu, and Xi-aohua Tony Hu. 2015. Linking heterogeneous inputfeatures with pivots for domain adaptation. In Pro-ceedings of the Twenty-Fourth International JointConference on Artificial Intelligence, IJCAI 2015,

Buenos Aires, Argentina, July 25-31, 2015, pages1419–1425.

Guangyou Zhou, Zhiwen Xie, Jimmy Xiangji Huang,and Tingting He. 2016. Bi-transferring deep neu-ral networks for domain adaptation. In Proceed-ings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 322–332. Association for Computa-tional Linguistics.

http://papers.nips.cc/paper/4009-co-regularization-based-semi-supervised-domain-adaptation.pdf

http://papers.nips.cc/paper/4009-co-regularization-based-semi-supervised-domain-adaptation.pdf

https://doi.org/10.18653/v1/P17-1001

https://doi.org/10.18653/v1/P17-1001

http://arxiv.org/abs/1805.04576

http://arxiv.org/abs/1805.04576

https://aclanthology.info/papers/P18-1233/p18-1233

https://aclanthology.info/papers/P18-1233/p18-1233

http://ijcai.org/Abstract/15/204

http://ijcai.org/Abstract/15/204

https://doi.org/10.18653/v1/P16-1031

https://doi.org/10.18653/v1/P16-1031

Learning Transferable Feature Representations Using Neural ...

Documents

Transcript of Learning Transferable Feature Representations Using Neural ...