arXiv:1704.06933v4 [cs.CL] 30 Sep 2018

Adversarial Neural Machine Translation

Lijun Wu1, Yingce Xia2, Li Zhao3, Fei Tian3, Tao Qin3, Jianhuang Lai1,4 and Tie-Yan Liu3

1School of Data and Computer Science, Sun Yat-sen University2University of Science and Technology of China

3Microsoft Research Asia4Guangdong Key Laboratory of Information Security Technology

[email protected]; [email protected];{lizo,fetia,taoqin,tie-yan.liu}@microsoft.com; [email protected]

Abstract

In this paper, we study a new learningparadigm for Neural Machine Translation(NMT). Instead of maximizing the likeli-hood of the human translation as in previ-ous works, we minimize the distinction be-tween human translation and the translationgiven by an NMT model. To achieve thisgoal, inspired by the recent success of Gener-ative Adversarial Networks (GANs), we em-ploy an adversarial training architecture andname it as Adversarial-NMT. In Adversarial-NMT, the training of the NMT model is as-sisted by an adversary, which is an elab-orately designed Convolutional Neural Net-work (CNN). The goal of the adversary is todifferentiate the translation result generated bythe NMT model from that by human. Thegoal of the NMT model is to produce highquality translations so as to cheat the adver-sary. A policy gradient method is leveragedto co-train the NMT model and the adversary.Experimental results on English→French andGerman→English translation tasks show thatAdversarial-NMT can achieve significantlybetter translation quality than several strongbaselines.

1 Introduction

Neural Machine Translation (NMT) (Cho et al.,2014; Bahdanau et al., 2014) has drawn more andmore attention in both academia and industry (Lu-ong and Manning, 2016; Jean et al., 2015; Shen etal., 2016; Tu et al., 2016b; Sennrich et al., 2016;Wu et al., 2016). Compared with traditional Sta-tistical Machine Translation (SMT) (Koehn et al.,

2003), NMT achieves similar or even better trans-lation results in an end-to-end framework. The sen-tence level maximum likelihood principle and gat-ing units in LSTM/GRU (Hochreiter and Schmid-huber, 1997; Cho et al., 2014), together with atten-tion mechanisms grant NMT with the ability to bet-ter translate long sentences.

Despite its success, the translation quality of latestNMT systems is still far from satisfaction and thereremains large room for improvement. For example,NMT usually adopts the Maximum Likelihood Es-timation (MLE) principle for training, i.e., to maxi-mize the probability of the target ground-truth sen-tence conditioned on the source sentence. Such anobjective does not guarantee the translation resultsto be natural, sufficient, and accurate compared withground-truth translation by human. There are previ-ous works (Ranzato et al., 2015; Shen et al., 2016;Bahdanau et al., 2016) that aim to alleviate such lim-itations of maximum likelihood training, by adopt-ing sequence level objectives (e.g., directly maxi-mizing BLEU (Papineni et al., 2002)), to reduce theobjective inconsistency between NMT training andinference. Yet somewhat improved, such objectivesstill cannot fully bridge the gap between NMT trans-lations and ground-truth translations.

We, in this paper, adopt a thoroughly differ-ent training objective for NMT, targeting at di-rectly minimizing the difference between humantranslation and the translation given by an NMTmodel. To achieve this target, inspired by the re-cent success of Generative Adversarial Networks(GANs) (Goodfellow et al., 2014a), we design anadversarial training protocol for NMT and name it as

arX

iv:1

704.

0693

3v4

[cs

.CL

] 3

0 Se

p 20

18

Adversarial-NMT. In Adversarial-NMT, besides thetypical NMT model, an adversary is introduced todistinguish the translation generated by NMT fromthat by human (i.e., ground truth). Meanwhile theNMT model tries to improve its translation resultssuch that it can successfully cheat the adversary.

These two modules in Adversarial-NMT are co-trained, and their performances get mutually im-proved. In particular, the discriminative power of theadversary can be improved by learning from moreand more training samples (both positive ones gen-erated by human and negative ones sampled fromNMT); and the ability of the NMT model in cheat-ing the adversary can be improved by taking theoutput of the adversary as reward. In this way, theNMT translation results are professor forced (Lambet al., 2016) to be as close as possible to ground-truthtranslation.

Different from previous GANs, which assume theexistence of a generator in continuous space, in ourproposed framework, the NMT model is in fact nota typical generative model, but instead a probabilis-tic transformation that maps a source language sen-tence to a target language sentence, both in dis-crete space. Such differences make it necessaryto design both new network architectures and opti-mization methods to make adversarial training pos-sible for NMT. We therefore on one aspect, lever-age a specially designed Convolutional Neural Net-work (CNN) model as the adversary, which takes the(source, target) sentence pair as input; on the otheraspect, we turn to a policy gradient method namedREINFORCE (Williams, 1992), widely used in thereinforcement learning literature (Sutton and Barto,1998), to guarantee both the two modules are ef-fectively optimized in an adversarial manner. Weconduct extensive experiments, which demonstratesthat Adversarial-NMT can achieve significantly bet-ter translation results than traditional NMT modelswith even much larger vocabulary size and highermodel complexity.

2 Related Work

End-to-end Neural Machine Translation(NMT) (Bahdanau et al., 2014; Cho et al.,2014; Sutskever et al., 2014; Jean et al., 2015; Wuet al., 2016; Zhou et al., 2016) has been the recent

research focus of the community. Typical NMTsystem is built within the RNN based encoder-decoder framework. In such a framework theencoder RNN sequentially processes the words in asource language sentence into fixed length vectors,which act as the inputs to decoder RNN to decodethe translation sentence. NMT typically adoptsthe principle of Maximum Likelihood Estimation(MLE) for training, i.e., maximizing the per-wordlikelihood of target sentence. Other training criteria,such as Minimum Risk Training (MRT) based onreinforcement learning (Ranzato et al., 2015; Shenet al., 2016) and translation reconstruction (Tuet al., 2016a), are shown to improve over suchword-level MLE principle since these objectivestake the translation sentence as a whole.

The training principle we propose is basedon the spirit of Generative Adversarial Networks(GANs) (Goodfellow et al., 2014a; Salimans etal., 2016), or more generally, adversarial train-ing (Goodfellow et al., 2014b). In adversarial train-ing, a discriminator and a generator compete witheach other, forcing the generator to produce highquality outputs that are able to fool the discriminator.Adversarial training typically succeed in image gen-eration (Goodfellow et al., 2014a; Reed et al., 2016),with limited contribution in natural language pro-cessing tasks (Yu et al., 2016; Li et al., 2017), mainlydue to the difficulty of propagating the error signalsfrom the discriminator to the generator through thediscretely generated natural language tokens. Yu etal. (2016) alleviates such a difficulty by reinforce-ment learning approach for sequence (e.g., music)generation. However, as far as we know, there arelimited efforts on adversarial training for sequence-to-sequence task when a conditional mapping be-tween two sequences is involved, and our work isamong the first endeavors to explore the potentialof acting in this way, especially for Neural MachineTranslation (Yang et al., 2017).

3 Adversarial-NMT

The overall framework of Our Adversarial-NMT isshown in Figure 1. Let (x = {x1, x2, ..., xTx}, y ={y1, y2, ..., yTy}) be a bilingual aligned sentencepair for training, where xi is the i-th word in thesource sentence and yj is the j-th word in the tar-

Figure 1: The Adversarial-NMT framework. ‘Ref’is short for ‘Reference’ which means the ground-truth translation and ‘Hyp’ is short for ‘Hypothesis’,denoting model translation sentence. All the yel-low parts denote the NMT model G, which maps asource sentence x to a translation sentence. The redparts are the adversary network D, which predictswhether a given target sentence is the ground-truthtranslation of the given source sentence x. G andD combat with each other, generating both sampledtranslation y′ to train D, and the reward signals totrain G by policy gradient (the blue arrows).

get sentence. Let y′ denote the translation sentenceout from an NMT system for the source sentence x.As previously stated, the goal of Adversarial-NMTis to force y′ to be as ‘similar’ as y. In the perfectcase, y′ is so similar to the human translation y thateven a human cannot tell whether y′ is generated bymachine or human. In order to achieve that, we in-troduce an extra adversary network, which acts sim-ilarly to the discriminator adopted in GANs (Good-fellow et al., 2014a). The goal of the adversary is todifferentiate human translation from machine trans-lation, and the NMT model G tries to produce a tar-get sentence as similar as human translation so as tofool the adversary.

3.1 NMT Model

We adopt the Recurrent Neural Network (RNN)based encoder-decoder as the NMT model to seek atarget language translation y′ given source sentencex. In particular, a probabilistic mapping G(y|x) isfirstly learnt and the translation result y′ ∼ G(·|x)is sampled from it. To be specific, given sourcesentence x and previously generated words y<t, the

probability of generating word yt is:

G(yt|y<t, x) ∝ exp(yt; rt, ct) (1)

rt = g(rt−1, yt−1, ct) (2)

where rt is the decoding state from decoder at timet. Here g is the recurrent unit such as the LongShort Term Memory (LSTM) unit (Hochreiter andSchmidhuber, 1997) or Gated Recurrent Unit (GRU)(Cho et al., 2014), and ct is a distinct source repre-sentation at time t calculated by an attention mecha-nism (Bahdanau et al., 2014):

ct =

Tx∑i=1

αithi (3)

αit =exp{a(hi, rt−1)}∑j exp{a(hj , rt−1)}

(4)

where Tx is the source sentence length, a(·, ·) is afeed-forward neural network and hi is the hiddenstate from RNN encoder computed by hi−1 and xi:

hi = f(hi−1, xi) (5)

The translation result y′ can be sampled fromG(·|x)either in a greedy way for each timestep, or usingbeam search (Sutskever et al., 2014) to seek globallyoptimized result.

3.2 Adversary ModelThe adversary is used to differentiate translation re-sult y′ and the ground-truth translation y, given thesource language sentence x. To achieve that, oneneeds to measure the translative matching degree ofsource-target sentence pair (x, y). We turn to Con-volution Neural Network (CNN) for this task (Yinet al., 2015; Hu et al., 2014), since with its layer-by-layer convolution and pooling strategies, CNNis able to accurately capture the hierarchical corre-spondence of (x, y) at different abstraction levels.

The general structure is shown in Figure 2.Specifically, given a sentence pair (x, y), we firstconstruct a 2D image-like representation by simplyconcatenating the embedding vectors of words in xand y. That is, for i-th word xi in x and j-th word yjin sentence y, we have the following feature map:

z(0)i,j = [xTi , y

Tj ]T

Based on such a 2D image-like representation, weperform convolution on every 3 × 3 window, withthe purpose to capture the correspondence betweensegments in x and segments in y by the followingfeature map of type f :

z(1,f)i,j = σ(W (1,f)z

(0)i,j + b(1,f))

where σ(·) is the sigmoid active function, σ(x) =1/(1 + exp(−x)).

After that we perform a max-pooling in non-overlapping 2× 2 windows:

z(2,f)i,j =max({z(1,f)

2i−1,2j−1, z(1,f)2i−1,2j , z

(1,f)2i,2j−1, z

(1,f)2i,2j })

We could go on for more layers of convolutionand max-pooling, aiming at capturing the corre-spondence at different levels of abstraction. Theextracted features are then fed into a multi-layerperceptron (MLP), with sigmoid activation at thelast layer to give the probability that (x, y) is fromground-truth data, i.e. D(x, y). The optimiza-tion target of such CNN adversary is to minimizethe cross entropy loss for binary classification, withground-truth data (x, y) as positive instance whilesampled data (from G) (x, y′) as negative one.

Figure 2: The CNN adversary framework.

3.3 Policy Gradient Algorithm to TrainAdversarial-NMT

With the notations for NMT model G and adversarymodel D, the final training objective is:

minG

maxD

V (D,G)

=E(x,y)∼Pdata(x,y)[logD(x, y)]+

Ex∼Pdata(x),y′∼G(·|x)[log(1−D(x, y′))]

(6)

That is, translation model G tries to producehigh quality translation y′ to fool the adversary D

(the outer-loop min), whose objective is to success-fully classify translation results from real data (i.e.,ground-truth) and from G (the inner-loop max).

Eqn. (6) reveals that it is straightforward to trainthe adversary D, by keeping providing D with theground-truth sentence pair (x, y) and the sampledtranslation pair (x, y′) from G, respectively as pos-itive and negative training data. However, when itturns to NMT modelG, it is non-trivial to design thetraining process, given that the discretely sampled y′

from G makes it difficult to directly back-propagatethe error signals fromD toG, making V (D,G) non-differentiable w.r.t. G’s model parameters ΘG.

To tackle the above challenge, we leveragethe REINFORCE algorithm (Williams, 1992), aMonte-Carlo policy gradient method in reinforce-ment learning literature to optimize G. Note thatthe objective of training G under a fixed source lan-guage sentence x andD is to minimize the followingloss item:

L = Ey′∼G(·|x) log(1−D(x, y′)) (7)

whose gradient w.r.t. ΘG is:

∇ΘGL

=∇ΘGEy′∼G(·|x)[log(1−D(x, y′))]

=Ey′∼G(·|x)[log(1−D(x, y′))∇ΘGlogG(y′|x)]

(8)A sample y′ from G(·|x) is used to approximate

the above gradient:

∇ΘG≈∇ΘG

=log(1−D(x, y′))∇ΘGlogG(y′|x)

(9)in which ∇ΘG

logG(y′|x) are gradients speci-fied with standard sequence-to-sequence NMT net-works. Such a gradient approximation is used to up-date ΘG:

ΘG ← ΘG − α∇ΘG(10)

where α is the learning rate.Using the language of reinforcement learning, in

the above Eqn. (7) to (9), the NMT model G(·|x) isthe conditional policy faced with x, while the term− log(1 − D(x, y′)), provided by the adversary D,acts as a Monte-Carlo estimation of the reward. In-tuitively speaking, Eqn. (9) implies, the more likely

y′ to successfully fool D (i.e, larger D(x, y′)), thelarger reward the NMT model will get, and the’pseudo’ training data (x, y′) will correspondinglybe more favored to improve the policy G(·|x).

Note here we in fact use one sample − log(1 −D(x, y′)) from a trajectory y′ to estimate the termi-nal reward given by D. Acting in this way bringshigh variance, to reduce the variance, a moving aver-age of the historical reward values is set as a rewardbaseline (Weaver and Tao, 2001). One can samplemultiple trajectories in each decoding step, by re-garding G as the roll-out policy to reduce estimationvariance for immediate reward (Silver et al., 2016;Yu et al., 2016). However, empirically we find suchapproach is intolerably time-consuming in our task,given that the decoding space in NMT is typicallyextremely large (the same as vocabulary size).

It is worth comparing our adversarial trainingwith existing methods that directly maximize se-quence level measure such as BLEU (Ranzato etal., 2015; Shen et al., 2016; Bahdanau et al., 2016)in training NMT models, using similar approachesbased on reinforcement learning as ours. We arguethat Adversarial-NMT makes the optimization eas-ier compared with these methods. Firstly, the re-ward learned by our adversary D provides rich andglobal information to evaluate the translation, whichgoes beyond the BLEU’s simple low-level n-grammatching criteria. Acting in this way provides muchsmoother objective compared with BLEU since thelatter is highly sensitive for slight translation dif-ference at word or phrase level. Secondly, theNMT model G and the adversary D in Adversarial-NMT co-evolves. The dynamics of adversary Dmakes NMT model G grows in an adaptive wayrather than controlled by a fixed evaluation metric asBLEU. Given the above two reasons, Adversarial-NMT makes the optimization process towards se-quence level objectives much more robust and bet-ter controlled, which is further verified by its su-perior performances to the aforementioned methodsthat will be reported in the next Section 4.

4 Experiments

4.1 Settings

We report the experimental results on bothEnglish→French translation (En→Fr for short) and

German→English translation (De→En for short).Dataset: For En→Fr translation, for the sake of

fair comparison with previous works, we use thesame dataset as (Bahdanau et al., 2014; Shen et al.,2016). The dataset is composed of a subset of WMT2014 training corpus as training set, the combinationof news-test 2012 and news-test 2013 as dev set andnews-test 2014 as test set, which respectively con-tains roughly 12M , 6k and 3k sentence pairs. Themaximal sentence length is 50. We use top 30k mostfrequent English and French words and replace theother words as ‘UNK’ token.

For De→En translation, following previousworks (Ranzato et al., 2015; Bahdanau et al., 2016),the dataset is from IWSLT 2014 evaluation cam-paign (Cettolo et al., 2014), consisting of train-ing/dev/test corpus with approximately 153k, 7kand 6.5k bilingual sentence pairs respectively. Themaximal sentence length is also set as 50. Thedictionary for English and German corpus respec-tively include 22, 822 and 32, 009 most frequentwords (Bahdanau et al., 2016), with other words re-placed as a special token ‘UNK’.

Implementation Details: In Adversarial-NMT,the structure of the NMT model G is the same asRNNSearch model (Bahdanau et al., 2014), a RNNbased encoding-decoding framework with attentionmechanism. Single layer GRUs act as encoder anddecoder. For En→Fr translation, the dimensions ofword embedding and GRU hidden state are respec-tively set as 620 and 1000, and for De→En transla-tion they are both 256.

For the adversary D, the CNN consists of twoconvolution+pooling layers, one MLP layer and onesoftmax layer, with 3 × 3 convolution window size,2× 2 pooling window size, 20 feature map size and20 MLP hidden layer size.

For the training of NMT modelG, similar as whatis commonly done in previous works (Shen et al.,2016; Tu et al., 2016a), we warm start G from awell-trained RNNSearch model, and optimize it us-ing vanilla SGD with mini-batch size 80 for En→Frtranslation and 32 for De→En translation. Gradientclipping is used with clipping value 1 for En→Frand 10 for De→En. The initial learning rate ischosen from cross-validation on dev set (0.02 forEn→Fr and 0.001 for De→En) and we halve it every80k iterations.

System System Configurations BLEURepresentative end-to-end NMT systems

Sutskever et al. (2014) LSTM with 4 layers + 80K vocabs 30.59Bahdanau et al. (2014) RNNSearch 29.97a

Jean et al. (2015) RNNSearch + UNK Replace 33.08Jean et al. (2015) RNNSearch + 500k vocabs + UNK Replace 34.11Luong et al. (2015) LSTM with 4 layers + 40K vocabs 29.50Luong et al. (2015) LSTM with 4 layers + 40K vocabs + PosUnk 31.80Shen et al. (2016) RNNSearch +Minimum Risk Training Objective 31.30Sennrich et al. (2016) RNNSearch +Monolingual Data 30.40 b

He et al. (2016) RNNSearch+ Monolingual Data + Dual Objective 32.06Adversarial-NMT

this work RNNSearch + Adversarial Training Objective 31.91†RNNSearch + Adversarial Training Objective + UNK Replace 34.78

aReported in (Jean et al., 2015).bReported in (He et al., 2016).

Table 1: Different NMT systems’ performances on En→Fr translation. The default setting is single layerGRU + 30k vocabs + MLE training objective, trained with no monolingual data, i.e., the RNNSearch modelproposed by Bahdanau et al. (2014). †: significantly better than Shen et al. (2016) (ρ < 0.05).

An important factor we find in successfully train-ing G is that the combination of adversarial objec-tive with MLE. That is, we force 50% randomlychosen mini-batch data are trained with Adversarial-NMT, while apply MLE principle to the other mini-batches. Acting in this way significantly improvesstability in model training, which is also reportedin other tasks such as language model (Lamb etal., 2016) and neural dialogue generation (Li et al.,2017). We conjecture that the reason is that MLEacts as a regularizer to guarantee smooth model up-date, alleviating the negative effects brought by highgradient estimation variance of the one-step Monte-Carlo sample in REINFORCE.

As the first step, the CNN adversary networkD is initially pre-trained using the sampled data(x, y′) sampled from the RNNSearch model, andthe ground-truth translation (x, y). After that, injoint G-D training of Adversarial-NMT, the adver-sary is optimized using Nesterov SGD (Nesterov,1983) with batch size set as 32. The initial learningrate is 0.002 for En→Fr and 0.001 for De→En, bothchosen by validation on dev set. The dimension ofword embedding is the same with that of G, and wefix the word embeddings during training. Batch nor-malization (Ioffe and Szegedy, 2015) is observed tosignificantly improve D’s performance. Consider-

ing efficiency, all the negative training data instances(x, y′) used inD’s training are generated using beamsearch with beam size 4.

In generating model translation for evaluation, weset beam width as 4 and 12 for En→Fr and De→Enrespectively according to BLEU on dev set. Thetranslation quality is measured by tokenized case-sensitive BLEU (Papineni et al., 2002) score 1.

4.2 Result on En→Fr translation

In Table 1 we provide the En→Fr translation resultof Adversarial-NMT, together with several strongNMT baselines, such as the well representativeattention-based NMT model RNNSearch (Bahdanauet al., 2014). In addition, to make our compari-son comprehensive, we would like to cover severalwell acknowledged techniques whose effectivenesshas been verified to improve En→Fr translation bypreviously published works, including the leverageof 1) Using large vocabulary to handle rare words(Jean et al., 2015; Luong et al., 2015); 2) Differ-ent training objectives (Shen et al., 2016; Ranzato etal., 2015; Bahdanau et al., 2016) such as MinimumRisk Training (MRT) to directly optimize evaluation

1https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl

https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl



0 5 10 15 20 25Iteration(*5k)

24.5

25.0

25.5

26.0

26.5

27.0

27.5B

LEU

lr_G=0.002, lr_D=0.002

lr_G=0.020, lr_D=0.002

lr_G=0.200, lr_D=0.002

(a) D: Same learning rates; G: different learning rates.

0 5 10 15 20 25Iteration(*5k)

26.2

26.4

26.6

26.8

27.0

27.2

27.4

27.6

BLE

U

lr_G=0.02, lr_D=0.0200

lr_G=0.02, lr_D=0.0020

lr_G=0.02, lr_D=0.0002

(b) G: Same learning rates; D: different learning rates.

Figure 3: Dev set BLEUs during En→Fr Adversarial-NMT training process, with same learning rates forD, different learning rates for G in left 3(a), and same learning rates for G and different learning rates forD in right 3(b).

measure (Shen et al., 2016), and dual learning to en-hance both primal and dual tasks (e.g., En→Fr andFr→En) (He et al., 2016); 3) Improved inferenceprocess such as beam search optimization (Wisemanand Rush, 2016) and postprocessing UNK (Luong etal., 2015; Jean et al., 2015); 4) Leveraging additionalmonolingual data (Sennrich et al., 2016; Zhang andZong, 2016; He et al., 2016).

From the table, we can clearly observe thatAdversarial-NMT obtains satisfactory translationquality against baseline systems. In particular, iteven surpasses the performances of other modelswith much larger vocabularies (Jean et al., 2015),deeper layers (Luong et al., 2015), much largermonolingual training corpus (Sennrich et al., 2016),and the goal of directly maximizing BLEU (Shen etal., 2016). In fact, as far as we know, Adversarial-NMT achieves state-of-the-art result (34.78) onEn→Fr translation for single-layer GRU sequence-to-sequence models trained with only supervisedbilingual corpus on news-test 2014 test set.

Human Evaluation: Apart from the compari-son based on the objective BLEU scores, to bet-ter appraise the performance of our model, we alsoinvolve human judgements as a subjective mea-sure. To be more specific, we generate the transla-tion results for 500 randomly selected English sen-tences from En→Fr news-test 2014 dataset using

Adversarial-NMT MRT

evaluator 1 286 (57.2%) 214 (42.8%)evaluator 2 310 (62.0%) 190 (38.0%)evaluator 3 295 (59.0%) 205 (41.0%)

Overall 891 (59.4%) 609 (40.6%)

Table 2: Human evaluations for Adversarial-NMTand MRT on English→French translation. “286(57.2%)” means that evaluator 1 made a decisionthat 286 (57.2%) out of 500 translations generatedby Adversarial-NMT were better than MRT.

both MRT (Shen et al., 2016) and our Adversarial-NMT. Here MRT is chosen since it is the well rep-resentative of previous NMT methods which max-imize sequence level objectives, achieving satisfac-tory results among all single layer models (i.e., 31.3in Table 1). Afterwards we ask three human label-ers to choose the better one from the two versions oftranslated sentences. The evaluation process is con-ducted on Amazon Mechanical Turk 2 with all theworkers to be native English or French speakers.

Result in Table 2 shows that 59.4% sentences arebetter translated by our Adversarial-NMT, comparedwith MRT (Shen et al., 2016). Such human eval-uation further demonstrates the effectiveness of our

2https://www.mturk.com

https://www.mturk.com

model and matches the expectation that Adversarial-NMT provides more human desired translation.

Adversarial Training: Slow or Fast: In this sub-section we analyze how to set the pace for trainingthe NMT model G and adversary D, to make themcombatting effectively. Specifically, for En→Frtranslation, we inspect how dev set BLEU variesalong adversarial training process with different ini-tial learning rates for G (shown in 3(a)) and for D(shown in 3(b)), conditioned on the other one fixed.

Overall speaking, these two figures show thatAdversarial-NMT is much more robust with regardto the pace of D making progress than that ofG, since the three curves in 3(b) grow in a sim-ilar pattern while curves in 3(a) drastically differwith each other. We conjecture the reason is thatin Adversarial-NMT, CNN based D is powerfulin classification tasks, especially when it is warmstarted with sampled data from RNNSearch. As acomparison, the translation model G is relativelyweak in providing qualified translations. Therefore,training G needs carefully configurations of learn-ing rate: small value (e.g., 0.002) leads to slowerconvergence (blue line in 3(a)), while large value(e.g., 0.2) brings un-stability (green line in 3(a)).The proper learning rate (e.g. 0.02) induces G tomake fast, meanwhile stable progress along training.

4.3 Result on De→En translationIn Table 3 we provide the De→En translation resultof Adversarial-NMT, compared with some strongbaselines such as RNNSearch (Bahdanau et al.,2014) and MRT (Shen et al., 2016).

Again, we can see that Adversarial-NMT per-forms best against other models from Table 3,achieves 27.94 BLEU scores, which is also a state-of-the-art result.

Effect of Adversarial Training: To better visu-alize and understand the advantages of adversarialtraining brought by Adversarial-NMT, we show sev-eral translation cases in Table 4. Concretely speak-ing, we give two German→English translation ex-amples, including the source language sentence x,the ground-truth translation sentence y, and twoNMT model translation sentences, respectively outfrom RNNSearch and Adversarial-NMT (trained af-ter 20 epochs) and emphasized on their differentparts by bold fonts which lead to different transla-

tion quality. For each model translation y′, we alsolist D(x, y′), i.e., the probability that the adversaryD regards y′ as ground-truth, in the third column,and the sentence level bleu score of y′ in the lastcolumn.

Since RNNSearch model acts as the warm startfor training Adversarial-NMT, its translation couldbe viewed as the result of Adversarial-NMT at itsinitial phase. Therefore, from Table 4, we can ob-serve:

• With adversarial training goes on, the qualityof translation sentence output by G gets im-proved, both in terms of subjective feelings andBLEU scores as a quantitative measure.

• Correspondingly, the translation quality growthmakes the adversary D deteriorated, as shownby D’s successful recognition of y′ byRNNSearch as translated from model, whereasD makes mistakes in classifying y′ out fromAdversarial-NMT as ground-truth (by human).

5 Conclusion

We in this paper propose a novel and intuitivetraining objective for NMT, that is to force thetranslation results be as similar as ground-truthtranslations generated by human. Such an objec-tive is achieved via an adversarial training frame-work called Adversarial-NMT which complementsthe original NMT model with an adversary basedon CNN. Adversarial-NMT adopts both new net-work architectures to reflect the mapping within(source, target) sentence, and an efficient policy gra-dient algorithm to tackle the optimization difficultybrought by the discrete nature of machine transla-tion. The experiments on both English→French andGerman→English translation tasks clearly demon-strate the effectiveness of such adversarial trainingmethod for NMT.

As to future works, with the hope of achievingnew state-of-the-art performance for NMT system,we plan to fully exploit the potential of Adversarial-NMT by combining it with other powerful methodslisted in subsection 4.2, such as training with largevocabulary, minimum-risk principle and deep struc-tures. We additionally would like to emphasize andexplore the feasibility of adversarial training to other

System System Configurations BLEURepresentative end-to-end NMT systems

Bahdanau et al. (2014) RNNSearch 23.87 a

Ranzato et al. (2015) CNN encoder + Sequence level objective 21.83Bahdanau et al. (2016) CNN encoder + Sequence level actor-critic objective 22.45Wiseman et al. (2016) RNNSearch + Beam search optimization 25.48Shen et al. (2016) RNNSearch + Minimum Risk Training Objective 25.84 b

Adversarial-NMTthis work RNNSearch + Adversarial Training Objective 26.98†

RNNSearch + Adversarial Training Objective + UNK Replace 27.94

aReported in (Wiseman and Rush, 2016).bResult from our implementation, and we reproduced their reported En→Fr result.

Table 3: Different NMT systems’ performances on De→En translation. The default setting is single layerGRU encoder-decoder model with MLE training objective, i.e., the RNNSearch model proposed by Bah-danau et al. (2014). †: significantly better than Shen et al. (2016) (ρ < 0.05).

Source sentence xich weiß , dass wir es konnen , und soweit es mich betrifftist das etwas ,was die welt jetzt braucht . D(x, y′) BLEU

Groundtruth translation yi know that we can , and as far as i 'm concerned ,that 's something the world needs right now .

Translation by RNNSearch y′i know we can do it , and as far as it 's in time ,what the world needs now . 0.14 27.26

Translation byAdversarial-NMT y′

i know that we can , and as far as it is to be somethingthat the world needs now . 0.67 50.28

Source sentence xwir mussen verhindern , dass die menschen kenntnis erlangenvon dingen , vor allem dann , wenn sie wahr sind . D(x, y′) BLEU

Groundtruth translation ywe have to prevent people from finding about things ,especially when they are true .

Translation by RNNSearch y′we need to prevent people who are able to knowthat people have to do , especially if they are true . 0.15 0.00

Translation byAdversarial-NMT y′

we need to prevent people who are able to knowabout things , especially if they are true . 0.93 25.45

Table 4: Cases-studies to demonstrate the translation quality improvement brought by Adversarial-NMT.We provide two De→En translation examples, with the source German sentence, ground-truth English sen-tence, and two translation results respectively provided by RNNSearch and Adversarial-NMT. D(x, y′) isthe probability of model translation y′ being ground-truth translation of x, calculated from the adversary D.BLEU is per-sentence translation bleu score for each translated sentence.

text processing tasks, such as image caption, depen-dency parsing and sentiment classification.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu,Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron

Courville, and Yoshua Bengio. 2016. An actor-criticalgorithm for sequence prediction. arXiv preprintarXiv:1607.07086.

Mauro Cettolo, Jan Niehues, Sebastian Stuker, LuisaBentivogli, and Marcello Federico. 2014. Report onthe 11th iwslt evaluation campaign, iwslt 2014.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre,Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. 2014. Learning phrase represen-tations using rnn encoder–decoder for statistical ma-chine translation. In EMNLP, October.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014a. Generative ad-versarial nets. In NIPS.

Ian J Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2014b. Explaining and harnessing adver-sarial examples. arXiv preprint arXiv:1412.6572.

Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,Tieyan Liu, and Wei-Ying Ma. 2016. Dual learningfor machine translation. In D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Garnett, editors,NIPS.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation.

Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen.2014. Convolutional neural network architectures formatching natural language sentences. In NIPS, pages2042–2050.

Sergey Ioffe and Christian Szegedy. 2015. Batch nor-malization: Accelerating deep network training by re-ducing internal covariate shift. In ICML-15, pages448–456.

Sebastien Jean, Kyunghyun Cho, Roland Memisevic, andYoshua Bengio. 2015. On using very large target vo-cabulary for neural machine translation. In ACL, July.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In NAACL.Association for Computational Linguistics.

Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL,Ying Zhang, Saizheng Zhang, Aaron C Courville, andYoshua Bengio. 2016. Professor forcing: A new algo-rithm for training recurrent networks. In NIPS.

Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and DanJurafsky. 2017. Adversarial learning for neural dia-logue generation. arXiv preprint arXiv:1701.06547.

Minh-Thang Luong and Christopher D Manning. 2016.Achieving open vocabulary neural machine translationwith hybrid word-character models. arXiv preprintarXiv:1604.00788.

Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals,and Wojciech Zaremba. 2015. Addressing the rareword problem in neural machine translation. In ACL,July.

Yurii Nesterov. 1983. A method for unconstrained con-vex minimization problem with the rate of conver-gence o (1/k2). In Doklady an SSSR, pages 543–547.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evalu-ation of machine translation. In ACL. Association forComputational Linguistics.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, andWojciech Zaremba. 2015. Sequence level train-ing with recurrent neural networks. arXiv preprintarXiv:1511.06732.

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-geswaran, Bernt Schiele, and Honglak Lee. 2016.Generative adversarial text to image synthesis. InICML.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, VickiCheung, Alec Radford, and Xi Chen. 2016. Improvedtechniques for training gans. In NIPS.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving neural machine translation modelswith monolingual data. In ACL, August.

Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu,Maosong Sun, and Yang Liu. 2016. Minimum risktraining for neural machine translation. In ACL, Au-gust.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez,, et al. 2016. Mastering the game of go with deepneural networks and tree search. Nature.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Se-quence to sequence learning with neural networks. InNIPS.

Richard S Sutton and Andrew G Barto. 1998. Rein-forcement learning: An introduction. MIT press Cam-bridge.

Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, andHang Li. 2016a. Neural machine translation with re-construction. arXiv preprint arXiv:1611.01874.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016b. Modeling coverage for neuralmachine translation. In ACL, August.

Lex Weaver and Nigel Tao. 2001. The optimal rewardbaseline for gradient-based reinforcement learning. InProceedings of the Seventeenth conference on Uncer-tainty in artificial intelligence, pages 538–545. Mor-gan Kaufmann Publishers Inc.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcementlearning. Machine learning.

Sam Wiseman and Alexander M. Rush. 2016. Sequence-to-sequence learning as beam-search optimization. InEMNLP, November.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le,et al. 2016. Google’s neural machine translation sys-tem: Bridging the gap between human and machinetranslation. arXiv preprint arXiv:1609.08144.

Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2017.Improving neural machine translation with conditionalsequence generative adversarial nets. arXiv preprintarXiv:1703.04887.

Wenpeng Yin, Hinrich Schutze, Bing Xiang, and BowenZhou. 2015. Abcnn: Attention-based convolutionalneural network for modeling sentence pairs. arXivpreprint arXiv:1512.05193.

Lantao Yu, Weinan Zhang, Jun Wang, and YongYu. 2016. Seqgan: sequence generative adver-sarial nets with policy gradient. arXiv preprintarXiv:1609.05473.

Jiajun Zhang and Chengqing Zong. 2016. Exploitingsource-side monolingual data in neural machine trans-lation. In EMNLP, November.

Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and WeiXu. 2016. Deep recurrent models with fast-forwardconnections for neural machine translation. arXivpreprint arXiv:1606.04199.

arXiv:1704.06933v4 [cs.CL] 30 Sep 2018

Documents

Transcript of arXiv:1704.06933v4 [cs.CL] 30 Sep 2018