Combining Recent Advances in Neural Machine Translation

11
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 76–86 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics 76 The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation Mia Xu Chen * Orhan Firat * Ankur Bapna * Melvin Johnson Wolfgang Macherey George Foster Llion Jones Niki Parmar Noam Shazeer Ashish Vaswani Jakob Uszkoreit Lukasz Kaiser Mike Schuster Zhifeng Chen miachen,orhanf,ankurbpn,[email protected] Google AI Yonghui Wu Macduff Hughes Abstract The past year has witnessed rapid ad- vances in sequence-to-sequence (seq2seq) modeling for Machine Translation (MT). The classic RNN-based approaches to MT were first out-performed by the convolu- tional seq2seq model, which was then out- performed by the more recent Transformer model. Each of these new approaches con- sists of a fundamental architecture accom- panied by a set of modeling and training techniques that are in principle applicable to other seq2seq architectures. In this pa- per, we tease apart the new architectures and their accompanying techniques in two ways. First, we identify several key mod- eling and training techniques, and apply them to the RNN architecture, yielding a new RNMT+ model that outperforms all of the three fundamental architectures on the benchmark WMT’14 EnglishFrench and EnglishGerman tasks. Second, we analyze the properties of each fundamen- tal seq2seq architecture and devise new hybrid architectures intended to combine their strengths. Our hybrid models ob- tain further improvements, outperforming the RNMT+ model on both benchmark datasets. 1 Introduction In recent years, the emergence of seq2seq mod- els (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014) has revolutionized the field of MT by replacing traditional phrase- based approaches with neural machine transla- tion (NMT) systems based on the encoder-decoder paradigm. In the first architectures that surpassed * Equal contribution. the quality of phrase-based MT, both the en- coder and decoder were implemented as Recur- rent Neural Networks (RNNs), interacting via a soft-attention mechanism (Bahdanau et al., 2015). The RNN-based NMT approach, or RNMT, was quickly established as the de-facto standard for NMT, and gained rapid adoption into large-scale systems in industry, e.g. Baidu (Zhou et al., 2016), Google (Wu et al., 2016), and Systran (Crego et al., 2016). Following RNMT, convolutional neural net- work based approaches (LeCun and Bengio, 1998) to NMT have recently drawn research attention due to their ability to fully parallelize training to take advantage of modern fast computing devices. such as GPUs and Tensor Processing Units (TPUs) (Jouppi et al., 2017). Well known examples are ByteNet (Kalchbrenner et al., 2016) and ConvS2S (Gehring et al., 2017). The ConvS2S model was shown to outperform the original RNMT archi- tecture in terms of quality, while also providing greater training speed. Most recently, the Transformer model (Vaswani et al., 2017), which is based solely on a self- attention mechanism (Parikh et al., 2016) and feed-forward connections, has further advanced the field of NMT, both in terms of translation qual- ity and speed of convergence. In many instances, new architectures are ac- companied by a novel set of techniques for per- forming training and inference that have been carefully optimized to work in concert. This ‘bag of tricks’ can be crucial to the performance of a proposed architecture, yet it is typically under-documented and left for the enterprising re- searcher to discover in publicly released code (if any) or through anecdotal evidence. This is not simply a problem for reproducibility; it obscures the central scientific question of how much of the observed gains come from the new architecture

Transcript of Combining Recent Advances in Neural Machine Translation

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 76–86Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

76

The Best of Both Worlds:Combining Recent Advances in Neural Machine TranslationMia Xu Chen ∗ Orhan Firat ∗ Ankur Bapna ∗

Melvin Johnson Wolfgang Macherey George Foster Llion Jones Niki Parmar

Noam Shazeer Ashish Vaswani Jakob Uszkoreit Lukasz Kaiser

Mike Schuster Zhifeng Chenmiachen,orhanf,ankurbpn,[email protected]

Google AI

Yonghui Wu Macduff Hughes

Abstract

The past year has witnessed rapid ad-vances in sequence-to-sequence (seq2seq)modeling for Machine Translation (MT).The classic RNN-based approaches to MTwere first out-performed by the convolu-tional seq2seq model, which was then out-performed by the more recent Transformermodel. Each of these new approaches con-sists of a fundamental architecture accom-panied by a set of modeling and trainingtechniques that are in principle applicableto other seq2seq architectures. In this pa-per, we tease apart the new architecturesand their accompanying techniques in twoways. First, we identify several key mod-eling and training techniques, and applythem to the RNN architecture, yielding anew RNMT+ model that outperforms allof the three fundamental architectures onthe benchmark WMT’14 English→Frenchand English→German tasks. Second, weanalyze the properties of each fundamen-tal seq2seq architecture and devise newhybrid architectures intended to combinetheir strengths. Our hybrid models ob-tain further improvements, outperformingthe RNMT+ model on both benchmarkdatasets.

1 Introduction

In recent years, the emergence of seq2seq mod-els (Kalchbrenner and Blunsom, 2013; Sutskeveret al., 2014; Cho et al., 2014) has revolutionizedthe field of MT by replacing traditional phrase-based approaches with neural machine transla-tion (NMT) systems based on the encoder-decoderparadigm. In the first architectures that surpassed

∗ Equal contribution.

the quality of phrase-based MT, both the en-coder and decoder were implemented as Recur-rent Neural Networks (RNNs), interacting via asoft-attention mechanism (Bahdanau et al., 2015).The RNN-based NMT approach, or RNMT, wasquickly established as the de-facto standard forNMT, and gained rapid adoption into large-scalesystems in industry, e.g. Baidu (Zhou et al., 2016),Google (Wu et al., 2016), and Systran (Cregoet al., 2016).

Following RNMT, convolutional neural net-work based approaches (LeCun and Bengio, 1998)to NMT have recently drawn research attentiondue to their ability to fully parallelize training totake advantage of modern fast computing devices.such as GPUs and Tensor Processing Units (TPUs)(Jouppi et al., 2017). Well known examples areByteNet (Kalchbrenner et al., 2016) and ConvS2S(Gehring et al., 2017). The ConvS2S model wasshown to outperform the original RNMT archi-tecture in terms of quality, while also providinggreater training speed.

Most recently, the Transformer model (Vaswaniet al., 2017), which is based solely on a self-attention mechanism (Parikh et al., 2016) andfeed-forward connections, has further advancedthe field of NMT, both in terms of translation qual-ity and speed of convergence.

In many instances, new architectures are ac-companied by a novel set of techniques for per-forming training and inference that have beencarefully optimized to work in concert. This‘bag of tricks’ can be crucial to the performanceof a proposed architecture, yet it is typicallyunder-documented and left for the enterprising re-searcher to discover in publicly released code (ifany) or through anecdotal evidence. This is notsimply a problem for reproducibility; it obscuresthe central scientific question of how much of theobserved gains come from the new architecture

77

and how much can be attributed to the associatedtraining and inference techniques. In some cases,these new techniques may be broadly applicableto other architectures and thus constitute a major,though implicit, contribution of an architecture pa-per. Clearly, they need to be considered in orderto ensure a fair comparison across different modelarchitectures.

In this paper, we therefore take a step back andlook at which techniques and methods contributesignificantly to the success of recent architectures,namely ConvS2S and Transformer, and exploreapplying these methods to other architectures, in-cluding RNMT models. In doing so, we come upwith an enhanced version of RNMT, referred toas RNMT+, that significantly outperforms all in-dividual architectures in our setup. We further in-troduce new architectures built with different com-ponents borrowed from RNMT+, ConvS2S andTransformer. In order to ensure a fair setting forcomparison, all architectures were implemented inthe same framework, use the same pre-processeddata and apply no further post-processing as thismay confound bare model performance.

Our contributions are three-fold:

1. In ablation studies, we quantify the effectof several modeling improvements (includ-ing multi-head attention and layer normaliza-tion) as well as optimization techniques (suchas synchronous replica training and label-smoothing), which are used in recent archi-tectures. We demonstrate that these tech-niques are applicable across different modelarchitectures.

2. Combining these improvements with theRNMT model, we propose the new RNMT+model, which significantly outperforms allfundamental architectures on the widely-usedWMT’14 En→Fr and En→De benchmarkdatasets. We provide a detailed model anal-ysis and comparison of RNMT+, ConvS2Sand Transformer in terms of model quality,model size, and training and inference speed.

3. Inspired by our understanding of the rela-tive strengths and weaknesses of individualmodel architectures, we propose new modelarchitectures that combine components fromthe RNMT+ and the Transformer model, andachieve better results than both individual ar-chitectures.

We quickly note two prior works that pro-vided empirical solutions to the difficulty of train-ing NMT architectures (specifically RNMT). In(Britz et al., 2017) the authors systematically ex-plore which elements of NMT architectures havea significant impact on translation quality. In(Denkowski and Neubig, 2017) the authors recom-mend three specific techniques for strengtheningNMT systems and empirically demonstrated howincorporating those techniques improves the relia-bility of the experimental results.

2 Background

In this section, we briefly discuss the commmonlyused NMT architectures.

2.1 RNN-based NMT Models - RNMT

RNMT models are composed of an encoder RNNand a decoder RNN, coupled with an attentionnetwork. The encoder summarizes the input se-quence into a set of vectors while the decoder con-ditions on the encoded input sequence through anattention mechanism, and generates the output se-quence one token at a time.

The most successful RNMT models consist ofstacked RNN encoders with one or more bidirec-tional RNNs (Schuster and Paliwal, 1997; Gravesand Schmidhuber, 2005), and stacked decoderswith unidirectional RNNs. Both encoder and de-coder RNNs consist of either LSTM (Hochreiterand Schmidhuber, 1997; Gers et al., 2000) or GRUunits (Cho et al., 2014), and make extensive use ofresidual (He et al., 2015) or highway (Srivastavaet al., 2015) connections.

In Google-NMT (GNMT) (Wu et al., 2016),the best performing RNMT model on the datasetswe consider, the encoder network consists ofone bi-directional LSTM layer, followed by 7uni-directional LSTM layers. The decoder isequipped with a single attention network and 8uni-directional LSTM layers. Both the encoderand the decoder use residual skip connections be-tween consecutive layers.

In this paper, we adopt GNMT as the startingpoint for our proposed RNMT+ architecture.

2.2 Convolutional NMT Models - ConvS2S

In the most successful convolutional sequence-to-sequence model (Gehring et al., 2017), both theencoder and decoder are constructed by stackingmultiple convolutional layers, where each layer

78

contains 1-dimensional convolutions followed bya gated linear units (GLU) (Dauphin et al., 2016).Each decoder layer computes a separate dot-product attention by using the current decoderlayer output and the final encoder layer outputs.Positional embeddings are used to provide explicitpositional information to the model. Following thepractice in (Gehring et al., 2017), we scale the gra-dients of the encoder layers to stabilize training.We also use residual connections across each con-volutional layer and apply weight normalization(Salimans and Kingma, 2016) to speed up conver-gence. We follow the public ConvS2S codebase1

in our experiments.

2.3 Conditional Transformation-based NMTModels - Transformer

The Transformer model (Vaswani et al., 2017) ismotivated by two major design choices that aimto address deficiencies in the former two modelfamilies: (1) Unlike RNMT, but similar to theConvS2S, the Transformer model avoids any se-quential dependencies in both the encoder anddecoder networks to maximally parallelize train-ing. (2) To address the limited context problem(limited receptive field) present in ConvS2S, theTransformer model makes pervasive use of self-attention networks (Parikh et al., 2016) so thateach position in the current layer has access to in-formation from all other positions in the previouslayer.

The Transformer model still follows theencoder-decoder paradigm. Encoder transformerlayers are built with two sub-modules: (1) a self-attention network and (2) a feed-forward network.Decoder transformer layers have an additionalcross-attention layer sandwiched between the self-attention and feed-forward layers to attend to theencoder outputs.

There are two details which we found very im-portant to the model’s performance: (1) Each sub-layer in the transformer (i.e. self-attention, cross-attention, and the feed-forward sub-layer) followsa strict computation sequence: normalize→ trans-form→ dropout→ residual-add. (2) In addition toper-layer normalization, the final encoder output isagain normalized to prevent a blow up after con-secutive residual additions.

In this paper, we follow the latest version of the

1https://github.com/facebookresearch/fairseq-py

Transformer model in the Tensor2Tensor2 code-base.

2.4 A Theory-Based Characterization ofNMT Architectures

From a theoretical point of view, RNNs belongto the most expressive members of the neuralnetwork family (Siegelmann and Sontag, 1995)3.Possessing an infinite Markovian structure (andthus an infinite receptive fields) equips them tomodel sequential data (Elman, 1990), especiallynatural language (Grefenstette et al., 2015) ef-fectively. In practice, RNNs are notoriouslyhard to train (Hochreiter, 1991; Bengio et al.,1994; Hochreiter et al., 2001), confirming the wellknown dilemma of trainability versus expressivity.

Convolutional layers are adept at capturing lo-cal context and local correlations by design. Afixed and narrow receptive field for each convo-lutional layer limits their capacity when the ar-chitecture is shallow. In practice, this weaknessis mitigated by stacking more convolutional lay-ers (e.g. 15 layers as in the ConvS2S model),which makes the model harder to train and de-mands meticulous initialization schemes and care-fully designed regularization techniques.

The transformer network is capable of ap-proximating arbitrary squashing functions (Horniket al., 1989), and can be considered a strong fea-ture extractor with extended receptive fields capa-ble of linking salient features from the entire se-quence. On the other hand, lacking a memorycomponent (as present in the RNN models) pre-vents the network from modeling a state space,reducing its theoretical strength as a sequencemodel, thus it requires additional positional infor-mation (e.g. sinusoidal positional encodings).

Above theoretical characterizations will driveour explorations in the following sections.

3 Experiment SetupWe train our models on the standard WMT’14En→Fr and En→De datasets that comprise 36.3Mand 4.5M sentence pairs, respectively. Each sen-tence was encoded into a sequence of sub-wordunits obtained by first tokenizing the sentence withthe Moses tokenizer, then splitting tokens into sub-word units (also known as “wordpieces”) usingthe approach described in (Schuster and Nakajima,2012).

2https://github.com/tensorflow/tensor2tensor3Assuming that data complexity is satisfied.

79

Figure 1: Model architecture of RNMT+. On the left side, the encoder network has 6 bidirectional LSTMlayers. At the end of each bidirectional layer, the outputs of the forward layer and the backward layerare concatenated. On the right side, the decoder network has 8 unidirectional LSTM layers, with the firstlayer used for obtaining the attention context vector through multi-head additive attention. The attentioncontext vector is then fed directly into the rest of the decoder layers as well as the softmax layer.

We use a shared vocabulary of 32K sub-wordunits for each source-target language pair. No fur-ther manual or rule-based post processing of theoutput was performed beyond combining the sub-word units to generate the targets. We report allour results on newstest 2014, which serves as thetest set. A combination of newstest 2012 and new-stest 2013 is used for validation.

To evaluate the models, we compute the BLEUmetric on tokenized, true-case output.4 For eachtraining run, we evaluate the model every 30 min-utes on the dev set. Once the model converges, wedetermine the best window based on the averagedev-set BLEU score over 21 consecutive evalua-tions. We report the mean test score and standarddeviation over the selected window. This allowsus to compare model architectures based on theirmean performance after convergence rather thanindividual checkpoint evaluations, as the latter canbe quite noisy for some models.

To enable a fair comparison of architectures,we use the same pre-processing and evaluationmethodology for all our experiments. We re-frain from using checkpoint averaging (exponen-tial moving averages of parameters) (Junczys-Dowmunt et al., 2016) or checkpoint ensembles(Jean et al., 2015; Chen et al., 2017) to focus on

4This procedure is used in the literature to which we com-pare (Gehring et al., 2017; Wu et al., 2016).

evaluating the performance of individual models.

4 RNMT+

4.1 Model Architecture of RNMT+

The newly proposed RNMT+ model architectureis shown in Figure 1. Here we highlight the keyarchitectural choices that are different between theRNMT+ model and the GNMT model. There are6 bidirectional LSTM layers in the encoder insteadof 1 bidirectional LSTM layer followed by 7 uni-directional layers as in GNMT. For each bidirec-tional layer, the outputs of the forward layer andthe backward layer are concatenated before beingfed into the next layer. The decoder network con-sists of 8 unidirectional LSTM layers similar to theGNMT model. Residual connections are added tothe third layer and above for both the encoder anddecoder. Inspired by the Transformer model, per-gate layer normalization (Ba et al., 2016) is ap-plied within each LSTM cell. Our empirical re-sults show that layer normalization greatly stabi-lizes training. No non-linearity is applied to theLSTM output. A projection layer is added to theencoder final output.5 Multi-head additive atten-tion is used instead of the single-head attention inthe GNMT model. Similar to GNMT, we use the

5Additional projection aims to reduce the dimensionalityof the encoder output representations to match the decoderstack dimension.

80

bottom decoder layer and the final encoder layeroutput after projection for obtaining the recurrentattention context. In addition to feeding the atten-tion context to all decoder LSTM layers, we alsofeed it to the softmax by concatenating it with thelayer input. This is important for both the qualityof the models with multi-head attention and thestability of the training process.

Since the encoder network in RNMT+ consistssolely of bi-directional LSTM layers, model par-allelism is not used during training. We com-pensate for the resulting longer per-step time withincreased data parallelism (more model replicas),so that the overall time to reach convergence ofthe RNMT+ model is still comparable to that ofGNMT.

We apply the following regularization tech-niques during training.

• Dropout: We apply dropout to both embed-ding layers and each LSTM layer output beforeit is added to the next layer’s input. Attentiondropout is also applied.

• Label Smoothing: We use uniform labelsmoothing with an uncertainty=0.1 (Szegedyet al., 2015). Label smoothing was shown tohave a positive impact on both Transformerand RNMT+ models, especially in the caseof RNMT+ with multi-head attention. Similarto the observations in (Chorowski and Jaitly,2016), we found it beneficial to use a largerbeam size (e.g. 16, 20, etc.) during decodingwhen models are trained with label smoothing.

• Weight Decay: For the WMT’14 En→De task,we apply L2 regularization to the weights withλ = 10−5. Weight decay is only applied to theEn→De task as the corpus is smaller and thusmore regularization is required.

We use the Adam optimizer (Kingma and Ba,2014) with β1 = 0.9, β2 = 0.999, ε = 10−6 andvary the learning rate according to this schedule:

lr = 10−4 ·min(1+

t · (n− 1)

np, n, n · (2n)

s−nte−s

)(1)

Here, t is the current step, n is the number of con-current model replicas used in training, p is thenumber of warmup steps, s is the start step of theexponential decay, and e is the end step of the de-cay. Specifically, we first increase the learning ratelinearly during the number of warmup steps, keep

it a constant until the decay start step s, then ex-ponentially decay until the decay end step e, andkeep it at 5 · 10−5 after the decay ends. Thislearning rate schedule is motivated by a similarschedule that was successfully applied in trainingthe Resnet-50 model with a very large batch size(Goyal et al., 2017).

In contrast to the asynchronous training usedfor GNMT (Dean et al., 2012), we train RNMT+models with synchronous training (Chen et al.,2016). Our empirical results suggest that whenhyper-parameters are tuned properly, synchronoustraining often leads to improved convergencespeed and superior model quality.

To further stabilize training, we also use adap-tive gradient clipping. We discard a training stepcompletely if an anomaly in the gradient normvalue is detected, which is usually an indicationof an imminent gradient explosion. More specif-ically, we keep track of a moving average and amoving standard deviation of the log of the gradi-ent norm values, and we abort a step if the normof the gradient exceeds four standard deviations ofthe moving average.

4.2 Model Analysis and Comparison

In this section, we compare the results of RNMT+with ConvS2S and Transformer.

All models were trained with synchronoustraining. RNMT+ and ConvS2S were trained with32 NVIDIA P100 GPUs while the TransformerBase and Big models were trained using 16 GPUs.

For RNMT+, we use sentence-level cross-entropy loss. Each training batch contained 4096sentence pairs (4096 source sequences and 4096target sequences). For ConvS2S and Transformermodels, we use token-level cross-entropy loss.Each training batch contained 65536 source to-kens and 65536 target tokens. For the GNMTbaselines on both tasks, we cite the largest BLEUscore reported in (Wu et al., 2016) without rein-forcement learning.

Table 1 shows our results on the WMT’14En→Fr task. Both the Transformer Big modeland RNMT+ outperform GNMT and ConvS2S byabout 2 BLEU points. RNMT+ is slightly betterthan the Transformer Big model in terms of itsmean BLEU score. RNMT+ also yields a muchlower standard deviation, and hence we observedmuch less fluctuation in the training curve. Ittakes approximately 3 days for the Transformer

81

Base model to converge, while both RNMT+ andthe Transformer Big model require about 5 daysto converge. Although the batching schemes arequite different between the Transformer Big andthe RNMT+ model, they have processed aboutthe same amount of training samples upon conver-gence.

Model Test BLEU EpochsTrainingTime

GNMT 38.95 - -ConvS2S 7 39.49 ± 0.11 62.2 438hTrans. Base 39.43 ± 0.17 20.7 90hTrans. Big 8 40.73 ± 0.19 8.3 120h

RNMT+ 41.00 ± 0.05 8.5 120h

Table 1: Results on WMT14 En→Fr. The num-bers before and after ‘±’ are the mean and stan-dard deviation of test BLEU score over an eval-uation window. Note that Transformer modelsare trained using 16 GPUs, while ConvS2S andRNMT+ are trained using 32 GPUs.

Table 2 shows our results on the WMT’14En→De task. The Transformer Base model im-proves over GNMT and ConvS2S by more than2 BLEU points while the Big model improves byover 3 BLEU points. RNMT+ further outperformsthe Transformer Big model and establishes a newstate of the art with an averaged value of 28.49. Inthis case, RNMT+ converged slightly faster thanthe Transformer Big model and maintained muchmore stable performance after convergence with avery small standard deviation, which is similar towhat we observed on the En-Fr task.

Table 3 summarizes training performance andmodel statistics. The Transformer Base model

6Since the ConvS2S model convergence is very slow wedid not explore further tuning on En→Fr, and validated ourimplementation on En→De.

7The BLEU scores for Transformer model are slightlylower than those reported in (Vaswani et al., 2017) due tofour differences:

1) We report the mean test BLEU score using the strategydescribed in section 3.

2) We did not perform checkpoint averaging since it wouldbe inconsistent with our evaluation for other models.

3) We avoided any manual post-processing, like unicodenormalization using Moses replace-unicode-punctuation.perlor output tokenization using Moses tokenizer.perl, to rule outits effect on the evaluation. We observed a significant BLEUincrease (about 0.6) on applying these post processing tech-niques.

4) In (Vaswani et al., 2017), reported BLEU scores are cal-culated using mteval-v13a.pl from Moses, which re-tokenizesits input.

Model Test BLEU EpochsTrainingTime

GNMT 24.67 - -ConvS2S 25.01 ±0.17 38 20h

Trans. Base 27.26 ± 0.15 38 17hTrans. Big 27.94 ± 0.18 26.9 48hRNMT+ 28.49 ± 0.05 24.6 40h

Table 2: Results on WMT14 En→De. Note thatTransformer models are trained using 16 GPUs,while ConvS2S and RNMT+ are trained using 32GPUs.

is the fastest model in terms of training speed.RNMT+ is slower to train than the TransformerBig model on a per-GPU basis. However, sincethe RNMT+ model is quite stable, we were able tooffset the lower per-GPU throughput with higherconcurrency by increasing the number of modelreplicas, and hence the overall time to convergencewas not slowed down much. We also computedthe number of floating point operations (FLOPs)in the model’s forward path as well as the num-ber of total parameters for all architectures (cf. Ta-ble 3). RNMT+ requires fewer FLOPs than theTransformer Big model, even though both modelshave a comparable number of parameters.

Model Examples/s FLOPs ParamsConvS2S 80 15.7B 263.4M

Trans. Base 160 6.2B 93.3MTrans. Big 50 31.2B 375.4MRNMT+ 30 28.1B 378.9M

Table 3: Performance comparison. Examples/s arenormalized by the number of GPUs used in thetraining job. FLOPs are computed assuming thatsource and target sequence length are both 50.

5 Ablation Experiments

In this section, we evaluate the importance of fourmain techniques for both the RNMT+ and theTransformer Big models. We believe that thesetechniques are universally applicable across dif-ferent model architectures, and should always beemployed by NMT practitioners for best perfor-mance.

We take our best RNMT+ and Transformer Bigmodels and remove each one of these techniquesindependently. By doing this we hope to learn twothings about each technique: (1) How much does

82

it affect the model performance? (2) How useful isit for stable training of other techniques and hencethe final model?

Model RNMT+ Trans. BigBaseline 41.00 40.73

- Label Smoothing 40.33 40.49- Multi-head Attention 40.44 39.83

- Layer Norm. * *- Sync. Training 39.68 *

Table 4: Ablation results of RNMT+ and theTransformer Big model on WMT’14 En→ Fr. Wereport average BLEU scores on the test set. An as-terisk ’*’ indicates an unstable training run (train-ing halts due to non-finite elements).

From Table 4 we draw the following conclu-sions about the four techniques:

• Label Smoothing We observed that labelsmoothing improves both models, leading to anaverage increase of 0.7 BLEU for RNMT+ and0.2 BLEU for Transformer Big models.

• Multi-head Attention Multi-head attentioncontributes significantly to the quality of bothmodels, resulting in an average increase of 0.6BLEU for RNMT+ and 0.9 BLEU for Trans-former Big models.

• Layer Normalization Layer normalization ismost critical to stabilize the training process ofeither model, especially when multi-head atten-tion is used. Removing layer normalization re-sults in unstable training runs for both models.Since by design, we remove one technique at atime in our ablation experiments, we were un-able to quantify how much layer normalizationhelped in either case. To be able to successfullytrain a model without layer normalization, wewould have to adjust other parts of the modeland retune its hyper-parameters.

• Synchronous training Removing synchronoustraining has different effects on RNMT+ andTransformer. For RNMT+, it results in a sig-nificant quality drop, while for the TransformerBig model, it causes the model to become un-stable. We also notice that synchronous train-ing is only successful when coupled with a tai-lored learning rate schedule that has a warmupstage at the beginning (cf. Eq. 1 for RNMT+ and

Eq. 2 for Transformer). For RNMT+, removingthis warmup stage during synchronous trainingcauses the model to become unstable.

6 Hybrid NMT ModelsIn this section, we explore hybrid architecturesthat shed some light on the salient behavior ofeach model family. These hybrid models outper-form the individual architectures on both bench-mark datasets and provide a better understandingof the capabilities and limitations of each modelfamily.

6.1 Assessing Individual Encoders andDecoders

In an encoder-decoder architecture, a natural as-sumption is that the role of an encoder is to buildfeature representations that can best encode themeaning of the source sequence, while a decodershould be able to process and interpret the repre-sentations from the encoder and, at the same time,track the current target history. Decoding is in-herently auto-regressive, and keeping track of thestate information should therefore be intuitivelybeneficial for conditional generation.

We set out to study which family of encoders ismore suitable to extract rich representations froma given input sequence, and which family of de-coders can make the best of such rich representa-tions. We start by combining the encoder and de-coder from different model families. Since it takesa significant amount of time for a ConvS2S modelto converge, and because the final translation qual-ity was not on par with the other models, we fo-cus on two types of hybrids only: Transformer en-coder with RNMT+ decoder and RNMT+ encoderwith Transformer decoder.

Encoder Decoder En→Fr Test BLEUTrans. Big Trans. Big 40.73 ± 0.19RNMT+ RNMT+ 41.00 ± 0.05

Trans. Big RNMT+ 41.12 ± 0.16RNMT+ Trans. Big 39.92 ± 0.21

Table 5: Results for encoder-decoder hybrids.

From Table 5, it is clear that the Transformerencoder is better at encoding or feature extrac-tion than the RNMT+ encoder, whereas RNMT+is better at decoding or conditional language mod-eling, confirming our intuition that a stateful de-

83

coder is beneficial for conditional language gener-ation.

6.2 Assessing Encoder Combinations

Next, we explore how the features extracted byan encoder can be further enhanced by incorpo-rating additional information. Specifically, we in-vestigate the combination of transformer layerswith RNMT+ layers in the same encoder block tobuild even richer feature representations. We ex-clusively use RNMT+ decoders in the followingarchitectures since stateful decoders show betterperformance according to Table 5.

We study two mixing schemes in the encoder(see Fig. 2):

(1) Cascaded Encoder: The cascaded encoderaims at combining the representational power ofRNNs and self-attention. The idea is to enrich aset of stateful representations by cascading a fea-ture extractor with a focus on vertical mapping,similar to (Pascanu et al., 2013; Devlin, 2017).Our best performing cascaded encoder involvesfine tuning transformer layers stacked on top ofa pre-trained frozen RNMT+ encoder. Using apre-trained encoder avoids optimization difficul-ties while significantly enhancing encoder capac-ity. As shown in Table 6, the cascaded encoderimproves over the Transformer encoder by morethan 0.5 BLEU points on the WMT’14 En→Frtask. This suggests that the Transformer encoderis able to extract richer representations if the inputis augmented with sequential context.

(2) Multi-Column Encoder: As illustrated inFig. 2b, a multi-column encoder merges the out-puts of several independent encoders into a sin-gle combined representation. Unlike a cascadedencoder, the multi-column encoder enables us toinvestigate whether an RNMT+ decoder can dis-tinguish information received from two differentchannels and benefit from its combination. Acrucial operation in a multi-column encoder istherefore how different sources of information aremerged into a unified representation. Our bestmulti-column encoder performs a simple concate-nation of individual column outputs.

The model details and hyperparameters of theabove two encoders are described in Appendix A.5and A.6. As shown in Table 6, the multi-columnencoder followed by an RNMT+ decoder achievesbetter results than the Transformer and the RNMTmodel on both WMT’14 benchmark tasks.

Model En→Fr BLEU En→De BLEUTrans. Big 40.73 ± 0.19 27.94 ± 0.18RNMT+ 41.00 ± 0.05 28.49 ± 0.05Cascaded 41.67 ± 0.11 28.62 ± 0.06MultiCol 41.66 ± 0.11 28.84 ± 0.06

Table 6: Results for hybrids with cascaded en-coder and multi-column encoder.

(a) Cascaded Encoder (b) Multi-Column Encoder

Figure 2: Vertical and horizontal mixing of Trans-former and RNMT+ components in an encoder.

7 ConclusionIn this work we explored the efficacy of sev-eral architectural and training techniques proposedin recent studies on seq2seq models for NMT.We demonstrated that many of these techniquesare broadly applicable to multiple model architec-tures. Applying these new techniques to RNMTmodels yields RNMT+, an enhanced RNMTmodel that significantly outperforms the three fun-damental architectures on WMT’14 En→Fr andEn→De tasks. We further presented several hy-brid models developed by combining encoders anddecoders from the Transformer and RNMT+ mod-els, and empirically demonstrated the superiorityof the Transformer encoder and the RNMT+ de-coder in comparison with their counterparts. Wethen enhanced the encoder architecture by hori-zontally and vertically mixing components bor-rowed from these architectures, leading to hy-brid architectures that obtain further improve-ments over RNMT+.

We hope that our work will motivate NMT re-searchers to further investigate generally applica-ble training and optimization techniques, and thatour exploration of hybrid architectures will openpaths for new architecture search efforts for NMT.

84

Our focus on a standard single-language-pairtranslation task leaves important open questionsto be answered: How do our new architecturescompare in multilingual settings, i.e., modelingan interlingua? Which architecture is more effi-cient and powerful in processing finer grained in-puts and outputs, e.g., characters or bytes? Howtransferable are the representations learned by thedifferent architectures to other tasks? And whatare the characteristic errors that each architecturemakes, e.g., linguistic plausibility?

Acknowledgments

We would like to thank the entire Google BrainTeam and Google Translate Team for their foun-dational contributions to this project. We wouldalso like to thank the entire Tensor2Tensor devel-opment team for their useful inputs and discus-sions.

ReferencesLei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton.

2016. Layer normalization. CoRR abs/1607.06450.http://arxiv.org/abs/1607.06450.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural machine translation byjointly learning to align and translate. In Inter-national Conference on Learning Representations.http://arxiv.org/abs/1409.0473.

Y. Bengio, P. Simard, and P. Frasconi. 1994. Learn-ing long-term dependencies with gradient descent isdifficult. Trans. Neur. Netw. 5(2):157–166.

Denny Britz, Anna Goldie, Minh-Thang Luong,and Quoc Le. 2017. Massive exploration ofneural machine translation architectures. InProceedings of the 2017 Conference on Em-pirical Methods in Natural Language Process-ing. Association for Computational Linguis-tics, Copenhagen, Denmark, pages 1442–1451.https://www.aclweb.org/anthology/D17-1151.

Hugh Chen, Scott Lundberg, and Su-In Lee. 2017.Checkpoint ensembles: Ensemble methods froma single training process. CoRR abs/1710.03282.http://arxiv.org/abs/1710.03282.

Jianmin Chen, Rajat Monga, Samy Bengio, andRafal Jozefowicz. 2016. Revisiting distributedsynchronous SGD. CoRR abs/1604.00981.http://arxiv.org/abs/1604.00981.

Kyunghyun Cho, Bart van Merrienboer, CaglarGulcehre, Fethi Bougares, Holger Schwenk, and

Yoshua Bengio. 2014. Learning phrase repre-sentations using RNN encoder-decoder for statis-tical machine translation. In Conference on Em-pirical Methods in Natural Language Processing.http://arxiv.org/abs/1406.1078.

Jan Chorowski and Navdeep Jaitly. 2016. Towards bet-ter decoding and language model integration in se-quence to sequence models. CoRR abs/1612.02695.http://arxiv.org/abs/1612.02695.

Josep Maria Crego, Jungi Kim, Guillaume Klein, An-abel Rebollo, Kathy Yang, Jean Senellart, EgorAkhanov, Patrice Brunelle, Aurelien Coquard,Yongchao Deng, Satoshi Enoue, Chiyo Geiss,Joshua Johanson, Ardas Khalsa, Raoum Khiari,Byeongil Ko, Catherine Kobus, Jean Lorieux, Leid-iana Martins, Dang-Chuan Nguyen, Alexandra Pri-ori, Thomas Riccardi, Natalia Segal, Christophe Ser-van, Cyril Tiquet, Bo Wang, Jin Yang, Dakun Zhang,Jing Zhou, and Peter Zoldan. 2016. Systran’spure neural machine translation systems. CoRRabs/1610.05540. http://arxiv.org/abs/1610.05540.

Yann N. Dauphin, Angela Fan, Michael Auli,and David Grangier. 2016. Language model-ing with gated convolutional networks. CoRRabs/1612.08083. http://arxiv.org/abs/1612.08083.

Jeffrey Dean, Greg S. Corrado, Rajat Monga, KaiChen, Matthieu Devin, Quoc V. Le, Mark Z. Mao,MarcAurelio Ranzato, Andrew Senior, Paul Tucker,Ke Yang, and Andrew Y. Ng. 2012. Large scale dis-tributed deep networks. In NIPS.

Michael Denkowski and Graham Neubig. 2017.Stronger baselines for trustable results in neural ma-chine translation. In Proceedings of the First Work-shop on Neural Machine Translation. Associationfor Computational Linguistics, Vancouver, pages18–27. http://www.aclweb.org/anthology/W17-3203.

Jacob Devlin. 2017. Sharp models on dull hard-ware: Fast and accurate neural machine trans-lation decoding on the cpu. In Proceedingsof the 2017 Conference on Empirical Meth-ods in Natural Language Processing. Associationfor Computational Linguistics, pages 2820–2825.http://aclweb.org/anthology/D17-1300.

Jeffrey L. Elman. 1990. Finding structure in time.Cognitive Science 14(2):179 – 211.

Jonas Gehring, Michael Auli, David Grangier, De-nis Yarats, and Yann N. Dauphin. 2017. Con-volutional sequence to sequence learning. CoRRabs/1705.03122. http://arxiv.org/abs/1705.03122.

Felix A Gers, Jurgen Schmidhuber, and Fred Cummins.2000. Learning to forget: Continual prediction withLSTM. Neural computation 12(10):2451–2471.

Priya Goyal, Piotr Dollar, Ross B. Girshick, PieterNoordhuis, Lukasz Wesolowski, Aapo Kyrola, An-drew Tulloch, Yangqing Jia, and Kaiming He.

85

2017. Accurate, large minibatch SGD: train-ing imagenet in 1 hour. CoRR abs/1706.02677.http://arxiv.org/abs/1706.02677.

Alex Graves and Jurgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectional lstmand other neural network architectures. Neural Net-works 18(5):602 – 610. IJCNN 2005.

Edward Grefenstette, Karl Moritz Hermann,Mustafa Suleyman, and Phil Blunsom. 2015.Learning to transduce with unbounded mem-ory. In Proceedings of the 28th InternationalConference on Neural Information Process-ing Systems - Volume 2. MIT Press, Cam-bridge, MA, USA, NIPS’15, pages 1828–1836.http://dl.acm.org/citation.cfm?id=2969442.2969444.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, andJian Sun. 2015. Deep residual learning forimage recognition. CoRR abs/1512.03385.http://arxiv.org/abs/1512.03385.

Sepp Hochreiter. 1991. Untersuchungen zu dynamis-chen neuronalen netzen. Diploma, Technische Uni-versitat Munchen 91:1.

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, andJrgen Schmidhuber. 2001. Gradient flow in recur-rent nets: the difficulty of learning long-term depen-dencies.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation9(8):1735–1780.

Kurt Hornik, Maxwell Stinchcombe, and HalbertWhite. 1989. Multilayer feedforward networks areuniversal approximators. Neural Networks 2(5):359– 366.

Sebastien Jean, Kyunghyun Cho, Roland Memisevic,and Yoshua Bengio. 2015. On using very largetarget vocabulary for neural machine translation.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing.

Norman P. Jouppi, Cliff Young, Nishant Patil,David Patterson, Gaurav Agrawal, and et al.2017. In-datacenter performance analysis of atensor processing unit. CoRR abs/1704.04760.http://arxiv.org/abs/1704.04760.

Marcin Junczys-Dowmunt, Tomasz Dwojak, and RicoSennrich. 2016. The amu-uedin submission to thewmt16 news translation task: Attention-based nmtmodels as feature functions in phrase-based smt.arXiv preprint arXiv:1605.04809 .

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentcontinuous translation models. In Conference onEmpirical Methods in Natural Language Process-ing.

Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,Aaron van den Oord, Alex Graves, and Ko-ray Kavukcuoglu. 2016. Neural machine trans-lation in linear time. CoRR abs/1610.10099.http://arxiv.org/abs/1610.10099.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRRabs/1412.6980. http://arxiv.org/abs/1412.6980.

Yann LeCun and Yoshua Bengio. 1998. Thehandbook of brain theory and neural net-works. MIT Press, Cambridge, MA, USA,chapter Convolutional Networks for Images,Speech, and Time Series, pages 255–258.http://dl.acm.org/citation.cfm?id=303568.303704.

Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In EMNLP.

Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho,and Yoshua Bengio. 2013. How to construct deeprecurrent neural networks. CoRR abs/1312.6026.http://arxiv.org/abs/1312.6026.

Tim Salimans and Diederik P. Kingma. 2016. Weightnormalization: A simple reparameterization to ac-celerate training of deep neural networks. CoRRabs/1602.07868. http://arxiv.org/abs/1602.07868.

M. Schuster and K.K. Paliwal. 1997. Bidirectional re-current neural networks. IEEE Transactions on Sig-nal Processing 45(11):2673–2681.

Mike Schuster and Kaisuke Nakajima. 2012. Japaneseand Korean voice search. 2012 IEEE InternationalConference on Acoustics, Speech and Signal Pro-cessing .

H.T. Siegelmann and E.D. Sontag. 1995. On the com-putational power of neural nets. Journal of Com-puter and System Sciences 50(1):132 – 150.

Rupesh Kumar Srivastava, Klaus Greff, and JurgenSchmidhuber. 2015. Highway networks. CoRRabs/1505.00387.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In Advances in Neural Information Process-ing Systems. pages 3104–3112.

Christian Szegedy, Vincent Vanhoucke, SergeyIoffe, Jonathon Shlens, and Zbigniew Wojna.2015. Rethinking the inception architecturefor computer vision. CoRR abs/1512.00567.http://arxiv.org/abs/1512.00567.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Atten-tion is all you need. CoRR abs/1706.03762.http://arxiv.org/abs/1706.03762.

86

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, MelvinJohnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, HidetoKazawa, Keith Stevens, George Kurian, NishantPatil, Wei Wang, Cliff Young, Jason Smith, JasonRiesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2016. Google’sneural machine translation system: Bridging the gapbetween human and machine translation. CoRRabs/1609.08144. http://arxiv.org/abs/1609.08144.

Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and WeiXu. 2016. Deep recurrent models with fast-forwardconnections for neural machine translation. CoRRabs/1606.04199. http://arxiv.org/abs/1606.04199.