arXiv:1609.07317v2 [cs.CL] 14 Oct 2016

10
Language as a Latent Variable: Discrete Generative Models for Sentence Compression Yishu Miao 1 , Phil Blunsom 1,2 1 University of Oxford, 2 Google Deepmind {yishu.miao, phil.blunsom}@cs.ox.ac.uk Abstract In this work we explore deep generative mod- els of text in which the latent representation of a document is itself drawn from a discrete language model distribution. We formulate a variational auto-encoder for inference in this model and apply it to the task of compressing sentences. In this application the generative model first draws a latent summary sentence from a background language model, and then subsequently draws the observed sentence con- ditioned on this latent summary. In our em- pirical evaluation we show that generative for- mulations of both abstractive and extractive compression yield state-of-the-art results when trained on a large amount of supervised data. Further, we explore semi-supervised compres- sion scenarios where we show that it is possi- ble to achieve performance competitive with previously proposed supervised models while training on a fraction of the supervised data. 1 Introduction The recurrent sequence-to-sequence paradigm for natural language generation (Kalchbrenner and Blun- som, 2013; Sutskever et al., 2014) has achieved re- markable recent success and is now the approach of choice for applications such as machine transla- tion (Bahdanau et al., 2015), caption generation (Xu et al., 2015) and speech recognition (Chorowski et al., 2015). While these models have developed so- phisticated conditioning mechanisms, e.g. attention, fundamentally they are discriminative models trained only to approximate the conditional output distribu- tion of strings. In this paper we explore modelling the joint distribution of string pairs using a deep genera- tive model and employing a discrete variational auto- encoder (VAE) for inference (Kingma and Welling, 2014; Rezende et al., 2014; Mnih and Gregor, 2014). We evaluate our generative approach on the task of sentence compression. This approach provides both alternative supervised objective functions and the opportunity to perform semi-supervised learning by exploiting the VAEs ability to marginalise the latent compressed text for unlabelled data. Auto-encoders (Rumelhart et al., 1985) are a typi- cal neural network architecture for learning compact data representations, with the general aim of perform- ing dimensionality reduction on embeddings (Hinton and Salakhutdinov, 2006). In this paper, rather than seeking to embed inputs as points in a vector space, we describe them with explicit natural language sen- tences. This approach is a natural fit for summarisa- tion tasks such as sentence compression. According to this, we propose a generative auto-encoding sen- tence compression (ASC) model, where we intro- duce a latent language model to provide the variable- length compact summary. The objective is to perform Bayesian inference for the posterior distribution of summaries conditioned on the observed utterances. Hence, in the framework of VAE, we construct an in- ference network as the variational approximation of the posterior, which generates compression samples to optimise the variational lower bound. The most common family of variational auto- encoders relies on the reparameterisation trick, which is not applicable for our discrete latent language model. Instead, we employ the REINFORCE al- gorithm (Mnih et al., 2014; Mnih and Gregor, 2014) arXiv:1609.07317v2 [cs.CL] 14 Oct 2016

Transcript of arXiv:1609.07317v2 [cs.CL] 14 Oct 2016

Language as a Latent Variable:Discrete Generative Models for Sentence Compression

Yishu Miao1, Phil Blunsom1,2

1University of Oxford, 2Google Deepmind{yishu.miao, phil.blunsom}@cs.ox.ac.uk

Abstract

In this work we explore deep generative mod-els of text in which the latent representationof a document is itself drawn from a discretelanguage model distribution. We formulate avariational auto-encoder for inference in thismodel and apply it to the task of compressingsentences. In this application the generativemodel first draws a latent summary sentencefrom a background language model, and thensubsequently draws the observed sentence con-ditioned on this latent summary. In our em-pirical evaluation we show that generative for-mulations of both abstractive and extractivecompression yield state-of-the-art results whentrained on a large amount of supervised data.Further, we explore semi-supervised compres-sion scenarios where we show that it is possi-ble to achieve performance competitive withpreviously proposed supervised models whiletraining on a fraction of the supervised data.

1 Introduction

The recurrent sequence-to-sequence paradigm fornatural language generation (Kalchbrenner and Blun-som, 2013; Sutskever et al., 2014) has achieved re-markable recent success and is now the approachof choice for applications such as machine transla-tion (Bahdanau et al., 2015), caption generation (Xuet al., 2015) and speech recognition (Chorowski etal., 2015). While these models have developed so-phisticated conditioning mechanisms, e.g. attention,fundamentally they are discriminative models trainedonly to approximate the conditional output distribu-tion of strings. In this paper we explore modelling the

joint distribution of string pairs using a deep genera-tive model and employing a discrete variational auto-encoder (VAE) for inference (Kingma and Welling,2014; Rezende et al., 2014; Mnih and Gregor, 2014).We evaluate our generative approach on the task ofsentence compression. This approach provides bothalternative supervised objective functions and theopportunity to perform semi-supervised learning byexploiting the VAEs ability to marginalise the latentcompressed text for unlabelled data.

Auto-encoders (Rumelhart et al., 1985) are a typi-cal neural network architecture for learning compactdata representations, with the general aim of perform-ing dimensionality reduction on embeddings (Hintonand Salakhutdinov, 2006). In this paper, rather thanseeking to embed inputs as points in a vector space,we describe them with explicit natural language sen-tences. This approach is a natural fit for summarisa-tion tasks such as sentence compression. Accordingto this, we propose a generative auto-encoding sen-tence compression (ASC) model, where we intro-duce a latent language model to provide the variable-length compact summary. The objective is to performBayesian inference for the posterior distribution ofsummaries conditioned on the observed utterances.Hence, in the framework of VAE, we construct an in-ference network as the variational approximation ofthe posterior, which generates compression samplesto optimise the variational lower bound.

The most common family of variational auto-encoders relies on the reparameterisation trick, whichis not applicable for our discrete latent languagemodel. Instead, we employ the REINFORCE al-gorithm (Mnih et al., 2014; Mnih and Gregor, 2014)

arX

iv:1

609.

0731

7v2

[cs

.CL

] 1

4 O

ct 2

016

s0

s1

s2

s1

s3

s2

s4

s3

h1

dh2

dh3

dh4

d

Decoder

s1

s2

s3

s4

h1

e

c1

c2

c3

c1 c

2c0

h1

c

h2

eh3

eh4

e

h2

ch3

c

Encoder Compressor

h0

c

Compression (Pointer Networks)

3α2α1α

h2

c

h3

ch1

c

Reconstruction (Soft Attention)

Figure 1: Auto-encoding Sentence Compression Model

to mitigate the problem of high variance duringsampling-based variational inference. Nevertheless,when directly applying the RNN encoder-decoder tomodel the variational distribution it is very difficult togenerate reasonable compression samples in the earlystages of training, since each hidden state of the se-quence would have |V | possible words to be sampledfrom. To combat this we employ pointer networks(Vinyals et al., 2015) to construct the variational dis-tribution. This biases the latent space to sequencescomposed of words only appearing in the sourcesentence (i.e. the size of softmax output for eachstate becomes the length of current source sentence),which amounts to applying an extractive compressionmodel for the variational approximation.

In order to further boost the performance on sen-tence compression, we employ a supervised forced-attention sentence compression model (FSC)trained on labelled data to teach the ASC model togenerate compression sentences. The FSC modelshares the pointer network of the ASC model andcombines a softmax output layer over the whole vo-cabulary. Therefore, while training on the sentence-compression pairs, it is able to balance copying aword from the source sentence with generating itfrom the background distribution. More importantly,by jointly training on the labelled and unlabelleddatasets, this shared pointer network enables themodel to work in a semi-supervised scenario. In

this case, the FSC teaches the ASC to generate rea-sonable samples, while the pointer network trainedon a large unlabelled data set helps the FSC model toperform better abstractive summarisation.

In Section 6, we evaluate the proposed model byjointly training the generative (ASC) and discrimina-tive (FSC) models on the standard Gigaword sentencecompression task with varying amounts of labelledand unlabelled data. The results demonstrate that byintroducing a latent language variable we are able tomatch the previous benchmakers with small amountof the supervised data. When we employ our mixeddiscriminative and generative objective with all of thesupervised data the model significantly outperformsall previously published results.

2 Auto-Encoding Sentence Compression

In this section, we introduce the auto-encoding sen-tence compression model (Figure 1)1 in the frame-work of variational auto-encoders. The ASC modelconsists of four recurrent neural networks – an en-coder, a compressor, a decoder and a language model.

Let s be the source sentence, and c be the compres-sion sentence. The compression model (encoder-compressor) is the inference network qφ(c|s) thattakes source sentences s as inputs and generatesextractive compressions c. The reconstruction

1The language model, layer connections and decoder softattentions are omitted in Figure 1 for clarity.

model (compressor-decoder) is the generative net-work pθ(s|c) that reconstructs source sentences sbased on the latent compressions c. Hence, the for-ward pass starts from the encoder to the compressorand ends at the decoder. As the prior distribution, alanguage model p(c) is pre-trained to regularise thelatent compressions so that the samples drawn fromthe compression model are likely to be reasonablenatural language sentences.

2.1 Compression

For the compression model (encoder-compressor),qφ(c|s), we employ a pointer network consisting of abidirectional LSTM encoder that processes the sourcesentences, and an LSTM compressor that generatescompressed sentences by attending to the encodedsource words.

Let si be the words in the source sentences, hei bethe corresponding state outputs of the encoder. hei arethe concatenated hidden states from each direction:

hei = f−→enc(~hei−1, si)||f←−enc( ~hei+1, si) (1)

Further, let cj be the words in the compressed sen-tences, hcj be the state outputs of the compressor. Weconstruct the predictive distribution by attending tothe words in the source sentences:

hcj =fcom(hcj−1, cj−1) (2)

uj(i) =wT3 tanh(W1h

cj+W2h

ei ) (3)

qφ(cj |c1:j−1, s)= softmax(uj) (4)

where c0 is the start symbol for each compressedsentence and hc0 is initialised by the source sentencevector of he|s|. In this case, all the words cj sampledfrom qφ(cj |c1:j−1, s) are the subset of the wordsappeared in the source sentence (i.e. cj ∈ s).

2.2 Reconstruction

For the reconstruction model (compressor-decoder)pθ(s|c), we apply a soft attention sequence-to-sequence model to generate the source sentence sbased on the compression samples c ∼ qφ(c|s).

Let sk be the words in the reconstructed sentencesand hdk be the corresponding state outputs of thedecoder:

hdk = fdec(hdk−1, sk−1) (5)

In this model, we directly use the recurrent cell ofthe compressor to encode the compression samples2:

hcj =fcom(h

cj−1, cj) (6)

where the state outputs hcj corresponding to the word

inputs cj are different from the outputs hcj in thecompression model, since we block the informationfrom the source sentences. We also introduce a startsymbol s0 for the reconstructed sentence and hd0is initialised by the last state output h

c|c|. The soft

attention model is defined as:

vk(j) =wT6 tanh(W 4h

dk +W 5h

cj) (7)

γk(j) = softmax(vk(j)) (8)

dk =∑|c|

jγk(j)h

cj(vk(j)) (9)

We then construct the predictive probability distribu-tion over reconstructed words using a softmax:

pθ(sk|s1:k−1, c) = softmax(W 7dk) (10)

2.3 InferenceIn the ASC model there are two sets of parameters, φand θ, that need to be updated during inference. Dueto the non-differentiability of the model, the repa-rameterisation trick of the VAE is not applicable inthis case. Thus, we use the REINFORCE algorithm(Mnih et al., 2014; Mnih and Gregor, 2014) to reducethe variance of the gradient estimator.

The variational lower bound of the ASC model is:

L =Eqφ(c|s)[log pθ(s|c)]−DKL[qφ(c|s)||p(c)]

6 log

∫qφ(c|s)qφ(c|s)

pθ(s|c)p(c)dc = log p(s) (11)

Therefore, by optimising the lower bound (Eq. 11),the model balances the selection of keywords for thesummaries and the efficacy of the composed com-pressions, corresponding to the reconstruction errorand KL divergence respectively.

In practise, the pre-trained language model priorp(c) prefers short sentences for compressions. Asone of the drawbacks of VAEs, the KL divergenceterm in the lower bound pushes every sample drawnfrom the variational distribution towards the prior.

2The recurrent parameters of the compressor are not updatedby the gradients from the reconstruction model.

s1

s2

s3

s4

h1

e

c1

c2

c3

c1 c

2c0

h1

c

h2

eh3

eh4

e

h2

ch3

c

Encoder Compresser

h0

c

α β1 2 3α α1 β β2 3

Compression (Combined Pointer Networks)

Selected from V

Figure 2: Forced Attention Sentence Compression Model

Thus acting to regularise the posterior, but also torestrict the learning of the encoder. If the estimatorkeeps sampling short compressions during inference,the LSTM decoder would gradually rely on the con-texts from the decoded words instead of the informa-tion provided by the compressions, which does notyield the best performance on sentence compression.

Here, we introduce a co-efficient λ to scale thelearning signal of the KL divergence:L=Eqφ(c|s)[log pθ(s|c)]−λDKL[qφ(c|s)||p(c)] (12)

Although we are not optimising the exact variationallower bound, the ultimate goal of learning an effec-tive compression model is mostly up to the recon-struction error. In Section 6, we empirically applyλ = 0.1 for all the experiments on ASC model. In-terestingly, λ controls the compression rate of thesentences which can be a good point to be exploredin future work.

During the inference, we have different strategiesfor updating the parameters of φ and θ. For the pa-rameters θ in the reconstruction model, we directlyupdate them by the gradients:

∂L

∂θ= Eqφ(c|s)[

∂ log pθ(s|c)∂θ

]

≈ 1

M

∑m

∂ log pθ(s|c(m))

∂θ(13)

where we draw M samples c(m) ∼ qφ(c|s) indepen-dently for computing the stochastic gradients.

For the parameters φ in the compression model,we firstly define the learning signal,

l(s, c) = log pθ(s|c)− λ(log qφ(c|s)− log p(c)).

Then, we update the parameters φ by:

∂L

∂φ= Eqφ(c|s)[l(s, c)

∂ log qφ(c|s)∂φ

]

≈ 1

M

∑m

[l(s, c(m))∂ log qφ(c

(m)|s)∂φ

] (14)

However, this gradient estimator has a big variancebecause the learning signal l(s, c(m)) relies on thesamples from qφ(c|s). Therefore, following the RE-INFORCE algorithm, we introduce two baselinesb and b(s), the centred learning signal and input-dependent baseline respectively, to help reduce thevariance.

Here, we build an MLP to implement the input-dependent baseline b(s). During training, we learnthe two baselines by minimising the expectation:

Eqφ(c|s)[(l(s, c)− b− b(s))2]. (15)

Hence, the gradients w.r.t. φ are derived as,

∂L

∂φ≈ 1

M

∑m

(l(s, c(m))−b−b(s))∂ log qφ(c

(m)|s)∂φ

(16)which is basically a likelihood-ratio estimator.

3 Forced-attention Sentence Compression

In neural variational inference, the effectiveness oftraining largely depends on the quality of the in-ference network gradient estimator. Although weintroduce a biased estimator by using pointer net-works, it is still very difficult for the compressionmodel to generate reasonable natural language sen-tences at the early stage of learning, which results in

high-variance for the gradient estimator. Here, weintroduce our supervised forced-attention sentencecompression (FSC) model to teach the compressionmodel to generate coherent compressed sentences.

Neither directly replicating the pointer networkof ASC model, nor using a typical sequence-to-sequence model, the FSC model employs a force-attention strategy (Figure 2) that encourages the com-pressor to select words appearing in the source sen-tence but keeps the original full output vocabularyV . The force-attention strategy is basically a com-bined pointer network that chooses whether to selecta word from the source sentence s or to predict aword from V at each recurrent state. Hence, thecombined pointer network learns to copy the sourcewords while predicting the word sequences of com-pressions. By sharing the pointer networks betweenthe ASC and FSC model, the biased estimator obtainsfurther positive biases by training on a small set oflabelled source-compression pairs.

Here, the FSC model makes use of the compres-sion model (Eq. 1 to 4) in the ASC model,

αj =softmax(uj), (17)

where αj(i), i ∈ (1, . . . , |s|) denotes the probabilityof selecting si as the prediction for cj .

On the basis of the pointer network, we furtherintroduce the probability of predicting cj that is se-lected from the full vocabulary,

βj = softmax(Whcj), (18)

where βj(w), w ∈ (1, . . . , |V |) denotes the probabil-ity of selecting the wth from V as the prediction forcj . To combine these two probabilities in the RNN,we define a selection factor t for each state output,which computes the semantic similarities betweenthe current state and the attention vector,

ηj =∑|s|

iαj(i)h

ei (19)

tj = σ(ηTjMhcj). (20)

Hence, the probability distribution over compressedwords is defined as,

p(cj |c1:j−1, s)={tjαj(i) + (1− tj)βj(cj), cj=si(1− tj)βj(cj), cj 6∈s

(21)

Essentially, the FSC model is the extended compres-sion model of ASC by incorporating the pointer net-work with a softmax output layer over the full vocab-ulary. So we employ φ to denote the parameters ofthe FSC model pφ(c|s), which covers the parametersof the variational distribution qφ(c|s).

4 Semi-supervised Training

As the auto-encoding sentence compression (ASC)model grants the ability to make use of an unla-belled dataset, we explore a semi-supervised train-ing framework for the ASC and FSC models. Inthis scenario we have a labelled dataset that containssource-compression parallel sentences, (s, c) ∈ L,and an unlabelled dataset that contains only sourcesentences s ∈ U. The FSC model is trained on L sothat we are able to learn the compression model bymaximising the log-probability,

F =∑

(c,s)∈L

log pφ(c|s). (22)

While the ASC model is trained on U, where wemaximise the modified variational lower bound,

L=∑s∈U

(Eqφ(c|s)[log pθ(s|c)]−λDKL[qφ(c|s)||p(c)]).

(23)

The joint objective function of the semi-supervisedlearning is,

J=∑s∈U

(Eqφ(c|s)[log pθ(s|c)]−λDKL[qφ(c|s)||p(c)])

+∑

(c,s)∈L

log pφ(c|s). (24)

Hence, the pointer network is trained on both un-labelled data, U, and labelled data, L, by a mixedcriterion of REINFORCE and cross-entropy.

5 Related Work

As one of the typical sequence-to-sequence tasks,sentence-level summarisation has been explored by aseries of discriminative encoder-decoder neural mod-els. Filippova et al. (2015) carries out extractivesummarisation via deletion with LSTMs, while Rushet al. (2015) applies a convolutional encoder and an

attentional feed-forward decoder to generate abstrac-tive summarises, which provides the benchmark forthe Gigaword dataset. Nallapati et al. (2016) furtherimproves the performance by exploring multiple vari-ants of RNN encoder-decoder models. The recentworks Gulcehre et al. (2016), Ling et al. (2016), Nal-lapati et al. (2016) and Gu et al. (2016) also applythe similar idea of combining pointer networks andsoftmax output. However, different from all thesediscriminative models above, we explore generativemodels for sentence compression. Instead of trainingthe discriminative model on a big labelled dataset, ouroriginal intuition of introducing a combined pointernetworks is to bridge the unsupervised generativemodel (ASC) and supervised model (FSC) so that wecould utilise a large additional dataset, either labelledor unlabelled, to boost the compression performance.Dai and Le (2015) also explored semi-supervised se-quence learning, but in a pure deterministic modelfocused on learning better vector representations.

Recently variational auto-encoders have been ap-plied in a variety of fields as deep generative mod-els. In computer vision Kingma and Welling (2014),Rezende et al. (2014), and Gregor et al. (2015) havedemonstrated strong performance on the task of im-age generation and Eslami et al. (2016) proposedvariable-sized variational auto-encoders to identifymultiple objects in images. While in natural languageprocessing, there are variants of VAEs on modellingdocuments (Miao et al., 2016), sentences (Bowmanet al., 2015) and discovery of relations (Marcheg-giani and Titov, 2016). Apart from the typical initi-ations of VAEs, there are also a series of works thatemploys generative models for supervised learningtasks. For instance, Ba et al. (2015) learns visualattention for multiple objects by optimising a varia-tional lower bound, Kingma et al. (2014) implementsa semi-supervised framework for image classificationand Miao et al. (2016) applies a conditional varia-tional approximation in the task of factoid questionanswering. Dyer et al. (2016) proposes a generativemodel that explicitly extracts syntactic relationshipsamong words and phrases which further supports theargument that generative models can be a statisticallyefficient method for learning neural networks fromsmall data.

6 Experiments

6.1 Dataset & Setup

We evaluate the proposed models on the standard Gi-gaword3 sentence compression dataset. This datasetwas generated by pairing the headline of each articlewith its first sentence to create a source-compressionpair. Rush et al. (2015) provided scripts4 to filterout outliers, resulting in roughly 3.8M training pairs,a 400K validation set, and a 400K test set. In thefollowing experiments all models are trained on thetraining set with different data sizes5 and tested on a2K subset, which is identical to the test set used byRush et al. (2015) and Nallapati et al. (2016). Wedecode the sentences by k = 5 Beam search and testwith full-length Rouge score.

For the ASC and FSC models, we use 256 for thedimension of both hidden units and lookup tables.In the ASC model, we apply a 3-layer bidirectionalRNN with skip connections as the encoder, a 3-layerRNN pointer network with skip connections as thecompressor, and a 1-layer vanilla RNN with soft at-tention as the decoder. The language model prior istrained on the article sentences of the full trainingset using a 3-layer vanilla RNN with 0.5 dropout. Tolower the computational cost, we apply different vo-cabulary sizes for encoder and compressor (119,506and 68,897) which corresponds to the settings ofRush et al. (2015). Specifically, the vocabulary ofthe decoder is filtered by taking the most frequent10,000 words from the vocabulary of the encoder,where the rest of the words are tagged as ‘<unk>’.In further consideration of efficiency, we use onlyone sample for the gradient estimator. We optimisethe model by Adam (Kingma and Ba, 2015) with a0.0002 learning rate and 64 sentences per batch. Themodel converges in 5 epochs. Except for the pre-trained language model, we do not use dropout orembedding initialisation for ASC and FSC models.

6.2 Extractive Summarisation

The first set of experiments evaluate the models onextractive summarisation. Here, we denote the joint

3https://catalog.ldc.upenn.edu/LDC2012T214https://github.com/facebook/NAMAS5The hyperparameters where tuned on the validation set to

maximise the perplexity of the summaries rather than the recon-structed source sentences.

ModelTraining Data Recall Precision F-1

Labelled Unlabelled R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-LFSC 500K - 30.817 10.861 28.263 22.357 7.998 20.520 23.415 8.156 21.468

ASC+FSC1 500K 500K 29.117 10.643 26.811 28.558 10.575 26.344 26.987 9.741 24.874ASC+FSC2 500K 3.8M 28.236 10.359 26.218 30.112 11.131 27.896 27.453 9.902 25.452

FSC 1M - 30.889 11.645 28.257 27.169 10.266 24.916 26.984 10.028 24.711ASC+FSC1 1M 1M 30.490 11.443 28.097 28.109 10.799 25.943 27.258 10.189 25.148ASC+FSC2 1M 3.8M 29.034 10.780 26.801 31.037 11.521 28.658 28.336 10.313 26.145

FSC 3.8M - 30.112 12.436 27.889 34.135 13.813 31.704 30.225 12.258 28.035ASC+FSC1 3.8M 3.8M 29.946 12.558 27.805 35.538 14.699 32.972 30.568 12.553 28.366

Table 1: Extractive Summarisation Performance. (1) The extractive summaries of these models are decodedby the pointer network (i.e the shared component of the ASC and FSC models). (2) R-1, R-2 and R-Lrepresent the Rouge-1, Rouge-2 and Rouge-L score respectively.

models by ASC+FSC1 and ASC+FSC2 where ASCis trained on unlabelled data and FSC is trained onlabelled data. The ASC+FSC1 model employs equiv-alent sized labelled and unlabelled datasets, wherethe article sentences of the unlabelled data are thesame article sentences in the labelled data, so thereis no additional unlabelled data applied in this case.The ASC+FSC2 model employs the full unlabelleddataset in addition to the existing labelled dataset,which is the true semi-supervised setting.

Table 1 presents the test Rouge score on extractivecompression. We can see that the ASC+FSC1 modelachieves significant improvements on F-1 scoreswhen compared to the supervised FSC model onlytrained on labelled data. Moreover, fixing the labelleddata size, the ASC+FSC2 model achieves better per-formance by using additional unlabelled data than theASC+FSC1 model, which means the semi-supervisedlearning works in this scenario. Interestingly, learn-ing on the unlabelled data largely increases the preci-sions (though the recalls do not benefit from it) whichleads to significant improvements on the F-1 Rougescores. And surprisingly, the extractive ASC+FSC1

model trained on full labelled data outperforms theabstractive NABS (Rush et al., 2015) baseline model(in Table 4).

6.3 Abstractive Summarisation

The second set of experiments evaluate performanceon abstractive summarisation (Table 2). Consistently,we see that adding the generative objective to thediscriminative model (ASC+FSC1) results in a sig-nificant boost on all the Rouge scores, while em-ploying extra unlabelled data increase performance

further (ASC+FSC2). This validates the effectivenessof transferring the knowledge learned on unlabelleddata to the supervised abstractive summarisation.

In Figure 3, we present the validation perplexityto compare the abilities of the three models to learnthe compression languages. The ASC+FSC1(red)employs the same dataset for unlabelled and labelledtraining, while the ASC+FSC2(black) employs thefull unlabelled dataset. Here, the joint ASC+FSC1

model obtains better perplexities than the single dis-criminative FSC model, but there is not much dif-ference between ASC+FSC1 and ASC+FSC2 whenthe size of the labelled dataset grows. From the per-spective of language modelling, the generative ASCmodel indeed helps the discriminative model learn togenerate good summary sentences. Table 3 displaysthe validation perplexities of the benchmark models,where the joint ASC+FSC1 model trained on the fulllabelled and unlabelled datasets performs the best onmodelling compression languages.

Table 4 compares the test Rouge score on ab-stractive summarisation. Encouragingly, the semi-supervised model ASC+FSC2 outperforms the base-line model NABS when trained on 500K supervisedpairs, which is only about an eighth of the super-vised data. In Nallapati et al. (2016), the authorsexploit the full limits of discriminative RNN encoder-decoder models by incorporating a sampled soft-max, expanded vocabulary, additional lexical fea-tures, and combined pointer networks6, which yieldsthe best performance listed in Table 4. However,when all the data is employed with the mixed ob-

6The idea of the combined pointer networks is similar to theFSC model, but the implementations are slightly different.

ModelTraining Data Recall Precision F-1

Labelled Unlabelled R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-LFSC 500K - 27.147 10.039 25.197 33.781 13.019 31.288 29.074 10.842 26.955

ASC+FSC1 500K 500K 27.067 10.717 25.239 33.893 13.678 31.585 29.027 11.461 27.072ASC+FSC2 500K 3.8M 27.662 11.102 25.703 35.756 14.537 33.212 30.140 12.051 27.99

FSC 1M - 28.521 11.308 26.478 33.132 13.422 30.741 29.580 11.807 27.439ASC+FSC1 1M 1M 28.333 11.814 26.367 35.860 15.243 33.306 30.569 12.743 28.431ASC+FSC2 1M 3.8M 29.017 12.007 27.067 36.128 14.988 33.626 31.089 12.785 28.967

FSC 3.8M - 31.148 13.553 28.954 36.917 16.127 34.405 32.327 14.000 30.087ASC+FSC1 3.8M 3.8M 32.385 15.155 30.246 39.224 18.382 36.662 34.156 15.935 31.915

Table 2: Abstractive Summarisation Performance. The abstractive summaries of these models are decoded bythe combined pointer network (i.e. the shared pointer network together with the softmax output layer over thefull vocabulary).

Model Labelled Data Perplexity

Bag-of-Word (BoW) 3.8M 43.6Convolutional (TDNN) 3.8M 35.9Attention-Based (NABS) 3.8M 27.1(Rush et al., 2015)

Forced-Attention (FSC) 3.8M 18.6Auto-encoding (ASC+FSC1) 3.8M 16.6

Table 3: Comparison on validation perplexity. BoW,TDNN and NABS are the baseline neural compres-sion models with different encoders in Rush et al.(2015)

Model Labelled Data R-1 R-2 R-L(Rush et al., 2015) 3.8M 29.78 11.89 26.97

(Nallapati et al., 2016) 3.8M 33.17 16.02 30.98ASC + FSC2 500K 30.14 12.05 27.99ASC + FSC2 1M 31.09 12.79 28.97ASC + FSC1 3.8M 34.17 15.94 31.92

Table 4: Comparison on test Rouge scores

jective ASC+FSC1 model, the result is significantlybetter than this previous state-of-the-art. As the semi-supervised ASC+FSC2 model can be trained on un-limited unlabelled data, there is still significant spaceleft for further performance improvements.

Table 5 presents the examples of the compressionsentences decoded by the joint model ASC+FSC1

and the FSC model trained on the full dataset.

7 Discussion

From the perspective of generative models, a sig-nificant contribution of our work is a process forreducing variance for discrete sampling-based vari-ational inference. The first step is to introduce twobaselines in the control variates method due to thefact that the reparameterisation trick is not applica-

0 500K 1M 2M 4MLabelled Data size

20

40

60

80

100

Perp

lexit

y

100

49

35.427.4

18.6

87.7

43

3225.2 16.6

83.3

42.5

33.625.4

16.6

FSC

ASC+FSC1

ASC+FSC2

Figure 3: Perplexity on validation dataset.

ble for discrete latent variables. However it is thesecond step of using a pointer network as the biasedestimator that makes the key contribution. This re-sults in a much smaller state space, bounded by thelength of the source sentence (mostly between 20and 50 tokens), compared to the full vocabulary. Thefinal step is to apply the FSC model to transfer theknowledge learned from the supervised data to thepointer network. This further reduces the samplingvariance by acting as a sort of bootstrap or constrainton the unsupervised latent space which could encodealmost anything but which thus becomes biased to-wards matching the supervised distribution. By usingthese variance reduction methods, the ASC model isable to carry out effective variational inference for thelatent language model so that it learns to summarisethe sentences from the large unlabelled training data.

In a different vein, according to the reinforce-ment learning interpretation of sequence level train-ing (Ranzato et al., 2016), the compression modelof the ASC model acts as an agent which iterativelygenerates words (takes actions) to compose the com-

pression sentence and the reconstruction model actsas the reward function evaluating the quality of thecompressed sentence which is provided as a rewardsignal. Ranzato et al. (2016) presents a thoroughempirical evaluation on three different NLP tasks byusing additional sequence-level reward (BLEU andRouge-2) to train the models. In the context of thispaper, we apply a variational lower bound (mixed re-construction error and KL divergence regularisation)instead of the explicit Rouge score. Thus the ASCmodel is granted the ability to explore unlimited unla-belled data resources. In addition we introduce a su-pervised FSC model to teach the compression modelto generate stable sequences instead of starting witha random policy. In this case, the pointer networkthat bridges the supervised and unsupervised modelis trained by a mixed criterion of REINFORCE andcross-entropy in an incremental learning framework.Eventually, according to the experimental results, thejoint ASC and FSC model is able to learn a robustcompression model by exploring both labelled andunlabelled data, which outperforms the other sin-gle discriminative compression models that are onlytrained by cross-entropy reward signal.

8 Conclusion

In this paper we have introduced a generative modelfor jointly modelling pairs of sequences and evalu-ated its efficacy on the task of sentence compression.The variational auto-encoding framework providedan effective inference algorithm for this approachand also allowed us to explore combinations of dis-criminative (FSC) and generative (ASC) compressionmodels. The evaluation results show that supervisedtraining of the combination of these models improvesupon the state-of-the-art performance for the Giga-word compression dataset. When we train the su-pervised FSC model on a small amount of labelleddata and the unsupervised ASC model on a largeset of unlabelled data the combined model is able tooutperform previously reported benchmarks trainedon a great deal more supervised data. These resultsdemonstrate that we are able to model language as adiscrete latent variable in a variational auto-encodingframework and that the resultant generative model isable to effectively exploit both supervised and unsu-pervised data in sequence-to-sequence tasks.

src the sri lankan government on wednesday announced the closure ofgovernment schools with immediate effect as a military campaignagainst tamil separatists escalated in the north of the country .

ref sri lanka closes schools as war escalatesasca sri lanka closes government schoolsasce sri lankan government closure schools escalatedfsca sri lankan government closure with tamil rebels closuresrc factory orders for manufactured goods rose #.# percent in septem-

ber , the commerce department said here thursday .ref us september factory orders up #.# percentasca us factory orders up #.# percent in septemberasce factory orders rose #.# percent in septemberfsca factory orders #.# percent in septembersrc hong kong signed a breakthrough air services agreement with the

united states on friday that will allow us airlines to carry freight toasian destinations via the territory .

ref hong kong us sign breakthrough aviation pact

asca us hong kong sign air services agreementasce hong kong signed air services agreement with united statesfsca hong kong signed air services pact with united statessrc a swedish un soldier in bosnia was shot and killed by a stray bul-

let on tuesday in an incident authorities are calling an accident ,military officials in stockholm said tuesday .

ref swedish un soldier in bosnia killed by stray bulletasca swedish un soldier killed in bosniaasce swedish un soldier shot and killedfsca swedish soldier shot and killed in bosniasrc tea scores on the fourth day of the second test between australia

and pakistan here monday .ref australia vs pakistan tea scorecardasca australia v pakistan tea scoresasce australia tea scoresfsca tea scores on #th day of #nd testsrc india won the toss and chose to bat on the opening day in the

opening test against west indies at the antigua recreation groundon friday .

ref india win toss and elect to bat in first testasca india win toss and bat against west indiesasce india won toss on opening day against west indiesfsca india chose to bat on opening day against west indiessrc a powerful bomb exploded outside a navy base near the sri lankan

capital colombo tuesday , seriously wounding at least one person ,military officials said .

ref bomb attack outside srilanka navy baseasca bomb explodes outside sri lanka navy baseasce bomb outside sri lankan navy base wounding onefsca bomb exploded outside sri lankan navy basesrc press freedom in algeria remains at risk despite the release on

wednesday of prominent newspaper editor mohamed <unk> aftera two-year prison sentence , human rights organizations said .

ref algerian press freedom at risk despite editor ’s release <unk>picture

asca algeria press freedom remains at riskasce algeria press freedom remains at riskfsca press freedom in algeria at risk

Table 5: Examples of the compression sentences.src and ref are the source and reference sentencesprovided in the test set. asca and asce are the abstrac-tive and extractive compression sentences decodedby the joint model ASC+FSC1, and fsca denotes theabstractive compression obtained by the FSC model.

References[Ba et al.2015] Jimmy Ba, Volodymyr Mnih, and Koray

Kavukcuoglu. 2015. Multiple object recognition withvisual attention. In Proceedings of ICLR.

[Bahdanau et al.2015] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. 2015. Neural machine trans-lation by jointly learning to align and translate. InProceedings of ICLR.

[Bowman et al.2015] Samuel R Bowman, Luke Vilnis,Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, andSamy Bengio. 2015. Generating sentences from acontinuous space. arXiv preprint arXiv:1511.06349.

[Chorowski et al.2015] Jan K Chorowski, Dzmitry Bah-danau, Dmitriy Serdyuk, Kyunghyun Cho, and YoshuaBengio. 2015. Attention-based models for speechrecognition. In Proceedings of NIPS, pages 577–585.

[Dai and Le2015] Andrew M Dai and Quoc V Le. 2015.Semi-supervised sequence learning. In Proceedings ofNIPS, pages 3061–3069.

[Dyer et al.2016] Chris Dyer, Adhiguna Kuncoro, MiguelBallesteros, and Noah A Smith. 2016. Recurrent neuralnetwork grammars. In Proceedings of NAACL.

[Eslami et al.2016] SM Eslami, Nicolas Heess, TheophaneWeber, Yuval Tassa, Koray Kavukcuoglu, and Geof-frey E Hinton. 2016. Attend, infer, repeat: Fast sceneunderstanding with generative models. arXiv preprintarXiv:1603.08575.

[Filippova et al.2015] Katja Filippova, Enrique Alfonseca,Carlos A Colmenares, Lukasz Kaiser, and Oriol Vinyals.2015. Sentence compression by deletion with lstms. InProceedings of EMNLP, pages 360–368.

[Gregor et al.2015] Karol Gregor, Ivo Danihelka, AlexGraves, and Daan Wierstra. 2015. Draw: A recurrentneural network for image generation. In Proceedingsof ICML.

[Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li, andVictor OK Li. 2016. Incorporating copying mecha-nism in sequence-to-sequence learning. arXiv preprintarXiv:1603.06393.

[Gulcehre et al.2016] Caglar Gulcehre, Sungjin Ahn,Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio.2016. Pointing the unknown words. arXiv preprintarXiv:1603.08148.

[Hinton and Salakhutdinov2006] Geoffrey E Hinton andRuslan R Salakhutdinov. 2006. Reducing the di-mensionality of data with neural networks. Science,313(5786):504–507.

[Kalchbrenner and Blunsom2013] Nal Kalchbrenner andPhil Blunsom. 2013. Recurrent continuous translationmodels. In Proceedings of EMNLP.

[Kingma and Ba2015] Diederik P. Kingma and Jimmy Ba.2015. Adam: A method for stochastic optimization. InProceedings of ICLR.

[Kingma and Welling2014] Diederik P Kingma and MaxWelling. 2014. Auto-encoding variational bayes. InProceedings of ICLR.

[Kingma et al.2014] Diederik P Kingma, Shakir Mo-hamed, Danilo Jimenez Rezende, and Max Welling.2014. Semi-supervised learning with deep generativemodels. In Proceedings of NIPS.

[Ling et al.2016] Wang Ling, Edward Grefenstette,Karl Moritz Hermann, Tomas Kocisky, Andrew Senior,Fumin Wang, and Phil Blunsom. 2016. Latentpredictor networks for code generation.

[Marcheggiani and Titov2016] Diego Marcheggiani andIvan Titov. 2016. Discrete-state variational autoen-coders for joint discovery and factorization of relations.Transactions of the Association for Computational Lin-guistics, 4.

[Miao et al.2016] Yishu Miao, Lei Yu, and Phil Blunsom.2016. Neural variational inference for text processing.In Proceedings of ICML.

[Mnih and Gregor2014] Andriy Mnih and Karol Gregor.2014. Neural variational inference and learning in be-lief networks. In Proceedings of ICML.

[Mnih et al.2014] Volodymyr Mnih, Nicolas Heess, andAlex Graves. 2014. Recurrent models of visual atten-tion. In Proceedings of NIPS.

[Nallapati et al.2016] Ramesh Nallapati, Bowen Zhou,Ça glar Gulçehre, and Bing Xiang. 2016. Abstrac-tive text summarization using sequence-to-sequencernns and beyond. arXiv preprint arXiv:1602.06023.

[Ranzato et al.2016] Marc’Aurelio Ranzato, SumitChopra, Michael Auli, and Wojciech Zaremba. 2016.Sequence level training with recurrent neural networks.

[Rezende et al.2014] Danilo J Rezende, Shakir Mohamed,and Daan Wierstra. 2014. Stochastic backpropagationand approximate inference in deep generative models.In Proceedings of ICML.

[Rumelhart et al.1985] David E Rumelhart, Geoffrey EHinton, and Ronald J Williams. 1985. Learning in-ternal representations by error propagation. Technicalreport, DTIC Document.

[Rush et al.2015] Alexander M Rush, Sumit Chopra, andJason Weston. 2015. A neural attention model forabstractive sentence summarization. In Proceedings ofEMNLP.

[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, andQuoc V Le. 2014. Sequence to sequence learning withneural networks. In Proceedings of NIPS.

[Vinyals et al.2015] Oriol Vinyals, Meire Fortunato, andNavdeep Jaitly. 2015. Pointer networks. In Proceed-ings of NIPS, pages 2674–2682.

[Xu et al.2015] Kelvin Xu, Jimmy Ba, Ryan Kiros, AaronCourville, Ruslan Salakhutdinov, Richard Zemel, andYoshua Bengio. 2015. Show, attend and tell: Neuralimage caption generation with visual attention.