arXiv:2106.00104v1 [cs.CL] 31 May 2021

12
Text Summarization with Latent Queries Yumo Xu and Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB [email protected] [email protected] Abstract The availability of large-scale datasets has driven the development of neural models that create summaries from single documents, for generic purposes. When using a summariza- tion system, users often have speci๏ฌc intents with various language realizations, which, de- pending on the information need, can range from a single keyword to a long narrative com- posed of multiple questions. Existing summa- rization systems, however, often either fail to support or act robustly on this query focused summarization task. We introduce LAQSUM, the ๏ฌrst uni๏ฌed text summarization system that learns Latent Queries from documents for abstractive summarization with any exist- ing query forms. Under a deep generative framework, our system jointly optimizes a la- tent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time. Despite learning from only generic summarization data and requiring no further optimization for downstream sum- marization tasks, our system robustly outper- forms strong comparison systems across sum- marization benchmarks with different query types, document settings, and target domains. 1 Introduction The neural encoder-decoder framework has be- come increasingly popular in generic summariza- tion (See et al. 2017; Gehrmann et al. 2018; Liu and Lapata 2019a, inter alia) thanks to the avail- ability of large-scale datasets containing hundreds of thousands of document-summary pairs. Query focused summarization (QFS; Dang 2005) which aims to create a short summary from one or multi- ple document(s) that answers a speci๏ฌc query, in comparison, does not have training data of this mag- nitude; existing QFS corpora (Dang, 2005; Hoa, 2006; Nema et al., 2017; Baumel et al., 2016) are relatively small for training large neural architec- tures and have been mostly used for evaluation. Therefore, how to leverage generic summarization data for the bene๏ฌt of QFS has become a research topic of interest recently (Xu and Lapata, 2020a; Laskar et al., 2020a). It, however, remains a chal- lenging research question due to the absence of queries in generic summarization data. Early attempts in QFS sidestep this problem by seeking distant supervision from query-relevant NLP tasks (Xu and Lapata, 2020b; Su et al., 2020; Laskar et al., 2020b) and exploiting existing re- sources, including datasets and pretrained mod- els in question answering (Rajpurkar et al., 2016; Chakraborty et al., 2020) and paraphrase identi- ๏ฌcation (Dolan and Brockett, 2005). Since these resources can also be extremely expensive to ac- quire (Bajaj et al., 2016), recent work proposes to induce proxy queries from generic summaries (Xu and Lapata, 2020a), enabling an abstractive QFS system to be trained with only query-free resources. In this work, we note that queries can be realized in various language forms, including but not limited to one or multiple keyword(s) (Baumel et al., 2016; Zhu et al., 2019), a natural question (Nema et al., 2017) and a composition of multiple sub-queries (Dang, 2006). These diversi๏ฌed query languages inevitably lead to a mismatch between the actual queries a system has to take as inputs at test time, and the queries pre-collected or generated from dis- tant supervision for training purposes. Moreover, the scalability of these approaches is constrained in terms of handling different query types. For instance, the performance of an abstractive QFS system trained on proxy queries can be sensitive to the format of target queries at test time which the proxy ones were created to mimic (Xu and Lapata, 2020a). To make a summarization system work well with every new query type, one may have to re-design the proxy algorithm, re-create the proxy queries, and re-train one or more system modules, which is computationally inef๏ฌcient and sometimes practically infeasible. arXiv:2106.00104v1 [cs.CL] 31 May 2021

Transcript of arXiv:2106.00104v1 [cs.CL] 31 May 2021

Text Summarization with Latent Queries

Yumo Xu and Mirella LapataInstitute for Language, Cognition and Computation

School of Informatics, University of Edinburgh10 Crichton Street, Edinburgh EH8 9AB

[email protected] [email protected]

Abstract

The availability of large-scale datasets hasdriven the development of neural models thatcreate summaries from single documents, forgeneric purposes. When using a summariza-tion system, users often have specific intentswith various language realizations, which, de-pending on the information need, can rangefrom a single keyword to a long narrative com-posed of multiple questions. Existing summa-rization systems, however, often either fail tosupport or act robustly on this query focusedsummarization task. We introduce LAQSUM,the first unified text summarization systemthat learns Latent Queries from documentsfor abstractive summarization with any exist-ing query forms. Under a deep generativeframework, our system jointly optimizes a la-tent query model and a conditional languagemodel, allowing users to plug-and-play queriesof any type at test time. Despite learning fromonly generic summarization data and requiringno further optimization for downstream sum-marization tasks, our system robustly outper-forms strong comparison systems across sum-marization benchmarks with different querytypes, document settings, and target domains.

1 Introduction

The neural encoder-decoder framework has be-come increasingly popular in generic summariza-tion (See et al. 2017; Gehrmann et al. 2018; Liuand Lapata 2019a, inter alia) thanks to the avail-ability of large-scale datasets containing hundredsof thousands of document-summary pairs. Queryfocused summarization (QFS; Dang 2005) whichaims to create a short summary from one or multi-ple document(s) that answers a specific query, incomparison, does not have training data of this mag-nitude; existing QFS corpora (Dang, 2005; Hoa,2006; Nema et al., 2017; Baumel et al., 2016) arerelatively small for training large neural architec-tures and have been mostly used for evaluation.

Therefore, how to leverage generic summarizationdata for the benefit of QFS has become a researchtopic of interest recently (Xu and Lapata, 2020a;Laskar et al., 2020a). It, however, remains a chal-lenging research question due to the absence ofqueries in generic summarization data.

Early attempts in QFS sidestep this problem byseeking distant supervision from query-relevantNLP tasks (Xu and Lapata, 2020b; Su et al., 2020;Laskar et al., 2020b) and exploiting existing re-sources, including datasets and pretrained mod-els in question answering (Rajpurkar et al., 2016;Chakraborty et al., 2020) and paraphrase identi-fication (Dolan and Brockett, 2005). Since theseresources can also be extremely expensive to ac-quire (Bajaj et al., 2016), recent work proposes toinduce proxy queries from generic summaries (Xuand Lapata, 2020a), enabling an abstractive QFSsystem to be trained with only query-free resources.

In this work, we note that queries can be realizedin various language forms, including but not limitedto one or multiple keyword(s) (Baumel et al., 2016;Zhu et al., 2019), a natural question (Nema et al.,2017) and a composition of multiple sub-queries(Dang, 2006). These diversified query languagesinevitably lead to a mismatch between the actualqueries a system has to take as inputs at test time,and the queries pre-collected or generated from dis-tant supervision for training purposes. Moreover,the scalability of these approaches is constrainedin terms of handling different query types. Forinstance, the performance of an abstractive QFSsystem trained on proxy queries can be sensitive tothe format of target queries at test time which theproxy ones were created to mimic (Xu and Lapata,2020a). To make a summarization system workwell with every new query type, one may have tore-design the proxy algorithm, re-create the proxyqueries, and re-train one or more system modules,which is computationally inefficient and sometimespractically infeasible.

arX

iv:2

106.

0010

4v1

[cs

.CL

] 3

1 M

ay 2

021

In this work, we aim at building an abstractivetext summarization system that is robust across ob-served and latent query settings. Particularly, wetreat generic summarization data as a special caseof QFS where the query is unspecified and empty,and assume it to be the only accessible resource forboth model training and development. To build ageneralizable summarization system, we proposeto model queries as discrete latent variables fromdocuments, and build a latent query representationthat is compatible with different query languagerealizations as inputs. Specifically, we formulatean abstractive summarization task as a generativeprocess, and decompose the objective into two com-ponents: (1) latent query modeling (i.e., generatinglatent query variables from a document observa-tion) and (2) conditional language modeling (i.e.,generating an abstractive summary conditioned onthe observed document and latent queries). To fur-ther enable user-specified query inputs of differentformats at test time, we provide a non-parametriccalibration to the latent query distribution; it canbe plugged into the latent variable model withoutre-training, and therefore enables zero-shot QFS.

Our contributions in this work are threefold: wepropose the first text summarization system thatunifies abstractive generic summarization and QFS.Particularly, no query-related resource is requiredfor model training or development; we provide ageneral, deep generative formulation for text sum-marization, under which we validate the effective-ness of representing queries directly from inputdocuments in the latent space, i.e., without resort-ing to pipeline-style query extraction or generation;we provide experimental results on a wide spec-trum of summarization benchmarks and show thatacross query types, document settings, and targetdomains, our system achieves better results thanstrong comparison systems.

2 Related Work

2.1 Generic Abstractive Summarization

Rush et al. (2015) and Nallapati et al. (2016) areamong the first to apply the neural encoder-decoderarchitecture to abstractive summarization. See et al.(2017) enhance the system with a pointer-generatormodel where a copy mechanism is proposed to al-low words in the source document to be copieddirectly in summary generation. In a bottom-upfashion, Gehrmann et al. (2018) propose to train aword-level tagging model separately; at test time,

it produces content selection probabilities for eachword, which are then used to restrict the copy mech-anism by performing hard masking over the inputdocument. Inspired by Gehrmann et al. (2018), wealso propose to train a tagging model for abstrac-tive summarization. In this work, a tagger serves asa parameterized component of an inference modelfor latent query variables and is jointly optimizedvia a generic summarization task and a weakly su-pervised tagging task.

Another line of research in abstractive summa-rization propose to control text generation via top-ics (Perez-Beltrachini et al., 2019; Wang et al.,2020), retrieved summaries (Cao et al., 2018), orfactual relations (Zhu et al., 2021). Notably, Douet al. (2020) propose a controllable system forabstractive summarization which achieves state-of-the-art performance by taking sentences ex-tracted by another state-of-the-art extractive system(Zhong et al., 2020) as guidance at test time. De-spite the conceptual similarity between guidanceand queries, we note that one fundamental differ-ence between guided summarization and query fo-cused summarization lies in their task objectives:guidance is created for improving generic summa-rization on aspects such as informativeness (Caoet al., 2018) and faithfulness (Zhu et al., 2021),while QFS handles user intents of various forms ina low-resource setting where the lack of trainingdata for high-quality guidance creation renders aguided system not directly applicable.

2.2 Query Focused Summarization

Extractive QFS, the dominant approach in QFS re-search, composes summaries by selecting centraland query-relevant sentences in documents, basedon different ways of estimating and incorporatingcentrality and relevance (Wan et al., 2007; Badri-nath et al., 2011; Wan and Zhang, 2014; Li et al.,2017b,a). More recently, Xu and Lapata (2020b)propose a coarse-to-fine framework that leveragesdistant supervision from question answering forsummary sentence extraction.

Abstractive QFS, compared to either its genericor extractive counterpart, has received significantlyless attention, due to generation models being par-ticularly data-hungry (Lebanoff et al., 2018; Liuand Lapata, 2019a). To alleviate the scarcity ofQFS data, resources from a wider range of NLPtasks have been investigated for generating queryfocused abstracts. Su et al. (2020) rank document

documentencoder

query encoder

decoder

query tagger

queryfocused view

sharedencoder

query- agnostic

view

observed query

document summary

latent query

Figure 1: Proposed text summarization framework. Dashed lines denote non-parametric query observation mod-eling at testing, which is optional. Shadow denotes components for latent query modeling, which infers discretelatent variables as queries. Latent queries create a query focused view of the input document, which, together witha query-agnostic view, is input to a conditional language model for summary generation.

paragraphs against queries with a plethora of QAand machine reading datasets (Su et al., 2019; Ra-jpurkar et al., 2016), then summarize selected para-graphs iteratively. Similarly, Laskar et al. (2020b)jointly exploit supervision from QFS data (typi-cally reserved for evaluation) and related QA andparaphrase identification tasks.

Since external query-related resources such asQA datasets can also be costly to obtain (Bajajet al., 2016; Kwiatkowski et al., 2019), Xu andLapata (2020a) assume access to only query-freeresources for abstractive QFS training. They dis-cover a type of connection between queries andsummaries, and, alternatively, create proxy queriesfrom generic summaries. These proxy queries arecreated by selectively masking information slots insummaries. To minimize covariance shift, they arecreated to be as close to target queries as possible,on aspects such as content and length. Despite thepromising system performance and relaxed trainingresource assumption, this approach still assumesprior knowledge of the target query type at test-ing, and therefore relies on a development set forproxy query generation and model selection (Xuand Lapata, 2020a). Also, the system is particularlytailored for multi-document QFS with an evidenceselection component.

In this work, we aim at zero-shot transfer settingfor QFS tasks, i.e., we do not assume the availabil-ity of even one small QFS dataset for developmentpurposes, as it is challenging to obtain a develop-ment set for every query language form. We presenta system that can performing well on both single-and multi-document QFS across both observed andlatent query settings.

3 Problem Formulation

Let {(D,Q,S)} denote a summarization dataset,where D is a document token sequence with cor-responding summary S, and query Q additionallyspecifies an information request. We have Q = โˆ… ingeneric summarization, while in QFS, dependingon the test set, Q can be in various formats, fromkeywords to composite questions (see Table 2 forexamples of different query types).

In this work, we aim at building a summarizationsystem that learns only from generic summariza-tion data, while robustly generalizing to a rangeof summarization tasks at test time, including in-domain generic summarization and out-of-domainQFS tasks. One common characteristic of genericsummarization and QFS is the under-specified userintent: even in QFS, despite Q โ‰  โˆ…, a query usuallypresents incomplete guidance signals and does notfully represent a userโ€™s intent, as a query usuallyseeks information, so it is unlikely for a user tospecify every aspect of what is needed to composea good summary (Xu and Lapata, 2020a). There-fore, for both generic summarization and QFS, itis favorable that a system can identify latent querysignals from D, to which, Q can, optionally, serveas an additional observation for belief update.

Generative Model We model the observed inputdocument D as a sequence of random variablesx = [x1; x2; . . . ; x๐‘€ ] where x๐‘– is a token and ๐‘€

length of document. We define latent query, a se-quence of discrete latent states over the input doc-ument tokens: z = [z1; z2; . . . ; z๐‘€ ]. Specifically,from each document token x๐‘–, we generate a bi-nary query variable z๐‘–, whose distribution ๐‘(z๐‘–)indicates the belief that x๐‘– contributes to a potential

query for the document D. The output variabley = [y1; y2; . . . ; y๐‘‡ ] is then generated from {x, z}using teacher-forcing at training time. Note that attest time, one may have access to additional queryknowledge Q; we also ground this optional queryinformation to the input document as discrete ob-served variables z = [z1; z2; . . . ; z๐‘€ ], and generatey by additionally conditioning on z (if it exists) inan autoregessive manner.

We consider modeling the conditional distribu-tion ๐‘\ (y|x) and write down its factorization ac-cording to the generative process described above:

๐‘\ (y|x) =โˆ‘

z๐‘\ (y|z, x)๐‘\ (z|x) (1)

=โˆ‘

z๐‘\ (y|z, x)

โˆ๐‘–

๐‘\ (z๐‘– |x๐‘–)

Inference Model The posterior distribution ofthe latent variable z is calculated as:

๐‘\ (z|x, y) =๐‘\ (x, y, z)๐‘\ (x, y)

=๐‘\ (x, y, z)โˆ‘z ๐‘\ (x, y, z)

. (2)

However, exact inference of this posterior is com-putationally intractable due to the joint probability๐‘\ (x, y). We, therefore, opt for a variational pos-terior ๐‘ž๐œ™ (z|x, y) to approximate it. Inspired by๐›ฝ-VAE (Higgins et al., 2017), we maximize theprobability of generating text y, while keeping thedistance between the prior and variational posteriordistributions under a small constant ๐›ฟ:

max๐œ™,\

E(x,y)โˆผD[Ezโˆผ๐‘ž๐œ™ (z |x,y) log ๐‘\ (y|x, z)

](3)

subject to ๐ทKL(๐‘ž๐œ™ (z|x, y)โ€–๐‘\ (z|x)

)< ๐›ฟ (4)

Since we cannot solve Equation (4) directly, weapply the Karushโ€“Kuhnโ€“Tucker conditions (KKT;Kuhn et al. 1951) and cast the constrained optimiza-tion problem into unconstrained optimization, withthe following form of ELBO objective:

LELBO = E๐‘ž๐œ™ (z |x,y) [log ๐‘\ (y|x, z)] (5)

โˆ’ ๐›ฝ๐ทKL(๐‘ž๐œ™ (z|x, y) | |๐‘\ (z|x)

)In this work, to minimize our systemโ€™s depen-

dence on testing data, we adopt a uniform prior๐‘\ (z|x), based on the hypothesis that given all in-stances of x in data (which may come from differ-ent domains), the aggregated probability of variablez, i.e., whether a word is a query word under vari-ous contexts, follows a uniform distribution. In this

case, minimizing the KL term is equivalent to max-imizing the entropy of the variational posterior.1

For posterior approximation, we further assumez โŠฅโŠฅ y and therefore ๐‘ž๐œ™ (z|x, y) = ๐‘ž๐œ™ (z|x). Whilethis reduces the risk of exposure bias from y duringlearning, it is more challenging to learn meaning-ful latent variables as they solely condition on x.We mitigate this issue by introducing a new typeof weak supervision ๐‘œ(z|x, y) that we are able toautomatically extract from data. This weak supervi-sion, which is in the form of sequence tagging andwill be discussed in Section 4, leads an extra objec-tive term for posterior regularization. Formally, werewrite the final optimization objective as:

L = E๐‘ž๐œ™ (z |x) [log ๐‘\ (y|x, z)]๏ธธ ๏ธท๏ธท ๏ธธconditional language modeling

(6)

+ ๐›ฝH(๐‘ž๐œ™ (z|x)

)โˆ’ ๐œ”H

(๐‘œ(z|x, y), ๐‘ž๐œ™ (z|x)

)๏ธธ ๏ธท๏ธท ๏ธธlatent query modeling

where H(ยท) denotes posterior entropy and H(ยท, ยท)denotes cross entropy for posterior regularizationfrom weak supervision. Particularly, we decom-pose a summarization task into two modeling ob-jectives, latent query modeling and conditional lan-guage modeling. Inside query modeling, hyper-parameter ๐œ” controls the influence from the weaksupervision z, while ๐›ฝ controls the strength of aform of label smoothing (see Section 4 for details).

Neural Parameterization We show our frame-work in Figure 1. We parameterize the two model-ing objectives in Equation (6) with a latent querymodel and an conditional language model, respec-tively. The query model is optimized to estimatelatent query z from its input variable x. At infer-ence time, it, optionally, conditions on prior queryknowledge z (when available in QFS data). Theconditional language model, on the other hand,augments the standard encoder-decoder structureto take as input two encoding sources: a query-agnostic view and a query focused view. We callthem two views as they are separate encodings ofthe same input D, with the only difference liesin the direct dependence on z generated from thequery model. In contrast to using only the query-focused view, maintaining two views allows theoriginal document context to be retained as a com-plementary feature space. Finally, a decoder is used

1When ๐‘\ (z|x) โˆผ U(๐‘Ž, ๐‘), ๐ทKL (๐‘ž๐œ™ (z|x, y) | |๐‘\ (z|x)) =โˆ’H

(๐‘ž๐œ™ (z|x)

)+ log(๐‘ โˆ’ ๐‘Ž + 1) always holds.

Error Data Example

Type ISummary

Real Madrid slump to defeat against Athletic Bilbao.Solitary goal from Aritz Aduriz enough to give the Basques victory.Bayern Munich continue Bundesliga domination.

Document ..., with a convincing 4-1 win over Lens at the Parc de Princes. ...Annotation ..., with#0 a#0 convincing#0 4-1#0 win#0 over#0 Lens#0 at#0 the#0 Parc#0 de#1 Princes.#0 ...

Type II

Summary A man in suburban Boston is selling snow online to customers in warmer states.For $89, he will ship 6 pounds of snow in an insulated Styrofoam box.

Document ... For $89, self-styled entrepreneur Kyle Waring will ship you 6 pounds of Boston-area snowin an insulated Styrofoam box ...

Annotation ... For#1 $89,#1 self-styled#0 entrepreneur#0 Kyle#0 Waring#0 will#1 ship#1 you#0 6#1 pounds#1of#1 Boston-area#0 snow#1 in#1 an#1 insulated#1 Styrofoam#1 box#1 ...

Table 1: Error analysis for longest common sub-sequence tagging scheme (LCS; Gehrmann et al. 2018) withinstances from CNN/DM validation set. Document tokens and binary annotations are separated with #. Bold fontsdenote erroneous annotations and their corresponding token in summaries/documents.

to generate summary y auto-regressively. Differentfrom previous abstractive QFS formulation (Xu andLapata, 2020a), we jointly train these two modelsin a fully differentiable summarization system.

4 Latent Query Model

In this section we detail the neural parameterizationfor latent query modeling in Equation (6), includingthe architecture for query inference and documentrepresentation in a query focused view. We alsointroduce learning from weak supervision ๐‘œ(z|x, y)and testing with belief update.

Inference Network for Latent Queries We con-struct a neural network model to infer query belieffor each token in the input document. Given a con-textual token representation matrix H๐‘ž โˆˆ R๐‘€ร—๐‘‘โ„Ž ,we project it to R๐‘€ร—2 with a two-layer MLP as ascoring function:

H๐‘  = ReLU(H๐‘žWโ„Ž + bแต€โ„Ž) (7)

๐… = H๐‘ W๐‘  + bแต€๐‘  (8)

where Wโ„Ž โˆˆ R๐‘‘โ„Žร—๐‘‘โ„Ž , bโ„Ž โˆˆ R๐‘‘โ„Žร—1, W๐‘  โˆˆ R๐‘‘โ„Žร—2

and b๐‘  โˆˆ R2ร—1 are learnable model parameters.Let ๐บ (0) denote the standard Gumbel distribu-

tion, and ๐‘”โ„“ โˆผ ๐บ (0), โ„“ โˆˆ [0, 1] are i.i.d. gumbelnoise. We normalize ๐… to form a variational distri-bution as:

๐‘ž๐œ™ (z๐‘– = โ„“ |x) = softmaxโ„“ ( [๐…0 + ๐‘”0, ๐…1 + ๐‘”1])

=exp((๐…โ„“ + ๐‘”โ„“)/๐œ)โˆ‘

โ„“โ€ฒโˆˆ[0,1] exp((๐…โ„“โ€ฒ + ๐‘”โ„“โ€ฒ)/๐œ)(9)

where ๐œ is the temperature controlling how close๐‘ž๐œ™ (z|x) is to argmaxโ„“ ๐‘ž๐œ™ (z|x), and is optimized

on the development set. Note that gumbel noise isonly applied during learning and is set to its mode,i.e., 0, for inference.

Query Focused View In addition to the query-agnostic document representation, a direct encod-ing of the input document D adopted by a plethoraof summarization models (which we leave to Sec-tion 5), we further introduce a query focused view,an encoding factorized via latent queries z.

Specifically, for the ๐‘–th token, we take the con-tinuous relaxation of its discrete latent variable z๐‘–,and ground the query to the input document in therepresentation space via:2

Q๐‘– = ๐‘ž๐œ™ (z๐‘– = 1|x) ยท H๐‘ž,๐‘– . (10)

As we can see, the query focused view explicitlymodels the dependency on latent queries. From thelearning perspective, this factorization leads to thefollowing partial derivatives of the query focusedview with respect to the query encoder states:

๐œ•Q๐‘–

๐œ•H๐‘ž,๐‘–

=

(1 โˆ’ ๐‘ž (1)

๐œ™

)๏ธธ ๏ธท๏ธท ๏ธธ

carry gate

ยท ๐œ•ฮ”๐œ‹๐œ•H๐‘ž,๐‘–

๏ฟฝ Q๐‘– + ๐‘ž(1)๐œ™๏ธธ๏ธท๏ธท๏ธธ

transform gate

ยท1 (11)

where ๐‘ž (โ„“)๐œ™

is the shorthand for the variational prob-ability of z๐‘– = โ„“ |x, and ฮ”๐œ‹ = ๐…1 โˆ’ ๐…0 (see Equation(8)). 1 denotes an all-one vector. This can be seenas a special case of highway networks (Srivastavaet al., 2015) with a zero-map transform function,where the transform gate ๐‘ž (1)

๐œ™controls the informa-

tion compression rate.

2We also experimented with drawing hard samples of z viathe straight-through trick (Jang et al., 2016) which is differ-entiable with biased gradient estimation. However, it did notyield better results than continuous relaxation.

Dataset Task Domain Size D/Q/S Tokens Query Type Query ExampleCNN/DM SDS News 11,490 760.5/0.0/45.7 Empty โˆ…WikiRef SDS Wiki 12,000 398.7/6.7/36.2 Keywords Marina Beach, IncidentsDebatepedia SDS Debate 1,000 66.4/10.0/11.3 Question Is euthanasia better than withdrawing life support?

DUC 2006 MDS Cross 1,250 (50) 699.3/32.8/250Composite

Amnesty International - What is the scope ofoperations of Amnesty International and what arethe international reactions to its activities?DUC 2007 MDS Cross 1,125 (45) 540.3/30.5/250

TD-QFS MDS Medical 7,099 (50) 182.9/3.0/250 Title Alzheimerโ€™s Disease

Table 2: Test data statistics. SDS and MDS stand for single- and multi-document summarization, respectively.Size refers to number of documents for single-document test set; for MDS, we additionally specify number ofclusters in brackets. In the composite query example, red and blue fonts denote its title and narrative, respectively.

Sequence Tagging as Weak Supervision Sinceour system is fully differentiable, it is possible tooptimize latent queries solely based on conditionallanguage modeling. In this work, we additionallypropose to exploit sequence tagging as weak super-vision. This can be advantageous since it imposesextra regularization via posterior constraints to pre-vent its collapse, in which case the decoder maylearn to ignore the query focused view and insteadsolely rely on the query agnostic view.

A challenge with applying sequence tagging tosummarization is the absence of gold query an-notation in training data. Inspired by Gehrmannet al. (2018), we align summary and document bysearching for their longest common sub-sequences(LCS). Nevertheless, we note that there exist a fewdrawbacks hindering its direct application as weaksupervision. Primarily, LCS treats a document as aword sequence and annotates it at word-level, whileour tagging model (built on top of a pretrained lan-guage model) operates on subwords. Apart fromthis incompatibility of granularity, LCS can lead tofalse-negative word labels (Type II error), and it isalso sensitive to false-positive cases (Type I error)since a summary is seen as a character sequencein LCS. We provide examples of these two errortypes in Table 1. We propose a simple but effectivesolution to fix the abovementioned issues: we firstbyte-pair encode (BPE; Sennrich et al. 2016) doc-uments and summaries, and then search LCS overpaired document-summary BPE sequences. Com-pared to the other way around, doing BPE as thefirst step allows finer-grained unit matching (whichreduces Type II error), while still retains basic se-mantics (which reduces Type I error). We annotateBPEs in the extracted LCS as 1 and the rest as 0.Note that if there exist multiple identical LCS, onlythe one appearing at the earliest document positionis tagged as positive. We refer to this query taggingscheme as BPE-LCS.

Training We use a cross entropy loss for se-quence tagging, with a posterior entropy term inEquation (6):

Lquery = โˆ’๐œ”Ltag + ๐›ฝLentropy (12)

= โˆ’๐‘โˆ‘๐‘—=1

๐‘€โˆ‘๐‘–=1

( (๐œ”z ๐‘—

๐‘–โˆ’ ๐›ฝ๐‘ž (1)

๐œ™

)log ๐‘ž (1)

๐œ™

+(๐œ”(1 โˆ’ z ๐‘—

๐‘–

)โˆ’ ๐›ฝ๐‘ž (0)

๐œ™

)log ๐‘ž (0)

๐œ™

)where z๐‘– is a binary annotation automatically as-signed via BPE-LCS(D,S). As we can see, theentropy term smooths the weak annotation z๐‘– , witha dynamic smoothing strength dependent on ๐‘ž๐œ™.We optimize ๐œ”, ๐›ฝ on the development set.

We notice that at the initial training phase, theunder-optimized tagger produces inaccurate pos-terior ๐‘ž๐œ™ (z๐‘– |x), and, consequently, hurts learningan abstractive summarization model which heavilyrelies on a high-quality query focused view. Totackle this issue, we propose a posterior dropoutmechanism: with a probability ๐›ฟ, we replace the es-timated posterior with the weak supervision ๐‘œ(z|x).We initialize ๐›ฟ to 1.0, i.e., only ๐‘œ(z|x) is used, andthe tagger is supervised via Equation (12). We thenlinearly anneal ๐›ฟ over optimization steps to ๐›ฟend,so the gradients from the summarization objective(which will be introduced in Section 5) can furtheroptimize the tagger jointly.

Testing To plug-and-play query focus during test-ing, we model the optional query knowledge onemay have access to, Q, with an query belief upda-tor ฮ”(z๐‘– |x, z). Specifically, when no prior queryknowledge Q is accessible (i.e., in generic sum-marization), one may assume zero increment forall tokensโ€™ query belief. While in QFS, we haveprior knowledge Q โ‰  โˆ… that some tokens comewith high query belief and therefore should be bi-ased over: we set ฮ”(z๐‘– = 1|x, z) = 1.0,โˆ€๐‘ค๐‘– โˆˆBPE-LCS(D,Q), and the rest to zero.

Systems R-1 R-2 R-LORACLE 55.8 33.2 51.8LEAD 40.4 17.6 36.7ExtractiveBERTEXT (Liu and Lapata, 2019b) 43.9 20.3 39.9MATCHSUM (Zhong et al., 2020) 43.9 20.6 39.8AbstractivePTGEN (See et al., 2017) 39.5 17.3 36.4BOTTOMUP (Gehrmann et al., 2018) 41.2 18.7 38.4BERTABS (Liu and Lapata, 2019b) 41.7 19.4 38.8BART (Lewis et al., 2020) 44.2 21.3 40.9GSUM (Dou et al., 2020) 45.9 22.3 42.5GSUM (our implementation) 45.0 21.9 41.8LAQSUM 45.1 22.0 41.9

Table 3: Supervised performance on CNN/DM test set.R-1, R-2 and R-L stand for the F1 score of ROUGE 1, 2,and L, respectively. GSUM (our implementation) usesthe same training configurations as our model.

Systems R-1 R-2 R-LORACLE 54.5 37.5 48.5LEAD 26.3 10.5 21.8LEXRANK 29.9 12.3 26.1Supervised (Extractive)TRANSFORMER (Zhu et al., 2019) 28.1 12.8 23.8BERTEXT (Zhu et al., 2019) 36.0 18.8 30.7Zero-shot AbstractiveBART (Lewis et al., 2020) 30.0 12.2 26.0GSUM+QUERYE 30.2 12.5 26.3LAQSUM 31.1 12.6 27.1

Table 4: Zero-shot performance on WikiRef test set(with keywords as queries). R1, R2 and RL stand forthe F1 score of ROUGE 1, 2, and L, respectively.

We further incorporate prior query informationvia a simple calibration as:

๐‘ž๐œ™ (z๐‘– = 1|x, z) = max{1.0, (13)

๐‘ž๐œ™ (z๐‘– = 1|x) + ฮ”(z๐‘– = 1|x, z)}.

Note that our adopt a non-parametric query beliefcalibration, as we do not assume the availabilityof a development set of each query type for hyper-parameter optimization. This enables zero-shottransfer to QFS tasks with different settings.

5 Conditional Language Model

In this section we introduce conditional languagemodeling which models the expectation of the log-likelihood of a summary word sequence over thevariational posterior distribution in Equation (6).As shown in Figure 1, we adopt an encoder-decoderarchitecture tailored for text summarization withlatent queries.

Encoder We encoder the same input documentinto two different views, a query-agnostic view,

Systems R-1 R-2 R-LLEAD 18.1 5.6 15.9LEXRANK 17.4 5.3 15.1Supervised (Abstractive)DDA (Laskar et al., 2020a) 7.4 2.8 7.2BERTABS+RANK (Abdullah and Chali, 2020) 19.2 10.6 17.9BERTABS+CONCAT (Laskar et al., 2020a) 26.4 11.9 25.1Zero-shot AbstractiveBERTABSโ€  (Liu and Lapata, 2019b) 13.3 2.8 2.8BART (Lewis et al., 2020) 21.4 6.3 18.4GSUM+QUERYE 21.2 6.2 18.2LAQSUM 23.5 7.2 20.6

Table 5: Zero-shot performance on Debatepedia testset (with natural questions as queries). R1, R2 and RLstand for the F1 score of ROUGE 1, 2, and L, respec-tively. โ€  denotes models optimized on XSum (Narayanet al., 2018) and numbers are borrowed from Laskaret al. (2020a).

and a query focused view. Therefore, our encodermodule consists of three encoders: a shared en-coder, a document encoder, and a query encoder.The intuition is straightforward: since both viewsare created from the same document, we use ashared encoder for general document understand-ing which also reduces model parameters. Theshared document representation is then input tothe other two separate encoders to encode high-level view-specific features. Each encoder containsone or multiple Transformer layers (Vaswani et al.,2017), which is composed of a multi-head attention(MHA) layer and a feed-forward (FFN) layer:

H(enc)= LN(H(enc) + MHA

(H(enc) ,H(enc) ,H(enc) ) )

H(enc)= LN(H(enc) + FFN

(H(enc) ) ) . (14)

where LN denotes layer normalization. As outputsof the encoding module, the query focused view Qdirectly conditions on latent query variables, whilethe query-agnostic view D retains original contexts.

Decoder We adopt a similar decoder structureas in Dou et al. (2020) to handle multiple inputs.Instead of incorporating pre-extracted guidance inDou et al. (2020), our decoder attends to the twoencoded views of the same document sequentially:

H(dec)= LN(H(dec) + MHA

(H(dec) ,H(dec) ,H(dec) ) )

H(dec)= LN(H(dec) + MHA

(H(dec) ,Q,Q

) )H(dec)= LN

(H(dec) + MHA

(H(dec) ,D,D

) )H(dec)= LN

(H(dec) + FFN

(H(dec) ) ) . (15)

After taking in the context of the previous genera-tion, the decoder first fuses in query signals from Q,

Models DUC 2006 DUC 2007 TD-QFSR-1 R-2 R-SU4 R-1 R-2 R-SU4 R-1 R-2 R-SU4

GOLD 45.7 11.2 17.0 47.9 14.1 19.1 โ€” โ€” โ€”ORACLE 40.6 9.1 14.8 41.8 10.4 16.0 44.9 18.9 23.0LEAD 32.1 5.3 10.4 33.4 6.5 11.3 33.5 5.2 10.4LEXRANK 34.2 6.4 11.4 35.8 7.7 12.7 35.3 7.6 12.2Distantly SupervisedQUERYSUMโˆ— (Xu and Lapata, 2020b) 41.6 9.5 15.3 43.3 11.6 16.8 44.3 16.1 20.7BART-CAQ (Su et al., 2020) 38.3 7.7 12.9 40.5 9.2 14.4 โ€” โ€” โ€”PQSUM (Laskar et al., 2020b) 40.9 9.4 14.8 42.2 10.8 16.0 โ€” โ€” โ€”Few- or Zero-shot AbstractiveMARGESUMโ€  (Xu and Lapata, 2020a) 40.2 9.7 15.1 42.5 12.0 16.9 45.5 16.6 20.9BART (Lewis et al., 2020) 38.3 7.8 13.1 40.2 9.9 14.6 45.1 16.9 21.4GSUM+QUERYE 38.1 7.9 13.1 39.5 9.5 14.3 45.5 18.0 22.4LAQSUM 39.1 8.5 13.7 40.4 10.2 15.0 45.7 18.1 22.1

Table 6: Zero-shot performance on multi-document QFS test sets DUC (with composed queries) and TD-QFS(with titles as queries). โˆ—/โ€ : extractive/few-shot system. R1, R2 and R-SU4 stand for the F1 score of ROUGE 1, 2,and SU4, respectively.

which then drives the incorporation of original doc-ument context D. The final summary generationobjective is calculated auto-regressively as:

Llm =

๐‘โˆ‘๐‘—=1

๐‘‡โˆ‘๐‘ก=1

log ๐‘\ (y๐‘ก |y<๐‘ก ,D,Q) (16)

which is jointly trained with the query model (seeEquation (12)) as: L = Llm + Lquery.

6 Experimental Setup

Datasets We used CNN/DM (Hermann et al.,2015), a generic single-document summarizationdataset containing news articles and associatedhighlights, for model training and development(with 287,227/13,368 instances). For model eval-uation, we evaluated our system on CNN/DM testset (11,490 instances) under a supervised setting.We also performed experiments on QFS under azero-shot transfer setting, on five test sets with var-ious formats of queries, domains, and documentsettings, including WikiRef (Zhu et al., 2019), De-batepedia (Nema et al., 2017), DUC 2006-07, andTD-QFS (Baumel et al., 2016). Statistics for all testsets are given in Table 2. Note that we do not as-sume any development data in QFS, which departsfrom Xu and Lapata (2020a).

Implementation Details The shared encoderconsists of 11 Transformer layers. The documentand query encoder has one separate Transformerlayer each. The shared encoder, document encoder,and decoder are initialized with a pretrained BART

model (Lewis et al., 2020), while the query encoderis randomly initialized. We used 4 GeForce RTX2080 GPUs for training; we set the batch size to 8

(i.e., one sample for each GPU), and accumulategradients every 32 steps. Following the standardconfigurations for BART finetuning on CNN/DM,we used a learning rate of 3 ร— 10โˆ’5 for 20,000 op-timization steps, with a warmup-step of 500. Dueto memory constraints, we used half float precisionfor efficient training and also set the maximumlength of an input document to 640 tokens, withthe excess clipped. We set ๐›ฝ = 0.1 and ๐œ” = 10in the learning objective. We used ๐œ = 0.9 for la-tent query modeling. For the proposed posteriordropout, we annealed the dropout rate ๐›ฟ from 1.0to 0.5 over the whole training session.

7 Results

Generic Summarization Table 3 summarizesour results on CNN/DM. The first block in the tableincludes an ORACLE extractive system as an upperbound. LEAD baseline take the first 3 sentences ina document as a summary.

The second block presents two extractive sys-tems. BERTEXT (Liu and Lapata, 2019b) is thefirst system using as a pretrained encoder (Devlinet al., 2019) for text summarization. MATCHSUM

is a state-of-the-art extractive system extracting anoptimal set of sentences via text matching.

The third block includes various abstractive sys-tems (see Section 2 for an overview). Particu-larly, PTGEN (See et al., 2017) and BOTTOMUP

(Gehrmann et al., 2018) do not use pretrained LMs,while BERTABS is built on a pretrained BERT en-coder, and GSUM (Dou et al., 2020), similar to ourwork, is initialized with pretrained BART parame-ters (Lewis et al., 2020). Our system outperformsstandard the BART finetuned on CNN/DM by a

Models CNN/DM WikiRef Debatepedia DUC 2006 DUC 2007 TD-QFSLAQSUM 41.9 27.1 20.6 13.7 15.0 22.1

-ฮ”(z|x, z) โ€” โ†“0.2 โ†“0.6 โ†“0.6 โ†“1.3 โ†“0.4-Joint training โ†“0.4 โ†“2.8 โ†“2.8 โ†“1.6 โ†“1.7 โ†“0.4-Weak supervision โ†“0.7 โ†“0.5 โ†“1.3 โ†“0.2 โ†“0.3 โ†“0.0-Dual view โ†“2.5 โ†“10.5 โ†“6.6 โ†“1.8 โ†“2.5 โ†“2.8-Posterior dropout โ†“0.8 โ†“0.7 โ†“1.2 โ†“0.2 โ†“0.5 โ†‘0.1

Table 7: Ablation results for our abstractive summarization system. โ†‘ /โ†“: absolute performance increase/decreasein ROUGE-L (on CNN/DM, WikiRef and Debatepedia) or ROUGE-SU4 (on DUC 2006-07 and TD-QFS).

fairly large margin, which demonstrates the effec-tiveness of modeling the query focused view withlatent queries even for generic summarization. Un-der the same training resources and configurations,it also performs on par with GSUM, a state-of-the-art abstractive model, despite being significantlymore computationally efficient, as no access tohigh-quality guidance sentences (which are pro-duced by another well-trained extractive system,e.g., MATCHSUM) is required.

Single-Document QFS We show results onWikiRef and Debatepedia in Table 4 and Table 5,respectively, for single-document QFS evaluation.

In the first block of these two tables, we showtwo unsupervised extractive baselines: LEAD andLEXRANK which estimates sentence-level central-ity via Markov Random Walk on graphs. The sec-ond block presents various supervised systems onWikiRef and Debatepedia. Note that no abstractiveQFS system has been evaluated on WikiRef, whileDebatepedia is a short-document, short-summarydataset mainly for abstractive summarization.

The third block of the two tables highlights sys-tem performance in the zero-shot transfer setting,including BART and GSUM. Particularly, GSUM

requires guidance from a generic extractive summa-rization system, which is hard to obtain due to thedata scarcity in QFS. Also, it is not straightforwardhow a given query can be incorporated to generatequery-specific summaries from GSUM. To adaptit to QFS test settings, we build GSUM+QUERYE ,where we employ an unsupervised query focusedextractive system to pre-extract the top๐พ rankedsentences for each testing document as its guidance.Specifically, we choose a query focused version ofLEXRANK described in Xu and Lapata (2020b),which is well-performed on extractive QFS tasksby jointly estimating sentence centrality and queryrelevance (Wan, 2008).

In an end-to-end fashion, our system achievesthe highest ROUGE scores on both datasets in thezero-shot transfer setting. Compared to the results

on generic data, our system shows a clearer edgeover systems without latent query modeling.

Multi-Document QFS To apply a summariza-tion system trained on single-document data to amulti-document setting, we adopt a simple itera-tive generation approach (Baumel et al., 2018): wefirst rank documents in a cluster via query term fre-quency, and then generate summaries iteratively foreach document. The final summary for the wholecluster is composed by concatenating document-level summaries.3 Repetitive generated sentencesare skipped to remove redundancy.

Table 6 presents results on multi-document QFSdatasets. The first block reports performance oftwo upper-bound systems, GOLD and ORACLE,and two unsupervised systems taken from Xu andLapata (2020a). The second block contains pre-vious distantly supervised approaches. QUERY-SUM (Xu and Lapata, 2020b) is state-of-the-artextractive system which adopts QA datasets fora coarse-to-fine salience estimation process. Onthe abstractive side, BART-CAQ (Su et al., 2020)uses an ensembled QA model for answer evidenceextraction, and then use finetuned BART (Lewiset al., 2020) to iteratively generate summaries fromparagraphs. PQSUM (Laskar et al., 2020b) usesfinetuned BERTSUM to generate summaries foreach document in a cluster, and a QA model forsummary sentenc re-ranking.

The third block compares our model with a start-of-the-art few-short approach, MARGESUM (Xuand Lapata, 2020a) which requires a small QFSdevelopment set, and zero-short systems includingBART and GSUM+QUERYE . As we can see, with-out recourse to expensive QA/QFS annotations,our system, achieves significantly better resultsthan BART-CAQ which exploits QA data as ex-

3An alternative is to generate a long summary at once.However, this requires a model to be trained on a MDS dataset,or at least a proxy one (Xu and Lapata, 2020a). Since webuild our system also for single-document summarization, wechoose to generate and then compose.

ternal training resources, on DUC test sets (exceptin ROUGE-1 on DUC 2007); on TD-QFS, it sur-passes MARGESUM which uses QFS data for proxyquery generation and model development, acrossall metrics. Also, our system outperforms strongzero-shot abstractive systems including BART andGSUM+QUERYE on all three datasets.

Ablation Studies We provide the results of ab-lation studies on LAQSUM in Table 7. Removingquery belief update at test time (-ฮ”(z|x, z)) hurtsmodel performance on QFS test sets, demonstrat-ing the usefulness of incorporating query infor-mation via simple calibration on the variationalposterior distribution. When it comes to learningmeaningful latent queries that benefit summariza-tion tasks, relying on only tagging (-Joint training,where we adopt argmax to stop gradients from thegeneration loss), or generation (-Weak supervision,where we set ๐œ” = 0) significantly decreases per-formance. We conclude that latent query learningperforms a trade-off between exploiting direct butweak supervision from the tagging objective (i.e.,based on synthetic token annotation), and explor-ing the natural but indirect supervision from thegeneration objective (i.e., based on human-writtensummaries). Removing the query agnostic view(-Dual view) causes significant performance dropas it keeps the original document context that thedecoder can possibly leverage, especially when thequery model is not well-performed. This is alsosupported by solely using the estimated posterior tocreate query focused view for training (-Posteriordropout), which also hurts model performance asit leads to more severe error propagation to thedownstream generation model.

8 Conclusion

In this work we provide a deep generative formula-tion for text summarization, and present a generaltext summarization system that supports generatingboth generic and query focused abstracts. Underthis formulation, queries are represented as discretelatent variables, whose approximated posterior dis-tribution can be, optionally, calibrated with addi-tional query observations during testing withoutfurther adaptation. As a result, our system doesnot rely on any query-related resource. Experimen-tal results across datasets of various characteristicsshow that the proposed system yields strong perfor-mance on generic summarization, and state-of-the-art performance on zero-shot abstractive QFS.

ReferencesDeen Mohammad Abdullah and Yllias Chali. 2020. To-

wards generating query to perform query focused ab-stractive summarization using pre-trained model. InProceedings of the 13th International Conference onNatural Language Generation, pages 80โ€“85, Dublin,Ireland.

Rama Badrinath, Suresh Venkatasubramaniyan, andCE Veni Madhavan. 2011. Improving query fo-cused summarization using look-ahead strategy. InProceedings of the 33rd European Conference onAdvances in Information Retrieval, pages 641โ€“652,Dublin, Ireland.

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng,Jianfeng Gao, Xiaodong Liu, Rangan Majumder,Andrew McNamara, Bhaskar Mitra, Tri Nguyen,et al. 2016. MS MARCO: A human generatedmachine reading comprehension dataset. arXivpreprint arXiv:1611.09268.

Tal Baumel, Raphael Cohen, and Michael Elhadad.2016. Topic concentration in query focused summa-rization datasets. In Proceedings of the 30th AAAIConference on Artificial Intelligence, pages 2573โ€“2579, Phoenix, Arizona.

Tal Baumel, Matan Eyal, and Michael Elhadad. 2018.Query focused abstractive summarization: Incorpo-rating query relevance, multi-document coverage,and summary length constraints into seq2seq mod-els. arXiv preprint arXiv:1801.07704.

Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. 2018.Retrieve, rerank and rewrite: Soft template basedneural summarization. In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages152โ€“161, Melbourne, Australia.

Souradip Chakraborty, Ekaba Bisong, Shweta Bhatt,Thomas Wagner, Riley Elliott, and FrancescoMosconi. 2020. BioMedBERT: A pre-trainedbiomedical language model for qa and ir. In Pro-ceedings of the 28th International Conference onComputational Linguistics, pages 669โ€“679, Online.

Hoa Trang Dang. 2005. Overview of duc 2005. InProceedings of the 2005 Document UnderstandingConference, pages 1โ€“12, Vancouver, Canada.

Hoa Trang Dang. 2006. DUC 2005: Evaluation ofquestion-focused summarization systems. In Pro-ceedings of the Workshop on Task-Focused Sum-marization and Question Answering, pages 48โ€“55,Stroudsburg, PA, USA.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, pages 4171โ€“4186, Minneapolis, Min-nesota.

William B Dolan and Chris Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In Proceedings of the Third International Workshopon Paraphrasing, pages 9โ€“16, Jeju Island, Korea.

Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, ZhengbaoJiang, and Graham Neubig. 2020. GSum: A generalframework for guided neural abstractive summariza-tion. arXiv preprint arXiv:2010.08014.

Sebastian Gehrmann, Yuntian Deng, and AlexanderRush. 2018. Bottom-up abstractive summarization.In Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing, pages4098โ€“4109, Brussels, Belgium.

Karl Moritz Hermann, Tomรกลก Kociskรฝ, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching machines to readand comprehend. In Proceedings of the 28th In-ternational Conference on Neural Information Pro-cessing Systems, page 1693โ€“1701, Cambridge, MA,USA.

I. Higgins, Loรฏc Matthey, A. Pal, Christopher P.Burgess, Xavier Glorot, M. Botvinick, S. Mohamed,and Alexander Lerchner. 2017. beta-vae: Learningbasic visual concepts with a constrained variationalframework. In ICLR.

TD Hoa. 2006. Overview of duc 2006. In Proceed-ings of the 2006 Document Understanding Confer-ence, New York, USA.

Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categor-ical reparameterization with gumbel-softmax. arXivpreprint arXiv:1611.01144.

HW Kuhn, AW Tucker, et al. 1951. Nonlinear program-ming. In Proceedings of the Second Berkeley Sym-posium on Mathematical Statistics and Probability.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-field, Michael Collins, Ankur Parikh, Chris Alberti,Danielle Epstein, Illia Polosukhin, Jacob Devlin,Kenton Lee, et al. 2019. Natural questions: a bench-mark for question answering research. Transactionsof the Association for Computational Linguistics,7:453โ€“466.

Md Tahmid Rahman Laskar, Enamul Hoque, andJimmy Huang. 2020a. Query focused abstrac-tive summarization via incorporating query rele-vance and transfer learning with transformer models.In Canadian Conference on Artificial Intelligence,pages 342โ€“348. Springer.

Md Tahmid Rahman Laskar, Enamul Hoque, andJimmy Xiangji Huang. 2020b. WSL-DS: Weakly su-pervised learning with distant supervision for queryfocused multi-document abstractive summarization.In Proceedings of the 28th International Conferenceon Computational Linguistics, pages 5647โ€“5654,Online.

Logan Lebanoff, Kaiqiang Song, and Fei Liu. 2018.Adapting the neural encoder-decoder frameworkfrom single to multi-document summarization. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages4131โ€“4141, Brussels, Belgium.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation,and comprehension. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 7871โ€“7880, Online.

Piji Li, Wai Lam, Lidong Bing, Weiwei Guo, and HangLi. 2017a. Cascaded attention based unsupervisedinformation distillation for compressive summariza-tion. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,pages 2081โ€“2090, Brussells, Belgium.

Piji Li, Zihao Wang, Wai Lam, Zhaochun Ren, andLidong Bing. 2017b. Salience estimation via varia-tional auto-encoders for multi-document summariza-tion. In Proceedings of the 31th AAAI Conference onArtificial Intelligence, pages 3497โ€“3503, San Fran-cisco, California, USA.

Yang Liu and Mirella Lapata. 2019a. Hierarchicaltransformers for multi-document summarization. InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 5070โ€“5081, Florence, Italy.

Yang Liu and Mirella Lapata. 2019b. Text summariza-tion with pretrained encoders. In Proceedings of the2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International JointConference on Natural Language Processing, HongKong, China.

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,Caglar Gulcehre, and Bing Xiang. 2016. Abstrac-tive text summarization using sequence-to-sequenceRNNs and beyond. In Proceedings of The 20thSIGNLL Conference on Computational Natural Lan-guage Learning, pages 280โ€“290, Berlin, Germany.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018. Donโ€™t give me the details, just the summary!topic-aware convolutional neural networks for ex-treme summarization. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 1797โ€“1807, Brussels, Bel-gium.

Preksha Nema, Mitesh M. Khapra, Anirban Laha, andBalaraman Ravindran. 2017. Diversity driven at-tention model for query-based abstractive summa-rization. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics,pages 1063โ€“1072, Vancouver, Canada.

Laura Perez-Beltrachini, Yang Liu, and Mirella Lap-ata. 2019. Generating summaries with topic tem-plates and structured convolutional decoders. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 5107โ€“5116, Florence, Italy.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, pages 2383โ€“2392, Syd-ney, Australia.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 379โ€“389, Lisbon, Portugal.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics, pages 1073โ€“1083, Vancouver, Canada.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 1715โ€“1725,Berlin, Germany.

Rupesh Kumar Srivastava, Klaus Greff, and JรผrgenSchmidhuber. 2015. Training very deep networks.In Proceedings of the 28th International Conferenceon Neural Information Processing Systems-Volume2, pages 2377โ€“2385, Montreal, Quebec, Canada.

Dan Su, Yan Xu, Genta Indra Winata, Peng Xu,Hyeondey Kim, Zihan Liu, and Pascale Fung. 2019.Generalizing question answering system with pre-trained language model fine-tuning. In Proceedingsof the 2nd Workshop on Machine Reading for Ques-tion Answering, pages 203โ€“211, Hong Kong, China.

Dan Su, Yan Xu, Tiezheng Yu, Farhad Bin Siddique,Elham Barezi, and Pascale Fung. 2020. CAiRE-COVID: A question answering and query-focusedmulti-document summarization system for COVID-19 scholarly information management. In Proceed-ings of the 1st Workshop on NLP for COVID-19 atEMNLP 2020, Online.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ลukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 6000โ€“6010.

Xiaojun Wan. 2008. Using only cross-document rela-tionships for both generic and topic-focused multi-document summarizations. Information Retrieval,11(1):25โ€“49.

Xiaojun Wan, Jianwu Yang, and Jianguo Xiao.2007. Manifold-ranking based topic-focused multi-document summarization. In Proceedings of the20th International Joint Conference on Artificial In-telligence, pages 2903โ€“2908, Hyderabad, India.

Xiaojun Wan and Jianmin Zhang. 2014. CTSUM: ex-tracting more certain summaries for news articles.In Proceedings of the 37th international ACM SIGIRConference on Research & Development in Informa-tion Retrieval, pages 787โ€“796, New York, UnitedStates.

Zhengjue Wang, Zhibin Duan, Hao Zhang, ChaojieWang, Long Tian, Bo Chen, and Mingyuan Zhou.2020. Friendly topic assistant for transformer basedabstractive summarization. In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing, pages 485โ€“497, Online.

Yumo Xu and Mirella Lapata. 2020a. Abstractivequery focused summarization with query-free re-sources. arXiv preprint arXiv:2012.14774.

Yumo Xu and Mirella Lapata. 2020b. Coarse-to-finequery focused multi-document summarization. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing, pages3632โ€“3645, Online.

Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang,Xipeng Qiu, and Xuanjing Huang. 2020. Extractivesummarization as text matching. In Proceedings ofthe 58th Annual Meeting of the Association for Com-putational Linguistics, pages 6197โ€“6208, Online.

Chenguang Zhu, William Hinthorn, Ruochen Xu,Qingkai Zeng, Michael Zeng, Xuedong Huang, andMeng Jiang. 2021. Enhancing factual consistencyof abstractive summarization. In Proceedings of the2021 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 718โ€“733, On-line.

Haichao Zhu, Li Dong, Furu Wei, Bing Qin, andTing Liu. 2019. Transforming wikipedia intoaugmented data for query-focused summarization.arXiv preprint arXiv:1911.03324.