Structured Pruning of Large Language Models - ACL Anthology

12
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6151–6162, November 16–20, 2020. c 2020 Association for Computational Linguistics 6151 Structured Pruning of Large Language Models Ziheng Wang * ASAPP, Inc. [email protected] Jeremy Wohlwend * ASAPP, Inc. [email protected] Tao Lei * ASAPP, Inc. [email protected] Abstract Large language models have recently achieved state of the art performance across a wide va- riety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their us- age costly, and raises an interesting question: do language models need to be large? We study this question through the lens of model compression. We present a generic, struc- tured pruning approach by parameterizing each weight matrix using its low-rank factor- ization, and adaptively removing rank-1 com- ponents during training. On language model- ing tasks, our structured approach outperforms other unstructured and block-structured prun- ing baselines at various compression levels, while achieving significant speedups during both training and inference. We also demon- strate that our method can be applied to prun- ing adaptive word embeddings in large lan- guage models, and to pruning the BERT model on several downstream fine-tuning classifica- tion benchmarks. 1 1 Introduction Recent advances in language modeling have led to remarkable improvements on a variety of natu- ral language tasks (Dai and Le, 2015; Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019; Liu et al., 2019; Dai et al., 2019; Zhang et al., 2019). These models, however, have grown increasingly large, rendering them slow and expensive for real- world applications. Through the use of model com- pression, we aim to reduce this overhead, and to better understand the role of model capacity in large language models. A common approach to model compression is known as weight pruning (Zhu and Gupta, 2017; Han et al., 2015a; See et al., 2016). Model weights * Denotes equal contribution. 1 Our code is publicly available at https://github. com/asappresearch/flop. are progressively removed, resulting in sparse ma- trices across the network. Earlier work focuses mostly on unstructured pruning, where weights are pruned individually (Narang et al., 2017a; Zhu and Gupta, 2017). While this method is effective, it results in unstructured sparse matrices that are dif- ficult to support on common hardware (Han et al., 2016), making it challenging to obtain training and inference speedups despite a significant reduction in model size. On the other hand, structured pruning imposes structured sparse patterns by removing groups of consecutive parameters, such as rows, columns or k ×k sub-blocks of the weight matrix (Narang et al., 2017b; Wen et al., 2018; Cao et al., 2019). These methods lead to significant speedup, but tend to give lower performance than unstructured pruning given the same parameter budget (Yao et al., 2019). Another caveat is that some of these methods re- quire special linear algebra implementations (Gray et al., 2017; Yao et al., 2019) or hardware (Cao et al., 2019) in order to accelerate matrix multi- plication, therefore limiting their application to a broad set of existing models. We propose a generic, improved structured prun- ing approach based on adaptive low-rank factoriza- tion. As an alternative to unstructured sparse and block sparse representations, low-rank factoriza- tion retains the full dense structure of weight matri- ces, eliminating the need for special linear algebra primitives and hardware for computation speedup. Compared to row (and column) based pruning, low- rank factorization better preserves the linear trans- formation of the un-compressed matrices. During training, our method adaptively learns which low- rank components to remove in order to achieve a strong performance-compression trade-off. We show that a simple magnitude based pruning strat- egy is sufficient to accomplish strong results. In addition, we further increase performance via an improved l 0 regularization (Louizos et al., 2018)

Transcript of Structured Pruning of Large Language Models - ACL Anthology

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6151–6162,November 16–20, 2020. c©2020 Association for Computational Linguistics

6151

Structured Pruning of Large Language Models

Ziheng Wang *

ASAPP, [email protected]

Jeremy Wohlwend *

ASAPP, [email protected]

Tao Lei *

ASAPP, [email protected]

Abstract

Large language models have recently achievedstate of the art performance across a wide va-riety of natural language tasks. Meanwhile,the size of these models and their latency havesignificantly increased, which makes their us-age costly, and raises an interesting question:do language models need to be large? Westudy this question through the lens of modelcompression. We present a generic, struc-tured pruning approach by parameterizingeach weight matrix using its low-rank factor-ization, and adaptively removing rank-1 com-ponents during training. On language model-ing tasks, our structured approach outperformsother unstructured and block-structured prun-ing baselines at various compression levels,while achieving significant speedups duringboth training and inference. We also demon-strate that our method can be applied to prun-ing adaptive word embeddings in large lan-guage models, and to pruning the BERT modelon several downstream fine-tuning classifica-tion benchmarks.1

1 Introduction

Recent advances in language modeling have ledto remarkable improvements on a variety of natu-ral language tasks (Dai and Le, 2015; Peters et al.,2018; Radford et al., 2018; Devlin et al., 2019; Liuet al., 2019; Dai et al., 2019; Zhang et al., 2019).These models, however, have grown increasinglylarge, rendering them slow and expensive for real-world applications. Through the use of model com-pression, we aim to reduce this overhead, and tobetter understand the role of model capacity inlarge language models.

A common approach to model compression isknown as weight pruning (Zhu and Gupta, 2017;Han et al., 2015a; See et al., 2016). Model weights

*Denotes equal contribution.1Our code is publicly available at https://github.

com/asappresearch/flop.

are progressively removed, resulting in sparse ma-trices across the network. Earlier work focusesmostly on unstructured pruning, where weights arepruned individually (Narang et al., 2017a; Zhu andGupta, 2017). While this method is effective, itresults in unstructured sparse matrices that are dif-ficult to support on common hardware (Han et al.,2016), making it challenging to obtain training andinference speedups despite a significant reductionin model size.

On the other hand, structured pruning imposesstructured sparse patterns by removing groups ofconsecutive parameters, such as rows, columns ork×k sub-blocks of the weight matrix (Narang et al.,2017b; Wen et al., 2018; Cao et al., 2019). Thesemethods lead to significant speedup, but tend togive lower performance than unstructured pruninggiven the same parameter budget (Yao et al., 2019).Another caveat is that some of these methods re-quire special linear algebra implementations (Grayet al., 2017; Yao et al., 2019) or hardware (Caoet al., 2019) in order to accelerate matrix multi-plication, therefore limiting their application to abroad set of existing models.

We propose a generic, improved structured prun-ing approach based on adaptive low-rank factoriza-tion. As an alternative to unstructured sparse andblock sparse representations, low-rank factoriza-tion retains the full dense structure of weight matri-ces, eliminating the need for special linear algebraprimitives and hardware for computation speedup.Compared to row (and column) based pruning, low-rank factorization better preserves the linear trans-formation of the un-compressed matrices. Duringtraining, our method adaptively learns which low-rank components to remove in order to achievea strong performance-compression trade-off. Weshow that a simple magnitude based pruning strat-egy is sufficient to accomplish strong results. Inaddition, we further increase performance via animproved l0 regularization (Louizos et al., 2018)

6152

technique which uses an augmented Lagrangianmethod to directly control the final compressionlevel of the model. Our method, which we refer toas FLOP (Factorized Low-rank Pruning) appliesto any matrix multiplication.

Pruning large language models introducesunique challenges with the handling of large inputand output layers. Although our method is generic,it is particularly well suited to this task. In particu-lar, we show that FLOP can dynamically learn theembedding dimensions of different word clusters,effectively extending the idea of adaptive embed-dings and softmax (Grave et al., 2017; Baevskiand Auli, 2019). Since these embedding layerstake a significant amount of parameters in the lan-guage models, learning flexible dimensions insteadof specifying them manually results in a more op-timal trade-off between parameter reduction andperformance.

We evaluate our method on common languagemodeling and language understanding tasks includ-ing the Wiki-103, Enwiki8 and GLUE benchmarks,and by testing our method on both recurrent net-works and Transformer (Vaswani et al., 2017). Ourresults demonstrate that factorization based pruningsignificantly outperforms block-structured pruningand even surpasses unstructured pruning, whileusing our improved l0 regularization further im-proves the performance in most cases. When prun-ing a large word-level language model with adap-tive embeddings for example, our method achieves50% compression while losing only 0.8 perplexity.Moreover, our method is able to achieve over 2xspeed-up during both training and inference withno additional hardware or software requirements.Our method will be released as a Pytorch (Paszkeet al., 2017) library.

2 Related Work

The development of model compression techniquescan be categorized into three areas of research:weight pruning (Han et al., 2015b; Zhu and Gupta,2017), knowledge distillation (Ba and Caruana,2014; Hinton et al., 2015; Kim and Rush, 2016),and quantization (Gong et al., 2014; Zhu et al.,2017; Shen et al., 2019).

Recent efforts have successfully applied com-pression on various architectures and NLP appli-cations, such as pruning multi-head attentions formachine translation (Voita et al., 2019), learningadaptive embeddings and softmax layers for lan-

guage models (Grave et al., 2017; Baevski and Auli,2019; Li et al., 2018; Variani et al., 2019), and com-pressing BERT models via distillation (Chia et al.,2019; Jiao et al., 2019; Sanh et al., 2019; Sun et al.,2019; Tsai et al., 2019; Turc et al., 2019). Only oneof the compression techniques such as distillationhas been used in these works for simplicity. How-ever, these techniques can be combined to achievegreater compression (Han et al., 2015a; Shangguanet al., 2019). Our pruning method is compatiblewith quantization and distillation, as it can be ap-plied to compress any matrix multiplication in anetwork.

Previous work has considered different weightpruning approaches such as unstructured pruningbased on magnitude (Narang et al., 2017a; Frankleand Carbin, 2019), dropout (Gale et al., 2019; Fanet al., 2020; Molchanov et al., 2017), and structuredpruning (Wen et al., 2018; Louizos et al., 2017).Model weights are often removed via threshold-ing and l1 regularization during the pruning pro-cess (Narang et al., 2017b; Liu et al., 2018). Ourmethod differs from previous work by using low-rank parameterization for compression. Further-more, we extend l0 regularization using an aug-mented Lagrangian optimization method to controlthe final model size.

3 Background

We formalize the task of model pruning as an end-to-end learning problem with l0 regularization, fol-lowing the prior work of Louizos et al. (2018).

Consider a given neural network model f(·;θ)parameterized by θ = {θj}nj=1, where each θj rep-resents an individual parameter weight or a blockof weights (e.g. a column of a weight matrix) andn denotes the number of blocks. A pruning strategyof the model can be parameterized by introducingadditional binary variables z = {zj}nj=1 such thatzj ∈ {0, 1} and

θ = θ � z ∀j θj = θj zj .

Here θ = {θj} denotes the set of model parametersafter pruning and its l0 norm, ‖θ‖0 =

∑nj=1 zj ,

measures the effective size of the pruned model.The choice of binary variables z can be regulated

by some prior distribution and optimized given thetraining data. That is, let qj(z) be the density func-tion of the learnable prior of zj . The optimizationobjective during training can be formulated as min-

6153

imizing the expected training loss

Ez

[1

D

D∑i=1

L(xi,yi; θ

)+ λ‖θ‖0

], (1)

where {xi,yi}Di=1 are training examples, L is thetraining loss function and λ > 0 is a constant hyper-parameter for l0 norm regularization encouragingthe model to be sparse. Note that in practice opti-mizing this objective is intractable due to the dis-crete nature of zj and an exponential number of 2n

choices.The key to the method of Louizos et al. (2018),

called the re-parameterization trick, enables z tobe differentiable and jointly trained with the modelparameter θ. Specifically, the random variablesz are relaxed as continuous variables distributedwithin the interval [0, 1]. In addition, instead oflearning the probability density function qj(z), there-parameterization trick proposes to learn the in-verse of the cumulative density function (CDF).Note that ifG() is the inverse of CDF for a variablez, then z can be easily sampled by first samplingu ∼ U(0, 1) and computing z = G(u). Assumingthe inverse CDF function is parameterized by somelearnable parameters α = {αj}nj=1 and the func-tion G(·;α) is differentiable, we obtain an overallend-to-end learning objective,

minθ,α

Eu∼U(0,1)

[1

D

D∑i=1

L(xi,yi; θ) + λ‖θ‖0

],

zj = G(uj ;αj), ∀j = 1 · · ·n (2)

where u = {u1, · · · , un} denotes the iid samplesfrom the uniform distribution. Since z is now theoutput of the parameterized function G(·;α) andis used as an intermediate representation for theneural network (with θ = θ � z), gradient basedoptimization methods can perform gradient updatesfor θ and α.

Following previous work, we choose the HardConcrete distribution for the random variablesz = {zj}. The inverse of CDF G(·;α) of thisdistribution is defined as follows

u ∼ U(0, 1)

s = sigmoid(logu− log(1− u) + α)

s = s× (r − l) + l

z = min(1,max(0, s))

where l < 0 and r > 1 are two constants usedto ‘stretch‘ the sigmoid outputs s into the interval

(l, r), and the final outputs z are rectified into [0, 1].The stretch-and-rectify process has the effect ofassigning a significant portion of probability masson the integer values {0, 1}, which makes it a goodrelaxation of the binary (Bernoulli) distribution.During training, we sample u and compute z andthe loss L() for each training batch. The expectedl0 norm regularization can be separately computedvia a closed form

E[‖θ‖0

]=

n∑j=1

E [zj > 0]

=

n∑j=1

sigmoid(αj − log

−lr

)(3)

which is differentiable as well.

4 Method

In this section, we introduce FLOP , an improvedstructured pruning method. FLOP proposes a differ-ent parameterization of the weight matrices usinglow-rank factorization. In addition, we introducea revised optimization objective that allows for anexplicit control of the compression size.

4.1 Structured Pruning using FactorizationIn weight pruning, a key choice is how we defineparameter blocks θ1, · · · , θn to achieve the mosteffective pruning results. One obvious method is toprune each individual parameter weight, which of-ten retains strong performance but poses challengesto achieve a computation speedup given unstruc-tured sparse matrices.

Structured pruning chooses to remove groups ofconsecutive parameters as a remedy. For example,consider a fully connected layer which performs amultiplication Wx for an input feature x ∈ Rd andweight matrix W ∈ Rd′×d. One popular method,sometimes referred to as neuron or input featurepruning, consists of adding the sparsity variables asa sparse diagonal matrix G = diag(z1, · · · , zd) tothe multiplication, i.e., WGx. This effectively re-moves the subset of the columns in W with zk = 0,where k is the column index. In practice, thismethod produces significant speedups at both train-ing and inference time (by selecting a small subsetof columns and performing matrix multiplicationsgiven much smaller matrices). However, it is re-ported to achieve lower performance compared tounstructured pruning (Yao et al., 2019) due to morerestrictive sparse patterns.

6154

We propose to use low-rank factorization as aless restrictive, yet powerful representation andobtain parameter reduction by pruning rank-1 com-ponents. That is, we reparameterize and factorizethe matrix W into the product of two smaller ma-trices W = PQ, where P ∈ Rd′×r, Q ∈ Rr×dand r ≤ min{d, d′} is the number of columns ofP (equivalently the number of rows of Q). Let pkand qk be the k-th column of P and k-th row of Qrespectively. Since W is now the sum of r rank-1components pk qk, we can achieve structured prun-ing by introducing a pruning variable zk for eachcomponent

W = PGQ =r∑

k=1

zk × (pk × qk)

where G = diag(z1, · · · , zr) is again a diagonalmatrix of pruning variables. Intuitively, learningthe factorization has the potential of keeping themost effective rank-1 components, and thereby bet-ter preserve the model performance.1

After training, only columns and rows corre-sponding to non-zero diagonal values need to bestored, resulting in much smaller (but still dense)matrices. The nonzero values of G can be absorbedinto either P or Q. The computation boils downto simple matrix multiplications at inference time,maximizing efficiency on common hardware. Un-like unstructured pruning, we need not store theindices of the sparse weights, resulting in greatermemory savings.

4.2 Pruning Adaptive Embedding andSoftmax Layer

The input embedding and softmax output layer cantake the vast majority of parameters in a languagemodel when the vocabulary size is large. Previ-ous work have considered various techniques thatare specifically tailored to compress the embed-ding and softmax layer. For instance, the adaptiveembedding and softmax methods of Grave et al.(2017); Baevski and Auli (2019) have been shownto achieve impressive results in preserving perplex-ity while significantly reducing the total number ofembedding parameters.

We describe how FLOP fits naturally with theseadaptive methods, giving them more potential. Thecore idea behind the adaptive methods is to apply

1It is also easy to see that input feature pruning is a specialcase of low-rank pruning: By fixing P = W and Q = I,PGQ = WGI = WG.

different embedding dimensions and projections todifferent word clusters. Consider the recent methodof Baevski and Auli (2019) without loss of gener-ality. Let i ∈ {1, · · · , C} denotes the indice ofthe i-th word cluster (sorted based on word fre-quency). Two parameter matrices Ei ∈ Rni×di

and Oi ∈ Rdi×d are introduced for the i-th cluster,where ni is the number of words in the cluster, dis the original embedding dimension and di is thereduced word dimension for this cluster. In otherwords, each word embedding in this cluster hasdimension di but are projected back into dimensiond using a projection Oi (and vise versa). This is inindeed a low-rank factorization

Ei = EiOi ∈ Rni×d

for an underlying embedding matrix Ei. Whilethe reduced dimensions {di}Ci=1 usually have tobe manually specified, our method automaticallylearns separate diagonal pruning mask Gi for eachcluster, i.e. Ei = EiGiOi. During training andpruning, it adaptively learns to adjust the parame-ter budget of each word cluster based on what isneeded to achieve good performance. Unsurpris-ingly, our method prunes most of the dimensionsfor rare words, which is consistent with the empiri-cal choice made in prior work.

4.3 Augmented Lagrangian Method

Our method can be implemented with a magni-tude based pruning strategy, or directly trainedwith the training objective (2) which uses an l0regularization λ‖θ‖0 to promote weight pruning.One limitation of this regularization however is thelack of effective control on the size of the prunedmodel. For instance, we observe that training withthe same λ could converge to very different modelsizes when using slightly different learning ratesor pruning schedules. This can be problematic be-cause a desired model size or parameter budget isoften needed in many real-world applications.

We make use of an Augmented Lagrangianmethod to overcome this training limitation. La-grangian relaxation methods have been explored inmany NLP problems (Bastings et al., 2019; Mar-tins et al., 2011; Flanigan et al., 2014; Rush et al.,2010). We use the following Lagrangian variant forour task – Let t be the target model size and s(α)be the expected model size determined by the HardConcrete parameter α. Note s(α) can be computedbased on Eq (3) by multiplying E [zj > 0] with the

6155

size of the j-th parameter block. Our AugmentedLagrangian method imposes an equality constraints(α) = t by introducing a violation penalty,

g(λ,α) = λ1 · (s(α)− t) + λ2 · (s(α)− t)2

where λ1, λ2 ∈ R are two Lagrangian multipli-ers that will be jointly updated during training.The overall training optimization is an adversar-ial game,

maxλ1,λ2

minθ,α

Eu

[1

D

D∑i=1

L(xi,yi; θ)

]+ g(λ,α).

The updates of λ1 and λ2 would always increasethe training loss unless the equality constraint ismet, which gives us the desired model size.

We gradually increase the target size t at a linearrate during the process of pruning training. That is,given the desired size tmax, we set the sparsity atk-th pruning iteration as

tk = min(1,k

m) · tmax

where m is a hyperparameter specifying the num-ber of annealing steps.

We perform joint gradient updates for the modelparameters θ, α as well as the Lagrangian multipli-ers λ1, λ2. For each training batch, we sample thepruning mask z = {z1, · · · , zn} and share it acrossthe training examples within the batch. Since thepruning mask is shared, we can select parametersthat are only active for the current batch and com-pute smaller matrix multiplications in forward andbackward passes. This results in training speedupwhen z becomes sparse.

4.4 InferenceDuring training, the prune mask is a random vari-able drawn from the Hard Concrete distribution. Atinference time, however, we must use a determinis-tic, fixed mask z for each weight matrix to obtainthe compressed factorization matrices P and Q (bykeeping i-th low-rank component if zi > 0). Wedo so by computing the expected value of each ziin z using Eq.(3) described in Section 3, and thenkeeping the top values of {z1, · · · , zn} and clip-ping the rest to zero, as to match the l0 norm (i.e.the compression level).

5 Experimental Setup

Tasks We evaluate the performance of ourmethod on language modeling and BERT fine-

tuning. Specifically, we consider the followingtask setup.

1. Recurrent word-level language models onthe Wiki-103 dataset. We adopt SRU (Leiet al., 2018) as the recurrent architecture andtied adaptive embedding and softmax lay-ers (Baevski and Auli, 2019). Our base modelconsists of 12 recurrent layers, 100M parame-ters in total. About 50% of the parameters areused for the adaptive layers.

2. Recurrent character-level language models onthe Enwik8 dataset. We use the same SRUarchitecture. The base model uses 6 recurrentlayers and 35M parameters in total.

3. Transformer-XL model on the Enwik8 dataset.We use the 12-layer base model from Daiet al. (2019) containing 41M parameters. Weintroduce pruning for the matrices in the self-attention layers as well as those in the feed-forward layers. For factorization based prun-ing, we choose the starting rank r for eachmatrix such that the number of parametersremain the same as the unfactorized model2.

4. BERT fine-tuning on several classificationbenchmarks benchmark (Socher et al., 2013;Dolan and Brockett, 2005; Cer et al., 2017;Wang et al., 2019). In this experiment, we usethe pre-trained RoBERTa base model by Liuet al. (2019).

We extend the implementation of Transformer,SRU and the adaptive embedding / softmax layersto support factorization based pruning (and otherbaselines).

Baselines We compare with the following un-structured, structured and/or factorization basedpruning baselines.

• FAC which trains low-rank factorized modelsfrom scratch by reducing all dimensions withthe same ratio to get the desired compression.

• NP-l0 (Louizos et al., 2018) which adopts l0regularization and performs neuron pruning(i.e. removing input features and columns ofweight matrices). No factorization is usedfor this baseline. We add the AugmentedLagrangian optimization similar to FLOP toachieve the exact desired compression.

2In effect, we set r = d1d2/(d1 + d2), where d1, d2 arethe dimensions of the original weight matrix.

6156

• AGP (Zhu and Gupta, 2017) which gradu-ally prunes individual parameters based on theweight magnitude. AGP is one of the state-of-the-art unstructured pruning methods. We usethe implementation provided in the NervanaDistiller library (Zmora et al., 2019).

• FLOP-AGP is a variant of our full methodthat prunes low-rank components, but usesmagnitude-based gradual pruning on the diag-onal mask G instead. We also tune l1 regu-larization on the masks to encourage sparsity,similar to Narang et al. (2017b).

These baselines serve as competitive pruning al-ternatives, and also provide data points for us toisolate the effectiveness of sub-components of ourmethod, such as low-rank factorization and l0 prun-ing. All methods use the same training config-urations such as learning rate and dropout. Wetune hyper-parameters related to pruning such ascompression scheduling and the learning rate ofLagrangian variables for each method. More train-ing and implementation details are provided in theappendix.

6 Results

Word-level Language Model Table 1 presentsthe results of FLOP as well as the baseline methods.The SRU base model (unpruned) achieves a testperplexity of 24.5, being a strong starting point andcompetitive with top-performing models such asTransformer (Dai et al., 2019).

The pruning results conform to our expectationsthat pruning a large model is consistently betterthan training a small model from scratch, and us-ing low-rank based pruning yields better perfor-mance than removing matrix columns and inputfeatures. FLOP exceeds the performance of FAC,NP-l0 and AGP baselines at all compression levelstested. The performance of FLOP-AGP, especiallyin comparison with its unstructured counterpartAGP, highlights the effectiveness of factorizationbased pruning. Moreover, we achieve a test per-plexity (PPL) of 25.3 with FLOP-l0 method, a lossof 0.8 perplexity score, while removing 50% of themodel parameters. This result is impressive sinceour base model adopts the adaptive word embed-ding and softmax layers, which already reduce themodel size significantly.

Figure 1 illustrates how our method adaptivelycontrols the size of different model components.

Method Size Compress PPLTrans. (Dai et al.) 151M - 24.1SRU (base) 100M - 24.5FAC 50M 50% 28.2AGP 50M 50% 25.7NP-l0 51M 50% 26.7FLOP -AGP 51M 50% 25.6FLOP -l0 50M 50% 25.3FAC 30M 70% 31.0AGP 30M 70% 28.4NP-l0 31M 70% 31.3FLOP -AGP 31M 70% 28.1FLOP -l0 30M 70% 27.7FAC 21M 80% 35.2AGP 20M 80% 32.6NP-l0 18M 80% 39.1FLOP -AGP 21M 80% 31.3FLOP -l0 21M 80% 31.9

Table 1: Comparison of FLOP and all baselines on theWiki-103 dataset. We report test perplexity (PPL) atthree different compression levels. All methods useadaptive embedding and softmax layers.Table 1

80% 70% 50% 50% 70% 80%

1024,166,256,167,64,18

[655, 120, 51, 42, 28, 0]

[416, 92, 50, 26, 6, 0]

775 508

0 0 0

29410590 12565008 8048320

All Emb. 29410590 12565008 8048320

Emb. (most frequent) 14337120 9337200 6120384 1190 775 508 2048

Emb. (less frequent) 9326304 2050464 1675648 423 93 76 512

Emb. (least frequent) 5747166 1177344 252288 82 28 6 128

49875000 29844000 20466000 9378000 20031000 100520000

RNN 20447000 17278992 12417680 4861312 3168008 50999936

All Emb. 29428000 12565008 8048320 4516688 16862992 49520064

Emb. (most frequent) 14337120 9337200 6120384 3216816 4999920

Emb. (less frequent) 9326304 2050464 1675648 374816 7275840

Emb. (least frequent) 5747166 1177344 252288 925056 4569822

80% 70% 50% 80% 70% 50%

RNN 12.41768 4.861312 3.168008 12417680 4861312 3168008 100520000

All Emb. 8.04832 4.516688 16.862992 8048320 4516688 16862992 50999936

First 20k 6.120384 3.216816 4.99992 6120384 3216816 4999920 49520064

20k - 60k 1.675648 0.374816 7.27584 1675648 374816 7275840

60k - Rest 0.252288 0.925056 4.569822 252288 925056 4569822

50 70 80

100520000 49875000 29844000 20466000

50999936 20447000 11674000 10068000

49520064 29428000 18170000 10398000

24674304 14337120 13445568 7915536

11288576 9326304 2028416 881920

13557184 5747166 2691072 1597824

50645000 70676000 80054000

RNN 30552936 39325936 40931936 30.552936 8.773 1.606

All Emb. 20092064 31350064 39122064 20.092064 11.258 7.772

First 20k 10337184 11228736 16758768 10.337184 0.891552 5.530032

20k - 60k 1962272 9260160 10406656 1.962272 7.297888 1.146496

60k - Rest 7810018 10866112 11959360 7.810018 3.056094 1.093248

RNN

All Emb.

First 20k

20k - 60k

60k - 267k

0m 9m 18m 26m 35m

80% 70% 50%

RNN

All Emb.

First 20k

20k - 60k

60k - Rest

0 13 25 38 50

70% 80%

1

Figure 1: Number of parameters used in the RNN andadaptive embeddings at different compression levels.We also show the number of parameters used for themost, second most and least frequent words.

We show the overall size of recurrent encoder andadaptive embedding layers at the compression lev-els tested, and break down the use of parameterswithin three word clusters based on their frequency.FLOP learns to prune the dimension more aggres-sively for less-frequent words. This result show-cases the benefit of adaptively reducing word di-mensions.

Char-level Language Model Table 3 shows theresults of pruning character-level language models.Our base model achieves a test bits-per-characterscore (BPC) of 1.24, which is comparable withprevious reported results of RNN-based models.

As shown in Table 3, we again see the benefit

6157

Parameters Compression SST2 MRPC STS-B QNLI Average125M 0% 92.43 90.9 90.22 89.77 90.8380M 35% 92.09 88.61 88.18 89.05 89.48

Table 2: Compression on downstream fine-tuning

Method Size Comp. BPCLSTM (Wu et al.) 17M - 1.44QRNN (Merity et al.) 26M - 1.33SRU (base) 35M - 1.24FAC 11M 70% 1.33AGP 11M 70% 1.27NP-l0 11M 70% 1.31FLOP -AGP 11M 70% 1.27FLOP -l0 11M 70% 1.25FAC 8M 80% 1.38AGP 8M 80% 1.29NP-l0 8M 80% 1.34FLOP -AGP 8M 80% 1.29FLOP -l0 8M 80% 1.27FAC 4M 90% 1.47AGP 4M 90% 1.35NP-l0 4M 90% 1.43FLOP -AGP 4M 90% 1.34FLOP -l0 4M 90% 1.33

Table 3: Comparison of FLOP and all baselines on theEnwiki8 dataset. We report bits-per-character (BPC)on the test set. We also include previous reported re-sults of recurrent language models on this dataset asadditional data points.

Method Size Compress BPCTrans-XL (base) 41M - 1.08FAC 8M 80% 1.20AGP 8M 80% 1.14FLOP -AGP 8M 80% 1.17FLOP -l0 8M 80% 1.13FLOP -AGP 4M 90% 1.25FLOP -l0 4M 90% 1.17

Table 4: Results of pruning Transformer-XL modelson the Enwiki8 dataset. We report bits-per-character(BPC) on the test set.

of low-rank pruning, matching or improving onunstructured pruning. Furthermore, FLOP -l0 ob-tains the best performance across all pruning levels.Notably, we achieve a perplexity of 1.25 at 70%compression, nearly matching the un-compressedmodel at 1.24.

Table 4 presents the results of pruning 12-layer

Transformer-XL models on the Enwik8 dataset. Wecompare FAC, unstructured AGP, FLOP-AGP andFLOP-l0 at 80% compression level, and also reportthe result of FLOP variants at 90% compression.FLOP-l0 outperforms other methods in compari-son. In addition, it is able to achieve 1.17 BPCusing 4M parameters, showcasing the effectivenessof our method when applied to another neural ar-chitecture.

BERT on Classification Tasks Finally, wedemonstrate that our method can also be applied tolanguage model fine-tuning on downstream tasks.We use the RoBERTa base model in this exper-iment. Since the model was pretrained withoutmatrix factorization, we first compute the singularvalue decomposition of each matrix and then in-troduce the pruning mask in between the resultingfactored matrices. Note that this procedure tem-porarily increases the total number of parameters.We compare here the final number of parameters tothe initial number pre-factorization.

Our results are shown in in Table 2. We are ableto conserve nearly 99% of the performance whilereducing the number of parameters by 35%. Ourtarget compression level is limited by the fact thatthe embedding layers consist of a significant por-tion of the remaining parameters. As demonstratedin the previous experiment on Wiki-103, we believethat higher levels of compression could be obtainedby factorizing the embedding layer, similar to Lanet al. (2020).

7 Analysis

In this section, we perform an analysis of severalaspects of our method.

Factorization One of the key hypotheses out-lined in this paper is that pruning input dimen-sions (equivalently rows or columns of weight ma-trices) is a more restrictive form of pruning com-pared to our factorization based strategy. However,one could also argue that the factorization methodworks better simply because the hidden size canbe initially set much larger than an unfactorizedmodel, not because of pruning itself. For instance,

6158

Variants Size 0% 70% 80% 85% 90%

NP-l037M 1.30 1.31 (-0.8%) 1.34 (-3.2%) 1.37 (-5.4%) 1.43 (-10.0%)66M 1.25 1.28 (-2.4%) 1.31 (-4.8%) 1.32 (-5.6%) 1.37 (-9.6%)

FLOP -l0 35M 1.24 1.25 (-0.8%) 1.27 (-2.4%) 1.29 (-4.0%) 1.33 (-7.3%)

Table 5: Further comparison between factorization-based pruning FLOP and input feature pruning NP-l0 (Louizoset al., 2018) using 6-layer SRU models and the Enwiki8 dataset. We show BPC at different compression levels andthe loss of performance relative to the un-compressed model. Factorization results in less decrease in relative andabsolute performance.

Figure 2: Histograms of HardConcrete parameters during training. We show the changes of histograms for the firstSRU layer (left figure) and the last layer (right figure). We compute the histogram every 3,000 training steps.

the SRU model used by the unfactorized NP-l0baseline has hidden size 1536, while with factoriza-tion other baselines with a similar parameter budgetuse a hidden size of 3056. To avoid potential un-fair comparison, we also train a large model withhidden size 2048 containing 90% more parameters,and apply the NP-l0 baseline. This larger model ob-tains 1.25 BPC which is on par with the factorizedbase model used in previous experiments.

Table 5 compares the pruning performance ofFLOP and NP-l0 at four compression levels. Weshow the test BPC and the loss of performance rel-ative to the model without pruning. These resultsfurther substantiate our hypothesis – factorizationbased pruning is able to retain relative model per-formance much more effectively than input featurepruning.

Speed analysis Thanks to its structured nature,FLOP can achieve significant computation speedup.As shown in Table 6, we achieve an inferencespeedup ranging from 1.5x to 2.2x for the compres-sion levels tested, using CPUs. Similar speedupsof up to 2.4x are also observed using GPUs duringtraining. On the contrary, the computations of un-structured sparse matrices are harder to optimize.For models obtained using unstructured AGP, weexperimented with the sparse matrix multiplicationroutine provided in Pytorch (Paszke et al., 2017)and a recent linear algebra compiler (Kjolstad et al.,2017), but were unable to achieve a speedup.

Size Compress Time (s) Speedup35M 0% 0.39 1.0x8M 80% 0.21 1.9x4M 90% 0.18 2.2x41M 0% 1.33 1.0x8M 80% 0.87 1.5x4M 90% 0.82 1.6x

Table 6: Inference timing measurements of character-level language model using SRU (top block) andTransformer-XL (bottom block).

Learning dynamics Figure 2 demonstrates thetraining dynamics of the HardConcrete distribution.We plot the histogram of HardConcrete parametersα after every few thousands of training iterations.A negative value of α indicates that the associatedparameter is likely to be pruned while a positivevalue indicates the opposite. The magnitude of thevalue reflects the certainty of the pruning decision.As illustrated by the figure, the distribution of αbecomes bi-modal after initial exploration. Cer-tain parameters within each layer are completelypruned while others are kept with (almost) absolutecertainty. In addition, the dynamics vary acrossdifferent layers. For instance, for SRU the first re-current layer gets pruned more aggressively thanthe last layer.

6159

8 Conclusion

In this work, we present a generic structured prun-ing method based on adaptive low-rank factoriza-tion. We systematically evaluate the performanceof this method on large language models. We showthat our method can provide significant speedupsand compression rates on large models while losingminimal performance compared to other methods,including unstructured magnitude pruning. Thiswork contributes to reducing the growing overheadof large language models, and shines a light on therole of model capacity in language modeling.

Acknowledgement

We would like to thank ASAPP Inc. for making thiswork possible. We would also like to thank HughPerkins, Sam Bowman, Nicholas Matthews, JoshShapiro and the other members of the LanguageTechnology and Research teams who helped reviewthis work and contributed their thoughts through-out the project. We would also like to thank theEMNLP reviewers and area chair for their helpfulcomments.

ReferencesJimmy Ba and Rich Caruana. 2014. Do deep nets really

need to be deep? In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger,editors, Advances in Neural Information ProcessingSystems 27, pages 2654–2662. Curran Associates,Inc.

Alexei Baevski and Michael Auli. 2019. Adaptive in-put representations for neural language modeling. InInternational Conference on Learning Representa-tions.

Jasmijn Bastings, Wilker Aziz, and Ivan Titov. 2019.Interpretable neural predictions with differentiablebinary variables. In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics, pages 2963–2977, Florence, Italy. Associa-tion for Computational Linguistics.

Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao,Lanshun Nie, Dechen Zhan, Yunxin Liu, Ming Wu,and Lintao Zhang. 2019. Efficient and effectivesparse lstm on fpga with bank-balanced sparsity.In Proceedings of the 2019 ACM/SIGDA Interna-tional Symposium on Field-Programmable Gate Ar-rays, FPGA ’19, page 63–72, New York, NY, USA.Association for Computing Machinery.

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017task 1: Semantic textual similarity multilingual and

crosslingual focused evaluation. In Proceedingsof the 11th International Workshop on SemanticEvaluation (SemEval-2017), pages 1–14, Vancouver,Canada. Association for Computational Linguistics.

Yew Ken Chia, Sam Witteveen, and Martin Andrews.2019. Transformer to cnn: Label-scarce distilla-tion for efficient text classification. arXiv preprintarXiv:1909.03508.

Andrew M Dai and Quoc V Le. 2015. Semi-supervisedsequence learning. In C. Cortes, N. D. Lawrence,D. D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems28, pages 3079–3087. Curran Associates, Inc.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-bonell, Quoc Le, and Ruslan Salakhutdinov. 2019.Transformer-XL: Attentive language models beyonda fixed-length context. In Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics, pages 2978–2988, Florence, Italy.Association for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

William B. Dolan and Chris Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In Proceedings of the Third International Workshopon Paraphrasing (IWP2005).

Angela Fan, Edouard Grave, and Armand Joulin. 2020.Reducing transformer depth on demand with struc-tured dropout. In International Conference onLearning Representations.

Jeffrey Flanigan, Sam Thomson, Jaime Carbonell,Chris Dyer, and Noah A. Smith. 2014. A discrim-inative graph-based parser for the Abstract Mean-ing Representation. In Proceedings of the 52nd An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1426–1436, Baltimore, Maryland. Association for Compu-tational Linguistics.

Jonathan Frankle and Michael Carbin. 2019. The lot-tery ticket hypothesis: Finding sparse, trainable neu-ral networks. In International Conference on Learn-ing Representations.

Trevor Gale, Erich Elsen, and Sara Hooker. 2019. Thestate of sparsity in deep neural networks. arXivpreprint arXiv:1902.09574.

Yunchao Gong, Liu Liu, Ming Yang, and LubomirBourdev. 2014. Compressing deep convolutionalnetworks using vector quantization. arXiv preprintarXiv:1412.6115.

6160

Edouard Grave, Armand Joulin, Moustapha Cisse,Herve Jegou, et al. 2017. Efficient softmax approxi-mation for gpus. In Proceedings of the 34th Interna-tional Conference on Machine Learning-Volume 70,pages 1302–1310. JMLR. org.

Scott Gray, Alec Radford, and Diederik P Kingma.2017. Gpu kernels for block-sparse weights. arXivpreprint arXiv:1711.09224.

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Arda-van Pedram, Mark A Horowitz, and William J Dally.2016. Eie: efficient inference engine on compresseddeep neural network. In 2016 ACM/IEEE 43rd An-nual International Symposium on Computer Archi-tecture (ISCA), pages 243–254. IEEE.

Song Han, Huizi Mao, and William J Dally. 2015a.Deep compression: Compressing deep neural net-works with pruning, trained quantization and huff-man coding. arXiv preprint arXiv:1510.00149.

Song Han, Jeff Pool, John Tran, and William Dally.2015b. Learning both weights and connectionsfor efficient neural network. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 28, pages 1135–1143. Curran Asso-ciates, Inc.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.2019. Tinybert: Distilling bert for natural languageunderstanding. arXiv preprint arXiv:1909.10351.

Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 1317–1327, Austin,Texas. Association for Computational Linguistics.

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, DavidLugato, and Saman Amarasinghe. 2017. The ten-sor algebra compiler. Proc. ACM Program. Lang.,1(OOPSLA).

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. Albert: A lite bert for self-supervised learningof language representations. In International Con-ference on Learning Representations.

Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and YoavArtzi. 2018. Simple recurrent units for highly par-allelizable recurrence. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 4470–4481, Brussels, Bel-gium. Association for Computational Linguistics.

Zhongliang Li, Raymond Kulhanek, Shaojun Wang,Yunxin Zhao, and Shuang Wu. 2018. Slim embed-ding layers for recurrent neural language models. In

Thirty-Second AAAI Conference on Artificial Intelli-gence.

Liyuan Liu, Xiang Ren, Jingbo Shang, Xiaotao Gu,Jian Peng, and Jiawei Han. 2018. Efficient contextu-alized representation: Language model pruning forsequence labeling. In EMNLP, pages 1215–1225.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Christos Louizos, Karen Ullrich, and Max Welling.2017. Bayesian compression for deep learning. InI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems30, pages 3288–3298. Curran Associates, Inc.

Christos Louizos, Max Welling, and Diederik P.Kingma. 2018. Learning sparse neural networksthrough l0regularization. In International Confer-ence on Learning Representations.

Andre F. T. Martins, Mario A. T. Figueiredo, PedroM. Q. Aguiar, Noah A. Smith, and Eric P. Xing.2011. An augmented lagrangian approach to con-strained map inference. In ICML, pages 169–176.

Stephen Merity, Nitish Shirish Keskar, and RichardSocher. 2018. An analysis of neural language mod-eling at multiple scales. CoRR, abs/1803.08240.

Dmitry Molchanov, Arsenii Ashukha, and Dmitry P.Vetrov. 2017. Variational dropout sparsifies deepneural networks. In ICML, pages 2498–2507.

Sharan Narang, Erich Elsen, Gregory Diamos, andShubho Sengupta. 2017a. Exploring sparsityin recurrent neural networks. arXiv preprintarXiv:1704.05119.

Sharan Narang, Eric Undersander, and Gregory Di-amos. 2017b. Block-sparse recurrent neural net-works. arXiv preprint arXiv:1711.02782.

Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in PyTorch.In NIPS Autodiff Workshop.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.

6161

Alexander M. Rush, David Sontag, Michael Collins,and Tommi Jaakkola. 2010. On dual decompositionand linear programming relaxations for natural lan-guage processing. In Proceedings of the 2010 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1–11, Cambridge, MA. Associa-tion for Computational Linguistics.

Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108.

Abigail See, Minh-Thang Luong, and Christopher D.Manning. 2016. Compression of neural machinetranslation models via pruning. In Proceedingsof The 20th SIGNLL Conference on ComputationalNatural Language Learning, pages 291–301, Berlin,Germany. Association for Computational Linguis-tics.

Yuan Shangguan, Jian Li, Liang Qiao, Raziel Alvarez,and Ian McGraw. 2019. Optimizing speech recogni-tion for the edge. arXiv preprint arXiv:1909.12408.

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, ZheweiYao, Amir Gholami, Michael W Mahoney, andKurt Keutzer. 2019. Q-bert: Hessian based ultralow precision quantization of bert. arXiv preprintarXiv:1909.05840.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing,pages 1631–1642, Seattle, Washington, USA. Asso-ciation for Computational Linguistics.

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019.Patient knowledge distillation for BERT model com-pression. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP). Asso-ciation for Computational Linguistics.

Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Ari-vazhagan, Xin Li, and Amelia Archer. 2019. Smalland practical BERT models for sequence labeling.In Proceedings of the 2019 Conference on Empiri-cal Methods in Natural Language Processing andthe 9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP). Associa-tion for Computational Linguistics.

Iulia Turc, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. Well-read students learn better:The impact of student initialization on knowledgedistillation. arXiv preprint arXiv:1908.08962.

Ehsan Variani, Ananda Theertha Suresh, and MitchelWeintraub. 2019. West: Word encoded sequence

transducers. In ICASSP 2019-2019 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 7340–7344. IEEE.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008. Curran Asso-ciates, Inc.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-nrich, and Ivan Titov. 2019. Analyzing multi-headself-attention: Specialized heads do the heavy lift-ing, the rest can be pruned. In Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics, pages 5797–5808, Florence,Italy. Association for Computational Linguistics.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Inter-national Conference on Learning Representations.

Wei Wen, Yuxiong He, Samyam Rajbhandari, MinjiaZhang, Wenhan Wang, Fang Liu, Bin Hu, YiranChen, and Hai Li. 2018. Learning intrinsic sparsestructures within long short-term memory. In Inter-national Conference on Learning Representations.

Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Ben-gio, and Ruslan R Salakhutdinov. 2016. On multi-plicative integration with recurrent neural networks.In Advances in Neural Information Processing Sys-tems.

Zhuliang Yao, Shijie Cao, Wencong Xiao, Chen Zhang,and Lanshun Nie. 2019. Balanced sparsity for ef-ficient dnn inference on gpu. In Proceedings ofthe AAAI Conference on Artificial Intelligence, vol-ume 33, pages 5676–5683.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. Ernie: En-hanced language representation with informative en-tities. arXiv preprint arXiv:1905.07129.

Chenzhuo Zhu, Song Han, Huizi Mao, and William J.Dally. 2017. Trained ternary quantization. In 5thInternational Conference on Learning Representa-tions ICLR.

Michael Zhu and Suyog Gupta. 2017. To prune, or notto prune: exploring the efficacy of pruning for modelcompression. arXiv preprint arXiv:1710.01878.

Neta Zmora, Guy Jacob, Lev Zlotnik, Bar Elharar,and Gal Novik. 2019. Neural network distiller: Apython package for dnn compression research.

6162

A Appendix

A.1 Optimization detailsIn our implementation, FLOP trains the factorizedmodel for a number of warmup epochs and thenstarts pruning. Other pruning baselines use thesame warmup training process, except that FACbaseline directly trains smaller factorized modelfrom scratch. Recall our augmented Lagrangiantraining objective during pruning is,

maxλ1,λ2

minθ,α

Eu

[1

D

D∑i=1

L(xi,yi; θ)

]+ g(λ,α),

g(λ,α) = λ1 · (s(α)− t) + λ2 · (s(α)− t)2.

We gradually increase the target size t at a linearrate. That is, given the desired size tmax, we set thesparsity at k-th pruning iteration as

tk = min(1,k

m) · tmax

where m is a hyperparameter specifying the num-ber of annealing steps.

The Lagrangian multipliers are initialized to zeroat the start of training. We perform joint gradientupdates for the parameters and Lagrangian multi-pliers at every iteration, but use and tune a differ-ent learning rate for Lagrangian multipliers. Foreach training batch, we sample the pruning maskz = {z1, · · · , zn} and share it across the trainingexamples within the batch. Since the pruning maskis shared, we can select parameters that are onlyactive for the current batch and compute smallermatrix multiplications in forward and backwardpasses. This can result in training speedup when zbecomes sparse.

A.2 Experimental DetailsOur experiments are performed using the stan-dard train/dev/test splits of Wiki-103, Enwik8 andGLUE benchmarks. We describe training config-urations in the following paragraphs. Detailed ex-perimental setup can be found at https://github.com/asappresearch/flop.

SRU Following the practice of Lei et al. (2018),for the Enwik8 dataset we train a 6-layer SRUmodel using a batch size of 64 and an unroll lengthof 256. We use a hidden size of 3056 and set theinitial factorization dimension r of the parametermatrices to 512. That is, we replace each weightmatrix W in SRU using an explicit factorization

PQ with an inner dimension of 512. We train themodel without pruning for 30 warmup epochs, andstart pruning for a maximum of 100 epochs.

For the Wiki-103 dataset, our 12-layer SRU basemodel uses a hidden dimension of 2048 and a fac-torization dimension of 512 for weight matricesin SRU. Following Baevski and Auli (2019), theadaptive embedding layer uses 1024, 256 and 64dimensions respectively for the 20K most frequentwords, 40K less frequent words and the rest leastfrequent words. We train 50 warm-up epochs andstart the pruning process for an addition of 100epochs. We use a batch size of 64 or 96 and anunroll length of 256.

For all SRU runs, we use inverse-square-rootlearning rate scheduling (Vaswani et al., 2017) anda learning rate of l0√

dwhere d is the hidden size

and l0 is the initial factor. We set l0 ∈ {2, 3} formodel parameters. For AGP methods, we tune thestart and end epoch of the compression scheduler.For l0 regularization, we tune the learning rate l0 ∈{3, · · · , 6} for Lagrangian multipliers.

Transformer-XL Following Dai et al. (2019),we train Transformer-XL base model using cosinelearning rate scheduling. For the 12-layer basemodel, we train a maximum of 200k iterations,a batch size of 48 and an initial learning rate of0.0003 and use 8 GPUs in parallel. For pruningruns, we train up to 300k iterations using a learningrate of 0.00025, a batch size of 32 and 4 GPUsin parallel for each run. We use the same inverse-square-root learning rate scheduling for Lagrangianmultipliers and set l0 ∈ {0.5, 1.0, 1.5}.