Proceedings of 3rd Workshop on Structured Prediction for NLP

49
NAACL HLT 2019 Structured Prediction for NLP Proceedings of the Third Workshop June 7, 2019 Minneapolis, Minnesota, USA

Transcript of Proceedings of 3rd Workshop on Structured Prediction for NLP

NAACL HLT 2019

Structured Prediction for NLP

Proceedings of the Third Workshop

June 7, 2019Minneapolis, Minnesota, USA

c©2019 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-950737-10-9

ii

Introduction

Welcome to the Third Workshop on Structured Prediction for NLP!

Structured prediction has a strong tradition within the natural language processing (NLP) community,owing to the discrete, compositional nature of words and sentences, which leads to natural combinatorialrepresentations such as trees, sequences, segments, or alignments, among others. It is no surprise thatstructured output models have been successful and popular in NLP applications since their inception.Many other NLP tasks, including, but not limited to: semantic parsing, slot filling, machine translation, orinformation extraction, are commonly modeled as structured problems, and accounting for said structurehas often lead to performance gain.

Of late, continuous representation learning via neural networks has been a significant complementarydirection, leading to improvements in unsupervised and semi-supervised pre-training, transfer learning,domain adaptation, etc. Using word embeddings as features for structured models such as part-of-speechtaggers count among the very first uses of continuous embeddings in NLP, and the symbiosis betweenthe two approaches is an exciting research direction today.

The five papers (as well as three additional non-archival papers) accepted for presentation in thisedition of the workshop, after double-blind peer review, all explore this interplay between structureand neural data representations, from different, important points of view. The program includes work onstructure-informed representation learning, transfer learning, partial supervision, and parallelization ofcomputation in structured computation graphs. Our program also includes six invited presentations frominfluential researchers.

Our warmest thanks go to the program committee – for their time and effort providing valuable feedback,to all submitting authors – for their thought-provoking work, and to the invited speakers – for doing usthe honor of joining our program. We are looking forward to seeing you in Minneapolis!

Zornitsa KozarevaJulia KreutzerGerasimos LampourasAndré MartinsVlad NiculaeSujith RaviAndreas Vlachos

iii

Organizers:

Zornitsa Kozareva, Google, USAJulia Kreutzer, Heidelberg University, GermanyGerasimos Lampouras, University of Cambridge, UKAndré F.T. Martins, Unbabel & University of Lisbon, PortugalVlad Niculae, University of Lisbon, PortugalSujith Ravi, Google, USAAndreas Vlachos, University of Cambridge, UK

Program Committee:

Yoav Artzi, Cornell University, USAWilker Aziz, University of Amsterdam, NetherlandsJoost Bastings, University of Amsterdam, NetherlandsXilun Chen, Cornell University, USAShay Cohen, University of Edinburgh, UKHal Daumé, Microsoft & University of Maryland, USAKevin Gimpel, TTI Chicago, USAMatt Gormley, CMU, USAArzoo Katiyar, Cornell University, USAYoon Kim, Harvard University, USAParisa Kordjamshidi, Tulane University, USAAmandla Mabona, University of Cambridge, UKPranava Madhyastha, Imperial College London, UKSebastian Mielke, Johns Hopkins University, USARoi Reichart, Technion - Israel Institute of Technology, IsraelMarek Rei, University of Cambridge, UKStefan Riezler, Heidelberg University, GermanyHiko Schamoni, Heidelberg University, GermanyTianze Shi, Cornell University, USAArtem Sokolov, Amazon, GermanyVivek Srikumar, University of Utah, USAIvan Titov, University of Edinburgh, ScotlandLuke Zettlemoyer, University of Washington, USA

Invited Speakers:

Claire Cardie, Cornell University, USAChris Dyer, DeepMind, UKJason Eisner, Johns Hopkins University, USAHannaneh Hajishirzi, University of Washington, USAHe He, Stanford University, USAAndrew McCallum, University of Massachusetts Amherst, USA

v

Table of Contents

Parallelizable Stack Long Short-Term MemoryShuoyang Ding and Philipp Koehn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Tracking Discrete and Continuous Entity State for Process UnderstandingAditya Gupta and Greg Durrett . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

SPARSE: Structured Prediction using Argument-Relative Structured EncodingRishi Bommasani, Arzoo Katiyar and Claire Cardie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Lightly-supervised Representation Learning with Global InterpretabilityAndrew Zupon, Maria Alexeeva, Marco Valenzuela-Escárcega,Ajay Nagesh and Mihai Surdeanu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Semi-Supervised Teacher-Student Architecture for Relation ExtractionFan Luo, Ajay Nagesh, Rebecca Sharp and Mihai Surdeanu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

vii

Workshop Program

Friday, June 7, 2019

9:00–9:10 Opening Remarks

9:10–9:50 Invited Talk: Andrew McCallum

9:50–10:30 Invited Talk: Hannaneh Hajishirzi

10:30–11:00 Coffee Break

11:00–11:40 Invited Talk by He He

11:40–12:10 Session 1: Contributed Talks

11:40–11:55 Parallelizable Stack Long Short-Term MemoryShuoyang Ding and Philipp Koehn

11:55–12:10 Tracking Discrete and Continuous Entity State for Process UnderstandingAditya Gupta and Greg Durrett

12:10–14:00 Lunch

14:00–14:40 Invited Talk: Chris Dyer

14:40–15:00 Poster Spotlight

15:00–16:00 Poster Session (and non-archival papers)

SPARSE: Structured Prediction using Argument-Relative Structured EncodingRishi Bommasani, Arzoo Katiyar and Claire Cardie

ix

Friday, June 7, 2019 (continued)

Lightly-supervised Representation Learning with Global InterpretabilityAndrew Zupon, Maria Alexeeva, Marco Valenzuela-Escárcega, Ajay Nagesh andMihai Surdeanu

Semi-Supervised Teacher-Student Architecture for Relation ExtractionFan Luo, Ajay Nagesh, Rebecca Sharp and Mihai Surdeanu

15:30–16:00 Coffee Break (With Posters)

16:00–16:40 Invited Talk: Claire Cardie

16:40–17:20 Invited Talk: Jason Eisner

x

Proceedings of 3rd Workshop on Structured Prediction for NLP, pages 1–6Minneapolis, Minnesota, June 7, 2019. c©2019 Association for Computational Linguistics

Parallelizable Stack Long Short-Term Memory

Shuoyang Ding Philipp KoehnCenter for Language and Speech Processing

Johns Hopkins University{dings, phi}@jhu.edu

Abstract

Stack Long Short-Term Memory (StackL-STM) is useful for various applications suchas parsing and string-to-tree neural machinetranslation, but it is also known to be notori-ously difficult to parallelize for GPU trainingdue to the fact that the computations are de-pendent on discrete operations. In this paper,we tackle this problem by utilizing state accesspatterns of StackLSTM to homogenize compu-tations with regard to different discrete opera-tions. Our parsing experiments show that themethod scales up almost linearly with increas-ing batch size, and our parallelized PyTorchimplementation trains significantly faster com-pared to the Dynet C++ implementation.

1 Introduction

Tree-structured representation of language has beensuccessfully applied to various applications includ-ing dependency parsing (Dyer et al., 2015), sen-timent analysis (Socher et al., 2011) and neuralmachine translation (Eriguchi et al., 2017). How-ever, most of the neural network architectures usedto build tree-structured representations are not ableto exploit full parallelism of GPUs by minibatchedtraining, as the computation that happens for eachinstance is conditioned on the input/output struc-tures, and hence cannot be naïvely grouped togetheras a batch. This lack of parallelism is one of the ma-jor hurdles that prevent these representations fromwider adoption practically (e.g., neural machinetranslation), as many natural language processingtasks currently require the ability to scale up to verylarge training corpora in order to reach state-of-the-art performance.

We seek to advance the state-of-the-art of thisproblem by proposing a parallelization scheme forone such network architecture, the Stack LongShort-Term Memory (StackLSTM) proposed inDyer et al. (2015). This architecture has been

successfully applied to dependency parsing (Dyeret al., 2015, 2016; Ballesteros et al., 2017) andsyntax-aware neural machine translation (Eriguchiet al., 2017) in the previous research literature,but none of these research results were producedwith minibatched training. We show that our paral-lelization scheme is feasible in practice by showingthat it scales up near-linearly with increasing batchsize, while reproducing a set of results reported in(Ballesteros et al., 2017).

2 StackLSTM

StackLSTM (Dyer et al., 2015) is an LSTM archi-tecture (Hochreiter and Schmidhuber, 1997) aug-mented with a stack H that stores some of thehidden states built in the past. Unlike traditionalLSTMs that always build state ht from ht−1, thestates of StackLSTM are built from the head of thestate stack H, maintained by a stack top pointerp(H). At each time step, StackLSTM takes a real-valued input vector together with an additional dis-crete operation on the stack, which determines whatcomputation needs to be conducted and how thestack top pointer should be updated. Throughoutthis section, we index the input vector (e.g. wordembeddings) xt using the time step t it is fed intothe network, and hidden states in the stack hj usingtheir position j in the stack H, j being defined asthe 0-base index starting from the stack bottom.

The set of input discrete actions typically con-tains at least Push and Pop operations. Whenthese operations are taken as input, the correspond-ing computations on the StackLSTM are listed be-low:1

• Push: read previous hidden state hp(H), per-form LSTM forward computation with xt and

1To simplify the presentation, we omitted the updates oncell states, because in practice the operations performed oncell states and hidden states are the same.

1

Transition Systems Transition Op Stack Op Buffer Op Composition Op

Arc-StandardShift push pop noneLeft-Arc pop, pop, push hold S1← g(S0, S1)Right-Arc pop hold S1← g(S1, S0)

Arc-Eager

Shift push pop noneReduce pop hold noneLeft-Arc pop hold B0← g(B0, S0)Right-Arc push pop B0← g(S0, B0)

Arc-HybridShift push pop noneLeft-Arc pop hold B0← g(B0, S0)Right-Arc pop hold S1← g(S1, S0)

Table 1: Correspondence between transition operations and stack/buffer operations for StackLSTM, where g de-notes the composition function as proposed by (Dyer et al., 2015). S0 and B0 refers to the token-level representa-tion corresponding to the top element of the stack and buffer, while S1 and B1 refers to those that are second to thetop. We use a different notation here to avoid confusion with the states in StackLSTM, which represent non-localinformation beyond token-level.

hp(H), write new hidden state to hp(H)+1 , up-date stack top pointer with p(H)← p(H)+1.

• Pop: update stack top pointer with p(H) ←p(H)− 1.

Reflecting on the aforementioned discussion onparallelism, one should notice that StackLSTMfalls into the category of neural network architec-tures that is difficult to perform minibatched train-ing. This is caused by the fact that the computationperformed by StackLSTM at each time step is de-pendent on the discrete input actions. The follow-ing section proposes a solution to this problem.

3 Parallelizable StackLSTM

Continuing the formulation in the previous section,we will start by discussing our proposed solutionunder the case where the set of discrete actionscontains only Push and Pop operations; we thenmove on to discussion of the applicability of ourproposed solution to the transition systems that areused for building representations for dependencytrees.

The first modification we perform to the Pushand Pop operations above is to unify the pointerupdate of these operations as p(H)← p(H) + op,where op is the input discrete operation that couldeither take the value +1 or -1 for Push and Popoperation. After this modification, we came to thefollowing observations:

Observation 1 The computation performed forPop operation is a subset of Push operation.

Now, what remains to homogenize Push andPop operations is conducting the extra computa-tions needed for Push operation when Pop is fedin as well, while guaranteeing the correctness ofthe resulting hidden state both in the current timestep and in the future. The next observation pointsout a way for this guarantee:

Observation 2 In a StackLSTM, given the currentstack top pointer position p(H), any hidden statehi where i > p(H) will not be read until it isoverwritten by a Push operation.

What follows from this observation is the guar-antee that we can always safely overwrite hiddenstates hi that are indexed higher than the currentstack top pointer, because it is known that any readoperation on these states will happen after anotheroverwrite. This allows us to do the extra computa-tion anyway when Pop operation is fed, becausethe extra computation, especially updating hp(H)+1,will not harm the validity of the hidden states atany time step.

Algorithm 1 gives the final forward computationfor the Parallelizable StackLSTM. Note that thisalgorithm does not contain any if-statements thatdepends on stack operations and hence is homoge-neous when grouped into batches that are consistedof multiple operations trajectories.

In transition systems (Nivre, 2008; Kuhlmannet al., 2011) used in real tasks (e.g., transition-basedparsing) as shown in Table 1, it should be noted thatmore than push and pop operations are neededfor the StackLSTM. Fortunately, for Arc-Eager and

2

Algorithm 1: Forward Computation for Paral-lelizable StackLSTMInput: input vector xt

discrete stack operation opOutput: current top hidden state hp(H)

h_prev← hp(H);h← LSTM(xt, h_prev);hp(H)+1 ← h;p(H)← p(H) + op;return hp(H);

Arc-Hybrid transition systems, we can simply adda hold operation, which is denoted by value 0 forthe discrete operation input. For that reason, wewill focus on parallelization of these two transitionsystems for this paper. It should be noted that bothobservations discussed above are still valid afteradding the hold operation.

4 Experiments

4.1 Setup

We implemented2 the architecture described abovein PyTorch (Paszke et al., 2017). We implementedthe batched stack as a float tensor wrapped in anon-leaf variable, thus enabling in-place operationson that variable. At each time step, the batchedstack is queried/updated with a batch of stack headpositions represented by an integer vector, an op-eration made possible by gather operation andadvanced indexing. Due to this implementationchoice, the stack size has to be determined at ini-tialization time and cannot be dynamically grown.Nonetheless, a fixed stack size of 150 works for allthe experiments we conducted.

We use the dependency parsing task to evaluatethe correctness and the scalability of our method.For comparison with previous work, we followthe architecture introduced in Dyer et al. (2015);Ballesteros et al. (2017) and chose the Arc-Hybridtransition system for comparison with previouswork. We follow the data setup in Chen and Man-ning (2014); Dyer et al. (2015); Ballesteros et al.(2017) and use Stanford Dependency Treebank(de Marneffe et al., 2006) for dependency parsing,and we extract the Arc-Hybrid static oracle usingthe code associated with Qi and Manning (2017).The part-of-speech (POS) tags are generated withStanford POS-tagger (Toutanova et al., 2003) with

2https://github.com/shuoyangd/hoolock

100 101 102

batch size

0

50

100

150

200

250

300

350

400

sen

t/s

training speed

Dyer et al. 2015

Figure 1: Training speed at different batch size. Notethat the x-axis is in log-scale in order to show all thedata points properly.

a test set accuracy of 97.47%. We use exactly thesame pre-trained English word embedding as Dyeret al. (2015).

We use Adam (Kingma and Ba, 2014) as the opti-mization algorithm. Following Goyal et al. (2017),we apply linear warmup to the learning rate with aninitial value of τ = 5× 10−4 and total epoch num-ber of 5. The target learning rate is set by τ multi-plied by batch size, but capped at 0.02 because wefind Adam to be unstable beyond that learning rate.After warmup, we reduce the learning rate by halfevery time there is no improvement for loss valueon the development set (ReduceLROnPlateau).We clip all the gradient norms to 5.0 and apply aL2-regularization with weight 1× 10−6.

We started with the hyper-parameter choices inDyer et al. (2015) but made some modificationsbased on the performance on the development set:we use hidden dimension 200 for all the LSTMunits, 200 for the parser state representation beforethe final softmax layer, and embedding dimension48 for the action embedding.

We use Tesla K80 for all the experiments, in or-der to compare with Neubig et al. (2017b); Dyeret al. (2015). We also use the same hyper-parametersetting as Dyer et al. (2015) for speed comparisonexperiments. All the speeds are measured by run-ning through one training epoch and averaging.

4.2 Results

Figure 1 shows the training speed at different batchsizes up to 256.3 The speed-up of our model is

3At batch size of 512, the longest sentence in the trainingdata cannot be fit onto the GPU.

3

bdev testUAS LAS UAS LAS

1* 92.50* 89.79* 92.10* 89.61*

8 92.93 90.42 92.54 90.11

16 92.62 90.19 92.53 90.13

32 92.43 89.89 92.31 89.94

64 92.53 90.04 92.22 89.73

128 92.39 89.73 92.55 90.02

256 92.15 89.46 91.99 89.43

Table 2: Dependency parsing result with various train-ing batch size b and without composition function.The results marked with asterisks were reported in theBallesteros et al. (2017).

close to linear, which means there is very little over-head associated with our batching scheme. Quan-titatively, according to Amdahl’s Law (Amdahl,1967), the proportion of parallelized computationsis 99.92% at batch size 64. We also comparedour implementation with the implementation thatcomes with Dyer et al. (2015), which is imple-mented in C++ with DyNet (Neubig et al., 2017a).DyNet is known to be very optimized for CPU com-putations and hence their implementation is reason-ably fast even without batching and GPU accelera-tion, as shown in Figure 1.4 But we would like topoint out that we focus on the speed-up we are ableto obtain rather than the absolute speed, and thatour batching scheme is framework-universal andsuperior speed might be obtained by combining ourscheme with alternative frameworks or languages(for example, the torch C++ interface).

The dependency parsing results are shown in Ta-ble 2. Our implementation is able to yield bettertest set performance than that reported in Balles-teros et al. (2017) for all batch size configurationsexcept 256, where we observe a modest perfor-mance loss. Like Goyal et al. (2017); Keskar et al.(2016); Masters and Luschi (2018), we initiallyobserved more significant test-time performancedeterioration (around 1% absolute difference) formodels trained without learning rate warmup, andconcurring with the findings in Goyal et al. (2017),we find warmup very helpful for stabilizing large-batch training. We did not run experiments withbatch size below 8 as they are too slow due to

4Measured on one core of an Intel Xeon E7-4830 CPU.

Python’s inherent performance issue.

5 Related Work

DyNet has support for automatic minibatching(Neubig et al., 2017b), which figures out what com-putation is able to be batched by traversing thecomputation graph to find homogeneous compu-tations. While we cannot directly compare withthat framework’s automatic batching solution forStackLSTM5, we can draw a loose comparisonto the results reported in that paper for BiLSTMtransition-based parsing (Kiperwasser and Gold-berg, 2016). Comparing batch size of 64 to batchsize of 1, they obtained a 3.64x speed-up on CPUand 2.73x speed-up on Tesla K80 GPU, while ourarchitecture-specific manual batching scheme ob-tained 60.8x speed-up. The main reason for thisdifference is that their graph-traversing automaticbatching scheme carries a much larger overheadcompared to our manual batching approach.

Another toolkit that supports automatic mini-batching is Matchbox6, which operates by analyz-ing the single-instance model definition and deter-ministically convert the operations into their mini-batched counterparts. While such mechanism elim-inated the need to traverse the whole computationgraph, it cannot homogenize the operations in eachbranch of if. Instead, it needs to perform each op-eration separately and apply masking on the result,while our method does not require any masking.Unfortunately we are also not able to compare withthe toolkit at the time of this work as it lacks sup-port for several operations we need.

Similar to the spirit of our work, Bowman et al.(2016) attempted to parallelize StackLSTM by us-ing Thin-stack, a data structure that reduces thespace complexity by storing all the intermediatestack top elements in a tensor and use a queue tocontrol element access. However, thanks to Py-Torch, our implementation is not directly depen-dent on the notion of Thin-stack. Instead, whenan element is popped from the stack, we simplyshift the stack top pointer and potentially re-writethe corresponding sub-tensor later. In other words,there is no need for us to directly maintain all the in-termediate stack top elements, because in PyTorch,when the element in the stack is re-written, its un-derlying sub-tensor will not be destructed as there

5This is due to the fact that DyNet automatic batchingcannot handle graph structures that depends on runtime inputvalues, which is the case in StackLSTM.

6https://github.com/salesforce/matchbox

4

are still nodes in the computation graph that pointto it. Hence, when performing back-propagation,the gradient is still able to flow back to the ele-ments that are previously popped from the stackand their respective precedents. Hence, we arealso effectively storing all the intermediate stacktop elements only once. Besides, Bowman et al.(2016) didn’t attempt to eliminate the conditionalbranches in the StackLSTM algorithm, which isthe main algorithmic contribution of this work.

6 Conclusion

We propose a parallelizable version of StackLSTMthat is able to fully exploit the GPU parallelismby performing minibatched training. Empiricalresults show that our parallelization scheme yieldscomparable performance to previous work, and ourmethod scales up very linearly with the increasingbatch size.

Because our parallelization scheme is based onthe observation made in section 1, we cannot incor-porate batching for neither Arc-Standard transitionsystem nor the token-level composition functionproposed in Dyer et al. (2015) efficiently yet. Weleave the parallelization of these architectures tofuture work.

Our parallelization scheme makes it feasible torun large-data experiments for various tasks thatrequires large training data to perform well, such asRNNG-based syntax-aware neural machine trans-lation (Eriguchi et al., 2017).

Acknowledgement

The authors would like to thank Peng Qi for help-ing with data preprocessing and James Bradburyfor helpful technical discussions. This material isbased upon work supported in part by the DARPALORELEI and IARPA MATERIAL programs.

ReferencesGene M. Amdahl. 1967. Validity of the single pro-

cessor approach to achieving large scale computingcapabilities. In American Federation of Informa-tion Processing Societies: Proceedings of the AFIPS

’67 Spring Joint Computer Conference, April 18-20,1967, Atlantic City, New Jersey, USA, pages 483–485.

Miguel Ballesteros, Chris Dyer, Yoav Goldberg, andNoah A. Smith. 2017. Greedy transition-based de-pendency parsing with stack lstms. ComputationalLinguistics, 43(2):311–347.

Samuel R. Bowman, Jon Gauthier, Abhinav Ras-togi, Raghav Gupta, Christopher D. Manning, andChristopher Potts. 2016. A fast unified model forparsing and sentence understanding. In Proceedingsof the 54th Annual Meeting of the Association forComputational Linguistics, ACL 2016, August 7-12,2016, Berlin, Germany, Volume 1: Long Papers.

Danqi Chen and Christopher D. Manning. 2014. Afast and accurate dependency parser using neural net-works. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing,EMNLP 2014, October 25-29, 2014, Doha, Qatar, Ameeting of SIGDAT, a Special Interest Group of theACL, pages 740–750.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing of the Asian Fed-eration of Natural Language Processing, ACL 2015,July 26-31, 2015, Beijing, China, Volume 1: LongPapers, pages 334–343.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,and Noah A. Smith. 2016. Recurrent neural net-work grammars. In NAACL HLT 2016, The 2016Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, San Diego California, USA,June 12-17, 2016, pages 199–209.

Akiko Eriguchi, Yoshimasa Tsuruoka, and KyunghyunCho. 2017. Learning to parse and translate improvesneural machine translation. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics, ACL 2017, Vancouver, Canada,July 30 - August 4, Volume 2: Short Papers, pages72–78.

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter No-ordhuis, Lukasz Wesolowski, Aapo Kyrola, AndrewTulloch, Yangqing Jia, and Kaiming He. 2017. Ac-curate, large minibatch SGD: training imagenet in 1hour. CoRR, abs/1706.02677.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge No-cedal, Mikhail Smelyanskiy, and Ping Tak PeterTang. 2016. On large-batch training for deep learn-ing: Generalization gap and sharp minima. CoRR,abs/1609.04836.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-ple and accurate dependency parsing using bidi-rectional LSTM feature representations. CoRR,abs/1603.04351.

5

Marco Kuhlmann, Carlos Gómez-Rodríguez, and Gior-gio Satta. 2011. Dynamic programming algorithmsfor transition-based dependency parsers. In The49th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technolo-gies, Proceedings of the Conference, 19-24 June,2011, Portland, Oregon, USA, pages 673–682.

Marie-Catherine de Marneffe, Bill MacCartney, andChristopher D. Manning. 2006. Generating typeddependency parses from phrase structure parses.In Proceedings of the Fifth International Confer-ence on Language Resources and Evaluation, LREC2006, Genoa, Italy, May 22-28, 2006., pages 449–454.

Dominic Masters and Carlo Luschi. 2018. Revisit-ing small batch training for deep neural networks.CoRR, abs/1804.07612.

Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, Daniel Cloth-iaux, Trevor Cohn, Kevin Duh, Manaal Faruqui,Cynthia Gan, Dan Garrette, Yangfeng Ji, LingpengKong, Adhiguna Kuncoro, Gaurav Kumar, Chai-tanya Malaviya, Paul Michel, Yusuke Oda, MatthewRichardson, Naomi Saphra, Swabha Swayamdipta,and Pengcheng Yin. 2017a. Dynet: The dynamicneural network toolkit. CoRR, abs/1701.03980.

Graham Neubig, Yoav Goldberg, and Chris Dyer.2017b. On-the-fly operation batching in dynamiccomputation graphs. In Advances in Neural Infor-mation Processing Systems 30: Annual Conferenceon Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages3974–3984.

Joakim Nivre. 2008. Algorithms for deterministic in-cremental dependency parsing. Computational Lin-guistics, 34(4):513–553.

Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.In NIPS-W.

Peng Qi and Christopher D. Manning. 2017. Arc-swift:A novel transition system for dependency parsing.In Proceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2017,Vancouver, Canada, July 30 - August 4, Volume 2:Short Papers, pages 110–117.

Richard Socher, Cliff Chiung-Yu Lin, Andrew Y. Ng,and Christopher D. Manning. 2011. Parsing natu-ral scenes and natural language with recursive neu-ral networks. In Proceedings of the 28th Inter-national Conference on Machine Learning, ICML2011, Bellevue, Washington, USA, June 28 - July 2,2011, pages 129–136.

Kristina Toutanova, Dan Klein, Christopher D. Man-ning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network.In Human Language Technology Conference of theNorth American Chapter of the Association for Com-putational Linguistics, HLT-NAACL 2003, Edmon-ton, Canada, May 27 - June 1, 2003.

6

Proceedings of 3rd Workshop on Structured Prediction for NLP, pages 7–12Minneapolis, Minnesota, June 7, 2019. c©2019 Association for Computational Linguistics

Tracking Discrete and Continuous Entity State forProcess Understanding

Aditya Gupta and Greg DurrettDepartment of Computer ScienceThe University of Texas at Austin

{agupta,gdurrett}@cs.utexas.edu

Abstract

Procedural text, which describes entities andtheir interactions as they undergo some pro-cess, depicts entities in a uniquely nuancedway. First, each entity may have some ob-servable discrete attributes, such as its stateor location; modeling these involves impos-ing global structure and enforcing consistency.Second, an entity may have properties whichare not made explicit but can be effectively in-duced and tracked by neural networks. In thispaper, we propose a structured neural archi-tecture that reflects this dual nature of entityevolution. The model tracks each entity recur-rently, updating its hidden continuous repre-sentation at each step to contain relevant stateinformation. The global discrete state struc-ture is explicitly modelled with a neural CRFover the changing hidden representation of theentity. This CRF can explicitly capture con-straints on entity states over time, enforcingthat, for example, an entity cannot move toa location after it is destroyed. We evaluatethe performance of our proposed model on QAtasks over process paragraphs in the PROPARAdataset (Dalvi et al., 2018) and find that ourmodel achieves state-of-the-art results.

1 Introduction

Many reading comprehension question answeringtasks (Richardson et al., 2013; Rajpurkar et al.,2016; Joshi et al., 2017) require looking at primar-ily one point in the passage to answer each ques-tion, or sometimes two or three (Yang et al., 2018;Welbl et al., 2018). As a result, modeling surface-level correspondences can work well (Seo et al.,2017) and holistic passage comprehension is notnecessary. However, certain QA settings requiredeeper analysis by focusing specifically on enti-ties, asking questions about their states over time(Weston et al., 2015; Long et al., 2016), combina-

tion in recipes (Bosselut et al., 2018), and partic-ipation in scientific processes (Dalvi et al., 2018).These settings then suggest more highly structuredmodels as a way of dealing with the more highlystructured tasks. One crucial aspect of such texts isthe way an entity’s state evolves with both discrete(observable state and location changes) and con-tinuous (changes in unobserved hidden attributes)phenomena going on. Additionally, the discretechanges unfold in a way that maintains the stateconsistency: an entity can not be destroyed beforeit even starts to exist.

In this work, we present a model which both re-currently tracks the entity in a continuous spacewhile imposing discrete constraints using a condi-tional random field (CRF). We focus on the sci-entific process understanding setting introduced inDalvi et al. (2018). For each entity, we instantiatea sentence-level LSTM to distill continuous stateinformation from each of that entity’s mentions.Separate LSTMs integrate entity-location infor-mation into this process. These continuous com-ponents then produce potentials for a sequentialCRF tagging layer, which predicts discrete entitystates. The CRF’s problem-specific tag scheme,along with transition constraints, ensures that themodel’s predictions of these observed entity prop-erties are structurally coherent. For example, inprocedural texts, this involves ensuring existencebefore destruction and unique creation and de-struction points. Because we use global inference,identifying implicit event creation or destructionis made easier, since the model resolves conflictsamong competing time steps and chooses the besttime step for these events during sequence predic-tion.

Past approaches in the literature have typicallybeen end-to-end continuous task specific frame-works (Henaff et al., 2017; Bosselut et al., 2018),sometimes for tasks that are simpler and more syn-

7

roots absorb water from soil the water flows to

Linear

leafthe CO2 enters leaf

Mask Vector

Linear Linear

Emission Potentials

Trans-itions

Tag Prediction

M M EWATERCO2

CRF Layer

- -

MOVES — soil

-

SUGAR

MOVES — leafMOVES — leaf

-

CO2WATER-

-

CREATE — leaf

-

?

-

?-

WATER

soil

leaf

s1 s3s2

Linear Linear

Linear Linear Linear

Softmax Softmax

Linear

Softmax Discrete State

Tracking

ContinuousState

Tracking

Continuous Location Tracking

het

xet

xe,lt

he,lt

s1 s3s2

CO2

CE

OB

WATER

EE

OBM

SUGAR

OBEE

EE OBM

M

SUGAR

Raw Turker Annotations Task-Specific Sequence Tags (Sec. 2.1)

SUGARCO2

The water flows to the leafRoots absorb water from soil

CO2 enters leafLight, water, and CO2 combine into mixture

Mixture forms sugar

EVENTSProcess Text (Input)

Figure 1: Task and our proposed model. Top: raw text descriptions are annotated with entity-state change informa-tion; we modify this in a rule-based way for our model. Bottom: our model. Entity mention and verb informationis aggregated in per-entity LSTMs (right). A CRF layer then predicts entity state. A separate sentence-level LSTM(left) tracks each entity-location pair using the combined entity and location mention information.

thetic (Weston et al., 2015), or continuous entity-centric neural language models (Clark et al., 2018;Ji et al., 2017). For process understanding specifi-cally, past work has effectively captured global in-formation (Dalvi et al., 2018) and temporal char-acteristics (Das et al., 2019). However, these mod-els do not leverage the structure constraints of theproblem, or only handle them heuristically (Tan-don et al., 2018). We find that our model out-performs these past approaches on the PROPARA

dataset of Dalvi et al. (2018) with a significantboost in questions concerning entity state, regard-less of the location.

2 Model

We propose a structured neural model for the pro-cess paragraph comprehension task of Dalvi et al.(2018). An example from their dataset is shown inFigure 1. It consists of annotation over a processparagraph w = {wi}Pi=1 of P tokens described bya sequence of T sentences s = {st}Tt=1. A pre-specified set of entities E = {ek}mk=1 is given aswell. For each entity, gold annotation is providedconsisting of the state (EXISTS, MOVES, etc.) andlocation (soil, leaf ) after each sentence. From thisinformation, a set of questions about the processcan be answered deterministically as outlined inTandon et al. (2018).

Our model, as depicted in Fig. 1, consists oftwo core modules: (i) state tracking, and (ii) lo-cation tracking. We follow past work on neuralCRFs (Collobert et al., 2011; Durrett and Klein,2015; Lample et al., 2016), leveraging continuousLSTMs to distill information and a discrete CRFlayer for prediction.

2.1 State Tracking

This part of the model is charged with modelingeach entity’s state over time. Our model places adistribution over state sequences y given a passagew and an entity e: P (y|w, e).

Contextual Embeddings Our model first com-putes contextual embeddings for each word inthe paragraph using a single layered bidirectionalLSTM. Each token word wi is encoded as a vec-tor xi = [emb(wi); vi] which serves as input tothe LSTM. Here, emb(wi) ∈ Rd1 is an embed-ding for the word produced by either pre-trainedGloVe (Pennington et al., 2014) or ELMo (Peterset al., 2018) embeddings and vi is a scalar binaryindicator if the current word is a verb. We denoteby hi = LSTM([xi]) the LSTM’s output for theith token in w.

Entity Tracking LSTM To track entities acrosssentences for state changes, we use another task

8

specific bidirectional LSTM on top of the baseLSTM which operates at the sentence level. Theaim of this BiLSTM is to get a continuous rep-resentation of the entity’s state at each time step,since not all time steps mention that entity. Thisrepresentation can capture long-range informationabout the entity’s state which may not be summa-rized in the discrete representation.

For a fixed entity e and each sentence st in theparagraph, the input to the entity tracking LSTMis the contextual embedding of the mention loca-tion1 of the entity e in st, or a mask vector whenthe entity isn’t present in st. Let xe

t denote therepresentation of entity e in sentence t. Then

xet =

{[he

t ;hvt ], if e ∈ st

zero vector, otherwise(1)

where het and hv

t denote the contextual embed-dings of the entity and the associated verb, respec-tively, from the base BiLSTM. In case of multipletokens, a mean pooling over the token represen-tations is used. Here, the information about verbis extracted using POS tags from an off-the-shelfPOS tagger. The entity tracking LSTM then pro-duces representations he

t = LSTM([xet ]).

Neural CRF We use the output of the entitytracking BiLSTM to generate emission potentialsfor each tag in our possible tag set at each timestep t:

φ(yt, t,w, e) = WTyth

et (2)

where W is a learnable parameter matrix. For thespecific case of entity tracking, we propose a 6 tagscheme where the tags are as follows:

Tags Description

OB , OA None state before and after existence, resp.C,D Creation and destruction event for entity, resp.E Exists in the process without any state changeM Entity moves from loca to locb

Table 1: Proposed tag scheme for the neural CRFbased model for entity tracking.

Additionally, we train a transition matrix to gettransition potentials between tags which we de-note byψ(yi−1, yi) and two extra tags: 〈START〉and 〈STOP〉. Finally, for a tag sequence y, we

1We use mention location to differentiate these from thephysical entity locations present in this QA domain.

get the probability as:

P (y|w, e)∝exp( T∑

i=0

φi(yi,w, e)+ψ(yi−1, yi))

(3)

2.2 Location Tracking

To complement entity’s state changes with thechange in physical location of the entity, we usea separate recurrent module to predict the loca-tions. Given a set of potential locations L =(l1, l2, . . . , ln), where each lj ∈ L is a continuousspan in w, the location predictor outputs a distri-bution for a passage w and entity e, at a given timestep t as P (l|w, e, t).

Identifying potential locations Instead of con-sidering all the spans of text as candidates for po-tential locations, we systematically reduce the setof locations by utilizing the part of speech (POS)tags of the tokens, whereby extracting all the max-imal noun and noun + adjective spans as potentialphysical location spans. Thus, using an off-the-shelf POS tagger, we get a set L = (l1, l2, . . . , ln)of potential locations for each w. These heuristicslead to a 85% recall classifier for locations whichare not null or unk.2

Location Tracking LSTM For a given locationl and an entity e, we take the mean of the hid-den representations of tokens in the span of l in st(or else a mask vector) analogous to the input forentity state tracking LSTM, concatenating it withthe mention location of the entity e in st, as inputfor time-step t for the tracking this entity-locationpair with he,l

t = LSTM([xe,l

t ])

. Fig. 1 showsan example where we instantiate location trackingLSTMs for each pair of entity e and potential loca-tion l. In the example, e ∈ {water, CO2, sugar}and l ∈ {soil, leaf}.

Softmax over Location Potentials The outputof the location tracking LSTM is then used to gen-erate potentials by for each entity e and locationl pair for a time step t. Taking softmax over thepotentials gives us a probability distribution overthe locations l at that time step t for that entity e:pe,lt = softmax(wT

loche,lt )

2Major non-matching cases include long phrases like“deep in the earth”, “side of the fault line”, and “area of highelevation” where the heuristics picks “earth”, “fault line”, and“area”, respectively.

9

Model Task-1 Task-2

Cat-1 Cat-2 Cat-3 Macro-Avg Micro-Avg Precision Recall F1

EntNet (Henaff et al., 2017) 51.62 18.83 7.77 26.07 25.96 50.2 33.5 40.2QRN (Seo et al., 2017) 52.37 15.51 10.92 26.26 26.49 55.5 31.3 40.0

ProGlobal (Dalvi et al., 2018) 62.95 36.39 35.90 45.08 45.37 46.7 52.4 49.4ProStruct (Tandon et al., 2018) - - - - - 74.2 42.1 53.75

KG-MRC (Das et al., 2019) 62.86 40.00 38.23 47.03 46.62 64.52 50.68 56.77

This work: NCET 70.55 44.57 41.34 52.15 52.31 64.2 53.9 58.6This work: NCET + ELMo 73.68 47.09 41.03 53.93 53.97 67.1 58.5 62.5

Table 2: Results on the sentence-level (Task-1) and document-level (Task-2) evaluation task of the PROPARAdataset on the test set. Our proposed CRF-based model achieves state of the art results on both the tasks comparedto the previous work in (Das et al., 2019). Incorporating ELMo further improves the performance for the statetracking module, as we see from the gains in Cat-1 and Cat-2.

2.3 Learning and Inference

The full model is trained end-to-end by minimiz-ing the negative log likelihood of the gold statetag sequence for each entity and process paragraphpair. The location predictor is only trained to makepredictions when the gold location is defined forthat entity in the dataset (i.e., the entity exists).

At inference time, we perform a global statechange inference coupled with location predictionin a pipelined fashion. First, we use the state track-ing module of the proposed model to predict thestate change sequence with the maximum scoreusing Viterbi decoding. Subsequently, we predictlocations where the predicted tag is either createor move, which is sufficient to identify the object’slocation at all times since these are the only pointswhere it can change.

3 Experiments

We evaluate the performance of the proposedmodel on the two comprehension tasks of thePROPARA dataset (Dalvi et al., 2018). This datasetconsists of 488 crowdsourced real world processparagraphs about 183 distinct topics in the sciencegenre. The names of the participating entities andtheir existence spans are identified by expert an-notators. Finally, crowd workers label locationsof participant entities at each time step (sentence).The final data consists of 3.3k sentence with an av-erage of 6.7 sentences and 4.17 entities per processparagraph. We compare our model, the NeuralCRF Entity Tracking (NCET) model, with bench-mark systems from past work.

3.1 Task 1: Sentence Level

This comprehension task concerns answering 10fine grained sentence level templated questions

grouped into three categories: (Cat-1) Is e Cre-ated (Moved, Destroyed) in the process (yes/no foreach)? (Cat-2) When was e Created (Moved, De-stroyed)? (Cat-3) Where was e Created, (Movedfrom/to, Destroyed)? The ground truth for thesequestions were extracted by the application of sim-ple rules to the annotated location state data. Notethat Cat-1 and Cat-2 can be answered from ourstate-tracking model alone, and only Cat-3 in-volves location.

As shown in Table 2, our model using GloVeachieves state of the art performance on the testset. The performance gain is attributed to the gainsin Cat-1 and Cat-2 (7.69% and 4.57% absolute),owing to the structural constraints imposed by theCRF layer. The gain in Cat-3 is relatively lower asit is the only sub-task involving location tracking.Additionally, using the frozen ELMo embeddingthe performance further improves with major im-provements in Cat-1 and Cat-2.

3.2 Task 2: Document Level

The document level evaluation tries to capture amore global context where the templated3 ques-tions set forth concern about the whole paragraphstructure: (i) What are the inputs to the process?(ii) What are the outputs of the process? (iii)What conversions occur, when and where? (iv)What movements occur, when and where? Table 2shows the performance of the model on this task.We achieve state of the art results with a F1 of58.6.

3Inputs refer to the entities which existed prior to the pro-cess and are destroyed during it. Outputs refer to the entitieswhich get created in the process without subsequent destruc-tion. Conversion refers to the simultaneous event which in-volves creation of some entities coupled with destruction ofothers.

10

Model C-1 C-2 C-3 Mac. Mic.

NCET 72.27 46.08 40.82 53.06 53.13

Tag Set 1 71.53 41.89 41.42 51.61 51.94Tag Set 2 71.97 41.85 39.71 51.18 51.43No trans. 71.68 44.22 40.38 52.09 52.24

No verb 73.16 42.58 41.85 52.53 52.85Attn. 61.69 22.80 36.44 40.31 41.38

Table 3: Ablation studies for the proposed architecture.

3.3 Model Ablations

We now examine the performance of the model bycomparing its variants along two different dimen-sions: (i) modifying the structural constraints forthe CRF layer, and (ii) making changes to the con-tinuous entity tracking.

Discrete Structural Constraints We experi-ment with two new tag schemes: (i) tag1 : OA =OB , and (ii) tag2 : OA = E = OB . As shownin Table 3, the proposed 6 tag scheme outperformsthe simpler tag schemes indicating that the modelis able to gain more from a better structural anno-tation. Additionally, we experiment with remov-ing the transition features from our CRF layer,though we still use structural constraints. Takentogether, these results show that carefully captur-ing the domain constraints in how entities changeover time is an important factor in our model.

Continuous Entity Tracking To evaluate theimportance of different modules in our continu-ous entity tracking model, we experiment with(i) removing the verb information, and (ii) takingattention-based input for the entity tracking LSTMinstead of the entity-mention information. Thisway instead of giving a hard attention by focus-ing exactly on the entity, we let the model learnsoft attention across the tokens for each time-step.The model can now learn to look anywhere in asentence for entity information, but is not givenprior knowledge of how to do so. As shown, usingattention-based input for entity tracking performssubstantially worse, indicating the structural im-portance of passing the mask vector.

4 Conclusion

In this paper, we present a structured architecturefor entity tracking which leverages both the dis-crete and continuous characterization of the en-tity evolution. We use a neural CRF approach to

model our discrete constraints while tracking enti-ties and locations recurrently. Our model achievesstate of the art results on the PROPARA dataset.

Acknowledgments

This work was partially supported by NSFGrant IIS-1814522, NSF Grant SHF-1762299, aBloomberg Data Science Grant, and an equipmentgrant from NVIDIA. The authors acknowledgethe Texas Advanced Computing Center (TACC)at The University of Texas at Austin for provid-ing HPC resources used to conduct this research.Thanks to the anonymous reviewers for their help-ful comments.

ReferencesAntoine Bosselut, Corin Ennis, Omer Levy, Ari Holtz-

man, Dieter Fox, and Yejin Choi. 2018. SimulatingAction Dynamics with Neural Process Networks.In Proceedings of the International Conference onLearning Representations (ICLR).

Elizabeth Clark, Yangfeng Ji, and Noah A. Smith.2018. Neural Text Generation in Stories Using En-tity Representations as Context. In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics(ACL): Human Language Technologies.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural Language Processing (Almost) fromScratch. Journal of Machine Learning Research,12(Aug):2493–2537.

Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen-tauYih, and Peter Clark. 2018. Tracking State Changesin Procedural Text: a Challenge Dataset and Modelsfor Process Paragraph Comprehension. In Proceed-ings of the Conference of the North American Chap-ter of the Association for Computational Linguistics(NAACL): Human Language Technologies.

Rajarshi Das, Tsendsuren Munkhdalai, Xingdi Yuan,Adam Trischler, and Andrew McCallum. 2019.Building Dynamic Knowledge Graphs from Text us-ing Machine Reading Comprehension. In Proceed-ings of the International Conference on LearningRepresentations (ICLR).

Greg Durrett and Dan Klein. 2015. Neural CRF Pars-ing. In Proceedings of the 53rd Annual Meetingof the Association for Computational Linguistics(ACL) and the 7th International Joint Conference onNatural Language Processing.

Mikael Henaff, Jason Weston, Arthur Szlam, AntoineBordes, and Yann LeCun. 2017. Tracking the World

11

State with Recurrent Entity Networks. In Proceed-ings of the International Conference on LearningRepresentations (ICLR).

Yangfeng Ji, Chenhao Tan, Sebastian Martschat, YejinChoi, and Noah A. Smith. 2017. Dynamic EntityRepresentations in Neural Language Models. InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing.

Mandar Joshi, Eunsol Choi, Daniel Weld, and LukeZettlemoyer. 2017. TriviaQA: A Large Scale Dis-tantly Supervised Challenge Dataset for ReadingComprehension. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (ACL).

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural Architectures for Named Entity Recognition.In Proceedings of the Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL): Human Language Technolo-gies.

Reginald Long, Panupong Pasupat, and Percy Liang.2016. Simpler Context-Dependent Logical Formsvia Model Projections. In Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics (ACL).

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global Vectors for WordRepresentation. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing (EMNLP).

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep Contextualized Word Rep-resentations. In Proceedings of the Conference ofthe North American Chapter of the Association forComputational Linguistics (NAACL): Human Lan-guage Technologies.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ Questionsfor Machine Comprehension of Text. In Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing (EMNLP).

Matthew Richardson, Christopher J.C. Burges, andErin Renshaw. 2013. MCTest: A Challenge Datasetfor the Open-Domain Machine Comprehension ofText. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP).

Minjoon Seo, Sewon Min, Ali Farhadi, and HannanehHajishirzi. 2017. Query-Reduction Networks forQuestion Answering. In Proceedings of the Inter-national Conference on Learning Representations(ICLR).

Niket Tandon, Bhavana Dalvi, Joel Grus, Wen-tau Yih,Antoine Bosselut, and Peter Clark. 2018. Reasoningabout Actions and State Changes by Injecting Com-monsense Knowledge. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP).

Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing Datasets for Multi-hopReading Comprehension Across Documents. InProceedings of the Transactions of the Associationfor Computational Linguistics (TACL).

Jason Weston, Antoine Bordes, Sumit Chopra, Alexan-der M Rush, Bart van Merrienboer, Armand Joulin,and Tomas Mikolov. 2015. Towards AI-CompleteQuestion Answering: A Set of Prerequisite ToyTasks. In arXiv preprint arXiv:1502.05698.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-gio, William W. Cohen, Ruslan Salakhutdinov, andChristopher D. Manning. 2018. HotpotQA: ADataset for Diverse, Explainable Multi-hop Ques-tion Answering. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing (EMNLP).

12

Proceedings of 3rd Workshop on Structured Prediction for NLP, pages 13–17Minneapolis, Minnesota, June 7, 2019. c©2019 Association for Computational Linguistics

SPARSE:Structured Prediction Using Argument-Relative Structured Encoding

Rishi BommasaniDept. of Computer Science

Cornell [email protected]

Arzoo KatiyarDept. of Computer Science

Cornell [email protected]

Claire CardieDept. of Computer Science

Cornell [email protected]

Abstract

We propose structured encoding as a novel ap-proach to learning representations for relationsand events in neural structured prediction. Ourapproach explicitly leverages the structure ofavailable relation and event metadata to gener-ate these representations, which are parameter-ized by both the attribute structure of the meta-data as well as the learned representation ofthe arguments of the relations and events. Weconsider affine, biaffine, and recurrent oper-ators for building hierarchical representationsand modelling underlying features.Without task-specific knowledge sources ordomain engineering, we significantly improveover systems and baselines that neglect theavailable metadata or its hierarchical struc-ture. We observe across-the-board improve-ments on the BeSt 2016/2017 sentiment anal-ysis task of at least 2.3 (absolute) and 10.6%(relative) F-measure over the previous state-of-the-art.

1 Introduction

Information extraction has long been an activesubarea of natural language processing (NLP)(Onyshkevych et al., 1993; Freitag, 2000). Aparticularly important class of extraction tasks isERE detection in which an object, typically an el-ement in a knowledge base, is created for eachENTITY, RELATION, and EVENT identified in agiven text (Song et al., 2015). In some variantsof ERE detection, metadata descriptions and spe-cific mentions of the ERE objects also need to berecorded as shown graphically in Figure 1. Sub-sequent second-order extraction tasks can furtherbuild upon first-order ERE information.

In this work, in particular, we consider thesecond-order structured prediction task studied inthe 2016/2017 Belief and Sentiment analysis eval-uations (BeSt) (Rambow et al., 2016a): given a

Figure 1: ERE graph for “China, a growing worldpower, is developing its army.” Entities are denoted inblue, entity mentions in green, and relations in red.

document and its EREs (including metadata andmentions) determine the sentiment of each EN-TITY towards every RELATION and EVENT in thedocument.1

Until quite recently, existing approaches to thistype of second-order extraction task have reliedheavily on domain knowledge and feature engi-neering (Rambow et al., 2016b; Niculae et al.,2016; Dalton et al., 2016; Gutirrez et al., 2016).And while end-to-end neural network methodsthat bypass feature engineering have been devel-oped for information extraction problems, theyhave largely been applied to first-order extrac-tion tasks (Katiyar and Cardie, 2016; Miwa andBansal, 2016; Katiyar and Cardie, 2017; Orr et al.,2018). Possibly more importantly, these tech-niques ignore, or have no access to, the internalstructure of relations and events.

We hypothesize that utilizing the metadata-

1The BeSt evaluation also requires the identification ofsentiment towards entities. For reasons that will become clearlater, we do not consider entity-to-entity sentiment here.

13

induced structure of relations and events willimprove performance on the second-order sen-timent analysis task. To this end, we proposestructured encoding as an approach towards learn-ing representations for relations and events in end-to-end neural structured prediction. In particular,our approach not only models available metadatabut also its hierarchical nature.

We evaluate the structured encoding approachon the BeSt 2016/2017 dataset. Withoutsentiment-specific resources, we are able to seesignificant improvements over baselines that donot take into account the available structure. Weachieve state-of-the-art F1 scores of 22.9 for dis-cussion forum documents and 14.0 for newswiredocuments. We also openly release our implemen-tation to promote further experimentation and re-producible research.

2 Task Formulation

In the BeSt sentiment analysis setting, we aregiven a corpus D where every document di ∈ Dis annotated with the set of entities Ei, relationmentions Ri, and event mentions2 EV i present indi. For each entity ej ∈ Ei, we are additionallygiven the span, or variable-length n-gram, associ-ated with (each of) its mention(s). Similarly, foreach relation and event, we are given metadataspecifications of its type and subtype as well as thearguments that constitute it. Our task is then thefollowing: for each potential source-target pair,(src, trg) ∈ (Ei × Ri) ∪ (Ei × EV i), predict thesentiment (one of POSITIVE, NEGATIVE or NONE)of the src entity towards the trg relation/event.

3 Model

Our structured encoding model is depicted in Fig-ure 2. Its goal is to compute representations forthe src, trg, and context c; we then pass the con-catenated triple into a FFNN for sentiment classi-fication. We describe the model components in theparagraphs below.

Word and Entity Mention RepresentationGiven a span s = w1, . . . , w|s| of the input doc-ument, a noncontextual embedding mapping f ,and a contextual mapping g, we create the ini-tial word representation wt for each word wt in sby concatenating its noncontextual and contextual

2In this work, the distinction between relations/events andrelation/event mentions is not considered, so we use ‘relation’and ‘event’ as a shorthand.

Figure 2: Model for representing the source (blue), tar-get (red), and context (yellow) for the directed senti-ment analysis task.

embedding. We then pass the initial word repre-sentations through a bidirectional RNN to obtaina domain-specific contextual word representationht.3

wt = [f(wt); g(wt|s)]ht = [

−−−→RNN(wt);

←−−−RNN(wt)]

(1)

For each entity mention s, we compute a represen-tation of s as the mean-pooling of h1, . . . ,h|s|.

Source Encoding For each source entity ej weare given its grounded entity mentions e1j , . . . , e

nj .

Similar to the representation methodology forentity-mentions, we represent each source as theaverage of its entity mention representations.

ej =1

n

n∑

i=1

1∣∣[eij ]∣∣

|[eij ]|∑

t=1

ht (2)

The dataset is also annotated with entities corre-sponding to the authors of articles. To more appro-priately handle these entities, we learn two authorembeddings. The post author embedding repre-sents the source entity if the target occurs in a postwritten by the source. If this is not the case, weuse the other author embedding.

Target Encoding To encode targets, we lever-age the hierarchical structure provided in the rela-tion and event metadata. Initially, we construct afixed-length representation for each of the target’sarguments, which are (role, entity mention) pairs,(r, ekj ), via concatenation (flat) or an affine map(affine) as follows (where Ur and vr are learned):

argrj,k = Ure

kj (affine)

argrj,k = [vr; e

kj ] (flat)

(3)

For relations, given that the dataset enforces theconstraint of two arguments per relation, we pool

3This process is not shown in Figure 2.

14

the relation vectors by concatenating them. On theother hand, for events, as the number of argumentsis variable, we propose to pool them with a GRUencoder (Cho et al., 2014). This allows for us todeal with the variable number of arguments andlearn compositional relationships between differ-ent arguments (arguments are ordered by their se-mantic roles in the overarching event).

Both relations and events are annotated withtheir types and subtypes. Based on our hypothe-sis that hierarchical modeling of structure is im-portant in encoding targets, we consider encodingthe subtype and type using flat concatenation orlearned affine maps applied successively in a sim-ilar fashion to role encoding for arguments.

Context Representation We encode the sourceand target independently, capturing the inductivebias that these two components of the input havestand-alone meanings. We find that this is compu-tationally efficient as opposed to approaches thatmodel the source and target on a pair-specific ba-sis. We introduce source-target interaction intothe architecture by modeling the textual contextin which they appear: first select a source-targetspecific context span; then construct a fixed-lengthencoding of the context using an attention mech-anism conditional on the source-target pair. Toidentify the context span, we identify the closestmention of the source that precedes the target. Webegin the context span starting at the first word be-yond the end of this mention (wi). Similarly, weconclude the span at the last word preceding thetarget (wj). Denoting the context span as [wi, wj ],we compute the context vector c as (where the Umaps are learned):usrc = Usrc src; utrg = Utrg trg; ut = Uc ht

αt = uTt (usrc � utrg)

α = softmax([αi, . . . , αj ]T )

c = [hi, . . . ,hj ]α

(4)We truncate long spans (length greater than 20) tobe the 20 words preceding the start of the target(i = j − 20).

4 Results and Analysis

Dataset and Evaluation We use the LDCE114dataset that was used in the BeSt 2016/2017 En-glish competition. The dataset contains 165 doc-uments, 84 of which are multi-post discussion fo-rums (DF) and 81 are single-post newswire arti-

DF NWLength (in num. tokens) 750.36 538.96Relation Pairs 775.34 1474.51Relation Positive Examples 1.32 1.02Event Pairs 810.36 1334.53Event Positive Examples 2.60 3.98

Table 1: Average document statistics

cles (NW), and is summarized in Table 1. We ob-serve that the data is extremely sparse with respectto positive examples: only 0.203% of all source-target pairs are positive examples with 431 pos-itive examples in 211310 candidate pairs in thetraining set.

Consistent with the BeSt evaluation framework,we report the microaverage adjusted F1 score4 andtreat NONE as the negative class and both POSI-TIVE and NEGATIVE as the positive classes. TheBeSt metric introduces a notion of partial creditto reward predictions that are incorrect but shareglobal properties with the correct predictions. Wefind that this metric is well-suited for structuredprediction treatments of the task that also considerglobal structure.

Implementation Details We use frozen wordrepresentations that concatenate noncontextual300-dimensional GloVe embeddings (Penningtonet al., 2014) and contextual 3072-dimensionalELMo embeddings (Peters et al., 2018). We usea 2-layer bidirectional LSTM encoder (Hochreiterand Schmidhuber, 1997) and consistently use botha hidden dimensionality of 128 and tanh nonlin-earities throughout the work. Based on resultson the development set, we use an attention di-mensionality of 32, a dropout (Srivastava et al.,2014) probability in all hidden layers of 0.2, anda batch size of 10 documents. We also find that asingle-layer GRU worked best for event argumentpooling and interestingly find that a unidirectionalGRU is preferable to a bidirectional GRU. The fi-nal classifier is a single-layer FFNN. The modelis trained with respect to the class-weighted crossentropy loss where weights are the `1-normalizedvector corresponding to inverse class frequencesand optimized using ADAM (Kingma and Ba,2014) with the default parameters in PyTorch(Paszke et al., 2017). All models are trained for50 epochs.

4Complete results can be found in Appendix A.

15

DF NWBaseline 14.53 7.24Columbia/GWU 20.69 10.10Cornell-Pitt-Michigan 19.48 0.70Target Arg (Sub)TypeRelation Flat Flat 15.2 8.5Event Flat Flat 15.5 8.5Both Affine Affine 22.9 14.0

Table 2: Experimental results on the BeSt dataset.The specified target types are encoded using struc-tured encoding with the specified argument and sub-type/type encoding methods. The other target type isencoded with affine maps for the arguments, subtypes,and types.

Overall Results In Table 2, the upper half ofthe table describes the official BeSt baseline sys-tem as well as the top systems on the task. Thebaseline employs a rule-based heuristic that con-siders pairs for which the source is the articleauthor as potential positive examples and clas-sifies positive examples as NEGATIVE (the morefrequent positive class) (Rambow et al., 2016a).The Columbia/GWU system treats second-ordersentiment classification as first-order relation ex-traction and extends a relation extraction systemthat uses phrase structure trees, dependency trees,and semantic parses with an SVM using tree ker-nels (Rambow et al., 2016b). The Cornell-Pitt-Michigan system employs a rule-based heuristicto prune candidate pairs in the sense of link pre-diction and then performs sentiment classificationvia multinomial logistic regression (Niculae et al.,2016). They find that the link prediction systemsometimes predicts spurious links and permit theclassifier to overrule the link prediction judgmentby predicting NONE.

Note these results include entities as targets andwe argue that this makes the task comparativelyeasier. In particular, sparsity and dataset size areless of a concern for entity targets as positive ex-amples are equally frequent (0.196%) and thereare significantly more entity pairs (438165 pairs)than relation and event targets combined (211310pairs).5

In the lower half of Table 2, we report resultsfor different structured encoding approaches. Aswe simultaneously predict sentiment towards re-lation targets and event targets, we find that the

5Additionally, entities are unstructured targets and there-fore are not well-suited for the contributions of this work.

Embeddings Encoder DF NWGloVe + ELMo Bidi-LSTM 22.9 14.0

GloVe + ELMo* Bidi-LSTM 8.5 2.4GloVe Bidi-LSTM 12.6 8.3ELMo Bidi-LSTM 15.0 9.9

GloVe + ELMo Uni-LSTM 20.9 13.6GloVe + ELMo none 22.1 13.6

Table 3: Experimental results on the effects of wordrepresentations. * indicates embeddings are not frozen.

encoding method for one target category tends toimprove performance on the other due to sharedrepresentations (refer to Appendix A). We find thatusing affine encoders, which is consistent with theinductive bias of hierarchical encoding, performsbest in all settings.

Word Representations To understand the effectof pretraining given its pervasive success through-out NLP as well as the extent to which domain-specific LSTM encoders are beneficial, we per-form experiments on the development set. Whenthe LSTM encoder is omitted, we introduce anaffine projection that yields output vectors of thesame dimension as the original LSTM (128). Thisensures that the capacity of subsequent modelcomponents is unchanged but means there is nodomain-specific context modelling. As shownin Table 3, the combination of contextual andnoncontextual embeddings leads to an improve-ment but the contextual embeddings perform bet-ter stand-alone. In doing this comparison, we con-sider pretrained embeddings of differing dimen-sionalities however since in most settings the em-beddings are frozen, we do not find that this sig-nificantly effects the run time or model capacity.When embeddings are learned, we observe a sub-stantial decline in performance which we posit isdue to catastrophic forgetting due to noisy and er-ratic signal from the loss. The results further indi-cate that a bidirectional task oriented encoder im-proves performance.

5 Conclusion

In this paper, we propose structured encoding tomodel structured targets in semantic link predic-tion. We demonstrate that this framework alongwith pretrained word embeddings can be an ef-fective combination as we achieve state-of-the-art performance on several metrics in the BeSt2016/2017 Task in English.

16

Acknowledgements

We thank Vlad Niculae for his tremendous sup-port and guidance in the early stages of this workand RB especially thanks him for his advice onresearch and for being the gold standard for codequality. We thank the anonymous reviewers fortheir insightful feedback.

ReferencesKyunghyun Cho, Bart van Merrienboer, Caglar

Gulcehre, Fethi Bougares, Holger Schwenk, andYoshua Bengio. 2014. Learning phrase representa-tions using RNN encoder-decoder for statistical ma-chine translation. CoRR, abs/1406.1078.

Adam Dalton, Morgan Wixted, Yorick Wilks,Meenakshi Alagesan, Gregorios Katsios, AnanyaSubburathinam, and Tomek Strzalkowski. 2016.Target-focused sentiment and belief extraction andclassification using cubism. In TAC.

Dayne Freitag. 2000. Machine learning for infor-mation extraction in informal domains. MachineLearning, 39(2):169–202.

Yoan Gutirrez, Salud Mara Jimnez-Zafra, IsabelMoreno, M. Teresa Martn-Valdivia, David Toms,Arturo Montejo-Rez, Andrs Montoyo, L. Al-fonso Urea-Lpez, Patricio Martnez-Barco, FernandoMartnez-Santiago, Rafael Muoz, and Eladio Blanco-Lpez. 2016. Redes at tac knowledge base population2016: Edl and best tracks. In TAC.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Comput., 9(8):1735–1780.

Arzoo Katiyar and Claire Cardie. 2016. Investigatinglstms for joint extraction of opinion entities and rela-tions. In Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 919–929. Associationfor Computational Linguistics.

Arzoo Katiyar and Claire Cardie. 2017. Going out ona limb: Joint extraction of entity mentions and re-lations without dependency trees. In Proceedings ofthe 55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 917–928. Association for Computational Lin-guistics.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Makoto Miwa and Mohit Bansal. 2016. End-to-end re-lation extraction using lstms on sequences and treestructures. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1105–1116. Asso-ciation for Computational Linguistics.

Vlad Niculae, Kai Sun, Xilun Chen, Yao Cheng, XinyaDu, Esin Durmus, Arzoo Katiyar, and Claire Cardie.2016. Cornell belief and sentiment system at tac2016. In TAC.

Boyan Onyshkevych, Mary Ellen Okurowski, andLynn Carlson. 1993. Tasks, domains, and lan-guages for information extraction. In Proceedingsof a workshop on held at Fredericksburg, Virginia:September 19-23, 1993, pages 123–133. Associationfor Computational Linguistics.

Walker Orr, Prasad Tadepalli, and Xiaoli Fern. 2018.Event detection with neural networks: A rigorousempirical evaluation. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 999–1004. Association forComputational Linguistics.

Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proc. of NAACL.

Owen Rambow, Meenakshi Alagesan, Michael Ar-rigo, Daniel Bauer, Claire Cardie, Adam Dalton,Mona Diab, Greg Dubbin, Gregorios Katsios, Ax-inia Radeva, Tomek Strzalkowski, and JenniferTracey. 2016a. The 2016 tac kbp best evaluation.In TAC.

Owen Rambow, Tao Yu, Axinia Radeva, Alexan-der Richard Fabbri, Christopher Hidey, TianruiPeng, Kathleen McKeown, Smaranda Muresan, Sar-dar Hamidian, Mona T. Diab, and Debanjan Ghosh.2016b. The columbia-gwu system at the 2016 tackbp best evaluation. In TAC.

Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese,Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick,Neville Ryant, and Xiaoyi Ma. 2015. From light torich ere: annotation of entities, relations, and events.In Proceedings of the the 3rd Workshop on EVENTS:Definition, Detection, Coreference, and Representa-tion, pages 89–98.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overfitting. J. Mach. Learn. Res., 15(1):1929–1958.

17

Proceedings of 3rd Workshop on Structured Prediction for NLP, pages 18–28Minneapolis, Minnesota, June 7, 2019. c©2019 Association for Computational Linguistics

Lightly-supervised Representation Learning with Global Interpretability

Andrew Zupon, Maria Alexeeva, Marco A. Valenzuela-Escarcega,Ajay Nagesh, and Mihai Surdeanu

University of Arizona, Tucson, AZ, USA{zupon, alexeeva, marcov, ajaynagesh, msurdeanu}

@email.arizona.edu

Abstract

We propose a lightly-supervised approach forinformation extraction, in particular named en-tity classification, which combines the bene-fits of traditional bootstrapping, i.e., use oflimited annotations and interpretability of ex-traction patterns, with the robust learning ap-proaches proposed in representation learning.Our algorithm iteratively learns custom em-beddings for both the multi-word entities tobe extracted and the patterns that match themfrom a few example entities per category. Wedemonstrate that this representation-based ap-proach outperforms three other state-of-the-art bootstrapping approaches on two datasets:CoNLL-2003 and OntoNotes. Additionally,using these embeddings, our approach outputsa globally-interpretable model consisting of adecision list, by ranking patterns based on theirproximity to the average entity embedding ina given class. We show that this interpretablemodel performs close to our complete boot-strapping model, proving that representationlearning can be used to produce interpretablemodels with small loss in performance. Thisdecision list can be edited by human experts tomitigate some of that loss and in some casesoutperform the original model.

1 Introduction

One strategy for mitigating the cost of super-vised learning in information extraction (IE) isto bootstrap extractors with light supervisionfrom a few provided examples (or seeds). Tra-ditionally, bootstrapping approaches iterate be-tween learning extraction patterns such as word n-grams, e.g., the pattern “@ENTITY , formerpresident” could be used to extract personnames,1 and applying these patterns to extract thedesired structures (entities, relations, etc.) (Carl-son et al., 2010; Gupta and Manning, 2014, 2015,

1In this work we use surface patterns, but the proposedalgorithm is agnostic to the types of patterns learned.

inter alia). One advantage of this direction isthat these patterns are interpretable, which mit-igates the maintenance cost associated with ma-chine learning systems (Sculley et al., 2014).

On the other hand, representation learning hasproven to be useful for natural language pro-cessing (NLP) applications (Mikolov et al., 2013;Riedel et al., 2013; Toutanova et al., 2015, 2016,inter alia). Representation learning approaches of-ten include a component that is trained in an unsu-pervised manner, e.g., predicting words based ontheir context from large amounts of data, mitigat-ing the brittle statistics affecting traditional boot-strapping approaches. However, the resulting real-valued embedding vectors are hard to interpret.

Here we argue that these two directions arecomplementary, and should be combined. We pro-pose such a bootstrapping approach for informa-tion extraction (IE), which blends the advantagesof both directions. As a use case, we instanti-ate our idea for named entity classification (NEC),i.e., classifying a given set of unknown entitiesinto a predefined set of categories (Collins andSinger, 1999). The contributions of this work are:(1) We propose an approach for bootstrappingNEC that iteratively learns custom embeddings forboth the multi-word entities to be extracted and thepatterns that match them from a few example enti-ties per category. Our approach changes the objec-tive function of a neural network language models(NNLM) to include a semi-supervised componentthat models the known examples, i.e., by attract-ing entities and patterns in the same category toeach other and repelling them from elements indifferent categories, and it adds an external iter-ative process that “cautiously” augments the poolsof known examples (Collins and Singer, 1999).In other words, our contribution is an example ofcombining representation learning and bootstrap-ping.(2) We demonstrate that our representation learn-

18

ing approach is suitable for semi-supervised NEC.We compare our approach against several state-of-the-art semi-supervised approaches on twodatasets: CoNLL-2003 (Tjong Kim Sang andDe Meulder, 2003) and OntoNotes (Pradhan et al.,2013). We show that, despite its simplicity, ourmethod outperforms all other approaches.

(3) Our approach also outputs an interpretation ofthe learned model, consisting of a decision list ofpatterns, where each pattern gets a score per classbased on the proximity of its embedding to the av-erage entity embedding in the given class. Thisinterpretation is global, i.e., it explains the entiremodel rather than local predictions. We show thatthis decision-list model performs comparably tothe complete model on the two datasets.

(4) We also demonstrate that the resulting systemcan be understood, debugged, and maintained bynon-machine learning experts. We compare thedecision-list model edited by human domain ex-perts with the unedited decision-list model andsee a modest improvement in overall performance,with some categories getting a bigger boost. Thisimprovement shows that, for non-ambiguous cat-egories that are well-defined by the local contextscaptured by our patterns, these patterns truly areinterpretable to end users.

2 Related Work

Bootstrapping is an iterative process that alternatesbetween learning representative patterns, and ac-quiring new entities (or relations) belonging toa given category (Riloff, 1996; McIntosh, 2010).Patterns and extractions are ranked using eitherformulas that measure their frequency and asso-ciation with a category, or classifiers, which in-creases robustness due to regularization (Carlsonet al., 2010; Gupta and Manning, 2015). Whilesemi-supervised learning is not novel (Yarowsky,1995; Gupta and Manning, 2014), our approachperforms better than some modern implementa-tions of these methids such as Gupta and Manning(2014).

Distributed representations of words (Deer-wester et al., 1990; Mikolov et al., 2013; Levyand Goldberg, 2014) serve as underlying repre-sentation for many NLP tasks such as informationextraction and question answering (Riedel et al.,2013; Toutanova et al., 2015, 2016; Sharp et al.,2016). Mrksic et al. (2017) build on traditionaldistributional models by incorporating synonymy

and antonymy relations as supervision to finetune word vector spaces, using an Attract/Repelmethod similar to our idea. However, most ofthese works that customize embeddings for a spe-cific task rely on some form of supervision. Incontrast, our approach is lightly supervised, witha only few seed examples per category. Batistaet al. (2015) perform bootstrapping for relation ex-traction using pre-trained word embeddings. Theydo not learn custom pattern embeddings that applyto multi-word entities and patterns. We show thatcustomizing embeddings for the learned patternsis important for interpretability.

Recent work has focused on explanations ofmachine learning models that are model-agnosticbut local, i.e., they interpret individual model pre-dictions (Ribeiro et al., 2018, 2016a). In contrast,our work produces a global interpretation, whichexplains the entire extraction model rather than in-dividual decisions.

Lastly, our work addresses the interpretabilityaspect of information extraction methods. Inter-pretable models mitigate the technical debt of ma-chine learning (Sculley et al., 2014). For example,it allows domain experts to make manual, gradualimprovements to the models. This is why rule-based approaches are commonly used in industryapplications, where software maintenance is cru-cial (Chiticariu et al., 2013). Furthermore, theneed for interpretability also arises in critical sys-tems, e.g., recommending treatment to patients,where these systems are deployed to aid humandecision makers (Lakkaraju and Rudin, 2016).The benefits of interpretability have encouragedefforts to either extract interpretable models fromopaque ones (Craven and Shavlik, 1996), or to ex-plain their decisions (Ribeiro et al., 2016b).

As machine learning models are becomingmore complex, the focus on interpretability hasbecome more important, with new funding pro-grams focused on this topic.2 Our approachfor exporting an interpretable model (§3) is sim-ilar to Valenzuela-Escarcega et al. (2016), butwe start from distributed representations, whereasthey started from a logistic regression model withexplicit features.

2 DARPA’s Explainable AI program: http://www.darpa.mil/program/explainable-artificial-intelligence.

19

3 Approach

Bootstrapping with representation learning

Our algorithm iteratively grows a pool of multi-word entities (entPoolc) and n-gram patterns(patPoolc) for each category of interest c, andlearns custom embeddings for both, which we willshow are crucial for both performance and inter-pretability.

The entity pools are initialized with a few seedexamples (seedsc) for each category. For exam-ple, in our experiments we initialize the pool fora person names category with 10 names suchas Mother Teresa. Then the algorithm iterativelyapplies the following three steps for T epochs:

(1) Learning custom embeddings: The algorithmlearns custom embeddings for all entities and pat-terns in the dataset, using the current entPoolcs assupervision. This is a key contribution, and is de-tailed in the second part of this section.

(2) Pattern promotion: We generate the patternsthat match the entities in each pool entPoolc,rank those patterns using point-wise mutual in-formation (PMI) with the corresponding category,and select the top ranked patterns for promotionto the corresponding pattern pool patPoolc. Inthis work, we use use surface patterns consist-ing of up to 4 words before/after the entity of in-terest, e.g., the pattern “@ENTITY , formerpresident” matches any entity followed by thethree tokens ,, former, and president. However,our method is agnostic to the types of patternslearned, and can be trivially adapted to other typesof patterns, e.g., over sytactic dependency paths.

(3) Entity promotion: Entities are promoted toentPoolc using a multi-class classifier that esti-mates the likelihood of an entity belonging to eachclass (Gupta and Manning, 2015). Our feature setincludes, for each category c: (a) edit distance overcharacters between the candidate entity e and cur-rent ecs ∈ entPoolc, (b) the PMI (with c) of thepatterns in patPoolc that matched e in the trainingdocuments, and (c) similarity between e and ecs ina semantic space. For the latter feature group, weuse the set of embedding vectors learned in step(1). These features are taken from Gupta and Man-ning (2015). We use these vectors to compute thecosine similarity score of a given candidate entitye to the entities in entPoolc, and add the averageand maximum similarities as features. The top10 entities classified with the highest confidence

for each class are promoted to the correspondingentPoolc after each epoch.

Learning custom embeddings

We train our embeddings for both entities and pat-terns by maximizing the objective function J :

J = SG + Attract + Repel (1)

where SG, Attract, and Repel are individual com-ponents of the objective function designed tomodel both the unsupervised, language model partof the task as well as the light supervision comingfrom the seed examples, as detailed below. A sim-ilar approach is proposed by (Mrksic et al., 2017),who use an objective function modified with At-tract and Repel components to fine-tune word em-beddings with synonym and antonym pairs.

The SG term is formulated identically to theoriginal objective function of the Skip-Grammodel of Mikolov et al. (2013), but, crucially,adapted to operate over multi-word entities andcontexts consisting not of bags of context words,but of the patterns that match each entity. Thus, in-tuitively, our SG term encourages the embeddingsof entities to be similar to the embeddings of thepatterns matching them:

SG =∑

e

[log(σ(V >e Vpp))+

np

log(σ(−V >e Vnp))]

(2)

where e represents an entity, pp represents a posi-tive pattern, i.e., a pattern that matches entity e inthe training texts, np represents a negative pattern,i.e., it has not been seen with this entity, and σ isthe sigmoid function. Intuitively, this componentforces the embeddings of entities to be similar tothe embeddings of the patterns that match them,and dissimilar to the negative patterns.

The second component, Attract, encourages en-tities or patterns in the same pool to be close toeach other. For example, if we have two entities inthe pool known to be person names, they shouldbe close to each other in the embedding space:

Attract =∑

P

x1,x2∈Plog(σ(V >

x1Vx2)) (3)

where P is the entity/pattern pool for a category,and x1, x2 are entities/patterns in said pool.

20

Lastly, the third term, Repel, encourages thatthe pools be mutually exclusive, which is a softversion of the counter training approach of Yan-garber (2003) or the weighted mutual-exclusivebootstrapping algorithm of McIntosh and Curran(2008). For example, person names should be farfrom organization names in the semantic embed-ding space:

Repel =∑

P1,P2 if P16=P2

x1∈P1

x2∈P2

log(σ(−V >x1Vx2))

(4)

where P1, P2 are different pools, and x1 and x2are entities/patterns in P1, and P2, respectively.

We term the complete algorithm that learns anduses custom embeddings as Emboot (Embeddingsfor bootstrapping), and the stripped-down ver-sion without them as EPB (Explicit Pattern-basedBootstrapping). EPB is similar to Gupta and Man-ning (2015); the main difference is that we use pre-trained embeddings in the entity promotion classi-fier rather than Brown clusters. In other words,EPB relies on pretrained embeddings for both pat-terns and entities rather than the custom ones thatEmboot learns.3

Interpretable modelIn addition to its output (entPoolcs), Emboot pro-duces custom entity and pattern embeddings thatcan be used to construct a decision-list model,which provides a global, deterministic interpreta-tion of what Emboot learned.

This interpretable model is constructed as fol-lows. First, we produce an average embedding percategory by averaging the embeddings of the enti-ties in each entPoolc. Second, we estimate the co-sine similarity between each of the pattern embed-dings and these category embeddings, and convertthem to a probability distribution using a softmaxfunction; probc(p) is the resulting probability ofpattern p for class c.

After being constructed, the interpretable modelis used as follows. First, each candidate entity tobe classified, e, receives a score for a given classc from all patterns in patPoolc that match it. Theentity score aggregates the relevant pattern proba-bilities using Noisy-Or:

3For multi-word entities and patterns, we simply averageword embeddings to generate entity and pattern embeddingsfor EPB.

Score(e, c) =

1−∏

{pc∈patPoolc|matches(pc,e)}(1− probc(pc))

(5)

Each entity is then assigned to the category withthe highest overall score.

4 Experiments

We evaluate the above algorithms on the task ofnamed entity classification from free text.

Datasets: We used two datasets, the CoNLL-2003 shared task dataset (Tjong Kim Sang andDe Meulder, 2003), which contains 4 entity types,and the OntoNotes dataset (Pradhan et al., 2013),which contains 11.4 These datasets containmarked entity boundaries with labels for eachmarked entity. Here we only use the entity bound-aries but not the labels of these entities during thetraining of our bootstrapping systems. To simulatelearning from large texts, we tuned hyper param-eters on development, but ran the actual experi-ments on the train partitions.

Baselines: In addition to the EPB algorithm, wecompare against the approach proposed by Guptaand Manning (2014)5. This algorithm is a sim-pler version of the EPB system, where entitiesare promoted with a PMI-based formula ratherthan an entity classifier.6 Further, we compareagainst label propagation (LP) (Zhu and Ghahra-mani, 2002), with the implementation availablein the scikit-learn package.7 In each boot-strapping epoch, we run LP, select the entities withthe lowest entropy, and add them to their top cate-gory. Each entity is represented by a feature vectorthat contains the co-occurrence counts of the entityand each of the patterns that matches it in text.8

Settings: For all baselines and proposed models,we used the same set of 10 seeds/category, whichwere manually chosen from the most frequent en-tities in the dataset. For the custom embedding

4We excluded numerical categories such as DATE.5https://nlp.stanford.edu/software/patternslearning.shtml

6We did not run this system on OntoNotes dataset as ituses a builtin NE classifier with a predefined set of labelswhich did not match the OntoNotes labels.

7http://scikit-learn.org/stable/modules/generated/

sklearn.semi_supervised.LabelPropagation.html8We experimented with other feature values, e.g., pattern

PMI scores, but all performed worse than raw counts.

21

20 10 0 10 20

20

10

0

10

20

(a) Embeddings initialized randomly

40 30 20 10 0 10 20 30

20

10

0

10

20

30

(b) Bootstrapping epoch 5

40 30 20 10 0 10 20

20

10

0

10

20

30

40

(c) Bootstrapping epoch 10

Figure 1: t-SNE visualizations of the entity embeddings at three stages during training.Legend: = LOC. = ORG. = PER. = MISC. This figure is best viewed in color.

MISC

PER

LOC

ORG

PERMostly

performersMostly

politicians

Figure 2: t-SNE visualizations of the entity embeddings learned by Emboot after training completes.Legend: = LOC. = ORG. = PER. = MISC.

features, we used randomly initialized 15d em-beddings. Here we consider patterns to be n-grams of size up to 4 tokens on either side ofan entity. For instance, “@ENTITY , formerPresident” is one of the patterns learned forthe class person. We ran all algorithms for 20bootstrapping epochs, and the embedding learningcomponent for 100 epochs in each bootstrappingepoch. We add 10 entities and 10 patterns to eachcategory during every bootstrapping epoch.

5 Discussion

Qualitative Analysis

Before we discuss overall results, we provide aqualitative analysis of the learning process for Em-boot for the CoNLL dataset in Figure 1. The fig-ure shows t-SNE visualizations (van der Maatenand Hinton, 2008) of the entity embeddings at sev-eral stages of the algorithm. This visualizationmatches our intuition: as training advances, en-tities belonging to the same category are indeed

22

grouped together. In particular, Figure 1c showsfive clusters, four of which are dominated by onecategory (and centered around the correspondingseeds), and one, in the upper left corner, with theentities that haven’t yet been added to any of thepools.

Figure 2 shows a more detailed view of the t-SNE projections of entity embeddings after Em-boot’s training completes. Again, this demon-strates that Emboot’s semi-supervised approachclusters most entities based on their (unseen) cate-gories. Interestingly, Emboot learned two clustersfor the PER category. Upon a manual inspectionof these clusters, we observed that one containsmostly performers (e.g., athletes or artists such asStephen Ames, a professional golfer), while theother contains many politicians (e.g., Arafat andClinton). Thus, Emboot learned correctly that, atleast at the time when the CoNLL 2003 datasetwas created, the context in which politicians andperformers were mentioned was different. Thecluster in the bottom left part of the figure containsthe remaining working pool of patterns, whichwere not assigned to any category cluster after thetraining epochs.

Quantitative Analysis

A quantitative comparison of the different modelson the two datasets is shown in Figure 3.

Figure 3 shows that Emboot considerably out-performs LP and Gupta and Manning (2014), andhas an occasional improvement over EPB. WhileEPB sometimes outperforms Emboot, Emboot hasthe potential for manual curation of its model,which we will explore later in this section. Thisdemonstrates the value of our approach, and theimportance of custom embeddings.

Importantly, we compare Emboot against: (a)its interpretable version (Embootint), which is con-structed as a decision list containing the pat-terns learned (and scored) after each bootstrappingepoch, and (b) an interpretable system built simi-larly for EPB (EPBint), using the pretrained Levyand Goldberg embeddings9 rather than our customones. This analysis shows that Embootint performsclose to Emboot on both datasets, demonstratingthat most of the benefits of representation learningare available in an interpretable model. Please see

9For multi-word entities, we averaged the embeddings ofthe individual words in the entity to create an overall entityembedding.

the discussion on the edited interpretable model inthe next section.

Importantly, the figure also shows that EPBint,which uses generic entity embeddings rather thanthe custom ones learned for the task performs con-siderably worse than the other approaches. Thishighlights the importance of learning a dedicateddistributed representation for this task.

Interpretability Analysis

Is the list of patterns generated by the interpretablemodel actually interpretable to end users? To in-vestigate this, we asked two linguists to curatethe models learned by Embootint, by editing thelist of patterns in a given model. First, the ex-perts performed an independent analysis of all thepatterns. Next, the two experts conferred witheach other and came to a consensus when theirindependent decisions on a particular pattern dis-agreed. These first two stages took the experts 2hours for the CoNLL dataset and 3 hours for theOntoNotes dataset. The experts did not have ac-cess to the original texts the patterns were pulledfrom, so they had to make their decisions basedon the patterns alone. They made one of three de-cisions for each pattern: (1) no change, when thepattern is a good fit for the category; (2) changingthe category, when the pattern clearly belongs toa different category; and (3) deleting the pattern,when the pattern is either not informative enoughfor any category or when the pattern could occurwith entities from multiple categories. The expertsdid not have the option of modifying the contentof the pattern, because each pattern is associatedwith an embedding learned during training. Ta-ble 1 shows several examples of the patterns anddecision made by the annotators. A summary ofthe changes made for the CoNLL dataset is givenin Figure 4, and a summary of the changes madefor the OntoNotes dataset is given in Figure 5.

As Figure 3 shows, this edited interpretablemodel (Embootint-edited) performs similarly to theunedited interpretable model. When we look a lit-tle deeper, the observed overall similarity betweenthe unchanged and the edited interpretable Em-boot models for both datasets depends on the spe-cific categories and the specific patterns involved.For example, when we look at the CoNLL dataset,we observe that the edited model outperforms theunchanged model on PER entities, but performsworse than the unchanged model on ORG enti-

23

0 100 200 300 400 500 600 700 800throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

OVERALL : CoNLLEPBEPBint

EmbootEmbootint

Embootint edited

LPGupta2014

0 500 1000 1500 2000throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

OVERALL : Ontonotes

EPBEPBint

EmbootEmbootint

Embootint edited

LP

Figure 3: Overall results on the CoNLL and OntoNotes datasets. Throughput is the number of entities classified,and precision is the proportion of entities that were classified correctly. Please see Sec. 4 for a description of thesystems listed in the legend.

Pattern Original Label Decision Rationale

@ENTITY was the LOC delete the pattern is too broad@ENTITY ) Ferrari LOC delete the pattern is uninformativecitizen of @ENTITY MISC change to LOC the pattern is more likely to

occur with a locationAccording to @ENTITY officials ORG no change the pattern is likely to occur

with an organization

Table 1: Examples of patterns and experts’ decisions and rationales from the CoNLL dataset.

24

Figure 4: Summary of expert decisions when editingthe Embootint model, for the CoNLL dataset by originalcategory. Dark blue (bottom) is no change, mediumblue (middle) is deletions, light blue (top) is change ofcategory.

Figure 5: Summary of expert decisions when edit-ing the Embootint model, for the OntoNotes dataset byoriginal category. Dark blue (bottom) is no change,medium blue (middle) is deletions, light blue (top) ischange of category.

ties (Figure 6). We observe a similar pattern withthe OntoNotes dataset, where the Embootint-editedmodel outperforms the Embootint model greatlyfor GPE but not for LAW (Figure 7). Over-all, for OntoNotes, Embootint-edited outperformsEmbootint for 5 categories out of 11. For the cat-egories where Embootint-edited performs worse, thedata is sparse, so few patterns are promoted (30FAC patterns compared to 200 GPE patterns), andmany of them were deleted or changed by the twolinguists (14 FAC deletions and 13 FAC changes,with only 3 FAC patterns remaining).

This difference in outcome partially has to dowith the amount of local vs. global informa-tion available in the patterns. For example, lo-

0 25 50 75 100 125 150 175 200throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

PER

EPBEPB_intEmbootEmboot_intEmboot_int-editedLPGupta 2014

0 50 100 150 200throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

ORGEPBEPB_intEmbootEmboot_intEmboot_int-editedLPGupta 2014

Figure 6: CoNLL results for PER and ORG.The Embootint-edited model generally outperforms theEmbootint model when it comes to PER entities, but notfor ORG entities. This discrepancy seems to relate tothe amount of local information available in PER pat-terns versus ORG patterns that can aid human domainexperts in correcting the patterns. The EPBint model(incorrectly) classifies very few entities as ORG, whichis why it only shows up as a single point in the bottomleft of the lower plot.

cal patterns are common for the PER and MISCcategories in CoNLL, and for the GPE categoryin OntoNotes, e.g., the entity “Syrian” is cor-rectly classified as MISC (which includes de-monyms) due to two patterns matching it in theCoNLL dataset: “@ENTITY President” and“@ENTITY troops”. In general, the majorityof predictions are triggered by 1 or 2 patterns,which makes these decisions explainable. For theCoNLL dataset, 59% of Embootint’s predictionsare triggered by 1 or 2 patterns; 84% are generatedby 5 or fewer patterns; only 1.1% of predictionsare generated by 10 or more patterns.

On the other hand, without seeing the fullsource text, the experts were not able to

25

0 25 50 75 100 125 150 175 200throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

GPE

EPBEPB_intEmbootEmboot_intEmboot_int-editedLP

0 50 100 150 200throughput

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

LAW

EPBEPB_intEmbootEmboot_intEmboot_int-editedLP

Figure 7: OntoNotes results for GPE and LAW. TheEmbootint-edited model greatly outperforms both Em-boot and Embootint models when it comes to GPE enti-ties, but not for LAW entities. This discrepancy seemsto relate to the amount of local information available inGPE patterns versus LAW patterns that can aid humandomain experts in correcting the patterns. EPBint doesnot classify any entities as LAW.

make an accurate judgment on the validity ofsome patterns—for instance, while the pattern@ENTITY and Portugal clearly indicates ageo-political entity, the pattern @ENTITY hasbeen (labeled facility originally when train-ing on OntoNotes documents) can co-occur withentities from any category. Such patterns are com-mon for the LAW category in OntoNotes, and forthe ORG category in CoNLL (due to the focuson sport events in CoNLL, where location namesare commonly used as a placeholder for the corre-sponding team), which partially explains the poorcuration results on these categories in Figures 6and 7. Additionally, lower performance on certaincategories can be partially explained by the smallamount of data in those categories and the fact thatthe edits made by the experts drastically changed

the number of patterns that occur with some cate-gories (see Figures 4 and 5).

6 Conclusion

This work introduced an example of representa-tion learning being successfully combined withtraditional, pattern-based bootstrapping for infor-mation extraction, in particular named entity clas-sification. Our approach iteratively learns customembeddings for multi-word entities and the pat-terns that match them as well as cautiously aug-menting the pools of known examples. This ap-proach outperforms several state-of-the-art semi-supervised approaches to NEC on two datasets,CoNLL 2003 and OntoNotes.

Our approach can also export the model learnedinto an interpretable list of patterns, which hu-man domain experts can use to understand whyan extraction was generated. These patterns canbe manually curated to improve the performanceof the system by modifying the model directly,with minimal effort. For example, we used a teamof two linguists to curate the model learned forOntoNotes in 3 hours. The model edited by humandomain experts shows a modest improvement overthe unedited model, demonstrating the usefulnessof these interpretable patterns. Interestingly, themanual curation of these patterns performed betterfor some categories that rely mostly on local con-text that is captured by the type of patterns usedin this work, and less well for categories that re-quire global context that is beyond the n-gram pat-terns used here. This observation raises opportu-nities for future work such as how to learn globalcontext in an interpretable way, and how to adjustthe amount of global information depending on thecategory learned.

7 Acknowledgments

We gratefully thank Yoav Goldberg for his sugges-tions for the manual curation experiments.

This work was supported by the Defense Ad-vanced Research Projects Agency (DARPA) underthe Big Mechanism program, grant W911NF-14-1-0395, and by the Bill and Melinda Gates Foun-dation HBGDki Initiative. Marco Valenzuela-Escarcega and Mihai Surdeanu declare a financialinterest in lum.ai. This interest has been properlydisclosed to the University of Arizona InstitutionalReview Committee and is managed in accordancewith its conflict of interest policies.

26

References

David S Batista, Bruno Martins, and Mario J Silva.2015. Semi-supervised bootstrapping of relation-ship extractors with distributional semantics. In InEmpirical Methods in Natural Language Process-ing. ACL.

Andrew Carlson, Justin Betteridge, Richard C Wang,Estevam R Hruschka Jr, and Tom M Mitchell. 2010.Coupled semi-supervised learning for informationextraction. In Proceedings of the third ACM inter-national conference on Web search and data mining,pages 101–110. ACM.

Laura Chiticariu, Yunyao Li, and Frederick R Reiss.2013. Rule-based information extraction is dead!long live rule-based information extraction systems!In EMNLP, October, pages 827–832.

Michael Collins and Yoram Singer. 1999. Unsuper-vised models for named entity classification. In Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing.

Mark W Craven and Jude W Shavlik. 1996. Extractingtree-structured representations of trained networks.Advances in neural information processing systems,pages 24–30.

Scott Deerwester, Susan T Dumais, George W Fur-nas, Thomas K Landauer, and Richard Harshman.1990. Indexing by latent semantic analysis. Jour-nal of the American society for information science,41(6):391–407.

Sonal Gupta and Christopher D Manning. 2014. Im-proved pattern learning for bootstrapped entity ex-traction. In CoNLL, pages 98–108.

Sonal Gupta and Christopher D. Manning. 2015. Dis-tributed representations of words to guide boot-strapped entity classifiers. In Proceedings of theConference of the North American Chapter of theAssociation for Computational Linguistics.

Himabindu Lakkaraju and Cynthia Rudin. 2016.Learning cost-effective treatment regimes usingmarkov decision processes. CoRR, abs/1610.06972.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In ACL (2), pages 302–308.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-sne. The Journal of Ma-chine Learning Research, 9(2579-2605):85.

Tara McIntosh. 2010. Unsupervised discovery of nega-tive categories in lexicon bootstrapping. In Proceed-ings of the 2010 Conference on Empirical Methodsin Natural Language Processing, pages 356–365.Association for Computational Linguistics.

Tara McIntosh and James R Curran. 2008. Weightedmutual exclusion bootstrapping for domain indepen-dent lexicon and template acquisition. In Proceed-ings of the Australasian Language Technology Asso-ciation Workshop, volume 2008.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Nikola Mrksic, Ivan Vulic, Diarmuid O Seaghdha, IraLeviant, Roi Reichart, Milica Gasic, Anna Korho-nen, and Steve Young. 2017. Semantic special-ization of distributional word vector spaces usingmonolingual and cross-lingual constraints. Transac-tions of the Association for Computational Linguis-tics, 5:309–324.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Hwee Tou Ng, Anders Bjrkelund, Olga Uryupina,Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-bust linguistic analysis using ontonotes. In Proceed-ings of the Seventeenth Conference on Computa-tional Natural Language Learning, pages 143–152,Sofia, Bulgaria. Association for Computational Lin-guistics.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016a. ”why should i trust you?”: Ex-plaining the predictions of any classifier. In Knowl-edge Discovery and Data Mining (KDD).

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016b. Why should i trust you?: Explain-ing the predictions of any classifier. In Proceedingsof the 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pages1135–1144. ACM.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Anchors: High-precision model-agnostic explanations. In AAAI Conference on Ar-tificial Intelligence (AAAI).

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M Marlin. 2013. Relation extraction withmatrix factorization and universal schemas. In Pro-ceedings of NAACL-HLT.

Ellen Riloff. 1996. Automatically generating extrac-tion patterns from untagged text. In Proceedingsof the national conference on artificial intelligence,pages 1044–1049.

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davy-dov, Todd Phillips, Dietmar Ebner, Vinay Chaud-hary, and Michael Young. 2014. Machine learning:The high interest credit card of technical debt. InSE4ML: Software Engineering for Machine Learn-ing (NIPS 2014 Workshop).

Rebecca Sharp, Mihai Surdeanu, Peter Jansen, Pe-ter Clark, and Michael Hammond. 2016. Creat-ing causal embeddings for question answering with

27

minimal supervision. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP).

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InProceedings of CoNLL-2003, pages 142–147. Ed-monton, Canada.

Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi-fung Poon, Pallavi Choudhury, and Michael Gamon.2015. Representing text for joint embedding of textand knowledge bases. In EMNLP, volume 15, pages1499–1509. Citeseer.

Kristina Toutanova, Xi Victoria Lin, Wen-tau Yih, Hoi-fung Poon, and Chris Quirk. 2016. Compositionallearning of embeddings for relation paths in knowl-edge bases and text. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics, volume 1, pages 1434–1444.

Marco A Valenzuela-Escarcega, Gus Hahn-Powell,Dane Bell, and Mihai Surdeanu. 2016. Snaptogrid:From statistical to interpretable models for biomed-ical information extraction. In Proceedings of the15th Workshop on Biomedical Natural LanguageProcessing, pages 56–65.

Roman Yangarber. 2003. Counter-training in discoveryof semantic patterns. In Proceedings of the AnnualMeeting of the Association for Computational Lin-guistics.

David Yarowsky. 1995. Unsupervised word sense dis-ambiguation rivaling supervised methods. In Pro-ceedings of the Annual Meeting of the Associationfor Computational Linguistics.

X. Zhu and Z. Ghahramani. 2002. Learning from la-beled and unlabeled data with label propagation.Technical Report CMU-CALD-02-107, CarnegieMellon University.

28

Proceedings of 3rd Workshop on Structured Prediction for NLP, pages 29–37Minneapolis, Minnesota, June 7, 2019. c©2019 Association for Computational Linguistics

Semi-Supervised Teacher-Student Architecture for Relation Extraction

Fan Luo, Ajay Nagesh, Rebecca Sharp, and Mihai SurdeanuUniversity of Arizona, Tucson, AZ, USA

{fanluo, ajaynagesh, bsharp, msurdeanu}@email.arizona.edu

Abstract

Generating a large amount of training data forinformation extraction (IE) is either costly (ifannotations are created manually), or runs therisk of introducing noisy instances (if distantsupervision is used). On the other hand, semi-supervised learning (SSL) is a cost-efficientsolution to combat lack of training data. In thispaper, we adapt Mean Teacher (Tarvainen andValpola, 2017), a denoising SSL framework toextract semantic relations between pairs of en-tities. We explore the sweet spot of amount ofsupervision required for good performance onthis binary relation extraction task. Addition-ally, different syntax representations are incor-porated into our models to enhance the learnedrepresentation of sentences. We evaluate ourapproach on the Google-IISc Distant Supervi-sion (GDS) dataset, which removes test datanoise present in all previous distance supervi-sion datasets, which makes it a reliable evalu-ation benchmark (Jat et al., 2017). Our resultsshow that the SSL Mean Teacher approachnears the performance of fully-supervised ap-proaches even with only 10% of the labeledcorpus. Further, the syntax-aware model out-performs other syntax-free approaches acrossall levels of supervision.

1 Introduction

Occurrences of entities in a sentence are oftenlinked through well-defined relations; e.g., occur-rences of PERSON and ORGANIZATION in a sen-tence may be linked through relations such as em-ployed at. The task of relation extraction (RE)is to identify such relations automatically (Pawaret al., 2017). RE is not only crucial for populatingknowledge bases with triples consisting of the en-tity pairs and the extracted relations, but it is alsoimportant for any NLP task involving text under-standing and inference, such as question answer-ing. In this work, we focus on binary RE, such

as identifying the nationality relation between twoentities, Kian Tajbakhsh and Iran, in the sentence:Kian Tajbakhsh is a social scientist who lived formany years in England and the United States be-fore returning to Iran a decade ago.

Semi-supervised learning (SSL) methods havebeen shown to work for alleviating the lack oftraining data in information extraction. For exam-ple, bootstrapping learns from a few seed exam-ples and iteratively augments the labeled portionof the data during the training process (Yarowsky,1995; Collins and Singer, 1999; Carlson et al.,2010; Gupta and Manning, 2015, inter alia). How-ever, an important drawback of bootstrapping,which is typically iterative, is that, as learningadvances, the task often drifts semantically intoa related but different space, e.g., from learningwomen’s names into learning flower names (Yan-garber, 2003; McIntosh, 2010). Unlike iterativeSSL methods such as bootstrapping, SSL teacher-student networks (Tarvainen and Valpola, 2017;Laine and Aila, 2016; Rasmus et al., 2015) makefull use of both a small set of labeled examples aswell as a large number of unlabeled examples in aone-shot, non-iterative learning process. The un-supervised component uses the unlabeled exam-ples to learn a robust representation of the inputdata, while the supervised component forces theselearned representations to stay relevant to the task.

The contributions of our work are:

(1) We provide a novel application of the MeanTeacher (MT) SSL framework to the task of bi-nary relation extraction. Our approach is simple:we build a student classifier using representationlearning of the context between the entity men-tions that participate in the given relation. Whenexposed to unlabeled data, the MT frameworklearns by maximizing the consistency between astudent classifier and a teacher that is an average of

29

past students, where both classifiers are exposed todifferent noise. For RE, we introduce a strategy togenerate noise through word dropout (Iyyer et al.,2015). We provide all code and resources neededto reproduce results1.

(2) We represent sentences with several types ofsyntactic abstractions, and employ a simple neuralsequence model to embed these representations.We demonstrate in our experiments that the repre-sentation that explicitly models syntactic informa-tion helps SSL to make the most of limited trainingdata in the RE task.

(3) We evaluate the proposed approach on theGoogle-IISc Distant Supervision (GDS) datasetintroduced in Jat et al. (2017). Our empirical anal-ysis demonstrates that the proposed SSL architec-ture approaches the performance of state-of-the-art fully-supervised approaches, even when usingonly 10% of the original labeled corpus.

2 Related work

The exploration of teacher-student models forsemi-supervised learning has produced impres-sive results for image classification (Tarvainenand Valpola, 2017; Laine and Aila, 2016; Rasmuset al., 2015). However, they have not yet beenwell-studied in the context of natural languageprocessing. Hu et al. (2016) propose a teacher-student model for the task of sentiment classifi-cation and named entity recognition, where theteacher is derived from a manually specified setof rule-templates that regularizes a neural student,thereby allowing one to combine neural and sym-bolic systems. Our MT system is different in thatthe teacher is a simple running average of the stu-dents across different epochs of training, whichremoves the need of human supervision throughrules. More recently, Nagesh and Surdeanu (2018)applied the MT architecture for the task of semi-supervised Named Entity Classification, which isa simpler task compared to our RE task.

Recent works (Liu et al., 2015; Xu et al., 2015;Su et al., 2018) use neural networks to learn syn-tactic features for relation extraction via travers-ing the shortest dependency path. Following thistrend, we adapt such syntax-based neural modelsto both of our student and teacher classifiers in theMT architecture.

1https://github.com/Fan-Luo/MT-RelationExtraction

Both Liu et al. (2015) and Su et al. (2018) useneural networks to encode the words and depen-dencies along the shortest path between the twoentities, and Liu et al. additionally encode the de-pendency subtrees of the words for additional con-text. We include this representation (words anddependencies) in our experiments. While the in-clusion of the subtrees gives Liu et al. a slightperformance boost, here we opt to focus only onthe varying representations of the dependency pathbetween the entities, without the additional con-text.

Su et al. (2018) use an LSTM to model theshortest path between the entities, but keep theirlexical and syntactic sequences in separate chan-nels (and they have other channels for additionalinformation such as part of speech (POS)). Ratherthan maintaining distinct channels for the differentrepresentations, here we elect to keep both surfaceand syntactic forms in the same sequence and in-stead experiment with different degrees of syntac-tic representation. We also do not include othertypes of information (e.g., POS) here, as it is be-yond the scope of the current work.

There are several more structured (and morecomplex) models proposed for relation extraction,e.g., tree-based recurrent neural networks (Socheret al., 2010) and tree LSTMs (Tai et al., 2015). Oursemi-supervised framework is an orthogonal im-provement, and it is flexible enough to potentiallyincorporate any of these more complex models.

3 Mean Teacher Framework

Our approach repurposes the Mean Teacher frame-work for relation extraction. This section anoverview of the general MT framework. The nextsection discusses the adaptation of this frameworkfor RE.

The Mean Teacher (MT) framework, like otherteacher-student algorithms, learns from a limitedset of labeled data coupled with a much larger cor-pus of unlabeled data. Intuitively, the larger, un-labeled corpus allows the model to learn a robustrepresentation of the input (i.e., one which is notsensitive to noise), and the smaller set of labeleddata constrains this learned representation to alsobe task-relevant. This is similar in spirit to an auto-encoder, which learns a representation that is ro-bust to noise in the input. However, auto-encodersgenerally suffer from a confirmation bias, espe-cially when used in a semi-supervised setting (i.e.,

30

they tend to overly rely on their own predictionsinstead of the gold labels (Tarvainen and Valpola,2017; Laine and Aila, 2016)), providing the mo-tivation for MT. Specifically, to mitigate confir-mation bias, and regularize the model, the teacherweights in MT are not directly learned, but ratherthey are an average of the learned student weights,similar in spirit to the averaged perceptron. Thisprovides a more robust scaffolding that the studentmodel can rely on during training when gold labelsare unavailable (Nagesh and Surdeanu, 2018).

The MT architecture is summarized in Fig-ure 1. It consists of two models, teacher and stu-dent, both with identical architectures (but differ-ent weights). While the weights for the studentmodel are learned through standard backpropaga-tion, the weights in the teacher network are setthrough an exponential moving average of the stu-dent weights. The same data point, augmentedwith noise (see Section 4.2), is input to both theteacher and the student models.

The MT algorithm is designed for semi-supervised tasks, where only a few of the datapoints are labeled in training. The cost functionis a linear combination of two different type ofcosts: classification and consistency. The classifi-cation cost applies to labeled data points, and canbe instantiated with any standard supervised costfunction (e.g., we used categorical cross-entropy,which is based on the softmax over the label cate-gories). The consistency cost is used for unlabeleddata points, and aims to minimize the differencesin predictions between the teacher and the studentmodels. The consistency cost, J is:

J(✓) = Ex,⌘0 ,⌘

⇥kf(x, ✓

0, ⌘

0)� f(x, ✓, ⌘)k2

⇤(1)

where, given an input x, this function is definedas the expected distance between the prediction ofthe student model (f(x, ✓, ⌘), with weights ✓ andnoise ⌘) and the prediction of the teacher model(f(x, ✓

0, ⌘

0) with weights ✓

0and noise ⌘

0). Here,

our predictions are the models’ output distribu-tions (with a softmax) across all labels.

Importantly, as hinted above, only the studentmodel is updated via backpropagation. On theother hand, the gradients are not backpropagatedthrough the teacher weights, rather they are deter-ministically updated after each mini-batch of gra-dient descent in the student. The update uses anexponentially-weighted moving average (EMA)that combines the previous teacher with the latest

version of the student:

✓0t = ↵✓

0t�1 + (1� ↵)✓t (2)

where ✓0t is defined at training step t as the EMA

of successive weights ✓. This averaging strategyis reminiscent of the average perceptron, except inthis case the average is not constructed through anerror-driven algorithm, but, instead, it is controlledby a hyper parameter ↵.

4 Task-specific Representation

The MT framework contains two components thatare task specific. The first is the representation ofthe input to be modeled, which consists of entitymention pairs and the context between the two en-tity mentions. The second is the strategy for insert-ing noise in the inputs received by the student andteacher models. We detail both these componentshere.

4.1 Input RepresentationsWithin the Mean Teacher framework, we inves-tigate input representations of four types of syn-tactic abstraction. Each corresponds to one of theinputs shown in Figure 2 for the example (RobertKingsley, perGraduatedInstitution, Uni-versity of Minnesota).

Average: Our shallowest learned representationuses the surface form of the entities (e.g., RobertKingsley and University of Minnesota) and thecontext between them. If there are several men-tions of one or both of the entities in a given sen-tence, we used the closest pair. The model aver-ages the word embeddings for all tokens and thenpasses it through a fully-connected feed-forwardneural network, with one hidden layer and finallyto an output layer for classification into the the re-lation labels.

Surface LSTM (surfaceLSTM): The nextlearned representation also uses the surface formof the input, i.e., the sequence of words betweenthe two entities, but replaces the word embeddingaverage with a single-layer bidirectional LSTM,which have been demonstrated to encode somesyntactic information (Peters et al., 2018). Therepresentation from the LSTM is passed to thefeed-forward network as above.

Head LSTM (headLSTM): Under the intuitionthat the trigger is the most important lexical itemin the context of a relation, relation extraction

31

Figure 1: The Mean Teacher (MT) framework for relation extraction which makes use of a large, unlabeled corpus and a smallnumber of labeled data points. Intuitively, the larger, unlabeled corpus allows the model to learn a robust representation ofthe input (i.e., one which is not sensitive to noise), and the smaller set of labeled data constrains this learned representation toalso be task-relevant. Here, for illustration, we depict the training step with one labeled example. The MT framework consistsof two models: a student model which is trained through backpropagation, and a teacher model which is not directly trained,but rather is an average of previous student models (i.e., its weights are updated through an exponential moving average of thestudent weights at each epoch). Each model takes the same input, but augmented with different noise (here, word dropout). Theinputs consist of (a) a pair of entity mentions (here, hE1i: Robert Kingsley and hE2i: University of Minnesota), and (b) thecontext in which they occur (i.e., (1903-1988) was an American legal scholar and California judge who graduated from the).Each model produces a prediction for the label of the data point (i.e., a distribution over the classes). There are two distinctcosts in the MT framework: the consistency cost and the classification cost. The consistency cost constrains the output labeldistribution of the two models to be the similar, and it is applied to both labeled and unlabeled inputs (as these distributions canbe constrained even if the true label is unknown) to ensure that the learned representations (distributions) are not sensitive tothe applied noise. When the model is given a labeled input, it additionally assigns a classification cost using the prediction ofthe student model. This guides the learned representation to be useful to the task at hand.

can often be distilled into the task of identify-ing and classifying triggers, i.e., words whichsignal specific relations (Yu and Ji, 2016). Forexample, the verb died is highly indicative ofthe placeOfDeath relation. There are severalways of finding a candidate trigger such as thePageRank-inspired approach of Yu and Ji (2016).Here, for simplicity, we use the governing headof the two entities within their token interval (i.e.,the node in the dependency graph that dominatesthe two entities) as a proxy for the trigger. Wethen create the shortest dependency paths from itto each of the two entities, and concatenate thesetwo sequences to form the representation of thecorresponding relation mention. This is shownas the head-lexicalized syntactic path in Figure 2.For this representation, we once again pass this to-ken sequence to the LSTM and subsequent feed-forward network.

Syntactic LSTM (synLSTM): Compared withheadLSTM, our syntactic LSTM traverses the

directed shortest path between the two entitiesthrough the dependency syntax graph for the sen-tence, instead of concatenated shortest depen-dency paths between the trigger and each entity.As shown in the fully lexicalized syntactic path inFigure 2, we include in the representation the di-rected dependencies traversed (e.g., <nsubj foran incoming nsubj dependency), as well as allthe words along the path (i.e., scholar and grad-uated)2. We feed these tokens, both words anddependencies, to the same LSTM as above.

4.2 Noise Regularization

One critical component of the Mean Teacherframework is the addition of independent noise toboth the student and the teacher models (Laine

2For multi-sentence instances where no single sentencecontains both entities, in order to obtain a syntactic parse,we concatenate the closest sentences with each of the entitieswith a semi-colon and use that as our sentence. While thiscould be resolved more cleanly with discourse informationor cross-sentence dependencies, that is beyond the scope ofthe current work.

32

Figure 2: Examples of the varying degrees of syntactic structure used for our models. The surface representation consists of abag of entities (shown in blue) and the words that come between them. The surface syntactic path includes the words along theshortest path between the entities. The head lexicalized syntactic representation contains the governing head of the structure thatdominates the two entities, and the directed dependencies connecting it to each of the entities. The fully lexicalized syntacticrepresentation contains the directed dependencies along the shortest path in addition to the words along the shortest path. Forboth head lexicalized syntactic and fully lexicalized syntactic representation, the dependency notations include the dependencylabels as well as the direction of the dependency arc (<: right-to-left, >: left-to-right). Both words and syntactic dependenciesare converted to embeddings that are trained along with the rest of the model.

and Aila, 2016; Tarvainen and Valpola, 2017).While in some tasks, e.g., image classification,Gaussian noise is added to the network (Rasmuset al., 2015), here we opted to follow the exam-ple of Iyyer et al. (2015) and Nagesh and Sur-deanu (2018) and use word dropout for the needednoise. For each training instance we randomlydrop k tokens from the input sequence.3 The ran-dom dropout for the student model is independentfrom that of the teacher model, in principle forcingthe learned representations to be robust to noiseand learn noise-invariant abstract latent represen-tations for the task.

5 Experiments

5.1 Data

We evaluate our approach on Google-IISc DistantSupervision (GDS) dataset. This is a distant su-pervision dataset introduced by Jat et al. (2017)that aligns the Google Relation Extraction Cor-pus4 with texts retrieved through web search. Im-portantly, the dataset was curated to ensure thatat least one sentence for a given entity pair sup-ports the corresponding relation between the enti-ties. It contains five relations including NA. Thetraining partition contains 11,297 instances (2,772are NA), the development partition contains 1,864instances (447 NA), and the test partition contains5663 instances (1360 NA). The dataset is dividedsuch that there is no overlap among entity pairs

3In this work we use k = 1. In initial experiments, othervalues of k did not change results significantly.

4https://research.googleblog.com/2013/04/50000-lessons-on-how-to-read-relation.html

between these partitions.

5.2 Semi-Supervised Setup

In order to examine the utility of our semi-supervised approach, we evaluate our approachwith different levels of supervision by artificiallymanipulating the fully-labeled datasets describedin Section 5.1 to contain varying amounts of un-labeled instances. To do this, we incrementallyselected random portions of the training data toserve as labeled data, the rest being treated as un-labeled data (i.e., their labels are masked to not bevisible to the classifier). In other words, this un-labeled data was used to calculate the consistencycost, but not the classification cost. We experi-mented with using 1%, 10%, 50%, and 100% ofthe labeled training data for supervision.

5.3 Baselines

Previous work: We compare our model againstthe previous state-of-the-art work in Jat et al.(2017). They propose two fully-supervised mod-els: a bidirectional gated recurrent unit neural net-work with word attention (their BGWA model)and a piecewise convolutional neural network(PCNN) with an entity attention layer (their EAmodel), which is itself based on the PCNN ap-proach of Zeng et al. (2015). Since we use sin-gle models only, we omit Jat et al.’s supervisedensemble for a more direct comparison. Notethese two approaches are state-of-the-art meth-ods for RE that rely on more complex attention-based models. On the other hand, in this workwe use only vanilla LSTMs for both of our stu-dent and teacher models. However, since the MTframework is agnostic to the student model, in fu-

33

ture work we can potentially further increase per-formance by incorporating these more complexmethods in our semi-supervised approach.

Student Only: In order to determine the contribu-tion of the Mean Teacher framework itself, in eachsetting (amount of supervision and input represen-tation) we additionally compare against a baselineconsisting of our student model trained without theteacher (i.e., we remove the consistency cost com-ponent from the cost function).

5.4 Model Tuning

We lightly tuned the approach and hyperparame-ters on the development partitions. We built ourarchitecture in PyTorch5, a popular deep learn-ing library, extending the framework of Tarvainenand Valpola (2017)6. For our model, we useda bidirectional LSTM with a hidden size of 50.The subsequent fully connected network has onehidden layer of 100 dimensions and ReLU acti-vations. The training was done with the Adamoptimizer (Kingma and Ba, 2015) with the de-fault learning rate (0.1). We initialized our wordembeddings with GloVe 100-dimensional embed-dings (Pennington et al., 2014)7. The dependencyembeddings were also of size 100 and were ran-domly initialized. Both of these embedding wereallowed to update during training. Any word or to-ken which occurred fewer than 5 times in a datasetwas treated as unknown. For our word dropout, weremoved one word from each input at random. MThas some additional parameters: the consistencycost weight (the weight given to the consistencycomponent of the cost function) and the EMA de-cay rate (↵ in Eq. 2). We set these to be 1.0 and0.99 respectively. Akin to a burn-in at the begin-ning of training, we had an initial consistency costweight of 0.0 and allowed it to ramp up to the fullvalue over the coarse of training using a consis-tency ramp up of 5.

During training, we saved model checkpointsevery 10 epochs, and ran our models up to a max-imum of 200 epochs. We chose the best model bytracking the teacher performance on the develop-ment partition and using the model checkpoint im-mediately following the maximum performance.

5https://pytorch.org6https://github.com/CuriousAI/

mean-teacher7https://nlp.stanford.edu/projects/

glove

6 Results

We evaluate our approach on the GDS dataset,across the various levels of supervision detailed inSection 5. The precision, recall, and F1 scores areprovided in Table 1.

In the evaluation, we consider only the top la-bel assigned by the model, counting it as a matchonly if (a) it matches the correct label and (b) thelabel is not NA. For each setting, we provide thefinal performance of both the student network andthe teacher network. We also compare against thebaseline student network (i.e., the network trainedwithout the consistency cost).

In Table 1 we see that in general the MT frame-work improves performance over the student-onlybaseline, with the sole exception of when there isonly 1% supervision. The large performance gapbetween the 1% and 10% supervision configura-tions for the student-only model indicates that 1%supervision provides an inadequate amount of la-beled data to support learning the different rela-tion types. Further, when the unlabeled data over-whelms the supervised data, the MT frameworkencourages the student model to drift away fromthe correct labels through the consistence cost.Therefore, in this situation with insufficient super-vision, the student-only approach performs betterthan MT. This result highlights the importance of abalance between the supervised and unsupervisedtraining partitions within the MT framework.

Among our MT models, we see that while thesurfaceLSTM models perform better than the Av-erage models, the models which explicitly repre-sent the syntactic dependency path between theentities have the strongest performance. This istrue for both the student-only as well as the MTframework. In particular, we find that the bestperforming model is the synLSTM. These resultsshow that: (a) while surface LSTMs may looselymodel syntax (Linzen et al., 2016), they are stilllargely complemented by syntax-based represen-tation, and, more importantly, (b) syntax providesimproved generalization power over surface infor-mation.

6.1 Comparison to Prior Work

We additionally compare our Mean TeachersynLSTM model, against the BGWA and EAmodels of Jat et al. (2017). For an accurate com-parison, here we follow their setup closely. Foreach entity pair that occurs in test, we aggregate

34

Model 100% 50% 10% 1%Prec / Recall / F1 Prec / Recall / F1 Prec / Recall / F1 Prec / Recall / F1

Student-Only Baseline

Average Student 0.724 / 0.793 / 0.757±0.00 0.716 / 0.774 / 0.744±0.01 0.718 / 0.684 / 0.700±0.01 0.624 / 0.572 / 0.597±0.01

surfaceLSTM Student 0.754 / 0.830 / 0.790±0.01 0.748 / 0.834 / 0.788±0.01 0.735 / 0.752 / 0.744±0.00 0.691 / 0.622 / 0.655±0.01

headLSTM Student 0.739 / 0.808 / 0.772±0.01 0.725 / 0.816 / 0.767±0.00 0.697 / 0.723 / 0.709±0.01 0.633 / 0.630 / 0.631±0.01

synLSTM Student 0.757 / 0.828 / 0.791±0.00 0.759 / 0.827 / 0.791±0.00 0.716 / 0.770 / 0.742±0.01 0.667 / 0.677 / 0.671±0.01

Mean Teacher Framework

Average Student 0.719 / 0.717 / 0.716±0.04 0.739 / 0.718 / 0.728±0.01 0.715 / 0.645 / 0.677±0.02 0.611 / 0.509 / 0.555±0.02Teacher 0.733 / 0.767 / 0.750±0.00 0.724 / 0.747 / 0.735±0.00 0.718 / 0.649 / 0.682±0.01 0.611 / 0.503 / 0.552±0.01

surfaceLSTM Student 0.760 / 0.745 / 0.752±0.01 0.762 / 0.792 / 0.777±0.00 0.725 / 0.712 / 0.718±0.00 0.546 / 0.440 / 0.486±0.07Teacher 0.761 / 0.819 / 0.789±0.00 0.760 / 0.817 / 0.787±0.00 0.735 / 0.733 / 0.734±0.01 0.538 / 0.438 / 0.482±0.06

headLSTM Student 0.718 / 0.766 / 0.741±0.01 0.728 / 0.786 / 0.756±0.01 0.700 / 0.684 / 0.692±0.01 0.613 / 0.525 / 0.565±0.00Teacher 0.728 / 0.815 / 0.769±0.00 0.731 / 0.800 / 0.764±0.00 0.703 / 0.717 / 0.710±0.01 0.611 / 0.527 / 0.566±0.01

synLSTM Student 0.753 / 0.780 / 0.766±0.01 0.751 / 0.826 / 0.786±0.01 0.724 / 0.732 / 0.728±0.01 0.634 / 0.590 / 0.609±0.03Teacher 0.764 / 0.846 / 0.803±0.00 0.763 / 0.833 / 0.796±0.00 0.734 / 0.759 / 0.746±0.00 0.632 / 0.574 / 0.601±0.03

Table 1: Performance on the GDS dataset for our baseline models (i.e, student-only without the Mean Teacher framework)and for our Mean Teacher models (both the student and teacher performance is provided). We report precision, recall, and F1score for each of the input representations (Average, surfaceLSTM, headLSTM, and synLSTM), with varying proportions oftraining labels (100%, 50%, 10% and 1% of the training data labeled). Each value is the average of three runs with differentrandom seed initializations, and for the F1 scores we additionally provide the standard deviation across the different runs.

Figure 3: Precision-recall curves for the GDS dataset. We include two prior systems (BGWA and EA) as well as our MeanTeacher approach using our best performing input representation (synLSTM) for increasing amounts of supervision (from 1%of the training labels to 100%, or fully supervised).

35

the model’s predictions for all mentions of that en-tity pair using the Noisy-Or formula into a singledistribution for our model across all labels, exceptfor NA, and plot the precision-recall (PR) curve ac-cordingly.

Data for the BGWA and EA models was takenfrom author-provided results files8, allowing us toplot additional values beyond what was previouslypublished. The PR curves for each level of super-vision are shown in Figures 3.

We observe that our fully supervised model per-forms very similarly to the fully-supervised modelof Jat et al. (2017). This is encouraging, consid-ering that our approach is simpler, e.g., we donot have an attention model. Critically, we seethat under the Mean Teacher framework, this pat-tern continues even as we decrease the amountof supervision such that even with only 10% ofthe original training data we approach their fully-supervised performance. It is not until we use twoorders of magnitude fewer labeled examples thatwe see a major degradation in performance. Thisdemonstrates the utility of our syntax-aware MTapproach for learning robust representations forrelation extraction.

7 Conclusion

This paper introduced a neural model for semi-supervised relation extraction. Our syntactically-informed, semi-supervised approach learns effec-tively to extract a number of relations from verylimited labeled training data by employing a MeanTeacher (MT) architecture. This framework com-bines a student trained through backpropagationwith a teacher that is an average of past students.Each model is exposed to different noise, whichwe created in this work through word dropout.The approach uses unlabeled data through a con-sistency cost, which encourages the student to stayclose to the predictions of the teacher, and labeleddata, through a standard classification cost. Theconsistency requirement between the student andthe teacher ensures that the learned representationof the input is robust to noise, while the classifi-cation requirement ensures that this learned rep-resentation encodes the task-specific informationnecessary to correctly extract relations betweenentities.

We empirically demonstrated that our approach

8https://github.com/SharmisthaJat/RE-DS-Word-Attention-Models

is able to perform close to more complex, fully-supervised approaches using only 10% of thetraining data for relation extraction. This work fur-ther supports our previous work on MT for the taskof named entity classification, where we observedsimilar gains (Nagesh and Surdeanu, 2018). Fur-ther, we show that the MT framework performsreliably for various input representations (frompurely surface forms all the way to primarily syn-tactic). We additionally demonstrate that explic-itly representing syntax, even in simplified ways,is beneficial to the models across all levels of su-pervision.

8 Acknowledgments

This work was supported by the Defense Ad-vanced Research Projects Agency (DARPA) un-der the World Modelers program, grant numberW911NF1810014, and by the Bill and MelindaGates Foundation HBGDki Initiative. Mihai Sur-deanu declares a financial interest in lum.ai. Thisinterest has been properly disclosed to the Uni-versity of Arizona Institutional Review Commit-tee and is managed in accordance with its conflictof interest policies.

ReferencesAndrew Carlson, Justin Betteridge, Bryan Kisiel, Burr

Settles, Estevam R. Hruschka Jr., and Tom M.Mitchell. 2010. Toward an architecture for never-ending language learning. In Proc. of AAAI.

Michael Collins and Yoram Singer. 1999. Unsuper-vised models for named entity classification. InProc. of EMNLP.

Sonal Gupta and Christopher D. Manning. 2015. Dis-tributed representations of words to guide boot-strapped entity classifiers. In Proc. of NAACL.

Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, EduardHovy, and Eric Xing. 2016. Harnessing deepneural networks with logic rules. arXiv preprintarXiv:1603.06318.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daume III. 2015. Deep unordered compo-sition rivals syntactic methods for text classification.In Proc. of ACL-IJCNLP, volume 1, pages 1681–1691.

Sharmistha Jat, Siddhesh Khandelwal, and ParthaTalukdar. 2017. Improving distantly supervised re-lation extraction using word and entity based atten-tion. In Proc. of AKBC.

36

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceed-ings of the 3rd International Conference on Learn-ing Representations (ICLR).

Samuli Laine and Timo Aila. 2016. Temporal en-sembling for semi-supervised learning. CoRR,abs/1610.02242.

Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Associ-ation for Computational Linguistics, 4:521–535.

Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou,and Houfeng WANG. 2015. A dependency-basedneural network for relation classification. In Proc.of ACL-IJCNLP, pages 285–290, Beijing, China.

Tara McIntosh. 2010. Unsupervised discovery of nega-tive categories in lexicon bootstrapping. In Proceed-ings of the 2010 Conference on Empirical Methodsin Natural Language Processing, pages 356–365.

Ajay Nagesh and Mihai Surdeanu. 2018. An ex-ploration of three lightly-supervised representationlearning approaches for named entity classification.In Proceedings of the 27th International Conferenceon Computational Linguistics, pages 2312–2324.

Sachin Pawar, Girish K. Palshikar, and Pushpak Bhat-tacharyya. 2017. Relation extraction : A survey.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP, pages 1532–1543.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proc. of NAACL, pages 2227–2237.

Antti Rasmus, Mathias Berglund, Mikko Honkala,Harri Valpola, and Tapani Raiko. 2015. Semi-supervised learning with ladder networks. InC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,and R. Garnett, editors, Proc. of NIPS, pages 3546–3554. Curran Associates, Inc.

Richard Socher, Christopher D Manning, and An-drew Y Ng. 2010. Learning continuous phraserepresentations and syntactic parsing with recursiveneural networks. In Proceedings of the NIPS-2010Deep Learning and Unsupervised Feature LearningWorkshop, volume 2010, pages 1–9.

Yu Su, Honglei Liu, Semih Yavuz, Izzeddin Gur, HuanSun, and Xifeng Yan. 2018. Global relation em-bedding for relation extraction. In Proc. of NAACL-HLT, pages 820–830, New Orleans, Louisiana.

Kai Sheng Tai, Richard Socher, and Christopher DManning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. arXiv preprint arXiv:1503.00075.

Antti Tarvainen and Harri Valpola. 2017. Mean teach-ers are better role models: Weight-averaged consis-tency targets improve semi-supervised deep learningresults. In Advances in neural information process-ing systems, pages 1195–1204.

Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng,and Zhi Jin. 2015. Classifying relations via longshort term memory networks along shortest depen-dency paths. In Proc. of EMNLP, pages 1785–1794.

Roman Yangarber. 2003. Counter-training in discoveryof semantic patterns. In Proc. of ACL.

David Yarowsky. 1995. Unsupervised word sense dis-ambiguation rivaling supervised methods. In Proc.of ACL.

Dian Yu and Heng Ji. 2016. Unsupervised person slotfilling based on graph mining. In Proc. of ACL, vol-ume 1, pages 44–53.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extractionvia piecewise convolutional neural networks. InEMNLP.

37

Author Index

Alexeeva, Maria, 18

Bommasani, Rishi, 13

Cardie, Claire, 13

Ding, Shuoyang, 1Durrett, Greg, 7

Gupta, Aditya, 7

Katiyar, Arzoo, 13Koehn, Philipp, 1

Luo, Fan, 29

Nagesh, Ajay, 18, 29

Sharp, Rebecca, 29Surdeanu, Mihai, 18, 29

Valenzuela-Escárcega, Marco, 18

Zupon, Andrew, 18

39