CoNLL 2015 The 19th Conference on Computational Natural ...

CoNLL 2015

The 19th Conferenceon Computational Natural Language Learning

Proceedings of the Conference

July 30-31, 2015Beijing, China

Production and Manufacturing byTaberg Media Group ABBox 94, 562 02 TabergSweden

CoNLL 2015 Best Paper Sponsors:

c�2015 The Association for Computational Linguistics

ISBN 978-1-941643-77-8

ii

Introduction

The 2015 Conference on Computational Natural Language Learning is the nineteenth in the series ofannual meetings organized by SIGNLL, the ACL special interest group on natural language learning.CONLL 2015 will be held in Beijing, China on July 30-31, 2015, in conjunction with ACL-IJCNLP2015.

For the first time this year, CoNLL 2015 has accepted both long (9 pages of content plus 2 additionalpages of references) and short papers (5 pages of content plus 2 additional pages of references). Wereceived 144 submissions in total, of which 81 were long and 61 were short papers, and 17 wereeventually withdrawn. Of the remaining 127 papers, 29 long and 9 short papers were selected to appearin the conference program, resulting in an overall acceptance rate of almost 30%. All accepted papersappear here in the proceedings.

As in previous years, CoNLL 2015 has a shared task, this year on Shallow Discourse Parsing. Papersaccepted for the shared task are collected in a companion volume of CoNLL 2015.

To fit the paper presentations in a 2-day program, 16 long papers were selected for oral presentation andthe remaining 13 long and the 9 short papers were presented as posters. The papers selected for oralpresentation are distributed in four main sessions, each consisting of 4 talks. Each of these sessions alsoincludes 3 or 4 spotlights of the long papers selected for the poster session. In contrast, the spotlightsfor short papers are presented in a single session of 30 minutes. The remaining sessions were used forpresenting a selection of 4 shared task papers, two invited keynote speeches and a single poster session,including long, short and shared task papers.

We would like to thank all the authors who submitted their work to CoNLL 2015, as well as the programcommittee for helping us select the best papers out of many high-quality submissions. We are alsograteful to our invited speakers, Paul Smolensky and Eric Xing, who graciously agreed to give talks atCoNLL.

Special thanks are due to the SIGNLL board members, Xavier Carreras and Julia Hockenmaier, fortheir valuable advice and assistance in putting together this year’s program, and to Ben Verhoeven, forredesigning and maintaining the CoNLL 2015 web page. We are grateful to the ACL organization forhelping us with the program, proceedings and logistics. Finally, our gratitude goes to our sponsors,Google Inc. and Microsoft Research, for supporting the best paper award and student scholarships atCoNLL 2015.

We hope you enjoy the conference!

Afra Alishahi and Alessandro Moschitti

CoNLL 2015 conference co-chairs

iii

Organizers:

Afra Alishahi, Tilburg UniversityAlessandro Moschitti, Qatar Computing Research Institute

Invited Speakers:

Paul Smolensky, Johns Hopkins UniversityEric Xing, Carnegie Mellon University

Program Committee:

Steven Abney, University of MichiganEneko Agirre, University of the Basque Country (UPV/EHU)Bharat Ram Ambati, University of EdinburghYoav Artzi, University of WashingtonMarco Baroni, University of TrentoAlberto Barrón-Cedeño, Qatar Computing Research InstituteRoberto Basili, University of Roma, Tor VergataUlf Brefeld, TU DarmstadtClaire Cardie, Cornell UniversityXavier Carreras, Xerox Research Centre EuropeAsli Celikyilmaz, MicrosoftDaniel Cer, GoogleKai-Wei Chang, University of IllinoisColin Cherry, NRCGrzegorz Chrupała, Tilburg UniversityAlexander Clark, King’s College LondonShay B. Cohen, University of EdinburghPaul Cook, University of New BrunswickBonaventura Coppola, Technical University of DarmstadtDanilo Croce, University of Roma, Tor VergataAron Culotta, Illinois Institute of TechnologyScott Cyphers, Massachusetts Institute of TechnologyGiovanni Da San Martino, Qatar Computing Research InstituteWalter Daelemans, University of Antwerp, CLiPSIdo Dagan, Bar-Ilan UniversityDipanjan Das, Google Inc.Doug Downey, Northwestern UniversityMark Dras, Macquarie UniversityJacob Eisenstein, Georgia Institute of TechnologyJames Fan, IBM ResearchRaquel Fernandez, ILLC, University of AmsterdamSimone Filice, University of Roma "Tor Vergata"Antske Fokkens, VU AmsterdamBob Frank, Yale UniversityStefan L. Frank, Radboud University NijmegenDayne Freitag, SRI InternationalMichael Gamon, Microsoft Research

v

Wei Gao, Qatar Computing Research InstituteAndrea Gesmundo, Google Inc.Kevin Gimpel, Toyota Technological Institute at ChicagoFrancisco Guzmán, Qatar Computing Research InstituteThomas Gärtner, University of Bonn and Fraunhofer IAISIryna Haponchyk, University of TrentoJames Henderson, Xerox Research Centre EuropeJulia Hockenmaier, University of Illinois Urbana-ChampaignDirk Hovy, Center for Language Technology, University of CopenhagenRebecca Hwa, University of PittsburghRichard Johansson, University of GothenburgShafiq Joty, Qatar Computing Research InstituteLukasz Kaiser, CNRS & GoogleMike Kestemont, University of AntwerpPhilipp Koehn, Johns Hopkins UniversityAnna Korhonen, University of CambridgeAnnie Louis, University of EdinburghAndré F. T. Martins, Priberam, Instituto de TelecomunicacoesYuji Matsumoto, Nara Institute of Science and TechnologyStewart McCauley, Cornell UniversityDavid McClosky, IBM ResearchRyan McDonald, GoogleMargaret Mitchell, Microsoft ResearchRoser Morante, Vrije Universiteit AmsterdamLluís Màrquez, Qatar Computing Research InstitutePreslav Nakov, Qatar Computing Research InstituteAida Nematzadeh, University of TorontoAni Nenkova, University of PennsylvaniaVincent Ng, University of Texas at DallasJoakim Nivre, Uppsala UniversityNaoaki Okazaki, Tohoku UniversityMiles Osborne, BloombergAlexis Palmer, Heidelberg UniversityAndrea Passerini, University of TrentoDaniele Pighin, Google Inc.Barbara Plank, University of CopenhagenThierry Poibeau, LATTICE-CNRSSameer Pradhan, cemantix.orgVerónica Pérez-Rosas, University of MichiganJonathon Read, School of Computing, Teesside UniversitySebastian Riedel, UCLAlan Ritter, The Ohio State UniversityBrian Roark, Google Inc.Dan Roth, University of IllinoisJosef Ruppenhofer, Hildesheim UniversityRajhans Samdani, Google ResearchHinrich Schütze, University of MunichDjamé Seddah, Université Paris SorbonneAliaksei Severyn, University of TrentoAvirup Sil, IBM T.J. Watson Research CenterKhalil Sima’an, ILLC, University of Amsterdam

vi

Kevin Small, AmazonVivek Srikumar, University of UtahMark Steedman, University of EdinburghAnders Søgaard, University of CopenhagenChristoph Teichmann, University of PotsdamKristina Toutanova, Microsoft ResearchIoannis Tsochantaridis, Google Inc.Jun’ichi Tsujii, Aritificial Intelligence Research Centre at AISTKateryna Tymoshenko, University of TrentoOlga Uryupina, University of TrentoAntonio Uva, University of TrentoAntal van den Bosch, Radboud University NijmegenMenno van Zaanen, Tilburg UniversityAline Villavicencio, Institute of Informatics, Federal University of Rio Grande do SulMichael Wiegand, Saarland UniversitySander Wubben, Tilburg UniversityFabio Massimo Zanzotto, University of Rome "Tor Vergata"Luke Zettlemoyer, University of WashingtonFeifei Zhai, The City University of New YorkMin Zhang, SooChow University

vii

Table of Contents

Keynote Talks

On Spectral Graphical Models, and a New Look at Latent Variable Modeling in Natural LanguageProcessing

Eric Xing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

Does the Success of Deep Neural Network Language Processing Mean – Finally! – the End of Theoreti-cal Linguistics?

Paul Smolensky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii

Long Papers

A Coactive Learning View of Online Structured Prediction in Statistical Machine TranslationArtem Sokolov, Stefan Riezler and Shay B. Cohen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

A Joint Framework for Coreference Resolution and Mention Head DetectionHaoruo Peng, Kai-Wei Chang and Dan Roth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

A Supertag-Context Model for Weakly-Supervised CCG Parser LearningDan Garrette, Chris Dyer, Jason Baldridge and Noah A. Smith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

A Synchronous Hyperedge Replacement Grammar based approach for AMR parsingXiaochang Peng, Linfeng Song and Daniel Gildea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

AIDA2: A Hybrid Approach for Token and Sentence Level Dialect Identification in ArabicMohamed Al-Badrashiny, Heba Elfardy and Mona Diab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

An Iterative Similarity based Adaptation Technique for Cross-domain Text ClassificationHimanshu Sharad Bhatt, Deepali Semwal and Shourya Roy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Analyzing Optimization for Statistical Machine Translation: MERT Learns Verbosity, PRO Learns LengthFrancisco Guzmán, Preslav Nakov and Stephan Vogel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Annotation Projection-based Representation Learning for Cross-lingual Dependency ParsingMIN XIAO and Yuhong Guo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73

Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: The Impact of WordRepresentations on Sequence Labelling Tasks

Lizhen Qu, Gabriela Ferraro, Liyuan Zhou, Weiwei Hou, Nathan Schneider andTimothy Baldwin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Contrastive Analysis with Predictive Power: Typology Driven Estimation of Grammatical Error Distri-butions in ESL

Yevgeni Berzak, Roi Reichart and Boris Katz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Cross-lingual syntactic variation over age and genderAnders Johannsen, Dirk Hovy and Anders Søgaard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Cross-lingual Transfer for Unsupervised Dependency Parsing Without Parallel DataLong Duong, Trevor Cohn, Steven Bird and Paul Cook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

ix

Detecting Semantically Equivalent Questions in Online User ForumsDasha Bogdanova, Cicero dos Santos, Luciano Barbosa and Bianca Zadrozny . . . . . . . . . . . . . . . 123

Entity Linking Korean Text: An Unsupervised Learning Approach using Semantic RelationsYoungsik Kim and Key-Sun Choi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Incremental Recurrent Neural Network Dependency Parser with Search-based Discriminative TrainingMajid Yazdani and James Henderson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Instance Selection Improves Cross-Lingual Model Training for Fine-Grained Sentiment AnalysisRoman Klinger and Philipp Cimiano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

Labeled Morphological Segmentation with Semi-Markov ModelsRyan Cotterell, Thomas Müller, Alexander Fraser and Hinrich Schütze . . . . . . . . . . . . . . . . . . . . . 164

Learning to Exploit Structured Resources for Lexical InferenceVered Shwartz, Omer Levy, Ido Dagan and Jacob Goldberger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Linking Entities Across Images and TextRebecka Weegar, Kalle Åström and Pierre Nugues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Making the Most of Crowdsourced Document Annotations: Confused Supervised LDAPaul Felt, Eric Ringger, Jordan Boyd-Graber and Kevin Seppi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

Multichannel Variable-Size Convolution for Sentence ClassificationWenpeng Yin and Hinrich Schütze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

Opinion Holder and Target Extraction based on the Induction of Verbal CategoriesMichael Wiegand and Josef Ruppenhofer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

Quantity, Contrast, and Convention in Cross-Situated Language ComprehensionIan Perera and James Allen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

Recovering Traceability Links in Requirements DocumentsZeheng Li, Mingrui Chen, LiGuo Huang and Vincent Ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Structural and lexical factors in adjective placement in complex noun phrases across Romance languagesKristina Gulordava and Paola Merlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

Symmetric Pattern Based Word Embeddings for Improved Word Similarity PredictionRoy Schwartz, Roi Reichart and Ari Rappoport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

Task-Oriented Learning of Word Embeddings for Semantic Relation ClassificationKazuma Hashimoto, Pontus Stenetorp, Makoto Miwa and Yoshimasa Tsuruoka . . . . . . . . . . . . . . 268

Temporal Information Extraction from Korean TextsYoung-Seob Jeong, Zae Myung Kim, Hyun-Woo Do, Chae-Gyun Lim and Ho-Jin Choi . . . . . . 279

Transition-based Spinal ParsingMiguel Ballesteros and Xavier Carreras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

x

Short Papers

Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Con-text Pairs

Shonosuke Ishiwatari, Nobuhiro Kaji, Naoki Yoshinaga, Masashi Toyoda andMasaru Kitsuregawa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

Deep Neural Language Models for Machine TranslationThang Luong, Michael Kayser and Christopher D. Manning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

Finding Opinion Manipulation Trolls in News Community ForumsTodor Mihaylov, Georgi Georgiev and Preslav Nakov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

Do dependency parsing metrics correlate with human judgments?Barbara Plank, Héctor Martínez Alonso, Željko Agic, Danijela Merkler andAnders Søgaard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

Feature Selection for Short Text Classification using Wavelet Packet TransformAnuj Mahajan, sharmistha Jat and Shourya Roy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

Learning Adjective Meanings with a Tensor-Based Skip-Gram ModelJean Maillard and Stephen Clark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

Model Selection for Type-Supervised Learning with Application to POS TaggingKristina Toutanova, Waleed Ammar, Pallavi Choudhury and Hoifung Poon . . . . . . . . . . . . . . . . . . 332

One Million Sense-Tagged Instances for Word Sense Disambiguation and InductionKaveh Taghipour and Hwee Tou Ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

Reading behavior predicts syntactic categoriesMaria Barrett and Anders Søgaard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

xi

Conference Program

Thursday, July 30, 2015

08:45-09:00 Opening Remarks

09:00-10:10 Session 1.a: Embedding input and output representations

Multichannel Variable-Size Convolution for Sentence ClassificationWenpeng Yin and Hinrich Schütze

Task-Oriented Learning of Word Embeddings for Semantic Relation ClassificationKazuma Hashimoto, Pontus Stenetorp, Makoto Miwa and Yoshimasa Tsuruoka

Symmetric Pattern Based Word Embeddings for Improved Word Similarity PredictionRoy Schwartz, Roi Reichart and Ari Rappoport

A Coactive Learning View of Online Structured Prediction in Statistical MachineTranslationArtem Sokolov, Stefan Riezler and Shay B. Cohen

10:10-10:30 Session 1.b: Entity Linking (spotlight presentations)

A Joint Framework for Coreference Resolution and Mention Head DetectionHaoruo Peng, Kai-Wei Chang and Dan Roth

Entity Linking Korean Text: An Unsupervised Learning Approach using SemanticRelationsYoungsik Kim and Key-Sun Choi

Linking Entities Across Images and TextRebecka Weegar, Kalle Åström and Pierre Nugues

Recovering Traceability Links in Requirements DocumentsZeheng Li, Mingrui Chen, LiGuo Huang and Vincent Ng

10:30-11:00 Coffee Break

11:00-12:00 Session 2.a: Keynote Talk

On Spectral Graphical Models, and a New Look at Latent Variable Modeling in NaturalLanguage ProcessingEric Xing, Carnegie Mellon University

12:00-12:30 Session 2.b: Short Paper Spotlights

Deep Neural Language Models for Machine TranslationThang Luong, Michael Kayser and Christopher D. Manning

xiii

Reading behavior predicts syntactic categoriesMaria Barrett and Anders Søgaard

One Million Sense-Tagged Instances for Word Sense Disambiguation and InductionKaveh Taghipour and Hwee Tou Ng

Model Selection for Type-Supervised Learning with Application to POS TaggingKristina Toutanova, Waleed Ammar, Pallavi Choudhury and Hoifung Poon

Feature Selection for Short Text Classification using Wavelet Packet TransformAnuj Mahajan, Sharmistha Jat and Shourya Roy

Do dependency parsing metrics correlate with human judgments?Barbara Plank, Héctor Martínez Alonso, Željko Agic, Danijela Merkler and AndersSøgaard

Learning Adjective Meanings with a Tensor-Based Skip-Gram ModelJean Maillard and Stephen Clark

Accurate Cross-lingual Projection between Count-based Word Vectors by ExploitingTranslatable Context PairsShonosuke Ishiwatari, Nobuhiro Kaji, Naoki Yoshinaga, Masashi Toyoda and MasaruKitsuregawa

Finding Opinion Manipulation Trolls in News Community ForumsTodor Mihaylov, Georgi Georgiev and Preslav Nakov

12:30-14:00 Lunch Break

14:00-15:30 Session 3: CoNLL Shared Task

The CoNLL-2015 Shared Task on Shallow Discourse ParsingNianwen Xue, Hwee Tou Ng, Sameer Pradhan, Rashmi Prasad, Christopher Bryant,Attapol Rutherford

A Refined End-to-End Discourse ParserJianxiang Wang and Man Lan

The UniTN Discourse Parser in CoNLL 2015 Shared Task: Token-level SequenceLabeling with Argument-specific ModelsEvgeny Stepanov, Giuseppe Riccardi and Ali Orkan Bayer

The SoNLP-DP System in the CoNLL-2015 shared TaskFang Kong, Sheng Li and Guodong Zhou


16:00-17:10 Session 4.a: Syntactic Parsing

A Supertag-Context Model for Weakly-Supervised CCG Parser LearningDan Garrette, Chris Dyer, Jason Baldridge and Noah A. Smith

xiv

Transition-based Spinal ParsingMiguel Ballesteros and Xavier Carreras

Cross-lingual Transfer for Unsupervised Dependency Parsing Without Parallel DataLong Duong, Trevor Cohn, Steven Bird and Paul Cook

Incremental Recurrent Neural Network Dependency Parser with Search-basedDiscriminative TrainingMajid Yazdani and James Henderson

17:10-17:30 Session 4.b: Cross-language studies (spotlight presentations)

Structural and lexical factors in adjective placement in complex noun phrases acrossRomance languagesKristina Gulordava and Paola Merlo

Instance Selection Improves Cross-Lingual Model Training for Fine-Grained SentimentAnalysisRoman Klinger and Philipp Cimiano

Annotation Projection-based Representation Learning for Cross-lingual DependencyParsingMin Xiao and Yuhong Guo

Friday, July 31, 2015

09:00-10:10 Session 5.a: Semantics

Detecting Semantically Equivalent Questions in Online User ForumsDasha Bogdanova, Cicero dos Santos, Luciano Barbosa and Bianca Zadrozny

An Iterative Similarity based Adaptation Technique for Cross-domain TextClassificationHimanshu Sharad Bhatt, Deepali Semwal and Shourya Roy

Making the Most of Crowdsourced Document Annotations: Confused Supervised LDAPaul Felt, Eric Ringger, Jordan Boyd-Graber and Kevin Seppi

Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: TheImpact of Word Representations on Sequence Labelling TasksLizhen Qu, Gabriela Ferraro, Liyuan Zhou, Weiwei Hou, Nathan Schneider andTimothy Baldwin

10:10-10:30 Session 5.b: Extraction and Labeling (spotlight presentations)

Labeled Morphological Segmentation with Semi-Markov ModelsRyan Cotterell, Thomas Müller, Alexander Fraser and Hinrich Schütze

Opinion Holder and Target Extraction based on the Induction of Verbal CategoriesMichael Wiegand and Josef Ruppenhofer

xv

Temporal Information Extraction from Korean TextsYoung-Seob Jeong, Zae Myung Kim, Hyun-Woo Do, Chae-Gyun Lim and Ho-Jin Choi


11:00-12:00 Session 6.a: Keynote Talk

Does the Success of Deep Neural Network Language Processing Mean – Finally! – theEnd of Theoretical Linguistics?Paul Smolensky, Johns Hopkins University

12:00-12:30 Session 6.b: Business Meeting

12:30-14:00 Lunch Break

14:00-15:10 Session 7.a: Language Structure

Cross-lingual syntactic variation over age and genderAnders Johannsen, Dirk Hovy and Anders Søgaard

A Synchronous Hyperedge Replacement Grammar based approach for AMR parsingXiaochang Peng, Linfeng Song and Daniel Gildea

Contrastive Analysis with Predictive Power: Typology Driven Estimation ofGrammatical Error Distributions in ESLYevgeni Berzak, Roi Reichart and Boris Katz

AIDA2: A Hybrid Approach for Token and Sentence Level Dialect Identification inArabicMohamed Al-Badrashiny, Heba Elfardy and Mona Diab

15:10-15:30 Session 7.b: CoNLL Mix (spotlight presentations)

Analyzing Optimization for Statistical Machine Translation: MERT Learns Verbosity,PRO Learns LengthFrancisco Guzmán, Preslav Nakov and Stephan Vogel

Learning to Exploit Structured Resources for Lexical InferenceVered Shwartz, Omer Levy, Ido Dagan and Jacob Goldberger

Quantity, Contrast, and Convention in Cross-Situated Language ComprehensionIan Perera and James Allen


xvi

16:00-17:30 Session 8.a: Joint Poster Presentation (long, short and shared task papers)

Long Papers

A Joint Framework for Coreference Resolution and Mention Head DetectionHaoruo Peng, Kai-Wei Chang and Dan Roth

Entity Linking Korean Text: An Unsupervised Learning Approach using SemanticRelationsYoungsik Kim and Key-Sun Choi

Linking Entities Across Images and TextRebecka Weegar, Kalle Åström and Pierre Nugues

Recovering Traceability Links in Requirements DocumentsZeheng Li, Mingrui Chen, LiGuo Huang and Vincent Ng

Structural and lexical factors in adjective placement in complex noun phrases acrossRomance languagesKristina Gulordava and Paola Merlo

Instance Selection Improves Cross-Lingual Model Training for Fine-Grained SentimentAnalysisRoman Klinger and Philipp Cimiano

Annotation Projection-based Representation Learning for Cross-lingual DependencyParsingMin Xiao and Yuhong Guo

Labeled Morphological Segmentation with Semi-Markov ModelsRyan Cotterell, Thomas Müller, Alexander Fraser and Hinrich Schütze

Opinion Holder and Target Extraction based on the Induction of Verbal CategoriesMichael Wiegand and Josef Ruppenhofer

Temporal Information Extraction from Korean TextsYoung-Seob Jeong, Zae Myung Kim, Hyun-Woo Do, Chae-Gyun Lim and Ho-Jin Choi

Analyzing Optimization for Statistical Machine Translation: MERT Learns Verbosity,PRO Learns LengthFrancisco Guzmán, Preslav Nakov and Stephan Vogel

Learning to Exploit Structured Resources for Lexical InferenceVered Shwartz, Omer Levy, Ido Dagan and Jacob Goldberger

Quantity, Contrast, and Convention in Cross-Situated Language ComprehensionIan Perera and James Allen

Short Papers

Deep Neural Language Models for Machine TranslationThang Luong, Michael Kayser and Christopher D. Manning

xvii

Reading behavior predicts syntactic categoriesMaria Barrett and Anders Søgaard

One Million Sense-Tagged Instances for Word Sense Disambiguation and InductionKaveh Taghipour and Hwee Tou Ng

Model Selection for Type-Supervised Learning with Application to POS TaggingKristina Toutanova, Waleed Ammar, Pallavi Choudhury and Hoifung Poon

Feature Selection for Short Text Classification using Wavelet Packet TransformAnuj Mahajan, Sharmistha Jat and Shourya Roy

Do dependency parsing metrics correlate with human judgments?Barbara Plank, Héctor Martínez Alonso, Željko Agic, Danijela Merkler and AndersSøgaard

Learning Adjective Meanings with a Tensor-Based Skip-Gram ModelJean Maillard and Stephen Clark

Accurate Cross-lingual Projection between Count-based Word Vectors by ExploitingTranslatable Context PairsShonosuke Ishiwatari, Nobuhiro Kaji, Naoki Yoshinaga, Masashi Toyoda and MasaruKitsuregawa

Finding Opinion Manipulation Trolls in News Community ForumsTodor Mihaylov, Georgi Georgiev and Preslav Nakov

Shared Task Papers

A Hybrid Discourse Relation Parser in CoNLL 2015Sobha Lalitha Devi, Sindhuja Gopalan, Lakshmi S, Pattabhi RK Rao, Vijay SundarRam and Malarkodi C.S.

A Minimalist Approach to Shallow Discourse Parsing and Implicit Relation RecognitionChristian Chiarcos and Niko Schenk

A Shallow Discourse Parsing System Based On Maximum Entropy ModelJia Sun, Peijia Li, Weiqun Xu and Yonghong Yan

Hybrid Approach to PDTB-styled Discourse Parsing for CoNLL-2015Yasuhisa Yoshida, Katsuhiko Hayashi, Tsutomu Hirao and Masaaki Nagata

Improving a Pipeline Architecture for Shallow Discourse ParsingYangqiu Song, Haoruo Peng, Parisa Kordjamshidi, Mark Sammons and Dan Roth

JAIST: A two-phase machine learning approach for identifying discourse relations innewswire textsSon Nguyen, Quoc Ho and Minh Nguyen

Shallow Discourse Parsing Using Constituent Parsing TreeChangge Chen, Peilu Wang and Hai Zhao

xviii

Shallow Discourse Parsing with Syntactic and (a Few) Semantic FeaturesShubham Mukherjee, Abhishek Tiwari, Mohit Gupta and Anil Kumar Singh

The CLaC Discourse Parser at CoNLL-2015Majid Laali, Elnaz Davoodi and Leila Kosseim

The DCU Discourse Parser for Connective, Argument Identification and Explicit SenseClassificationLongyue Wang, Chris Hokamp, Tsuyoshi Okita, Xiaojun Zhang and Qun Liu

The DCU Discourse Parser: A Sense Classification TaskTsuyoshi Okita, Longyue Wang and Qun Liu

17:30-17:45 Session 8.b: Best Paper Award and Closing

xix

Keynote Talk

On Spectral Graphical Models, and a New Look at Latent Variable Modeling in NaturalLanguage Processing

Eric Xing

Carnegie Mellon University

[email protected]

Abstract

Latent variable and latent structure modeling, as widely seen in parsing systems, machine translationsystems, topic models, and deep neural networks, represents a key paradigm in Natural Language Pro-cessing, where discovering and leveraging syntactic and semantic entities and relationships that are notexplicitly annotated in the training set provide a crucial vehicle to obtain various desirable effects suchas simplifying the solution space, incorporating domain knowledge, and extracting informative features.However, latent variable models are difficult to train and analyze in that, unlike fully observed models,they suffer from non-identifiability, non-convexity, and over-parameterization, which make them oftenhard to interpret, and tend to rely on local-search heuristics and heavy manual tuning.

In this talk, I propose to tackle these challenges using spectral graphical models (SGM), which viewlatent variable models through the lens of linear algebra and tensors. I show how SGMs exploit theconnection between latent structure and low rank decomposition, and allow one to develop models andalgorithms for a variety of latent variable problems, which unlike traditional techniques, enjoy provableguarantees on correctness and global optimality, can straightforwardly incorporate additional moderntechniques such as kernels to achieve more advanced modeling power, and empirically offer a 1-2 ordersof magnitude speed up over existing methods while giving comparable or better performance.

This is joint work with Ankur Parikh, Carnegie Mellon University.

xx

Biography of the Speaker

Dr. Eric Xing is a Professor of Machine Learning in the School of Computer Science at Carnegie MellonUniversity, and Director of the CMU/UPMC Center for Machine Learning and Health. His principal re-search interests lie in the development of machine learning and statistical methodology, and large-scalecomputational system and architecture; especially for solving problems involving automated learning,reasoning, and decision-making in high-dimensional, multimodal, and dynamic possible worlds in artifi-cial, biological, and social systems. Professor Xing received a Ph.D. in Molecular Biology from RutgersUniversity, and another Ph.D. in Computer Science from UC Berkeley. He servers (or served) as an asso-ciate editor of the Annals of Applied Statistics (AOAS), the Journal of American Statistical Association(JASA), the IEEE Transaction of Pattern Analysis and Machine Intelligence (PAMI), the PLoS Journalof Computational Biology, and an Action Editor of the Machine Learning Journal (MLJ), the Journal ofMachine Learning Research (JMLR). He was a member of the DARPA Information Science and Tech-nology (ISAT) Advisory Group, a recipient of the NSF Career Award, the Sloan Fellowship, the UnitedStates Air Force Young Investigator Award, and the IBM Open Collaborative Research Award. He is theProgram Chair of ICML 2014.

xxi

Keynote Talk

Does the Success of Deep Neural Network Language Processing Mean – Finally! – theEnd of Theoretical Linguistics?

Paul Smolensky

Johns Hopkins University

[email protected]

Abstract

Statistical methods in natural-language processing that rest on heavily empirically-based language learn-ing – especially those centrally deploying neural networks – have witnessed dramatic improvement inthe past few years, and their success restores the urgency of understanding the relationship between (i)these neural/statistical language systems and (ii) the view of linguistic representation, processing, andstructure developed over centuries within theoretical linguistics.

Two hypotheses concerning this relationship arise from our own mathematical and experimental resultsfrom past work, which we will present. These hypotheses can guide – we will argue – important futureresearch in the seemingly sizable gap separating computational linguistics from linguistic theories ofhuman language acquisition. These hypotheses are:

1. The internal representational format used in deep neural networks for language – numerical vectors– is covertly an implementation of a system of discrete, symbolic, structured representations whichare processed so as to optimally meet the demands of a symbolic grammar recognizable from theperspective of theoretical linguistics.

2. It will not be successes but rather the *failures* of future machine learning approaches to languageacquisition which will be most telling for determining whether such approaches capture the cruciallimitations on human language learning – limitations, documented in recent artificial-grammar-learning experimental results, which support the nativist Chomskian hypothesis asserting that

• reliably and efficiently learning human grammars from available evidence requires

• that the hypothesis space entertained by the child concerning the set of possible (or likely)human languages

• be limited by abstract, structure-based constraints;

• these constraints can then also explain (in principle at least) the many robustly-respecteduniversals observed in cross-linguistic typology.

This is joint work with Jennifer Culbertson, University of Edinburgh.

xxii

Biography of the Speaker

Paul Smolensky is the Krieger-Eisenhower Professor of Cognitive Science at Johns Hopkins Universityin Baltimore, Maryland, USA. He studies the mutual implications between the theories of neural com-putation and of universal grammar and has published in distinguished venues including Science and theProceedings of the National Academy of Science USA. He received the David E. Rumelhart Prize forOutstanding Contributions to the Formal Analysis of Human Cognition (2005), the Chaire de RechercheBlaise Pascal (2008—9), and the Sapir Professorship of the Linguistic Society of America (2015). Pri-mary results include:

• Contradicting widely-held convictions, (i) structured symbolic and (ii) neural network models ofcognition are mutually compatible: formal descriptions of the same systems, the mind/brain, at (i)a highly abstract, and (ii) a more physical, level of description. His article “On the proper treatmentof connectionism” (1988) was until recently one of the 10-most cited articles in The Behavioraland Brain Sciences, itself the most-cited journal of all the behavioral sciences.

• That the theory of neural computation can in fact strengthen the theory of universal grammar isattested by the revolutionary impact in theoretical linguistics (within phonology in particular) ofOptimality Theory, a neuralnetwork-derived symbolic grammar formalism that he developed withAlan Prince (in a book widely released 1993, officially published 2004).

• The learnability theory for Optimality Theory was founded at nearly the same time as the theoryitself, in joint work of Smolensky and his PhD student Bruce Tesar (TR 1993; article in the pre-mier linguistic theory journal, Linguistic Inquiry 1998; MIT Press book 2000). This work laidthe foundation upon which rests most of the flourishing formal theory of learning in OptimalityTheory.

• There is considerable power in formalizing neural network computation as statistical inference/opti-mization within a dynamical system. Smolensky’s Harmony Theory (1981–6) analyzed networkcomputation as Harmony Maximization (an independently-developed homologue to Hopfield’s“energy minimization” formulation) and first deployed principles of statistical inference for pro-cessing and learning in the bipartite network structure later to be known as the ‘Restricted Boltz-mann Machine’ in the initial work on deep neural network learning (Hinton et al., 2006–).

• Powerful recursive symbolic computation can be achieved with massive parallelism in neural net-works designed to process tensor product representations (TR 1987; journal article in ArtificialIntelligence 1990). Related uses of the tensor product to structure numerical vectors is currentlyunder rapid development in the field of distributional vector semantics.

Most recently, as argued in Part 1 of the talk, his work shows the value for theoretical and psycho-linguistics of representations that share both the discrete structure of symbolic representations and thecontinuous variation of activity levels in neural network representations (initial results in an article inCognitive Science by Smolensky, Goldrick & Mathis 2014).

xxiii

Proceedings of the 19th Conference on Computational Language Learning, pages 1–11,Beijing, China, July 30-31, 2015. c

�2015 Association for Computational Linguistics

A Coactive Learning View of Online Structured Predictionin Statistical Machine Translation

Artem Sokolov and Stefan Riezler⇤Computational Linguistics & IWR⇤

69120 Heidelberg, Germany{sokolov,riezler}@cl.uni-heidelberg.de

Shay B. CohenUniversity of EdinburghEdinburgh EH8 9LE, [email protected]

Abstract

We present a theoretical analysis of onlineparameter tuning in statistical machinetranslation (SMT) from a coactive learn-ing view. This perspective allows us togive regret and generalization bounds forlatent perceptron algorithms that are com-mon in SMT, but fall outside of the stan-dard convex optimization scenario. Coac-tive learning also introduces the concept ofweak feedback, which we apply in a proof-of-concept experiment to SMT, showingthat learning from feedback that consistsof slight improvements over predictionsleads to convergence in regret and transla-tion error rate. This suggests that coactivelearning might be a viable framework forinteractive machine translation. Further-more, we find that surrogate translationsreplacing references that are unreachablein the decoder search space can be inter-preted as weak feedback and lead to con-vergence in learning, if they admit an un-derlying linear model.

1 Introduction

Online learning has become the tool of choice forlarge scale machine learning scenarios. Comparedto batch learning, its advantages include memoryefficiency, due to parameter updates being per-formed on the basis of single examples, and run-time efficiency, where a constant number of passesover the training sample is sufficient for conver-gence (Bottou and Bousquet, 2004). StatisticalMachine Translation (SMT) has embraced the po-tential of online learning, both to handle millionsof features and/or millions of data in parameter

tuning via online structured prediction (see Lianget al. (2006) for seminal early work), and in in-teractive learning from user post-edits (see Cesa-Bianchi et al. (2008) for pioneering work on on-line computer-assisted translation). Online learn-ing algorithms can be given a theoretical analy-sis in the framework of online convex optimiza-tion (Shalev-Shwartz, 2012), however, the appli-cation of online learning techniques to SMT sac-rifices convexity because of latent derivation vari-ables, and because of surrogate translations replac-ing human references that are unreachable in thedecoder search space. For example, the objectivefunction actually optimized in Liang et al.’s (2006)application of Collins’ (2002) structure perceptronhas been analyzed by Gimpel and Smith (2012)as a non-convex ramp loss function (McAllesterand Keshet, 2011; Do et al., 2008; Collobert et al.,2006). Since online convex optimization does notprovide convergence guarantees for the algorithmof Liang et al. (2006), Gimpel and Smith (2012)recommend CCCP (Yuille and Rangarajan, 2003)instead for optimization, but fail to provide a the-oretical analysis of Liang et al.’s (2006) actual al-gorithm under the new objective.

The goal of this paper is to present an alternativetheoretical analysis of online learning algorithmsfor SMT from the viewpoint of coactive learning(Shivaswamy and Joachims, 2012). This frame-work allows us to make three main contributions:• Firstly, the proof techniques of Shivaswamy

and Joachims (2012) are a simple and elegant toolfor a theoretical analysis of perceptron-style al-gorithms that date back to the perceptron mistakebound of Novikoff (1962). These techniques pro-vide an alternative to an online gradient descentview of perceptron-style algorithms, and can eas-ily be extended to obtain regret bounds for a la-

1

tent perceptron algorithm at a rate of O�

1pT

�, with

possible improvements by using re-scaling. Thisbound can be directly used to derive generalizationguarantees for online and online-to-batch conver-sions of the algorithm, based on well-known con-centration inequalities. Our analysis covers the ap-proach of Liang et al. (2006) and supersedes Sunet al. (2013)’s analysis of the latent perceptron byproviding simpler proofs and by adding a general-ization analysis. Furthermore, an online learningframework such as coactive learning covers prob-lems such as changing n-best lists after each up-date that were explicitly excluded from the batchanalysis of Gimpel and Smith (2012) and consid-ered fixed in the analysis of Sun et al. (2013).

• Our second contribution is an extension ofthe online learning scenario in SMT to include anotion of “weak feedback” for the latent percep-tron: Coactive learning follows an online learningprotocol, where at each round t, the learner pre-dicts a structured object yt for an input xt, andthe user corrects the learner by responding withan improved, but not necessarily optimal, objectyt with respect to a utility function U . The key as-set of coactive learning is the ability of the learnerto converge to predictions that are close to opti-mal structures y⇤t , although the utility function isunknown to the learner, and only weak feedbackin form of slightly improved structures yt is seenin training. We present a proof-of-concept ex-periment in which translation feedback of varyinggrades is chosen from the n-best list of an “opti-mal” model that has access to full information. Weshow that weak feedback structures correspond toimprovements in TER (Snover et al., 2006) overpredicted structures, and that learning from weakfeedback minimizes regret and TER.

• Our third contribution is to show that cer-tain practices of computing surrogate referencesactually can be understood as a form of weakfeedback. Coactive learning decouples the learner(performing prediction and updates) from the user(providing feedback in form of an improved trans-lation) so that we can compare different surro-gacy modes as different ways of approximate util-ity maximization. We show experimentally thatlearning from surrogate “hope” derivations (Chi-ang, 2012) minimizes regret and TER, thus fa-voring surrogacy modes that admit an underly-ing linear model, over “local” updates (Liang etal., 2006) or “oracle” derivations (Sokolov et al.,

2013), for which learning does not converge.It is important to note that the goal of our ex-

periments is not to present improvements of coac-tive learning over the “optimal” full-informationmodel in terms of standard SMT performance. In-stead, our goal is to present experiments that serveas a proof-of-concept of the feasibility of coactivelearning from weak feedback for SMT, and to pro-pose a new perspective on standard practices oflearning from surrogate translations. The rest ofthis paper is organized as follows. After a reviewof related work (Section 2), we present a latentpercpetron algorithm and analyze its convergenceand generalization properties (Section 3). Our firstset of experiments (Section 4.1) confirms our the-oretical analysis by showing convergence in regretand TER for learning from weak and strong feed-back. Our second set of experiments (Section 4.2)analyzes the relation of different surrogacy modesto minimization of regret and TER.

2 Related Work

Our work builds on the framework of coactivelearning, introduced by Shivaswamy and Joachims(2012). We extend their algorithms and proofs tothe area of SMT where latent variable models areappropriate, and additionally present generaliza-tion guarantees and an online-to-batch conversion.Our theoretical analysis is easily extendable to thefull information case of Sun et al. (2013). Wealso extend our own previous work (Sokolov et al.,2015) with theory and experiments for online-to-batch conversion, and with experiments on coac-tive learning from surrogate translations.

Online learning has been applied for discrimi-native training in SMT, based on perceptron-typealgorithms (Shen et al. (2004), Watanabe et al.(2006), Liang et al. (2006), Yu et al. (2013), interalia), or large-margin approaches (Tillmann andZhang (2006), Watanabe et al. (2007), Chiang etal. (2008), Chiang et al. (2009), Chiang (2012), in-ter alia). The latest incarnations are able to handlemillions of features and millions of parallel sen-tences (Simianer et al. (2012), Eidelmann (2012),Watanabe (2012), Green et al. (2013), inter alia).Most approaches rely on hidden derivation vari-ables, use some form of surrogate references, andinvolve n-best lists that change after each update.

Online learning from post-edits has mostly beenconfined to “simulated post-editing” where inde-pendently created human reference translations,

2

or post-edits on the output from similar SMTsystems, are used as for online learning (Cesa-Bianchi et al. (2008), Lopez-Salcedo et al. (2012),Martınez-Gomez et al. (2012), Saluja et al. (2012),Saluja and Zhang (2014), inter alia). Recentapproaches extend online parameter updating byonline phrase extraction (Waschle et al. (2013),Bertoldi et al. (2014), Denkowski et al. (2014),Green et al. (2014), inter alia). We exclude dy-namic phrase table extension, which has shown tobe important in online learning for post-editing, inour theoretical analysis (Denkowski et al., 2014).

Learning from weak feedback is related to bi-nary response-based learning where a meaningrepresentation is “tried out” by iteratively generat-ing system outputs, receiving feedback from worldinteraction, and updating the model parameters.Such world interaction consists of database accessin semantic parsing (Kwiatowski et al. (2013), Be-rant et al. (2013), or Goldwasser and Roth (2013),inter alia). Feedback in response-based learningis given by a user accepting or rejecting systempredictions, but not by user corrections.

Lastly, feedback in form of numerical utilityvalues for actions is studied in the frameworks ofreinforcement learning (Sutton and Barto, 1998)or in online learning with limited feedback, e.g.,multi-armed bandit models (Cesa-Bianchi and Lu-gosi, 2006). Our framework replaces quantitativefeedback with immediate qualitative feedback inform of a structured object that improves upon theutility of the prediction.

3 Coactive Learning for Online LatentStructured Prediction

3.1 Notation and BackgroundLet X denote a set of input examples, e.g.,sentences, and let Y(x) denote a set of structuredoutputs for x 2 X , e.g., translations. We defineY = [xY(x). Furthermore, by H(x, y) wedenote a set of possible hidden derivations for astructured output y 2 Y(x), e.g., for phrase-basedSMT, the hidden derivation is determined by aphrase segmentation and a phrase alignment be-tween source and target sentences. Every hiddenderivation h 2 H(x, y) deterministically identifiesan output y 2 Y(x). We define H = [x,yH(x, y).Let � : X⇥Y⇥H! Rd denote a feature functionthat maps a triplet (x, y, h) to a d-dimensionalvector. For phrase-based SMT, we use 14 fea-tures, defined by phrase translation probabilities,

Algorithm 1 Feedback-based Latent Perceptron

1: Initialize w 0

2: for t = 1, . . . , T do3: Observe xt

4: (yt, ht) arg max(y,h) w

>t �(xt, y, h)

5: Obtain weak feedback yt

6: if yt 6= yt then7: ¯

ht arg maxh w

>t �(xt, yt, h)

8: wt+1 wt+�ht,ht

��(xt, yt,

¯

ht)��(xt, yt, ht)�

language model probability, distance-based andlexicalized reordering probabilities, and wordand phrase penalty. We assume that the fea-ture function has a bounded radius, i.e. thatk�(x, y, h)k R for all x, y, h. By �h,h0 wedenote a distance function that is defined for anyh, h0 2 H, and is used to scale the step size ofupdates during learning. In our experiments, weuse the ordinary Euclidean distance between thefeature vectors of derivations. We assume a linearmodel with fixed parameters w⇤ such that eachinput example is mapped to its correct deriva-tion and structured output by using (y⇤, h⇤) =

arg maxy2Y(x),h2H(x,y) w⇤>�(x, y, h). We definefor each given input x, its highest scoring deriva-tion over all outputs Y(x) such that h(x;w) =

arg maxh02H(x,y) maxy2Y(x) w>�(x, y, h0)and the highest scoring derivation fora given output y 2 Y(x) such thath(x|y;w) = arg maxh02H(x,y) w>�(x, y, h0). Inthe following theoretical exposition we assumethat the arg max operation can be computedexactly.

3.2 Feedback-based Latent Perceptron

We assume an online setting, in which examplesare presented one-by-one. The learner observesan input xt, predicts an output structure yt, andis presented with feedback yt about its prediction,which is used to make an update to an existing pa-rameter vector. Algorithm 1 is called ”Feedback-based Latent Perceptron” to stress the fact thatit only uses weak feedback to its predictions forlearning, but does not necessarily observe optimalstructures as in the full information case (Sun etal., 2013). Learning from full information can berecovered by setting the informativeness parame-ter ↵ to 1 in Equation (2) below, in which casethe feedback structure yt equals the optimal struc-ture y⇤t . Algorithm 1 differs from the algorithmof Shivaswamy and Joachims (2012) by a jointmaximization over output structures y and hid-

3

den derivations h in prediction (line 4), by choos-ing a hidden derivation ¯h for the feedback struc-ture y (line 7), and by the use of the re-scalingfactor �ht,ht

in the update (line 8), where ¯ht =

h(xt|yt;wt) and ht = h(xt;wt) are the deriva-tions of the feedback structure and the predictionat time t, respectively. In our theoretical exposi-tion, we assume that yt is reachable in the searchspace of possible outputs, that is, yt 2 Y(xt).

3.3 Feedback of Graded UtilityThe key in the theoretical analysis in Shivaswamyand Joachims (2012) is the notion of a linear utilityfunction, determined by parameter vector w⇤, thatis unknown to the learner:

Uh(x, y) = w⇤>�(x, y, h).

Upon a system prediction, the user approximatelymaximizes utility, and returns an improved objectyt that has higher utility than the predicted yt s.t.

U(xt, yt) > U(xt, yt)

where for given x 2 X , y 2 Y(x), and h⇤ =

arg maxh2H(x,y) Uh(x, y), we define U(x, y) =

Uh⇤(x, y) and drop the subscript unless h 6= h⇤.Importantly, the feedback is typically not the opti-mal structure y⇤t that is defined as

y⇤t = arg max

y2Y(xt)U(xt, y).

While not receiving optimal structures in training,the learning goal is to predict objects with util-ity close to optimal structures y⇤t . The regret thatis suffered by the algorithm when predicting ob-ject yt instead of the optimal object y⇤t is

REGT =

1

T

TX

t=1

�U(xt, y

⇤t )� U(xt, yt)

�. (1)

To quantify the amount of information in theweak feedback, Shivaswamy and Joachims (2012)define a notion of ↵-informative feedback, whichwe generalize as follows for the case of latentderivations. We assume that there exists a deriva-tion ¯ht for the feedback structure yt, such thatfor all predictions yt, the (re-scaled) utility of theweak feedback yt is higher than the (re-scaled)utility of the prediction yt by a fraction ↵ of themaximum possible utility range (under the givenutility model). Thus 8t,9¯ht,8h and for ↵ 2 (0, 1]:

�Uht

(xt,yt)� Uh(xt, yt)�⇥�ht,h

� ↵�U(xt, y

⇤t )� U(xt, yt)

�� ⇠t, (2)

where ⇠t � 0 are slack variables allowing for vio-lations of (2) for given ↵. For slack ⇠t = 0, userfeedback is called strictly ↵-informative.

3.4 Convergence AnalysisA central theoretical result in learning from weakfeedback is an analysis that shows that Algo-rithm 1 minimizes an upper bound on the averageregret (1), despite the fact that optimal structuresare not used in learning:Theorem 1. Let DT =

PTt=1 �

2ht,ht

. Then theaverage regret of the feedback-based latent per-ceptron can be upper bounded for any ↵ 2 (0, 1],for any w⇤ 2 Rd:

REGT 1

↵T

TX

t=1

⇠t +

2Rkw⇤k↵

pDT

T.

A proof for Theorem 1 is similar to the proofof Shivaswamy and Joachims (2012) and the orig-inal mistake bound for the perceptron of Novikoff(1962).1 The theorem can be interpreted as fol-lows: we expect lower average regret for highervalues of ↵; due to the dominant term T , regretwill approach the minimum of the accumulatedslack (in case feedback structures violate Equa-tion (2)) or 0 (in case of strictly ↵-informativefeedback). The main difference between the aboveresult and the result of Shivaswamy and Joachims(2012) is the term DT following from the re-scaled distance of latent derivations. Their anal-ysis is agnostic of latent derivations, and can berecovered by setting this scaling factor to 1. Thisyields DT = T , and thus recovers the main fac-tor

pDTT =

1pT

in their regret bound. In our al-gorithm, penalizing large distances of derivationscan help to move derivations ht closer to ¯ht, there-fore decreasing DT as learning proceeds. Thus incase DT < T , our bound is better than the originalbound of Shivaswamy and Joachims (2012) for aperceptron without re-scaling. As we will showexperimentally, re-scaling leads to a faster conver-gence in practice.

3.5 Generalization AnalysisRegret bounds measure how good the average pre-diction of the current model is on the next examplein the given sequence, thus it seems plausible thata low regret on a sequence of examples should im-ply good generalization performance on the entiredomain of examples.

1Short proofs are provided in the appendix.

4

Generalization for Online Learning. First wepresent a generalization bound for the case of on-line learning on a sequence of random examples,based on generalization bounds for expected aver-age regret as given by Cesa-Bianchi et al. (2004).Let probabilities P and expectations E be de-fined with respect to the fixed unknown underly-ing distribution according to which all examplesare drawn. Furthermore, we bound our loss func-tion `t = U(xt, y⇤t )�U(xt, yt) to [0, 1] by addinga normalization factor 2R||w⇤|| s.t. REGT =

1T

PTt=1 `t. Plugging the bound on REGT of The-

orem 1 directly into Proposition 1 of Cesa-Bianchiet al. (2004) gives the following theorem:Theorem 2. Let 0 < � < 1, and let x1, . . . , xT bea sequence of examples that Algorithm 1 observes.Then with probability at least 1� �,

E[REGT ] 1

↵T

TX

t=1

⇠t +

2Rkw⇤k↵

pDT

T

+ 2||w⇤||Rr

2

Tln

1

�.

The generalization bound tells us how far theexpected average regret E[REGT ] (or averagerisk, in terms of Cesa-Bianchi et al. (2004)) is fromthe average regret that we actually observe in aspecific instantiation of the algorithm.

Generalization for Online-to-Batch Conver-sion. In practice, perceptron-type algorithms areoften applied in a batch learning scenario, i.e.,the algorithm is applied for K epochs to a train-ing sample of size T and then used for predic-tion on an unseen test set (Freund and Schapire,1999; Collins, 2002). The difference to the onlinelearning scenario is that we treat the multi-epochalgorithm as an empirical risk minimizer that se-lects a final weight vector wT,K whose expectedloss on unseen data we would like to bound. Weassume that the algorithm is fed with a sequenceof examples x1, . . . , xT , and at each epoch k =

1, . . . ,K it makes a prediction yt,k. The correctlabel is y⇤t . For k = 1, . . . ,K and t = 1, . . . , T ,let `t,k = U(xt, y⇤t ) � U(xt, yt,k), and denote by�t,k and ⇠t,k the distance at epoch k for examplet, and the slack at epoch k for example t, respec-tively. Finally, we denote by DT,K =

PTt=1 �

2t,K ,

and by wT,K the final weight vector returned afterK epochs. We state a condition of convergence2:

2This condition is too strong for large datasets. However,we believe that a weaker condition based on ideas from the

Condition 1. Algorithm 1 has converged on train-ing instances x1, . . . , xT after K epochs if thepredictions on x1, . . . , xT using the final weightvector wT,K are the same as the predictions onx1, . . . , xT in the Kth epoch.

Denote by EX(`(x)) the expected loss onunseen data when using wT,K where `(x) =

U(x, y⇤) � U(x, y0), y⇤ = arg maxy U(x, y) andy0 = arg maxy maxh w>

T,K�(x, y, h). We cannow state the following result:

Theorem 3. Let 0 < � < 1, and let x1, . . . , xT

be a sample for the multiple-epoch perceptron al-gorithm such that the algorithm converged on it(Condition 1). Then, with probability at least 1��,the expected loss of the feedback-based latent per-ceptron satisfies:

EX(`(x)) 1

↵T

TX

t=1

⇠t,K +

2Rkw⇤k↵

pDT,K

T

+ Rkw⇤k

s8 ln

2�

T.

The theorem can be interpreted as a bound onthe generalization error (lefthand-side) by the em-pirical error (the first two righthand-side terms)and the variance caused by the finite sample (thethird term in the theorem). The result follows di-rectly from McDiarmid’s concentration inequality.

4 Experiments

We used the LIG corpus3 which consists of 10,881tuples of French-English post-edits (Potet et al.,2012). The corpus is a subset of the news-commentary dataset provided at WMT4 and con-tains input French sentences, MT outputs, post-edited outputs and English references. To prepareSMT outputs for post-editing, the creators of thecorpus used their own WMT10 system (Potet etal., 2010), based on the Moses phrase-based de-coder (Koehn et al., 2007) with dense features.We replicated a similar Moses system using thesame monolingual and parallel data: a 5-gramlanguage model was estimated with the KenLMtoolkit (Heafield, 2011) on news.en data (48.65Msentences, 1.13B tokens), pre-processed with thetools from the cdec toolkit (Dyer et al., 2010).

perceptron cycling theorem (Block and Levin, 1970; Gelfandet al., 2010) should suffice to show a similar bound.

3http://www-clips.imag.fr/geod/User/marion.potet/index.php?page=download

4http://statmt.org/wmt10/translation-task.html

5

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

0 4000 8000 12000 16000 20000

regr

et

iterations

↵ = 0.1

↵ = 0.5

↵ = 1.0

0 4000 8000 12000 16000 200000.29

0.30

0.31

0.32

TER

iterations

↵ = 0.1

↵ = 0.5

↵ = 1.0

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

0 4000 8000 12000 16000 20000

regr

et

iterations

scaled; ↵ = 0.1

scaled; ↵ = 0.5

scaled; ↵ = 1.0

0 4000 8000 12000 16000 200000.29

0.30

0.31

0.32

TER

iterations

scaled; ↵ = 0.1

scaled; ↵ = 0.5

scaled; ↵ = 1.0

Figure 1: Regret and TER vs. iterations for ↵-informative feedback ranging from weak (↵ = 0.1) tostrong (↵ = 1.0) informativeness, with (lower part) and without re-scaling (upper part).

Parallel data (europarl+news-comm, 1.64M sen-tences) were similarly pre-processed and alignedwith fast align (Dyer et al., 2013). In all ex-periments, training is started with the Moses de-fault weights. The size of the n-best list, whereused, was set to 1,000. Irrespective of the use ofre-scaling in perceptron training, a constant learn-ing rate of 10

�5 was used for learning from simu-lated feedback, and 10

�4 for learning from surro-gate translations.

Our experiments on online learning requirea random sequence of examples for learning.Following the techniques described in Bertsekas(2011) to generate random sequences for incre-mental optimization, we compared cyclic order (Kepochs of T examples in fixed order), randomizedorder (sampling datapoints with replacement), andrandom shuffling of datapoints after each cycle,and found nearly identical regret curves for allthree scenarios. In the following, all figures areshown for sequences in the cyclic order, with re-decoding after each update. Furthermore note thatin all three definitions of sequence, we never seethe fixed optimal feedback y⇤t in training, but in-stead in general a different feedback structure yt

(and a different prediction yt) every time we seethe same input xt.

4.1 Idealized Weak and Strong Feedback

In a first experiment, we apply Algorithm 1 touser feedback of varying utility grade. The goal of

strict (⇠t = 0) slack (⇠t > 0)

# datapoints 5,725 1,155

TER(yt) < TER(yt) 52.17% 32.55%TER(yt) = TER(yt) 23.95% 20.52%TER(yt) > TER(yt) 23.88% 46.93%

Table 1: Improved utility vs. improved TER dis-tance to human post-edits for ↵-informative feed-back yt compared to prediction yt using defaultweights at ↵ = 0.1.

this experiment is to confirm our theoretical anal-ysis by showing convergence in regret for learn-ing from weak and strong feedback. We selectfeedback of varying grade by directly inspectingthe optimal w⇤, thus this feedback is idealized.However, the experiment also has a realistic back-ground since we show that ↵-informative feedbackcorresponds to improvements under standard eval-uation metrics such as lowercased and tokenizedTER, and that learning from weak and strong feed-back leads to convergence in TER on test data.

For this experiment, the post-edit data from theLIG corpus were randomly split into 3 subsets:PE-train (6,881 sentences), PE-dev, and PE-test(2,000 sentences each). PE-train was used forour online learning experiments. PE-test was heldout for testing the algorithms’ progress on unseendata. PE-dev was used to obtain w⇤ to define theutility model. This was done by MERT optimiza-tion (Och, 2003) towards post-edits under the TERtarget metric. Note that the goal of our experi-

6

% strictly ↵-informative

local 39.46%filtered 47.73%hope 83.30%

Table 2: ↵-informativeness of surrogacy modes.

ments is not to improve SMT performance overany algorithm that has access to full information tocompute w⇤. Rather, we want to show that learn-ing from weak feedback leads to convergence inregret with respect to the optimal model, albeitat a slower rate than learning from strong feed-back. The feedback data in this experiment weregenerated by searching the n-best list for transla-tions that are ↵-informative at ↵ 2 {0.1, 0.5, 1.0}(with possible non-zero slack). This is achievedby scanning the n-best list output for every inputxt and returning the first yt 6= yt that satisfiesEquation (2).5 This setting can be thought of as anidealized scenario where a user picks translationsfrom the n-best list that are considered improve-ments under the optimal w⇤.

In order to verify that our notion of graded util-ity corresponds to a realistic concept of gradedtranslation quality, we compared improvements inutility to improved TER distance to human post-edits. Table 1 shows that for predictions underdefault weights, we obtain strictly ↵-informative(for ↵ = 0.1) feedback for 5,725 out of 6,881datapoints in PE-train. These feedback structuresimprove utility per definition, and they also yieldbetter TER distance to post-edits in the majorityof cases. A non-negative slack has to be used in1,155 datapoins. Here the majority of feedbackstructures do not improve TER distance.

Convergence results for different learning sce-narios are shown in Figure 1. The left upper partof Figure 1 shows average utility regret againstiterations for a setup without re-scaling, i.e., set-ting �h,h = 1 in the definition of ↵-informativefeedback (Equation (2)) and in the update of Al-gorithm 1 (line 8). As predicted by our regretanalysis, higher ↵ leads to faster convergence, butall three curves converge towards a minimal re-gret. Also, the difference between the curves for

5Note that feedback provided in this way might bestronger than required at a particular value of ↵ since for all� � ↵, strictly �-informative feedback is also strictly ↵-informative. On the other hand, because of the limited size ofthe n-best list, we cannot assume strictly ↵-informative userfeedback with zero slack ⇠t. In experiments where updatesare only done if feedback is strictly ↵-informative we foundsimilar convergence behavior.

0.05

0.10

0.15

0.20

0.25

1 2 3 4 5 6 7 8 9 10

aver

age

loss

`

t

epochs

test, ↵ = 0.1

test, ↵ = 0.5

test, ↵ = 1.0

train, ↵ = 0.1

train, ↵ = 0.5

train, ↵ = 1.0

Figure 3: Average loss `t on heldout and train data.

↵ = 0.1 and ↵ = 1.0 is much smaller than a fac-tor of ten. As expected from the correspondence of↵-informative feedback to improvements in TER,similar relations are obtained when plotting TERscores on test data for training from weak feed-back at different utility grades. This is shown inthe right upper part of Figure 1.

The left lower part of Figure 1 shows averageutility regret plotted against iterations for a setupthat uses re-scaling. We define �ht,h by the `2-distance between the feature vectors �(xt, yt, ¯ht)

of the derivation of the feedback structure and thefeature vector �(xt, yt, ht) of the derivation of thepredicted structure. We see that the curves for allgrades of feedback converge faster than the corre-sponding curves for un-scaled feedback shown inthe upper part Figure 1. Furthermore, as shown inthe right lower part of Figure 1, TER is decreasedon test data as well at a faster rate.6

Lastly, we present an experimental validation ofthe online-to-batch application of our algorithm.That is, we would like to evaluate predictions thatuse the final weight vector wT,K by comparing thegeneralization error with the empirical error statedin Theorem 3. The standard way to do this is tocompare the average loss on heldout data with thethe average loss on the training sequence. Fig-ure 3 shows these results for models trained on↵-informative feedback of ↵ 2 {0.1, 0.5, 1.0} for10 epochs. Similar to the online learning setup,higher ↵ results in faster convergence. Further-more, curves for training and heldout evaluationconverge at the same rate.

4.2 Feedback from Surrogate TranslationsIn this section, we present experiments on learn-ing from real human post-edits. The goal ofthis experiment is to investigate whether the stan-

6We also conducted online-to-batch experiments for sim-ulated feedback at ↵ 2 {0.1, 0.5, 1.0}. Similar to the onlinelearning setup, higher ↵ results in faster convergence.

7

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

0 4000 8000 12000 16000 20000

regr

et

iterations

↵ = 0.1

↵ = 1.0

oracleslocal

filteredhope

0 4000 8000 12000 16000 200000.29

0.30

0.31

0.32

0.33

0.34

0.35

TER

iterations

↵ = 0.1

↵ = 1.0

oracleslocal

filteredhope

Figure 2: Regret and TER for online learning from oracles, local, filtered, and hope surrogates.

dard practices for extracting feedback from ob-served user post-edits for discriminative SMT canbe matched with the modeling assumptions ofthe coactive learning framework. The custom-ary practice in discriminative learning for SMT isto replace observed user translations by surrogatetranslations since the former are often not reach-able in the search space of the SMT decoder. Inour case, only 29% of the post-edits in the LIG-corpus were reachable by the decoder. We com-pare four heuristics of generating surrogate trans-lations: oracles are generated using the lattice or-acle approach of Sokolov et al. (2013) which re-turns the closest path in the decoder search graphas reachable surrogate translation.7 A local sur-rogate y is chosen from the n-best list of thelinear model as the translation that achieves thebest TER score with respect to the actual post-edit y: y = arg miny02n-best(xt;wt) TER(y0, y).This corresponds to the local update mode ofLiang et al. (2006). A filtered surrogate trans-lation y is found by scanning down the n-bestlist, and accepting the first translation as feed-back that improves TER score with respect to thehuman post-edit y over the 1-best prediction yt

of the linear model: TER(y, y) < TER(yt, y).Finally, a hope surrogate is chosen from the n-best list as the translation that jointly maximizesmodel score under the linear model and nega-tive TER score with respect to the human post-edit: y = arg maxy02n-best(xt;wt)(�TER(y0, y) +

w>t �(xt, y0, h)). This corresponds to what Chi-

ang (2012) termed “hope derivations”. Informally,oracles are model-agnostic, as they can pick asurrogate even from outside of the n-best list;local is constrained to the n-best list, thoughstill ignoring the ordering according to the linear

7While the original algorithm is designed to maximize theBLEU score of the returned path, we tuned its two free pa-rameters to maximize TER.

model; finally, filtered and hope represent dif-ferent ways of letting the model score influencethe selected surrogate.

As shown in Figure 2, regret and TER de-crease with the increased amount of informationabout the assumed linear model that is induced bythe surrogate translations: Learning from oracle

surrogates does not converge in regret and TER.The local surrogates extracted from 1,000-bestlists still do not make effective use of the linearmodel, while filtered surrogates enforce an im-provement over the prediction under TER towardsthe human post-edit, and improve convergence inlearning. Empirically, convergence is achievedonly for hope surrogates that jointly maximizenegative TER and linear model score, with a con-vergence behavior that is very similar to learningfrom weak ↵-informative feedback at ↵ ' 0.1.We quantify this in Table 2 where we see that theimprovement in TER over the prediction that holdsfor any hope derivation, corresponds to an im-provement in ↵-informativeness: hope surrogatesare strictly ↵-informative in 83.3% of the casesin our experiment, whereas we find a correspon-dence to strict ↵-informativeness only in 45.74%

or 39.46% of the cases for filtered and local

surrogates, respectively.

5 Discussion

We presented a theoretical analysis of onlinelearning for SMT from a coactive learning per-spective. This viewpoint allowed us to give regretand generalization bounds for perceptron-style on-line learners that fall outside the convex opti-mization scenario because of latent variables andchanging feedback structures. We introduced theconcept of weak feedback into online learning forSMT, and provided proof-of-concept experimentswhose goal was to show that learning from weakfeedback converges to minimal regret, albeit at a

8

slower rate than learning from strong feedback.Furthermore, we showed that the SMT standardof learning from surrogate hope derivations can tobe interpreted as a search for weak improvementsunder the assumed linear model. This justifiesthe importance of admitting an underlying linearmodel in computing surrogate derivations from acoactive learning perspective.

Finally, we hope that our analysis motivates fur-ther work in which the idea of learning from weakfeedback is taken a step further. For example,our results could perhaps be strengthened by ap-plying richer feature sets or dynamic phrase tableextension in experiments on interactive SMT. Ourtheory would support a new post-editing scenariowhere users pick translations from the n-best listthat they consider improvements over the predic-tion. Furthermore, it would be interesting to see if“light” post-edits that are better reachable and eas-ier elicitable than “full” post-edits provide a strongenough signal for learning.

Acknowledgments

This research was supported in part by DFGgrant RI-2221/2-1 “Grounding Statistical MachineTranslation in Perception and Action.”

Appendix: Proofs of TheoremsProof of Theorem 1Proof. First we bound w

>T+1wT+1 from above:

w

>T+1wT+1 = w

>T wT

+ 2w

>T

��(xT , yT ,

¯

hT )� �(xT , yT , hT )

��hT ,hT

+

��(xT , yT ,

¯

hT )� �(xT , yT , hT )

�>�hT ,hT�

�(xT , yT ,

¯

hT )� �(xT , yT , hT )

��hT ,hT

w

>T wT + 4R

2�

2hT ,hT

4R

2DT . (3)

The first equality uses the update rule from Algorithm1. The second uses the fact that w

>T (�(xT , yT ,

¯

hT ) �

�(xT , yT , hT )) 0 by definition of (yT , hT ) in Algo-rithm 1. By assumption k�(x, y, h)k R, 8x, y, h andby the triangle inequality, k�(x, y, h) � �(x, y

0, h

0)k

k�(x, y, h)k + k�(x, y

0, h

0)k 2R. Finally, DT =PT

t=1 �

2ht,ht

by definition, and the last inequality followsby induction.

The connection to average regret is as follows:

w

>T+1w⇤ = w

>T w⇤

+ �hT ,hT

��(xT , yT ,

¯

hT ))� �(xT , yT , hT )

�>w⇤

=

TX

t=1

�ht,ht

��(xt, yt,

¯

ht)� �(xt, yt, ht)�>

w⇤

=

TX

t=1

�ht,ht

�Uht

(xt, yt)� Uht(xt, yt)�. (4)

The first equality again uses the update rule from Algorithm1. The second follows by induction. The last equality appliesthe definition of utility.

Next we upper bound the utility difference:

TX

t=1

�ht,ht

�Uht

(xt, yt)� Uht(xt, yt)�

kw⇤kkwT+1k kw⇤k2R

p

DT . (5)

The first inequality follows from applying the Cauchy-Schwartz inequality w

>T+1w⇤ kw⇤kkwT+1k to Equa-

tion (4). The seond follows from applying Equation (3) tokwT+1k =

qw

>T+1wT+1.

The final result is obtained simply by lower boundingEquation (5) using the assumption in Equation (2).

kw⇤k2R

p

DT

�

TX

t=1

�ht,ht

�Uht

(xt, yt)� Uht(xt, yt)�

� ↵

TX

t=1

�U(xt, y

⇤t )� U(xt, yt)

��

TX

t=1

⇠t

= ↵ T REGT �

TX

t=1

⇠t.

Proof of Theorem 3Proof. The theorem can be shown by an application of Mc-Diarmid’s concentration inequality:

Theorem 4 (McDiarmid, 1989). Let Z1, . . . , Zm be a setof random variables taking value in a set Z . Further, letf : Z

m! R be a function that satisfies for all i and

z1, . . . , zm, z

0i 2 Z:

|f(z1, . . . , zi, . . . , zm)

� f(z1, . . . , z0i, . . . , zm)| c, (6)

for some c. Then for all ✏ > 0,

P(|f � E(f)| > ✏) 2 exp(�

2✏

2

mc

2). (7)

Let f be the average loss for predicting yt on example xt

in epoch K: f(x1, . . . , xT ) = REGT,K =

1T

PTt=1 `t,K .

Because of the convergence condition (Condition 1), `t,K =

`(xt). The expectation of f is E(f) =

1T

PTt=1 E[`t,k] =

1T

PTt=1 E[`(xt)] = EX(`(x)).

The first and second term on the righthand-side of Theo-rem 3 follow from upper bounding REGT in the Kth epochs,using Theorem 1. The third term is derived by calculating c

in Equation (6) as follows:

|f(x1, . . . , xt, . . . , xT )� f(x1, . . . , x0t, . . . , xT )|

= |

1

T

TX

t=1

`t,K �1

T

TX

t=1

`

0t,K | = |

1

T

TX

t=1

�`t,K � `

0t,K

�|

1

T

TX

t=1

�|`t,k| + |`

0t,K |

�

4Rkw⇤k

T

= c.

The first inequality uses the triangle inequality; the sec-ond uses the upper bound |`t,k| 2R||w⇤||. Setting therighthand-side of Equation (7) to at least � and solving for ✏,using c, concludes the proof.

9

ReferencesJonathan Berant, Andrew Chou, Roy Frostig, and Percy

Liang. 2013. Semantic parsing on freebase fromquestion-answer pairs. In EMNLP, Seattle, WA.

Nicola Bertoldi, Patrick Simianer, Mauro Cettolo,Katharina Waschle, Marcello Federico, and StefanRiezler. 2014. Online adaptation to post-edits forphrase-based statistical machine translation. Ma-chine Translation, 29:309–339.

Dimitri P. Bertsekas. 2011. Incremental gradient,subgradient, and proximal methods for convex op-timization: A survey. In Suvrit Sra, SebastianNowozin, and Stephen J. Wright, editors, Optimiza-tion for Machine Learning. MIT Press.

Henry D. Block and Simon A. Levin. 1970. On theboundedness of an iterative procedure for solvinga system of linear inequalities. Proceedings of theAmerican Mathematical Society, 26(2):229–235.

Leon Bottou and Olivier Bousquet. 2004. Large scaleonline learning. In NIPS, Vancouver, Canada.

Nicolo Cesa-Bianchi and Gabor Lugosi. 2006. Predic-tion, Learning, and Games. Cambridge UniversityPress.

Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gen-tile. 2004. On the generalization ablility of on-linelearning algorithms. IEEE Transactions on Infor-mation Theory, 50(9):2050–2057.

Nicolo Cesa-Bianchi, Gabriele Reverberi, and San-dor Szedmak. 2008. Online learning algorithmsfor computer-assisted translation. Technical report,SMART (www.smart-project.eu).

David Chiang, Yuval Marton, and Philip Resnik. 2008.Online large-margin training of syntactic and struc-tural translation features. In EMNLP, Waikiki, HA.

David Chiang, Kevin Knight, and Wei Wang. 2009.11,001 new features for statistical machine transla-tion. In NAACL, Boulder, CO.

David Chiang. 2012. Hope and fear for discriminativetraining of statistical translation models. Journal ofMachine Learning Research, 12:1159–1187.

Michael Collins. 2002. Discriminative training meth-ods for hidden markov models: theory and exper-iments with perceptron algorithms. In EMNLP,Philadelphia, PA.

Ronan Collobert, Fabian Sinz, Jason Weston, and LeonBottou. 2006. Trading convexity for scalability. InICML, Pittsburgh, PA.

Michael Denkowski, Chris Dyer, and Alon Lavie.2014. Learning from post-editing: Online modeladaptation for statistical machine translation. InEACL, Gothenburg, Sweden.

Chuong B. Do, Quoc Le, and Choon Hui Teo. 2008.Tighter bounds for structured estimation. In NIPS,Vancouver, Canada.

Chris Dyer, Adam Lopez, Juri Ganitkevitch, JonathanWeese, Ferhan Ture, Phil Blunsom, Hendra Seti-awan, Vladimir Eidelman, and Philip Resnik. 2010.cdec: A decoder, alignment, and learning frameworkfor finite-state and context-free translation models.In ACL, Uppsala, Sweden.

Chris Dyer, Victor Chahuneau, and Noah A. Smith.2013. A simple, fast, and effective reparameteriza-tion of IBM model 2. In NAACL, Atlanta, GA.

Vladimir Eidelmann. 2012. Optimization strategiesfor online large-margin learning in machine transla-tion. In WMT, Montreal, Canada.

Yoav Freund and Robert E. Schapire. 1999. Largemargin classification using the perceptron algorithm.Journal of Machine Learning Research, 37:277–296.

Andrew E. Gelfand, Yutian Chen, Max Welling, andLaurens van der Maaten. 2010. On herding and theperceptron cycling theorem. In NIPS, Vancouver,Canada.

Kevin Gimpel and Noah A. Smith. 2012. Structuredramp loss minimization for machine translation. InNAACL, Montreal, Canada.

Dan Goldwasser and Dan Roth. 2013. Learning fromnatural instructions. Machine Learning, 94(2):205–232.

Spence Green, Jeffrey Heer, and Christopher D. Man-ning. 2013. The efficacy of human post-editing forlanguage translation. In CHI, Paris, France.

Spence Green, Sida I. Wang, Jason Chuang, JeffreyHeer, Sebastian Schuster, and Christopher D. Man-ning. 2014. Human effort and machine learnabil-ity in computer aided translation. In EMNLP, Doha,Qatar.

Kenneth Heafield. 2011. KenLM: faster and smallerlanguage model queries. In WMT, Edinburgh, UK.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Birch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. InACL, Prague, Czech Republic.

Tom Kwiatowski, Eunsol Choi, Yoav Artzi, and LukeZettlemoyer. 2013. Scaling semantic parsers withon-the-fly ontology matching. In EMNLP, Seattle,WA.

Percy Liang, Alexandre Bouchard-Cote, Dan Klein,and Ben Taskar. 2006. An end-to-end discrimina-tive approach to machine translation. In COLING-ACL, Sydney, Australia.

10

Francisco-Javier Lopez-Salcedo, German Sanchis-Trilles, and Francisco Casacuberta. 2012. Onlinelearning of log-linear weights in interactive machinetranslation. In IberSpeech, Madrid, Spain.

Pascual Martınez-Gomez, German Sanchis-Trilles, andFrancisco Casacuberta. 2012. Online adaptationstrategies for statistical machine translation in post-editing scenarios. Pattern Recognition, 45(9):3193–3202.

David McAllester and Joseph Keshet. 2011. General-ization bounds and consistency for latent structuralprobit and ramp loss. In NIPS, Granada, Spain.

Colin McDiarmid. 1989. On the method of boundeddifferences. Surveys in combinatorics, 141(1):148–188.

Albert B.J. Novikoff. 1962. On convergence proofs onperceptrons. Symposium on the Mathematical The-ory of Automata, 12:615–622.

Franz Josef Och. 2003. Minimum error rate training instatistical machine translation. In NAACL, Edmon-ton, Canada.

Marion Potet, Laurent Besacier, and Herve Blanchon.2010. The LIG machine translation system forWMT 2010. In WMT, Upsala, Sweden.

Marion Potet, Emanuelle Esperanca-Rodier, LaurentBesacier, and Herve Blanchon. 2012. Collectionof a large database of French-English SMT outputcorrections. In LREC, Istanbul, Turkey.

Avneesh Saluja and Ying Zhang. 2014. On-line discriminative learning for machine translationwith binary-valued feedback. Machine Translation,28:69–90.

Avneesh Saluja, Ian Lane, and Ying Zhang. 2012.Machine translation with binary feedback: A large-margin approach. In AMTA, San Diego, CA.

Shai Shalev-Shwartz. 2012. Online learning and on-line convex optimization. Foundations and Trendsin Machine Learning, 4(2):107–194.

Libin Shen, Anoop Sarkar, and Franz Josef Och. 2004.Discriminative reranking for machine translation. InNAACL, Boston, MA.

Pannaga Shivaswamy and Thorsten Joachims. 2012.Online structured prediction via coactive learning.In ICML, Edinburgh, UK.

Patrick Simianer, Stefan Riezler, and Chris Dyer. 2012.Joint feature selection in distributed stochastic learn-ing for large-scale discriminative training in SMT.In ACL, Jeju, Korea.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A study oftranslation edit rate with targeted human annotation.In AMTA, Cambridge, MA.

Artem Sokolov, Guillaume Wisniewski, and FrancoisYvon. 2013. Lattice BLEU oracles in machinetranslation. Transactions on Speech and LanguageProcessing, 10(4):18.

Artem Sokolov, Stefan Riezler, and Shay B. Cohen.2015. Coactive learning for interactive machinetranslation. In ICML Workshop on Machine Learn-ing for Interactive Systems (MLIS), Lille, France.

Xu Sun, Takuya Matsuzaki, and Wenjie Li. 2013.Latent structured perceptrons for large scale learn-ing with hidden information. IEEE Transactionson Knowledge and Data Engineering, 25(9):2064–2075.

Richard S. Sutton and Andrew G. Barto. 1998. Re-inforcement Learning. An Introduction. The MITPress.

Christoph Tillmann and Tong Zhang. 2006. A discrim-inative global training algorithm for statistical MT.In COLING-ACL, Sydney, Australia.

Katharina Waschle, Patrick Simianer, Nicola Bertoldi,Stefan Riezler, and Marcello Federico. 2013. Gen-erative and discriminative methods for online adap-tation in SMT. In MT Summit, Nice, France.

Taro Watanabe, Jun Suzuki, Hajime Tsukada, andHideki Isozaki. 2006. NTT statistical machinetranslation for IWSLT 2006. In IWSLT, Kyoto,Japan.

Taro Watanabe, Jun Suzuki, Hajime Tsukada, andHideki Isozaki. 2007. Online large-margin train-ing for statistical machine translation. In EMNLP,Prague, Czech Republic.

Taro Watanabe. 2012. Optimized online rank learn-ing for machine translation. In NAACL, Montreal,Canada.

Heng Yu, Liang Huang, Haitao Mi, and Kai Zhao.2013. Max-violation perceptron and forced decod-ing for scalable MT training. In EMNLP, Seattle,WA.

Alan Yuille and Anand Rangarajan. 2003. Theconcave-convex procedure. Neural Computation,15:915–936.

11



A Joint Framework for Coreference Resolution and Mention Head Detection

Haoruo Peng Kai-Wei Chang Dan RothUniversity of Illinois, Urbana-Champaign

Urbana, IL, 61801{hpeng7,kchang10,danr}@illinois.edu

AbstractIn coreference resolution, a fair amount ofresearch treats mention detection as a pre-processed step and focuses on developingalgorithms for clustering coreferred men-tions. However, there are significant gapsbetween the performance on gold mentionsand the performance on the real problem,when mentions are predicted from raw textvia an imperfect Mention Detection (MD)module. Motivated by the goal of reduc-ing such gaps, we develop an ILP-basedjoint coreference resolution and mentionhead formulation that is shown to yield sig-nificant improvements on coreference fromraw text, outperforming existing state-of-art systems on both the ACE-2004 and theCoNLL-2012 datasets. At the same time,our joint approach is shown to improve men-tion detection by close to 15% F1. Onekey insight underlying our approach is thatidentifying and co-referring mention headsis not only sufficient but is more robust thanworking with complete mentions.

1 IntroductionMention detection is rarely studied as a stand-aloneresearch problem (Recasens et al. (2013) is onekey exception). Most coreference resolution worksimply mentions it in passing as a module in thepipelined system (Chang et al., 2013; Durrett andKlein, 2013; Lee et al., 2011; Bjorkelund and Kuhn,2014). However, the lack of emphasis is not due tothis being a minor issue, but rather, we think, its dif-ficulty. Indeed, many papers report results in termsof gold mentions versus system generated mentions,as shown in Table 1. Current state-of-the-art sys-tems show a very significant drop in performancewhen running on system generated mentions. Theseperformance gaps are worrisome, since the real goalof NLP systems is to process raw data.

System Dataset Gold Predict GapIllinois CoNLL-12 77.05 60.00 17.05Illinois CoNLL-11 77.22 60.18 17.04Illinois ACE-04 79.42 68.27 11.15Berkeley CoNLL-11 76.68 60.42 16.26Stanford ACE-04 81.05 70.33 10.72

Table 1: Performance gaps between using gold mentionsand predicted mentions for three state-of-the-art corefer-ence resolution systems. Performance gaps are always largerthan 10%. Illinois’s system (Chang et al., 2013) is evaluatedon CoNLL (2012, 2011) Shared Task and ACE-2004 datasets.It reports an average F1 score of MUC, B3 and CEAFe met-rics using CoNLL v7.0 scorer. Berkeley’s system (Durrett andKlein, 2013) reports the same average score on the CoNLL-2011 Shared Task dataset. Results of Stanford’s system (Lee etal., 2011) are for B3 metric on ACE-2004 dataset.

This paper focuses on improving end-to-endcoreference performance. We do this by: 1) De-veloping a new ILP-based joint learning and infer-ence formulation for coreference and mention headdetection. 2) Developing a better mention head can-didate generation algorithm. Importantly, we focuson heads rather than mention boundaries since thosecan be identified more robustly and used effectivelyin an end-to-end system. As we show, this resultsin a dramatic improvement in the quality of the MDcomponent and, consequently, a significant reduc-tion in the performance gap between coreference ongold mentions and coreference on raw data.

Existing coreference systems usually consider apipelined system, where the mention detection stepis followed by that of clustering mentions into coref-erence chains. Higher quality mention identificationnaturally leads to better coreference performance.Standard methods define mentions as boundaries oftext, and expect exact boundaries as input in thecoreference step. However, mentions have an intrin-sic structure, in which mention heads carry the cru-cial information. Here, we define a mention head asthe last token of a syntactic head, or the whole syn-tactic head for proper names.1 For example, in “the

1Here, we follow the ACE annotation guideline. Note that

12

incumbent [Barack Obama]” and “[officials] at thePentagon”, “Barack Obama” and “officials” serveas mention heads, respectively. Mention heads canbe used as auxiliary structures for coreference. Inthis paper, we first identify mention heads, and thendetect mention boundaries based on heads. We relyheavily on the first, head identification, step, whichwe show to be sufficient to support coreference deci-sions. Moreover, this step also provides enough in-formation for “understanding” the coreference out-put, and can be evaluated more robustly (since mi-nor disagreements on mention boundaries are oftena reason for evaluation issues when dealing withpredicted mentions). We only identify the mentionboundaries at the end, after we make the coreferencedecisions, to be consistent with current evaluationstandards in the corefernce resolution community.Consider the following example2:

[Multinational companies investing in [China]]had become so angry that [they] recentlyset up an anti-piracy league to pressure [the[Chinese] government] to take action. [Do-mestic manufacturers, [who] are also suffering],launched a similar body this month. [They] hope[the government] can introduce a new law in-creasing fines against [producers of fake goods]from the amount of profit made to the value of thegoods produced.

Here, phrases in the brackets are mentions andthe underlined simple phrases are mention heads.Moreover, mention boundaries can be nested (theboundary of a mention is inside the boundary ofanother mention), but mention heads never overlap.This property also simplifies the problem of mentionhead candidate generation. In the example above,the first “they” refers to “Multinational companiesinvesting in China” and the second “They” refersto “Domestic manufacturers, who are also suffer-ing”. In both cases, the mention heads are sufficientto support the decisions: ”they” refers to ”compa-nies”, and ”They” refers to ”manufacturers”. Infact, most of the features3 implemented in existingcoreference resolution systems rely solely on men-tion heads (Bengtson and Roth, 2008).

Furthermore, consider the possible mention can-didate “league” (italic in the text). It is not cho-sen as a mention because the surrounding contextis not focused on “anti-piracy league”. So, mention

the CoNLL-2012 dataset is built from OntoNotes-5.0 corpus.2This example is chosen from the ACE-2004 corpus.3All features except for those that rely on modifiers.

Figure 1: Comparison between a traditional pipelined sys-tem and our proposed system. We split up mention detectioninto two steps: mention head candidate generation and (an op-tional) mention boundary detection. We feed mention headsrather than complete mentions into the coreference model. Dur-ing the joint head-coreference process, we reject some mentionhead candidates and then recover complete mention boundariesafter coreference decisions are made.

detection can be viewed as a global decision prob-lem, which involves considering the relevance of amention to its context. The fact that the coreferencedecision provides a way to represent this relevance,further motivates considering mention detection andcoreference jointly. The insight here is that a men-tion candidate will be more likely to be valid whenit has more high confidence coreference links.

This paper develops a joint coreference resolutionand mention head detection framework as an Inte-ger Linear Program (ILP) following Roth and Yih(2004). Figure 1 compares a traditional pipelinedsystem with our proposed system. Our joint for-mulation includes decision variables both for coref-erence links between pairs of mention heads, andfor all mention head candidates, and we simultane-ously learn the ILP coefficients for all these vari-ables. During joint inference, some of the mentionhead candidates will be rejected (that is, the corre-sponding variables will be assigned ’0’), contribut-ing to improvement both in MD and in coreferenceperformance. The aforementioned joint approachbuilds on an algorithm that generates mention headcandidates. Our candidate generation process con-sists of a statistical component and a component thatmakes use of existing resources, and is designed toensure high recall on head candidates.

Ideally, after making coreference decisions, weextend the remaining mention heads to completementions; we employ a binary classifier, whichshares all features with the mention head detectionmodel in the joint step.

Our proposed system can work on both ACE andOntoNotes datasets, even though their styles of an-notation are different. There are two main differ-

13

ences to be addressed. First, OntoNotes removessingleton mentions, even if they are valid mentions.This causes additional difficulty in learning a goodmention detector in a pipelined framework. How-ever, our joint framework can adapt to it by rejectingthose singleton mentions. More details will be dis-cussed in Sec. 2. Second, ACE uses shortest deno-tative phrases to identify mentions while OntoNotestends to use long text spans. This makes identifyingmention boundaries unnecessarily hard. Our systemfocuses on mention heads in the coreference stage toensure robustness. As OntoNotes does not containhead annotations, we preprocess the data to extractmention heads which conform with the ACE style.

Results on ACE-2004 and CoNLL-2012 datasetsshow that our system4 reduces the performance gapfor coreference by around 25% (measured as the ra-tio of performance improvement over performancegap) and improves the overall mention detection byover 10 F1 points. With such significant improve-ments, we achieve the best end-to-end coreferenceresolution results reported so far.

The main contributions of our work can be sum-marized as follows:

1. We develop a new, end-to-end, coreference ap-proach that is based on a joint learning and in-ference model for mention heads and corefer-ence decisions.

2. We develop an improved mention head candi-date generation module and a mention bound-ary detection module.

3. We achieve the best coreference results on pre-dicted mentions and reduce the performancegap compared to using gold mentions.

The rest of the paper is organized as follows. Weexplain the joint head-coreference learning and in-ference framework in Sec. 2. Our mention headcandidate generation module and mention boundarydetection module are described in Sec. 3. We reportour experimental results in Sec. 4, review relatedwork in Sec. 5 and conclude in Sec. 6.

2 A Joint Head-Coreference Framework

This section describes our joint coreference resolu-tion and mention head detection framework. Ourwork is inspired by the latent left-linking model inChang et al. (2013) and the ILP formulation fromChang et al. (2011). The joint learning and infer-ence model takes as input mention head candidates

4Available at http://cogcomp.cs.illinois.edu/page/software_view/Coref

(Sec. 3) and jointly (1) determines if they are indeedmention heads and (2) learns a similarity metric be-tween mentions. This is done by simultaneouslylearning a binary mention head detection classifierand a mention-pair coreference classifier. The men-tion head detection model here is mainly trained todifferentiate valid mention heads from invalid ones.By learning and making decisions jointly, it alsoserves as a singleton mention head classifier, build-ing on insights from Recasens et al. (2013). Thisjoint framework aims to improve performance onboth mention head detection and on coreference.

We first describe the formualtion of the men-tion head detection and the ILP-based mention-paircoreference separately, and then propose the jointhead-coreference framework.

2.1 Mention Head DetectionThe mention head detection model is a binary classi-fier gm = w>

1 j(m), in which j(m) is a feature vectorfor mention head candidate m and w1 is the corre-sponding weight vector. We identify a candidate mas a mention head if gm > 0. The features utilized inthe vector j(m) consist of: 1) Gazetteer features 2)Part-Of-Speech features 3) Wordnet features 4) Fea-tures from the previous and next tokens 5) Length ofmention head. 6) Normalized Pointwise Mutual In-formation (NPMI) on the tokens across a mentionhead boundary 7) Feature conjunctions. Altogetherthere are hundreds of thousands of sparse features.

2.2 ILP-based Mention-Pair CoreferenceLet M be the set of all mentions. We train a corefer-ence model by learning a pairwise mention scoringfunction. Specifically, given a mention-pair (u,v)(u,v 2 M, u is the antecedent of v), we learn a left-linking scoring function fu,v = w>

2 f(u,v), wheref(u,v) is a pairwise feature vector and w2 is theweight vector. The inference algorithm is inspiredby the best-left-link approach (Chang et al., 2011),where they solve the following ILP problem:

argmaxy Â

u<v,u,v2Mfu,vyu,v,

s.t. Âu<v

yu,v 1, 8v 2M,

yu,v 2 {0,1} 8u,v 2M.

(1)

Here, yu,v = 1 iff mentions u,v are directly linked.Thus, we can construct a forest and the mentionsin the same connected component (i.e., in the sametree) are co-referred. For this mention-pair corefer-ence model f(u,v), we use the same set of featuresused in Bengtson and Roth (2008).

14

2.3 Joint Inference FrameworkWe extend expression (1) to facilitate joint inferenceon mention heads and coreference as follows:

argmaxy Â

u<v,u,v2Mfu,vyu,v + Â

m2Mgmym,

s.t. Âu<v

yu,v 1, 8v 2M0,

Âu<v

yu,v yv, 8v 2M0,

yu,v 2 {0,1}, ym 2 {0,1} 8u,v,m 2M0.

Here, M0 is the set of all mention head candidates.ym is the decision variable for mention head candi-date m. ym = 1 if and only if the mention head mis chosen. To consider coreference decisions andmention head decisions together, we add the con-straint Âu<v yu,v yv, which ensures that if a candi-date mention head v is not chosen, then it will nothave coreference links with other mention heads.

2.4 Joint Learning FrameworkTo support joint learning of the parameters w1 andw2 described above, we define a joint training objec-tive function C(w1,w2) for mention head detectionand coreference, which uses a max-margin approachto learn both weight vectors. Suppose we have a col-lection of documents D, and we generate nd men-tion head candidates for each document d (d 2 D).We use an indicator function d (u,m) to representwhether mention heads u,m are in the same corefer-ence cluster based on gold annotations (d (u,m) = 1iff they are in the same cluster). Similarly, W(m) isan indicator funtion representing whether mentionhead m is valid in the gold annotations.

For simplicity, we first define

u0 = argmaxu<m

(w>2 f(u,m)�d (u,m)),

u00 = arg maxu<m,d (u,m)=1

w>2 f(u,m)W(m).

We then minimize the following joint training ob-jective function C(w1,w2).

C(w1,w2) =

1|D| Â

d2D

1nd

Âm

(Ccore f ,m(w2)

+Clocal,m(w1)+Ctrans,m(w1))+R(w1,w2).

C(w1,w2) is composed of four parts. The first partis the loss function for coreference, where we have

Ccore f ,m(w2) =�w>2 f(u00,m)W(m)

+(w>2 f(u0,m)�d (u0,m))(W(m)_W(u0)).

It is similar to the loss function for a latent left-linking coreference model5. As the second com-ponent, we have the quadratic loss for the mentionhead detection model,

Clocal,m(w1) =

12(w>

1 j(m)�W(m))

2.

Using the third component, we further maximize themargin between valid and invalid mention head can-didates when they are selected as the best-left-linkmention heads for any valid mention head. It can berepresented as

Ctrans,m(w1) =

12(w>

1 j(u0)�W(u0))2W(m).

The last part is the regularization term

R(w1,w2) =

l1

2||w1||2 +

l2

2||w2||2.

2.5 Stochastic Subgradient Descent for JointLearning

For joint learning, we choose stochastic subgradi-ent descent (SGD) approach to facilitate performingSGD on a per mention head basis. Next, we de-scribe the weight update algorithm by defining thesubgradients.

The partial subgradient w.r.t. mention head m forthe head weight vector w1 is given by

—w1,mC(w1,w2) =

1|D|nd

(—Clocal,m(w1)+—Ctrans,m(w1))+l1w1, (2)

where—Clocal,m(w1) = (w>

1 j(m)�W(m))j(m),

—Ctrans,m(w1) = (w>1 j(u0)�W(u0))j(u0)W(m).

The partial subgradient w.r.t. mention head m forthe coreference weight vector w2 is given by

—w2,mC(w1,w2) = l2w2+8><

>:

f(u0,m)�f(u00,m) if W(m) = 1,

f(u0,m) if W(m) = 0 and W(u0) = 1,

0 if W(m) = 0 and W(u0) = 0.

(3)

Here l1 and l2 are regularization coefficientswhich are tuned on the development set. To learnthe mention head detection model, we consider twodifferent parts of the gradient in expression (2).—Clocal,m(w1) is exactly the local gradient of men-tion head m while we add —Ctrans,m(w1) to represent

5More details can be found in Chang et al. (2013). Thedifference here is that we also consider the validity of mentionheads using W(u),W(m)

15

the gradient for mention head u0, the mention headchosen by the current best-left-linking model for m.This serves to maximize the margin between validmention heads and invalid ones. As invalid mentionheads will not be linked to any other mention head,—trans is zero when m is invalid. When training themention-pair coreference model, we only considergradients when at least one of the two mention headsm,u0 is valid, as shown in expression (3). Whenmention head m is valid (W(m) = 1), the gradientis the same as local training for best-left-link of m(first condition in expression (3)). When m is notvalid while u0 is valid, we only demote the coref-erence link between them (second condition in ex-pression (3)). We consider only the gradient fromthe regularization term when both m,u0 are invalid.

As mentioned before, our framework can han-dle annotations with or without singletion mentions.When the gold data contains no singleton mentions,we have W(m) = 0 for all singleton mention headsamong mention head candidates. Then, our men-tion head detection model partly serves as a single-ton head detector, and tries to reject singletons inthe joint decisions with coreference. When the golddata contains singleton mentions, we have W(m) = 1for all valid singleton mention heads. Our mentionhead detection model then only learns to differenti-ate invalid mention heads from valid ones, and thushas the ability to preserve valid singleton heads.

Most of the head mentions proposed by the al-gorithms described in Sec. 3 are positive exam-ples. We ensure a balanced training of the men-tion head detection model by adding sub-sampledinvalid mention head candidates as negative exam-ples. Specifically, after mention head candidate gen-eration (described in Sec. 3), we train on a set ofcandidates with precision larger than 50%. We thenuse Illinois Chunker (Punyakanok and Roth, 2001)6

to extract more noun phrases from the text and em-ploy Collins head rules (Collins, 1999) to identifytheir heads. When these extracted heads do notoverlap with gold mention heads, we treat them asnegative examples.

We note that the aforementioned joint frameworkcan take as input either complete mention candi-dates or mention head candidates. However, in thispaper we only feed mention heads into it. Our ex-perimental results support our intuition that this pro-vides better results.

6http://cogcomp.cs.illinois.edu/page/software_view/Chunker

3 Mention Detection Modules

This section describes the module that generates ourmention head candidates, and then how the mentionheads are expanded to complete mentions.

3.1 Mention Head Candidate GenerationThe goal of the mention head candidate genera-tion process is to acquire candidates from multiplesources to ensure high recall, given that our jointframework acts as a filter and increases precision.We view the sources as independent componentsand merge all mention heads generated. A sequencelabelling component and a named entity recogni-tion component employ statistical learning methods.These are augmented by additional heads that weacquire from Wikipedia and a “known heads” re-source, which we incorporate utilizing string match-ing algorithms.

3.1.1 Statistical ComponentsSequence Labelling Component We use the fol-lowing notations. Let O =< o1,o2, · · · ,on > repre-sent an input token sequence over an alphabet W. Amention is a substring of consecutive input tokensmi, j =< oi,oi+1, · · · ,o j > for 1 i j n. Weconsider the positions of mentions in the text: twomentions with an identical sequence of tokens thatdiffer in position are considered different mentions.

The sequence labeling component builds on thefollowing assumption:Assumption Different mentions have differentheads, and heads do not overlap with each other.That is, for each mi, j, we have a correspondinghead ha,b where i a b j. Moreover, for an-other head ha0,b0 , we have the satisfying conditiona�b0 > 0 or b�a0 < 0 8ha,b,ha0,b0 .

Based on this assumption, the problem ofidentifying mention heads is a sequential phraseidentification problem, and we choose to em-ploy the BILOU-representation as it has advan-tages over traditional BIO-representation, as shown,e.g. in Ratinov and Roth (2009). The BILOU-representation suggests learning classifiers thatidentify the Beginning, Inside and Last tokens ofmulti-token chunks as well as Unit-length chunks.The problem is then transformed into a simple, butconstrained, 5-class classification problem.

The BILOU-classifier shares all features with themention head detection model described in Sec. 2.1except for two: length of mention heads and NPMIover head boundary. For each instance, the feature

16

vector is sparse and we use sparse perceptron (Jack-son and Craven, 1996) for supervised training. Wealso apply a two layer prediction aggregation. First,we apply a baseline BILOU-classifier, and then usethe resulting predictions as additional features in asecond level of inference to take interactions intoaccount in an efficient manner. A similar techniquehas been applied in Ratinov and Roth (2009), andhas shown favorable results over other ”standard”sequential prediction models.Named Entity Recognition Component We useexisting tools to extract named entities as additionalmention head candidates. We choose the state-of-the-art “Illinois Named Entity Tagger” package7.It uses distributional word representations that im-prove its generalization. This package gives thestandard Person/Location/Organization/Misc labelsand we take all output named entities as candidates.

3.1.2 Resource-Driven Matching ComponentsWikipedia Many mention heads can be directlymatched to a Wikipedia title. We get 4,045,764Wikipedia titles from Wikipedia dumps and use allof them as potential mention heads. The Wikipediamatching component includes an efficient hashingalgorithm implemented via a DJB2 hash function8.One important advantage of using Wikipedia isthat it keeps updating. This component can con-tribute steadily to ensure a good coverage of men-tion heads. We first run this matching componenton training documents and compute the precision ofentries that appear in the text (the probability of ap-pearing as mention heads). We then get the set of en-tries with precision higher than a threshold a , whichis tuned on the development set using F1-score. Weuse them as candidates for mention head matching.Known Head Some mention heads appear repeat-edly in the text. To fully utilize the training data, weconstruct a known mention head candidate set andidentify them in the test documents. To balance be-tween recall and precision, we set a parameter b > 0as a precision threshold and only allow those men-tion heads with precision larger than b on the train-ing set. Please note that threshold b is also tuned onthe development set using F1-score.

We also employ a simple word variation tolerancealgorithm in our matching components, to general-ize over small variations (plural/singular, etc.).

7http://cogcomp.cs.illinois.edu/page/software_view/NETagger

8http://www.cse.yorku.ca/˜oz/hash.html

3.2 Mention Boundary DetectionOnce the joint learning and inference process deter-mines the set of mention heads (and their corefer-ence chains), we extend the heads to complete men-tions. Note that this process may not be necessary,since in many applications, the head clusters oftenprovide enough information. However, for consis-tency with existing coreference resolution systems,we describe below how we expand the heads tocomplete mentions.

We learn a binary classifier to expand mentions,which determines if the mention head should in-clude the token to its left and to its right. We fol-low the notations in Sec. 2.1. We construct pos-itive examples as (op,ha,b,dir), 8mi, j(ha,b). Herep 2 {i, i + 1, · · · ,a� 1}[ {b + 1,b + 2, · · · , j} andwhen p = i, i + 1, · · · ,a� 1, dir = L; when p =

b + 1,b + 2, · · · , j, dir = R. We construct negativeexamples as (oi�1,ha,b,L) and (o j+1,ha,b,R). Oncetrained, the binary classifier takes in the head, a to-ken and the direction of the token relative to thehead, and decides whether the token is inside or out-side the mention corresponding to the head. At testtime, this classifier is used around each confirmedhead to determine the mention boundaries. The fea-tures used here are similar to the mention head de-tection model described in Sec. 2.1.

4 Experiments

We present experiments on the two standard coref-erence resolution datasets, ACE-2004 (NIST, 2004)and OntoNotes-5.0 (Hovy et al., 2006). Our ap-proach results in a substantial reduction in the coref-erence performance gap between gold and pre-dicted mentions, and significantly outperforms ex-isting stat-of-the-art results on coreference resolu-tion; in addition, it achieves significant performanceimprovement on MD for both datasets.

4.1 Experimental SetupDatasets The ACE-2004 dataset contains 443 doc-uments. We use a standard split of 268 training doc-uments, 68 development documents, and 106 test-ing documents (Culotta et al., 2007; Bengtson andRoth, 2008). The OntoNotes-5.0 dataset, which isreleased for the CoNLL-2012 Shared Task (Prad-han et al., 2012), contains 3,145 annotated docu-ments. These documents come from a wide range ofsources which include newswire, bible, transcripts,magazines, and web blogs. We report results on thetest documents for both datasets.

17

MUC B3 CEAFe AVGGoldM/H 78.17 81.64 78.45 79.42StanfordM 63.89 70.33 70.21 68.14PredictedM 64.28 70.37 70.16 68.27H-M-CorefM 65.81 71.97 71.14 69.64H-Joint-MM 67.28 73.06 73.25 71.20StanfordH 70.28 73.93 73.04 72.42PredictedH 71.35 75.33 74.02 73.57H-M-CorefH 71.81 75.69 74.45 73.98H-Joint-MH 72.74 76.69 75.18 74.87

Table 2: Performance of coreference resolution for all sys-tems on the ACE-2004 dataset. Subscripts (M , H ) indicateevaluations on (mentions, mention heads) respectively. Forgold mentions and mention heads, they yield the same per-formance for coreference. Our proposed H-Joint-M systemachieves the highest performance. Parameters of our proposedsystem are tuned as a = 0.9, b = 0.8, l1 = 0.2 and l2 = 0.3.

The ACE-2004 dataset is annotated with bothmention and mention heads, while the OntoNotes-5.0 dataset only has mention annotations. There-fore, we preprocess Ontonote-5.0 to derive men-tion heads using Collins head rules (Collins, 1999)with gold constituency parsing information and goldnamed entity information. The parsing information9

is only needed to generate training data for the men-tion head candidate generator and named entities aredirectly set as heads. We set these extracted headsas gold, which enables us to train the two layerBILOU-classifier described in Sec. 3.1.1. The non-overlapping mention head assumption in Sec. 3.1.1can be verified empirically on both ACE-2004 andOntoNotes-5.0 datasets.Baseline Systems We choose three publicly avail-able state-of-the-art end-to-end coreference systemsas our baselines: Stanford system (Lee et al., 2011),Berkeley system (Durrett and Klein, 2014) andHOTCoref system (Bjorkelund and Kuhn, 2014).Developed Systems Our developed system is builton the work by Chang et al. (2013), using Con-strained Latent Left-Linking Model (CL3M) as ourmention-pair coreference model in the joint frame-work10. When the CL3M coreference system usesgold mentions or heads, we call the system Gold;when it uses predicted mentions or heads, we callthe system Predicted. The mention head candidategeneration module along with mention boundarydetection module can be grouped together to forma complete mention detection system, and we callit H-M-MD. We can feed the predicted mentionsfrom H-M-MD directly into the mention-pair coref-

9No parsing information is needed at evaluation time.10We use Gurobi v5.0.1 as our ILP solver.

MUC B3 CEAFe AVGGoldM/H 82.03 70.59 66.76 73.12StanfordM 64.62 51.89 48.23 54.91HotCorefM 70.74 58.37 55.47 61.53BerkeleyM 71.24 58.71 55.18 61.71PredictedM 69.63 57.46 53.16 60.08H-M-CorefM 70.95 59.11 54.98 61.68H-Joint-MM 72.22 60.50 56.37 63.03StanfordH 68.53 56.68 52.36 59.19HotCorefH 72.94 60.27 57.53 63.58BerkeleyH 73.05 60.39 57.43 63.62PredictedH 72.11 60.12 55.68 62.64H-M-CorefH 73.22 61.42 56.21 63.62H-Joint-MH 74.83 62.77 57.93 65.18

Table 3: Performance of coreference resolution for all sys-tems on the CoNLL-2012 dataset. Subscripts (M , H ) indi-cate evaluations on (mentions, mention heads) respectively. Forgold mentions and mention heads, they yield the same per-formance for coreference. Our proposed H-Joint-M systemachieves the highest performance. Parameters of our proposedsystem are tuned as a = 0.9, b = 0.9, l1 = 0.25 and l2 = 0.2.

erence model that we implemented, resulting in atraditional pipelined end-to-end coreference system,namely H-M-Coref. We name our new proposedend-to-end coreference resolution system incorpo-rating both the mention head candidate generationmodule and the joint framework as H-Joint-M.Evaluation Metrics We compare all systems us-ing three popular metrics for coreference resolution:MUC (Vilain et al., 1995), B3 (Bagga and Bald-win, 1998), and Entity-based CEAF (CEAFe) (Luo,2005). We use the average F1 scores (AVG) of thesethree metrics as the main metric for comparison.We use the v7.0 scorer provided by CoNLL-2012Shared Task11. We also evaluate the mention de-tection performance based on precision, recall andF1 score. As mention heads are important for bothmention detection and coreference resolution, wealso report results evaluated on mention heads.

4.2 Performance for Coreference ResolutionPerformance of coreference resolution for all sys-tems on the ACE-2004 and CoNLL-2012 datasets isshown in Table 2 and Table 3 respectively.12 Theseresults show that our developed system H-Joint-M

11The latest scorer is version v8.01, but MUC, B3, CEAFeand CoNLL average scores are not changed. For evaluation onACE-2004, we convert the system output and gold annotationsinto CoNLL format.

12We do not provide results from Berkeley and HOTCoref onACE-2004 dataset as they do not directly support ACE input.Results for HOTCoref are slightly different from the results re-ported in Bjorkelund and Kuhn (2014). For Berkeley system,we use the reported results from Durrett and Klein (2014).

18

shows significant improvement on all metrics forboth datasets. Existing systems only report resultson mentions. Here, we also show their performanceevaluated on mention heads. When evaluated onmention heads rather than mentions13, we can al-ways expect a performance increase for all systemson both datasets. Even though evaluating on men-tions is more common in the literature, it is oftenenough to identify just mention heads in corefer-ence chains (as shown in the example from Sec.1). H-M-Coref can already bring substantial perfor-mance improvement, which indicates that it is help-ful for coreference to just identify high quality men-tion heads. Our proposed H-Joint-M system out-performs all baselines and achieves the best resultsreported so far.

4.3 Performance for Mention DetectionThe performance of mention detection for all sys-tems on the ACE-2004 and CoNLL-2012 datasetsis shown in Table 4. These results show that ourdeveloped system exhibits significant improvementon precision and recall for both datasets. H-M-MDmainly improves on recall, indicating, as expected,that the mention head candidate generation mod-ule ensures high recall on mention heads. H-Joint-M mainly improves on precision, indicating, as ex-pected, that the joint framework correctly rejectsmany of the invalid mention head candidates duringjoint inference. Our joint model can adapt to anno-tations with or without singleton mentions. Basedon training data, our system has the ability to pre-serve true singleton mentions in ACE while reject-ing many singleton mentions in OntoNotes14. Notethat we have better mention detection results onACE-2004 dataset than on OntoNotes-5.0 dataset.We believe that this is due to the fact that extract-ing mention heads in the OntoNotes dataset is some-what noisy.

4.4 Analysis of Performance ImprovementThe improvement of our H-Joint-M system is due totwo distinct but related modules: the mention headcandidate generation module (“Head”) and the jointlearning and inference framework (“Joint”).15 We

13Here, we treat mention heads as mentions. Thus, in theevaluation script, we set the boundary of a mention to be theboundary of its correponding mention head.

14Please note that when evaluating on OntoNotes, we even-tually remove all singleton mentions from the output.

15“Joint” rows are computed as “H-Joint-M” rows minus“Head” rows. They reflect the contribution of the joint frame-work to mention detection (by rejecting some mention heads).

Systems Precision Recall F1-scoreACE-2004

PredictedM 75.11 73.03 74.06H-M-MDM 77.45 92.97 83.90H-Joint-MM 85.34 91.73 88.42PredictedH 76.84 86.99 79.87H-M-MDH 80.82 93.45 86.68H-Joint-MH 88.85 92.27 90.53

CoNLL-2012PredictedM 65.28 63.41 64.33H-M-MDM 70.09 76.72 73.26H-Joint-MM 78.51 75.52 76.99PredictedH 76.38 74.02 75.18H-M-MDH 77.73 83.99 80.74H-Joint-MH 85.07 82.31 83.67

Table 4: Performance of mention detection for all systemson the ACE-2004 and CoNLL-2012 datasets. Subscripts (M ,H ) indicate evaluations on (mentions, mention heads) respec-tively. Our proposed H-Joint-M system dramatically improvesthe MD performance.

evaluate the effect of these two modules in termsof Mention Detection Error Reduction (MDER) andPerformance Gap Reduction (PGR) for coreference.MDER is computed as the ratio of performance im-provement for mention detection over the originalmention detection error rate, while PGR is com-puted as the ratio of performance improvement forcoreference over the performance gap for corefer-ence. Results on the ACE-2004 and CoNLL-2012datasets are shown in Table 5.16

The mention head candidate generation modulehas a bigger impact on MDER compared to the jointframework. However, they both have the same levelof positive effects on PGR for coreference resolu-tion. On both datasets, we achieve more than 20%performance gap reduction for coreference.

5 Related Work

Coreference resolution has been extensively stud-ied, with several state-of-the-art approaches ad-dressing this task (Lee et al., 2011; Durrett andKlein, 2013; Bjorkelund and Kuhn, 2014; Song etal., 2012). Many of the early rule-based systemslike Hobbs (1978) and Lappin and Leass (1994)gained considerable popularity. The early designswere easy to understand and the rules were designedmanually. Machine learning approaches were intro-duced in many works (Connolly et al., 1997; Ng and

16We use bootstrapping resampling (10 times from the testdata) with signed rank test. All the improvements shown arestatistically significant.

19

ACE-2004 MDER PGR(AVG)HeadM 37.93 12.29JointM 17.43 13.99H-Joint-MM 55.36 26.28HeadH 34.00 7.01JointH 19.22 15.21H-Joint-MH 53.22 22.22CoNLL-2012 MDER PGR(AVG)HeadM 25.04 12.16JointM 10.45 10.44H-Joint-MM 35.49 22.60HeadH 22.40 10.58JointH 11.81 13.75H-Joint-MH 34.21 24.33

Table 5: Analysis of performance improvement in termsof Mention Detection Error Reduction (MDER) and Perfor-mance Gap Reduction (PGR) for coreference resolution onthe ACE-2004 and CoNLL-2012 datasets. “Head” representsthe mention head candidate generation module, “Joint” repre-sents the joint learning and inference framework, and “H-Joint-M” indicates the end-to-end system.

Cardie, 2002; Bengtson and Roth, 2008; Soon et al.,2001). The introduction of ILP methods has influ-enced the coreference area too (Chang et al., 2011;Denis and Baldridge, 2007). In this paper, we usethe Constrained Latent Left-Linking Model (CL3M)described in Chang et al. (2013) in our experiments.

The task of mention detection is closely relatedto Named Entity Recognition (NER). Punyakanokand Roth (2001) thoroughly study phrase identifica-tion in sentences and propose three different generalapproaches. They aim to learn several different lo-cal classifiers and combine them to optimally satisfysome global constraints. Cardie and Pierce (1998)propose to select certain rules based on a givencorpus, to identify base noun phrases. However,the phrases detected are not necessarily mentionsthat we need to discover. Ratinov and Roth (2009)present detailed studies on the task of named entityrecognition, which discusses and compares differentmethods on multiple aspects including chunk repre-sentation, inference method, utility of non-local fea-tures, and integration of external knowledge. NERcan be regarded as a sequential labeling problem,which can be modeled by several proposed mod-els, e.g. Hidden Markov Model (Rabiner, 1989) orConditional Random Fields (Sarawagi and Cohen,2004). The typical BIO representation was intro-duced in Ramshaw and Marcus (1995); OC repre-sentations were introduced in Church (1988), whileFinkel and Manning (2009) further study nestednamed entity recognition, which employs a tree

structure as a representation of identifying namedentities within other named entities.

The most relevant study on mentions in the con-text of coreference was done in Recasens et al.(2013); this work studies distinguishing single men-tions from coreferent mentions. Our joint frame-work provides similar insights, where the addedmention decision variable partly reflects if the men-tion is singleton or not.

Several recent works suggest studying corefer-ence jointly with other tasks. Lee et al. (2012)model entity coreference and event coreferencejointly; Durrett and Klein (2014) consider jointcoreference and entity-linking. The work closestto ours is that of Lassalle and Denis (2015), whichstudies a joint anaphoricity detection and corefer-ence resolution framework. While their inferenceobjective is similar, their work assumes gold men-tions are given and thus their modeling is very dif-ferent.

6 Conclusion

This paper proposes a joint inference approach tothe end-to-end coreference resolution problem. Bymoving to identify mention heads rather than men-tions, and by developing an ILP-based, joint, onlinelearning and inference approach, we close a signif-icant fraction of the existing gap between corefer-ence systems’ performance on gold mentions andtheir performance on raw data. At the same time,we show substantial improvements in mention de-tection. We believe that our approach will gener-alize well to many other NLP problems, where theperformance on raw data (the result that really mat-ters) is still significantly lower than the performanceon gold data.

Acknowledgments

This work is partly supported by NSF grant #SMA12-09359 and by DARPA under agreement numberFA8750-13-2-0008. The U.S. Government is autho-rized to reproduce and distribute reprints for Gov-ernmental purposes notwithstanding any copyrightnotation thereon. The views and conclusions con-tained herein are those of the authors and shouldnot be interpreted as necessarily representing the of-ficial policies or endorsements, either expressed orimplied, of DARPA or the U.S. Government.

20

ReferencesA. Bagga and B. Baldwin. 1998. Algorithms for scoring

coreference chains. In In The First International Con-ference on Language Resources and Evaluation Work-shop on Linguistics Coreference.

E. Bengtson and D. Roth. 2008. Understanding the valueof features for coreference resolution. In EMNLP.

A. Bjorkelund and J Kuhn. 2014. Learning structuredPerceptrons for coreference resolution with latent an-tecedents and non-local features. In ACL.

C. Cardie and D. Pierce. 1998. Error-driven pruning ofTreebanks grammars for base noun phrase identifica-tion. In Proceedings of ACL-98.

K. Chang, R. Samdani, A. Rozovskaya, N. Rizzolo,M. Sammons, and D. Roth. 2011. Inference proto-cols for coreference resolution. In CoNLL.

K.-W. Chang, R. Samdani, and D. Roth. 2013. A con-strained latent variable model for coreference resolu-tion. In EMNLP.

K. Church. 1988. A stochastic parts program and nounphrase parser for unrestricted text. In ANLP.

M. Collins. 1999. Head-driven Statistical Models forNatural Language Parsing. Ph.D. thesis, ComputerScience Department, University of Pennsylvenia.

D. Connolly, J. Burger, and D. Day. 1997. A machinelearning approach to anaphoric reference. In NewMethods in Language Processing.

A. Culotta, M. Wick, and A. McCallum. 2007. First-order probabilistic models for coreference resolution.In NAACL.

P. Denis and J. Baldridge. 2007. Joint determination ofanaphoricity and coreference resolution using integerprogramming. In NAACL.

G. Durrett and D. Klein. 2013. Easy victories and uphillbattles in coreference resolution. In EMNLP.

G. Durrett and D. Klein. 2014. A joint model for entityanalysis: Coreference, typing, and linking.

J. Finkel and C. Manning. 2009. Nested named entityrecognition. In EMNLP.

J. R. Hobbs. 1978. Resolving pronoun references. Lin-gua.

E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, andR. Weischedel. 2006. Ontonotes: The 90% solution.In Proceedings of HLT/NAACL.

J. Jackson and M. Craven. 1996. Learning sparse per-ceptrons. Proceedings of the 1996 Advances in NeuralInformation Processing Systems.

S. Lappin and H. Leass. 1994. An algorithm for pronom-inal anaphora resolution. Computational linguistics.

E. Lassalle and P. Denis. 2015. Joint anaphoricity detec-tion and coreference resolution with constrained latentstructures.

H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Sur-deanu, and D. Jurafsky. 2011. Stanford’s multi-pass sieve coreference resolution system at the conll-2011 shared task. In Proceedings of the CoNLL-2011Shared Task.

H. Lee, M. Recasens, A. Chang, M. Surdeanu, and D. Ju-rafsky. 2012. Joint entity and event coreference reso-lution across documents. In EMNLP.

X. Luo. 2005. On coreference resolution performancemetrics. In EMNLP.

V. Ng and C. Cardie. 2002. Improving machine learningapproaches to coreference resolution. In ACL.

NIST. 2004. The ACE evaluation plan.

S. Pradhan, A. Moschitti, N. Xue, O. Uryupina, andY. Zhang. 2012. CoNLL-2012 shared task: Modelingmultilingual unrestricted coreference in OntoNotes.In CoNLL.

V. Punyakanok and D. Roth. 2001. The use of classifiersin sequential inference. In NIPS.

L. R. Rabiner. 1989. A tutorial on hidden Markov mod-els and selected applications in speech recognition.Proceedings of the IEEE.

L. A. Ramshaw and M. P. Marcus. 1995. Text chunkingusing transformation-based learning. In Proceedingsof the Third Annual Workshop on Very Large Corpora.

L. Ratinov and D. Roth. 2009. Design challenges andmisconceptions in named entity recognition. In Proc.of the Annual Conference on Computational NaturalLanguage Learning (CoNLL).

M. Recasens, M.-C. de Marneffe, and C. Potts. 2013.The life and death of discourse entities: Identifyingsingleton mentions. In NAACL.

D. Roth and W. Yih. 2004. A linear programmingformulation for global inference in natural languagetasks. In Hwee Tou Ng and Ellen Riloff, editors,CoNLL.

S. Sarawagi and W. Cohen. 2004. Semi-Markov con-ditional random fields for information extraction. InNIPS.

Y. Song, J. Jiang, W.-X. Zhao, S. Li, and H. Wang. 2012.Joint learning for coreference resolution with markovlogic. In Proceedings of the 2012 Joint Conference ofEMNLP-CoNLL.

W. M. Soon, H. T. Ng, and D. C. Y. Lim. 2001. A ma-chine learning approach to coreference resolution ofnoun phrases. Comput. Linguist.

M. Vilain, J. Burger, J. Aberdeen, D. Connolly, andL. Hirschman. 1995. A model-theoretic coreferencescoring scheme. In Proceedings of the 6th conferenceon Message understanding.

21



A Supertag-Context Modelfor Weakly-Supervised CCG Parser Learning

Dan Garrette⇤ Chris Dyer† Jason Baldridge‡ Noah A. Smith†

⇤Computer Science & Engineering, University of Washington, [email protected]†School of Computer Science, Carnegie Mellon University, {cdyer,nasmith}@cs.cmu.edu

‡Department of Linguistics, University of Texas at Austin, [email protected]

Abstract

Combinatory Categorial Grammar (CCG)is a lexicalized grammar formalism inwhich words are associated with cate-gories that specify the syntactic configura-tions in which they may occur. We presenta novel parsing model with the capacity tocapture the associative adjacent-categoryrelationships intrinsic to CCG by param-eterizing the relationships between eachconstituent label and the preterminal cat-egories directly to its left and right, bi-asing the model toward constituent cate-gories that can combine with their con-texts. This builds on the intuitions ofKlein and Manning’s (2002) “constituent-context” model, which demonstrated thevalue of modeling context, but has the ad-vantage of being able to exploit the prop-erties of CCG. Our experiments show thatour model outperforms a baseline in whichthis context information is not captured.

1 Introduction

Learning parsers from incomplete or indirect su-pervision is an important component of movingNLP research toward new domains and languages.But with less information, it becomes necessary todevise ways of making better use of the informa-tion that is available. In general, this means con-structing inductive biases that take advantage ofunannotated data to train probabilistic models.

One important example is the constituent-context model (CCM) of Klein and Manning(2002), which was specifically designed to cap-ture the linguistic observation made by Radford(1988) that there are regularities to the contextsin which constituents appear. This phenomenon,known as substitutability, says that phrases of thesame type appear in similar contexts. For example,

the part-of-speech (POS) sequence ADJ NOUN fre-quently occurs between the tags DET and VERB.This DET—VERB context also frequently appliesto the single-word sequence NOUN and to ADJ ADJNOUN. From this, we might deduce that DET—VERB is a likely context for a noun phrase. CCMis able to learn which POS contexts are likely,and does so via a probabilistic generative model,providing a statistical, data-driven take on substi-tutability. However, since there is nothing intrin-sic about the POS pair DET—VERB that indicatesa priori that it is a likely constituent context, thisfact must be inferred entirely from the data.

Baldridge (2008) observed that unlike opaque,atomic POS labels, the rich structures of Combina-tory Categorial Grammar (CCG) (Steedman, 2000;Steedman and Baldridge, 2011) categories reflectuniversal grammatical properties. CCG is a lexi-calized grammar formalism in which every con-stituent in a sentence is associated with a struc-tured category that specifies its syntactic relation-ship to other constituents. For example, a cate-gory might encode that “this constituent can com-bine with a noun phrase to the right (an object)and then a noun phrase to the left (a subject) toproduce a sentence” instead of simply VERB. CCGhas proven useful as a framework for grammar in-duction due to its ability to incorporate linguis-tic knowledge to guide parser learning by, for ex-ample, specifying rules in lexical-expansion al-gorithms (Bisk and Hockenmaier, 2012; 2013)or encoding that information as priors within aBayesian framework (Garrette et al., 2015).

Baldridge observed is that, cross-linguistically,grammars prefer simpler syntactic structures whenpossible, and that due to the natural correspon-dence of categories and syntactic structure, bias-ing toward simpler categories encourages simplerstructures. In previous work, we were able toincorporate this preference into a Bayesian pars-ing model, biasing PCFG productions toward sim-

22

pler categories by encoding a notion of categorysimplicity into a prior (Garrette et al., 2015).Baldridge further notes that due to the natural as-sociativity of CCG, adjacent categories tend to becombinable. We previously showed that incorpo-rating this intuition into a Bayesian prior can helptrain a CCG supertagger (Garrette et al., 2014).

In this paper, we present a novel parsing modelthat is designed specifically for the capacity tocapture both of these universal, intrinsic proper-ties of CCG. We do so by extending our pre-vious, PCFG-based parsing model to include pa-rameters that govern the relationship between con-stituent categories and the preterminal categories(also known as supertags) to the left and right.The advantage of modeling context within a CCGframework is that while CCM must learn whichcontexts are likely purely from the data, the CCGcategories give us obvious a priori informationabout whether a context is likely for a given con-stituent based on whether the categories are com-binable. Biasing our model towards both sim-ple categories and connecting contexts encourageslearning structures with simpler syntax and thathave a better global “fit”.

The Bayesian framework is well-matched to ourproblem since our inductive biases — those de-rived from universal grammar principles, weak su-pervision, and estimations based on unannotateddata — can be encoded as priors, and we canuse Markov chain Monte Carlo (MCMC) infer-ence procedures to automatically blend these bi-ases with unannotated text that reflects the waylanguage is actually used “in the wild”. Thus, welearn context information based on statistics in thedata like CCM, but have the advantage of addi-tional, a priori biases. It is important to note thatthe Bayesian setup allows us to use these universalbiases as soft constraints: they guide the learnertoward more appropriate grammars, but may beoverridden when there is compelling contradictoryevidence in the data.

Methodologically, this work serves as an ex-ample of how linguistic-theoretical commitmentscan be used to benefit data-driven methods, notonly through the construction of a model familyfrom a grammar, as done in our previous work, butalso when exploiting statistical associations aboutwhich the theory is silent. While there has beenmuch work in computational modeling of the in-teraction between universal grammar and observ-

able data in the context of studying child languageacquisition (e.g., Villavicencio, 2002; Goldwater,2007), we are interested in applying these princi-ples to the design of models and learning proce-dures that result in better parsing tools. Given ourdesire to train NLP models in low-supervision sce-narios, the possibility of constructing inductive bi-ases out of universal properties of language is en-ticing: if we can do this well, then it only needs tobe done once, and can be applied to any languageor domain without adaptation.

In this paper, we seek to learn from only rawdata and an incomplete dictionary mapping somewords to sets of potential supertags. In order toestimate the parameters of our model, we developa blocked sampler based on that of Johnson etal. (2007) to sample parse trees for sentences inthe raw training corpus according to their poste-rior probabilities. However, due to the very largesets of potential supertags used in a parse, com-puting inside charts is intractable, so we design aMetropolis-Hastings step that allows us to sampleefficiently from the correct posterior. Our experi-ments show that the incorporation of supertag con-text parameters into the model improves learning,and that placing combinability-preferring priorson those parameters yields further gains in manyscenarios.

2 Combinatory Categorial Grammar

In the CCG formalism, every constituent, includingthose at the lexical level, is associated with a struc-tured CCG category that defines that constituent’srelationships to the other constituents in the sen-tence. Categories are defined by a recursive struc-ture, where a category is either atomic (possiblywith features), or a function from one category toanother, as indicated by a slash operator:

C ! {s, sdcl, sadj, sb, np, n, nnum, pp, ...}C ! {(C/C), (C \C)}

Categories of adjacent constituents can be com-bined using one of a set of combination rules toform categories of higher-level constituents, asseen in Figure 1. The direction of the slash op-erator gives the behavior of the function. A cat-egory (s\np)/pp might describe an intransitiveverb with a prepositional phrase complement; itcombines on the right (/) with a constituent withcategory pp, and then on the left (\) with a nounphrase (np) that serves as its subject.

23

s

np

np/n n

s\np

(s\np)/pppp

pp/np npThe man walks to work

Figure 1: CCG parse for “The man walks to work.”

We follow Lewis and Steedman (2014) in allow-ing a small set of generic, linguistically-plausibleunary and binary grammar rules. We further addrules for combining with punctuation to the leftand right and allow for the merge rule X ! X Xof Clark and Curran (2007).

3 Generative Model

In this section, we present our novel supertag-context model (SCM) that augments a standardPCFG with parameters governing the supertags tothe left and right of each constituent.

The CCG formalism is said to be naturally as-sociative since a constituent label is often ableto combine on either the left or the right. As amotivating example, consider the sentence “Thelazy dog sleeps”, as shown in Figure 2. Theword lazy, with category n/n, can either com-bine with dog (n) via the Forward Application rule(>), or with The (np/n) via the Forward Compo-sition (>B) rule. Baldridge (2008) showed thatthis tendency for adjacent supertags to be com-binable can be used to bias a sequence model inorder to learn better CCG supertaggers. However,we can see that if the supertags of adjacent wordslazy (n/n) and dog (n) combine, then they willproduce the category n, which describes the en-tire constituent span “lazy dog”. Since we haveproduced a new category that subsumes that en-tire span, a valid parse must next combine thatn with one of the remaining supertags to the leftor right, producing either (The·(lazy·dog))·sleepsor The·((lazy·dog)·sleeps). Because we know thatone (or both) of these combinations must be valid,we will similarly want a strong prior on the con-nectivity between lazy·dog and its supertag con-text: The$(lazy·dog)$sleeps.

Assuming T is the full set of known categories,the generative process for our model is:

np/n n/n n s\np

The lazy dog sleeps

n

Figure 2: Higher-level category n subsumes thecategories of its constituents. Thus, n should havea strong prior on combinability with its adjacentsupertags np/n and s\np.

Parameters:✓ROOT ⇠ Dir(↵ROOT, ✓ROOT-0

)

✓BINt ⇠ Dir(↵BIN, ✓BIN-0

) 8t 2 T✓UN

t ⇠ Dir(↵UN, ✓UN-0) 8t 2 T

✓TERMt ⇠ Dir(↵TERM, ✓TERM-0

t ) 8t 2 T�t ⇠ Dir(↵�, �0

) 8t 2 T✓LCTX

t ⇠ Dir(↵LCTX, ✓LCTX-0t ) 8t 2 T

✓RCTXt ⇠ Dir(↵RCTX, ✓RCTX-0

t ) 8t 2 TSentence:

do s ⇠ Cat(✓ROOT)

y | s ⇠ SCM(s)until the tree y is valid

where h`, y, ri | t ⇠ SCM(t) is defined as:z ⇠ Cat(�t)

if z = B : hu,vi | t ⇠ Cat(✓BINt )

yL | u ⇠ SCM(u), yR | v ⇠ SCM(v)

y = hyL, yRiif z = U : hui | t ⇠ Cat(✓UN

t )

y | u ⇠ SCM(u)

if z = T : w | t ⇠ Cat(✓TERMt )

y = w

` | t ⇠ Cat(✓LCTXt ), r | t ⇠ Cat(✓RCTX

t )

The process begins by sampling the parametersfrom Dirichlet distributions: a distribution ✓ROOT

over root categories, a conditional distribution ✓BINt

over binary branching productions given categoryt, ✓UN

t for unary rewrite productions, ✓TERMt for ter-

minal (word) productions, and ✓LCTXt and ✓RCTX

t forleft and right contexts. We also sample parame-ters �t for the probability of t producing a binarybranch, unary rewrite, or terminal word.

Next we sample a sentence. This begins by sam-pling first a root category s and then recursivelysampling subtrees. For each subtree rooted by acategory t, we generate a left context supertag `

and a right context supertag r. Then, we sam-

24

Aij

Bik Ckj

ti-1 tjti tj-1tk-1 tk

Figure 3: The generative process starting withnon-terminal Aij , where tx is the supertag for wx,the word at position x, and “A ! B C” is a validproduction in the grammar. We can see that non-terminal Aij generates nonterminals Bik and Ckj

(solid arrows) as well as generating left context ti-1and right context tj (dashed arrows); likewise forBik and Ckj . The triangle under a non-terminalindicates the complete subtree rooted by the node.

ple a production type z corresponding to either a(B) binary, (U) unary, or (T) terminal production.Depending on z, we then sample either a binaryproduction hu,vi and recurse, a unary productionhui and recurse, or a terminal word w and end thatbranch. A tree is complete when all branches endin terminal words. See Figure 3 for a graphical de-piction of the generative behavior of the process.Finally, since it is possible to generate a supertagcontext category that does not match the actualcategory generated by the neighboring constituent,we must allow our process to reject such invalidtrees and re-attempt to sample.

Like CCM, this model is deficient since the samesupertags are generated multiple times, and parseswith conflicting supertags are not valid. Since weare not generating from the model, this does notintroduce difficulties (Klein and Manning, 2002).

One additional complication that must be ad-dressed is that left-frontier non-terminal categories— those whose subtree span includes the firstword of the sentence — do not have a left-side su-pertag to use as context. For these cases, we usethe special sentence-start symbol hSi to serve ascontext. Similarly, we use the end symbol hEi forthe right-side context of the right-frontier.

We next discuss how the prior distributions areconstructed to encode desirable biases, using uni-versal CCG properties.

3.1 Non-terminal production prior means

For the root, binary, and unary parameters, wewant to choose prior means that encode our bias

toward cross-linguistically-plausible categories.To formalize the notion of what it means for acategory to be more “plausible”, we extend thecategory generator of our previous work, whichwe will call PCAT. We can define PCAT using aprobabilistic grammar (Garrette et al., 2014). Thegrammar may first generate a start or end category(hSi,hEi) with probability pse or a special token-deletion category (hDi; explained in §5) with prob-ability pdel, or a standard CCG category C:

X!hSi | hEi pse

X!hDi pdel

X!C (1� (2pse + pdel)) ·PC(C)

For each sentence s, there will be one hSi and onehEi, so we set pse = 1/(25 + 2), since the averagesentence length in the corpora is roughly 25. Todiscourage the model from deleting tokens (onlyapplies during testing), we set pdel = 10

�100.For PC, the distribution over standard cate-

gories, we use a recursive definition based on thestructure of a CCG category. If p = 1� p, then:1

C!a pterm ·patom(a)

C!A/A pterm ·pfwd · ( pmod ·PC(A) +

pmod ·PC(A)

2)

C!A/B pterm ·pfwd · pmod ·PC(A) ·PC(B)

C!A\A pterm ·pfwd · ( pmod ·PC(A) +

pmod ·PC(A)

2)

C!A\B pterm ·pfwd · pmod ·PC(A) ·PC(B)

The category grammar captures important as-pects of what makes a category more or lesslikely: (1) simplicity is preferred, with a higherpterm meaning a stronger emphasis on simplic-ity;2 (2) atomic types may occur at different rates,as given by patom; (3) modifier categories (A/Aor A\A) are more likely than similar-complexitynon-modifiers (such as an adverb that modifies averb); and (4) operators may occur at differentrates, as given by pfwd.

We can use PCAT to define priors on our produc-tion parameters that bias our model toward rules

1Note that this version has also updated the probabilitydefinitions for modifiers to be sums, incorporating the factthat any A/A is also a A/B (likewise for A\A). This ensuresthat our grammar defines a valid probability distribution.

2The probability distribution over categories is guaranteedto be proper so long as pterm >

12 since the probability of the

depth of a tree will decrease geometrically (Chi, 1999).

25

that result in a priori more likely categories:3

✓ROOT-0(t) = PCAT(t)

✓BIN-0(hu,vi) = PCAT(u) · PCAT(v)

✓UN-0(hui) = PCAT(u)

For simplicity, we assume the production-typemixture prior to be uniform: �0

= h13 , 1

3 , 13i.

3.2 Terminal production prior meansWe employ the same procedure as our previouswork for setting the terminal production prior dis-tributions ✓TERM-0

t (w) by estimating word-given-category relationships from the weak supervision:the tag dictionary and raw corpus (Garrette andBaldridge, 2012; Garrette et al., 2015).4 This pro-cedure attempts to automatically estimate the fre-quency of each word/tag combination by divid-ing the number of raw-corpus occurrences of eachword in the dictionary evenly across all of its asso-ciated tags. These counts are then combined withestimates of the “openness” of each tag in order toassess its likelihood of appearing with new words.

3.3 Context parameter prior meansIn order to encourage our model to choose treesin which the constituent labels “fit” into theirsupertag contexts, we want to bias our con-text parameters toward context categories that arecombinable with the constituent label.

The right-side context of a non-terminal cate-gory — the probability of generating a categoryto the right of the current constituent’s category— corresponds directly to the category transitionsused for the HMM supertagger of Garrette et al.(2014). Thus, the right-side context prior mean✓RCTX-0

t can be biased in exactly the same way asthe HMM supertagger’s transitions: toward contextsupertags that connect to the constituent label.

To encode a notion of combinability, we fol-low Baldridge’s (2008) definition. Briefly, let(t, u) 2 {0, 1} be an indicator of whether t com-bines with u (in that order). For any binary rulethat can combine t to u, (t, u)=1. To ensure thatour prior captures the natural associativity of CCG,we define combinability in this context to includecomposition rules as well as application rules. If

3For our experiments, we normalize PCAT by dividing byPc2T PCAT(c). This allows for experiments contrasting with

a uniform prior (1/|T |) without adjusting ↵ values.4We refer the reader to the previous work (Garrette et al.,

2015) for a fuller discussion and implementation details.

atoms have features associated, then the atoms areallowed to unify if the features match, or if at leastone of them does not have a feature. In defining ,it is also important to ignore possible argumentson the wrong side of the combination since theycan be consumed without affecting the connectionbetween the two. To achieve this for (t, u), it isassumed that it is possible to consume all preced-ing arguments of t and all following arguments ofu. So (np, (s\np)/np) = 1. This helps to en-sure the associativity discussed earlier. For “com-bining” with the start or end of a sentence, wedefine (hSi, u)=1 when u seeks no left-side ar-guments (since there are no tags to the left withwhich to combine) and (t, hEi)=1 when t seeksno right-side arguments. So (hSi, np/n)=1, but(hSi, s\np)=0. Finally, due to the frequent useof the unary rule that allows n to be rewrittenas np, the atom np is allowed to unify with nif n is the argument. So (n, s\np) = 1, but(np/n, np) = 0.

The prior mean of producing a right-context su-pertag r from a constituent category t, P right

(r | t),is defined so that combinable pairs are givenhigher probability than non-combinable pairs. Wefurther experimented with a prior that biases to-ward both combinability and category likelihood,replacing the uniform treatment of categories withour prior over categories, yielding P right

CAT (r | t). IfT is the full set of known CCG categories:

P right(r | t) =

⇢� · 1/|T | if (t, r) � > 1

1/|T | otherwise

P rightCAT (r | t) =

⇢� · PCAT(r) if (t, r) � > 1

PCAT(r) otherwise

Distributions P left(` | t) and P left

CAT(` | t) are de-fined in the same way, but with the combinabilitydirection flipped: (`, t), since the left context su-pertag precedes the constituent category.

4 Posterior Inference

We wish to infer the distribution over CCG parses,given the model we just described and a corpus ofsentences. Since there is no way to analyticallycompute these modes, we resort to Gibbs sam-pling to find an approximate solution. Our strat-egy is based on the approach presented by John-son et al. (2007). At a high level, we alternate be-tween resampling model parameters (✓ROOT, ✓BIN,✓UN, ✓TERM, �, ✓LCTX, ✓RCTX) given the current setof parse trees and resampling those trees given the

26

current model parameters and observed word se-quences. To efficiently sample new model param-eters, we exploit Dirichlet-multinomial conjugacy.By repeating these alternating steps and accumu-lating the productions, we obtain an approxima-tion of the required posterior quantities.

Our inference procedure takes as input the dis-tribution prior means, along with the raw corpusand tag dictionary. During sampling, we restrictthe tag choices for a word w to categories allowedby the tag dictionary. Since real-world learningscenarios will always lack complete knowledge ofthe lexicon, we, too, want to allow for unknownwords; for these, we assume the word may takeany known supertag. We refer to the sequence ofword tokens as w and a non-terminal category cov-ering the span i through j � 1 as yij .

While it is technically possible to sample di-rectly from our context-sensitive model, the highnumber of potential supertags available for eachcontext means that computing the inside chart forthis model is intractable for most sentences. Inorder to overcome this limitation, we employ anaccept/reject Metropolis-Hastings (MH) step. Thebasic idea is that we sample trees according to asimpler proposal distribution Q that approximatesthe full distribution and for which direct samplingis tractable, and then choose to accept or rejectthose trees based on the true distribution P .

For our model, there is a straightforward andintuitive choice for the proposal distribution: thePCFG model without our context parameters:(✓ROOT, ✓BIN, ✓UN, ✓TERM, �), which is known tohave an efficient sampling method. Our accep-tance step is therefore based on the remaining pa-rameters: the context (✓LCTX, ✓RCTX).

To sample from our proposal distribution, weuse a blocked Gibbs sampler based on the oneproposed by Goodman (1998) and used by John-son et al. (2007) that samples entire parse trees.For a sentence w, the strategy is to use the Insidealgorithm (Lari and Young, 1990) to inductivelycompute, for each potential non-terminal positionspanning words wi through wj�1 and category t,going “up” the tree, the probability of generatingwi, . . . , wj�1 via any arrangement of productionsthat is rooted by yij = t.

p(wi | yi,i+1 = t) = �t

(T) · ✓TERMt

(wi)

+

Pt!u

�t

(U) · ✓UNt

(hui)· p(wi:j�1 | yij = u)

p(wi:j�1 | yij = t) =

Pt!u

�t

(U) · ✓UNt

(hui)· p(wi:j�1 | yij = u)

+

Pt!u v

Pi<k<j �

t

(B) · ✓BINt

(hu,vi)· p(wi:k�1 | yik = u)

· p(wk:j�1 | ykj = v)

We then pass “downward” through the chart, sam-pling productions until we reach a terminal wordon all branches.

y0n ⇠ ✓ROOTt

· p(w0:n�1 | y0n = t)

x | yij ⇠⌦✓BINyij

(hu,vi) · p(wi:k�1 | yik = u)

· p(wk:j�1 | ykj = v)

8 yik, ykj when j > i + 1,

✓UNyij

(hui) · p(wi:j�1 | y0ij = u) 8 y0ij ,

✓TERMyij

(wi) when j = i + 1

↵

where x is either a split point k and pair of cate-gories yik, ykj resulting from a binary rewrite rule,a single category y0ij resulting from a unary rule, ora word w resulting from a terminal rule.

The MH procedure requires an acceptance dis-tribution A that is used to accept or reject a treesampled from the proposal Q. The probability ofaccepting new tree y0 given the previous tree y is:

A(y0 | y) = min

✓1,

P (y0)P (y)

Q(y)

Q(y0)

◆

Since Q is defined as a subset of P ’s parameters,it is the case that:

P (y) = Q(y) · p(y | ✓LCTX, ✓RCTX)

After substituting this for each P in A, all of the Qfactors cancel, yielding the acceptance distributiondefined purely in terms of context parameters:

A(y0 | y) = min

✓1,

p(y0 | ✓LCTX, ✓RCTX)

p(y | ✓LCTX, ✓RCTX)

◆

For completeness, we note that the probabilityof a tree y given only the context parameters is:5

p(y | ✓LCTX, ✓RCTX) =

Y

0i<jn

✓LCTX(yi�1,i | yij) · ✓RCTX

(yj,j+1 | yij)

5Note that there may actually be multiple yij due to unaryrules that “loop back” to the same position (i, j); all of thesemuch be included in the product.

27

Before we begin sampling, we initialize eachdistribution to its prior mean (✓ROOT

=✓ROOT-0,✓BIN

t =✓BIN-0, etc). Since MH requires an initial setof trees to begin sampling, we parse the raw corpuswith probabilistic CKY using these initial parame-ters (excluding the context parameters) to guess aninitial tree for each raw sentence.

The sampler alternates sampling parse trees forthe entire corpus of sentences using the above pro-cedure with resampling the model parameters. Re-sampling the parameters requires empirical countsof each production. These counts are taken fromthe trees resulting from the previous round of sam-pling: new trees that have been “accepted” by theMH step, as well as existing trees for sentences inwhich the newly-sampled tree was rejected.✓

ROOT⇠ Dir(h↵

ROOT· ✓

ROOT-0(t) + Croot(t) it2T )

✓

BINt ⇠ Dir(h↵

BIN· ✓

BIN-0(hu,vi) + C(t!hu,vi) iu,v2T )

✓

UNt ⇠ Dir(h↵

UN· ✓

UN-0(hui) + C(t!hui) iu2T )

✓

TERMt ⇠ Dir(h↵

TERM· ✓

TERM-0t (w) + C(t ! w) iw2V )

�t ⇠ Dir(h↵� · �

0(B) +

Pu,v2T C(t!hu,vi),

↵� · �

0(U) +

Pu2T C(t!hui),

↵� · �

0(T) +

Pw2V C(t!w) i)

✓

LCTXt ⇠ Dir(h↵

LCTX· ✓

LCTX-0t (`) + Cleft (t, `)i`2T )

✓

RCTXt ⇠ Dir(h↵

RCTX· ✓

RCTX-0t (r) + Cright(t, r)ir2T )

It is important to note that this method of re-sampling allows the draws to incorporate both thedata, in the form of counts, and the prior mean,which includes all of our carefully-constructed bi-ases derived from both the intrinsic, universal CCGproperties as well as the information we inducedfrom the raw corpus and tag dictionary.

After all sampling iterations have completed,the final model is estimated by pooling the treesresulting from each sampling iteration, includingtrees accepted by the MH steps as well as the dupli-cated trees retained due to rejections. We use thispool of trees to compute model parameters usingthe same procedure as we used directly above tosample parameters, except that instead of drawinga Dirichlet sample based on the vector of counts,we simply normalize those counts. However, sincewe require a final model that can parse sentencesefficiently, we drop the context parameters, mak-ing the model a standard PCFG, which allows us touse the probabilistic CKY algorithm.

5 ExperimentsIn our evaluation we compared our supertag-context approach to (our reimplementation of) the

best-performing model of our previous work (Gar-rette et al., 2015), which SCM extends. We evalu-ated on the English CCGBank (Hockenmaier andSteedman, 2007), which is a transformation of thePenn Treebank (Marcus et al., 1993); the CTB-CCG (Tse and Curran, 2010) transformation of thePenn Chinese Treebank (Xue et al., 2005); and theCCG-TUT corpus (Bos et al., 2009), built from theTUT corpus of Italian text (Bosco et al., 2000).

Each corpus was divided into four distinct datasets: a set from which we extract the tag dictionar-ies, a set of raw (unannotated) sentences, a devel-opment set, and a test set. We use the same splitsas Garrette et al. (2014). Since these treebanksuse special representations for conjunctions, wechose to rewrite the trees to use conjunction cate-gories of the form (X\X)/X rather than introduc-ing special conjunction rules. In order to increasethe amount of raw data available to the sampler,we supplemented the English data with raw, unan-notated newswire sentences from the NYT Giga-word 5 corpus (Parker et al., 2011) and supple-mented Italian with the out-of-domain WaCky cor-pus (Baroni et al., 1999). For English and Italian,this allowed us to use 100k raw tokens for train-ing (Chinese uses 62k). For Chinese and Italian,for training efficiency, we used only raw sentencesthat were 50 words or fewer (note that we did notdrop tag dictionary set or test set sentences).

The English development set was used to tunehyperparameters using grid search, and the samehyperparameters were then used for all three lan-guages. For the category grammar, we usedppunc=0.1, pterm=0.7, pmod=0.2, pfwd=0.5. Forthe priors, we use ↵ROOT

=1, ↵BIN=100, ↵UN

=100,↵TERM

=10

4, ↵�=3, ↵LCTX=↵RCTX

=10

3.6 For thecontext prior, we used �=10

5. We ran our samplerfor 50 burn-in and 50 sampling iterations.

CCG parsers are typically evaluated on the de-pendencies they produce instead of their CCGderivations directly since there can be many differ-ent CCG parse trees that all represent the same de-pendency relationships (spurious ambiguity), andCCG-to-dependency conversion can collapse thosedifferences. To convert a CCG tree into a de-pendency tree, we follow Lewis and Steedman

6In order to ensure that these concentration parameters,while high, were not dominating the posterior distributions,we ran experiments in which they were set much higher(including using the prior alone), and found that accuraciesplummeted in those cases, demonstrating that there is a goodbalance with the prior.

28

Size of the corpus (tokens) from which the tag dictionary is extracted250k 200k 150k 100k 50k 25k

English no context 60.43 61.22 59.69 58.61 56.26 54.70context (uniform) 64.02 63.89 62.58 61.80 59.44 57.08+P left / P right 65.44 63.26 64.28 62.90 59.63 57.86+P left

CAT / P rightCAT 59.34 59.89 59.32 58.47 57.85 55.77

Chinese no context 32.70 32.07 28.99context (uniform) 36.02 33.84 32.55+P left / P right 35.34 33.04 31.48+P left

CAT / P rightCAT 35.15 34.04 33.53

Italian no context 51.54context (uniform) 53.57+P left / P right 52.54+P left

CAT / P rightCAT 53.29

Table 1: Experimental results in three languages. For each language, four experiments were executed:(1) a no-context model baseline, Garrette et al. (2015) directly; (2) our supertag-context model, with uni-form priors on contexts; (3) supertag-context model with priors that prefer combinability; (4) supertag-context model with priors that prefer combinability and simpler categories. Results are shown for sixdifferent levels of supervision, as determined by the size of the corpus used to extract a tag dictionary.

(2014). We traverse the parse tree, dictating at ev-ery branching node which words will be the de-pendents of which. For binary branching nodes offorward rules, the right side—the argument side—is the dependent, unless the left side is a modi-fier (X/X) of the right, in which case the left isthe dependent. The opposite is true for backwardrules. For punctuation rules, the punctuation is al-ways the dependent. For merge rules, the right sideis always made the parent. The results presentedin this paper are dependency accuracy scores: theproportion of words that were assigned the correctparent (or “root” for the root of a tree).

When evaluating on test set sentences, if themodel is unable to find a parse given the con-straints of the tag dictionary, then we would haveto take a score of zero for that sentence: every de-pendency would be “wrong”. Thus, it is impor-tant that we make a best effort to find a parse. Toaccomplish this, we implemented a parsing back-off strategy. The parser first tries to find a validparse that has either sdcl or np at its root. Ifthat fails, then it searches for a parse with anyroot. If no parse is found yet, then the parser at-tempts to strategically allow tokens to subsume aneighbor by making it a dependent (first with a re-stricted root set, then without). This is similar tothe “deletion” strategy employed by Zettlemoyerand Collins (2007), but we do it directly in thegrammar. We add unary rules of the form hDi!u

for every potential supertag u in the tree. Then,at each node spanning exactly two tokens (but nohigher in the tree), we allow rules t!hhDi, vi andt!hv, hDii. Recall that in §3.1, we stated that hDiis given extremely low probability, meaning thatthe parser will avoid its use unless it is absolutelynecessary. Additionally, since u will still remainas the preterminal, it will be the category exam-ined as the context by adjacent constituents.

For each language and level of supervision, weexecuted four experiments. The no-context base-line used (a reimplementation of) the best modelfrom our previous work (Garrette et al., 2015):using only the non-context parameters (✓ROOT,✓BIN, ✓UN, ✓TERM, �) along with the category priorPCAT to bias toward likely categories throughoutthe tree, and ✓TERM-0

t estimated from the tag dictio-nary and raw corpus. We then added the supertag-context parameters (✓LCTX, ✓RCTX

), but used uni-form priors for those (still using PCAT for the rest).Then, we evaluated the supertag-context modelusing context parameter priors that bias towardcategories that combine with their contexts: P left

and P right (see §3.3). Finally, we evaluated thesupertag-context model using context parameterpriors that bias toward combinability and towarda priori more likely categories, based on the cate-gory grammar (P left

CAT and P rightCAT ).

Because we are interested in understanding howour models perform under varying amounts of su-

29

pervision, we executed sequences of experimentsin which we reduced the size of the corpus fromwhich the tag dictionary is drawn, thus reducingthe amount of information provided to the model.As this information is reduced, so is the size of thefull inventory of known CCG categories that can beused as supertags. Additionally, a smaller tag dic-tionary means that there will be vastly more un-known words; since our model must assume thatthese words may take any supertag from the fullset of known labels, the model must contend witha greatly increased level of ambiguity.

The results of our experiments are given in Ta-ble 1. We find that the incorporation of supertag-context parameters into a CCG model improvesperformance in every scenario we tested; we seegains of 2–5% across the board. Adding contextparameters never hurts, and in most cases, usingpriors based on intrinsic, cross-lingual aspects ofthe CCG formalism to bias those parameters to-ward connectivity provides further gains. In par-ticular, biasing the model toward trees in whichconstituent labels are combinable with their adja-cent supertags frequently helps the model.

However, for English, we found that addition-ally biasing context priors toward simpler cate-gories using P left

CAT/P rightCAT degraded performance.

This is likely due to the fact that the priors on pro-duction parameters (✓BIN, ✓UN

) are already biasingthe model toward likely categories, and that hav-ing the context parameters do the same ends upover-emphasizing the need for simple categories,preventing the model from choosing more com-plex categories when they are needed. On theother hand, this bias helps in Chinese and Italian.

6 Related Work

Klein and Manning (2002)’s CCM is an unla-beled bracketing model that generates the span ofpart-of-speech tags that make up each constituentand the pair of tags surrounding each constituentspan (as well as the spans and contexts of eachnon-constituent). They found that modeling con-stituent context aids in parser learning because itis able to capture the observation that the samecontexts tend to appear repeatedly in a corpus,even with different constituents. While CCM isdesigned to learn which tag pairs make for likelycontexts, without regard for the constituents them-selves, our model attempts to learn the relation-ships between context categories and the types of

the constituents, allowing us to take advantage ofthe natural a priori knowledge about which con-texts fit with which constituent labels.

Other researchers have shown positive resultsfor grammar induction by introducing relativelysmall amounts of linguistic knowledge. Naseemet al. (2010) induced dependency parsers by hand-constructing a small set of linguistically-universaldependency rules and using them as soft con-straints during learning. These rules were use-ful for disambiguating between various structuresin cases where the data alone suggests multiplevalid analyses. Boonkwan and Steedman (2011)made use of language-specific linguistic knowl-edge collected from non-native linguists via aquestionnaire that covered a variety of syntacticparameters. They were able to use this infor-mation to induce CCG parsers for multiple lan-guages. Bisk and Hockenmaier (2012; 2013) in-duced CCG parsers by using a smaller number oflinguistically-universal principles to propose syn-tactic categories for each word in a sentence, al-lowing EM to estimate the model parameters. Thisallowed them to induce the inventory of language-specific types from the training data, without priorlanguage-specific knowledge.

7 Conclusion

Because of the structured nature of CCG categoriesand the logical framework in which they must as-semble to form valid parse trees, the CCG formal-ism offers multiple opportunities to bias modellearning based on universal, intrinsic propertiesof the grammar. In this paper we presented anovel parsing model with the capacity to capturethe associative adjacent-category relationships in-trinsic to CCG by parameterizing supertag con-texts, the supertags appearing on either side ofeach constituent. In our Bayesian formulation, weplace priors on those context parameters to biasthe model toward trees in which constituent labelsare combinable with their contexts, thus preferringtrees that “fit” together better. Our experimentsdemonstrate that, across languages, this additionalcontext helps in weak-supervision scenarios.

Acknowledgements

We would like to thank Yonatan Bisk for hisvaluable feedback. This work was supported bythe U.S. Department of Defense through the U.S.Army Research Office grant W911NF-10-1-0533.

30

ReferencesJason Baldridge. 2008. Weakly supervised supertag-

ging with grammar-informed initialization. In Proc.of COLING.

Marco Baroni, Silvia Bernardini, Adriano Ferraresi,and Eros Zanchetta. 1999. The WaCky wideweb: a collection of very large linguistically pro-cessed web-crawled corpora. Language Resourcesand Evaluation, 43(3).

Yonatan Bisk and Julia Hockenmaier. 2012. Simplerobust grammar induction with combinatory catego-rial grammar. In Proc. of AAAI.

Yonatan Bisk and Julia Hockenmaier. 2013. An HDPmodel for inducing combinatory categorial gram-mars. Transactions of the Association for Compu-tational Linguistics, 1.

Prachya Boonkwan and Mark Steedman. 2011. Gram-mar induction from text using small syntactic proto-types. In Proc. of IJCNLP.

Johan Bos, Cristina Bosco, and Alessandro Mazzei.2009. Converting a dependency treebank to a cat-egorial grammar treebank for Italian. In M. Pas-sarotti, Adam Przepiorkowski, S. Raynaud, andFrank Van Eynde, editors, Proc. of the Eighth In-ternational Workshop on Treebanks and LinguisticTheories (TLT8).

Cristina Bosco, Vincenzo Lombardo, Daniela Vassallo,and Leonardo Lesmo. 2000. Building a treebank forItalian: a data-driven annotation schema. In Proc. ofLREC.

Zhiyi Chi. 1999. Statistical properties of probabilisticcontext-free grammars. Computational Linguistics.

Stephen Clark and James R. Curran. 2007. Wide-coverage efficient statistical parsing with CCG andlog-linear models. Computational Linguistics, 33.

Dan Garrette and Jason Baldridge. 2012. Type-supervised hidden Markov models for part-of-speech tagging with incomplete tag dictionaries. InProc. of EMNLP.

Dan Garrette, Chris Dyer, Jason Baldridge, andNoah A. Smith. 2014. Weakly-supervised Bayesianlearning of a CCG supertagger. In Proc. of CoNLL.

Dan Garrette, Chris Dyer, Jason Baldridge, andNoah A. Smith. 2015. Weakly-supervisedgrammar-informed Bayesian CCG parser learning.In Proc. of AAAI.

Sharon Goldwater. 2007. Nonparametric BayesianModels of Lexical Acquisition. Ph.D. thesis, BrownUniversity.

Joshua Goodman. 1998. Parsing inside-out. Ph.D.thesis, Harvard University.

Julia Hockenmaier and Mark Steedman. 2007. CCG-bank: A corpus of CCG derivations and dependencystructures extracted from the Penn Treebank. Com-putational Linguistics, 33(3).

Mark Johnson, Thomas Griffiths, and Sharon Gold-water. 2007. Bayesian inference for PCFGs viaMarkov chain Monte Carlo. In Proc. of NAACL.

Dan Klein and Christopher D. Manning. 2002. Agenerative constituent-context model for improvedgrammar induction. In Proc. of ACL.

Karim Lari and Steve J. Young. 1990. The esti-mation of stochastic context-free grammars usingthe inside-outside algorithm. Computer Speech andLanguage, 4:35–56.

Mike Lewis and Mark Steedman. 2014. A* CCGparsing with a supertag-factored model. In Proc. ofEMNLP.

Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn Treebank. Computa-tional Linguistics, 19(2).

Tahira Naseem, Harr Chen, Regina Barzilay, and MarkJohnson. 2010. Using universal linguistic knowl-edge to guide grammar induction. In Proc. ofEMNLP.

Robert Parker, David Graff, Junbo Kong, Ke Chen, andKazuaki Maeda. 2011. English Gigaword Fifth Edi-tion LDC2011T07. Linguistic Data Consortium.

Andrew Radford. 1988. Transformational Grammar.Cambridge University Press.

Mark Steedman and Jason Baldridge. 2011. Combina-tory categorial grammar. In Robert Borsley and Ker-sti Borjars, editors, Non-Transformational Syntax:Formal and Explicit Models of Grammar. Wiley-Blackwell.

Mark Steedman. 2000. The Syntactic Process. MITPress.

Daniel Tse and James R. Curran. 2010. Chinese CCG-bank: Extracting CCG derivations from the PennChinese Treebank. In Proc. of COLING.

Aline Villavicencio. 2002. The acquisition of aunification-based generalised categorial grammar.Ph.D. thesis, University of Cambridge.

Nianwen Xue, Fei Xia, Fu-Dong Chiou, and MarthaPalmer. 2005. The Penn Chinese TreeBank: Phrasestructure annotation of a large corpus. Natural Lan-guage Engineering, 11(2):207–238.

Luke S. Zettlemoyer and Michael Collins. 2007. On-line learning of relaxed CCG grammars for parsingto logical form. In Proc. of EMNLP.

31



A Synchronous Hyperedge Replacement Grammar based approach forAMR parsing

Xiaochang Peng, Linfeng Song and Daniel GildeaDepartment of Computer Science

University of RochesterRochester, NY 14627

Abstract

This paper presents a synchronous-graph-grammar-based approach for string-to-AMR parsing. We apply Markov ChainMonte Carlo (MCMC) algorithms tolearn Synchronous Hyperedge Replace-ment Grammar (SHRG) rules from a for-est that represents likely derivations con-sistent with a fixed string-to-graph align-ment. We make an analogy of string-to-AMR parsing to the task of phrase-basedmachine translation and come up with anefficient algorithm to learn graph gram-mars from string-graph pairs. We pro-pose an effective approximation strategyto resolve the complexity issue of graphcompositions. We also show some usefulstrategies to overcome existing problemsin an SHRG-based parser and present pre-liminary results of a graph-grammar-basedapproach.

1 Introduction

Abstract Meaning Representation (AMR) (Ba-narescu et al., 2013) is a semantic formalismwhere the meaning of a sentence is encoded as arooted, directed graph. Figure 1 shows an exam-ple of the edge-labeled representation of an AMRgraph where the edges are labeled while the nodesare not. The label of the leaf edge going out ofa node represents the concept of the node, andthe label of a non-leaf edge shows the relation be-tween the concepts of the two nodes it connects to.This formalism is based on propositional logic andneo-Davidsonian event representations (Parsons,1990; Davidson, 1967). AMR does not encodequantifiers, tense and modality, but it jointly en-codes a set of selected semantic phenomena whichrenders it useful in applications like question an-swering and semantics-based machine translation.

want-01believe-01

ARG1

ARG0

girl

boy

ARG0

ARG1

Figure 1: An example of AMR graph representingthe meaning of: “The boy wants the girl to believehim”

The task of AMR graph parsing is to map nat-ural language strings to AMR semantic graphs.Flanigan et al. (2014) propose a two-stage pars-ing algorithm which first maps meaningful contin-uous spans on the string side to concept fragmentson the graph side, and then in the second stageadds additional edges to make all these fragmentsconnected. Concept identification (Flanigan et al.,2014; Pourdamghani et al., 2014) can be consid-ered as an important first step to relate componentsof the string to components in the graph.

Wang et al. (2015) also present a two-stage pro-cedure where they first use a dependency parsertrained on a large corpus to generate a depen-dency tree for each sentence. In the second step,a transition-based algorithm is used to greedilymodify the dependency tree into an AMR graph.The benefit of starting with a dependency tree in-stead of the original sentence is that the depen-dency structure is more linguistically similar to anAMR graph and provides more direct feature in-formation within limited context.

Hyperedge replacement grammar (HRG) is acontext-free rewriting formalism for generatinggraphs (Drewes et al., 1997). Its synchronouscounterpart, SHRG, can be used for transforminga graph from/to another structured representationsuch as a string or tree structure. HRG has greatpotential for applications in natural language un-

32

derstanding and generation, and also semantics-based machine translation.

Given a graph as input, finding its derivation ofHRG rules is NP-complete (Drewes et al., 1997).Chiang et al. (2013) describe in detail a graphrecognition algorithm and present an optimizationscheme which enables the parsing algorithm to runin polynomial time when the treewidth and degreeof the graph are bounded. However, there is stillno real system available for parsing large graphs.

An SHRG can be used for AMR graph pars-ing where each SHRG rule consists of a pair ofa CFG rule and an HRG rule, which can gener-ate strings and AMR graphs in parallel. Jones etal. (2012) present a Syntactic Semantic Algorithmthat learns SHRG by matching minimal parse con-stituents to aligned graph fragments and incremen-tally collapses them into hyperedge nonterminals.The basic idea is to use the string-to-graph align-ment and syntax information to constrain the pos-sible HRGs.

Learning SHRG rules from fixed string-to-graph alignments is a similar problem to extractingmachine translation rules from fixed word align-ments, where we wish to automatically learn thebest granularity for the rules with which to ana-lyze each sentence. Chung et al. (2014) presentan MCMC sampling schedule to learn Hiero-styleSCFG rules (Chiang, 2007) by sampling tree frag-ments from phrase decomposition forests, whichrepresent all possible rules that are consistent witha set of fixed word alignments, making use of theproperty that each SCFG rule in the derivation is inessence the decomposition of a larger phrase pairinto smaller ones.

In this paper, we make an analogy to treat frag-ments in the graph language as phrases in the natu-ral language string and SHRG rules as decomposi-tions of larger substring, graph fragment pairs intosmaller ones. Graph language is different fromstring language in that there is no explicit orderto compose the graph and there is an exponen-tial number of possible compositions. We pro-pose a strategy that uses the left-to-right order ofthe string to constrain the structure of the deriva-tion forest and experiment with different tactics indealing with unaligned words on the string sideand unaligned edges on the graph side.

Specifically, we make the following contribu-tions:

1. We come up an alternative SHRG-based

want-01believe-01

ARG1

ARG0girl

boy

ARG0

ARG1

ARG0

X0X2

X3

X1

X3

1 2

want-01

1

ARG12

X1 boy

ARG0ARG1

X2boy

X2X3

X11 1

ARG0ARG1

boy

X3

X1

ARG0X3

2 2

ARG11

X1 girl

1

want-01

want-01believe-01

Figure 2: The series of HRG rules applied to de-rive the AMR graph of “The boy wants the girl tobelieve him”. The first rule is directly shown. Theother HRG rules are either above or below eachright arrow. The white circle shows the root ofeach hyperedge. The indexes in each rule showthe one-to-one mapping between the attachmentnodes of l.h.s. nonterminal edges and the externalnodes of the r.h.s. subgraph

AMR parsing strategy and present a reason-able preliminary result without additional de-pendency information and global features,showing promising future applications whena language model is applied or larger datasetsare available.

2. We present the novel notion of fragment de-composition forest and come up with an ef-ficient algorithm to construct the forest fromfixed string-to-graph alignment.

3. We propose an MCMC algorithm which sam-ples a special type of SHRG rules whichhelps maintain the properties of AMR graphs,which should be able to generalize to learningother synchronous grammar with a CFG leftside.

4. We augment the concept identification proce-dure of Flanigan et al. (2014) with a phrase-to-graph-fragment alignment table whichmakes use of the dependency between con-cepts.

5. We discovered that an SHRG-based approachis especially sensitive to missing alignmentinformation. We present some simple yet ef-fective ways motivated by the AMR guide-line to deal with this issue.

2 Hyperedge Replacement Grammar

Hyperedge replacement grammar (HRG) is acontext-free rewriting formalism for graph gener-ation (Drewes et al., 1997). HRG is like CFG in

33

that it rewrites nonterminals independently. WhileCFG generates natural language strings by suc-cessively rewriting nonterminal tokens, the non-terminals in HRG are hyperedges, and each rewrit-ing step in HRG replaces a hyperedge nonterminalwith a subgraph instead of a span of a string.

2.1 Definitions

In this paper we only use edge-labeled graphs be-cause using both node and edge labels complicatesthe definitions in our HRG-based approach. Fig-ure 2 shows a series of HRG rules applied to derivethe AMR graph shown in Figure 1.

We start with the definition of hypergraphs. Anedge-labeled, directed hypergraph is a tuple H =

hV,E, l, Xi, where V is a finite set of nodes, E ✓V + is a finite set of hyperedges, each of which willconnect to one or more nodes in V . l : E ! L de-fines a mapping from each hyperedge to its labelfrom a finite set L. Each hyperedge is an atomicitem with an ordered list of nodes it connects to,which are called attachment nodes. The type of ahyperedge is defined as the number of its attach-ment nodes. X 2 V ⇤ defines an ordered list ofdistinct nodes called external nodes. The orderedexternal nodes specify how to fuse a hypergraphwith another graph, as we will see below. In thispaper, we alternately use the terms of hypergraphand graph, hyperedge and edge, and also phrase,substring and span for brevity.

An HRG is a rewriting formalism G =

hN,T, P, Si, where N and T define two disjointfinite sets called nonterminals and terminals. S 2N is a special nonterminal called the start sym-bol. P is a finite set of productions of the formA ! R, where A 2 N and R is a hypergraphwith edge labels over N [ T and with nonemptyexternal nodes XR. We have the constraint thatthe type of the hyperedge with label A should co-incide with the number of nodes in XR. In ourgrammar, each nonterminal has the form of Xn,where n indicates the type of the hyperedge. Ourspecial start symbol is separately denoted as X0.

The rewriting mechanism replaces a nontermi-nal hyperedge with the graph fragment specifiedby a production’s righthand side (r.h.s), attachingeach external node of the r.h.s. to the correspond-ing attachment node of the lefthand side. TakeFigure 2 as an example. Starting from our initialhypergraph with one edge labeled with the startsymbol “X0”, we select one edge with nontermi-

[X0-1] The [X1-1, 1] [X3-100, 2] the [X2-10, 3] him |

[X1-1] girl |

[X1-1] boy |

wants |[X3-100]

believe | [X3-100]

[X1-1, 1] to [X3-100, 2] | [X2-10]

[X2-10,3]

[X3-100,2]

[X1-1,1]

1

want-01ARG0

ARG12

[X3-100, 2]

[X1-1, 1]1

boy

2ARG0

believe-01 ARG11

girl

Figure 3: A series of symbol-refined SHRG rulesused to derive the AMR graph for the sentence“The boy wants the girl to believe him”.

nal label in our current hypergraph, and rewrite itusing a rule in our HRG. The first rule rewrites thestart symbol with a subgraph shown on the r.h.s..We continue the rewriting steps until there are nomore nonterminal-labeled edges.

The synchronous counterpart of HRG can beused for transforming graphs from/to another formof natural language representation. Productionshave the form (A ! hS, Ri,⇠), where A 2 Nand S and R are called the source and the targetand at least one of them should be hypergraphsover N [ T . ⇠ is a bijection linking nonterminalsmentions in S and R. In our case, the source sideis a CFG and the target side is an HRG. Givensuch a synchronous grammar and a string as in-put, we can parse the string with the CFG sideand then derive the counterpart graph by deduc-tion from the derivation. The benefit of parsingwith SHRG is that the complexity is bounded by aCFG-like parsing.

2.2 SHRG-based AMR graph parsingWe write down AMR graphs as rooted, directed,edge-labeled graphs. There is exactly one leafedge going out of each node, the label of whichrepresents the concept of the node. We define thisleaf edge as concept edge. In Figure 1, for ex-ample, the edge labeled with “boy”, “want-01”,“girl” or “believe-01” connects to only one nodein the AMR graph and each label represents theconcept of that node. AMR concepts are either En-glish words (“boy”), PropBank framesets (“want-01”), or special keywords like special entity types,quantities, and logical conjunctions. The label ofeach non-leaf edge shows the relation between theAMR concepts of the two nodes it connects to.

The constraint of having exactly one conceptedge for each node is not guaranteed in generalSHRG. Our strategy for maintaining the AMRgraph structure is to refine the edge nontermi-

34

nal label with an extra binary flag, representingwhether it will have a concept edge in the finalrewriting result, for each external node. The ba-sic intuition is to explicitly enforce the one con-cept edge constraint in each nonterminal so that noadditional concept edge is introduced after apply-ing each rule. The graph derived from this type ofSHRG is guaranteed to have exactly one conceptedge at each node.

Figure 3 shows one example of our symbol-refined SHRG. For each nonterminal Xi-b1 · · · bi,i defines the type of the nonterminal, while each bi

indicates whether the i-th external node will havea concept edge in the rewriting result.1 The sec-ond rule, for example, rewrites nonterminal X3-100 with wants on the string side and a hypergraphwith three external nodes where the root has a con-cept edge :want-01 as the first binary flag 1 indi-cates, while the other two external nodes do notwith the binary flag 0. This guarantees that whenwe integrate the r.h.s. into another graph, it will in-troduce the concept edge :want-01 to the first fus-ing position and no concept edge to the next two.

While this refinement might result in an expo-nential number of nonterminals with respect to themaximum type of hyperedges, we found in our ex-periment that most of the nonterminals do not ap-pear in our grammar. We use a maximum edgetype of 5, which also results in a relatively smallnonterminal set.

3 Sampling SHRG from forests

The fragment decomposition forest provides acompact representation of all possible SHRG rulesthat are consistent with a fixed string-to-graphalignment. Each SHRG rule in the derivation is inessence the decomposition of larger phrase, graphfragment pairs on the left hand side (l.h.s.) intosmaller ones on the r.h.s. and is encoded in a treefragment in the forest. Our goal is to learn anSHRG from this forest. We first build a forest rep-resentation of possible derivations and then use anMCMC algorithm to sample tree fragments fromthis forest representing each rule in the derivation.

3.1 Fragment Decomposition ForestWe first proceed to define the fragment decompo-sition forest. The fragment decomposition forestis a variation of the phrase decomposition forest

1X0-1 is different as X0 is the start symbol of type one

and should always have a concept edge at the root

defined by Chung et al. (2014) where the targetside is a graph instead of a string.

;z}|{The

boy

z}|{boy

want-01z }| {wants

;z}|{the

girl

z}|{girl

;z}|{to

believe-01z }| {believe

;z}|{him

A phrase p = [i, j] is a set of continuous wordindices {i, i + 1, . . . , j � 1}. A fragment f isa hypergraph with external nodes Xf . A string-to-graph alignment h : P ! F defines the map-ping from spans in the sentence to fragments in thegraph. Our smallest phrase-fragment pairs are thestring-to-graph alignments extracted using heuris-tic rules from Flanigan et al. (2014). The figureabove shows an example of the alignments forthe sentence “The boy wants the girl to believehim”. The symbol ; represents that the word isnot aligned to any concept in the AMR graph andthis word is called an unaligned word. After thisalignment, there are also left-over edges that arenot aligned from any substrings, which are calledunaligned edges.

Given an aligned string, AMR graph pair, aphrase-fragment pair n is a pair ([i, j], f) whichdefines a pair of a phrase [i, j] and a fragment fsuch that words in positions [i, j] are only alignedto concepts in the fragment f and vice versa (withunaligned words and edges omitted). A fragmentforest H = hV,Ei is a hypergraph made of a setof hypernodes V and hyperedges E. Each noden = ([i, j], f) is tight on the string side similar tothe definition by Koehn et al. (2003), i.e., n con-tains no unaligned words at its boundaries. Notehere we do not have the constraint that f should beconnected or single rooted, but we will deal withthese constraints separately in the sampling proce-dure.

We define two phrases [i1, j1], [i2, j2] to be ad-jacent if word indices {j1, j1 + 1, . . . , i2 � 1}are all unaligned. We also define two fragmentsf1 = hV1, E1i, f2 = hV2, E2i to be disjoint ifE1 \ E2 = ;. And f1 and f2 are adjacent if theyare disjoint and f = hV1 [ V2, E1 [ E2i is con-nected. We also define the compose operation oftwo nodes: it takes two nodes n1 = ([i1, j1], f1)

and n2 = ([i2, j2], f2) (j1 i2) as input, and com-putes f = hV1 [ V2, E1 [ E2i, the output is acomposed node n = ([i1, j2], f). We say n1 andn2 are immediately adjacent if f is connected andsingle-rooted.

We keep composing larger phrase-fragment

35

boyARG0

ARG1ARG0

ARG1girl

ARG0ARG1

boy ARG0ARG1

girlARG0ARG1 girl

ARG0ARG1

boy girl ARG0ARG1

ARG0ARG1girl

ARG0

ARG1want-01

ARG0ARG1

believe-01

girlboy

The boy wants the girl to believe him.

believe-01want-01

want-01

want-01

want-01

want-01

believe-01

believe-01

Figure 4: The fragment decomposition forest forthe (sentence, AMR graph) pair for “The boywants the girl to believe him”

pairs (each one kept in a node of the forest) fromsmaller ones until we reach the root of the forestwhose phrase side is the whole sentence and thefragment side is the complete AMR graph. We de-fine fragment decomposition forest to be madeof all possible phrase-fragments pairs that can bedecomposed from the sentence AMR graph pair.The fragment decomposition forest has the im-portant property that any SHRG rule consistentwith the string-to-graph alignment corresponds toa continuous tree fragment of a complete treefound in the forest.

While we can compose larger phrases fromsmaller ones from left to right, there is no explicitorder of composing the graph fragments. Also, thenumber of possible graph fragments is highly ex-ponential as we need to make a binary decision todecide each boundary node of the fragment andalso choose the edges going out of each boundarynode of the fragment, unlike the polynomial num-bers of phrases for fixed string alignment.

Our bottom-up construction procedure startsfrom the smallest phrase-fragment pairs. Wefirst index these smallest phrase-fragment pairs([ik, jk], fk), k = 1, 2, . . . , n based on ascendingorder of their start positions on the string side, i.e.,jk ik+1 for k = 1, 2, . . . , n � 1. Even withthis left-to-right order constraint from the stringside, the complexity of building the forest is stillexponential due to the possible choices in attach-ing graphs edges that are not aligned to the string.Our strategy is to deterministically attach each un-aligned relation edge to one of the identified con-cept fragments it connects to. We attach ARGs andops to its head node and each other types of un-

Algorithm 1 A CYK-like algorithm for building afragment decomposition forest1: For each smallest phrase-fragment pairs ([ik, jk], fk), k

= 1, 2, . . . , n, attach unaligned edges to fragment fk, de-noting the result as f

0k. Build a node for ([ik, jk], f

0k) and

add it to chart item c[k][k + 1].2: Extract all the remaining unaligned fragments, build a

special unaligned node for each of them and add it tounaligned node set unaligned nodes

3: Keep composing unaligned nodes with nodes in differentchart items if they are immediate adjacent and add it tothe same chart item

4: for span from 2 to n do5: for i from 1 to n-span+1 do6: j = i + span

7: for k from i + 1 to j � 1 do8: for n1 = ([start1, end1], f1) in c[i][k] do9: for n2 = ([start2, end2], f2) in c[k][j] do

10: if f1 and f2 are disjoint then11: new node = compose(n1, n2)

12: add incoming edge (n1, n2) tonew node

13: if n1 and n2 are not immediate adja-cent then

14: new node.nosample cut=True15: insert node(new node, c[i][j])

aligned relations to its tail node.2

Algorithm 1 shows our CYK-like forest con-struction algorithm. We maintain the length 1chart items according to the order of each smallestphrase-fragment pair instead of its position in thestring.3 In line 1, we first attach unaligned edgesto the smallest phrase-fragment pairs as stated be-fore. After this procedure, we build a node forthe k-th phrase-fragment (with unaligned edgesadded) pair and add it to chart item c[k][k + 1].Note here that we still have remaining unalignededges; in line 2 we attach all unaligned edges go-ing out from the same node as a single fragmentand build a special unaligned node with emptyphrase side and add it to unaligned nodes set.In line 3, we try to compose each unaligned nodewith one of the nodes in the length 1 chart itemsc[k][k + 1]. If they are immediately adjacent, weadd the composed node to c[k][k + 1]. The al-gorithm then composes smaller phrase-fragmentpairs into larger ones (line 4). When we have com-posed two nodes n1, n2, we need to keep track

2Our intuition is that the ARG types for verbs and opsstructure usually go with the concept of the head node. Weassume that other relations are additional introduced to thehead node, which resembles a simple binarization step forother relations.

3We use this strategy mainly because the alignments avail-able do not have overlapping alignments, while our algo-rithm could still be easily adapted to a version that maintainsthe chart items with string positions when overlapping align-ments are available

36

of this incoming edge. We have the constraint inour grammar that the r.h.s. hypergraph of each ruleshould be connected and single rooted.4 Lines 13to 14 enforce this constraint by marking this nodewith a nosample cut flag, which we will use inthe MCMC sampling stage. The insert nodefunction will check if the node already exists inthe chart item. If it already exists, then we onlyupdate the incoming edges for that node. Other-wise we will add it to the chart item.

For some sentence-AMR pairs where there aretoo many nodes with unaligned edges going out,considering all possible compositions would re-sult in huge complexity overhead. One solutionwe have adopted is to disallow disconnected graphfragments and do not add them to the chart items(Line 15). In practice, this pruning procedure doesnot affect much of the final performance in ourcurrent setting. Figure 4 shows the procedure ofbuilding the fragment decomposition forest for thesentence “The boy wants the girl to believe him”.

3.2 MCMC sampling

Sampling methods have been used to learn TreeSubstitution Grammar (TSG) rules from deriva-tion trees (Cohn et al., 2009; Post and Gildea,2009) for TSG learning. The basic intuition isto automatically learn the best granularity for therules with which to analyze our data. Our prob-lem, however, is different in that we need to sam-ple rules from a compact forest representation.We need to sample one tree from the forest, andthen sample one derivation from this tree structure,where each tree fragment represents one rule in thederivation. Sampling tree fragments from forestsis described in detail in Chung et al. (2014) andPeng and Gildea (2014).

We formulate the rule sampling procedure withtwo types of variables: an edge variable en rep-resenting which incoming hyperedge is chosen ata given node n in the forest (allowing us to sam-ple one tree from a forest) and a cut variable zn

representing whether node n in forest is a bound-ary between two SHRG rules or is internal to anSHRG rule (allowing us to sample rules from atree). Figure 5 shows one sampled derivation fromthe forest. We have sampled one tree from the for-est using the edge variables. We also have a 0-1variable at each node in this tree where 0 repre-

4We should be able to get rid of both constraints as we areparsing on the string side.

boyARG0

ARG1ARG0

ARG1girl

ARG0ARG1

boy ARG0ARG1

girl

ARG0

ARG1want-01

ARG0ARG1

believe-01

girlboy

The boy wants the girl to believe him.

1

01

1 1 1 1

want-01

want-01

believe-01

believe-01

Figure 5: The sampled derivation for the (sen-tence, AMR graph) pair for “The boy wants thegirl to believe him”

sents the current node is internal to an SHRG rule,while 1 represents the current node is the boundaryof two SHRG rules.

Let all the edge variables form the random vec-tor Y and all the cut variables form the randomvector Z. Given an assignment y to the edge vari-ables and assignment z to the cut variables, our de-sired distribution is proportional to the product ofweights of the rules specified by the assignment:

Pt(Y = y, Z = z) /Y

r2⌧(y,z)

w(r) (1)

where ⌧(y, z) is the set of rules identified by theassignment and w(r) is the weight for each indi-vidual rule. We use a generative model based ona Dirichlet Process (DP) defined over composedrules. We draw a distribution G over rules from aDP, and then rules from G.

G | ↵, P0 ⇠Dir(↵, P0)

r | G ⇠G

We define two rules to have the same rule typeif they have the same string and hypergraph rep-resentation (including order of external nodes) onthe r.h.s..For the base distribution P0, we use a uni-form distribution where all rules of the same sizehave equal probability. By marginalizing out Gwe get a simple posterior distribution over ruleswhich can be derived using the Chinese RestaurantProcess (CRP). We define a table of counts N =

{NC}C2I which memorizes different categoriesof counts in the previous assignments, where Iis an index set for different categories of counts.Each NC is a vector of counts for category C. We

37

have the following probability over rule r giventhe previous count table N :

P (ri = r|N) =

NR(r) + ↵P0(r)

n + ↵(2)

here in the case of DP, I = {R}, where R is theindex for the category of rule counts.

We use the top-down sampling algorithm ofChung et al. (2014) which samples cut and edgevariables from top down and one at a time. Foreach node n, we denote the composed rule typethat we get when we set the cut of node n to 0 asr1 and the two split rule types that we get when weset the cut to 1 as r2, r3. We sample the cut valuezi of the current node according to the posteriorprobability:

P (zi = z|N) =

(P (r1|N)

P (r1|N)+P (r2|N)P (r3|N 0) if z = 0

P (r2|N)P (r3|N 0)P (r1|N)+P (r2|N)P (r3|N 0) otherwise

(3)

where the posterior probability P (ri|N) is accord-ing to a DP, and N,N 0 are tables of counts. In thecase of DP, N,N 0 differ only in the rule counts ofr2, where N 0

R(r2) = NR(r2) + 1.As for edge variables ei, we refer to the set of

composed rules turned on below n including thecomposed rule fragments having n as an internalor root node as {r1, . . . , rm}. We have the follow-ing posterior probability over the edge variable ei:

P (ei = e|N) /mY

i=1

P (ri|N i�1)

Y

v2⌧(e)\in(n)

deg(v) (4)

where deg(v) is the number of incoming edges fornode v, in(n) is the set of nodes in all subtreesunder n, and ⌧(e) is the tree specified when weset ei = e. N0 to Nm are tables of counts whereN0

= N , N iR(ri) = N i�1

R (ri) + 1 in the case ofDP.

After we have sampled one SHRG derivationfrom the forest, we still need to keep track of theplace where each nonterminal edge attaches. Aswe have maintained the graph fragment it repre-sents in each node of the forest, we can retrievethe attachment nodes of each hyperedge in ther.h.s. by tracing at which graph nodes two frag-ments fuse with each other. We perform this ruleextraction procedure from top-down and maintainthe order of attachment nodes of each r.h.s. non-terminal edge. When we further rewrite a nonter-minal edge, we need to make sure that it keeps theorder of the attachment nodes in its parent rule.

As for the unaligned words, we just insert all theomitted unaligned words in the composition pro-cedure. We also add additional rules including thesurrounding 2 unaligned words context to makesure there are terminals on the string side.

3.3 Phrase-to-Graph-Fragment AlignmentExtraction

Aside from the rules sampled using the MCMCalgorithm, we also extract a phrase-to-graph-fragment alignment table from the fragment de-composition forest. This step can be considered asa mapping of larger phrases made of multiple iden-tified spans (plus unaligned words) to a larger frag-ments made of multiple concept fragments (plusthe way they connect using unaligned edges).

Our extraction happens along with the forestconstruction procedure. In line 1 of Algorithm 1we extract one rule for each smallest phrase-fragment pairs before and after the unalignededges are attached. We also extract one rule foreach newly constructed node after line 11 if thefragment side of the node is single-rooted.5 We donot extract rules after line 2 because it usually in-troduces additional noise of meaningful conceptswhich are unrecognized in the concepts identifica-tion stage.

4 Decoding

4.1 Concept identification

During the decoding stage, first we need to iden-tify meaningful spans in the sentence and mapthem to graph fragments on the graph side. Thenwe use SHRG rules to parse each sentence frombottom up and left to right, which is similar to con-stituent parsing. The recall of the concept identi-fication stage from Flanigan et al. (2014) is 0.79,which means 21% of the meaningful concepts arealready lost at the beginning of the next stage.

Our strategy is to use lemma and POS tags in-formation after the concept identification stage,we use it to recall some meaningful concepts.We find that, except for some special functionwords, most nouns, verbs and, adjectives shouldbe aligned. We use the lemma information to re-trieve unaligned words whose morphological formdoes not appear in our training data. We also use

5Here we will also look at the surrounding 2 unalignedwords to fix partial alignment and noise introduced by mean-ingful unaligned words

38

POS tag information to deal with nouns and quan-tities. Motivated by the fact that AMR makes ex-tensive use of PropBank framesets, we look upthe argument structure of the verbs from the Prop-Bank. Although the complicated abstraction ofAMR makes it hard to get the correct concept foreach word, the more complete structure can reducethe propagation of errors along the derivation tree.

4.2 AMR graph parsingWe use Earley algorithm with cube-pruning (Chi-ang, 2007) for the string-to-AMR parsing. Foreach synchronous rule with N nonterminals onits l.h.s., we build an N + 1 dimensional cubeand generate top K candidates. Out of all thehypotheses generated by all satisfied rules withineach span (i, j),we keep at most K candidates forthis span. Our glue rules will create a pseudoR/ROOT concept and use ARGs relations toconnect disconnected components to make a con-nected graph.

We use the following local features:1. StringToGraphProbability: the probability of a hyper-

graph given the input string

2. RuleCount: number of rules used to compose the AMRgraph

3. RuleEdgeCount: the number of edges in the r.h.s. hy-pergraph

4. EdgeType: the type of the l.h.s. nonterminal. For ruleswith same source side tokens, we prefer rules withsmaller edge types.

5. AllNonTerminalPunish: one for rules which only havenon-terminals on the source side.

6. GlueCount: one for glue rules.

As our forest structure is highly binarized, it ishard to capture the :opn structure when n is largebecause we limit the number of external nodes to5. The most common :op structure in the AMRannotation is the coordinate structure of items sep-arated by “;” or separated by “,” along with and.We add the following two rules:

[X1-1]� > [X1-1, 1]; [X1-1, 2]; · · · ; [X1-1, n] |

(. :a/and :op1 [X1-1, 1] :op2 [X1-1, 2] · · · :opn [X1-1, n])

[X1-1]� > [X1-1, 1], [X1-1, 2], · · · and [X1-1, n] |

(. :a/and :op1 [X1-1, 1] :op2 [X1-1, 2] · · · :opn [X1-1, n])

where the HRG side is a :a/and coordinate struc-ture of X1-1s connected with relation :ops.

5 ExperimentsWe use the same newswire section ofLDC2013E117 as Flanigan et al. (2014), which

Precision Recall F-scoreConcept id only 0.37 0.53 0.44+ MCMC 0.57 0.53 0.55+ MCMC + phrase table 0.60 0.54 0.57+ All 0.59 0.58 0.58

Table 1: Comparisons of different strategies of ex-tracting lexical rules on dev.

consists of 3955 training sentences, 2132 devsentences and 2132 test sentences. We also usethe string-to-graph alignment from Flanigan et al.(2014) to construct the fragment decompositionforest and to extract the phrase-to-fragment table.

In the fragment decomposition forest construc-tion procedure, we have experimented with differ-ent ways of dealing with the unaligned edges. Firstwe have tried to directly use the alignment, andgroup all unaligned edges going out from the samenode as an unaligned fragment. Using this con-straint would take a few hours or longer for somesentences. The reason for this is because the manynumber of unaligned edges can connect to eachbranch of the aligned or unaligned fragments be-low it. And there is no explicit order of composi-tion with each branch. Another constraint we havetried is to attach all unaligned edges to the headnode concept. The problem with this constraint isthat it is very hard to generalize and introduces alot of additional redundant relation edges.

As for sampling, we initialize all cut variablesin the forest as 1 (except for nodes that are markedas nosample cut, which indicates we initialize itwith 0 and keep it fixed) and uniformly sample anincoming edge for each node. We evaluate the per-formance of our SHRG-based parser using Smatchv1.0 (Cai and Knight, 2013), which evaluates theprecision, recall and F1 of the concepts and rela-tions all together. Table 1 shows the dev results ofour sampled grammar using different lexical rulesthat maps substrings to graph fragments. Conceptid only is the result of using the concepts identi-fied by Flanigan et al. (2014). From second line,we replace the concept identification result withthe lexical rules we have extracted from the train-ing data (except for named entities and time ex-pressions). +MCMC shows the result using ad-ditional alignments identified using our samplingapproach. We can see that using the phrase tograph fragment alignment learned from our train-ing data can significantly improve the smatch. Wehave also tried extracting all phrase-to-fragment

39

Precision Recall F-scoreJAMR 0.67 0.58 0.62Wang et al. 0.64 0.62 0.63Our approach 0.59 0.57 0.58

Table 2: Comparisons of smatch score results

alignments of length 6 on the string side fromour constructed forest. We can see that using thisalignment table further improves the smatch score.This is because the larger phrase-fragment pairscan make better use of the dependency informa-tion between continuous concepts. The improve-ment is not much in comparison with MCMC, thisis perhaps MCMC can also learn some meaningblocks that frequently appear together. As thedataset is relatively small, so there are a lot ofmeaningful concepts that are not aligned. We uselemma as a backoff strategy to find the alignmentfor the unaligned words. We have also used thePOS tag information to retrieve some unalignednouns and a PropBank dictionary to retrieve theargument structure of the first sense of the verbs.+All shows the result after using lemma, POS tagand PropBank information, we can see that fixingthe alignment can improve the recall, but the pre-cision does not change much.

Table 2 shows our result on test data. JAMRis the baseline result from Flanigan et al. (2014).Wang et al. (2015) shows the current state-of-art for string-to-AMR parsing. Without the de-pendency parse information and complex globalfeatures, our SHRG-based approach can alreadyachieve competitive results in comparison withthese two algorithms.

6 Discussion

In comparison to the spanning tree algorithm ofFlanigan et al. (2014), an SHRG-based approachis more sensitive to the alignment. If a lot of themeaningful concepts are not aligned, then the lostinformation would break down the structure of ourgrammar. Using more data would definitely helpease this issue. Building overlapping alignmentsfor the training data with more concepts alignmentwould also be helpful.

Another thing to note is that Flanigan et al.(2014) have used path information of dependencyarc labels and part of speech tags. Using theseglobal information can help the predication of therelation edge labels. One interesting way to in-clude such kind of path information is to add

a graph language model into our CFG decoder,which should also help improve the performance.

All the weights of the local features mentionedin Section 4.2 are tuned by hand. We have triedtuning with MERT (Och, 2003), but the computa-tion of smatch score for the k-best list has becomea major overhead. This issue might come from theNP-Completeness of the problem smatch tries toevaluate, unlike the simple counting of N-gramsin BLEU (Papineni et al., 2001). Parallelizationmight be a consideration for tuning smatch scorewith MERT.

7 Conclusion

We presented an MCMC sampling schedule forlearning SHRG rules from a fragment decompo-sition forest constructed from a fixed string-to-AMR-graph alignment. While the complexity ofbuilding a fragment decomposition forest is highlyexponential, we have come up with an effectiveconstraint from the string side that enables an effi-cient construction algorithm. We have also eval-uated our sampled SHRG on a string-to-AMRgraph parsing task and achieved some reasonableresult without using a dependency parse. Inter-esting future work might include adding languagemodel on graph structure and also learning SHRGfrom overlapping alignments.

Acknowledgments Funded by NSF IIS-1446996, NSF IIS-1218209, the 2014 JelinekMemorial Workshop, and a Google Faculty Re-search Award. We are grateful to Jeff Flanigan forproviding the results of his concept identification.

ReferencesLaura Banarescu, Claire Bonial, Shu Cai, Madalina

Georgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse.

Shu Cai and Kevin Knight. 2013. Smatch: an evalua-tion metric for semantic feature structures. In ACL(2), pages 748–752.

David Chiang, Jacob Andreas, Daniel Bauer,Karl Moritz Hermann, Bevan Jones, and KevinKnight. 2013. Parsing graphs with hyperedgereplacement grammars. In Proceedings of the 51stAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages924–932.

40

David Chiang. 2007. Hierarchical phrase-based trans-lation. Computational Linguistics, 33(2):201–228.

Tagyoung Chung, Licheng Fang, Daniel Gildea, andDaniel Stefankovic. 2014. Sampling tree fragmentsfrom forests. Computational Linguistics, 40:203–229.

Trevor Cohn, Sharon Goldwater, and Phil Blun-som. 2009. Inducing compact but accurate tree-substitution grammars. In Proceedings of HumanLanguage Technologies: The 2009 Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 548–556,Boulder, Colorado, June. Association for Computa-tional Linguistics.

Donald Davidson. 1967. The logical form of actionsentences. In Nicholas Rescher, editor, The Logic ofDecision and Action, pages 81–120. Univ. of Pitts-burgh Press.

Frank Drewes, Hans-Jorg Kreowski, and Annegret Ha-bel. 1997. Hyperedge replacement, graph gram-mars. In Handbook of Graph Grammars, volume 1,pages 95–162. World Scientific, Singapore.

Jeffrey Flanigan, Sam Thomson, Jaime Carbonell,Chris Dyer, and Noah A Smith. 2014. A discrim-inative graph-based parser for the abstract meaningrepresentation. In Proc. of ACL, pages 1426–1436.

Bevan Jones, Jacob Andreas, Daniel Bauer,Karl Moritz Hermann, and Kevin Knight. 2012.Semantics-based machine translation with hyper-edge replacement grammars. In COLING, pages1359–1376.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Pro-ceedings of NAACL-03, pages 48–54, Edmonton,Alberta.

Franz Josef Och. 2003. Minimum error rate trainingfor statistical machine translation. In Proceedingsof ACL-03, pages 160–167, Sapporo, Japan.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2001.BLEU: a method for automatic evaluation of MT.Technical Report Research Report, Computer Sci-ence RC22176 (W0109-022), IBM Research Divi-sion, T.J.Watson Research Center.

Terence Parsons. 1990. Events in the Semantics ofEnglish, volume 5. Cambridge, Ma: MIT Press.

Xiaochang Peng and Daniel Gildea. 2014. Type-basedMCMC for sampling tree fragments from forests. InConference on Empirical Methods in Natural Lan-guage Processing (EMNLP-14).

Matt Post and Daniel Gildea. 2009. Bayesian learningof a tree substitution grammar. In Proc. Associationfor Computational Linguistics (short paper), pages45–48, Singapore.

Nima Pourdamghani, Yang Gao, Ulf Hermjakob, andKevin Knight. 2014. Aligning english strings withabstract meaning representation graphs. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages425–429.

Chuan Wang, Nianwen Xue, and Sameer Pradhan.2015. A transition-based algorithm for amr parsing.In Proceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 366–375, Denver, Colorado, May–June. As-sociation for Computational Linguistics.

41



AIDA2: A Hybrid Approach for Token and Sentence Level DialectIdentification in Arabic

Mohamed Al-Badrashiny, Heba Elfardy† and Mona DiabDepartment of Computer Science, The George Washington University

{badrashiny,mtdiab}@gwu.edu†Department of Computer Science, Columbia University

†[email protected]

Abstract

In this paper, we present a hybrid approachfor performing token and sentence levelsDialect Identification in Arabic. Specifi-cally we try to identify whether each to-ken in a given sentence belongs to ModernStandard Arabic (MSA), Egyptian Dialec-tal Arabic (EDA) or some other class andwhether the whole sentence is mostly EDAor MSA. The token level component re-lies on a Conditional Random Field (CRF)classifier that uses decisions from severalunderlying components such as languagemodels, a named entity recognizer andand a morphological analyzer to label eachword in the sentence. The sentence levelcomponent uses a classifier ensemble sys-tem that relies on two independent under-lying classifiers that model different as-pects of the language. Using a feature-selection heuristic, we select the best set offeatures for each of these two classifiers.We then train another classifier that usesthe class labels and the confidence scoresgenerated by each of the two underlyingclassifiers to decide upon the final classfor each sentence. The token level compo-nent yields a new state of the art F-score of90.6% (compared to previous state of theart of 86.8%) and the sentence level com-ponent yields an accuracy of 90.8% (com-pared to 86.6% obtained by the best stateof the art system).

1 Introduction

In this age of social media ubiquity, we note thepervasive presence of informal language mixed inwith formal language. Degree of mixing formaland informal language registers varies across lan-guages making it ever harder to process. The prob-

lem is quite pronounced in Arabic where the dif-ference between the formal modern standard Ara-bic (MSA) and the informal dialects of Arabic(DA) could add up to a difference in languagemorphologically, lexically, syntactically, seman-tically and pragmatically, exacerbating the chal-lenges for almost all NLP tasks. MSA is usedin formal settings, edited media, and education.On the other hand the spoken, and, currently writ-ten in social media and penetrating formal me-dia, are the informal vernaculars. There are mul-tiple dialects corresponding to different parts ofthe Arab world: (1) Egyptian, (2) Levantine, (3)Gulf, (4) Moroccan, and, (5) Iraqi. For each one ofthese sub-dialectal variants exist. Speakers/writerscode switch between the two forms of the lan-guage especially in social media text both interand intra sententially. Automatically identifyingcode-switching between variants of the same lan-guage (Dialect Identification) is quite challeng-ing due to the lexical overlap and significant se-mantic and pragmatic variation yet it is crucialas a preprocessing step before building any Ara-bic NLP tool. MSA trained tools perform verybadly when applied directly to DA or to intrasen-tential code-switched DA and MSA text (ex. Al-fryq fAz bAlEAfyp bs tSdr qA}mp Aldwry, wherethe words correspond to MSA MSA DA DA MSAMSA MSA, respectively)1. Dialect Identificationhas been shown to be an important preprocess-ing step for statistical machine Translation (SMT).(Salloum et al., 2014) explored the impact of us-ing Dialect Identification on the performance ofMT and found that it improves the results. Theytrained four different SMT systems; (a) DA-to-English SMT, (b) MSA-to-English SMT, (c) DA +MSA-to-English SMT, and (d) DA-to-English hy-brid MT system and treated the task of choosing

1We use Buckwalter transliteration scheme to repre-sent Arabic in Romanized script throughout the paper.http://www.qamus.org/transliteration.htm

42

which SMT system to invoke as a classificationtask. They built a classifier that uses various fea-tures derived from the input sentence and that in-dicate, among other things, how dialectal the inputsentence is and found that this approach improvedthe performance by 0.9% BLEU points.

In this paper, we address the problem of to-ken and sentence levels dialect identification inArabic, specifically between Egyptian Arabic andMSA. For the token level task, we treat the prob-lem as a sequence labeling task by training a CRFclassifier that relies on the decisions made by alanguage model, a morphological analyzer, a shal-low named entity recognition system, a modalitylexicon and other features pertaining to the sen-tence statistics to decide upon the class of each to-ken in the given sentence. For the sentence leveltask we resort to a classifier ensemble approachthat combines independent decisions made by twoclassifiers and use their decisions to train a newone. The proposed approaches for both tasks sig-nificantly beat the current state of the art perfor-mance with a significant margin, while creating apipelined system.

2 Related Work

Dialect Identification in Arabic has recentlygained interest among Arabic NLP researchers.Early work on the topic focused on speech data.Biadsy et al. (2009) presented a system that identi-fies dialectal words in speech through acoustic sig-nals. More recent work targets textual data. Themain task for textual data is to decide the class ofeach word in a given sentence; whether it is MSA,EDA or some other class such as Named-Entityor punctuation and whether the whole sentence ismostly MSA or EDA. The first task is referred toas “Token Level Dialect Identification” while thesecond is “Sentence Level Dialect Identification”.

For sentence level dialect identification in Ara-bic, the most recent works are (Zaidan andCallison-Burch, 2011), (Elfardy and Diab, 2013),and (Cotterell and Callison-Burch, 2014a). Zaidanand Callison-Burch (2011) annotate MSA-DAnews commentaries on Amazon Mechanical Turkand explore the use of a language-modeling basedapproach to perform sentence-level dialect identi-fication. They target three Arabic dialects; Egyp-tian, Levantine and Gulf and develop differentmodels to distinguish each of them against the oth-ers and against MSA. They achieve an accuracy of

80.9%, 79.6%, and 75.1% for the Egyptian-MSA,Levantine-MSA, and Gulf-MSA classification, re-spectively. These results support the common as-sumption that Egyptian, relative to the other Ara-bic dialectal variants, is the most distinct dialectvariant of Arabic from MSA. Elfardy and Diab(2013) propose a supervised system to performEgyptian Arabic Sentence Identification. Theyevaluate their approach on the Egyptian part of thedataset presented by Zaidan and Callison-Burch(2011) and achieve an accuracy of 85.3%. Cot-terell and Callison-Burch (2014b) extend Zaidanand Callison-Burch (2011) work by handling twomore dialects (Iraqi and Moroccan) and targeting anew genre, specifically tweets. Their system out-performs Zaidan and Callison-Burch (2011) andElfardy and Diab (2013), achieving a classifica-tion accuracy of 89%, 79%, and 88% on the sameEgyptian, Levantine and Gulf datasets. For tokenlevel dialect identification, King et al. (2014) use alanguage-independent approach that utilizes char-acter n-gram probabilities, lexical probabilities,word label transition probabilities and existingnamed entity recognition tools within a Markovmodel framework.

Jain and Bhat (2014) use a CRF based tokenlevel language identification system that uses a setof easily computable features (ex. isNum, isPunc,etc.). Their analysis showed that the most impor-tant features are the word n-gram posterior proba-bilities and word morphology.

Lin et al. (2014) use a CRF model that relieson character n-grams probabilities (tri and quadgrams), prefixes, suffixes, unicode page of the firstcharacter, capitalization case, alphanumeric case,and tweet-level language ID predictions from twooff-the-shelf language identifiers: cld22 and ldig.3

They increase the size of the training data using asemi supervised CRF autoencoder approach (Am-mar et al., 2014) coupled with unsupervised wordembeddings.

MSR-India (Chittaranjan et al., 2014) use char-acter n-grams to train a maximum entropy classi-fier that identifies whether a word is MSA or EDA.The resultant labels are then used together withword length, existence of special characters in theword, current, previous and next words to train aCRF model that predicts the token level classes ofwords in a given sentence/tweet.

2https://code.google.com/p/cld2/3https://github.com/shuyo/ldig

43

MADAMIRAInput Sentence

Language Model

NER

Modality List

LM decision

Modality decision

MADAMIRA decision

+MA features

NER decision

CRF++Ensemble output

Figure 1: Token-level identification pipeline

In our previously published system AIDA (El-fardy et al., 2014) we use a weakly supervised rulebased approach that relies on a language model totag each word in the given sentence to be MSA,EDA, or unk. We then use the LM decision foreach word in the given sentence/tweet and com-bine it with other morphological information, inaddition to a named entity gazetteer to decide uponthe final class of each word.

3 Approach

We introduce AIDA2. This is an improved versionof our previously published tool AIDA (Elfardy etal., 2014). It tackles the problems of dialect iden-tification in Arabic both on the token and sentencelevels in mixed modern standard Arabic MSA andEgyptian dialect EDA text. We first classify eachword in the input sentence to be one of the fol-lowing six tags as defined in the shared task for“Language Identification in Code-Switched Data”in the first workshop on computational approachesto code-switching [ShTk](Solorio et al., 2014):

• lang1: If the token is MSA (ex. AlwAqE, “Thereality”)

• lang2: If the token is EDA (ex. m$, “Not”)• ne: If the token is a named entity (ex. >mrykA,

“America”)• ambig: If the given context is not sufficient to

identify the token as MSA or EDA (ex. slAmElykm, “Peace be upon you”)

• mixed: If the token is of mixed morphology (ex.b>myT meaning “I’m always removing”)

• other: If the token is or is attached to any nonArabic token (ex. numbers, punctuation, Latincharacter, emoticons, etc)

The fully tagged tokens in the given sentenceare then used in addition to some other features toclassify the sentence as being mostly MSA or EDA.

3.1 Token Level IdentificationIdentifying the class of a token in a given sentencerequires knowledge of its surrounding tokens since

these surrounding tokens can be the trigger foridentifying a word as being MSA or EDA. Thissuggests that the best way to approach the prob-lem is by treating it as a sequence labeling task.Hence we use a Conditional Random Field (CRF)classifier to classify each token in the input sen-tence. The CRF is trained using decisions fromthe following underlying components:

• MADAMIRA: is a publicly available tool formorphological analysis and disambiguation ofEDA and MSA text (Pasha et al., 2014).4

MADAMIRA uses SAMA (Maamouri et al.,2010) to analyze the MSA words and CAL-IMA (Habash et al., 2012) for the EDA words.We use MADAMIRA to tokenize both the lan-guage model and input sentences using D3tokenization-scheme, the most detailed level oftokenization provided by the tool (ex. bAlfryq,“By the team” becomes “b+ Al+ fryq”)(Habashand Sadat, 2006). This is important in orderto maximize the Language Models (LM) cov-erage. Furthermore, we also use MADAMIRAto tag each token in the input sentence as MSAor EDA by tagging the source of the morpho-logical analysis, if MADAMIRA. analyses theword using SAMA, then the token is tagged MSAwhile if the analysis comes from CALIMA, thetoken is tagged EDA. Out of vocabulary wordsare tagged unk.

• Language Model: is a D3-tokenized 5-gramslanguage model. It is built using the 119K man-ually annotated words of the training data of theshared task ShTk in addition to 8M words fromweblogs data (4M from MSA sources and 4Mfrom EDA ones). The weblogs are automati-cally annotated based on their source, namely, ifthe source of the data is dialectal, all the wordsfrom this source are tagged as EDA. Otherwisethey are tagged MSA. Since we are using a D3-tokenized data, all D3 tokens of a word are as-signed the same tag of their corresponding word(ex. if the word “bAlfryq” is tagged MSA, theneach of “b+”, “Al+”, and “fryq” is tagged MSA).During runtime, the Language Model classifiermodule creates a lattice of all possible tags foreach word in the input sentence after it is be-ing tokenized by MADAMIRA. Viterbi searchalgorithm (Forney, 1973) is then used to findthe best sequence of tags for the given sentence.If the input sentence contains out of vocabulary4http://nlp.ldeo.columbia.edu/madamira/

44

words, they are being tagged as unk. This mod-ule also provides a binary flag called “isMixed”.It is “true” only if the LM decisions for the pre-fix, stem, and suffix are not the same.

• Modality List: ModLex (Al-Sabbagh et al.,2013) is a manually compiled lexicon of Arabicmodality triggers (i.e. words and phrases thatconvey modality). It provides the lemma with acontext and the class of this lemma (MSA, EDA,or both) in that context. In our approach, wematch the lemma of the input word that is pro-vided by MADAMIRA and its surrounding con-text with an entry in ModLex. Then we assignthis word the corresponding class from the lexi-con. If we find more than one match, we use theclass of the longest matched context. If thereis no match, the word takes unk tag. Ex. theword “Sdq” which means ”told the truth” getsthe class “both” in this context “>flH An Sdq”meaning “He will succeed if he told the truth”.

• NER: this is a shallow named entity recognitionmodule. It provides a binary flag “isNE” foreach word in the input sentence. This flag is setto “true” if the input word has been tagged as ne.It uses a list of all sequences of words that aretagged as ne in the training data of ShTk in ad-dition to the named-entities from ANERGazet(Benajiba et al., 2007) to identify the named-entities in the input sentence. This modulealso checks the POS provided by MADAMIRAfor each input word. If a token is tagged asnoun prop POS, then the token is classified asne.

Using these four components, we generate the fol-lowing features for each word:.

• MADAMIRA-features: the input word, prefix,stem, suffix, POS, MADAMIRA decision, and as-sociated confidence score;

• LM-features: the “isMixed” flag in addition tothe prefix-class, stem-class, suffix-class and theconfidence score for each of them as provided bythe language model;

• Modality-features: the Modality List decision;• NER-features: the “isNE” flag from the NER;• Meta-features: “isOther” is a binary flag that is

set to “true” only if the input word is a non Ara-bic token. And “hasSpeechEff” which is anotherbinary flag set to “true” only if the input wordhas speech effects (i.e. word lengthening).

Token&Level&Iden,fica,on

D3#Tokenized,MSA#LM,

Comp&Cl

D3#Tokenized,EDA#LM,

Input Data

Abs&Cl

Surface(Level,MSA(LM,

Surface(Level,EDA(LM,

DT7Ensemble

Output

Figure 2: Sentence-level identification pipeline

We then use these features to train a CRF classi-fier using CRF++ toolkit (Sha and Pereira, 2003)and we set the window size to 16.5 Figure 1 illus-trates the different components of the token-levelsystem.

3.2 Sentence Level IdentificationFor this level of identification, we rely on a clas-sifier ensemble to generate the class label foreach sentence. The underlying classifiers aretrained on gold labeled data with sentence levelbinary decisions of either being MSA or EDA. Fig-ure 2 shows the pipeline of the sentence levelidentification component. The pipeline consistsof two main pathways with some pre-processingcomponents. The first classifier (ComprehensiveClassifier/Comp-Cl) is intended to cover dialectalstatistics, token statistics, and writing style whilethe second one (Abstract Classifier/Abs-Cl) coverssemantic and syntactic relations between words.The decisions from the two classifiers are fusedtogether using a decision tree classifier to predictthe final class of the input sentence.6

3.2.1 Comprehensive ClassifierThe first classifier is intended to explicitly modeldetailed aspects of the language. We identify mul-tiple features that are relevant to the task and wegroup them into different sets. Using the D3 tok-enized version of the input data in addition to theclasses provided by the “Token Level Identifica-tion” module for each word in the given sentence,we conduct a suite of experiments using the deci-sion tree implementation by WEKA toolkit (Hallet al., 2009) to exhaustively search over all fea-tures in each group in the first phase, and then ex-haustively search over all of the remaining features

5The window size is set empirically, we experimentedwith window sizes of 2, 4, 6, 8, 12.

6We experiment with different classifiers: Naive Bayesand Bayesian Network classifiers, but Decision Trees yieldedthe best results

45

from all groups to find the best combination of fea-tures that maximizes 10-fold cross-validation onthe training data. We explore the same featuresused by Elfardy and Diab (2013) in addition tothree other features that we refer to as “ModalityFeatures”. The full list of features include:

• Perplexity-Features [PF]: We run the tokenizedinput sentence through a tokenized MSA anda tokenized EDA 5-grams LMs to get sen-tence perplexity from each LM (msaPPL andedaPPL). These two LMs are built using thesame data and the same procedure for the LMsused in the “Token Level Identification” mod-ule;

• Dia-Statistics-Features [DSF]:

– The percentage of words tagged as EDA in theinput sentence by the “Token Level Identifica-tion” module (diaPercent);

– The percentage of words tagged as EDA andMSA by MADAMIRA in the input sentence(calimaWords and samaWords, respectively).And the percentage of words found in a pre-compiled EDA lexicon egyWords used andprovided by (Elfardy and Diab, 2013);

– hasUnk is a binary feature set to “true” only ifthe language model of the “Token Level Iden-tification” module yielded at least one unk tagin the input sentence;

– Modality features: The percentage of wordstagged as EDA, MSA, and both (modEDA,modMSA, and modBoth, respectively) usingthe Modality List component in the “TokenLevel Identification” module.

• Sentence-Statistics-Features [SSF]: The per-centage of Latin words, numbers, and punctu-ation (latinPercent, numPercent, and puncPer-cent, respectively) in the input sentence. In ad-dition to the average word length (avgWordLen)and the total number of words (sentLength) inthe same sentence;

• Sentence-decoration-features [SDF]: Somebinary features of whether the sentencehas/doesn’t have diacritics (hasDiac), speecheffects (hasSpeechEff), presence of excla-mation mark (hasExMark), presence ofemoticons (hasEmot), presence of questionmark (hasQuesMark), presence of decorationeffects (hasDecEff) (ex: ****), or repeatedpunctuation (hasRepPunc).

3.2.2 Abstract ClassifierThe second classifier, Abs-Cl, is intended to coverthe implicit semantic and syntactic relations be-tween words. It runs the input sentence in its sur-face form without tokenization through a surfaceform MSA and a surface form EDA 5-gram LMs toget sentence probability from each of the respec-tive LM (msaProb and edaProb). These two LMsare built using the same data used in the “TokenLevel Identification” module LM, but without to-kenization.

This classifier complements the informationprovided by Comp-Cl. While Comp-Cl yields de-tailed and specific information about the tokensas it uses tokenized-level LMs, Abs-Cl is able tocapture better semantic and syntactic relations be-tween words since it can see longer context interms of the number of words compared to thatseen by Comp-Cl (on average a span of two wordsin the surface-level LM corresponds to almost fivewords in the tokenized-level LM) (Rashwan et al.,2011).

3.2.3 DT EnsembleIn the final step, we use the classes and confidencescores of the preceding two classifiers on the train-ing data to train a decision tree classifier. Accord-ingly, an input test sentence goes through Comp-Cl and Abs-Cl, where each classifier assigns thesentence a label and a confidence score for this la-bel. It then uses the two labels and the two confi-dence scores to provide its final classification forthe input sentence.

4 Experimental Setup

4.1 Data

To our knowledge, there is no publicly availablestandard dataset that is annotated for both tokenand sentence levels to be used for evaluating bothlevels of classifications. Accordingly we use twoseparate standard datasets for both tasks.

For the token level identification, we use thetraining and test data that is provided by the sharedtask ShTk. Additionally, we manually annotatemore token-level data using the same guidelinesused to annotate this dataset and use this additionaldata for training and tuning our system.

• tokTrnDB: is the ShTk training set. It consistsof 119,326 words collected from Twitter;

46

• tokTstDB: is the ShTk test set. It consists of87,373 words of tweets collected from some un-seen users in the training set and 12,017 wordsof sentences collected from Arabic commen-taries;

• tokDevDB: 42,245 words collected from we-blogs and manually annotated in house using thesame guidelines of the shared task.7 We onlyuse this set for system tuning to decide upon thebest configuration;

• tokTrnDB2: 171,419 words collected from we-blogs and manually annotated in house using thesame guidelines of the shared task. We use it asan extra training set in addition to tokTrnDB tostudy the effect of increasing training data sizeon the system performance.8

Table 1 shows the distribution of each of these sub-sets of the token-level dataset.

lang1 lang2 ambig ne other mixedtokTrnDB 79,059 16,291 1,066 14,110 8,688 15tokTstDB 57,740 21,871 240 11,412 8,121 6tokTrnDB2 77,856 69,407 46 14,902 9,190 18tokDevDB 23,733 11,542 34 4,017 2,916 3

Table 1: Tag distribution in the datasets used inour token level identification component.

For sentence level dialect identification, we usethe code-switched EDA-MSA portion of the crowdsource annotated dataset (Zaidan and Callison-Burch, 2011). The dataset consists of usercommentaries on Egyptian news articles. Thedata is split into training (sentTrnDB) and test(sentTstDB) using the same split reported by El-fardy and Diab (2013). Table 2 shows the statisticsfor that data.

MSA Sent. EDA Sent. MSA Tok. EDA Tok.sentTrnDB 12,160 11,274 300,181 292,109sentTstDB 1,352 1,253 32,048 32,648

Table 2: Number of EDA and MSA sentences andtokens in the training and test sets.

4.2 Baselines4.2.1 Token Level BaselinesFor the token level task, we evaluate our approachagainst the results reported by all systems partic-

7The task organizers kindly provided the guidelines forthe task.

8We are expecting to release both tokDevDB and tok-TrnDB2 in addition to some other data are still under devel-opment to the community by 2016

ipating in ShTk evaluation test bed. These base-lines include:

• IUCL: The best results obtained by King et al.(2014);

• IIIT: The best results obtained by Jain and Bhat(2014);

• CMU: The best results obtained by Lin et al.(2014);

• MSR-India: The best results obtained by Chit-taranjan et al. (2014);

• AIDA: The best results obtained by us using theolder version AIDA (Elfardy et al., 2014).

4.2.2 Sentence Level BaselinesFor the sentence level component, we evaluate ourapproach against all published results on the Ara-bic “Online Commentaries (AOC)” publicly avail-able dataset (Zaidan and Callison-Burch, 2011).The sentence level baselines include:

• Zidan et al: The best results obtained by Zaidanand Callison-Burch (2011);

• Elfardy et al: The best results obtained by El-fardy and Diab (2013);

• Cotterell et al: The best result obtained by Cot-terell and Callison-Burch (2014a);

• All Features: This baseline combines all fea-tures from Comp-Cl and Abs-Cl to train a singledecision tree classifier.

5 Evaluation

5.1 Token Level EvaluationTable 3 compares our token level identification ap-proach to all baselines. It shows, our proposedapproach significantly outperforms all baselinesusing the same training and test sets. AIDA2achieves 90.6% weighted average F-score whilethe nearest baseline gets 86.8% (this is 28.8% er-ror reduction from the best published approach).By using both tokTrnDB and tokTrnDB2 for train-ing, the weighted average F-score is further im-proved by 2.3% as shown in the last row of thetable.

5.2 Sentence Level EvaluationFor all experiments, we use a decision-tree clas-sifier as implemented in WEKA (Hall et al.,2009) toolkit. Table 4 shows the 10-folds cross-validation results on the sentTrnDB.

• “Comp-Cl” shows the results of the best se-lected set of features from each group. (The

47

Baseline lang1 lang2 ambig ne other mixed Avg-FAIDA 89.4 76.0 0.0 87.9 99.0 0.0 86.8CMU 89.9 81.1 0.0 72.5 98.1 0.0 86.4IIIT 86.2 52.9 0.0 70.1 84.2 0.0 76.6

IUCL 81.1 59.5 0.0 5.8 1.2 0.0 61.0MSR-India 86.0 56.4 0.7 49.6 74.8 0.0 74.2

AIDA2 92.9 82.9 0.0 89.5 99.3 0.0 90.6AIDA2+ 94.6 88.3 0.0 90.2 99.4 0.0 92.9

Table 3: F-score on held-out test-set tokTstDB us-ing our best setup against the baselines. AIDA2+shows the the results of training our system usingtokTrnDB and tokTrnDB2

Group AccuracyPerplexity-Features 80.0%

Dia-Statistics-Features 85.1%Comp-Cl Sentence-Statistics-Features 61.6%

Sentence-decoration-features 53.1%Best of all groups 87.3%

Abs-Cl 78.4%DT Ensemble 89.9%

Table 4: Cross-validation accuracy on the sent-TrnDB using the best selected features in eachgroup

ones that yield best cross-validation results ofsentTrnDB. “Best-of-all-groups” shows the re-sult of the best selected features from the re-tained feature groups which in turn is the fi-nal set of features for the comprehensive clas-sifier. In our case the best selected featuresare msaPPL, edaPPL, diaPercent, hasUnk, cal-imaWords, modEDA, egyWords, latinPercent,puncPercent, avgWordLen, and hasDiac.

• “Abs-Cl” shows the results and best set of fea-tures (msaProb and edaProb) for the abstractclassifier.

• “DT Ensemble” reflect the results of combiningthe labels and confidence scores from Comp-Cland Abs-Cl using a decision tree classifier.

Among the different configurations, the ensemblesystem yields the best 10-fold cross-validation ac-curacy of 89.9%. We compare the performanceof this best setup to our baselines on both thecross-validation and held-out test sets. As Table5 shows, the proposed approach significantly out-performs all baselines on all sets.

6 Results Discussion

6.1 Token Level Results DiscussionLast row in table 3 shows that the system resultsin 24.5% error reduction by adding 171K words

Baseline sentTrnDB sentTstDB sentTrnDB + sentTstDBZidan et al N/A N/A 80.9

Elfardy et al 85.3 83.3 85.5Cotterell et al N/A N/A 86.6All Features 85.8 85.3 85.5

DT Ensemble 89.9 87.3 90.8Table 5: Results of using our best setup (DT En-semble) against baselines

of gold data to the training set. This shows thatthe system did not reach the saturation state yet,which means that adding more gold data can in-crease performance.

Table 6 shows the confusion matrix of our bestsetup for all six labels over the tokTstDB. Thetable shows that the highest confusability is be-tween lang1 and lang2 classes; 2.9% are classi-fied as lang1 instead of lang2 and 1.6% are clas-sified as lang2 instead of lang1. This accounts for63.8% of the total errors. The Table also showsthat our system does not produce the mixed class atall probably because of the tiny number of mixedcases in the training data (only 33 words out of270.7K words). The same case applies to the am-big class as it represents only 0.4% of the wholetraining data. lang1 and ne are also quite highlyconfusable. Most of ne words have another non-named entity meaning and in most cases theseother meanings tend to be MSA. Therefore, we ex-pect that a more sophisticated NER system willhelp in identifying these cases.

Predictedlang1 lang2 ambig ne other mixed

lang1 55.7% 1.6% 0.0% 0.9% 0.0% 0.0%lang2 2.9% 18.9% 0.0% 0.2% 0.0% 0.0%

Gold ambig 0.1% 0.1% 0.0% 0.0% 0.0% 0.0%ne 0.8% 0.2% 0.0% 10.3% 0.1% 0.0%

other 0.0% 0.0% 0.0% 0.0% 8.2% 0.0%mixed 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

Table 6: Token-level confusion matrix for the bestperforming setup on tokTstDB

Table 7 shows examples of the words that aremisclassified by our system. The misclassifiedword in the first examples (bED meaning “eachother”) has a gold class other. However, the goldlabel is incorrect and our system predicted it cor-rectly as lang2 given the context. In the secondexample, the misclassified named entity refers tothe name of a charitable organization but the wordalso means “message” which is a lang1 word. Thethird example shows a lang1 word that is incor-rectly classified by our system as lang2. Similarly,

48

in the last example our system incorrectly classi-fied a lang2 word as a lang1.

Sentence Word Gold Predtlt twytr mzwr . nSh AkwntAt m$$gAlh w AlbAqy mblkyn bED

bED

⌘Å” ⇣H A⇣J KÒª @ Èí � . P Q” Q⇣�K⌦ Ò⇣K ⇣I ⇣K ë™K. ·�⌦∫ J. ” ⌦⇣Ø AJ. À @ ⇣ÈÀ A ™ ⌘É

ë™K. other lang2

One third of twitter is forged. Half ofthe accounts are not working while therest block each other..

each other

kmA Anny mTlE Ely mA tqwmwnbh fy mxtlf AljmEyAt wAlAn$TpAlAhlyp . mvl rsAlp .

rsAlp

⌦ Ø ÈK. ‡ Ò”Ò ⇣Æ⇣K A” ⌦Œ´ © ¢” ⌦

G @ A“ª. ⇣ÈJ⌦ Î B @ ⇣È¢ ⌘Ç � B @ ⇣H AJ⌦™“m.Ã '@ ≠ ⇣Jm◊. ⇣ÈÀ AÉP …⌘J”

⇣ÈÀ AÉP ne lang1

Also I know what you are doing in dif-ferent domains and civil activities likeResala.

Resala

>nA bxyr . SHty wAlHmd llh fy>fDl HAl .

SHty

… í Ø @ ⌦ Ø È✏<À Y“mÃ '@ ⌦

⇣Êm⇡ï Q�⌦ m⇢'. A K @. » Ag

⌦⇣Êm⇡ï lang1 lang2

I am fine. Thank God, my health is inbest condition.

my health

lm Aqr> AlbyAn w qrrt AEtrD ElyhglAsp

AEtrD

È J⌦ ´ êQ⇣� ´ @ ⇣HP ✏Q ⇣Ø ‡ A J⌦ J. À @�@Q ⇣Ø @ ’À⇣ÈÉ C

êQ⇣�´ @ lang2 lang1

I did not read the statement and de-cided to object to it just to be annoy-ing

object

Table 7: Examples of the words that were misclas-sified by our system

6.2 Sentence Level Results Discussion

The best selected features shows that Comp-Clbenefits most from using only 11 features. Bystudying the excluded features we found that:

• Five features (hasSpeechEff, hasEmot, hasDe-cEff, hasExMark, and hasQuesMark) are zerosfor most records, hence extremely sparse, whichexplains why they are not selected as relevantdistinctive features. However, it should be notedthat the hasSpeechEff and hasEmot features aremarkers of informal language especially in thesocial media (not to ignore the fact that userswrite in MSA using these features as well butmuch less frequently). Accordingly we antici-pate that if the data has more of these features,they would have significant impact on modelingthe phenomena;

• Five features are not strong indicators of dialec-talness. For the sentLength feature, the aver-age length of the MSA, and EDA sentences inthe training data is almost the same. While, thenumPercent, modMSA, modBoth, and hasRep-Punc features are almost uniformly distributedacross the two classes;

• The initial assumption was that SAMA is ex-clusively MSA while CALIMA is exclusivelyEDA, thereby the samaWords feature will be astrong indicator for MSA sentences and the cali-maWords feature will be a strong indicator forEDA sentences. Yet by closer inspection, wefound that in 96.5% of the EDA sentences, cal-imaWords is higher than samaWords. But, inonly 23.6% of the MSA sentences, samaWordsis higher than calimaWords. This means thatsamaWords feature is not able to distinguish theMSA sentences efficiently. Accordingly sama-Words feature was not selected as a distinctivefeature in the final feature selection process.

Although modEDA is selected as one of the rep-resentative features, it only occurs in a small per-centage of the training data (10% of the EDA sen-tences and 1% of the MSA sentences). Accord-ingly, we repeated the best setup (DT Ensemble)without the modality features, as an ablation study,to measure the impact of modality features onthe performance. In the 10-fold-cross-validationon the sentTrnDB using Comp-Cl alone, we notethat performance results slightly decreased (from87.3% to 87.0%). However given the sparsity ofthe feature (it occurs in less than 1% of the tokensin the EDA sentences), 0.3% drop in performanceis significant. This shows that if the modality lex-icon has more coverage, we will observe a moresignificant impact.

Table 8 shows some examples for our systempredictions. The first example is correctly clas-sified with a high confidence (92%). Example 2is quite challenging. The second word is a typowhere two words are concatenated due to a miss-ing white space, while the first and third words canbe used in both MSA and EDA contexts. There-fore, the system gives a wrong prediction with alow confidence score (59%). In principle this sen-tence could be either EDA or MSA. The last exam-ple should be tagged as EDA. However, our systemtagged it as MSA with a very high confidence scoreof (94%).

49

Input sentence Gold Pred ConfwlA AElAnAt fY Altlyfzywn nAfEp wlA jrArAt jdydp nAfEp.. w bEdyn.

B ⇣È ™ Ø A K ‡ Ò K⌦ Q Æ ⇣J À @ ⌦ Ø ⇣H A K C ´ @ B

. ·K⌦ Y™K. . . ⇣È™ Ø A K ⇣Ë YK⌦ Yg. ⇣H @P @Qk.

EDA EDA 92%

Neither TV commercials nor new trac-tors work. So now what.Allhm AgfrlhA wArHmhA

AÍ‘gP @ AÍÀQ Æ @ —Í✏ À @

MSA EDA 59%

May God forgive her and have mercy onher.tsmHly >qwlk yAbAbA?

? AK. AK. AK⌦ ΩÀÒ⇣Ø @ ⌦Œj“Ç⇣�EDA MSA 94%

Do you allow me to call you father?

Table 8: Examples of the sentences that were mis-classified by our system

7 Conclusion

We presented AIDA2, a hybrid system for tokenand sentence levels dialectal identification in codeswitched Modern Standard and Egyptian DialectalArabic text. The proposed system uses a classifierensemble approach to perform dialect identifica-tion on both levels. In the token level module, werun the input sentence through four different clas-sifiers. Each of which classify each word in thesentence. A CRF model is then used to predict thefinal class of each word using the provided infor-mation from the underlying four classifiers. Theoutput from the token level module is then used totrain one of the two underlying classifiers of thesentence level module. A decision tree classifier isthen used to to predict the final label of any new in-put sentence using the predictions and confidencescores of two underlying classifiers. The sentencelevel module also uses a heuristic features selec-tion approach to select the best features for eachof its two underlying classifiers by maximizing theaccuracy on a cross-validation set. Our approachsignificantly outperforms all published systems onthe same training and test sets. We achieve 90.6%weighted average F-score on the token level iden-tification compared to 86.8% for state of the artusing the same data sets. Adding more trainingdata results in even better performance to 92.9%.On the sentence level, AIDA2 yields an accuracyof 90.8% using cross-validation compared to thelatest state of the art performance of 86.6% on thesame data.

ReferencesRania Al-Sabbagh, Jana Diesner, and Roxana Girju.

2013. Using the semantic-syntactic interface for re-liable arabic modality annotation. In Proceedingsof the Sixth International Joint Conference on Nat-ural Language Processing, pages 410–418, Nagoya,Japan, October. Asian Federation of Natural Lan-guage Processing.

Waleed Ammar, Chris Dyer, and Noah A Smith. 2014.Conditional random field autoencoders for unsu-pervised structured prediction. In Z. Ghahramani,M. Welling, C. Cortes, N.D. Lawrence, and K.Q.Weinberger, editors, Advances in Neural Informa-tion Processing Systems 27, pages 3311–3319. Cur-ran Associates, Inc.

Yassine Benajiba, Paolo Rosso, and Jos Miguel Bene-druiz. 2007. Anersys: An arabic named entityrecognition system based on maximum entropy. InIn Proceedings of CICLing-2007.

Fadi Biadsy, Julia Hirschberg, and Nizar Habash.2009. Spoken arabic dialect identification usingphonotactic modeling. In Proceedings of the Work-shop on Computational Approaches to Semitic Lan-guages at the meeting of the European Associa-tion for Computational Linguistics (EACL), Athens,Greece.

Gokul Chittaranjan, Yogarshi Vyas, Kalika Bali, andMonojit Choudhury, 2014. Proceedings of the FirstWorkshop on Computational Approaches to CodeSwitching, chapter Word-level Language Identifica-tion using CRF: Code-switching Shared Task Reportof MSR India System, pages 73–79. Association forComputational Linguistics.

Ryan Cotterell and Chris Callison-Burch. 2014a. Amulti-dialect, multi-genre corpus of informal writtenarabic. In Nicoletta Calzolari (Conference Chair),Khalid Choukri, Thierry Declerck, Hrafn Lofts-son, Bente Maegaard, Joseph Mariani, AsuncionMoreno, Jan Odijk, and Stelios Piperidis, editors,Proceedings of the Ninth International Conferenceon Language Resources and Evaluation (LREC’14),Reykjavik, Iceland, may. European Language Re-sources Association (ELRA).

Ryan Cotterell and Chris Callison-Burch. 2014b. Amulti-dialect, multi-genre corpus of informal writtenarabic. In Proceedings of the Language Resourcesand Evaluation Conference (LREC).

Heba Elfardy and Mona Diab. 2013. Sentence-LevelDialect Identification in Arabic. In Proceedings ofACL2013, Sofia, Bulgaria, August.

Heba Elfardy, Mohamed Al-Badrashiny, and MonaDiab, 2014. Proceedings of the First Workshopon Computational Approaches to Code Switching,chapter AIDA: Identifying Code Switching in In-formal Arabic Text, pages 94–101. Association forComputational Linguistics.

50

Jr. Forney, G.D. 1973. The viterbi algorithm. Pro-ceedings of the IEEE, 61(3):268–278, March.

Nizar Habash and Fatiha Sadat. 2006. Arabic prepro-cessing schemes for statistical machine translation.Proceedings of the 7th Meeting of the North Amer-icanChapter of the Association for ComputationalLinguistics/Human Language Technologies Confer-ence (HLT-NAACL06).

Nizar Habash, Ramy Eskander, and AbdelAtiHawwari. 2012. A Morphological Analyzer forEgyptian Arabic. In NAACL-HLT 2012 Workshopon Computational Morphology and Phonology(SIGMORPHON2012), pages 1–9.

Mark Hall, Eibe Frank, Geoffrey Holmes, BernhardPfahringer, Peter Reutemann, and Ian H Witten.2009. The weka data mining software: an update.ACM SIGKDD Explorations Newsletter, 11(1).

Naman Jain and Ahmad Riyaz Bhat, 2014. Pro-ceedings of the First Workshop on ComputationalApproaches to Code Switching, chapter LanguageIdentification in Code-Switching Scenario, pages87–93. Association for Computational Linguistics.

Levi King, Eric Baucom, Timur Gilmanov, SandraKubler, Daniel Whyatt, Wolfgang Maier, and PaulRodrigues. 2014. The iucl+ system: Word-levellanguage identification via extended markov mod-els. In Proceedings of the First Workshop on Com-putational Approaches to Code-Switching. EMNLP2014, Conference on Empirical Methods in NaturalLanguage Processing, Doha, Qatar.

Chu-Cheng Lin, Waleed Ammar, Lori Levin, and ChrisDyer, 2014. Proceedings of the First Workshopon Computational Approaches to Code Switching,chapter The CMU Submission for the Shared Taskon Language Identification in Code-Switched Data,pages 80–86. Association for Computational Lin-guistics.

Mohamed Maamouri, Dave Graff, Basma Bouziri,Sondos Krouna, Ann Bies, and Seth Kulick. 2010.Ldc standard arabic morphological analyzer (sama)version 3.1.

Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab,Ahmed El Kholy, Ramy Eskander, Nizar Habash,Manoj Pooleery, Owen Rambow, and Ryan M. Roth.2014. MADAMIRA: A Fast, Comprehensive Toolfor Morphological Analysis and Disambiguation ofArabic. In Proceedings of LREC, Reykjavik, Ice-land.

M.A.A. Rashwan, M.A.S.A.A. Al-Badrashiny, M. At-tia, S.M. Abdou, and A. Rafea. 2011. A stochas-tic arabic diacritizer based on a hybrid of factorizedand unfactorized textual features. Audio, Speech,and Language Processing, IEEE Transactions on,19(1):166–175, Jan.

Wael Salloum, Heba Elfardy, Linda Alamir-Salloum,Nizar Habash, and Mona Diab. 2014. Sentencelevel dialect identification for machine translationsystem selection. In Proceedings of the annualmeeting of the Association for Computational Lin-guistics (ACL).

Fei Sha and Fernando Pereira. 2003. Shallow parsingwith conditional random fields. pages 213–220.

Thamar Solorio, Elizabeth Blair, Suraj Maharjan, SteveBethard, Mona Diab, Mahmoud Gonheim, AbdelatiHawwari, Fahad AlGhamdi, Julia Hirshberg, Ali-son Chang, , and Pascale Fung. 2014. Overviewfor the first shared task on language identification incode-switched data. In In Proceedings of the FirstWorkshop on Computational Approaches to Code-Switching. EMNLP 2014, Conference on EmpiricalMethods in Natural Language Processing, October,2014, Doha, Qatar.

Omar F Zaidan and Chris Callison-Burch. 2011. Thearabic online commentary dataset: an annotateddataset of informal arabic with high dialectal con-tent. In ACL 2011.

51



An Iterative Similarity based Adaptation Technique for Cross DomainText Classification

Himanshu S. Bhatt Deepali SemwalXerox Research Center India, Bengaluru, INDIA

{Himanshu.Bhatt,Deepali.Semwal,Shourya.Roy}@xerox.com

Shourya Roy

Abstract

Supervised machine learning classifica-tion algorithms assume both train and testdata are sampled from the same domainor distribution. However, performanceof the algorithms degrade for test datafrom different domain. Such cross do-main classification is arduous as featuresin the test domain may be different andabsence of labeled data could further ex-acerbate the problem. This paper proposesan algorithm to adapt classification modelby iteratively learning domain specific fea-tures from the unlabeled test data. More-over, this adaptation transpires in a simi-larity aware manner by integrating similar-ity between domains in the adaptation set-ting. Cross-domain classification exper-iments on different datasets, including areal world dataset, demonstrate efficacy ofthe proposed algorithm over state-of-the-art.

1 Introduction

A fundamental assumption in supervised statis-tical learning is that training and test data areindependently and identically distributed (i.i.d.)samples drawn from a distribution. Otherwise,good performance on test data cannot be guar-anteed even if the training error is low. In reallife applications such as business process automa-tion, this assumption is often violated. While re-searchers develop new techniques and models formachine learning based automation of one or ahandful business processes, large scale adoption ishindered owing to poor generalized performance.In our interactions with analytics software devel-opment teams, we noticed such pervasive diver-sity of learning tasks and associated inefficiency.Novel predictive analytics techniques on standard

datasets (or limited client data) did not general-ize across different domains ( new products & ser-vices) and has limited applicability. Training mod-els from scratch for every new domain requires hu-man annotated labeled data which is expensive andtime consuming, hence, not pragmatic.On the other hand, transfer learning techniques

allow domains, tasks, and distributions used intraining and testing to be different, but related. Itworks in contrast to traditional supervised tech-niques on the principle of transferring learnedknowledge across domains. While transfer learn-ing has generally proved useful in reducing thelabelled data requirement, brute force techniquessuffer from the problem of negative transfer (Panand Yang, 2010a). One cannot use transfer learn-ing as the proverbial hammer, but needs to gaugewhen to transfer and also how much to transfer.To address these issues, this paper proposes

a domain adaptation technique for cross-domaintext classification. In our setting for cross-domainclassification, a classifier trained on one domainwith sufficient labelled training data is applied toa different test domain with no labelled data. Asshown in Figure 1, this paper proposes an iterativesimilarity based adaptation algorithm which startswith a shared feature representation of source andtarget domains. To adapt, it iteratively learns do-main specific features from the unlabeled targetdomain data. In this process, similarity betweentwo domains is incorporated in the adaptation set-ting for similarity-aware transfer. The major con-tributions of this research are:

• An iterative algorithm for learning domainspecific discriminative features from unla-beled data in the target domain starting withan initial shared feature representation.

• Facilitating similarity-aware domain adapta-tion by seamlessly integrating similarity be-tween two domains in the adaptation settings.

52

Figure 1: Outlines different stages of the proposedalgorithm i.e. shared feature representation, do-main similarity, and the iterative learning process.

To the best of our knowledge, this is the first-of-its-kind approach in cross-domain text classifica-tion which integrates similarity between domainsin the adaptation setting to learn domain specificfeatures in an iterative manner. The rest of thepaper is organized as follows: Section 2 summa-rizes the related work, Section 3 presents detailsabout the proposed algorithm. Section 4 presentsdatabases, experimental protocol, and results. Fi-nally, Section 5 concludes the paper.

2 Related WorkTransfer learning in text analysis (domain adapta-tion) has shown promising results in recent years(Pan and Yang, 2010a). Prior work on domainadaptation for text classification can be broadlyclassified into instance re-weighing and feature-representation based adaptation approaches.Instance re-weighing approaches address the

difference between the joint distributions of ob-served instances and class labels in source do-main with that of target domain. Towards this di-rection, Liao et al. (2005) learned mismatch be-tween two domains and used active learning toselect instances from the source domain to en-hance adaptability of the classifier. Jiang and Zhai(2007) proposed instance weighing scheme for do-main adaptation in NLP tasks which exploit inde-pendence between feature mapping and instanceweighing approaches. Saha et al. (2011) lever-aged knowledge from source domain to activelyselect the most informative samples from the tar-get domain. Xia et al. (2013) proposed a hybridmethod for sentiment classification task that alsoaddresses the challenge of mutually opposite ori-entation words.A number of domain adaptation techniques are

based on learning common feature representation(Pan and Yang, 2010b; Blitzer et al., 2006; Ji et

al., 2011; Daume III, 2009) for text classification.The basic idea being identifying a suitable fea-ture space where projected source and target do-main data follow similar distributions and hence,a standard supervised learning algorithm can betrained on the former to predict instances fromthe latter. Among them, Structural Correspon-dence Learning (SCL) (Blitzer et al., 2007) is themost representative one, explained later. Daume(2009) proposed a heuristic based non-linear map-ping of source and target data to a high dimen-sional space. Pan et al. (2008) proposed a di-mensionality reduction method Maximum MeanDiscrepancy Embedding to identify a latent space.Subsequently, Pan et al. (2010) proposed to mapdomain specific words into unified clusters usingspectral clustering algorithm. In another followup work, Pan et al. (2011) proposed a novel fea-ture representation to perform domain adaptationvia Reproducing Kernel Hilbert Space using Max-imum Mean Discrepancy. A similar approach,based on co-clustering (Dhillon et al., 2003), wasproposed in Dai et al. (2007) to leverage commonwords as bridge between two domains. Bollegalaet al. (2011) used sentiment sensitive thesaurus toexpand features for cross-domain sentiment clas-sification. In a comprehensive evaluation study, itwas observed that their approach tends to increasethe adaptation performance when multiple sourcedomains were used (Bollegala et al., 2013).

Domain adaptation based on iterative learninghas been explored by Chen et al. (2011) andGarcia-Fernandez et al. (2014) and are similar tothe philosophy of the proposed approach in ap-pending pseudo-labeled test data to the trainingset. The first approach uses an expensive fea-ture split to co-train two classifiers while the for-mer presents a single classifier self-training basedsetting. However, the proposed algorithm offersnovel contributions in terms of 1) leveraging twoindependent feature representations capturing theshared and target specific representations, 2) anensemble of classifiers that uses labelled sourcedomain and pseudo labelled target domain in-stances carefully moderated based on similaritybetween two domains. Ensemble based domainadaptation for text classification was first pro-posed by Aue and Gammon (2005) though theirapproach could not achieve significant improve-ments over baseline. Later, Zhao et al. (2010)proposed online transfer learning (OTL) frame-

53

work which forms the basis of our ensemble baseddomain adaptation. However, the proposed algo-rithm differs in the following ways: 1) an unsuper-vised approach that transforms unlabeled data intopseudo labeled data unlike OTL which is super-vised, and 2) incorporates similarity in the adapta-tion setting for gradual transfer.

3 Iterative Similarity based Adaptation

The philosophy of our algorithm is gradual trans-fer of knowledge from the source to the target do-main while being cognizant of similarity betweentwo domains. To accomplish this, we have devel-oped a technique based on ensemble of two classi-fiers. Transfer occurs within the ensemble wherea classifier learned on shared representation trans-forms unlabeled test data into pseudo labeled datato learn domain specific classifier. Before explain-ing the algorithm, we highlight its salient features:Common Feature Space Representation: Ourobjective is to find a good feature representationwhich minimizes divergence between the sourceand target domains as well as the classificationerror. There have been several works towardsfeature-representation-transfer approach such as(Blitzer et al., 2007; Ji et al., 2011) which derives atransformation matrix Q that gives a shared repre-sentation between the source and target domains.One of the widely used approaches is StructuralCorrespondence Learning (SCL) (Blitzer et al.,2006) which aims to learn the co-occurrence be-tween features expressing similar meaning in dif-ferent domains. Top k Eigenvectors of matrix, W ,represent the principal predictors for weight space,Q. Features from both domains are projected onthis principal predictor space,Q, to obtain a sharedrepresentation. Source domain classifier in our ap-proach is based on this SCL representation. InSection 4, we empirically show how our algorithmgeneralizes to different shared representations.Iterative Building of Target Domain LabeledData: If we have enough labeled data from thetarget domain then a classifier can be trained with-out the need for adaptation. Hence, we wantedto explore if and how (pseudo) labeled data forthe target domain can be created. Our hypothe-sis is that certain target domain instances are moresimilar to source domain instances than the rest.Hence a classifier trained on (a suitably chosentransformed representation of) source domain in-stances will be able to categorize similar target do-

main instances confidently. Such confidently pre-dicted instances can be considered as pseudo la-beled data which are then used to initialize a clas-sifier in target domain.

Only handful of instances in the target domaincan be confidently predicted using the shared rep-resentation, therefore, we further iterate to createpseudo labeled instances in target domain. In thenext round of iterations, remaining unlabeled tar-get domain instances are passed through both theclassifiers and their output are suitably combined.Again, confidently labeled instances are added tothe pool of pseudo labeled data and the classi-fier in the target domain is updated. This pro-cess is repeated till all unlabeled data is labeledor certain maximum number of iterations is per-formed. This way we gradually adapt the targetdomain classifier on pseudo labeled data using theknowledge transferred from source domain. InSection 4, we empirically demonstrate effective-ness of this technique compared to one-shot adap-tation approaches.

Domain Similarity-based Aggregation: Perfor-mance of domain adaptation is often constrainedby the dissimilarity between the source and targetdomains (Luo et al., 2012; Rosenstein et al., 2005;Chin, 2013; Blitzer et al., 2007). If the two do-mains are largely similar, the knowledge learned inthe source domain can be aggressively transferredto the target domain. On the other hand, if the twodomains are less similar, knowledge learned in thesource domain should be transferred in a conserva-tive manner so as to mitigate the effects of negativetransfer. Therefore, it is imperative for domainadaptation techniques to account for similarity be-tween domains and transfer knowledge in a simi-larity aware manner. While this may sound obvi-ous, we do not see many works in domain adapta-tion literature that leverage inter-domain similar-ity for transfer of knowledge. In this work, we usethe cosine similarity measure to compute similar-ity between two domains and based on that gradu-ally transfer knowledge from the source to the tar-get domain. While it would be interesting to com-pare how different similarity measures comparetowards preventing negative transfer but that is notthe focus of this work. In Section 4, we empiri-cally show marginal gains of transferring knowl-edge in a similarity aware manner.

54

Table 1: Notations used in this research.Symbol Description{xs

i , y

si }i=1:ns ; x

si 2

R

d; ysi 2 {�1, +1}

Labeled source domain instances

{xti}i=1:nt ; yi 2

{�1, +1}Unlabeled target domain instances and pre-dicted label for target domain

Q Co-occurrence based projection matrix

Pu , PsPool of unlabeled and pseudo-labeled targetdomain instances respectively

Cs, Ct ; function fromClassifier Cs is trained on {(Qxs

i , y

si )};

classifier Ct is trained on {xti, y

ti} where

x

ti 2 Ps and y is the pseudo label

R

d ! {�1, +1} predicted labels by EnsembleE

↵ confidence of predictionE Weighted ensemble of Cs andCt

✓1, ✓2 confidence threshold for Cs and ensembleE

w

s, wt Weights for Cs andCt respectively

3.1 AlgorithmTable 1 lists the notations used in this research. In-puts to the algorithm are labeled source domain in-stances {xs

i , ysi }i=1:ns and a pool of unlabeled tar-

get domain instances {xti}i=1:nt , denoted by Pu.

As shown in Figure 2, the steps of the algorithmare as follows:

1. Learn Q, a shared representation projectionmatrix from the source and target domains,using any of the existing techniques. SCL isused in this research.

2. Learn Cs on SCL-based representation of la-beled source domain instances {Qxs

i , ysi }.

3. Use Cs to predict labels, yi, for instancesin Pu using the SCL-based representationQxti. Instances which are predicted with con-fidence greater than a pre-defined threshold,✓1, are moved from Pu to Ps with pseudo la-bel, y.

4. Learn Ct from instances in Ps 2 {xti, y

ti} to

incorporate target specific features. Ps onlycontains instances added in step-3 and willbe growing iteratively (hence the training sethere is small).

5. Cs and Ct are combined in an ensemble, E,as a weighted combination with weights asws and wt which are both initialized to 0.5.

6. Ensemble E is applied to all remaining in-stances in Pu to obtain the label yi as:

E(xti) ! yi ! w

sCs(Qx

ti) + w

tCt(x

ti) (1)

(a) If the ensemble classifies an instancewith confidence greater than the thresh-old ✓2, then it is moved from Pu to Ps

along with pseudo label yi.

Figure 2: Illustrates learning of the initial classi-fiers and iterative learning process of the proposedsimilarity-aware domain adaptation algorithm.

(b) Repeat step-6 for all xti 2 Pu.

7. Weights ws and wt are updated as shown inEqs. 2 and 3. This update facilitates knowl-edge transfer within the ensemble guided bythe similarity between domains.

w

s(l+1) =

(sim ⇤ w

sl ⇤ I(Cs))

(sim ⇤ w

sl ⇤ I(Cs) + (1� sim) ⇤ w

tl ⇤ I(Ct))

(2)

w

t(l+1) =

((1� sim) ⇤ w

tl ⇤ I(Ct))

(sim ⇤ w

sl⇤ I(Cs) + (1� sim) ⇤ w

tl⇤ I(Ct))

(3)

where, l is the iteration, sim is the similarityscore between domains computed using co-sine similarity metric as shown in Eq. 4

sim =a · b

||a||||b||(4)

where a & b are normalized vector represen-tations for the two domains. I(·) is the lossfunction to measure the errors of individualclassifiers in each iteration:

I(·) = exp{�⌘l(C, Y )} (5)

where, ⌘ is learning rate set to 0.1, l(y, y) =

(y � y)

2 is the square loss function, y is thelabel predicted by the classifier and y is thelabel predicted by the ensemble.

8. Re-train classifier Ct on Ps.

9. Repeat step 6� 8 until Pu is empty or maxi-mum number of iterations is reached.

55

In this iterative manner, the proposed algorithmtransforms unlabeled data in the test domain intopseudo labeled data and progressively learns clas-sifier Ct. Confidence of prediction, ↵i for ith in-stance, is measured as the distance from the de-cision boundary (Hsu et al., 2003) which is com-puted as shown in Eq. 6.

↵ =R

|v|(6)

where R is the un-normalized output from thesupport vector machine (SVM) classifier, v is theweight vector for support vectors and |v| = vT v.Weights of individual classifiers in the ensem-ble are updated with each iteration that gradu-ally shifts emphasis from the classifier learned onshared representation to the classifier learned ontarget domain. Algorithm 1 illustrates the pro-posed iterative learning algorithm.

Algorithm 1 Iterative Learning AlgorithmInput: Cs trained on shared co-occurrencebased representation Qx, Ct initiated on TFIDFrepresentation from Ps, Pu remaining unlabeledtarget domain instances.Iterate: l = 0 : till Pu = {�} or l iterMaxProcess: Construct ensemble E as weightedcombination of Cs and Ct with initials weightsws

l and wtl as 0.5 and sim = similarity between

domains.for i = 1 to n (size of Pu) doPredict labels: E(Qxi, xi)! yi; calculate ↵i

if ↵i > ✓2 thenRemove ith instance from Pu and add toPs with pseudo label yi.

end if.end for. Retrain Ct on Ps and update ws

l andwt

l .end iterate.Output: Updated Ct, ws

l and wtl .

4 Experimental ResultsThe efficacy of the proposed algorithm is eval-uated on different datasets for cross-domain textclassification (Blitzer et al., 2007), (Dai et al.,2007). In our experiments, performance is eval-uated on two-class classification task and reportedin terms of classification accuracy.

4.1 Datasets & Experimental ProtocolThe first dataset is the Amazon review dataset(Blitzer et al., 2007) which has four different

domains, Books, DVDs, Kitchen appliances andElectronics. Each domain comprises 1000 pos-itive and 1000 negative reviews. In all experi-ments, 1600 labeled reviews from the source and1600 unlabeled reviews from the target domainsare used in training and performance is reportedon the non-overlapping 400 reviews from the tar-get domain.The second dataset is the 20 Newsgroups

dataset (Lang, 1995) which is a text collectionof approximately 20, 000 documents evenly par-titioned across 20 newsgroups. For cross-domaintext classification on the 20 Newsgroups dataset,we followed the protocol of Dai et al. (2007)where it is divided into six different datasets andthe top two categories in each are picked as the twoclasses. The data is further segregated based onsub-categories, where each sub-category is con-sidered as a different domain. Table 2 lists howdifferent sub-categories are combined to representthe source and target domains. In our experiments,4/5th of the source and target data is used to learnshared feature representation and results are re-ported on the remaining 1/5th of the target data.

Table 2: Elaborates data segregation on the 20

Newsgroups dataset for cross-domain classifica-tion.

dataset Ds Dt

comp vs rec

comp.graphics comp.os.ms-windows.misccomp.sys.ibm.pc.hardware comp.windows.xcomp.sys.mac.hardware rec.autos

rec.motorcycles rec.sport.baseballrec.sport.hockey

comp vs sci

comp.graphics comp.sys.ibm.pc.hardwarecomp.os.ms-windows.misc comp.sys.mac.hardware

sci.crypt comp.windows.xsci.electronics sci.med

sci.space

comp vs talk

comp.graphics comp.os.ms-windows.miscnewlinecomp.sys.mac.hardware comp.sys.ibm.pc.hardwarecomp.windows.x talk.politics.gunstalk.politics.mideast talk.politics.misctalk.religion.misc

rec vs sci

rec.autos rec.motorcyclesrec.sport.baseball rec.sport.hockey

sci.med sci.cryptsci.space sci.electronics

rec vs talk

rec.autos rec.sport.baseballrec.motorcycles rec.sport.hockeytalk.politics.guns talk.politics.mideasttalk.politics.misc talk.religion.misc

sci vs talk

sci.electronics sci.cryptsci.med sci.space

talk.politics.misc talk.politics.gunstalk.religion.misc talk.politics.mideast

The third dataset is a real world dataset com-prising tweets about the products and servicesin different domains. The dataset comprisestweets/posts from three collections, Coll1 aboutgaming, Coll2 about Microsoft products andColl3 about mobile support. Each collection has218 positive and negative tweets. These tweetsare collected based on user-defined keywords cap-

56

tured in a listening engine which then crawls thesocial media and fetches comments matching thekeywords. This dataset being noisy and compris-ing short-text is more challenging than the previ-ous two datasets.All datasets are pre-processed by converting to

lowercase followed by stemming. Feature selec-tion based on document frequency (DF = 5)reduces the number of features as well as speedup the classification task. For Amazon reviewdataset, TF is used for feature weighing whereasTFIDF is used for feature weighing in other twodatasets. In all our experiments, constituent clas-sifiers used in the ensemble are support vector ma-chines (SVMs) with radial basis function kernel.Performance of the proposed algorithm for cross-domain classification task is compared with dif-ferent techniques1including 1) in-domain classi-fier trained and tested on the same domain data, 2)baseline classifier which is trained on the sourceand directly tested on the target domain, 3) SCL2,a widely used domain adaptation technique forcross-domain text classification, 4) ‘Proposed w/osim’, removing similarity from Eqs. 2 & 3.

4.2 Results and AnalysisFor cross-domain classification, the performancedegrades mainly due to 1) feature divergence and2) negative transfer owing to largely dissimilar do-mains. Table 3 shows the accuracy of individ-ual classifiers and the ensemble for cross-domainclassification on the Amazon review dataset. Theensemble has better accuracy than the individualclassifiers, therefore, in our experiments the fi-nal reported performance is the accuracy of theensemble. The combination weights in the en-semble represent the contributions of individualclassifiers toward classification accuracy. In ourexperiments, the maximum number of iterations(iterMax) is set to 30. It is observed that at theend of the iterative learning process, the target spe-cific classifier is assigned more weight mass ascompared to the classifier trained on the sharedrepresentation. On average, the weights for thetwo classifiers converge to ws

= 0.22 and wt=

0.78 at the end of the iterative learning process.1We also compared our performance with sentiment sen-

sitive thesaurus (SST) proposed by (Bollegala et al., 2013)and our algorithm outperformed on our protocol. However,we did not include comparative results because of differencein experimental protocol as SST is tailored for using multiplesource domains and our protocol uses single source domain.

2Our implementation of SCL is used in this paper.

Table 3: Comparing the performance of individualclassifiers and the ensemble for training on Booksdomain and test across different domains. Cs andCt are applied on the test domain data before per-forming the iterating learning process.

SD! TD Cs Ct EnsembleB! D 63.1 34.8 72.1B! E 64.5 39.1 75.8B! K 68.4 42.3 76.2

Table 4: List some examples of domain specificdiscriminative features learned by the proposed al-gorithm on the Amazon review dataset.

Domain Domain specific featuresBooks pictures illustrations, more detail, to readDvDs Definite buy, delivery promptKitchen invaluable resource, rust, deliciousElectronics Bargain, Energy saving, actually use

This further validates our assertion that the tar-get specific features are more discriminative thanthe shared features in classifying target domain in-stances, which are efficiently captured by the pro-posed algorithm. Key observations and analysisfrom the experiments on different datasets is sum-marized below.

4.2.1 Results on the Amazon Review datasetTo study the effects of different components of theproposed algorithm, comprehensive experimentsare performed on the Amazon review dataset3.1) Effect of learning target specific features: Re-sults in Figure 3 show that iteratively learning tar-get specific feature representation (slow transfer asopposed to one-shot transfer) yields better perfor-mance across different cross-domain classificationtasks as compared to SCL, SFA (Pan et al., 2010)4and the baseline. Unlike SCL and SFA, the pro-posed approach uses shared and target specific fea-ture representations for the cross-domain classifi-cation task. Table 4 illustrates some examples ofthe target specific discriminative features learnedby the proposed algorithm that leads to enhancedperformance. At 95% confidence, parametric t-test suggests that the proposed algorithm and SCLare significantly (statistically) different.2) Effect of similarity on performance: It is ob-served that existing domain adaptation techniquesenhance the accuracy for cross-domain classifica-tion, though, negative transfer exists in camou-

3Due to space restrictions, we show this analysis only onone dataset; however similar conclusions were drawn fromother datasets as well.

4We directly compared our results with the performancereported in (Pan et al., 2010).

57

Figure 3: Comparing the performance of the proposed approach with existing techniques for cross-domain classification on Amazon review dataset.flage. Results in Figure 3(b) (for the case K! B)describes an evident scenario for negative trans-fer where the adaptation performance with SCLdescends lower than the baseline. However, theproposed algorithm still sustains the performanceby transferring knowledge proportionate to simi-larity between the two domains. To further an-alyze the effect of similarity, we segregated the12 cross-domain classification cases into two cat-egories based on similarity between two the par-ticipating domains i.e. 1) > 0.5 and 2) < 0.5.Table 5 shows that for 6 out of 12 cases that fallin the first category, the average accuracy gain is10.8% as compared to the baseline. While forthe remaining 6 cases that fall in the second cat-egory, the average accuracy gain is 15.4% as com-pared to the baseline. This strongly elucidates thatthe proposed similarity-based iterative algorithmnot only adapts well when the domain similarityis high but also yields gain in the accuracy whenthe domains are largely dissimilar. Figure 4 alsoshows how weight for the target domain classi-fier wt varies with the number of iterations. Itfurther strengthens our assertion that if domainsare similar, algorithm can readily adapt and con-verges in a few iterations. On the other hand fordissimilar domains, slow iterative transfer, as op-posed to one-shot transfer, can achieve similar per-formance; however, it may take more iterationsto converge.While the effect of similarity on do-main adaptation performance is evident, this workopens possibilities for further investigations.

3) Effect of varying threshold ✓1 & ✓2: Figure5(a) explains the effect of varying ✓1 on the finalclassification accuracy. If ✓1 is low, Ct may gettrained on incorrectly predicted pseudo labeled in-stances; whereas, if ✓1 is high, Ct may be defi-cient of instances to learn a good decision bound-ary. On the other hand, ✓2 influences the numberof iterations required by the algorithm to reach the

Table 5: Effect of similarity on accuracy gain forcross-domain classification on the Amazon reviewdataset.

Category SD! TD Sim Gain Avg. (SD)

> 0.5

E! K 0.78 13.1

10.8 (4.9)

K! E 0.78 10.6B! K 0.54 8.0K! B 0.54 2.9B! E 0.52 13.1E! B 0.52 17.2

< 0.5

K! D 0.34 8.9

15.4 (4.4)

D! K 0.34 21.6E! D 0.33 14.5D! E 0.33 14.5B! D 0.29 14.1D! B 0.29 19.1

Figure 4: Illustrates how the weight (wt) for tar-get domain classifiers varies for the most and leastsimilar domains with number of iterations.

stopping criteria. If this threshold is low, the algo-rithm converges aggressively (in a few iterations)and does not benefit from the iterative nature oflearning the target specific features. Whereas ahigh threshold tends to make the algorithm con-servative. It hampers the accuracy because of theunavailability of sufficient instances to update theclassifier after each iteration which also leads tolarge number of iterations to converge (may noteven converge).

✓1 and ✓2 are set empirically on a held-outset, with values ranging from zero to distance offarthest classified instance from the SVM hyper-plane (Hsu et al., 2003). The knee-shaped curveon the graphs in Figure 5 shows that there exists

58

Figure 5: Bar plot shows % of data that crossesconfidence threshold, lower and upper part of thebar represents % correctly and wrongly predictedpseudo labels. The black line shows how the finalclassification accuracy is effected with threshold.

an optimal value for ✓1 and ✓2 which yields thebest accuracy. We observed that the best accuracyis obtained when the thresholds are set to the dis-tance between the hyper plane and the farthest sup-port vector in each class.4) Effect of using different shared represen-tations in ensemble: To study the generaliza-tion ability of the proposed algorithm to differ-ent shared representations, experiments are per-formed using three different shared representa-tions on the Amazon review dataset. Apart fromusing the SCL representation, the accuracy iscompared with the proposed algorithm using twoother representations, 1) common features be-tween the two domains (“common”) and 2) multi-view principal component analysis based repre-sentation (“MVPCA”) (Ji et al., 2011) as they arepreviously used for cross-domain sentiment clas-sification on the same dataset. Table 6 shows thatthe proposed algorithm yields significant gains incross-domain classification accuracy with all threerepresentations and is not restricted to any spe-cific representation. The final accuracy dependson the initial classifier trained on the shared repre-sentation; therefore, if a shared representation suf-ficiently captures the characteristics of both sourceand target domains, the proposed algorithm canbe built on any such representation for enhancedcross-domain classification accuracy.

4.2.2 Results on 20 Newsgroups dataResults in Figure 6 compares the accuracy of pro-posed algorithm with existing approaches on the20 Newsgroups dataset. Since different domainare crafted out from the sub-categories of thesame dataset, domains are exceedingly similar andtherefore, the baseline accuracy is relatively better

Table 6: Comparing the accuracy of proposed al-gorithm built on different shared representations.

SD! TD Common MVPCA SCLB! D 66.8 76.4 78.2B! E 69.0 79.2 80.6B! K 71.4 79.2 79.8D! B 64.5 78.4 79.3D! E 62.8 76.4 76.2D! K 64.3 80.9 82.4E! B 68.9 77.8 78.5E! D 65.7 77.0 77.3E! K 75.1 85.4 86.2K! B 71.3 71.0 71.1K! D 70.4 75.0 76.1K! E 76.7 85.7 86.4

Figure 6: Results comparing the accuracy of pro-posed approach with existing techniques for crossdomain categorization on 20 Newsgroups dataset.

than that on the other two datasets. The proposedalgorithm still yields an improvement of at least10.8% over the baseline accuracy. As compared toother existing domain adaptation approaches likeSCL(Blitzer et al., 2007) and CoCC (Dai et al.,2007), the proposed algorithm outperforms by atleast 4% and 1.9% respectively. This also vali-dates our assertion that generally domain adapta-tion techniques accomplishes well when the par-ticipating domains are largely similar; however,the similarity aggregation and the iterative learn-ing offer the proposed algorithm an edge over one-shot adaptation algorithms.

4.2.3 Results on real world dataResults in Figure 7 exhibit challenges associatedwith real world dataset. The baseline accuracyfor cross-domain classification task is severely af-fected for this dataset. SCL based domain adap-tation does not yields generous improvements asselecting the pivot features and computing the co-occurrence statistics with noisy short text is ardu-ous and inept. On the other hand, the proposedalgorithm iteratively learns discriminative targetspecific features from such perplexing data andtranslates it to an improvement of at least 6.4%and 3.5% over the baseline and the SCL respec-

59

Figure 7: Results comparing the accuracy of theproposed approach with existing techniques forcross domain categorization on the real worlddataset.

tively.

5 Conclusion

The paper presents an iterative similarity-awaredomain adaptation algorithm that progressivelylearns domain specific features from the unlabeledtest domain data starting with a shared feature rep-resentation. In each iteration, the proposed algo-rithm assigns pseudo labels to the unlabeled datawhich are then used to update the constituent clas-sifiers and their weights in the ensemble. Updatingthe target specific classifier in each iteration helpsbetter learn the domain specific features and thus,results in enhanced cross-domain classification ac-curacy. Similarity between the two domains is ag-gregated while updating weights of the constituentclassifiers which facilitates gradual shift of knowl-edge from the source to the target domain. Finally,experimental results for cross-domain classifica-tion on different datasets show the efficacy of theproposed algorithm as compared to other existingapproaches.

ReferencesA. Aue and M. Gamon. 2005. Customizing sentiment clas-sifiers to new domains: A case study. Technical report,Microsoft Research.

J. Blitzer, R. McDonald, and F. Pereira. 2006. Domain adap-tation with structural correspondence learning. In Pro-ceedings of Conference on Empirical Methods in NaturalLanguage Processing, pages 120–128.

J. Blitzer, M. Dredze, and F. Pereira. 2007. Biographies,bollywood, boomboxes and blenders: Domain adaptationfor sentiment classification. In Proceedings of Associationfor Computational Linguistics, pages 187–205.

D. Bollegala, D. Weir, and J. Carroll. 2011. Using multi-ple sources to construct a sentiment sensitive thesaurus forcross-domain sentiment classification. In Proceedings of

Association for Computational Linguistics: Human Lan-guage Technologies, pages 132–141.

D. Bollegala, D. Weir, and J. Carroll. 2013. Cross-domainsentiment classification using a sentiment sensitive the-saurus. IEEE Transactions on Knowledge and Data En-gineering, 25(8):1719–1731.

M. Chen, K. Q Weinberger, and J. Blitzer. 2011. Co-training for domain adaptation. In Proceedings of Ad-vances in Neural Information Processing Systems, pages2456–2464.

Si-Chi Chin. 2013. Knowledge transfer: what, how, andwhy.

W Dai, G-R Xue, Q Yang, and Y Yu. 2007. Co-clusteringbased classification for out-of-domain documents. In Pro-ceedings of International Conference on Knowledge Dis-covery and Data Mining, pages 210–219.

Hal Daume III. 2009. Frustratingly easy domain adaptation.arXiv preprint arXiv:0907.1815.

I. S. Dhillon, S. Mallela, and D. S Modha. 2003.Information-theoretic co-clustering. In Proceedings ofInternational Conference on Knowledge Discovery andData Mining, pages 89–98.

A. Garcia-Fernandez, O. Ferret, and M. Dinarelli. 2014.Evaluation of different strategies for domain adaptation inopinion mining. In Proceedings of the International Con-ference on Language Resources and Evaluation, pages26–31.

C-W. Hsu, C.-C. Chang, and C.-J. Lin. 2003. A practicalguide to support vector classification. Technical report,Department of Computer Science, National Taiwan Uni-versity.

Y.-S. Ji, J.-J. Chen, G. Niu, L. Shang, and X.-Y. Dai.2011. Transfer learning via multi-view principal compo-nent analysis. Journal of Computer Science and Technol-ogy, 26(1):81–98.

J. Jiang and C. Zhai. 2007. Instance weighting for do-main adaptation in NLP. In Proceedings of Associationfor Computational Linguistics, volume 7, pages 264–271.

K Lang. 1995. Newsweeder: Learning to filter netnews.In Proceedings of International Conference on MachineLearning.

X. Liao, Y. Xue, and L. Carin. 2005. Logistic regression withan auxiliary data source. In Proceedings of InternationalConference on Machine Learning, pages 505–512.

C. Luo, Y. Ji, X. Dai, and J. Chen. 2012. Active learning withtransfer learning. In Proceedings of Association for Com-putational Linguistics Student Research Workshop, pages13–18. Association for Computational Linguistics.

S. J. Pan and Q. Yang. 2010a. A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering,22(10):1345–1359.

Sinno Jialin Pan and Qiang Yang. 2010b. A survey on trans-fer learning. Knowledge and Data Engineering, IEEETransactions on, 22(10):1345–1359.

Sinno Jialin Pan, James T Kwok, and Qiang Yang. 2008.Transfer learning via dimensionality reduction. In AAAI,volume 8, pages 677–682.

60

S. J. Pan, X. Ni, J-T Sun, Q. Yang, and Z. Chen. 2010. Cross-domain sentiment classification via spectral feature align-ment. In Proceedings International Conference on WorldWide Web, pages 751–760. ACM.

S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. 2011. Do-main adaptation via transfer component analysis. IEEETransactions on Neural Networks, 22(2):199–210.

M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Di-etterich. 2005. To transfer or not to transfer. In Pro-ceedings of Advances in Neural Information ProcessingSystems Workshop, Inductive Transfer: 10 Years Later.

A. Saha, P. Rai, H. Daume, S. Venkatasubramanian, and S. L.DuVall. 2011. Active supervised domain adaptation. InProceedings of European Conference on Machine Learn-ing and Knowledge Discovery in Databases, pages 97–112.

R. Xia, C. Zong, X. Hu, and E. Cambria. 2013. Feature en-semble plus sample selection: domain adaptation for sen-timent classification. IEEE Intelligent Systems, 28(3):10–18.

P. Zhao and S. C. H. Hoi. 2010. OTL: A Framework ofOnline Transfer Learning. In Proceeding of InternationalConference on Machine Learning.

61



Analyzing Optimization for Statistical Machine Translation:MERT Learns Verbosity, PRO Learns Length

Francisco Guzman Preslav Nakov and Stephan VogelALT Research Group

Qatar Computing Research Institute, HBKU{fguzman,pnakov,svogel}@qf.org.qa

Abstract

We study the impact of source length andverbosity of the tuning dataset on the per-formance of parameter optimizers such asMERT and PRO for statistical machinetranslation. In particular, we test whetherthe verbosity of the resulting translationscan be modified by varying the lengthor the verbosity of the tuning sentences.We find that MERT learns the tuning setverbosity very well, while PRO is sensi-tive to both the verbosity and the lengthof the source sentences in the tuning set;yet, overall PRO learns best from high-verbosity tuning datasets.

Given these dependencies, and potentiallysome other such as amount of reorder-ing, number of unknown words, syntac-tic complexity, and evaluation measure, tomention just a few, we argue for the needof controlled evaluation scenarios, so thatthe selection of tuning set and optimiza-tion strategy does not overshadow scien-tific advances in modeling or decoding.In the mean time, until we develop suchcontrolled scenarios, we recommend us-ing PRO with a large verbosity tuning set,which, in our experiments, yields highestBLEU across datasets and language pairs.

1 Introduction

Statistical machine translation (SMT) systemsnowadays are complex and consist of many com-ponents such as a translation model, a reorder-ing model, a language model, etc., each of whichcould have several sub-components. All compo-nents and their elements work together to scorefull and partial hypotheses proposed by the SMTsystem’s search algorithms.

Thus, putting them together requires assigningthem relative weights, e.g., how much weight weshould give to the translation model vs. the lan-guage model vs. the reordering table. These rela-tive weights are typically learned discriminativelyin a log-linear framework, and their values are op-timized to maximize some automatic metric, typi-cally BLEU, on a tuning dataset.

Given this setup, it is clear that the choice of atuning set and its characteristics, can have signif-icant impact on the SMT system’s performance:if the experimental framework (training data, tun-ing set, and test set) is highly consistent, i.e.,there is close similarity in terms of genre, domainand verbosity,1 then translation quality can be im-proved by careful selection of tuning sentencesthat exhibit high degree of similarity to the test set(Zheng et al., 2010; Li et al., 2010).

In our recent work (Nakov et al., 2012), we havestudied the relationship between optimizers suchas MERT, PRO and MIRA, and we have pointedout that PRO tends to generate relatively shortertranslations, which could lead to lower BLEUscores on testing. Our solution there was to fixthe objective function being optimized: PRO usessentence-level smoothed BLEU+1, as opposed tothe standard dataset-level BLEU.

Here we are interested in a related but dif-ferent question: the relationship between prop-erties of the tuning dataset and the optimizer’sperformance. More specifically, we study howthe verbosity, i.e., the average target/source sen-tence length ratio, learned by optimizers such asMERT and PRO depends on the nature of the tun-ing dataset. This could potentially allow us to ma-nipulate the verbosity of the translation hypothesesgenerated at test time by changing some character-istics of the tuning dataset.

1Verbosity also depends on the translator; it is often astylistic choice. and not necessarily related to fluency or ade-quacy. This aspect is beyond the scope of the present work.

62

2 Related Work

Tuning the parameters of a log-linear model forstatistical machine translation is an active area ofresearch. The standard approach is to use mini-mum error rate training, or MERT, (Och, 2003),which optimizes BLEU directly.

Recently, there has been a surge in new opti-mization techniques for SMT. Two parameter op-timizers that have recently become popular in-clude the margin-infused relaxed algorithm orMIRA (Watanabe et al., 2007; Chiang et al.,2008; Chiang et al., 2009), which is an on-linesentence-level perceptron-like passive-aggressiveoptimizer, and pairwise ranking optimization orPRO (Hopkins and May, 2011), which operates inbatch mode and sees tuning as ranking.

A number of improved versions thereof havebeen proposed in the literature including a batchversion of MIRA (Cherry and Foster, 2012), withlocal updates (Liu et al., 2012), a linear regressionversion of PRO (Bazrafshan et al., 2012), and anon-sampling version of PRO (Dreyer and Dong,2015); another example is Rampeon (Gimpel andSmith, 2012). We refer the interested reader tothree recent overviews on parameter optimizationfor SMT: (McAllester and Keshet, 2011; Cherryand Foster, 2012; Gimpel and Smith, 2012).

Still, MERT remains the de-facto standard inthe statistical machine translation community. Itsstability has been of concern, and is widely stud-ied. Suggestions to improve it include usingregularization (Cer et al., 2008), random restarts(Moore and Quirk, 2008), multiple replications(Clark et al., 2011), and parameter aggregation(Cettolo et al., 2011).

With the emergence of new optimization tech-niques there have been also studies that comparestability between MIRA–MERT (Chiang et al.,2008; Chiang et al., 2009; Cherry and Foster,2012), PRO–MERT (Hopkins and May, 2011),MIRA–PRO–MERT (Cherry and Foster, 2012;Gimpel and Smith, 2012; Nakov et al., 2012).Pathological verbosity was reported when usingMERT on recall-oriented metrics such as ME-TEOR (Lavie and Denkowski, 2009; Denkowskiand Lavie, 2011), as well as large variance withMIRA (Simianer et al., 2012). However, we arenot aware of any previous studies of the impact ofsentence length and dataset verbosity across opti-mizers.

3 Method

For the following analysis, we need to define thefollowing four quantities:

• source-side length: the number of words inthe source sentence;

• length ratio: the ratio of the number of wordsin the output hypothesis to those in the refer-ence;2

• verbosity: the ratio of the number of words inthe reference to those in the source;3

• hypothesis verbosity: the ratio of the num-ber of words in the hypothesis to those in thesource.

Naturally, the verbosity varies across differ-ent tuning/testing datasets, e.g., because of style,translator choice, etc. Interestingly, verbosity canalso differ across sentences with different sourcelengths drawn from the same dataset. This is illus-trated in Figure 1, which plots the average sam-ple source length vs. the average verbosity for100 samples, each containing 500 randomly se-lected sentence pairs, drawn from the concatena-tion of the MT04, MT05, MT06, MT09 datasetsfor Arabic-English and of newstest2008-2011 forSpanish-English.4

We can see that for Arabic-English, the Englishtranslations are longer than the Arabic source sen-tences, i.e., the verbosity is greater than one. Thisrelationship is accentuated by length: verbosity in-creases with sentence length: see the slightly pos-itive slope of the regression line. Note that theincreasing verbosity can be observed in single-reference sets (we used the first reference), and toa lesser extent in multiple-reference sets (five ref-erences for MT04 and MT05, and four for MT06and MT09). For Spanish-English, the story isdifferent: here the English sentences tend to beshorter than the Spanish ones, and the verbositydecreases as the sentence length increases. Over-all, in all three cases, the verbosity appears to belength-dependent.

2For multi-reference sets, we use the length of the refer-ence that is closest to the length of the hypothesis. This isthe best match length from the original paper on BLEU (Pap-ineni et al., 2002); it is default in the NIST scoring tool v13a,which we use in our experiments.

3When dealing with multi-reference sets, we use the aver-age reference length.

4The datasets we experiment with are described in moredetail in Section 4 below.

63

●●

●●

●

●

●●

●●●

●●● ●●

●

●●

● ●●

● ●●●●

●●

●●

●

● ●●

●

●

●

●●

●

●●

●

●●●●●

●●

●● ●●

●

● ●

●● ●

● ●

●●

●

●

●●

● ●●

●

●●●●

●

●

●●

●

●

● ●

●

●●●●

●

●

●

●

●

●

●●

●●

1.075

1.100

1.125

1.150

1.175

28 30 32Average source side length (words)

Ave

rage

ver

bosi

ty

set ● Ar−En−multi Ar−En−single

Arabic−English

●

●●●

● ●

●●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●●

●● ●●

●●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●●

●

●●

●●

●

●●●

●●

●●

● ●●●● ●

●●

●

●● ●●

● ●●

●● ●● ●●

●

●●

0.925

0.950

0.975

1.000

1.025

24 26 28Average source side length (words)

Ave

rage

ver

bosi

ty

Spanish−English

Source length vs. avg. verbosity

Figure 1: Average source sentence length (x axis) vs. average verbosity (y axis) for 100 random samples,each with 500 sentence pairs extracted from NIST (Left: Arabic-English, multi- and single-reference)and from WMT (Right: Spanish-English, single-reference) data.

The main research question we are interestedin, and which we will explore in this paper, iswhether the SMT parameter optimizers are ableto learn the verbosity from the tuning set. Weare also interested in the question of how the hy-pothesis verbosity learned by optimizers such asMERT and PRO depends on the nature of the tun-ing dataset, i.e., its verbosity. Understanding thiscould potentially allow us to manipulate the hy-pothesis verbosity of the translations generated attest time simply by changing the characteristics ofthe tuning dataset in a systematic and controlledway. While controlling the verbosity of a tuningset might be an appealing idea, this is unrealisticin practice, given that the verbosity of a test set isalways unknown. However, the results in Figure 1suggest that it is possible to manipulate verbosityby controlling the average source sentence lengthof the dataset (and the source-side length is alwaysknown for any test set). Thus, in our study, we usethe source-side sentence length as a data selectioncriterion; still, we also report results for selectionbased on verbosity.

In order to shed some light on our initial ques-tion (whether the SMT parameter optimizers areable to learn the verbosity from the tuning dataset),we contrast the verbosity that two different opti-mizers, MERT and PRO, learn as a function ofthe average length of the sentences in the tuningdataset.5

5In this work, we consider both optimizers, MERT andPRO, as black-boxes. For a detailed analysis of how theirinner workings can affect optimization, see our earlier work(Nakov et al., 2012).

4 Experiments and Evaluation

We experimented with single-reference and multi-reference tuning and testing datasets for twolanguage pairs: Spanish-English and Arabic-English. For Spanish-English, we used thesingle-reference datasets newstest2008, new-stest2009, newstest2010, and newstest2011 fromthe WMT 2012, Workshop on Machine Transla-tion Evaluation.6 For Arabic-English, we usedthe multi-reference datasets MT04, MT05, MT06,and MT09 from the NIST 2012 OpenMT Eval-uation;7 we further experimented with single-reference versions of the MT0x datasets, using thefirst reference only.

In addition to the above datasets, we con-structed tuning sets of different source-side sen-tence lengths: short, middle and long. Given anoriginal tuning dataset, we selected 50% of its sen-tence pairs: shortest 50%, middle 50%, or longest50%. This yielded tuning datasets with the samenumber of sentence pairs but with different num-ber of words, e.g., for our Arabic-English datasets,longest has about twice as many English wordsas middle, and about four times as many wordsas shortest. Constructing tuning datasets with thesame number of sentences instead of the samenumber of tokens is intentional as we wanted toensure that in each of the conditions, the SMT pa-rameter optimizers learn on the same number oftraining examples.

6www.statmt.org/wmt12/7www.nist.gov/itl/iad/mig/openmt12.cfm

64

4.1 Experimental Setup

We experimented with the phrase-based SMTmodel (Koehn et al., 2003) as implemented inMoses (Koehn et al., 2007). For Arabic-English,we trained on all data that was allowed for usein the NIST 2012 except for the UN corpus. ForSpanish-English, we used all WMT12 data, againexcept for the UN data.

We tokenized and truecased the English and theSpanish side of all bi-texts and also the monolin-gual data for language modeling using the stan-dard tokenizer of Moses. We segmented the wordson the Arabic side using the MADA ATB segmen-tation scheme (Roth et al., 2008). We built ourphrase tables using the Moses pipeline with max-phrase-length 7 and Kneser-Ney smoothing. Wealso built a lexicalized reordering model (Koehnet al., 2005): msd-bidirectional-fe. We useda 5-gram language model trained on GigaWordv.5 with Kneser-Ney smoothing using KenLM(Heafield, 2011).

On tuning and testing, we dropped the unknownwords for Arabic-English, and we used monotone-at-punctuation decoding for Spanish-English. Wetuned using MERT and PRO. We used the standardimplementation of MERT from the Moses toolkit,and a fixed version of PRO, as we recommendedin (Nakov et al., 2013), which solves instabilityissues when tuning on the long sentences; we willdiscuss our PRO fix and the reasons it is needed inSection 5 below. In order to ensure convergence,we allowed both MERT and PRO to run for up to25 iterations (default: 16); we further used 1000-best lists (default: 100).

In our experiments below, we perform three re-runs of parameter optimization, tuning on each ofthe twelve tuning datasets; in the figures, we plotthe results of the three reruns, while in the tables,we report BLEU averaged over the three reruns, assuggested by Clark et al. (2011).

4.2 Learning Verbosity

We performed parameter optimization usingMERT and PRO on each dataset, and we used theresulting parameters to translate the same dataset.The purpose of this experiment was to study theability of the optimizers to learn the verbosity ofthe tuning sets. Getting the hypothesis verbosityright means that it is highly correlated with thetuning set verbosity , which in turn is determinedby the dataset source length.

The results are shown in Figure 2. In eachgraph, there are 36 points (many of them veryclose and overlapping) since we performed threereruns with our twelve tuning datasets (threelength-based subsets for each of the four originaltuning datasets). There are several observationsthat we can make:

(1) MERT is fairly stable with respect to thelength of the input tuning sentences. Note howthe MERT regression lines imitate those in Fig-ure 1. In fact, the correlation between the verbosityand the hypothesis verbosity for MERT is r=0.980.PRO, on the other hand, has harder time learningthe tuning set verbosity, and the correlation withthe hypothesis verbosity is only r=0.44. Interest-ingly, its length ratio is more sensitive to the in-put length (r=0.67): on short sentences, it learnsto output translations that are slightly shorter thanthe reference, while on long sentences, it yieldsincreasingly longer translations. The dependenceof PRO on source length can be explained by thesentence-level smoothing in BLEU+1 and the bro-ken balance between BLEU’s precision compo-nent and BP (Nakov et al., 2012). The problem isbigger for short sentences since there +1 is addedto smaller counts; this results in preference forshorter translations.

(2) Looking at the results for Arabic-English,we observe that having multiple references makesboth MERT and PRO appear more stable, allowingthem to generate hypotheses that are less spread,and closer to 1. This can be attributed to the bestmatch reference length, which naturally dampensthe effect of verbosity during optimization by se-lecting the reference that is closest to the respec-tive hypothesis.

Overall, we can conclude that MERT learns thetuning set’s verbosity more accurately than PRO.PRO learns verbosity that is more dependent onthe source side length of the sentences in the tun-ing dataset.

4.3 Performance on the Test Dataset

Next, we study the performance of MERT andPRO when testing on datasets that are differentfrom the one used for tuning. First, we test therobustness of the parameters obtained for spe-cific tuning datasets when testing on various testdatasets. Second, we test whether selecting atuning dataset based on the length of the testingdataset (i.e., closest) is a good strategy.

65

MERT PRO−fix

●●

●●●

●

●

●●

●● ●●●● ●●

●

●

●

●

●● ●●

●

●●● ●

●●

●●●

●●

●

●●●

●

●

●●

●● ●●

●● ●●●

●

●

●

●●

●●

●

●●

● ●

●●

●●●

●

● ●●●

●

●

● ●●●

●

● ●●

●

●

●

●●

●

●●●

●●

● ●●

●●

●●

●●●

●● ●●

●

●

●

●●

●●

●

● ●●

●

●

●

●●

●

●

●●

●●

● ●●

●●

●●

●●●

●

0.9

1.0

1.1

1.2

10 20 30 40 10 20 30 40NIST Ar−En (multi−ref) tuning set source side length (words)

●

●

●

short

mid

long

MERT PRO−fix

●●

●

●●●

●

●

●

●●●●

●

●

●● ●

●●

●

●●

●●

●

●● ●●

●

●●

●●

● ●●

●

●●●

●

●

●

●●

●●

●

●

●

● ●

●●

●

●●

●●

●

●

● ●●

●

●

●●

●

● ●

●●

●●

●

●●

●

●

●

●● ●

●

●

●● ●

●

●

● ●●

●●

●●●

●

●

●

●●●●●

●

●

●●

●

●

●

●

●

●

●● ●

●

●

●

● ●

●

●

● ●●

●●

●●●

●

●

●

●●●●

0.9

1.0

1.1

1.2

10 20 30 40 10 20 30 40NIST Ar−En (single−ref) tuning set source side length (words)

●

●

●

short

mid

long

MERT PRO−fix

●●●

●●● ● ●●

●● ●●●● ●●●

●● ●●● ●

●●●●●● ●●● ●● ●

●

●●●

●● ● ●●

●● ●●●● ●●

●●● ●

●

●●

●●●●●● ●●● ●● ● ●●●●●●

●●●●●●● ●●

●●● ● ●●

●

●●●●

●●●●●

● ●● ●●●●

●●●●●●

●●●●

● ●●●

●● ● ●●

●

●●●

●●●●●

● ● ●● ●●

0.9

1.0

1.1

1.2

10 20 30 40 10 20 30 40WMT Es−En tuning set source side length (words)

●

●

●

short

mid

long

Tuning set source side length vs. hypothesis verbosityhy

poth

esis

ver

bosi

ty

Figure 2: Source-side length vs. hypothesis verbosity for the tuning dataset. There are 36 points perlanguage pair: four tuning sets, each split into three datasets (short, middle, and long) times three reruns.

For this purpose, we perform a grid comparisonof tuning and testing on all our datasets: we tuneon each short/middle/long dataset, and we test onthe remaining short/middle/long datasets.

The results are shown in Table 1, where eachcell is an average over 36 BLEU scores: fourtuning sets times three test sets times three re-runs. For example, 49.63 in row 1 (tune: short),column 2 (test: middle), corresponds to the av-erage over three reruns of (i) tune on MT04-short and test on MT05-middle, MT06-middle,and MT09-middle, (ii) tune on MT05-short andtest on MT04-middle, MT06-middle, and MT09-middle, (iii) tune on MT06-short and test onMT04-middle, MT05-middle, and MT09-middle,and (iv) tune on MT09-short and test on MT04-middle, MT05-middle, and MT06-middle. Wefurther include two statistics: (1) the range ofvalues (max-min), measuring test BLEU variancedepending on the tuning set, and (2) the loss inBLEU when tuning on closest instead of on thebest-performing dataset.

There are several interesting observations:(1) PRO and MERT behave quite differently

with respect to the input tuning set. For MERT,tuning on a specific length condition yields thebest results when testing on a similar condition,i.e., zero-loss. This is a satisfactory result since itconfirms the common wisdom that tuning datasetsshould be as similar as possible to test-time inputin terms of source side length. In contrast, PRObehaves better when tuning on mid-length tuningsets. However, the average loss incurred by apply-ing the closest strategy with PRO is rather small,and in practice, choosing a tuning set based on testset’s average length is a good strategy.

(2) MERT has higher variance than PRO andfluctuates more depending on the input tuning set.PRO on the contrary, tends to perform more con-sistently, regardless of the length of the tuning set.

(3) MERT yields the best BLEU across datasetsand language pairs. Thus, when several tuning setsare available, we recommend choosing the oneclosest in length to the test set and using MERT.

66

test

Arabic-English (multi-ref) Arabic-English (1-ref) WMT Spanish-Englishtuning short mid long short mid long short mid long avg

MERTshort 47.26⇤ 50.71 50.82 26.69⇤ 28.14 27.49 25.17⇤ 25.94 27.64mid 46.53 51.11⇤ 51.31 26.22 28.39⇤ 27.96 24.96 26.27⇤ 27.97long 46.23 50.84 51.74⇤ 25.80 28.20 28.27⇤ 24.57 26.08 28.29⇤

max-min 1.04 0.40 0.91 0.89 0.25 0.78 0.59 0.34 0.65 0.65loss if using closest 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

PRO-fixshort 46.74 50.57 50.97 25.95 27.66 27.28 24.66 25.83 27.89mid 46.59 50.83 51.41 25.98 28.23 28.19 24.67 25.81 27.64long 46.08 50.56 51.18 25.87 28.11 28.05 24.58 25.77 27.81

max-min 0.66 0.27 0.44 0.11 0.58 0.92 0.09 0.06 0.25 0.38loss if using closest 0.00 0.00 0.23 0.02 0.00 0.15 0.01 0.00 0.08 0.06

Table 1: Average test BLEU when tuning on each short/mid/long dataset, and testing on the remainingshort/mid/long datasets. Each cell represents the average over 36 scores (see the text). The best score foreither MERT or PRO is bold; the best overall score is marked with a ⇤.

4.3.1 Performance vs. Length and Verbosity

The above results give rise to some interestingquestions: What if we do not know the source-side length of the test set? What if we can choosea tuning set based on its verbosity? Would it thenbe better to choose based on length or based onverbosity?

To answer these questions, we analyzed the av-erage results according to two orthogonal views:one based on the tuning set length (using the above50% length-based subsets of tuning: short, mid,long), and another one based on the tuning setverbosity (using new 50% subsets verbosity-basedsubsets of tuning: low-verb, mid-verb, high-verb).This time, we translated the full test datasets (e.g.,MT06, MT09); the results are shown in Table 2.We can make the following observations:

(1) The best results for PRO are better than thebest results for MERT, in all conditions.

(2) Length-based tuning subsets: With a sin-gle reference, PRO performs best when tuning onshort sentences, but with multiple references, itworks best with mid-length sentences. MERT, onthe other hand, prefers tuning on long sentencesfor all testing datasets.

(3) Verbosity-based tuning subsets: PRO yieldsbest results when the tuning sets have high ver-bosity; in fact, the best verbosity-based results inthe table are obtained with this setting. With mul-tiple references, MERT performs best when tuningon high-verbosity datasets; however, with a singlereference, it prefers mid-verbosity.

Based on the above results, we recommend that,whenever we have no access to the input side ofthe testing dataset beforehand, we should tune ondatasets with high verbosity.

4.4 Test vs. Tuning Verbosity and SourceLength

In the previous subsection, we have seen thatMERT and PRO perform differently in terms ofBLEU, depending on the characteristics of the tun-ing dataset. Here, we study a different aspect:i.e. how they behave with respect to verbosity andsource side length.

We have seen that MERT and PRO perform dif-ferently in terms of BLEU depending on the char-acteristics of the tuning dataset. Below we studyhow other characteristics of the output of PRO andMERT are affected by tuning set verbosity andsource side length.

4.4.1 MERT – Sensitive to VerbosityFigure 3 shows a scatter plot of tuning verbosityvs. test hypothesis verbosity when using MERTto tune under different conditions, and testing oneach of the unseen full datasets. We test on fulldatasets to avoid the verbosity bias that might oc-cur for specific conditions (see Section 3).

We can see strong positive correlation betweenthe tuning set verbosity and the hypothesis ver-bosity on the test datasets. The average correla-tion for Arabic-English is r=0.95 with multiplereferences and r=0.98 with a single reference; forSpanish-English, it is r=0.97.

67

test

Arabic-English (multi-ref) Arabic-English (1-ref) WMT Spanish-Englishtuning MERT PRO-fix MERT PRO-fix MERT PRO-fix

lengthshort 48.71 49.12 26.74 27.35 26.79 27.07mid 49.27 49.59 26.97 27.23 26.99 26.88long 49.35 49.20 27.23 27.28 27.02 26.84verbositylow-verb 47.90 47.60 25.89 25.88 26.70 26.61mid-verb 49.16 49.52 27.69 27.95 27.09 26.81high-verb 50.28 50.79⇤ 27.36 28.03⇤ 27.01 27.38⇤

Table 2: Average test BLEU scores when tuning on different length- and verbosity-based datasets, andtesting on the remaining full datasets. Each cell represents the average over 36 scores. The best score foreither MERT or PRO is bold; the best overall score is marked with a ⇤.

Ar−En−multi Ar−En−single Es−En

●● ●●●

●

●

●

●●

●

●●

●

● ●

●

●●

●

●●

●●

●●

●

●●●

●●●●●

● ●●●●●●

●●

●●

●●●●●●● ●

●●

●

●

●●●

●

●

●

●●

●

●

●

●

●●

●

●●

●●●●●

●

●

●●●●●

●

●●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●●

●

●● ●

●●

●

●

●●

●

●

●

●

●●● ●●

●

●●

●●●●

●●● ●

●● ●● ●●

●

●●

●●●

●

●

●

●● ●

●

●

●●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●●

●●

●●

●

●

●●●●●

●

●●●●●●●●●

●●●

●●●

●●

●

●●●●●● ●●●

●

●

●

●●●

●●

●●● ●●●

●

●● ●●●●●

●●●

●●●●●

●●●●

●

●●

●

●●

●●●

●●

●●

●

●●●●● ●●

●●●●

●●●●●●

●●●●●●●●

●●

1.0

1.1

1.2

0.95 1.00 1.05 1.10 1.15 0.95 1.00 1.05 1.10 1.15 0.95 1.00 1.05 1.10 1.15Tuning set verbosity

Tes

t−se

t hy

poth

esis

ver

bosi

ty

MERT−tuning verbosity vs. test set length ratio

Figure 3: Tuning set verbosity vs. test hypothesis verbosity when using MERT. Each point represents theresult for an unseen testing dataset, given a specific tuning condition. The linear regressions show thetendencies for each of the test datasets (note that they all overlap for Es-En and look like a single line).

These are very strong positive correlations andthey show that MERT tends to learn SMT pa-rameters that yield translations preserving the ver-bosity, e.g., lower verbosity on the tuning datasetwill yield test-time translations that are less ver-bose, while higher verbosity on the tuning datasetwill yield test-time translations that are more ver-bose. In other words, MERT learns to generatea fixed number of words per input word. Thiscan be explained by the fact that MERT optimizesBLEU score directly, and thus learns to output the“right” verbosity on the tuning dataset (in contrast,PRO optimizes sentence-level BLEU+1, which isan approximation to BLEU, but it is not the actualBLEU). This explains why MERT performs bestwhen the tuning conditions and the testing condi-tions are in sync. Yet, this makes it dependent ona parameter that we do not necessarily control orhave access to beforehand: the length of the testreferences.

4.4.2 PRO – Sensitive to Source LengthFigure 4 shows the tuning set average source-sidelength vs. the testing hypothesis/reference lengthratio when using PRO to tune on short, middle,and long and testing on each of the unseen fulldatasets, as in the previous subsection. We can seethat there is positive correlation between the tun-ing set average source side length and the testinghypothesis/reference length ratio. For Spanish-English, it is quite strong (r=0.64), and for Arabic-English, it is more clearly expressed with one(r=0.42) than with multiple references (r=0.34).The correlation is significant (p < 0.001) whenwe take into account the contribution of the tuningset verbosity in the model. This suggests that forPRO, both source length and verbosity influencethe hypotheses lengths, i.e., PRO learns the tuningset’s verbosity, much like MERT; yet, the contri-bution of the length of the source sentences fromthe tuning dataset is not negligible.

68

Ar−En−multi Ar−En−single Es−En

● ●

●

●●

●

●●

●

●●● ●

●

●

●●●●● ●●

●

●

●

● ●

●●● ●● ●●

●●

●●●

●

●●● ●

●

●●●●

●

● ●● ●

●

●

●

●

●

●

●●

●

●

●●

●●●

●

●

●●●●●

●● ● ●●●●

●

●●●

●

●●

● ●

●

●

●● ●

●

●

●

●

●●

●

●

●●●

● ●

●

●●

●

●●

●

●●● ●

●

●

●●●●● ●●

●

●

●

● ●

●●● ●● ●●

●●

●●●

●

●●● ●

●

●●●●

●

● ●●●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●●●●●

●● ● ●●●●

●

●●●

●

●●

● ●

●

●

●● ●

●

●

●

●

●

●●

●

●●●

●

●●

●●●

●●

● ●

●

●● ●

●

●

●●

●

●

●●

●

●

●●

●

●

●

● ●

●

●

●●

● ●

●

●

●●●●

●

● ●

●

●

●●●

●

●●

●●

●

●

●

●

●

●●●

●●

●

● ●●

●

●

●

●●● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●●

●

●●

● ●

●

●● ●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●●

● ●

●

●

●●

●●

●

● ●

●

●

●●●

●

●●

●●

●

●

●

●

●

●●

●

●●

●

● ●●

●

●

●

●●● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●●●●

●

●

●

●●

● ●●

●

●

●

● ●

●

●

●

●

●●

●

●

●● ●

●

●●

●

●

●●

●

●●

●

●●●

●

●

●

●●●●●

●

●

●●

●

●

●

●●

●●●

●●

●●●

●●●● ●●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●●●

●●●

●●

●

●

●●●

●

●

●

●●

● ●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●● ●

●

●●

●

●

●●

●

●●

●

●●●

●

●

●

●

●●●●

●

●

●●

●

●

●

●●

●●●

●●

●●

●

●●●● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●●

●●

0.9

1.0

1.1

10 20 30 40 10 20 30 40 10 20 30 40Tuning set source sentence length

Tes

t−se

t le

ngth

rat

ioPRO−tuning source side length vs. test set length ratio

Figure 4: Tuning set average source length vs. test hypothesis/reference length ratio for PRO. Eachpoint represents the result for an unseen testing dataset, given a specific tuning condition. The linearregressions shows the tendencies across each of the testing datasets.

Finally, note the “stratification” effect for theArabic-English single-reference data. We attributeit to the differences across test datasets. Thesedifferences are attenuated with multiple referencesdue to the closest-match reference length.

5 Discussion

We have observed that high-verbosity tuning setsyield better results with PRO. We have further seenthat we can manipulate verbosity by adjusting theaverage length of the tuning dataset. This leads tothe natural question: can this yield better BLEU?It turns out that the answer is “yes”. Below, wepresent an example that makes this evident.

First, recall that for Arabic-English longer tun-ing datasets have higher verbosity. Moreover, ourprevious findings suggest that for PRO, higher-verbosity tuning datasets will perform better inthis situation. Therefore, we should expect thatlonger tuning datasets could yield better BLEU.Table 3 presents the results for PRO with Arabic-English when tuning on MT06, or subsets thereof,and testing on MT09. The table shows the re-sults for both multi- and single-reference experi-ments; naturally, manipulating the tuning set hasstronger effect with a single reference. Lines 1-3 show that as the average length of the tuningdataset increases, so does the length ratio, whichmeans better brevity penalty for BLEU and thushigher BLEU score. Line 4 shows that selectinga random-50% subset (included here to show theeffect of using mixed-length sentences) yields re-sults that are very close to those for middle.

Comparing line 3 to lines 4 and 5, we can seethat tuning on long yields longer translations andalso higher BLEU, compared to tuning on the fulldataset or on random.

Next, lines 6 and 7 show the results when apply-ing our smoothing fix for sentence-level BLEU+1(Nakov et al., 2012), which prevents translationsfrom becoming too short; we can see that longyields very comparable results. Yet, manipulat-ing the tuning dataset might be preferable since itallows (i) faster tuning, by using part of the tun-ing dataset, (ii) flexibility in the selection of thedesired verbosity, and (iii) applicability to otherMT evaluation measures. Point (ii) is illustratedon Figure 5, which shows that there is direct pos-itive correlation between verbosity, length ratio,and BLEU; note that the tuning set size does notmatter much: in fact, better results are obtainedwhen using less tuning data.

28.5

29.0

29.5

30.0

all top90 top80 top70 top60 top50

BLE

U

BLEUBP

0.91

0.93

0.95

0.97

0.99

BP

Figure 5: PRO, Arabic-English, 1-ref: tune onN% longest sentences from MT06, test on MT09.

69

multi-ref 1-refTuning BLEU len. ratio BLEU len. ratio

1 tune-short 46.38 0.961 27.44 0.8942 tune-mid 47.44 0.977 29.11 0.9503 tune-long 47.47 0.980 29.51 0.969

4 tune-random 47.43 0.978 28.96 0.941

5 tune-full 47.18 0.972 28.88 0.934

6 tune-full, BP-smooth=1 47.52 0.984 29.43 0.9627 tune-full, BP-smooth=1, grounded 47.61 0.991 29.68 0.979

Table 3: PRO, Arabic-English: tuning on MT06, or subsets thereof, and testing on MT09. Statisticallysignificant improvements over tune-full are in bold: using the sign test (Collins et al., 2005), p < 0.05.

6 Conclusion and Future Work

Machine translation has, and continues to, benefitimmensely from automatic evaluation measures.However, we frequently observe delicate depen-dencies between the evaluation metric, the systemoptimization strategy, and the pairing of tuningand test datasets. This leaves us with the situationthat getting lucky in the selection of tuning datasetsand optimization strategy overshadows scientificadvances in modeling or decoding. Understand-ing these dependencies in detail puts us in a bet-ter position to construct tuning sets that match thetest datasets in such a way that improvements inmodels, training, and decoding algorithms can bemeasured more reliably.

To this end, we have studied the impact thatsource-side length and verbosity of tuning setshave on the performance of the translation systemwhen tuning the system with different optimizerssuch as MERT and PRO. We observed that MERTlearns the verbosity of the tuning dataset very well,but this can be a disadvantage because we do notknow the verbosity of unseen test sentences. Incontrast, PRO is affected by both the verbosity andthe source-side length of the tuning dataset.

There may be other characteristics of testdatasets, e.g., amount of reordering, number ofunknown words, complexity of the sentences interms of syntactic structure, etc. that could havesimilar effects of creating good or bad luck whendeciding how to tune an SMT system. Untilwe have such controlled evaluation scenarios, ourshort-term recommendations are as follows:

• Know your tuning datasets: Different lan-guage pairs and translation directions mayhave different source-side length – verbositydependencies.

• When optimizing with PRO: select or con-struct a high-verbosity dataset as this couldpotentially compensate for PROs tendencyto yield too short translations. Note thatfor Arabic-English, higher verbosity meanslonger tuning sentences, while for Spanish-English, it means shorter ones; translation di-rection might matter too.

• When optimizing with MERT: If you knowbeforehand the test set, select the closest tun-ing set. Otherwise, tune on longer sentences.

We plan to extend this study in a number of di-rections. First, we would like to include other pa-rameter optimizers such as Rampeon (Gimpel andSmith, 2012) and MIRA. Second, we want to ex-periment with other metrics, such as TER (Snoveret al., 2006), which typically yields short trans-lations, and METEOR (Lavie and Denkowski,2009), which yields too long translations. Third,we would like to explore other SMT models suchas hierarchical (Chiang, 2005) and syntax-based(Galley et al., 2004; Quirk et al., 2005), and otherdecoders such as cdec (Dyer et al., 2010), Joshua(Li et al., 2009), and Jane (Vilar et al., 2010).

A long-term objective would be to design a met-ric that measures the closeness between tuning andtest datasets, which includes the different char-acteristics, such as length distribution, verbositydistribution, syntactic complexity, etc., to guaran-tee a more stable evaluation situations, but whichwould also allow to systematically test the robust-ness of translation systems, when deviating fromthe matching conditions.

AcknowledgmentsWe would like to thank the anonymous review-ers for their constructive comments, which havehelped us to improve the paper.

70

ReferencesMarzieh Bazrafshan, Tagyoung Chung, and Daniel

Gildea. 2012. Tuning as linear regression. InProceedings of the 2012 Annual Conference ofthe North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT ’12, pages 543–547,Montreal, Canada.

Daniel Cer, Daniel Jurafsky, and Christopher D Man-ning. 2008. Regularization and search for min-imum error rate training. In Proceedings of theThird Workshop on Statistical Machine Translation,WMT ’08, pages 26–34, Columbus, Ohio, USA.

Mauro Cettolo, Nicola Bertoldi, and Marcello Fed-erico. 2011. Methods for smoothing the optimizerinstability in SMT. In Proceedings of the Thir-teenth Machine Translation Summit, MT SummitXIII, pages 32–39, Xiamen, China.

Colin Cherry and George Foster. 2012. Batchtuning strategies for statistical machine translation.In Proceedings of the 2012 Annual Conference ofthe North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT ’12, pages 427–436,Montreal, Canada.

David Chiang, Yuval Marton, and Philip Resnik. 2008.Online large-margin training of syntactic and struc-tural translation features. In Proceedings of the2008 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP ’08, pages 224–233,Honolulu, Hawaii, USA.

David Chiang, Kevin Knight, and Wei Wang. 2009.11,001 new features for statistical machine transla-tion. In Proceedings of the 2009 Annual Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT ’09, pages 218–226, Boulder, Colorado, USA.

David Chiang. 2005. A hierarchical phrase-basedmodel for statistical machine translation. In Pro-ceedings of the 43rd Annual Meeting of the Associa-tion for Computational Linguistics, ACL ’05, pages263–270, Ann Arbor, Michigan, USA.

Jonathan H. Clark, Chris Dyer, Alon Lavie, andNoah A. Smith. 2011. Better hypothesis testing forstatistical machine translation: Controlling for op-timizer instability. In Proceedings of the 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies, ACL-HLT ’11, pages 176–181, Portland, Oregon, USA.

Michael Collins, Philipp Koehn, and Ivona Kucerova.2005. Clause restructuring for statistical machinetranslation. In Proceedings of the 43rd AnnualMeeting on Association for Computational Linguis-tics, ACL ’05, pages 531–540, Ann Arbor, Michi-gan.

Michael Denkowski and Alon Lavie. 2011. Meteor-tuned phrase-based SMT: CMU French-English andHaitian-English systems for WMT 2011. Techni-cal report, Technical Report CMU-LTI-11-011, Lan-guage Technologies Institute, Carnegie Mellon Uni-versity.

Markus Dreyer and Yuanzhe Dong. 2015. APRO:All-pairs ranking optimization for MT tuning. InProceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,NAACL-HLT ’15, pages 1018–1023, Denver, Col-orado, USA.

Chris Dyer, Adam Lopez, Juri Ganitkevitch, JonathanWeese, Ferhan Ture, Phil Blunsom, Hendra Seti-awan, Vladimir Eidelman, and Philip Resnik. 2010.cdec: A decoder, alignment, and learning frameworkfor finite-state and context-free translation models.In Proceedings of the ACL 2010 System Demonstra-tions, ACL ’10, pages 7–12, Uppsala, Sweden.

Michel Galley, Mark Hopkins, Kevin Knight, andDaniel Marcu. 2004. What’s in a translationrule? In Proceedings of the 2004 Annual Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT ’04, pages 273–280, Boston, Massachusetts, USA.

Kevin Gimpel and Noah A. Smith. 2012. Struc-tured ramp loss minimization for machine transla-tion. In Proceedings of the 2012 Annual Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT ’12, pages 221–231, Montreal, Canada.

Kenneth Heafield. 2011. KenLM: Faster and smallerlanguage model queries. In Proceedings of theSixth Workshop on Statistical Machine Translation,WMT ’11, Edinburgh, United Kingdom.

Mark Hopkins and Jonathan May. 2011. Tuning asranking. In Proceedings of the 2011 Conferenceon Empirical Methods in Natural Language Pro-cessing, EMNLP ’11, pages 1352–1362, Edinburgh,Scotland, United Kingdom.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Pro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology (Vol-ume 1), HLT-NAACL ’03, pages 48–54, Edmonton,Canada.

Philipp Koehn, Amittai Axelrod, Alexandra BirchMayne, Chris Callison-Burch, Miles Osborne, andDavid Talbot. 2005. Edinburgh system descrip-tion for the 2005 IWSLT speech translation evalu-ation. In Proceedings of the International Workshopon Spoken Language Translation, IWSLT ’05, Pitts-burgh, Pennsylvania, USA.

71

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. InProceedings of the 45th Annual Meeting of the Asso-ciation for Computational Linguistics (Demonstra-tion session)., ACL ’07, pages 177–180, Prague,Czech Republic.

Alon Lavie and Michael J. Denkowski. 2009. The ME-TEOR metric for automatic evaluation of machinetranslation. Machine Translation, 23:105–115.

Zhifei Li, Chris Callison-Burch, Chris Dyer, San-jeev Khudanpur, Lane Schwartz, Wren Thornton,Jonathan Weese, and Omar Zaidan. 2009. Joshua:An open source toolkit for parsing-based machinetranslation. In Proceedings of the Fourth Workshopon Statistical Machine Translation, WMT ’09, pages135–139, Athens, Greece.

Mu Li, Yinggong Zhao, Dongdong Zhang, and MingZhou. 2010. Adaptive development data selec-tion for log-linear model in statistical machine trans-lation. In Proceedings of the 23rd InternationalConference on Computational Linguistics, COL-ING ’10, pages 662–670, Beijing, China.

Lemao Liu, Hailong Cao, Taro Watanabe, Tiejun Zhao,Mo Yu, and Conghui Zhu. 2012. Locally train-ing the log-linear model for SMT. In Proceedingsof the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning, EMNLP-CoNLL ’12,pages 402–411, Jeju Island, Korea.

David McAllester and Joseph Keshet. 2011. Gen-eralization bounds and consistency for latent struc-tural probit and ramp loss. In J. Shawe-Taylor,R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q.Weinberger, editors, Advances in Neural Informa-tion Processing Systems 24, NIPS ’11, pages 2205–2212, Granada, Spain.

Robert C Moore and Chris Quirk. 2008. Randomrestarts in minimum error rate training for statis-tical machine translation. In Proceedings of the22nd International Conference on ComputationalLinguistics-Volume 1, COLING ’08, pages 585–592,Manchester, United Kingdom.

Preslav Nakov, Francisco Guzman, and Stephan Vo-gel. 2012. Optimizing for sentence-level BLEU+1yields short translations. In Proceedings of the24th International Conference on ComputationalLinguistics, COLING ’12, pages 1979–1994, Mum-bai, India.

Preslav Nakov, Francisco Guzman, and Stephan Vogel.2013. A tale about PRO and monsters. In Proceed-ings of the 51st Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers), ACL ’13, pages 12–17, Sofia, Bulgaria.

Franz Josef Och. 2003. Minimum error rate training instatistical machine translation. In Proceedings of the41st Annual Meeting of the Association for Compu-tational Linguistics, ACL ’03, pages 160–167, Sap-poro, Japan.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In Proceedingsof the 40th Annual Meeting of the Association forComputational Linguistics, ACL ’02, pages 311–318, Philadelphia, Pennsylvania, USA.

Chris Quirk, Arul Menezes, and Colin Cherry. 2005.Dependency treelet translation: Syntactically in-formed phrasal smt. In Proceedings of the 43rdAnnual Meeting on Association for ComputationalLinguistics, ACL ’05, pages 271–279, Ann Arbor,Michigan, USA.

Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab,and Cynthia Rudin. 2008. Arabic morphologicaltagging, diacritization, and lemmatization using lex-eme models and feature ranking. In Proceedings ofthe 46th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technolo-gies, ACL ’08, pages 117–120, Columbus, Ohio,USA.

Patrick Simianer, Stefan Riezler, and Chris Dyer. 2012.Joint feature selection in distributed stochastic learn-ing for large-scale discriminative training in SMT.In Proceedings of the 50th Annual Meeting of the As-sociation for Computational Linguistics: Long Pa-pers - Volume 1, ACL ’12, pages 11–21, Jeju Island,Korea.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A study oftranslation edit rate with targeted human annotation.In Proceedings of the 7th Biennial Conference of theAssociation for Machine Translation in the Ameri-cas, AMTA ’06, pages 223–231, Cambridge, Mas-sachusetts, USA.

David Vilar, Daniel Stein, Matthias Huck, and Her-mann Ney. 2010. Jane: Open source hierarchi-cal translation, extended with reordering and lexi-con models. In Proceedings of the 5th Workshop onStatistical Machine Translation and MetricsMATR,WMT ’10, pages 262–270, Uppsala, Sweden.

Taro Watanabe, Jun Suzuki, Hajime Tsukada, andHideki Isozaki. 2007. Online large-margin trainingfor statistical machine translation. In Proceedingsof the 2007 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning, EMNLP-CoNLL ’07,pages 764–773, Prague, Czech Republic.

Zhongguang Zheng, Zhongjun He, Yao Meng, and HaoYu. 2010. Domain adaptation for statistical ma-chine translation in development corpus selection.In Proceedings of the 4th International UniversalCommunication Symposium, IUCS ’10, pages 2–7,Beijing, China.

72



Annotation Projection-based Representation Learning forCross-lingual Dependency Parsing

Min Xiao and Yuhong GuoDepartment of Computer and Information SciencesTemple University, Philadelphia, PA 19122, USA

{minxiao,yuhong}@temple.edu

AbstractCross-lingual dependency parsing aimsto train a dependency parser for anannotation-scarce target language by ex-ploiting annotated training data from anannotation-rich source language, which isof great importance in the field of nat-ural language processing. In this paper,we propose to address cross-lingual de-pendency parsing by inducing latent cross-lingual data representations via matrixcompletion and annotation projections ona large amount of unlabeled parallel sen-tences. To evaluate the proposed learn-ing technique, we conduct experiments ona set of cross-lingual dependency parsingtasks with nine different languages. Theexperimental results demonstrate the effi-cacy of the proposed learning method forcross-lingual dependency parsing.

1 IntroductionThe natural language processing (NLP) commu-nity has witnessed an enormous development ofmultilingual resources, which draws increasing at-tention to developing cross-lingual NLP adapta-tion systems. Cross-lingual dependency parsingaims to train a dependency parser for a target lan-guage where labeled data is rare or unavailableby exploiting the abundant annotated data from asource language. Cross-lingual dependency pars-ing can effectively reduce the expensive manualannotation effort in individual languages and hasbeen increasingly studied in the multilingual com-munity. Previous works have demonstrated thesuccess of cross-lingual dependency parsing for avariety of languages (Durrett et al., 2012; McDon-ald et al., 2013; Tackstrom et al., 2013; Søgaardand Wulff, 2012).

One fundamental issue of cross-lingual depen-dency parsing lies in how to effectively transfer the

annotation information from the source languagedomain to the target language domain. Due to thelanguage divergence over the word-level represen-tations and the sentence structures, simply traininga monolingual dependency parser on the labeledsource language data without adaptation learn-ing will fail to produce a dependency parser thatworks in the target language domain. To tacklethis problem, a variety of works in the literaturehave designed better algorithms to exploit the an-notated resources in the source languages, includ-ing the cross-lingual annotation projection meth-ods (Hwa et al., 2005; Smith and Eisner, 2009;Zhao et al., 2009), the cross-lingual direct trans-fer with linguistic constraints methods (Ganchevet al., 2009; Naseem et al., 2010; Naseem et al.,2012), and the cross-lingual representation learn-ing methods (Durrett et al., 2012; Tackstrom et al.,2012; Zhang et al., 2012).

In this work, we propose a novel representationlearning method to address cross-lingual depen-dency parsing, which exploits annotation projec-tions on a large amount of unlabeled parallel sen-tences to induce latent cross-lingual features viamatrix completion. It combines the advantagesof the cross-lingual annotation projection meth-ods, which project labeled information into the tar-get language domain, and the cross-lingual rep-resentation learning methods, which learn latentinterlingual features. Specifically, we first traina dependency parser on the labeled source lan-guage data and use it to infer labels for the un-labeled source language sentences of the parallelresources. We then project the annotations fromthe source language to the target language via theword alignments on the parallel sentences. Af-terwards, we define a set of interlingual featuresand construct a word-feature matrix by associat-ing each word with these language-independentfeatures. We then use the original labeled sourcelanguage data and the predicted (or projected) la-

73

beled information on the parallel sentences to fillin the observed entries of the word-feature matrix,while matrix completion is performed to fill theremaining missing entries. The completed word-feature matrix provides a set of consistent cross-lingual representation features for the words inboth languages. We use these features as augment-ing features to train a dependency parsing systemon the labeled data in the source language and per-form prediction on the test sentences in the tar-get language. To evaluate the proposed learningmethod, we conduct experiments on eight cross-lingual dependency parsing tasks with nine differ-ent languages. The experimental results demon-strate the superior performance of the proposedcross-lingual transfer learning method, comparingto other approaches.

2 Related Work

A variety of cross-lingual dependency parsingmethods have been developed in the literature. Weprovide a brief review over the related works inthis section.

Much work developed in the literature is basedon annotation projection (Hwa et al., 2005; Liuet al., 2013; Smith and Eisner, 2009; Zhao et al.,2009). Basically, they exploit parallel sentencesand first project the annotations of the source lan-guage sentences to the corresponding target lan-guage sentences via the word level alignments.Then, they train a dependency parser in the targetlanguage by using the target language sentenceswith projected annotations. The performance ofannotation projection-based methods can be af-fected by the quality of word-level alignments andthe specific projection schema. Therefore, Hwaet al. (2005) proposed to heuristically correct ormodify the projected annotations in order to in-crease the projection performance while Smithand Eisner (2009) used a more robust projec-tion method, quasi-synchronous grammar projec-tion, to address cross-lingual dependency parsing.Moreover, Liu et al. (2013) proposed to project thediscrete dependency arcs instead of the treebankas the training set. These works however assumethat the parallel sentences are already available, orcan be obtained by using free machine translationtools. Instead, Zhao et al. (2009) considered thecost of machine translation and used a bilinguallexicon to obtain a translated treebank with pro-jected annotations from the source language.

A number of works are developed based onrepresentation learning (Durrett et al., 2012;Tackstrom et al., 2012; Zhang et al., 2012; Xiaoand Guo, 2014). In general, these methods first au-tomatically learn some language-independent fea-tures and then train a dependency parser in thisinterlingual feature space with labeled data in thesource language and apply it on the data in the tar-get language. Durrett et al. (2012) used a bilinguallexicon, which can be manually constructed or in-duced on parallel sentences, to learn language-independent projection features for cross-lingualdependency parsing. Tackstrom et al. (2012)used unlabeled parallel sentences to induce cross-lingual word clusterings and used these word clus-terings as interlingual features. Both (Durrett etal., 2012) and (Tackstrom et al., 2012) assumedthat the twelve universal part-of-speech (POS)tags (Petrov et al., 2012) are available and usedthem as the basic interlingual features. Moreover,Zhang et al. (2012) proposed to automatically maplanguage-specific POS tags to universal POS tagsto address cross-lingual dependency parsing, in-stead of using the manually defined mapping rules.Recently, Xiao and Guo (2014) used a set of bilin-gual word pairs as pivots to learn interlingual dis-tributed word representations via deep neural net-works as augmenting features for cross-lingual de-pendency parsing.

Some other works are proposed based on mul-tilingual linguistic constraints (Ganchev et al.,2009; Gillenwater et al., 2010; Naseem et al.,2010; Naseem et al., 2012). Basically, they firstconstruct a set of linguistic constrains and thentrain a dependency parsing system by incorporat-ing the linguistic constraints via posterior regular-ization. The constraints are expected to bridge thelanguage differences. Ganchev et al. (2009) au-tomatically learned the constraints by using par-allel data while some other works manually con-structed them by using the universal dependencyrules (Naseem et al., 2010) or the typological fea-tures (Naseem et al., 2012).

3 Proposed Approach

In this section, we present a novel representa-tion learning method for cross-lingual dependencyparsing, which combines annotation projectionand matrix completion-based feature representa-tion learning together to produce effective inter-lingual features.

74

Figure 1: The architecture of the proposed cross-lingual representation learning framework, whichconsists of two steps, cross-lingual annotation pro-jection and cross-lingual representation learning.

We consider the following cross-lingual depen-dency parsing setting. We have a large amountof labeled sentences in the source language and aset of unlabeled sentences in the target language.In addition, we also have a large set of auxiliaryunlabeled parallel sentences across the two lan-guages. We aim to learn interlingual feature rep-resentations such that a dependency parser trainedin the source language sentences can be applied inthe target language domain. The framework forthe proposed cross-lingual representation learn-ing system is given in Figure 1. The system hastwo steps: cross-lingual annotation projection andcross-lingual representation learning. We presenteach of the two steps below.

3.1 Cross-Lingual Annotation Projection

In the first step, we employ a large amount of un-labeled parallel sentences to transfer dependencyrelations from the source language to the targetlanguage. We first train a lexicalized dependencyparser with the labeled training data in the sourcelanguage. Then we use this parser to produceparse trees on the source language sentences ofthe auxiliary parallel data. Simultaneously, weperform word-level alignments on the unlabeledparallel sentences using existing alignment tools.Finally, we project the predicted dependency re-lations of the source language sentences to their

Figure 2: An example of cross-lingual annotationprojection, where a partial word-level alignmentis shown to demonstrate two cases of annotationprojection.

parallel counterparts in the target language via theword-level alignments. Instead of projecting thewhole dependency trees, which requires more so-phisticated algorithms, we simply project each de-pendency arc on the source sentences to the targetlanguage side.

We now use a specific example in Figure 2 toillustrate the projection step. This example con-tains an English sentence and its parallel sentencein German. The English sentence is fully labeledwith each dependency relation indicated by a soliddirected arc. The dashed lines between the En-glish sentence and the German sentence show thealignments between them. For each dependencyarc instance, we consider the following properties:the parent word, the child word, the parent POS,the child POS, the dependency direction, and thedependency distances. The projection of the de-pendency relations from the source language to thetarget language is conducted based on the word-level alignment. There are two different scenar-ios. The first scenario is that the two source lan-guage words involved in the dependency relationare aligned to two different words in the corre-sponding target sentence. For example, the En-glish words “the” and “quota” are aligned to Ger-man words “die” and “Quote” separately. We thencopy this dependency relation into the target lan-guage side. The second scenario is that a sourcelanguage word is aligned to a word in the targetlanguage sentence and has a dependency relationwith the “<root>” word. For example, the En-glish word “want” is aligned to “wollen” and it hasa dependency arc with “<root>”. We then projectthe dependency relation from the English side tothe German side as well. Moreover, we also di-rectly project the POS tags of the source language

75

Figure 3: Example of how to collect queries foreach specific dependency relation and how to ob-tain the abstract signatures (adapted from (Durrettet al., 2012)).

words onto the target language words. Since theword order for each aligned word pair in parallelsentences can be different, we recalculate the de-pendency direction and the dependency distancefor the projected dependency arc instance. Notethe example in Figure 2 only shows a partial word-level alignment to demonstrate the two cases of theannotation projection. The word alignment toolcan align more words than shown in the example.

3.2 Cross-Lingual Representation Learning

After cross-lingual annotation projection, we havea set of projected dependency arc instances inthe target language. However, the sentences inthe target language are not fully labeled. De-pendency relation related features are not readilyavailable for all the words in the target languagedomain. Hence, in this step, we first generate a setof interlingual features and then automatically fillthe missing feature values for the target languagewords with matrix completion based on the pro-jected feature values.

3.2.1 Generating Interlingual FeaturesWe use the signature method in (Durrett et al.,2012) to construct a set of interlingual featuresfor the words in the source and target languagedomains . The signatures proposed in (Dur-rett et al., 2012) for dependency parsing areuniversal across different languages, and havenumerical values that are computed in specificdependency relations. Here we illustrate the

signature generation process by using an examplein Figure 3, which is adapted from (Durrett etal., 2012). Note for each dependency relationbetween a parent (also known as the head) wordand a child (also known as the dependent) word,we can collect a number of queries based onthe dependency properties. For example, giventhe dependency arc between “want” and “to” inthe English sentence in Figure 3, and assumingwe consider the child word “to”, we producequeries by considering a non-empty subset ofthe dependency properties (the parent POS, thedependency direction, the dependency distance),which provides us 7 queries: “VERB!to”, “!to,RIGHT”, “!to, 1”, “VERB !to, RIGHT”,“VERB!to, 1”, “!to, RIGHT, 1”, “VERB!to,RIGHT, 1”, where VERB is the parent POS tag,RIGHT is the dependency direction and 1 isthe dependency distance. Then we can abstractthe specific queries to generate the signaturesby replacing the considered word (“to”) with itsPOS tag (“PRT”), and replacing the parent POStag with “PARENT”, the specific dependencydistance with “DIST” and the dependency direc-tion with “DIR”. This produces the following 7signatures: “PARENT![PRT]”, “![PRT], DIR”,“![PRT], DIST”, “PARENT![PRT], DIST”,“PARENT![PRT], DIST”, “![PRT], DIR,DIST”, and “PARENT![PRT], DIR, DIST”,where the brackets indicate the POS tags are forthe considered word. Similarly, we can performthe same abstraction process for the parent word“want” and get another 7 signatures (see Table 1).Since each signature contains one POS tag andthere are 13 different POS types (12 universalPOS tags and 1 special type for the “<root>”word), we can get a total of 7 ⇥ 2 ⇥ 13 = 182

signatures. These signatures are independent ofspecific languages, though their numerical valuesshould be computed in a specific dependencyrelation for each considered target word.

A set of interlingual features can then be gener-ated from these abstractive signatures by consid-ering different instantiations of their items. For agiven target word with an observed POS tag, it has14 signatures (see Table 1). For each signature,we consider all possible instantiations of its otheritems given the fixed target word. For example, forthe target word “to”, its signature “![PRT], DIR”can be instantiated into 2 features: “! LEFT” and“! RIGHT”. Similarly, its signature “![PRT],

76

Signatures # Features[PRT]! DIR 2[PRT]! DIST 5[PRT]! CHILD 13[PRT]! DIR, DIST 10[PRT]! CHILD, DIR 26[PRT]! CHILD, DIST 65[PRT]! CHILD, DIR, DIST 130! [PRT], DIR 2! [PRT], DIST 5PARENT! [PRT] 13! [PRT], DIR, DIST 10PARENT! [PRT], DIR 26PARENT! [PRT], DIST 65PARENT! [PRT], DIR, DIST 130Total 502

Table 1: The number of induced “features” of eachsignature for a given word.

DIST” can be instantiated into 5 features sinceDIST has 5 different values ({1, 2, 3–5, 6–10,11+}), and its signature “[PRT]!CHILD” can beinstantiated into 13 features since CHILD denotesthe child word’s POS tags and can have 13 differ-ent values. Hence as shown in Table 1, we can get502 features from the 14 signatures.

3.2.2 Learning Feature Values with MatrixCompletion

The signature-based 502 interlingual features to-gether with the 13 universal POS tag features canbe used as language independent features for allthe words in the vocabulary constructed across thesource and target language domains. In particular,we can form a word-feature matrix with the con-structed vocabulary and the total 515 language in-dependent features. For each word that appearedin the dependency relation arcs, we can use thenumber of appearances of its interlingual featuresas the corresponding feature values. However, thesentences in the target language are not fully la-beled. Some words in the target language domainmay not be observed in the projected dependencyarc instances, and we cannot compute their featurevalues for the 502 interlingual features, though the13 universal POS tag features are available for allwords. Moreover, since we only have a limitednumber of projected dependency arc instances inthe target language, even for some target wordsthat appeared in the projected arc instances of theparallel data, we may only observe a subset offeatures among the total 502 interlingual features,with the rest features missing. Hence the con-structed word-feature matrix is only partially ob-

Figure 4: The word-feature matrix. There arethree parts of words: the source language words,target language words from the projected depen-dency arc instances, and additional target languagewords. The signature features are the 502 interlin-gual features and the POS features are the 13 uni-versal POS tags. Solid lines indicate observed en-tries, dashed lines indicate partially observed en-tries, while empty indicates missing entries.

served, as shown in Figure 4. Furthermore, therecould also be some noise in the observed featurevalues as some word features may not have re-ceived sufficient observations.

To solve the missing feature problem and si-multaneously perform data denoising, we exploita feature correlation assumption: the 502 con-structed interlingual features and the 13 univer-sal POS tags are not mutually independent; theyusually contain a lot statistical correlation infor-mation. For example, for a word “want” withPOS tag “VERB”, its feature value for “VERB!want, RIGHT” is likely to be very small such aszero, while its feature value for “want! NOUN,LEFT” is likely to be large. Moreover, the exis-tence of any one of the two interlingual features inthis example can also indicate the non-existenceof the other feature. The existence of feature cor-relations establishes the low-rank property of theword-feature matrix. We hence propose to fill themissing feature values and reduce the noise in theword-feature matrix by performing matrix com-pletion. Low-rank matrix completion has beensuccessfully used in many applications to fill miss-ing entries of partially observed low-rank matri-ces and perform matrix denoising (Cabral et al.,2011; Xiao and Guo, 2013) by exploiting the fea-ture correlations and underlying low-dimensionalrepresentations. Following the same principle, weexpect to automatically discover the missing fea-

77

ture values in our word-feature matrix and performdenoising through low-rank matrix completion.

Let M0 2 Rn⇥k denote the partially observedword-feature matrix in Figure 4, where n is thenumber of words and k is the dimensionality ofthe feature set, which is 515 in this study. Let ⌦

denote the set of indices for the observed entries.Hence for each observed entry (i, j) 2 ⌦, M0

ijcontains the frequency collected for the j-th fea-ture of the i-th word. We then formulate matrixcompletion as the following optimization problemto recover a full matrix M from the partially ob-served matrix M0:

min

M�0�kMk⇤+↵kMk1,1+

X

(i,j)2⌦

(Mij�M0ij)

2 (1)

where the trace norm kMk⇤ enforces the low-rankproperty of the matrix, and kMk1,1 denotes theentrywise L1 norm. Since many words usuallyonly have observed values for a small subset of the502 interlingual features due to the simple fact thatthey are only associated with very few POS tags,a fully observed word-feature matrix is typicallysparse and contains many zero entries. Hence weuse the L1 norm regularizer to encode the spar-sity of the matrix M . The nonnegativity constraintM � 0 encodes the fact that our frequency basedfeature values in the word-feature matrix are allnonnegative. The minimization problem in Eq (1)can be solved using a standard projected gradientdescent algorithm (Xiao and Guo, 2013).

3.3 Cross-Lingual Dependency ParsingAfter matrix completion, we can get a set of in-terlingual features for all the words in the word-feature matrix. We then use the interlingual fea-tures for each word as augmenting features andtrain a delexicalized dependency parser on the la-beled sentences in the source language. The parseris then applied to perform prediction on the testsentences in the target language, which are alsodelexicalized and augmented with the interlingualfeatures.

4 Experiments

4.1 DatasetsWe used the multilingual dependency parsingdataset from the CoNLL-X shared tasks (Buch-holz and Marsi, 2006; Nivre et al., 2007) and ex-perimented with nine different languages: Dan-ish (Da), Dutch (Nl), English (En), German (De),

Greek (El), Italian (It), Portuguese (Pt), Span-ish (Es) and Swedish (Sv). For each language,the original dataset contains a training set and atest set. We constructed eight cross-lingual de-pendency parsing tasks, by using English as thelabel-rich source language and using each of theother eight languages as the label-poor target lan-guage. For example, the task En2Da means thatwe used English sentences as the source languagedata and Danish sentences as the target languagedata. For each task, we used the original trainingset in English as the labeled source language data,and used the original training set in the target lan-guage as unlabeled training data and the originaltest set in the target language as test sentences.Each sentence from the dataset is labeled withgold standard POS tags. We manually mappedthese language-specific POS tags to 12 univer-sal POS tags: NOUN (nouns), NUM (numerals),PRON (pronouns), ADJ (adjectives), ADP (prepo-sitions or postpositions), ADV (adverbs), CONJ(conjunctions), DET (determiners), PRT (parti-cles), PUNC (punctuation marks), VERB (verbs)and X (for others).

We used the unlabeled parallel sentences fromthe European parliament proceedings parallelcorpus (Koehn, 2005), which contains parallelsentences between multiple languages, as auxil-iary unlabeled parallel sentences in our experi-ments. For the representation learning over eachcross-lingual dependency parsing task, we usedall the parallel sentences for the given languagepair from this corpus. The number of parallel sen-tences for the eight language pairs ranges from1, 235, 976 to 1, 997, 775, and the number of to-kens involved in these sentences in each languageranges from 31, 929, 703 to 50, 602, 994.

4.2 Representation Learning

For the proposed representation learning, we firsttrained a lexicalized dependency parser on the la-beled source language data using the MSTParsertool (proj with the first order set) (McDonald etal., 2005) and used it to predict the parsing annota-tions of the source language sentences in the unla-beled parallel dataset. The sentences of the paral-lel data only contain sequences of words, withoutadditional POS tag information. We then used anexisting POS tagging tool (Collobert et al., 2011)to infer POS tags for them. Next we producedword-level alignments on the unlabeled parallel

78

Basic Conj with dist Conj with dir Conj with dist and dirupos h dist, upos h dir, upos h dist, dir, upos hupos d dist, upos d dir, upos d dist, dir, upos dupos h, upos d, dist, upos h, upos d dir, upos h, upos d dist, dir, upos h, upos d

Table 2: Feature templates for training a basic delexicalized dependency parser. upos stands for theuniversal POS tag, h stands for the head word, d stands for the dependent word, dist stands for thedependency distance, which has five values {1, 2, 3� 5, 6� 10, 11+}, and dir stands for the dependencydirection, which has two values {left, right}.

Tasks Wikitionary Parallel DataDelex Proj1 r DNN r Proj2 r RLAP r X-lingual

En2Da 36.5 41.3 4.8 42.6 6.1 42.9 6.4 43.6 7.1 38.7En2De 46.2 49.2 3.0 49.5 3.3 49.7 3.5 50.5 4.3 50.7En2El 61.5 62.4 0.9 63.0 1.5 63.5 2.0 64.3 2.8 63.0En2Es 52.1 54.5 2.4 55.7 3.6 56.2 4.1 56.3 4.2 62.9En2It 56.4 57.7 1.3 59.1 2.7 59.2 2.8 60.4 4.0 68.8En2Nl 62.0 64.4 2.4 65.1 3.1 64.9 2.9 66.1 4.1 54.3En2Pt 68.7 71.5 2.8 72.4 3.7 71.9 3.2 72.8 4.1 71.0En2Sv 57.8 61.0 3.2 61.9 4.1 62.9 5.1 63.7 5.9 56.9Average 55.2 57.8 2.6 58.7 3.5 58.9 3.8 59.7 4.6 58.3

Table 3: Comparison results in terms of unlabeled attachment score (UAS) for the eight cross-lingualdependency parsing tasks (English is used as the source language). The evaluation results are on allthe test sentences. The Delex method uses no auxiliary resource, Proj1 and DNN use Wikitionary asauxiliary resource, Proj2, RLAP, and X-lingual use parallel sentences as auxiliary resources. r denotesthe improvements of each method over the baseline Delex method. The bottom row contains the averageresults over the eight tasks.

sentences by using the Berkeley alignment tool(Liang et al., 2006). With the word alignments,we then projected the predicted dependency rela-tions from the source language sentences of theparallel data to the target language side, whichproduces a set of dependency arc instances in thetarget language. Finally, we constructed the par-tially observed word-feature matrix from these la-beled data and conducted matrix completion to re-cover the whole matrix. For matrix completion,we used the first task En2Da to perform param-eter selection based on the test performance. Weselected � from {0.1, 1, 10} and selected ↵ from{10

3, 10

4, 10

5}. The selected values � = 1 and↵ = 10

�4 were then used for all the experiments.

4.3 Experimental Results

4.3.1 Test Results on All the Test SentencesWe first compared the proposed representa-tion learning with annotation projection method,RLAP, to the following methods in our experi-

ments: Delex, Proj1, Proj2, DNN and X-lingual.The Delex method is a baseline method, which re-places the language-specific word sequence withthe universal POS tag sequence and then trainsa delexicalized dependency parser. We listed thefeature templates used in this baseline delexical-ized dependency parser in Table 2. The Proj1and Proj2 methods are from (Durrett et al., 2012).Durrett et al. (2012) proposed to use bilingual lex-icon to learn cross-lingual features and providedtwo ways to construct the bilingual lexicon, oneis based on Wikitionary and the other is based onunlabeled parallel sentences with observed word-level alignments. We used these two ways sepa-rately to construct the bilingual lexicon betweenthe languages for learning cross-lingual features,which are then used as augmenting features fortraining delexicalized dependency parsers. Wedenote the Wikitionary-based method as Proj1and the parallel-sentence-based method as Proj2.The DNN method, developed in (Xiao and Guo,

79

Tasks Delex Proj2 r RLAP r USR PGI PR MLCEn2Da 46.7 54.6 7.9 55.7 9.0 51.9 41.6 44.0 -En2De 62.0 63.0 1.0 64.0 2.0 - - 39.6 62.8En2El 60.9 61.9 1.0 63.2 2.3 - - - 61.4En2Es 55.2 58.3 3.1 59.6 4.4 67.2 58.4 62.4 57.3En2It 55.5 56.9 1.4 58.3 2.8 - - - 56.2En2Nl 60.3 62.5 2.2 63.7 3.4 - 45.1 37.9 62.0En2Pt 80.2 84.5 4.3 85.7 5.5 71.5 63.0 47.8 83.8En2Sv 73.4 76.0 2.6 76.4 3.0 63.3 58.3 42.2 74.9Average 61.8 64.7 2.9 65.8 4.1 - - - -

Table 4: Comparison results on the short test sentences with length of 10 or less in terms of unlabeledattachment score (UAS).r denotes the improvements of each method over the baseline Delex method.

# of Labeled Target Instances0 500 1000 1500

UA

S

35

36

37

38

39

40

41

42

43

44

En2Da

DelexProj2RLAP


UA

S

45

46

47

48

49

50

51

En2De

DelexProj2RLAP


UA

S

60

60.5

61

61.5

62

62.5

63

63.5

64

64.5

65

En2El

DelexProj2RLAP


UA

S

51

52

53

54

55

56

57

En2Es

DelexProj2RLAP


UA

S

55

56

57

58

59

60

61

En2It

DelexProj2RLAP


UA

S

60

61

62

63

64

65

66

67

En2Nl

DelexProj2RLAP


UA

S

67

68

69

70

71

72

73

En2Pt

DelexProj2RLAP


UA

S56

57

58

59

60

61

62

63

64

En2Sv

DelexProj2RLAP

Figure 5: Unlabeled attachment score (UAS) on the whole test sentences in the target language by varyingthe number of labeled training sentences in the target language.

2014), uses Wikitionary to construct bilingualword pairs and then uses a deep neural network tolearn interlingual word embeddings as augment-ing features for training delexicalized dependencyparsers. The X-lingual method uses unlabeled par-allel sentences to induce cross-lingual word clus-ters as augmenting features for delexicalized de-pendency parser (Tackstrom et al., 2012). For X-lingual, we cited its results reported in its originalpaper. For other methods, we used the MSTParser(McDonald et al., 2005) as the underlying depen-dency parsing tool. To train the MSTParser, weset the number of maximum iterations for the per-ceptron training as 10 and set the number of best-kdependency tree candidates as 1.

We evaluated the empirical performance of eachcomparison method on all the test sentences. Thecomparison results on the eight cross-lingual de-pendency parsing tasks in terms of unlabeled at-tachment score (UAS) are reported in Table 3. Wecan see that the baseline method, Delex, performs

poorly across the eight tasks. This is not surprisingsince the sequence of universal POS tags are notdiscriminative enough for the dependency parsingtask. Note even for two sentences with the exactsame sequence of POS tags, they may have differ-ent dependency trees. By using auxiliary bilingualword pairs via Wikitionary, the two cross-lingualrepresentation learning methods, Proj1 and DNN,outperform Delex across all the eight tasks. Be-tween these two methods, DNN consistently out-performs Proj1, which suggests the interlingualword embeddings induced by deep neural net-works are very effective. By using unlabeled par-allel sentences as an auxiliary resource, the twomethods, Proj2 and RLAP, consistently outper-form the baseline Delex method, while X-lingualoutperforms Delex on six tasks. Moreover, Proj2outperforms its variant Proj1 across all the eighttasks and achieves comparable performance withthe deep neural network based method DNN. Thissuggests that unlabeled parallel sentences form

80

a stronger auxiliary resource than the free Wiki-tionary. Our proposed approach, RLAP, which hasthe capacity of exploiting the unlabeled parallelsentences, consistently outperforms the four com-parison methods, Delex, Proj1, DNN and Proj2,across all the eight tasks. It also outperforms theX-lingual method on five tasks. The average UASover all the eight tasks for the RLAP method is1.4 higher than the X-lingual method. All theseresults demonstrated the effectiveness of the pro-posed representation learning method for cross-lingual dependency parsing.

4.3.2 Test Results on Short Test SentencesWe also conducted empirical evaluations on shorttest sentences (with length of 10 or less). Wecompared Delex, Proj2 and RLAP with four othermethods, USR, PGI, PR and MLC. The USRmethod is a cross-lingual direct transfer methodwhich uses universal dependency rules to con-struct linguistic constraints (Naseem et al., 2010).The PGI method is a phylogenetic grammar induc-tion model (Berg-Kirkpatrick and Klein, 2010).The PR method is a posterior regularization ap-proach (Gillenwater et al., 2010). The MLCmethod is the multilingual linguistic constraints-based method which uses typological features forcross-lingual dependency parsing (Naseem et al.,2012). Here we used this method in our settingwith only one source domain. Moreover, since wedo not have typological features for Danish, wedid not conduct experiment on the first task withMLC. For the methods of USR, PGI and PR, wecited their results reported in their original papers.All the cited results are also produced on the shortsentences of the CoNLL-X shard task dataset. Wecited them as references on measuring the progressof cross-lingual dependency parsing on each giventarget language.

The comparison results are reported in Table 4.We can see that the results on the short test sen-tences are in general better than on the whole testset (in Table 3) for the same method across mosttasks. This suggests that it is easier to infer thedependency tree for a short sentence than for along sentence. Nevertheless, Proj2 consistentlyoutperforms Delex and RLAP consistently outper-forms Proj2 across all the tasks. Moreover, RLAPachieves the highest test scores in seven out of theeight cross-lingual tasks among all the compari-son systems. This again demonstrated the efficacyof the proposed approach for cross-lingual depen-

dency parsing.

4.4 Impact of Labeled Training Data inTarget Language

We have also conducted experiments for the learn-ing scenarios where a small set of labeled train-ing sentences from the target language is available.Specifically, we conducted experiments with a fewdifferent numbers of additional labeled trainingsentences from the target language, {500, 1000,1500}, using three methods, RLAP, Delex andProj2. The comparison results on all the test sen-tences are reported in Figure 5. We can see that theperformance of all three methods increases veryslow but in a similar trend with more additional la-beled training instances from the target language.However, both Proj2 and RLAP outperform Delexwith large margins across all experiments. More-over, the proposed method, RLAP, produces thebest results across all the eight tasks. The resultsagain verified the efficacy of the proposed method,demonstrated that filling the missing feature val-ues with matrix completion is indeed useful.

5 Conclusion

In this paper, we proposed a novel representationlearning method with annotation projection to ad-dress cross-lingual dependency parsing. The pro-posed approach exploits unlabeled parallel sen-tences and combines cross-lingual annotation pro-jection and matrix completion-based interlingualfeature learning together to automatically inducea set of language-independent numerical features.We used these interlingual features as augmentingfeatures to train a delexicalized dependency parseron the labeled sentences in the source languageand tested it in the target language domain. Ourexperimental results on eight cross-lingual depen-dency parsing tasks showed the proposed repre-sentation learning method outperforms a numberof comparison methods.

Acknowledgments

This research was supported in part by NSF grantIIS-1065397.

ReferencesT. Berg-Kirkpatrick and D. Klein. 2010. Phylogenetic

grammar induction. In Proc. of the Annual Meetingof the Association for Comput. Linguistics (ACL).

81

S. Buchholz and E. Marsi. 2006. CoNLL-X sharedtask on multilingual dependency parsing. In Proc.of the Conference on Computational Natural Lan-guage Learning (CoNLL).

R. Cabral, F. Torre, J. Costeira, and A. Bernardino.2011. Matrix completion for multi-label image clas-sification. In Advances in Neural Information Pro-cessing Systems (NIPS).

R. Collobert, J. Weston, L. Bottou, M. Karlen,K. Kavukcuoglu, and P. Kuksa. 2011. Natural lan-guage processing (almost) from scratch. J. of Ma-chine Learning Research (JMLR), 12:2493–2537.

G. Durrett, A. Pauls, and D. Klein. 2012. Syntac-tic transfer using a bilingual lexicon. In Proc. ofthe Joint Conference on Empirical Methods in Nat-ural Language Processing and Computational Nat-ural Language Learning (EMNLP-CoNLL).

K. Ganchev, J. Gillenwater, and B. Taskar. 2009. De-pendency grammar induction via bitext projectionconstraints. In Proc. of the Joint Conference of theAnnual Meeting of the ACL and the InternationalJoint Conference on Natural Language Processingof the AFNLP (ACL-IJCNLP).

J. Gillenwater, K. Ganchev, J. Graca, F. Pereira, andB. Taskar. 2010. Sparsity in dependency grammarinduction. In Proc. of the Annual Meeting of theAssociation for Computational Linguistics (ACL).

R. Hwa, P. Resnik, A. Weinberg, C. Cabezas, andO. Kolak. 2005. Bootstrapping parsers via syntacticprojection across parallel texts. Natural LanguageEngineering, 11:11–311.

P. Koehn. 2005. Europarl: A Parallel Corpus for Sta-tistical Machine Translation. In Proc. of the Ma-chine Translation Summit.

P. Liang, B. Taskar, and D. Klein. 2006. Alignmentby agreement. In Proc. of the Conference on Hu-man Language Technology Conference of the NorthAmerican Chapter of the Association of Computa-tional Linguistics (NAACL).

K. Liu, Y. Lu, W. Jiang, and Q. Liu. 2013. Bilingually-guided monolingual dependency parsing grammarinduction. In Proc. of the Conference on AnnualMeeting of the Association for Computational Lin-guistics (ACL).

R. McDonald, K. Crammer, and F. Pereira. 2005.Online large-margin training of dependency parsers.In Proc. of the Annual Meeting on Association forComputational Linguistics (ACL).

R. McDonald, J. Nivre, Y. Quirmbach-Brundage,Y. Goldberg, D. Das, K. Ganchev, K. Hall, S. Petrov,H. Zhang, O. Tckstrom, C. Bedini, N. Castello, andJ. Lee. 2013. Universal dependency annotation formultilingual parsing. In Proc. of the Annual Meetingon Association for Comput. Linguistics (ACL).

T. Naseem, H. Chen, R. Barzilay, and M. Johnson.2010. Using universal linguistic knowledge to guidegrammar induction. In Proc. of the Conference onEmpirical Methods in Natural Language Processing(EMNLP).

T. Naseem, R. Barzilay, and A. Globerson. 2012. Se-lective sharing for multilingual dependency parsing.In Proc. of the Annual Meeting of the Association forComputational Linguistics (ACL).

J. Nivre, J. Hall, S. Kubler, R. McDonald, J. Nilsson,S. Riedel, and D. Yuret. 2007. The CoNLL 2007shared task on dependency parsing. In Proc. of theJoint Conference on Empirical Methods in NaturalLanguage Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL).

S. Petrov, D. Das, and R. McDonald. 2012. A uni-versal part-of-speech tagset. In Proc. of the Interna-tional Conference on Language Resources and Eval-uation (LREC).

D. Smith and J. Eisner. 2009. Parser adaptation andprojection with quasi-synchronous grammar fea-tures. In Proc. of the Conf. on Empirical Methodsin Natural Language Processing (EMNLP).

A. Søgaard and J. Wulff. 2012. An empirical study ofnon-lexical extensions to delexicalized transfer. InProc. of the Conference on Computational linguis-tics (COLING).

O. Tackstrom, R. McDonald, and J. Uszkoreit. 2012.Cross-lingual word clusters for direct transfer of lin-guistic structure. In Proc. of the Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies (NAACL).

O. Tackstrom, R. McDonald, and J. Nivre. 2013. Tar-get language adaptation of discriminative transferparsers. In Proc. of the Conf. of the North AmericanChapter of the Association for Comput. Linguistics:Human Language Technologies (NAACL).

M. Xiao and Y. Guo. 2013. A novel two-step methodfor cross language representation learning. In Ad-vances in Neural Inform. Process. Systems (NIPS).

M. Xiao and Y. Guo. 2014. Distributed word represen-tation learning for cross-lingual dependency parsing.In Proceedings of the Conference on ComputationalNatural Language Learning (CoNLL).

Y. Zhang, R. Reichart, R. Barzilay, and A. Globerson.2012. Learning to map into a universal pos tagset.In Proc. of the Joint Conf. on Empirical Methods inNatural Language Processing and Comput. NaturalLanguage Learning (EMNLP-CoNLL).

H. Zhao, Y. Song, C. Kit, and G. Zhou. 2009. Crosslanguage dependency parsing using a bilingual lexi-con. In Proc. of the Joint Conf. of ACL and AFNLP(ACL-IJCNLP).

82



Big Data Small Data, In Domain Out-of Domain, Known Word UnknownWord: The Impact of Word Representations on Sequence Labelling Tasks

Lizhen Qu1,2, Gabriela Ferraro1,2, Liyuan Zhou1, Weiwei Hou1,Nathan Schneider3 and Timothy Baldwin1,4

1 NICTA, Australia2 The Australian National University

3 University of Edinburgh4 The University of Melbourne

{lizhen.qu,gabriela.ferraro,liyuan.zho,weiwei.hou}@nicta.com.au

[email protected]

[email protected]

AbstractWord embeddings — distributed wordrepresentations that can be learned fromunlabelled data — have been shown tohave high utility in many natural languageprocessing applications. In this paper, weperform an extrinsic evaluation of fourpopular word embedding methods in thecontext of four sequence labelling tasks:part-of-speech tagging, syntactic chunk-ing, named entity recognition, and multi-word expression identification. A particu-lar focus of the paper is analysing the ef-fects of task-based updating of word rep-resentations. We show that when usingword embeddings as features, as few asseveral hundred training instances are suf-ficient to achieve competitive results, andthat word embeddings lead to improve-ments over out-of-vocabulary words andalso out of domain. Perhaps more sur-prisingly, our results indicate there is littledifference between the different word em-bedding methods, and that simple Brownclusters are often competitive with wordembeddings across all tasks we consider.

1 IntroductionRecently, distributed word representations havegrown to become a mainstay of natural languageprocessing (NLP), and have been shown to haveempirical utility in a myriad of tasks (Collobertand Weston, 2008; Turian et al., 2010; Baroni etal., 2014; Andreas and Klein, 2014). The un-derlying idea behind distributed word representa-tions is simple: to map each word w in vocabu-lary V onto a continuous-valued vector of dimen-sionality d ⌧ |V |. Words that are similar (e.g.,

with respect to syntax or lexical semantics) willideally be mapped to similar regions of the vec-tor space, implicitly supporting both generalisa-tion across in-vocabulary (IV) items, and counter-ing the effects of data sparsity for low-frequencyand out-of-vocabulary (OOV) items.

Without some means of automatically deriv-ing the vector representations without reliance onlabelled data, however, word embeddings wouldhave little practical utility. Fortunately, it hasbeen shown that they can be “pre-trained” fromunlabelled text data using various algorithms tomodel the distributional hypothesis (i.e., thatwords which occur in similar contexts tend to besemantically similar). Pre-training methods havebeen refined considerably in recent years, andscaled up to increasingly large corpora.

As with other machine learning methods, it iswell known that the quality of the pre-trained wordembeddings depends heavily on factors includingparameter optimisation, the size of the trainingdata, and the fit with the target application. Forexample, Turian et al. (2010) showed that the op-timal dimensionality for word embeddings is task-specific. One factor which has received relativelylittle attention in NLP is the effect of “updating”the pre-trained word embeddings as part of thetask-specific training, based on self-taught learn-ing (Raina et al., 2007). Updating leads to wordrepresentations that are task-specific, but often atthe cost of over-fitting low-frequency and OOVwords.

In this paper, we perform an extensive evalu-ation of four recently proposed word embeddingapproaches under fixed experimental conditions,applied to four sequence labelling tasks: part-of-speech (POS) tagging, full-text chunking, namedentity recognition (NER), and multiword expres-

83

sion (MWE) identification.1 We build on previousempirical studies (Collobert et al., 2011; Turian etal., 2010; Pennington et al., 2014) in consideringa broader range of word embedding approachesand evaluating them over more sequence labellingtasks. In addition, we explore the following re-search questions:RQ1: are word embeddings better than baseline

approaches of one-hot unigram2 featuresand Brown clusters?

RQ2: do word embeddings require less trainingdata (i.e., generalise better) than one-hotunigram features? If so, to what degree canword embeddings reduce the amount of la-belled data?

RQ3: what is the impact of updating word em-beddings in sequence labelling tasks, bothempirically over the target task and geo-metrically over the vectors?

RQ4: what is the impact of these word embed-dings (with and without updating) on bothOOV items (relative to the training data)and out-of-domain data?

RQ5: overall, are some word embeddings betterthan others in a sequence labelling context?

2 Word Representations

2.1 Types of Word Representations

Turian et al. (2010) identifies three varietiesof word representations: distributional, cluster-based, and distributed.

Distributional representation methods mapeach word w to a context word vector Cw,which is constructed directly from co-occurrencecounts between w and its context words. Thelearning methods either store the co-occurrencecounts between two words w and i directlyin Cwi (Sahlgren, 2006; Turney and Pantel,2010; Honkela, 1997) or project the concur-rence counts between words into a lower dimen-sional space (Rehurek and Sojka, 2010; Lund andBurgess, 1996), using dimensionality reductiontechniques such as SVD (Dumais et al., 1988) orLDA (Blei et al., 2003).

1MWEs are lexicalized combinations of two or more sim-plex words that are exceptional enough to be considered assingle units in the lexicon (Baldwin and Kim, 2010; Schnei-der et al., 2014a), e.g., pick up or part of speech.

2Word vectors with one-hot representation are binary vec-tors with a single dimension per word in the vocabulary (i.e.,d = |V |), with the single dimension corresponding to thetarget word set to 1 and all other dimensions set to 0.

Cluster-based representation methods buildclusters of words by applying either soft or hardclustering algorithms (Lin and Wu, 2009; Li andMcCallum, 2005). Some of them also rely ona co-occurrence matrix of words (Pereira et al.,1993). The Brown clustering algorithm (Brownet al., 1992) is the best-known method in this cat-egory.

Distributed representation methods usu-ally map words into dense, low-dimensional,continuous-valued vectors, with x 2 Rd, where dis referred to as the word dimension.

2.2 Selected Word RepresentationsOver a range of sequence labelling tasks, we eval-uate four methods for inducing word represen-tations: Brown clustering (Brown et al., 1992)(“BROWN”), the continuous bag-of-words model(“CBOW”) (Mikolov et al., 2013a), the continu-ous skip-gram model (“SKIP-GRAM”) (Mikolov etal., 2013b), and Global vectors (“GLOVE”) (Pen-nington et al., 2014). All have been shown to beat or near state-of-the-art in recent empirical stud-ies (Turian et al., 2010; Pennington et al., 2014).3

The training of these word representations is un-supervised: the common underlying idea is to pre-dict the occurrence of words in the neighbour-ing context. Their training objectives share thesame form, which is a sum of local training fac-tors J(w, ctx(w)),

L =

X

w2T

J(w, ctx(w))

where T is the set of tokens in a given corpus, andctx(w) denotes the local context of word w. Thelocal context of a word is conventionally its pre-ceding m words, or alternatively the m words sur-rounding it. Local training factors are designedto capture the relationship between w and its lo-cal contexts of use, either by predicting w basedon its local context, or using w to predict the con-text words. Other than BROWN, which utilises acluster-based representation, all the other methodsemploy a distributed representation.

The starting point for CBOW and SKIP-GRAMis to employ softmax to predict word occurrence:

J(w, ctx(w)) = � log

exp(v

Twvctx(w))P

j2V exp(v

Tj vctx(w))

!

3The word embedding approach proposed in Collobert etal. (2011) is not considered because it was found to be inferiorto our four target word embedding approaches in previouswork.

84

where vctx(w) denotes the distributed representa-tion of the local context of word w, and V is thevocabulary of a given corpus. CBOW derivesvctx(w) based on averaging over the context words.That is, it estimates the probability of each w givenits local context. In contrast, SKIP-GRAM appliessoftmax to each context word of a given occur-rence of word w. In this case, vctx(w) correspondsto the representation of one of its context words.This model can be characterised as predicting con-text words based on w. In practice, softmax istoo expensive to compute over large corpora, andthus Mikolov et al. (2013b) use hierarchical soft-max and negative sampling to scale up the train-ing.

GLOVE assumes the dot product of two wordembeddings should be similar to the logarithm ofthe co-occurrence count Xij of the two words. Assuch, the local factor J(w, ctx(w)) becomes:

g(Xij)(vTi vj + bi + bj � log(Xij))

2

where bi and bj are the bias terms of words i andj, respectively, and g(Xij) is a weighting functionbased on the co-occurrence count. This weight-ing function controls the degree of agreement be-tween the parametric function v

Ti vj + bi + bj and

log(Xij). Frequently co-occurring word pairs willhave larger weight than infrequent pairs, up to athreshold.

BROWN partitions words into a finite set ofword classes V . The conditional probability ofseeing the next word is defined to be:

p(wk|wk�1k�m) = p(wk|hk)p(hk|hk�1

k�m)

where hk denotes the word class of the wordwk, wk�1

k�m are the previous m words, andhk�1

k�m are their respective word classes. ThenJ(w, ctx(w)) = � log p(wk|wk�1

k�m). Since thereis no tractable method to find an optimal parti-tion of word classes, the method uses only a bi-gram class model, and utilises hierarchical clus-tering as an approximation method to find a suffi-ciently good partition of words.

2.3 Building Word RepresentationsTo ensure the comparison of different word rep-resentations is fair, we train BROWN, CBOW,SKIP-GRAM, and GLOVE on a fixed corpus, com-prised of freely available corpora, as detailed inTable 1. The joint corpus was preprocessed with

Data set Size WordsUMBC (Han et al., 2013) 48.1GB 3GOne Billion (Chelba et al., 2013) 4.1GB 1GEnglish Wikipedia 49.6GB 3G

Table 1: Corpora used to pre-train the word em-beddings

Figure 1: Linear-chain graph transformer

the Stanford CoreNLP sentence splitter and to-keniser. All consecutive digit substrings werereplaced by NUMf, where f is the length ofthe digit substring (e.g., 10.20 is replaced byNUM2.NUM2.

The dimensionality of the word embeddingsand the size of the context window are the key hy-perparameters when learning distributed represen-tations. We use all combinations of the followingvalues to train word embeddings on the combinedcorpus:

• Embedding dim. d 2 {25, 50, 100, 200}

• Context window size m 2 {1, 5, 10}

BROWN requires only the number of clustersas a hyperparameter. Here, we perform clusteringwith b 2 {250, 500, 1000, 2000, 4000} clusters.

3 Sequence Labelling Tasks

We evaluate the different word representationsover four sequence labelling tasks: POS tagging(POS tagging), full-text chunking (Chunking),NER (NER), and MWE identification (MWE).For each task, we fed features into a first-orderlinear-chain graph transformer (Collobert et al.,2011) made up of two layers: the upper layer isidentical to a linear-chain CRF (Lafferty et al.,2001), and the lower layer consists of word rep-resentation and hand-crafted features. If we treatword representations as fixed, the graph trans-former is a simple linear-chain CRF. On the otherhand, if we can treat the word representations asmodel parameters, the model is equivalent to aneural network with word embeddings as the inputlayer, as shown in Figure 1. We trained all modelsusing AdaGrad (Duchi et al., 2011).

As in Turian et al. (2010), at each word position,we construct word representation features fromthe words in a context window of size two to either

85

side of the target word, based on the pre-trainedrepresentation of each word type. For BROWN,the features are the prefix features extracted fromword clusters in the same way as Turian et al.(2010). As a baseline (and to test RQ1), we in-clude a one-hot representation (which is equiva-lent to a linear-chain CRF with only lexical con-text features).

Our hand-crafted features for POS tagging,Chunking and MWE, are those used by Collobertet al. (2011), Turian et al. (2010) and Schneideret al. (2014b), respectively. For NER, we use thesame feature space as Turian et al. (2010), exceptfor the previous two predictions, because we wantto evaluate all word representations with the sametype of model — a first-order graph transformer.

In training the distributed word representations,we consider two settings: (1) the word represen-tations are fixed during sequence model training;and (2) the graph transformer updated the token-level word representations during training.

As outlined in Table 2, for each sequence la-belling task, we experiment over the de facto cor-pus, based on pre-existing training–dev–test splitswhere available:4

POS tagging: the Wall Street Journal portionof the Penn Treebank (Marcus et al. (1993):“WSJ”) with Penn POS tags

Chunking: the Wall Street Journal portion of thePenn Treebank (“WSJ”), converted into IOB-style full-text chunks using the CoNLL con-version scripts for training and dev, and theWSJ-derived CoNLL-2000 full text chunk-ing test data for testing (Tjong Kim Sang andBuchholz, 2000)

NER: the English portion of the CoNLL-2003English Named Entity Recognition data set,for which the source data was taken fromReuters newswire articles (Tjong Kim Sangand De Meulder (2003): “Reuters”)

MWE: the MWE dataset of Schneider et al.(2014b), over a portion of text from the En-glish Web Treebank5 (“EWT”)

For all tasks other than MWE,6 we additionallyhave an out-of-domain test set, in order to eval-uate the out-of-domain robustness of the different

4For the MWE dataset, no such split pre-existed, so weconstructed our own.

5https://catalog.ldc.upenn.edu/LDC2012T13

6Unfortunately, there is no second domain which has beenhand-tagged with MWEs using the method of Schneider et al.(2014b) to use as an out-of-domain test corpus.

word representations, with and without updating.These datasets are as follows:POS tagging: the English Web Treebank with

Penn POS tags (“EWT”)Chunking: the Brown Corpus portion of the

Penn Treebank (“Brown”), converted intoIOB-style full-text chunks using the CoNLLconversion scripts

NER: the MUC-7 named entity recognition cor-pus7 (“MUC7”)

For reproducibility, we tuned the hyperparam-eters with random search over the developmentdata for each task (Bergstra and Bengio, 2012).In this, we randomly sampled 50 distinct hyper-parameter sets with the same random seed for thenon-updating models (i.e., the models that don’tupdate the word representation), and sampled 100distinct hyperparameter sets for the updating mod-els (i.e., the models that do). For each set of hy-perparameters and task, we train a model over itstraining set and choose the best one based on itsperformance on development data (Turian et al.,2010). We also tune the word representation hy-perparameters — namely, the word vector size dand context window size m (distributed represen-tations), and in the case of Brown, the number ofclusters.

For the updating models, we found that the re-sults over the test data were always inferior tothose that do not update the word representations,due to the higher number of hyperparameters andsmall sample size (i.e., 100). Since the two-layermodel of the graph transformer contains a distinctset of hyperparameters for each layer, we reuse thebest-performing hyperparameter settings from thenon-updating models, and only tune the hyperpa-rameters of AdaGrad for the word representationlayer. This method requires only 32 additionalruns and achieves consistently better results than100 random draws.

In order to test the impact of the volume oftraining data on the different models (RQ2), wesplit the training set into 10 partitions based ona base-2 log scale (i.e., the second smallest par-tition will be twice the size of the smallest parti-tion), and created 10 successively larger trainingsets by merging these partitions from the smallestone to the largest one, and used each of these totrain a model. From these, we construct learning

7https://catalog.ldc.upenn.edu/LDC2001T02

86

Training Development In-domain Test Out-of-domain TestPOS tagging WSJ Sec. 0-18 WSJ Sec. 19–21 WSJ Sec. 22–24 EWT

Chunking WSJ WSJ (1K sentences) WSJ (CoNLL-00 test) BrownNER Reuters (CoNLL-03 train) Reuters (CoNLL-03 dev) Reuters (CoNLL-03 test) MUC7MWE EWT (500 docs) EWT (100 docs) EWT (123 docs) —

Table 2: Training, development and test (in- and out-of-domain) data for each sequence labelling task.

curves over each task.For ease of comparison with previous results,

we evaluate both in- and out-of-domain usingchunk/entity/expression-level F1-measure (“F1”)for all tasks except POS tagging, for which weuse token-level accuracy (“ACC”). To test perfor-mance over OOV (unknown) tokens — i.e., thewords that do not occur in the training set — weuse token-level accuracy for all tasks (e.g., forChunking, we evaluate whether the full IOB tagis correct or not), because chunks/NEs/MWEs canconsist of a mixture of in-vocabulary and OOV to-kens, which makes the use of chunk-based evalu-ation measures inappropriate.

4 Experimental Results and Discussion

We structure our evaluation by stepping througheach of our five research questions (RQ1–5) fromthe start of the paper. In this, we make referenceto: (1) the best-performing method both in- andout-of-domain vs. the state-of-the-art (Table 3);(2) a heat map for each task indicating the con-vergence rate for each word representation, withand without updating (Figure 2); (3) OOV accu-racy both in-domain and out-of-domain for eachtask (Figure 3); and (4) visualisation of the impactof updating on word embeddings, based on t-SNE(Figure 4).

RQ1: Are the selected word embeddings betterthan one-hot unigram features and Brown clus-ters? As shown in Table 3, the best-performingmethod for every task except in-domain Chunk-ing is a word embedding method, although theprecise method varies greatly. Figure 2, on theother hand, tells a more subtle story: the differencebetween UNIGRAM and the other word represen-tations is relatively modest, esp. as the amount oftraining data increases. Additionally, the differ-ence between BROWN and the word embeddingmethods is modest across all tasks. So, the over-all answer would appear to be: yes, word embed-dings are better than unigrams when there is littletraining data, but they are not markedly better thanBrown clusters.

RQ2: Do word embedding features requireless training data? Figure 2 shows that forPOS tagging and NER, with only several hun-dred training instances, word embedding fea-tures achieve superior results to UNIGRAM. Forexample, when trained with 561 instances, thePOS tagging model using SKIP-GRAM+UP em-beddings is 5.3% above UNIGRAM; and whentrained with 932 instances, the NER model us-ing SKIP-GRAM is 11.7% above UNIGRAM. Sim-ilar improvements are also found for other typesof word embeddings and BROWN, when the train-ing set is small. However, all word representa-tions perform similarly for Chunking regardlessof training data size. For MWE, BROWN performsslightly better than the other methods when trainedwith approximately 25% of the training instances.Therefore, we conjecture that the POS taggingand NER tasks benefit more from distributionalsimilarity than Chunking and MWE.

RQ3: Does task-specific updating improve allword embeddings across all tasks? Based onFigure 2, updating of word representations canequally correct poorly-learned word representa-tions, and harm pre-trained representations, due tooverfitting. For example, GLOVE performs sub-stantially worse than SKIP-GRAM for both POStagging and NER without updating, but with up-dating, the relative empirical gap between the bestperforming method becomes smaller. In contrast,SKIP-GRAM performs worse over the test datawith updating, despite the results on the develop-ment set improving by 1%.

To further investigate the effects of updating,we sampled 60 words and plotted the changes intheir word embeddings under updating, using 2-d vector fields generated using matplotlib and t-SNE (van der Maaten and Hinton, 2008). Halfof the words were chosen manually to includeknown word clusters such as days of the week andnames of countries; the other half were selectedrandomly. Additional plots with 100 randomly-sampled words and the top-100 most frequentwords, for all the methods and all the tasks, canbe found in the supplementary material and at

87

Task Benchmark In-domain Test set Out-of-domain Test setPOS tagging (ACC) 0.972 (Toutanova et al., 2003) 0.959 (SKIP-GRAM+UP) 0.910 (SKIP-GRAM)Chunking (F1) 0.942 (Sha and Pereira, 2003) 0.938 (BROWNb=2000) 0.676 (GLOVE)NER (F1) 0.893 (Ando and Zhang, 2005) 0.868 (SKIP-GRAM) 0.736 (SKIP-GRAM)MWE (F1) 0.625 (Schneider et al., 2014a) 0.654 (CBOW+UP) —

Table 3: State-of-the-art results vs. our best results for in-domain and out-of-domain test sets.

(a) POS tagging (ACC) (b) Chunking (F1)

(c) NER (F1) (d) MWE (F1)Figure 2: Results for each type of word representation over POS tagging, Chunking, NER and MWE,optionally with updating (“+UP”). The y-axis indicates the training data sizes (on a log scale). Green= high performance, and red = low performance, based on a linear scale of the best- to worst-result foreach task.

https://goo.gl/Y8bk2w. In each plot, asingle arrow signifies one word, pointing from theposition of the original word embedding to the up-dated representation.

In Figure 4, we show vector fields plots forChunking and NER using SKIP-GRAM embed-dings. For Chunking, most of the vectors werechanged with similar magnitude, but in very dif-ferent directions, including within the clusters ofdays of the week and country names. In contrast,for NER, there was more homogeneous change inword vectors belonging to the same cluster. Thisgreater consistency is further evidence that seman-tic homogeneity appears to be more beneficial for

NER than Chunking.

RQ4: What is the impact of word embeddingscross-domain and for OOV words? As shownin Table 3, results predictably drop when we eval-uate out of domain. The difference is most pro-nounced for Chunking, where there is an absolutedrop in F1 of around 30% for all methods, indi-cating that word embeddings and unigram featuresprovide similar information for Chunking.

Another interesting observation is that updatingoften hurts out-of-domain performance becausethe distribution between domains is different. Thissuggests that, if the objective is to optimise per-formance across domains, it is best not to perform

88

Figure 3: ACC over out-of-vocabulary (OOV) words for in-domain and out-of-domain test sets.

(a) Chunking (b) NERFigure 4: A t-SNE plot of the impact of updating on SKIP-GRAM

updating.

We also analyze performance on OOV wordsboth in-domain and out-of-domain in Figure 3.

As expected, word embeddings and BROWN excelin out-of-domain OOV performance. Consistentwith our overall observations about cross-domain

89

generalisation, the OOV results are better whenupdating is not performed.

RQ5 Overall, are some word embeddings bet-ter than others? Comparing the different wordembedding techniques over our four sequence la-belling tasks, for the different evaluations (overall,out-of-domain and OOV), there is no clear winneramong the word embeddings — for POS tagging,SKIP-GRAM appears to have a slight advantage,but this does not generalise to other tasks.

While the aim of this paper was not to achievethe state of the art over the respective tasks, it isimportant to concede that our best (in-domain) re-sults for NER, POS tagging and Chunking areslightly worse than the state of the art (Table 3).The 2.7% difference between our NER systemand the best performing system is due to the factthat we use a first-order instead of a second-orderCRF (Ando and Zhang, 2005), and for the othertasks, there are similarly differences in the learnerand the complexity of the features used. Anotherdifference is that we tuned the hyperparameterswith random search, to enable replication usingthe same random seed. In contrast, the hyperpa-rameters for the state-of-the-art methods are tunedmore extensively by experts, making them moredifficult to reproduce.

5 Related Work

Collobert et al. (2011) proposed a unified neuralnetwork framework that learns word embeddingsand applied it to POS tagging, Chunking, NERand semantic role labelling. When they combinedword embeddings with hand-crafted features (e.g.,word suffixes for POS tagging; gazetteers forNER) and applied other tricks like cascading andclassifier combination, they achieved state-of-the-art performance. Similarly, Turian et al. (2010)evaluated three different word representations onNER and Chunking, and concluded that unsu-pervised word representations improved NER andChunking. They also found that combining dif-ferent word representations can further improveperformance. Guo et al. (2014) also explored dif-ferent ways of using word embeddings for NER.Owoputi et al. (2013) and Schneider et al. (2014a)found that BROWN clustering enhances TwitterPOS tagging and MWE, respectively. Comparedto previous work, we consider more word rep-resentations including the most recent work andevaluate them on more sequence labelling tasks,

wherein the models are trained with training setsof varying size.

Bansal et al. (2014) reported that direct use ofword embeddings in dependency parsing did notshow improvement. They achieved an improve-ment only when they performed hierarchical clus-tering of the word embeddings, and used featuresextracted from the cluster hierarchy. In a simi-lar vein, Andreas and Klein (2014) explored theuse of word embeddings for constituency pars-ing and concluded that the information containedin word embeddings might duplicate the one ac-quired by a syntactic parser, unless the training setis extremely small. Other syntactic parsing studiesthat reported improvements by using word embed-dings include Koo et al. (2008), Koo et al. (2010),Haffari et al. (2011), Tratz and Hovy (2011) andChen and Manning (2014).

Word embeddings have also been applied toother (non-sequential NLP) tasks like grammar in-duction (Spitkovsky et al., 2011), and semantictasks such as semantic relatedness, synonymy de-tection, concept categorisation, selectional prefer-ence learning and analogy (Baroni et al., 2014;Levy and Goldberg, 2014; Levy et al., 2015).

Huang and Yates (2009) demonstrated that us-ing distributional word representations methods(like TF-IDF and LSA) as features, improves thelabelling of OOV, when test for POS tagging andChunking. In our study, we evaluate the labellingperformance of OOV words for updated vs. non-updated word embedding representations, relativeto the training set and with out-of-domain data.

6 Conclusions

We have performed an extensive extrinsic evalua-tion of four word embedding methods under fixedexperimental conditions, and evaluated their ap-plicability to four sequence labelling tasks: POStagging, Chunking, NER and MWE identifica-tion. We found that word embedding features re-liably outperformed unigram features, especiallywith limited training data, but that there was rela-tively little difference over Brown clusters, and noone embedding method was consistently superioracross the different tasks and settings. Word em-beddings and Brown clusters were also found toimprove out-of-domain performance and for OOVwords. We expected a performance gap betweenthe fixed and task-updated embeddings, but the ob-served difference was marginal. Indeed, we found

90

that updating can result in overfitting. We also car-ried out preliminary analysis of the impact of up-dating on the vectors, a direction which we intendto pursue further.

7 AcknowledgmentsNICTA is funded by the Australian Governmentas represented by the Department of Broadband,Communications and the Digital Economy andthe Australian Research Council through the ICTCentre of Excellence program.

ReferencesRie Kubota Ando and Tong Zhang. 2005. A frame-

work for learning predictive structures from multi-ple tasks and unlabeled data. Journal of MachineLearning Research, 6:1817–1853.

Jacob Andreas and Dan Klein. 2014. How much doword embeddings encode about syntax? In Pro-ceedings of the 52nd Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), pages 822–827, Baltimore, USA.

Timothy Baldwin and Su Nam Kim. 2010. Multiwordexpressions. In Nitin Indurkhya and Fred J. Dam-erau, editors, Handbook of Natural Language Pro-cessing. CRC Press, Boca Raton, USA, 2nd edition.

Mohit Bansal, Kevin Gimpel, and Karen Livescu.2014. Tailoring continuous word representations fordependency parsing. In Proceedings of the 52nd An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 809–815, Baltimore, USA.

Marco Baroni, Georgiana Dinu, and GermanKruszewski. 2014. Don’t count, predict! asystematic comparison of context-counting vs.context-predicting semantic vectors. In Proceedingsof the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1: LongPapers), pages 238–247, Baltimore, USA.

James Bergstra and Yoshua Bengio. 2012. Randomsearch for hyper-parameter optimization. Journal ofMachine Learning Research, 13(1):281–305.

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. Journal of Ma-chine Learning Research, 3:993–1022.

Peter F. Brown, Peter V. deSouza, Robert L. Mer-cer, Vincent J. Della Pietra, and Jenifer C. Lai.1992. Class-based n-gram models of natural lan-guage. Computational Linguistics, 18:467–479.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2013. One billion word benchmark for measur-ing progress in statistical language modeling. Tech-nical report, Google.

Danqi Chen and Christopher D Manning. 2014. A fastand accurate dependency parser using neural net-works. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP 2014), pages 740–750, Doha, Qatar.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Pro-ceedings of the 25th International Conference onMachine Learning, pages 160–167, Helsinki, Fin-land.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. Journal of Machine Learning Research,12:2493–2537.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. Journal of MachineLearning Research, 12:2121–2159.

Susan T Dumais, George W Furnas, Thomas K Lan-dauer, Scott Deerwester, and Richard Harshman.1988. Using latent semantic analysis to improve ac-cess to textual information. In Proceedings of theSIGCHI conference on Human Factors in Comput-ing Systems, pages 281–285.

Jiang Guo, Wanxiang Che, Haifeng Wang, and TingLiu. 2014. Revisiting embedding features for sim-ple semi-supervised learning. In Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP 2014), pages 110–120, Doha, Qatar.

Gholamreza Haffari, Marzieh Razavi, and AnoopSarkar. 2011. An ensemble model that combinessyntactic and semantic clustering for discriminativedependency parsing. In ACL 2011 (Short Papers),pages 710–714, Portland, USA.

Lushan Han, Abhay L. Kashyap, Tim Finin, JamesMayfield, and Johnathan Weese. 2013. UMBCEBIQUITY CORE: Semantic textual similarity sys-tems. In Proceedings of the Second Joint Con-ference on Lexical and Computational Semantics,pages 44–52, Atlanta, USA.

Timo Honkela. 1997. Self-organizing maps of wordsfor natural language processing applications. InProceedings of the International ICSC Symposiumon Soft Computing, pages 401–407, Nimes, France.

Fei Huang and Alexander Yates. 2009. Distributionalrepresentations for handling sparsity in supervisedsequence-labeling. In Proceedings of the Joint Con-ference of the 47th Annual Meeting of the ACL andthe 4th International Joint Conference on NaturalLanguage Processing of the AFNLP: Volume 1 - Vol-ume 1, pages 495–503, Suntec, Singapore.

91

Terry Koo, Xavier Carreras, and Michael Collins.2008. Simple semi-supervised dependency parsing.In Proceedings of ACL-08: HLT, pages 595–603,Columbus, USA.

Terry Koo, Alexander M. Rush, Michael Collins,Tommi Jaakkola, and David Sontag. 2010. Dualdecomposition for parsing with non-projective headautomata. In Proceedings of the 2010 Conference onEmpirical Methods in Natural Language Process-ing (EMNLP 2010), pages 1288–1298, Cambridge,USA.

John Lafferty, Andrew McCallum, and FernandoPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. In Proceedings of the 18th Interna-tional Conference on Machine Learning, pages 282–289, Williamstown, USA.

Omer Levy and Yoav Goldberg. 2014. Neuralword embedding as implicit matrix factorization.In Z. Ghahramani, M. Welling, C. Cortes, N.D.Lawrence, and K.Q. Weinberger, editors, Advancesin Neural Information Processing Systems 27, pages2177–2185. Curran Associates, Inc.

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-proving distributional similarity with lessons learnedfrom word embeddings. Transactions of the Associ-ation for Computational Linguistics, 3(1):211–225.

Wei Li and Andrew McCallum. 2005. Semi-supervised sequence modeling with syntactic topicmodels. In Proceedings of the National Conferenceon Artificial Intelligence, Pittsburgh, USA.

Dekang Lin and Xiaoyun Wu. 2009. Phrase clusteringfor discriminative learning. In Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP, pages1030–1038, Suntec, Singapore.

Kevin Lund and Curt Burgess. 1996. Producinghigh-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instru-ments, & Computers, 28(2):203–208.

Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: the Penn treebank. Computa-tional Linguistics, 19(2):313–330.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient estimation of word represen-tations in vector space. CoRR, abs/1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in Neural Information ProcessingSystems 26, pages 3111–3119. Curran Associates,Inc.

Olutobi Owoputi, Brendan O’Connor, Chris Dyer,Kevin Gimpel, Nathan Schneider, and Noah ASmith. 2013. Improved part-of-speech tagging foronline conversational text with word clusters. InProceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies(NAACL HLT 2013), pages 380–390, Atlanta, USA.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global vec-tors for word representation. In Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP 2014), pages 1532–1543, Doha, Qatar.

Fernando Pereira, Naftali Tishby, and Lillian Lee.1993. Distributional clustering of English words. InProceedings of the 31st Annual Meeting of the Asso-ciation for Computational Linguistics, pages 183–190, Columbus, USA.

Rajat Raina, Alexis Battle, Honglak Lee, BenjaminPacker, and Andrew Y Ng. 2007. Self-taught learn-ing: transfer learning from unlabeled data. In Pro-ceedings of the 24th International Conference onMachine Learning, pages 759–766, Corvallis, USA.

Radim Rehurek and Petr Sojka. 2010. Software frame-work for topic modelling with large corpora. In Pro-ceedings of the LREC 2010 Workshop on New Chal-lenges for NLP Frameworks, pages 51–55, Valetta,Malta.

Magnus Sahlgren. 2006. The Word-Space Model: Us-ing distributional analysis to represent syntagmaticand paradigmatic relations between words in high-dimensional vector spaces. Ph.D. thesis, Institutio-nen for lingvistik.

Nathan Schneider, Emily Danchik, Chris Dyer, andNoah A. Smith. 2014a. Discriminative lexical se-mantic segmentation with gaps: Running the MWEgamut. Transactions of the Association of Computa-tional Linguistics, 2(1):193–206.

Nathan Schneider, Spencer Onuffer, Nora Kazour,Emily Danchik, Michael T. Mordowanec, HenriettaConrad, and Noah A. Smith. 2014b. Comprehen-sive annotation of multiword expressions in a socialweb corpus. In Proceedings of the Ninth Interna-tional Conference on Language Resources and Eval-uation, pages 455–461, Reykjavık, Iceland.

Fei Sha and Fernando Pereira. 2003. Shallow parsingwith conditional random fields. In Proceedings ofthe 2003 Conference of the North American Chap-ter of the Association for Computational Linguisticson Human Language Technology - Volume 1, pages134–141, Edmonton, Canada.

Valentin I. Spitkovsky, Hiyan Alshawi, Angel X.Chang, and Daniel Jurafsky. 2011. Unsuperviseddependency parsing without gold part-of-speech

92

tags. In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing, pages1281–1290, Edinburgh, UK.

Erik F. Tjong Kim Sang and Sabine Buchholz.2000. Introduction to the CoNLL-2000 sharedtask: Chunking. In Proceedings of the 4th Confer-ence on Computational Natural Language Learning(CoNLL-2000), pages 127–132, Lisbon, Portugal.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InProceedings of the 7th Conference on Natural Lan-guage Learning (CoNLL-2003), pages 142–147, Ed-monton, Canada.

Kristina Toutanova, Dan Klein, Christopher D. Man-ning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network.In Proceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology- Volume 1, pages 173–180, Edmonton, Canada.

Stephen Tratz and Eduard Hovy. 2011. A fast, ac-curate, non-projective, semantically-enriched parser.In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing, pages1257–1268, Edinburgh, UK.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: a simple and general methodfor semi-supervised learning. In Proceedings of the48th Annual Meeting of the Association for Compu-tational Linguistics, pages 384–394, Uppsala, Swe-den.

Peter D. Turney and Patrick Pantel. 2010. Fromfrequency to meaning: Vector space models of se-mantics. Journal of Artificial Intelligence Research,37(1):141–188.

Laurens J.P. van der Maaten and Geoffrey Hinton.2008. Visualizing high-dimensional data usingt-sne. Journal of Machine Learning Research,9:2579–2605.

93



Contrastive Analysis with Predictive Power: Typology Driven Estimationof Grammatical Error Distributions in ESL

Yevgeni BerzakCSAIL MIT

[email protected]

Roi ReichartTechnion IIT

[email protected]

Boris KatzCSAIL MIT

[email protected]

Abstract

This work examines the impact of cross-linguistic transfer on grammatical errors inEnglish as Second Language (ESL) texts.Using a computational framework that for-malizes the theory of Contrastive Analy-sis (CA), we demonstrate that languagespecific error distributions in ESL writ-ing can be predicted from the typologi-cal properties of the native language andtheir relation to the typology of English.Our typology driven model enables to ob-tain accurate estimates of such distribu-tions without access to any ESL data forthe target languages. Furthermore, wepresent a strategy for adjusting our methodto low-resource languages that lack typo-logical documentation using a bootstrap-ping approach which approximates nativelanguage typology from ESL texts. Fi-nally, we show that our framework is in-strumental for linguistic inquiry seekingto identify first language factors that con-tribute to a wide range of difficulties insecond language acquisition.

1 Introduction

The study of cross-linguistic transfer, wherebyproperties of a native language influence perfor-mance in a foreign language, has a long tradi-tion in Linguistics and Second Language Acqui-sition (SLA). Much of the linguistic work on thistopic was carried out within the framework ofContrastive Analysis (CA), a theoretical approachthat aims to explain difficulties in second languagelearning in terms of the relations between struc-tures in the native and foreign languages.

The basic hypothesis of CA was formulated byLado (1957), who suggested that “we can predictand describe the patterns that will cause difficulty

in learning, and those that will not cause difficulty,by comparing systematically the language and cul-ture to be learned with the native language andculture of the student”. In particular, Lado pos-tulated that divergences between the native andforeign languages will negatively affect learningand lead to increased error rates in the foreign lan-guage. This and subsequent hypotheses were soonmet with criticism, targeting their lack of ability toprovide reliable predictions, leading to an ongoingdebate on the extent to which foreign language er-rors can be explained and predicted by examiningnative language structure.

Differently from the SLA tradition, which em-phasizes manual analysis of error case studies(Odlin, 1989), we address the heart of this contro-versy from a computational data-driven perspec-tive, focusing on the issue of predictive power. Weprovide a formalization of the CA framework, anddemonstrate that the relative frequency of gram-matical errors in ESL can be reliably predictedfrom the typological properties of the native lan-guage and their relation to the typology of Englishusing a regression model.

Tested on 14 languages in a leave-one-out fash-ion, our model achieves a Mean Average Error(MAE) reduction of 21.8% in predicting the lan-guage specific relative frequency of the 20 mostcommon ESL structural error types, as comparedto the relative frequency of each of the error typesin the training data, yielding improvements acrossall the languages and the large majority of the er-ror types. Our regression model also outperformsa stronger, nearest neighbor based baseline, thatprojects the error distribution of a target languagefrom its typologically closest language.

While our method presupposes the existence oftypological annotations for the test languages, wealso demonstrate its viability in low-resource sce-narios for which such annotations are not avail-able. To address this setup, we present a bootstrap-

94

ping framework in which the typological featuresrequired for prediction of grammatical errors areapproximated from automatically extracted ESLmorpho-syntactic features using the method of(Berzak et al., 2014). Despite the noise intro-duced in this process, our bootstrapping strategyachieves an error reduction of 13.9% compared tothe average frequency baseline.

Finally, the utilization of typological features aspredictors, enables to shed light on linguistic fac-tors that could give rise to different error typesin ESL. For example, in accordance with com-mon linguistic knowledge, feature analysis of themodel suggests that the main contributor to in-creased rates of determiner omission in ESL is thelack of determiners in the native language. A morecomplex case of missing pronouns is intriguinglytied by the model to native language subject pro-noun marking on verbs.

To summarize, the main contribution of thiswork is a CA inspired computational frameworkfor learning language specific grammatical errordistributions in ESL. Our approach is both predic-tive and explanatory. It enables us to obtain im-proved estimates for language specific error distri-butions without access to ESL error annotationsfor the target language. Coupling grammaticalerrors with typological information also providesmeaningful explanations to some of the linguisticfactors that drive the observed error rates.

The paper is structured as follows. Section 2surveys related linguistic and computational workon cross-linguistic transfer. Section 3 describesthe ESL corpus and the typological data used inthis study. In section 4 we motivate our native lan-guage oriented approach by providing a varianceanalysis for ESL errors across native languages.Section 5 presents the regression model for pre-diction of ESL error distributions. The bootstrap-ping framework which utilizes automatically in-ferred typological features is described in section6. Finally, we present the conclusion and direc-tions for future work in section 7.

2 Related Work

Cross linguistic-transfer was extensively studiedin SLA, Linguistics and Psychology (Odlin, 1989;Gass and Selinker, 1992; Jarvis and Pavlenko,2007). Within this area of research, our work ismost closely related to the Contrastive Analysis(CA) framework. Rooted in the comparative lin-

guistics tradition, CA was first suggested by Fries(1945) and formalized by Lado (1957). In essence,CA examines foreign language performance, witha particular focus on learner difficulties, in lightof a structural comparison between the native andthe foreign languages. From its inception, CA wascriticized for the lack of a solid predictive theory(Wardhaugh, 1970; Whitman and Jackson, 1972),leading to an ongoing scientific debate on the rele-vance of comparison based approaches. Importantto our study is that the type of evidence used inthis debate typically relies on small scale manualcase study analysis. Our work seeks to reexaminethe issue of predictive power of CA based methodsusing a computational, data-driven approach.

Computational work touching on cross-linguistic transfer was mainly conducted inrelation to the Native Language Identification(NLI) task, in which the goal is to determine thenative language of the author of an ESL text.Much of this work focuses on experimentationwith different feature sets (Tetreault et al., 2013),including features derived from the CA frame-work (Wong and Dras, 2009). A related line ofinquiry which is closer to our work deals with theidentification of ESL syntactic patterns that arespecific to speakers of different native languages(Swanson and Charniak, 2013; Swanson andCharniak, 2014). Our approach differs from thisresearch direction by focusing on grammaticalerrors, and emphasizing prediction of languagespecific patterns rather than their identification.

Previous work on grammatical error correctionthat examined determiner and preposition errors(Rozovskaya and Roth, 2011; Rozovskaya andRoth, 2014) incorporated native language specificpriors in models that are otherwise trained on stan-dard English text. Our work extends the nativelanguage tailored treatment of grammatical errorsto a much larger set of error types. More impor-tantly, this approach is limited by the availabil-ity of manual error annotations for the target lan-guage in order to obtain the required error counts.Our framework enables to bypass this annotationbottleneck by predicting language specific priorsfrom typological information.

The current investigation is most closely re-lated to studies that demonstrate that ESL sig-nal can be used to infer pairwise similarities be-tween native languages (Nagata and Whittaker,2013; Berzak et al., 2014) and in particular, tie

95

the similarities to the typological characteristics ofthese languages (Berzak et al., 2014). Our workinverts the direction of this analysis by startingwith typological features, and utilizing them topredict error patterns in ESL. We also show thatthe two approaches can be combined in a boot-strapping strategy by first inferring typologicalproperties from automatically extracted morpho-syntactic ESL features, and in turn, using theseproperties for prediction of language specific errordistributions in ESL.

3 Data

3.1 ESL CorpusWe obtain ESL essays from the Cambridge FirstCertificate in English (FCE) learner corpus (Yan-nakoudakis et al., 2011), a publicly available sub-set of the Cambridge Learner Corpus (CLC)1. Thecorpus contains upper-intermediate level essaysby native speakers of 16 languages2. DiscardingSwedish and Dutch, which have only 16 docu-ments combined, we take into consideration theremaining following 14 languages, with the cor-responding number of documents in parenthesis:Catalan (64), Chinese (66), French (146), Ger-man (69), Greek (74), Italian (76), Japanese (82),Korean (86), Polish (76), Portuguese (68), Rus-sian (83), Spanish (200), Thai (63) and Turkish(75). The resulting dataset contains 1228 docu-ments with an average of 379 words per document.

The FCE corpus has an elaborate error anno-tation scheme (Nicholls, 2003) and high qualityof error annotations, making it particularly suit-able for our investigation. The annotation schemeencompasses 75 different error types, covering awide range of grammatical errors on different lev-els of granularity. As the typological features usedin this work refer mainly to structural properties,we filter out spelling errors, punctuation errors andopen class semantic errors, remaining with a list ofgrammatical errors that are typically related to lan-guage structure. We focus on the 20 most frequenterror types3 in this list, which are presented and

1http://www.cambridge.org/gb/elt/catalogue/subject/custom/item3646603

2We plan to extend our analysis to additional proficiencylevels and languages when error annotated data for theselearner profiles will be publicly available.

3Filtered errors that would have otherwise appeared in thetop 20 list, with their respective rank in brackets: Spelling (1),Replace Punctuation (2), Replace Verb (3), Missing Punctu-ation (7), Replace (8), Replace Noun (9) Unnecessary Punc-tuation (13), Replace Adjective (18), Replace Adverb (20).

exemplified in table 1. In addition to concentrat-ing on the most important structural ESL errors,this cutoff prevents us from being affected by datasparsity issues associated with less frequent errors.

3.2 Typological DatabaseWe use the World Atlas of Language Structures(WALS; Dryer and Haspelmath, 2013), a repos-itory of typological features of the world’s lan-guages, as our source of linguistic knowledgeabout the native languages of the ESL corpus au-thors. The features in WALS are divided into11 categories: Phonology, Morphology, NominalCategories, Nominal Syntax, Verbal Categories,Word Order, Simple Clauses, Complex Sentences,Lexicon, Sign Languages and Other. Table 2presents examples of WALS features belonging todifferent categories. The features can be associ-ated with different variable types, including bi-nary, categorical and ordinal, making their encod-ing a challenging task. Our strategy for addressingthis issue is feature binarization (see section 5.3).

An important challenge introduced by theWALS database is incomplete documentation.Previous studies (Daume III, 2009; Georgi etal., 2010) have estimated that only 14% of allthe language-feature combinations in the databasehave documented values. While this issue is mostacute for low-resource languages, even the wellstudied languages in our ESL dataset are lacking asignificant portion of the feature values, inevitablyhindering the effectiveness of our approach.

We perform several preprocessing steps in or-der to select the features that will be used in thisstudy. First, as our focus is on structural fea-tures that can be expressed in written form, wediscard all the features associated with the cate-gories Phonology, Lexicon4, Sign Languages andOther. We further discard 24 features which eitherhave a documented value for only one language,or have the same value in all the languages. Theresulting feature-set contains 119 features, with anaverage of 2.9 values per feature, and 92.6 docu-mented features per language.

4 Variance Analysis of GrammaticalErrors in ESL

To motivate a native language based treatment ofgrammatical error distributions in ESL, we begin

4The discarded Lexicon features refer to properties suchas the number of words in the language that denote colors,and identity of word pairs such as “hand” and “arm”.

96

Rank Code Name Example Count KW MW1 TV Verb Tense I hope I give have given you enough details 3324 ** 342 RT Replace Preposition on in July 3311 ** 313 MD Missing Determiner I went for the interview 2967 ** 574 FV Wrong Verb Form had time to played play 1789 ** 215 W Word Order Probably our homes will probably be 1534 ** 346 MT Missing Preposition explain to you 1435 ** 227 UD Unnecessary Determiner a course at the Cornell University 13218 UT Unnecessary Preposition we need it on each minute 10799 MA Missing Pronoun because it is the best conference 984 ** 3310 AGV Verb Agreement the teachers was were very experienced 916 ** 2111 FN Wrong Form Noun because of my study studies 884 ** 2412 RA Replace Pronoun she just met Sally, which who 847 ** 1713 AGN Noun Agreement two month months ago 816 ** 2414 RD Replace Determiner of a the last few years 676 ** 3515 DJ Wrongly Derived Adjective The mother was pride proud 608 * 816 DN Wrongly Derived Noun working place workplace 53617 DY Wrongly Derived Adverb Especial Especially 414 ** 1418 UA Unnecessary Pronoun feel ourselves comfortable 391 * 919 MC Missing Conjunction reading, and playing piano at home 346 * 1120 RC Replace Conjunction not just the car, and but also the train 226

Table 1: The 20 most frequent error types in the FCE corpus that are related to language structure. Inthe Example column, words marked in italics are corrections for the words marked in bold. The Countcolumn lists the overall count of each error type in the corpus. The KW column depicts the result ofthe Kruskal-Wallis test whose null hypothesis is that the relative error frequencies for different nativelanguages are drawn from the same distribution. Error types for which this hypothesis is rejected withp < 0.01 are denoted with ‘*’. Error types with p < 0.001 are marked with ‘**’. The MW columndenotes the number of language pairs (out of the total 91 pairs) which pass the post-hoc Mann-Whitneytest with p < 0.01.

ID Category Name Values23A Morphology Locus of No case marking,

Marking Core cases only,in the Core and non-core,Clause No syncretism

67A Verbal The Future Inflectional future,Categories Tense No inflectional

future.30A Nominal Number of None, Two, Three,

Categories Genders Four, Five or more.87A Word Order Order of AN, NA, No

Adjective dominant order,and Noun Only internally

headed relativeclauses.

Table 2: Examples of WALS features.

by examining whether there is a statistically sig-nificant difference in ESL error rates based on thenative language of the learners. This analysis pro-vides empirical justification for our approach, andto the best of our knowledge was not conducted inprevious studies.

To this end, we perform a Kruskal-Wallis (KW)test (Kruskal and Wallis, 1952) for each error

type5. We treat the relative error frequency perword in each document as a sample6 (i.e. the rel-ative frequencies of all the error types in a docu-ment sum to 1). The samples are associated with14 groups, according to the native language of thedocument’s author. For each error type, the nullhypothesis of the test is that error fraction sam-ples of all the native languages are drawn from thesame underlying distribution. In other words, re-jection of the null hypothesis implies a significantdifference between the relative error frequenciesof at least one language pair.

As shown in table 1, we can reject the null hy-pothesis for 16 of the 20 grammatical error typeswith p < 0.01, where Unnecessary Determiner,Unnecessary Preposition, Wrongly Derived Noun,and Replace Conjunction are the error types thatdo not exhibit dependence on the native language.

5We chose the non-parametric KW rank-based test overANOVA, as according to the Shapiro-Wilk (1965) and Lev-ene (1960) tests, the assumptions of normality and homo-geneity of variance do not hold for our data. In practice, theANOVA test yields similar results to those of the KW test.

6We also performed the KW test on the absolute error fre-quencies (i.e. raw counts) per word, obtaining similar resultsto the ones reported here on the relative frequencies per word.

97

Furthermore, the null hypothesis can be rejectedfor 13 error types with p < 0.001. These resultssuggest that the relative error rates of the major-ity of the common structural grammatical errorsin our corpus indeed differ between native speak-ers of different languages.

We further extend our analysis by perform-ing pairwise post-hoc Mann-Whitney (MW) tests(Mann and Whitney, 1947) in order to determinethe number of language pairs that significantly dif-fer with respect to their native speakers’ error frac-tions in ESL. Table 1 presents the number of lan-guage pairs that pass this test with p < 0.01 foreach error type. This inspection suggests Miss-ing Determiner as the error type with the strongestdependence on the author’s native language, fol-lowed by Replace Determiner, Verb Tense, WordOrder, Missing Pronoun and Replace Preposition.

5 Predicting Language Specific ErrorDistributions in ESL

5.1 Task DefinitionGiven a language l 2 L, our task is to predict forthis language the relative error frequency yl,e ofeach error type e 2 E, where L is the set of all na-tive languages, E is the set of grammatical errors,and

Pe yl,e = 1.

5.2 ModelIn order to predict the error distribution of a nativelanguage, we train regression models on individ-ual error types:

y0l,e = ✓l,e · f(tl, teng) (1)

In this equation y0l,e is the predicted relative fre-quency of an error of type e for ESL documentsauthored by native speakers of language l, andf(tl, teng) is a feature vector derived from the ty-pological features of the native language tl and thetypological features of English teng.

The model parameters ✓l,e are obtained usingOrdinary Least Squares (OLS) on the training dataD, which consists of typological feature vectorspaired with relative error frequencies of the re-maining 13 languages:

D = {(f(tl0 , teng), ye,l0)|l0 2 L, l0 6= l} (2)

To guarantee that the individual relative error fre-quency estimates sum to 1 for each language, we

renormalize them to obtain the final predictions:

yl,e =

y0l,ePe y0l,e

(3)

5.3 FeaturesOur feature set can be divided into two subsets.The first subset, used in a version of our modelcalled Reg, contains the typological features of thenative language. In a second version of our model,called RegCA, we also utilize additional featuresthat explicitly encode differences between the ty-pological features of the native language, and theand the typological features of English.

5.3.1 Typological FeaturesIn the Reg model, we use the typological fea-tures of the native language that are documentedin WALS. As mentioned in section 3.2, WALSfeatures belong to different variable types, and arehence challenging to encode. We address this is-sue by binarizing all the features. Given k possiblevalues vk for a given WALS feature ti, we generatek binary typological features of the form:

fi,k(tl, teng) =

(1 if tl,i = vk

0 otherwise(4)

When a WALS feature of a given language doesnot have a documented value, all k entries of thefeature for that language are assigned the value of0. This process transforms the original 119 WALSfeatures into 340 binary features.

5.3.2 Divergences from EnglishIn the spirit of CA, in the model RegCA, we alsoutilize features that explicitly encode differencesbetween the typological features of the native lan-guage and those of English. These features arealso binary, and take the value 1 when the value ofa WALS feature in the native language is differentfrom the corresponding value in English:

fi(tl, teng) =

(1 if tl,i 6= teng,i

0 otherwise(5)

We encode 104 such features, in accordance withthe typological features of English available inWALS. The features are activated only when a ty-pological feature of English has a correspondingdocumented feature in the native language. Theaddition of these divergence features brings the to-tal number of features in our feature set to 444.

98

5.4 ResultsWe evaluate the model predictions using two met-rics. The first metric, Absolute Error, measures thedistance between the predicted and the true rela-tive frequency of each grammatical error type7:

Absolute Error = |yl,e � yl,e| (6)

When averaged across different predictions we re-fer to this metric as Mean Absolute Error (MAE).

The second evaluation score is the Kullback-Leibler divergence DKL, a standard measure forevaluating the difference between two distribu-tions. This metric is used to evaluate the pre-dicted grammatical error distribution of a nativelanguage:

DKL(yl||yl) =

X

e

yl,e ln

yl,e

yl,e(7)

Base NN Reg RegCAMAE 1.28 1.11 1.02 1.0Error Reduction - 13.3 20.4 21.8#Languages - 9/14 12/14 14/14#Mistakes - 11/20 15/20 14/20AVG D

KL

0.052 0.046 0.033 0.032#Languages - 10/14 14/14 14/14

Table 3: Results for prediction of relative error fre-quencies using the MAE metric across languagesand error types, and the DKL metric averagedacross languages. #Languages and #Mistakes de-note the number of languages and grammatical er-ror types on which a model outperforms Base.

Table 3 summarizes the grammatical error pre-diction results8. The baseline model Base sets therelative frequencies of the grammatical errors of atest language to the respective relative error fre-quencies in the training data. We also considera stronger, language specific model called Near-est Neighbor (NN), which projects the error distri-bution of a target language from the typologicallyclosest language in the training set, according tothe cosine similarity measure. This baseline pro-vides a performance improvement for the majority

7For clarity of presentation, all the reported results on thismetric are multiplied by 100.

8As described in section 5.2, we report the performanceof regression models trained and evaluated on relative errorfrequencies obtained by normalizing the rates of the differenterror types. We also experimented with training and evaluat-ing the models on absolute error counts per word, obtainingresults that are similar to those reported here.

of the languages and error types, with an averageerror reduction of 13.3% on the MAE metric com-pared to Base, and improving from 0.052 to 0.046on the KL divergence metric, thus emphasizing thegeneral advantage of a native language adapted ap-proach to ESL error prediction.

Our regression model introduces further sub-stantial performance improvements. The Regmodel, which uses the typological features of thenative language for predicting ESL relative er-ror frequencies, achieves 20.4% MAE reductionover the Base model. The RegCA version ofthe regression model, which also incorporates dif-ferences between the typological features of thenative language and English, surpasses the Regmodel, reaching an average error reduction of21.8% from the Base model, with improvementsacross all the languages and the majority of theerror types. Strong performance improvementsare also obtained on the KL divergence measure,where the RegCA model scores 0.032, comparedto the baseline score of 0.052.

To illustrate the outcome of our approach, con-sider the example in table 4, which compares thetop 10 predicted errors for Japanese using the Baseand RegCA models. In this example, RegCA cor-rectly places Missing Determiner as the most com-mon error in Japanese, with a significantly higherrelative frequency than in the training data. Sim-ilarly, it provides an accurate prediction for theMissing Preposition error, whose frequency andrank are underestimated by the Base model. Fur-thermore, RegCA correctly predicts the frequencyof Replace Preposition and Word Order to belower than the average in the training data.

5.5 Feature Analysis

An important advantage of our typology-based ap-proach are the clear semantics of the features,which facilitate the interpretation of the model. In-spection of the model parameters allows us to gaininsight into the typological features that are poten-tially involved in causing different types of ESLerrors. Although such inspection is unlikely toprovide a comprehensive coverage of all the rel-evant causes for the observed learner difficulties,it can serve as a valuable starting point for ex-ploratory linguistic analysis and formulation of across-linguistic transfer theory.

Table 5 lists the most salient typological fea-tures, as determined by the feature weights aver-

99

Rank Base Frac. RegCA Frac. True Frac.1 Replace Preposition 0.14 Missing Determiner 0.18 Missing Determiner 0.202 Tense Verb 0.14 Tense Verb 0.12 Tense Verb 0.123 Missing Determiner 0.12 Replace Preposition 0.12 Replace Preposition 0.104 Wrong Verb Form 0.07 Missing Preposition 0.08 Missing Preposition 0.085 Word Order 0.06 Unnecessary Determiner 0.06 Unnecessary Preposition 0.066 Missing Preposition 0.06 Wrong Verb Form 0.05 Unnecessary Determiner 0.057 Unnecessary Determiner 0.06 Unnecessary Preposition 0.05 Replace Determiner 0.058 Unnecessary Preposition 0.04 Wrong Noun Form 0.05 Wrong Verb Form 0.059 Missing Pronoun 0.04 Word Order 0.05 Word Order 0.0410 Wrong Noun Form 0.04 Verb Agreement 0.04 Wrong Noun Form 0.06

Table 4: Comparison between the fractions and ranks of the top 10 predicted error types by the Baseand RegCA models for Japanese. As opposed to the Base method, the RegCA model correctly predictsMissing Determiner to be the most frequent error committed by native speakers of Japanese. It alsocorrectly predicts Missing Preposition to be more frequent and Replace Preposition and Word Order tobe less frequent than in the training data.

aged across the models of different languages, forthe error types Missing Determiner and MissingPronoun. In the case of determiners, the modelidentifies the lack of definite and indefinite arti-cles in the native language as the strongest factorsrelated to increased rates of determiner omission.Conversely, features that imply the presence of anarticle system in the native language, such as ‘In-definite word same as one’ and ‘Definite word dis-tinct from demonstrative’ are indicative of reducederror rates of this type.

A particularly intriguing example concerns theMissing Pronoun error. The most predictive typo-logical factor for increased pronoun omissions ispronominal subject marking on the verb in the na-tive language. Differently from the case of deter-miners, it is not the lack of the relevant structure inthe native language, but rather its different encod-ing that seems to drive erroneous pronoun omis-sion. Decreased error rates of this type correlatemost strongly with obligatory pronouns in subjectposition, as well as a verbal person marking sys-tem similar to the one in English.

6 Bootstrapping with ESL-basedTypology

Thus far, we presupposed the availability of sub-stantial typological information for our target lan-guages in order to predict their ESL error distribu-tions. However, the existing typological documen-tation for the majority of the world’s languages isscarce, limiting the applicability of this approachfor low-resource languages.

We address this challenge for scenarios inwhich an unannotated collection of ESL texts au-

Missing Determiner37A Definite Articles: Different from English .05738A Indefinite Articles: No definite or indefinite article .05537A Definite Articles: No definite or indefinite article .05549A Number of Cases: 6-7 case .052

100A Alignment of Verbal Person Marking: Accusative -.07338A Indefinite Article: Indefinite word same as ’one’ -.05052A Comitatives and Instrumentals: Identity -.04437A Definite Articles: -.036Definite word distinct from demonstrative

Missing Pronoun101A Expression of Pronominal Subjects: .015Subject affixes on verb71A The Prohibitive: Different from English .01238A Indefinite Articles: Indefinite word same as ’one’ .01171A The Prohibitive: Special imperative + normal negative .010

104A Order of Person Markers on the Verb: -.016A & P do not or do not both occur on the verb102A Verbal Person Marking: Only the A argument -.013101A Expression of Pronominal Subjects: -.011Obligatory pronouns in subject position71A The Prohibitive: Normal imperative + normal negative -.010

Table 5: The most predictive typological featuresof the RegCA model for the errors Missing De-terminer and Missing Pronoun. The right columndepicts the feature weight averaged across all thelanguages. Missing determiners are related to theabsence of a determiner system in the native lan-guage. Missing pronouns are correlated with sub-ject pronoun marking on the verb.

thored by native speakers of the target language isavailable. Given such data, we propose a boot-strapping strategy which uses the method pro-posed in (Berzak et al., 2014) in order to approx-imate the typology of the native language frommorpho-syntactic features in ESL. The inferred ty-pological features serve, in turn, as a proxy for thetrue typology of that language in order to predictits speakers’ ESL grammatical error rates with ourregression model.

100

To put this framework into effect, we use theFCE corpus to train a log-linear model for nativelanguage classification using morpho-syntacticfeatures obtained from the output of the StanfordParser (de Marneffe et al., 2006):

p(l|x; ✓) =

exp(✓ · f(x, l))Pl02L exp(✓ · f(x, l0))

(8)

where l is the native language, x is the observedEnglish document and ✓ are the model parameters.We then derive pairwise similarities between lan-guages by averaging the uncertainty of the modelwith respect to each language pair:

S0ESLl,l0

=

8<

:

1|Dl|

P(x,l)2Dl

p(l0|x; ✓) if l0 6= l

1 otherwise

(9)

In this equation, x is an ESL document, ✓ arethe parameters of the native language classifica-tion model and Dl is a set of documents whosenative language is l. For each pair of languages land l0 the matrix S0

ESL contains an entry S0ESLl,l0

which represents the average probability of con-fusing l for l0, and an entry S0

ESLl0,l, which cap-

tures the opposite confusion. A similarity estimatefor a language pair is then obtained by averagingthese two scores:

SESLl,l0 = SESLl0,l=

1

2

(S

0ESLl,l0

+ S

0ESLl0,l

) (10)

As shown in (Berzak et al., 2014), given the simi-larity matrix SESL, one can obtain an approxima-tion for the typology of a native language by pro-jecting the typological features from its most sim-ilar languages. Here, we use the typology of theclosest language, an approach that yields 70.7%accuracy in predicting the typological features ofour set of languages.

In the bootstrapping setup, we train the regres-sion models on the true typology of the languagesin the training set, and use the approximate typol-ogy of the test language to predict the relative errorrates of its speakers in ESL.

6.1 ResultsTable 6 summarizes the error prediction results us-ing approximate typological features for the testlanguages. As can be seen, our approach contin-ues to provide substantial performance gains de-spite the inaccuracy of the typological informa-tion used for the test languages. The best per-forming method, RegCA reduces the MAE of Base

by 13.9%, with performance improvements formost of the languages and error types. Perfor-mance gains are also obtained on the DKL met-ric, whereby RegCA scores 0.041, compared to theBase score of 0.052, improving on 11 out of our 14languages.

Base NN Reg RegCAMAE 1.28 1.12 1.13 1.10Error Reduction - 12.6 11.6 13.9#Languages - 11/14 11/14 11/14#Mistakes - 10/20 10/20 11/20AVG D

KL

0.052 0.048 0.043 0.041#Languages - 10/14 11/14 11/14

Table 6: Results for prediction of relative error fre-quencies using the bootstrapping approach. In thissetup, the true typology of the test language is sub-stituted with approximate typology derived frommorpho-syntactic ESL features.


We present a computational framework for pre-dicting native language specific grammatical er-ror distributions in ESL, based on the typologicalproperties of the native language and their compat-ibility with the typology of English. Our regres-sion model achieves substantial performance im-provements as compared to a language obliviousbaseline, as well as a language dependent near-est neighbor baseline. Furthermore, we addressscenarios in which the typology of the native lan-guage is not available, by bootstrapping typologi-cal features from ESL texts. Finally, inspection ofthe model parameters allows us to identify nativelanguage properties which play a pivotal role ingenerating different types of grammatical errors.

In addition to the theoretical contribution, theoutcome of our work has a strong potential to bebeneficial in practical setups. In particular, it canbe utilized for developing educational curriculathat focus on the areas of difficulty that are charac-teristic of different native languages. Furthermore,the derived error frequencies can be integrated asnative language specific priors in systems for au-tomatic error correction. In both application ar-eas, previous work relied on the existence of errortagged ESL data for the languages of interest. Ourapproach paves the way for addressing these chal-lenges even in the absence of such data.

101

Acknowledgments

This material is based upon work supported bythe Center for Brains, Minds, and Machines(CBMM), funded by NSF STC award CCF-1231216.

ReferencesYevgeni Berzak, Roi Reichart, and Boris Katz. 2014.

Reconstructing native language typology from for-eign language usage. In Proceedings of the Eigh-teenth Conference on Computational Natural Lan-guage Learning, pages 21–29. Association for Com-putational Linguistics, June.

Hal Daume III. 2009. Non-parametric Bayesian areallinguistics. In Proceedings of human language tech-nologies: The 2009 annual conference of the northamerican chapter of the association for computa-tional linguistics, pages 593–601. Association forComputational Linguistics.

Marie-Catherine de Marneffe, Bill MacCartney,Christopher D Manning, et al. 2006. Generatingtyped dependency parses from phrase structureparses. In Proceedings of LREC, volume 6, pages449–454.

Charles C Fries. 1945. Teaching and learning englishas a foreign language.

Susan M Gass and Larry Selinker. 1992. LanguageTransfer in Language Learning: Revised edition,volume 5. John Benjamins Publishing.

Ryan Georgi, Fei Xia, and William Lewis. 2010.Comparing language similarity across genetic andtypologically-based groupings. In Proceedings ofthe 23rd International Conference on Computa-tional Linguistics, pages 385–393. Association forComputational Linguistics.

Scott Jarvis and Aneta Pavlenko. 2007. Crosslinguis-tic influence in language and cognition. Routledge.

William H Kruskal and W Allen Wallis. 1952. Use ofranks in one-criterion variance analysis. Journal ofthe American statistical Association, 47(260):583–621.

Robert Lado. 1957. Linguistics across cultures: Ap-plied linguistics for language teachers.

Howard Levene. 1960. Robust tests for equality ofvariances. Contributions to probability and statis-tics: Essays in honor of Harold Hotelling, 2:278–292.

Henry B Mann and Donald R Whitney. 1947. On a testof whether one of two random variables is stochasti-cally larger than the other. The annals of mathemat-ical statistics, pages 50–60.

Ryo Nagata and Edward Whittaker. 2013. Recon-structing an indo-european family tree from non-native english texts. In Proceedings of the 51st An-nual Meeting of the Association for ComputationalLinguistics, pages 1137–1147, Sofia, Bulgaria. As-sociation for Computational Linguistics.

Diane Nicholls. 2003. The cambridge learner corpus:Error coding and analysis for lexicography and elt.In Proceedings of the Corpus Linguistics 2003 con-ference, pages 572–581.

Terence Odlin. 1989. Language transfer: Cross-linguistic influence in language learning. Cam-bridge University Press.

Alla Rozovskaya and Dan Roth. 2011. Algorithm se-lection and model adaptation for esl correction tasks.In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics: HumanLanguage Technologies-Volume 1, pages 924–933.Association for Computational Linguistics.

Alla Rozovskaya and Dan Roth. 2014. Building astate-of-the-art grammatical error correction system.Transactions of the Association for ComputationalLinguistics, 2(10):419–434.

Samuel Sanford Shapiro and Martin B Wilk. 1965.An analysis of variance test for normality (completesamples). Biometrika, pages 591–611.

Ben Swanson and Eugene Charniak. 2013. Extract-ing the native language signal for second languageacquisition. In HLT-NAACL, pages 85–94.

Ben Swanson and Eugene Charniak. 2014. Datadriven language transfer hypotheses. EACL 2014,page 169.

Joel Tetreault, Daniel Blanchard, and Aoife Cahill.2013. A report on the first native language identi-fication shared task. In Proceedings of the EighthWorkshop on Innovative Use of NLP for BuildingEducational Applications, pages 48–57. Citeseer.

Ronald Wardhaugh. 1970. The contrastive analysishypothesis. TESOL quarterly, pages 123–130.

Randal L Whitman and Kenneth L Jackson. 1972. Theunpredictability of contrastive analysis. Languagelearning, 22(1):29–41.

Sze-Meng Jojo Wong and Mark Dras. 2009. Con-trastive analysis and native language identification.In Proceedings of the Australasian Language Tech-nology Association Workshop, pages 53–61. Cite-seer.

Helen Yannakoudakis, Ted Briscoe, and Ben Medlock.2011. A new dataset and method for automaticallygrading ESOL texts. In ACL, pages 180–189.

102



Cross-lingual syntactic variation over age and gender

Anders Johannsen, Dirk Hovy, and Anders SøgaardCenter for Language Technology

University of Copenhagen, DenmarkNjalsgade 140

{ajohannsen,dirk.hovy,soegaard}@hum.ku.dk

Abstract

Most computational sociolinguistics stud-ies have focused on phonological andlexical variation. We present the firstlarge-scale study of syntactic variationamong demographic groups (age and gen-der) across several languages. We har-vest data from online user-review sites andparse it with universal dependencies. Weshow that several age and gender-specificvariations hold across languages, for ex-ample that women are more likely to useVP conjunctions.

1 Introduction

Language varies between demographic groups.To detect this variation, sociolinguistic studiesrequire both a representative corpus of text andmeta-information about the speakers. Tradition-ally, this data was collected from a combination ofinterview transcriptions and questionnaires. Bothmethods are time-consuming, so population sizeshave been small, sometimes including less thanfive subjects (Rickford and Price, 2013). Whilethese resources enable detailed qualitative analy-ses, small sample sizes may lead to false researchfindings (Button et al., 2013). Sociolinguisticstudies, in other words, often lack statisticalpower to establish relationships between languageuse and socio-economic variables.

Obtaining large enough data sets becomes evenmore challenging the more complex the targetvariables are. So while syntactic variation hasbeen identified as an important factor of variation(Cheshire, 2005), it was not approached, due to itshigh complexity. This paper addresses the issuesystematically on a large scale. In contrast to pre-vious work in both sociolinguistics and NLP, weconsider syntactic variation across groups at thelevel of treelets, as defined by dependency struc-

tures, and make use of a large corpus that includesdemographic information on both age and gender.

The impact of such findings goes beyond soci-olinguistic insights: knowledge about systematicdifferences among demographic groups can helpus build better and fairer NLP tools. Volkova et al.(2013), Hovy and Søgaard (2015), Jørgensen et al.(2015), and Hovy (2015) have shown the impactof demographic factors on NLP performance.Recently, the company Textio introduced a tool tohelp phrase job advertisements in a gender-neutralway.1 While their tool addresses lexical variation,our results indicate that linguistic differencesextend to the syntactic level.

Previous work on demographic variation in bothsociolinguistics and NLP has begun to rely on cor-pora from social media, most prominently Twitter.Twitter offers a sufficiently large data source withbroad coverage (albeit limited to users with accessto social media). Indeed, results show that thisresource reflects the phonological and morpho-lexical variation of spoken language (Eisenstein,2013b; Eisenstein, 2013a; Doyle, 2014).

However, Twitter is not well-suited for thestudy of syntactic variation for two reasons.First, the limited length of the posts compelsthe users to adopt a terse style that leaves outmany grammatical markers. As a consequence,performance of syntactic parsers is prohibitive forlinguistic analysis in this domain. Second, Twitterprovides little meta-information about the users,except for regional origin and time of posting.Existing work has thus been restricted to thesedemographic variables. One line of research hasfocused on predictive models for age and gender(Alowibdi et al., 2013; Ciot et al., 2013) to addmeta-data on Twitter, but again, error rates are toohigh for use in sociolinguistic hypothesis testing.

We use a new source of data, namely the user1http://recode.net/2015/04/20/

textio-spell-checks-for-gender-bias/

103

review site Trustpilot. The meta-informationon Trustpilot is both more prevalent and morereliable, and textual data is not restricted in length(see Table 2). We use state-of-the-art dependencyparsers trained on universal treebanks (McDonaldet al., 2013) to obtain comparable syntacticanalyses across several different languages anddemographics.

Contributions We present the first study ofmorpho-syntactic variation with respect to demo-graphic variables across several languages at alarge scale. We collect syntactic features withindemographic groups and analyze them to retrievethe most significant differences. For the analysiswe use a method that preserves statistical power,even when the number of possible syntacticfeatures is very large. Our results show that demo-graphic differences extend beyond lexical choice.

2 Data collection

The TRUSTPILOT CORPUS consists of user re-views from the Trustpilot website. On Trustpilot,users can review company websites and leave aone to five star rating, as well as a written review.The data is available for 24 countries, using 13 dif-ferent languages (Danish, Dutch, English, Finnish,French, German, Italian, Norwegian, Polish, Por-tuguese, Russian, Spanish, Swedish). In our study,we are limited by the availability of comparablesyntactically annotated corpora (McDonald etal., 2013) for five languages used in elevencountries, i.e., English (Australia, Canada, UK,and US), French (Belgium and France), German(Switzerland and Germany), Italian, Spanish, andSwedish. We treat the different variants of theselanguages separately in the experiments below.2

Many users opt to provide a public profile.There are no mandatory fields, other than name,but many also supply their birth year, gender,and location. We crawl the publicly availableinformation on the web site for users and reviews,with different fields. Table 1 contains a list of thefields that are available for each type of entity.For more information on the data as a source fordemographic information, see Hovy et al. (2015).

We enhance the data set for our analysis byadding gender information based on first names.In order to add missing gender information, we

2While this might miss some dialectal idiosyncrasies, itis based on standard NLP practice, e.g., when using WSJ-trained parsers in translation of (British) Europarl.

Users Name, ID, profile text, location (cityand country), gender, year of birth

Reviews Title, text, rating (1–5), User ID, Com-pany ID, Date and time of review

Table 1: Meta-information in TRUSTPILOT data

measure the distribution over genders for eachname. If a name occurs with sufficient frequencyand is found predominantly in one gender, wepropagate this gender to all occurrences of thename that lack gender information. In our ex-periments, we used a gender-purity factor of 0.95(name occurs with one gender 95% of the time)and a minimum frequency of 3 (name appearsat least 3 times in the data). Since names arelanguage-specific (Angel is male in Spanish, butfemale in English), we run this step separately oneach language. On average, this measure doubledthe amount of gender information for a language.

Note that the domain (reviews) potentiallyintroduces a bias, but since our analysis is largelyat the syntactic level, we expect the effect to belimited. While there is certainly a domain effectat the lexical level, we assume that the syntacticfindings generalize better to other domains.

Users Age Gender Place All

UK 1,424k 7% 62% 5% 4%France 741k 3% 53% 2% 1%Denmark 671k 23% 87% 17% 16%US 648k 8% 59% 7% 4%Netherlands 592k 9% 39% 7% 5%Germany 329k 8% 47% 6% 4%Sweden 170k 5% 64% 4% 3%Italy 132k 10% 61% 8% 6%Spain 56k 6% 37% 5% 3%Norway 51k 5% 50% 4% 3%Belgium 36k 13% 42% 11% 8%Australia 31k 8% 36% 7% 5%Finland 16k 6% 36% 5% 3%Austria 15k 10% 43% 7% 5%Switzerland 14k 8% 41% 7% 4%Canada 12k 10% 19% 9% 4%Ireland 12k 8% 30% 7% 4%

Table 2: No. of users per variable per country (af-ter augmentations), for countries with 10k+ users.

3 Methodology

For each language, we train a state-of-the-artdependency parser (Martins et al., 2013) on a

104

treebank annotated with the Stanford dependencylabels (McDonald et al., 2013) and universal POStag set (Petrov et al., 2011). This gives us syntac-tic analyses across all languages that describe thesame syntactic phenomena the same way. Figure1 shows two corpus sentences annotated with thisharmonized representation.

The style of the reviews is much more canonicalthan social web data, say Twitter. Expected parseperformance can be estimated from the SANCL2012 shared task on dependency parsing of webdata (Petrov and McDonald, 2012). The bestresult on the review domain there was 83.86 LASand 88.31 UAS, close to the average over all webdomains (83.45 LAS and 87.62 UAS).

From the parses, we extract all subtreesof up to three tokens (treelets). We do notdistinguish between right- and left-branchingrelations: the representation is basically a “bagof relations”. The purpose of this is to increasecomparability across languages with differentword orderings (Naseem et al., 2012). A one-token treelet is simply the POS tag of the token,e.g. NOUN or VERB. A two-token treelet is atyped relation between head and dependent, e.g.VERB

NSUBJ��!NOUN. Treelets of three tokenshave two possible structures. Either the headdirectly dominates two tokens, or the tokens arelinked together in a chain, as shown below:

NOUN . . . VERB . . . NOUN

nsubj dobj

PRON . . . NOUN . . . VERB

nsubjposs

3.1 Treelet reduction

We extract between 500,000 to a million distincttreelets for each language. In principle, wecould directly check for significant differencesin the demographic groups and use Bonferronicorrection to control the family-wise error (i.e.,the probability of obtaining a false positive).However, given the large number of treelets,the correction for multiple comparisons wouldunderpower our analyses and potentially cause usto miss many significant differences. We thereforereduce the number of treelets by two methods.

First, we set the minimum number of occur-rences of a feature in each language to 50. Weapply this heuristic both to ensure statistical powerand to focus our analyses on prevalent rather thanrare syntactic phenomena.

Second, we perform feature selection using L1

randomized logistic regression models, with ageor gender as target variable, and the treelets asinput features. However, direct feature selectionwith L1 regularized models (Ng, 2004) is prob-lematic when variables are highly correlated (asin our treelets, where e.g. three-token structurescan subsume smaller ones). As a result, small andinessential variations in the dataset can determinewhich of the variables are selected to represent thegroup, so we end up with random within-groupfeature selection.

We therefore use stability selection (Mein-shausen and Buhlmann, 2010). Stability selectionmitigates the correlation problem by fitting thelogistic regression model hundreds of timeswith perturbed data (75% subsampling andfeature-wise regularization scaling). Featuresthat receive non-zero weights across many runscan be assumed to be highly indicative. Stabilityselection thus gives all features a chance to beselected. It controls the false positive rate, whichis less conservative than family-wise error. Weuse the default parameters of a publicly availablestability selection implementation3, run it on thewhole data set, and discard features selected lessthan 50% of the time.

With the reduced feature set, we check forusage differences in demographic groups (age andgender) using a �2 test. We distinguish two agegroups: speakers that are younger than 35, andspeakers older than 45. These thresholds werechosen to balance the size of both groups. Atthis stage we set the desired p-value at 0.02 andapply Bonferroni correction, effectively dividingthe p-value threshold by the number of remainingtreelets.4

Note, finally, that the average number of wordswritten by a reviewer differs between the demo-graphic groups (younger users tend to write morethan older ones, women more than men). To coun-teract this effect, the expected counts in our nullhypothesis use the proportion of words written bypeople in a group, rather than the proportion ofpeople in the group (which would skew the resultstowards the groups with longer reviews).

3http://scikit-learn.org/4Choosing a p-value is somewhat arbitrary. Effectively,

our p-value cutoff is several orders of magnitude lower than0.02, due to the Bonferroni correction.

105

My husband especially likes the bath oilsPRON NOUN ADV VERB DET NOUN NOUN

poss

nsubj

advmod

dobj

det

compmod

Je recommande vivement ce sitePRON VERB ADV DET NOUN

nsubj advmod

dobj

det

Figure 1: Universal dependency relations for an English and a French sentence, both with adverbialmodifiers.

Signif. in Effect

# lang. Rank Feature High By Subsumes

11 1 NUM M 32 %2 PRON F 11 %3 NOUN M 6 %

10 4 VERBACOMP��! ADJ F 22 % 5

5 VERB F 6 %9 6 ADJ

ACOMP �� VERBCONJ��! VERB F 36 % 4 , 5 , 14

7 VERBACOMP��! ADJ

ADVMOD��! ADV F 35 % 4 , 5

8 NOUNCOMPMOD��! NOUN M 22 % 3

9 VERBNSUBJ��! PRON F 14 % 2 , 5

8 10 VERBCONJ��! VERB

ACOMP��! ADJ F 40 % 4 , 5 , 14

11 VERBACOMP��! ADJ

CONJ��! ADJ F 36 % 4 , 5

12 ADJACOMP �� VERB

CC��! CONJ F 28 % 4 , 5

13 CONJCC �� VERB

CONJ��! VERB F 16 % 5 , 14

14 VERBCONJ��! VERB F 14 % 5

15 ADPADPMOD �� VERB

NSUBJ��! NOUN M 14 % 3 , 5

16 NOUNADPMOD��! ADP

ADPOBJ��! NOUN M 13 % 3 , 17

17 NOUNADPMOD��! ADP M 13 % 3

18 VERBAUX��! VERB F 10 % 5

7 19 ADPADPOBJ��! NUM M 43 % 1

20 ADJACOMP �� VERB

NSUBJ��! PRON F 41 % 2 , 4 , 5 , 9

Table 3: Gender comparison: Significant syntactic features across languages. Features ordered bynumber of languages in which they are significant. Right-hand side shows the gender for which thefeature is indicative, by which margin, and whether it subsumes other features (indexed by rank)

4 Results

We are interested in robust syntactic variationacross languages; that is, patterns that hold acrossmost or all of the languages considered here.We therefore score each of the identified treeletsby the number of languages with a significantdifference in occurrence between the groups ofthe given demographic variable. Again, we usea rather conservative non-parametric hypothesistest, with Bonferroni correction.

Tables 3 and 4 show the results for age andgender, respectively. The first column showsthe number of languages in which the treelet(third column) is significant. The fourth and fifthcolumn indicate for which age or gender subgroupthe feature is indicative, and how much larger therate of occurrence is there in percent. The indices

in the last column represent containment relation-ships, i.e., when a treelet is strictly contained inanother treelet (indexed by the rank given in thesecond column).

In the case of gender, three atomic treelets(parts of speech) correlate significantly acrossall 11 languages. Two treelets correlate signifi-cantly across 10 languages. For age, five treeletscorrelate significantly across 10 languages.

In sum, men seem to use numerals and nounsmore than women across languages, whereaswomen use pronouns and verbs more often. Menuse nominal compounds more often than womenin nine out of eleven languages. Women, on theother hand, use VP coordinations more in eightout of eleven languages.

For age, some of the more striking patterns

106

involve prepositional phrases, which see higheruse in the older age group. In atomic treelets, nounuse is slightly higher in the older group, while pro-nouns are more often used by younger reviewers.

Our results address a central question in varia-tional linguistics, namely whether syntax plays arole in language variation among groups. Whilethis has been long suspected, it was never empir-ically researched due to the perceived complexity.Our findings are the first to corroborate thehypotheses that language variation goes beyondthe lexical level.

FR DE IT ES SE UK US

FR 592 42% 73% 88% 46% 47% 38%DE 365 46% 56% 45% 52% 43%IT 138 50% 32% 72% 65%ES 78 28% 79% 73%SE 182 49% 45%UK 1056 88%US 630

FR DE UK US

FR 108 57% 53% 35%DE 237 46% 27%UK 370 56%US 173

Table 5: Gender (top) and age (bottom): Pair-wise overlap in significant features. Languageswith 50 or less significant features were left out.Diagonal gives the number of features per lan-guage.

We also present the pairwise overlap in signif-icant treelets between (a subset of the) languages.See Table 5. Their diagonal values give thenumber of significant treelets for that language.Percentages in the pairwise comparisons are nor-malized by the smallest of the pair. For instance,the 49 % overlap between Sweden (SE) andUnited Kingdom (UK) in Table 5 means that 49 %of the 182 SE treelets were also significant in UK.

We observe that English variants (UK and US)share many features. The Romance languagesalso share many features with each other, butItalian and Spanish also share many features withEnglish. In Section 5, we analyze our results inmore depth.

5 Analysis of syntactic variation

Due to space constraints, we restrict our analysisto a few select treelets with good coverage andinterpretable results.

5.1 Gender differences

The top features for gender differences are mostlyatomic (pre-terminals), indicating that we observethe same effect as mentioned previously in theliterature (Schler et al., 2006), namely that certainparts-of-speech are prevalent in one gender.

1 , 2 , 3 For all languages, the use of numeralsand nouns is significantly correlated with men,while pronouns and verbs are more indicative ofwomen. When looking at the types of pronounsused by men and women, we see very similardistributions, but men tend to use impersonalpronouns (it, what) more than women do. Nounsand numbers are associated with the alleged“information emphasis” of male language use(Schler et al., 2006). Numbers typically indicateprices or model numbers, while nouns are usuallycompany names.

The robustness of POS features could to someextent be explained by the different company cat-egories reviewed by each gender: in COMPUTER& ACCESSORIES and CAR LIGHTS the reviewsare predominately by men, while the reviews inthe PETS and CLOTHES & FASHION categoriesare mainly posted by women. Using numeralsand nouns is more likely when talking aboutcomputers and car lights than when talking aboutpets and clothing, for example.

4 In English, this treelet is instantiated by ex-amples such as:

(1) is/was/are great/quick/easy andis/was/arrived

In German, the corresponding examples would be:

(2) bin/war zufrieden und werde/wurde wiederbestellen (am/was satisfied and will/wouldorder again)

8 This feature mainly encompasses noun com-pounds, incl., company names. Again, this featureis indicative of male language use. This may be aside-effect of male use of nouns, but note that theeffect is much larger with noun compounds.

107

Signif. in Effect

# lang. Rank Feature High By Subsumes

8 1 NOUN >45 5 %7 2 ADP

ADPOBJ��! NOUNADPMOD��! ADP >45 20 % 1 , 5 , 9

3 NOUNADPMOD��! ADP

ADPOBJ��! NOUN >45 14 % 1 , 5 , 9

4 VERBADVMOD��! ADV <35 12 %

5 ADPADPOBJ��! NOUN >45 8 % 1

6 6 ADVADVMOD �� VERB

CONJ��! VERB <35 34 % 4 , 19

7 VERBADVCL �� VERB

ADVMOD��! ADV <35 27 % 4 , 20

8 VERBCC��! CONJ <35 15 %

9 NOUNADPMOD��! ADP >45 12 % 1

10 PRON <35 10 %5 11 ADP

ADPMOD �� NOUNCOMPMOD��! NOUN >45 40 % 1 , 9 , 18

12 VERBCONJ��! VERB

NSUBJ��! PRON <35 32 % 10 , 19

13 ADVADVMOD �� VERB

CC��! CONJ <35 25 % 4 , 8

14 ADPADPOBJ��! NOUN

COMPMOD��! NOUN >45 23 % 1 , 5 , 18


NSUBJ��! PRON <35 21 % 8 , 10


CONJ��! VERB <35 20 % 8 , 19

17 ADVADVMOD �� VERB

NSUBJ��! PRON <35 19 % 4 , 10

18 NOUNCOMPMOD��! NOUN >45 17 % 1

19 VERBCONJ��! VERB <35 16 %

20 VERBADVCL��! VERB <35 11 %

Table 4: Age group comparison: Significant syntactic features across languages. Layout as in Table 3

5.2 Age differences

For age, features vary a lot more than for gender,i.e., there is less support for each than there was forthe gender features. A few patterns still stand out.

2 This pattern, which is mostly used by the> 45 age group, is often realized in English toexpress temporal relations, such as

(1) (with)in a couple/days/hours of(2) in time for

In German, it is mostly used to express compar-isons

(1) im Vergleich/Gegensatz zu (compared/incontrast to)

(2) auf Suche nach (in search of)(3) in Hohe/im Wert von (valued at)

3 This pattern, which is indicative of the > 45

age group, is mostly realized in English to expressa range of prepositional phrases, some of themoverlapping with the previous pattern:

(1) value for money(2) couple of days(3) range of products

German also shows prepositional phrases, yet nooverlap with 2

(1) Qualitat zu Preisen (quality for price)

(2) Auswahl an Weinen/Hotels (selection ofwines/hotels)

In French, this mostly talks about delivery(1) delai(s) de livraison (delivery)(2) rapidite de livraison (speed of delivery)

And in Spanish, the main contenders are complex(and slightly more formal) expressions

(1) gastos de envıo (shipping)(2) atencion al cliente (customer service)

4 This pattern is mostly used by the youngergroup, and realized to express positive recommen-dations in all languages:

(1) use again/definitely(2) recommend highly/definitely

German:(1) empfehle nur/sehr (just recommend)(2) bestelle wieder/dort/schon (order

again/there/already)French:

(1) recommande vivement (vividly recom-mend)

(2) emballe/passe/fait bien (pack-aged/delivered/made well)

5 This pattern is again predominant in the oldergroup, and mostly used in English to complementthe prepositional phrases in 3

108

(1) at price(2) with service

In German, it is mostly used to express compar-isons

(1) in Ordnung (alright)(2) am Tag (on the day)

6 Semantic variation within syntacticcategories

Given that a number of the indicative features aresingle treelets (POS tags), we wondered whetherthere are certain semantic categories that fill theseslots. Since we work across several languages, weare looking for semantically equivalent classes.We collect the most significant adjectives andadverbs for each gender for each language andmap the words to all of their possible lexicalgroups in BabelNet (Navigli and Ponzetto, 2010).This creates lexical equivalence classes. Table 6shows the results. We purposefully exclude nounsand verbs here, as there is too much variation todetect any patterns.

The number of languages that share lexicalitems from the same BabelNet class is typicallysmaller than the number of languages that share atreelet. Nevertheless, we observe certain patterns.

The results for gender are presented in Table6. For adverbs, the division seems to be aboutintensity: men use more downtoners (approx-imately; almost; still), while women use moreintensifiers (actually; really; truly; quite; lots).This finding is new, in that it directly contradictsthe perceived wisdom of female language as beingmore restrained and hedging.

In their use of adjectives, on the other hand,men highlight “factual” properties of the subject,such as price (inexpensive) and quality (cheap;best; professional), whereas women use morequalitative adjectives that express the speaker’sopinion about the subject (fantastic; amazing;pretty) or their own state (happy), although wealso find the “factual” assessment simple.

Table 7 shows the results for age. There arenot many adjectives that group together, and theydo not show a clear pattern. Most of the adverbsare indicative of the younger group, althoughthere is overlap with the older group (this is dueto different sets of words mapping to the sameclass). We did not find any evidence for pervasiveage effects across languages.

Langs. BABELNET class Highest

Adverbs

5 just about; approximately Mactually; indeed Freal; really; very Freally; truly; genuinely Fquite F

4 almost; nearly; virtually Mstill Mhowever; still; nevertheless Msoon; presently; shortly Fa good deal; lots; very much F

Adjectives

6 fantastic; wondrous; wonderful F5 inexpensive; cheap; economic M

amazing; awesome; marvelous Ftinny; bum; cheap M

4 happy Fbest (quality) Mprofessional Mpretty Feasy; convenient; simple Fokay; o.k.; all right M

Table 6: Gender: equivalence classes in BabelNet

7 Related Work

Sociolinguistic studies investigate the relationbetween a speaker’s linguistic choices andsocio-economic variables. This includes regionalorigin (Schmidt and Herrgen, 2001; Nerbonne,2003; Wieling et al., 2011), age (Barke, 2000;Barbieri, 2008; Rickford and Price, 2013), gender(Holmes, 1997; Rickford and Price, 2013), socialclass (Labov, 1964; Milroy and Milroy, 1992;Macaulay, 2001; Macaulay, 2002), and ethnicity(Carter, 2013; Rickford and Price, 2013). Wefocus on age and gender in this work.

Corpus-based studies of variation have largelybeen conducted either by testing for the presenceor absence of a set of pre-defined words (Pen-nebaker et al., 2001; Pennebaker et al., 2003), orby analysis of the unigram distribution (Barbieri,2008). This approach restricts the findings tothe phenomena defined in the hypothesis, in thiscase the word list used. In contrast, our approachworks beyond the lexical level, is data-driven andthus unconstrained by prior hypotheses.

Eisenstein et al. (2011) use multi-output

109

Langs. BABELNET class Highest

Adverbs

5 actually; really; in fact <35truly; genuinely; really <35

4 however; nevertheless <35however <35

3 merely; simply; just <35reasonably; moderately; fairly <35very >45in truth; really <35very; really; real >45very; really; real <35

Adjectives

3 easy; convenient; simple <35quick; speedy >45costly; pricy; expensive <35simple <35excellent; first-class >45

2 spacious; wide <35expensive <35simple (unornamented) <35new >45best <35

Table 7: Age: Lexical equivalences in BabelNet

regression to predict demographic attributesfrom term frequencies, and vice versa. Usingsparsity-inducing priors, they identify key lexicalvariations between linguistic communities. Whilethey mention syntactic variation as possible futurework, their method has not yet been appliedto syntactically parsed data. Our method issimpler than theirs, yet goes beyond words. Welearn demographic attributes from raw counts ofsyntactic treelets rather than term frequencies,and test for group differences between the mostpredictive treelets and the demographic variables.We also use a sparsity-inducing regularizer.

Kendall et al. (2011) study dative alternationson a 250k-words corpus of transcribed spokenAfro-American Vernacular English. They uselogistic regression to correlate syntactic featuresand dialect, similar to Eisenstein et al. (2011), buttheir study differs from ours in using manuallyannotated data, studying only one dialect anddemographic variable, and using much less data.

Stewart (2014) uses POS tags to study morpho-syntactic features of Afro-American VernacularEnglish on Twitter, such as copula deletion, ha-

bitual be, null genitive marking, etc. Our study isdifferent from his in using full syntactic analyses,studying variation across age and gender ratherthan ethnicity, and in studying syntactic variationacross several languages.

8 Conclusion

Syntax has been identified as an important factorin language variation among groups, but notaddressed. Previous work has been limited bydata size or availability of demographic meta-data.Existing studies on variation have thus mostlyfocused on lexical and phonological variation.

In contrast, we study the effect of age andgender on syntactic variation across severallanguages. We use a large-scale data source(international user-review websites) and parsethe data, using the same formalisms to maximizecomparability. We find several highly significantage- and gender-specific syntactic patterns.

As NLP applications for social media becomemore widespread, we need to address theirperformance issues. Our findings suggest thatincluding extra-linguistic factors (which becomemore and more available) could help improveperformance of these systems. This requires adiscussion of approaches to corpora constructionand the development of new models.

Acknowledgements

We would like to thank the anonymous reviewersfor their comments that helped improve the paper.This research is funded by the ERC Starting GrantLOWLANDS No. 313695.

ReferencesJalal S Alowibdi, Ugo A Buy, and Philip Yu. 2013.

Empirical evaluation of profile characteristics forgender classification on twitter. In Machine Learn-ing and Applications (ICMLA), 2013 12th Interna-tional Conference on, volume 1, pages 365–369.IEEE.

Federica Barbieri. 2008. Patterns of age-based lin-guistic variation in American English. Journal ofsociolinguistics, 12(1):58–88.

Andrew J Barke. 2000. The Effect of Age on theStyle of Discourse among Japanese Women. In Pro-ceedings of the 14th Pacific Asia Conference on Lan-guage, Information and Computation, pages 23–34.

Katherine Button, John Ioannidis, Claire Mokrysz,Brian Nosek, Jonathan Flint, Emma Robinson, and

110

Marcus Munafo. 2013. Power failure: why smallsample size undermines the reliability of neuro-science. Nature Reviews Neuroscience, 14:365–376.

Phillip M Carter. 2013. Shared spaces, shared struc-tures: Latino social formation and african americanenglish in the us south. Journal of Sociolinguistics,17(1):66–92.

Jenny Cheshire. 2005. Syntactic variation and be-yond: Gender and social class variation in the use ofdiscourse-new markers1. Journal of Sociolinguis-tics, 9(4):479–508.

Morgane Ciot, Morgan Sonderegger, and Derek Ruths.2013. Gender inference of twitter users in non-english contexts. In Proceedings of the 2013 Con-ference on Empirical Methods in Natural LanguageProcessing, Seattle, Wash, pages 18–21.

Gabriel Doyle. 2014. Mapping dialectal variation byquerying social media. In EACL.

Jacob Eisenstein, Noah Smith, and Eric Xing. 2011.Discovering sociolinguistic associations with struc-tured sparsity. In Proceedings of ACL.

Jacob Eisenstein. 2013a. Phonological factors in so-cial media writing. In Workshop on Language Anal-ysis in Social Media, NAACL.

Jacob Eisenstein. 2013b. What to do about bad lan-guage on the internet. In Proceedings of NAACL.

Janet Holmes. 1997. Women, language and identity.Journal of Sociolinguistics, 1(2):195–223.

Dirk Hovy and Anders Søgaard. 2015. Tagging perfor-mance correlates with author age. In Proceedings ofACL.

Dirk Hovy, Anders Johannsen, and Anders Søgaard.2015. User review-sites as a source for large-scalesociolinguistic studies. In Proceedings of WWW.

Dirk Hovy. 2015. Demographic factors improve clas-sification performance. In Proceedings of ACL.

Anna Jørgensen, Dirk Hovy, and Anders Søgaard.2015. Challenges of studying and processing di-alects in social media. In Workshop on Noisy User-generated Text (W-NUT).

Tyler Kendall, Joan Bresnan, and Gerard van Herk.2011. The dative alternation in african americanenglish. Corpus Linguistics and Linguistic Theory,7(2):229–244.

William Labov. 1964. The social stratification of En-glish in New York City. Ph.D. thesis, Columbia uni-versity.

Ronald Macaulay. 2001. You’re like ‘why not?’ thequotative expressions of glasgow adolescents. Jour-nal of Sociolinguistics, 5(1):3–21.

Ronald Macaulay. 2002. Extremely interesting, veryinteresting, or only quite interesting? adverbs andsocial class. Journal of Sociolinguistics, 6(3):398–417.

Andre FT Martins, Miguel Almeida, and Noah ASmith. 2013. Turning on the turbo: Fast third-ordernon-projective turbo parsers. In ACL (2), pages617–622.

Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuz-man Ganchev, Keith Hall, Slav Petrov, HaoZhang, Oscar Tackstrom, Claudia Bedini, NuriaBertomeu Castello, and Jungmee Lee. 2013. Uni-versal dependency annotation for multilingual pars-ing. In ACL.

Nicolai Meinshausen and Peter Buhlmann. 2010.Stability selection. Journal of the Royal Statis-tical Society: Series B (Statistical Methodology),72(4):417–473.

Lesley Milroy and James Milroy. 1992. Social net-work and social class: Toward an integrated soci-olinguistic model. Language in society, 21(01):1–26.

Tahira Naseem, Regina Barzilay, and Amir Globerson.2012. Selective sharing for multilingual dependencyparsing. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguis-tics: Long Papers-Volume 1, pages 629–637. Asso-ciation for Computational Linguistics.

Roberto Navigli and Simone Paolo Ponzetto. 2010.Babelnet: Building a very large multilingual seman-tic network. In Proceedings of the 48th annual meet-ing of the association for computational linguistics,pages 216–225. Association for Computational Lin-guistics.

John Nerbonne. 2003. Linguistic variation and com-putation. In Proceedings of EACL, pages 3–10. As-sociation for Computational Linguistics.

Andrew Y Ng. 2004. Feature selection, l 1 vs. l 2 regu-larization, and rotational invariance. In Proceedingsof the twenty-first international conference on Ma-chine learning, page 78. ACM.

James W Pennebaker, Martha E Francis, and Roger JBooth. 2001. Linguistic inquiry and word count:Liwc 2001. Mahway: Lawrence Erlbaum Asso-ciates, 71:2001.

James W Pennebaker, Matthias R Mehl, and Kate GNiederhoffer. 2003. Psychological aspects of nat-ural language use: Our words, our selves. Annualreview of psychology, 54(1):547–577.

Slav Petrov and Ryan McDonald. 2012. Overview ofthe 2012 shared task on parsing the web. In Notesof the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL), volume 59. Citeseer.

111

Slav Petrov, Dipanjan Das, and Ryan McDonald.2011. A universal part-of-speech tagset. CoRRabs/1104.2086.

John Rickford and Mackenzie Price. 2013. Girlz iiwomen: Age-grading, language change and stylisticvariation. Journal of Sociolinguistics, 17(2):143–179.

Jonathan Schler, Moshe Koppel, Shlomo Argamon,and James W Pennebaker. 2006. Effects of ageand gender on blogging. In AAAI Spring Sympo-sium: Computational Approaches to Analyzing We-blogs, volume 6, pages 199–205.

Jurgen Erich Schmidt and Joachim Herrgen. 2001.Digitaler Wenker-Atlas (DiWA). Bearbeitet vonAlfred Lameli, Tanja Giessler, Roland Kehrein,Alexandra Lenz, Karl-Heinz Muller, Jost Nickel,Christoph Purschke und Stefan Rabanus. Er-ste vollstandige Ausgabe von Georg Wenkers“Sprachatlas des Deutschen Reichs”.

Ian Stewart. 2014. Now we stronger than ever:African-american syntax in twitter. EACL 2014,page 31.

Svitlana Volkova, Theresa Wilson, and DavidYarowsky. 2013. Exploring demographic languagevariations to improve multilingual sentiment anal-ysis in social media. In Proceedings of EMNLP,pages 1815–1827.

Martijn Wieling, John Nerbonne, and R HaraldBaayen. 2011. Quantitative social dialectology: Ex-plaining linguistic variation geographically and so-cially. PloS one, 6(9):e23613.

112



Cross-lingual Transfer for Unsupervised Dependency ParsingWithout Parallel Data

Long Duong,12 Trevor Cohn,1 Steven Bird,1 and Paul Cook3

1Department of Computing and Information Systems, University of Melbourne2National ICT Australia, Victoria Research Laboratory

3Faculty of Computer Science, University of New [email protected] {t.cohn,sbird}@unimelb.edu.au [email protected]

AbstractCross-lingual transfer has been shown toproduce good results for dependency pars-ing of resource-poor languages. Althoughthis avoids the need for a target languagetreebank, most approaches have still usedlarge parallel corpora. However, paralleldata is scarce for low-resource languages,and we report a new method that doesnot need parallel data. Our method learnssyntactic word embeddings that generaliseover the syntactic contexts of a bilingualvocabulary, and incorporates these into aneural network parser. We show empir-ical improvements over a baseline delex-icalised parser on both the CoNLL andUniversal Dependency Treebank datasets.We analyse the importance of the sourcelanguages, and show that combining mul-tiple source-languages leads to a substan-tial improvement.

1 IntroductionDependency parsing is a crucial component ofmany natural language processing (NLP) systemsfor tasks such as relation extraction (Bunescuand Mooney, 2005), statistical machine transla-tion (Xu et al., 2009), text classification (Ozgurand Gungor, 2010), and question answering (Cuiet al., 2005). Supervised approaches to depen-dency parsing have been very successful for manyresource-rich languages, where relatively largetreebanks are available (McDonald et al., 2005a).However, for many languages, annotated tree-banks are not available, and are very costly to cre-ate (Bohmova et al., 2001). This motivates thedevelopment of unsupervised approaches that canmake use of unannotated, monolingual data. How-ever, purely unsupervised approaches have rela-tively low accuracy (Klein and Manning, 2004;Gelling et al., 2012).

Most recent work on unsupervised dependencyparsing for low-resource languages has used theidea of delexicalized parsing and cross-lingualtransfer (Zeman et al., 2008; Søgaard, 2011; Mc-Donald et al., 2011; Ma and Xia, 2014). Inthis setting, a delexicalized parser is trained on aresource-rich source language, and is then applieddirectly to a resource-poor target language. Theonly requirement here is that the source and tar-get languages are POS tagged must use the sametagset. This assumption is pertinent for resource-poor languages since it is relatively quick to man-ually POS tag the data. Moreover, there are manyreports of high accuracy POS tagging for resource-poor languages (Duong et al., 2014; Garrette etal., 2013; Duong et al., 2013b). The cross-lingualdelexicalized approach has been shown to signif-icantly outperform unsupervised approaches (Mc-Donald et al., 2011; Ma and Xia, 2014).

Parallel data can be used to boost the perfor-mance of a cross-lingual parser (McDonald et al.,2011; Ma and Xia, 2014). However, parallel datamay be hard to acquire for truly resource-poor lan-guages.1 Accordingly, we propose a method toimprove the performance of a cross-lingual delex-icalized parser using only monolingual data.

Our approach is based on augmenting the delex-icalized parser using syntactic word embeddings.Words from both source and target language aremapped to a shared low-dimensional space basedon their syntactic context, without recourse to par-allel data. While prior work has struggled to ef-ficiently incorporate word embedding informationinto the parsing model (Bansal et al., 2014; An-dreas and Klein, 2014; Chen et al., 2014), wepresent a method for doing so using a neural net-

1Note that most research in this area (as do we) evalu-ates on simulated low-resource languages, through selectiveuse of data in high-resource languages. Consequently paral-lel data is plentiful, however this is often not the case in thereal setting, e.g., for Tagalog, where only scant parallel dataexists (e.g., dictionaries, Wikipedia and the Bible).

113

work parser. We train our parser using a two stageprocess: first learning cross-lingual syntactic wordembeddings, then learning the other parameters ofthe parsing model using a source language tree-bank. When applied to the target language, weshow consistent gains across all studied languages.

This work is a stepping stone towards the moreambitious goal of a universal parser that can ef-ficiently parse many languages with little modi-fication. This aspiration is supported by the re-cent release of the Universal Dependency Tree-bank (Nivre et al., 2015) which has consensus de-pendency relation types and POS annotation formany languages.

When multiple source languages are available,we can attempt to boost performance by choosingthe best source language, or combining informa-tion from several source languages. To the bestof our knowledge, no prior work has proposed ameans for selecting the best source language givena target language. To address this, we introducetwo metrics which outperform the baseline of al-ways picking English as the source language. Wealso propose a method for combining all availablesource languages which leads to substantial im-provement.

The rest of this paper is organized as fol-lows: Section 2 reviews prior work on unsuper-vised cross-lingual dependency parsing. Section 3presents the methods for improving the delexi-calized parser using syntactic word embeddings.Section 4 describes experiments on the CoNLLdataset and Universal Dependency Treebank. Sec-tion 5 presents methods for selecting the bestsource language given a target language.

2 Unsupervised Cross-lingualDependency Parsing

There are two main approaches for building de-pendency parsers for resource-poor languageswithout using target-language treebanks: delexi-calized parsing and projection (Hwa et al., 2005;Ma and Xia, 2014; Tackstrom et al., 2013; Mc-Donald et al., 2011).

The delexicalized approach was proposedby Zeman et al. (2008). They built a delexi-calized parser from a treebank in a resource-richsource language. This parser can be trained us-ing any standard supervised approach, but with-out including any lexical features, then applied di-rectly to parse sentences from the resource-poor

language. Delexicalized parsing relies on the factthat parts-of-speech are highly informative of de-pendency relations. For example, an English lex-icalized discriminative arc-factored dependencyparser achieved 84.1% accuracy, whereas a delex-icalized version achieved 78.9% (McDonald et al.,2005b; Tackstrom et al., 2013). Zeman et al.(2008) build a parser for Swedish using Danish,two closely-related languages. Søgaard (2011)adapt this method for less similar languages bychoosing sentences from the source language thatare similar to the target language. Tackstrom et al.(2012) additionally use cross-lingual word cluster-ing as a feature for their delexicalized parser. Alsorelated is the work by Naseem et al. (2012) andTackstrom et al. (2013) who incorporated linguis-tic features from the World Atlas of LanguageStructures (WALS; Dryer and Haspelmath (2013))for joint modelling of multi-lingual syntax.

In contrast, projection approaches use paral-lel data to project source language dependencyrelations to the target language (Hwa et al.,2005). Given a source-language parse tree alongwith word alignments, they generate the target-language parse tree by projection. However, theirapproach relies on many heuristics which wouldbe difficult to adapt to other languages. McDon-ald et al. (2011) exploit both delexicalized parsingand parallel data, using an English delexicalizedparser as the seed parser for the target languages,and updating it according to word alignments. Themodel encourages the target-language parse treeto look similar to the source-language parse treewith respect to the head-modifier relation. Ma andXia (2014) use parallel data to transfer source lan-guage parser constraints to the target side via wordalignments. For the null alignment, they used adelexicalized parser instead of the source languagelexicalized parser.

In summary, existing work generally starts witha delexicalized parser, and uses parallel data ty-pological information to improve it. In contrast,we want to improve the delexicalized parser, butwithout using parallel data or any explicit linguis-tic resources.

3 Improving Delexicalized Parsing

We propose a novel method to improve the per-formance of a delexicalized cross-lingual parserwithout recourse to parallel data. Our method usesno additional resources and is designed to com-

114

plement other methods. The approach is based onsyntactic word embeddings where a word is rep-resented as a low-dimensional vector in syntacticspace. The idea is simple: we want to relexicalizethe delexicalized parser using word embeddings,where source and target language lexical items arerepresented in the same space.

Word embeddings typically capture both syn-tactic and semantic information. However, we hy-pothesize (and later show empirically) that for de-pendency parsing, word embeddings need to bet-ter reflect syntax. In the next subsection, we re-view some cross-lingual word embedding meth-ods and propose our syntactic word embeddings.Section 4 empirically compares these word em-beddings when incorporated into a dependencyparser.

3.1 Cross-lingual word embeddings

We review methods that can represent wordsin both source and target languages in a low-dimensional space. There are many benefits of us-ing a low-dimensional space. Instead of the tra-ditional “one-hot” representation with the numberof dimensions equal to vocabulary size, words arerepresented using much fewer dimensions. Thisconfers the benefit of generalising over the vocab-ulary to alleviate issues of data sparsity, throughlearning representations encoding lexical relationssuch as synonymy.

Several approaches have sought to learn cross-lingual word embeddings from parallel data (Her-mann and Blunsom, 2014a; Hermann and Blun-som, 2014b; Xiao and Guo, 2014; Zou et al., 2013;Tackstrom et al., 2012). Hermann and Blunsom(2014a) induced a cross-lingual word representa-tion based on the idea that representations for par-allel sentences should be close together. Theyconstructed a sentence level representation as abag-of-words summing over word-level represen-tations, and then optimized a hinge loss functionto match a latent representation of both sides ofa parallel sentence pair. While this might seemwell suited to our needs as a word representationin cross-lingual parsing, it may lead to overly se-mantic embeddings, which are important for trans-lation, but less useful for parsing. For example,“economic” and “economical” will have a simi-lar representation despite having different syntac-tic features.

Also related is (Tackstrom et al., 2012) who

3URQRXQ��1RXQ��9HUE��$GM

��7X��PDVFRWD��SDUHFH��DGRUDEOH

�'HW��1RXQ��9HUE��$GM��1RXQ

��7KH��ZHDWKHU��LV��KRUULEOH��WRGD\��

Figure 1: Examples of the syntactic word embed-dings for Spanish and English. In each case, thehighlighted tags are predicted by the highlightedword. The Spanish sentence means “your petlooks lovely”.

build cross-lingual word representations using avariant of the Brown clusterer (Brown et al., 1992)applied to parallel data. Bansal et al. (2014)and Turian et al. (2010) showed that for monolin-gual dependency parsing, the simple Brown clus-tering based algorithm outperformed many wordembedding techniques. In this paper we compareour approach to forming cross-lingual word em-beddings with those of both Hermann and Blun-som (2014a) and Tackstrom et al. (2012).

3.2 Syntactic Word EmbeddingWe now propose a novel approach for learningcross-lingual word embeddings that is more heav-ily skewed towards syntax. Word embeddingmethods typically exploit word co-occurrences,building on traditional techniques for distribu-tional similarity, e.g., the co-occurrences of wordsin a context window about a central word. Bansalet al. (2014) suggested that for dependency pars-ing, word embeddings be trained over dependencyrelations, instead of adjacent tokens, such thatembeddings capture head and modifier relations.They showed that this strategy performed muchbetter than surface embeddings for monolingualdependency parsing. However, their method isnot applicable to our low resource setting, as itrequires a parse tree for training. Instead weconsider a simpler representation, namely part-of-speech contexts. This requires only POS tagging,rather than full parsing, while providing syntacticinformation linking words to their POS context,which we expect to be informative for characteris-ing dependency relations.

115

Algorithm 1 Syntactic word embedding1: Match the source and target tagsets to the Uni-

versal Tagset.2: Extract word n-gram sequences for both the

source and target language.3: For each n-gram, keep the middle word, and

replace the other words by their POS.4: Train a skip-gram word embedding model on

the resulting list of word and POS sequencesfrom both the source and target language

We assume the same POS tagset is used for boththe source and target language,2 and learn wordembeddings for each word type in both languagesinto the same syntactic space of nearby POS con-texts. In particular, we develop a predictive modelof the tags to the left and right of a word, as il-lustrated in Figure 1 and outlined in Algorithm 1.Figure 1 illustrates two training contexts extractedfrom our English source and Spanish target lan-guage, where the highlighted fragments reflect thetags being predicted around each focus word. Notethat for this example, the POS contexts for the En-glish and Spanish verbs are identical, and there-fore the model would learn similar word embed-dings for these terms, and bias the parser to gener-ate similar dependency structures for both terms.

There are several motivations for our approach:(1) POS tags are too coarse-grained for accurateparsing, but with access to local context they canbe made more informative; (2) leaving out themiddle tag avoids duplication because this is al-ready known to the parser; (3) dependency edgesare often local, as shown in Figure 1, i.e., thereare dependency relations between most words andtheir immediate neighbours. Consequently, train-ing our embeddings to predict adjacent tags islikely to learn similar information to training overdependency edges.3 Bansal et al. (2014) stud-ied the effect of word embeddings on dependencyparsing, and found that larger embedding win-dows captured more semantic information, whilesmaller windows better reflected syntax. There-fore we choose a small ±1 word window in our ex-periments. We also experimented with bigger win-

2Later we consider multiple source languages, but for nowassume a single source language.

3For the 16 languages in the CoNLL-X and CoNLL-07datasets we observed that approx. 50% of dependency rela-tions span a distance of one word and 20% span two words.Thus our POS context of a ±1 word window captures themajority of dependency relations.

62)7�0$;�/$<(5

+,''(1�/$<(5�

:25'6 326�7$*6 $5&�/$%(/6

0$33,1*�/$<(5

&21),*85$7,21��67$&.��48(8(��$5&6�

(ZRUG (SRV (DUF

:�

:�

Figure 2: Neural Network Parser Architecturefrom Chen and Manning (2014)

dows (±2,±3) but observed performance degra-dation in these cases, supporting the argumentabove.

Step 4 of Algorithm 1 finds the word embed-dings as a side-effect of training a neural languagemodel. We use the skip-gram model (Mikolov etal., 2013), trained to predict context tags for eachword. The model is formulated as a simple bilin-ear logistic classifier

P (tc|w) =

exp(u>tcvw)

PTz=1 exp(u>z vw)

(1)

where tc is the context tag around the currentword w, U 2 RT⇥D is the tag embedding matrix,V 2 RV⇥D is the word embedding matrix, with Tthe number of tags, V is the total number of wordtypes over both languages and D the capacity ofthe embeddings. Given a training set of word andPOS contexts, (tLi , wi, tRi )

Ni=1,4 we maximize the

log-likelihoodPN

i=1 log P (tLi |wi)+log P (tRi |wi)

with respect to U and V using stochastic gradientdescent. The learned V matrix of word embed-dings is later used in parser training (the sourceword embeddings) and inference (the target wordembeddings).

3.3 Parsing AlgorithmIn this Section, we show how to incorporate thesyntactic word embeddings into a parsing model.Our parsing model is built based on the workof Chen and Manning (2014). They built atransition-based dependency parser using a neural-network. The neural network classifier will decidewhich transition is applied for each configuration.

4Note that w here can be a word type in either the sourceor target language, such that both embeddings will be learnedfor all word types in both languages.

116

The architecture of the parser is illustrated in Fig-ure 2, where each layer is fully connected to thelayer above.

For each configuration, the selected list ofwords, POS tags and labels from the Stack, Queueand Arcs are extracted. Each word, POS or label ismapped to a low-dimension vector representation(embedding) through the Mapping Layer. Thislayer simply concatenates the embeddings whichare then fed into a two-layer neural network clas-sifier to predict the next parsing action. The setof parameters for the neural network classifier isEword, Epos, Elabels for the mapping layer, W1 forthe hidden layer and W2 for the soft-max outputlayer. We incorporate the syntactic word embed-dings into the neural network model by settingEword to the syntactic word embeddings, whichremain fixed during training so as to retain thecross-lingual mapping.5

3.4 Model Summary

To apply the parser to a resource-poor target lan-guage, we start by building syntactic word em-beddings between source and target languages asshown in algorithm 1. Next we incorporate syn-tactic word embeddings using the algorithm pro-posed in Section 3.3. The third step is to substitutesource- with target-language syntactic word em-beddings. Finally, we parse the target languageusing this substituted model. In this way, themodel will recognize lexical items for the targetlanguage.

4 Experiments

We test our method of incorporating syntacticword embeddings into a neural network parser, forboth the existing CoNLL dataset (Buchholz andMarsi, 2006; Nivre et al., 2007) and the newly-released Universal Dependency Treebank (Nivreet al., 2015). We employed the Unlabeled Attach-ment Score (UAS) without punctuation for com-parison with prior work on the CoNLL dataset.Where possible we also report Labeled Attach-ment Score (LAS) without punctuation. We useEnglish as the source language for this experiment.

5This is a consequence of only training the parser on thesource language. If we were to update embeddings duringparser training this would mean they no longer align with thetarget language embeddings.

4.1 Experiments on CoNLL DataIn this section we report experiments involvingthe CoNLL-X and CoNLL-07 datasets. Run-ning on this dataset makes our model comparablewith prior work. For languages included in bothdatasets, we use the newer one only. Crucially, forthe delexicalized parser we map language-specifictags to the universal tagset (Petrov et al., 2012).The syntactic word embeddings are trained usingPOS information from the CoNLL data.

There are two baselines for our experiment. Thefirst one is the unsupervised dependency parserof Klein and Manning (2004), the second one isthe delexicalized parser of Tackstrom et al. (2012).We also compare our syntactic word embeddingwith the cross-lingual word embeddings of Her-mann and Blunsom (2014a). These word em-beddings are induced by running each languagepair using Europarl (Koehn, 2005). We incor-porated Hermann and Blunsom (2014a)’s cross-lingual word embeddings into the parsing modelin the same way as for the syntactic word em-beddings. Table 1 shows the UAS for 8 lan-guages for several models. The first observationis that the direct transfer delexicalized parser out-performed the unsupervised approach. This isconsistent with many prior studies. Our imple-mentation of the direct transfer model performedon par with Tackstrom et al. (2012) on average.Table 1 also shows that using HB embeddings im-prove the performance over the Direct Transfermodel. Our model using syntactic word embed-ding consistently out-performed the Direct Trans-fer model and HB embedding across all 8 lan-guages. On average, it is 1.5% and 1.3% better. 6

The improvement varies across languages com-pared with HB embedding, and falls in the rangeof 0.3 to 2.6%. This confirms our initial hypoth-esis that we need word embeddings that capturesyntactic instead of semantic information.

It is not strictly fair to compare our methodwith prior approaches to unsupervised dependencyparsing, since they have different resource require-ment, i.e. parallel data or typological resources.Compared with the baseline of the direct transfermodel, our approach delivered a 1.5% mean per-formance gain, whereas Tackstrom et al. (2012)and McDonald et al. (2011) report approximately3% gain, Ma and Xia (2014) and Naseem et al.(2012) report an approximately 6% gain. As we

6All performance comparisons in this paper are absolute.

117

da de el es it nl pt sv Avg

Unsupervised 33.4 18.0 39.9 28.5 43.1 38.5 20.1 44.0 33.2Tackstrom et al. (2012) DT 36.7 48.9 59.5 60.2 64.4 52.8 66.8 55.4 55.6Our Direct Transfer 44.1 44.9 63.3 52.2 57.7 59.7 67.5 55.4 55.6Our Model + HB embedding 45.0 44.5 63.8 52.2 56.7 59.8 68.7 55.6 55.8Our Model + Syntactic embedding 45.9 45.9 64.1 52.9 59.1 61.1 69.5 58.1 57.1

Table 1: Comparative results on the CoNLL corpora showing UAS for several parsers: unsupervisedinduction Klein and Manning (2004), Direct Transfer (DT) delexicalized parser of Tackstrom et al.(2012), our implementation of Direct Transfer and our neural network parsing model using cross-lingualembeddings Hermann and Blunsom (2014a) (HB) and our proposed syntactic embeddings.

cs de en es fi fr ga hu it sv

Train 1173.3 269.6 204.6 382.4 162.7 354.7 16.7 20.8 194.1 66.6Dev 159.3 12.4 25.1 41.7 9.2 38.9 3.2 3.0 10.5 9.8Test 173.9 16.6 25.1 8.5 9.1 7.1 3.8 2.7 10.2 20.4Total 1506.5 298.6 254.8 432.6 181 400.7 23.7 26.5 214.8 96.8

Table 2: Number of tokens (⇥ 1000) for each language in the Universal Dependency Treebank.

have stated above, our approach is complementaryto the approaches used in these other systems. Forexample, we could incorporate the cross-lingualword clustering feature (Tackstrom et al., 2012)or WALS features (Naseem et al., 2012) into ourmodel, or use our improved delexicalized parser asthe reference model for Ma and Xia (2014), whichwe expect would lead to better results yet.

4.2 Experiments with Universal DependencyTreebank

We also experimented with the Universal Depen-dency Treebank V1.0, which has many desirableproperties for our system, e.g. dependency typesand coarse POS are the same across languages.This removes the need for mapping the source andtarget language tagsets to a common tagset, as wasdone for the CoNLL data. Secondly, instead ofonly reporting UAS we can report LAS, which isimpossible on CoNLL dataset where the depen-dency edge labels differed among languages.

Table 2 shows the size in thousands of tokensfor each language in the treebank. The first thingto observe is that some languages have abundantamount of data such as Czech (cs), French (fr) andSpanish (es). However, there are languages withmodest size i.e. Hungarian (hu) and Irish (ga).

We ran our model with and without syntacticword embeddings for all languages with Englishas the source language. The results are shown inTable 3. The first observation is that our model

using syntactic word embeddings out-performeddirect transfer for all the languages on both UASand LAS. We observed an average improvementof 3.6% (UAS) and 3.1% (LAS). This consistentimprovement shows the robustness of our methodof incorporating syntactic word embedding to themodel. The second observation is that the gap be-tween UAS and LAS is as big as 13% on averagefor both models. This reflects the increase diffi-culty of labelling the edges, with unlabelled edgeprediction involving only a 3-way classification7

while labelled edge prediction involves an 81-wayclassification.8 Narrowing the gap between UASand LAS for resource-poor languages is an impor-tant research area for future work.

5 Different Source Languages

In the previous sections, we used English as thesource language. However, English might not bethe best choice. For the delexicalized parser, it iscrucial that the source and target languages havesimilar syntactic structures. Therefore a differ-ent choice of source language might substantiallychange the performance, as observed in prior stud-ies (Tackstrom et al., 2013; Duong et al., 2013a;McDonald et al., 2011).

7Since there are only 3 transitions: SHIFT, LEFT-ARC,RIGHT-ARC.

8Since the Universal Dependency Treebank has 40 uni-versal relations, each relation is attached to LEFT-ARC orRIGHT-ARC. The number 81 comes from 1 (SHIFT) + 40(LEFT-ARC) + 40 (RIGHT-ARC).

118

cs de es fi fr ga hu it sv UAS LAS

Direct Transfer 47.2 57.9 64.7 44.9 64.8 49.1 47.8 64.9 55.5 55.2 42.7Our Model + Syntactic embedding 50.2 60.9 67.9 51.4 66.0 51.6 52.3 69.2 59.6 58.8 45.8

Table 3: Results comparing a direct transfer parser and our model with syntactic word embeddings.Evaluating UAS over the Universal Dependency Treebank. (We observed a similar pattern for LAS.) Therightmost UAS and LAS columns shows the average scores for the respective metric across 9 languages.

TARGET LANGUAGE

SOU

RC

EL

AN

GU

AG

E

cs de en es fi fr ga hu it sv UAS LAS

cs 76.8 65.9 60.8 70.0 53.7 66.8 59.0 55.2 70.7 56.8 62.1 38.7de 60.0 78.2 61.7 63.1 52.4 60.6 49.8 56.7 64.0 59.5 58.6 45.5en 50.2 60.9 81.0 67.9 51.4 66.0 51.6 52.3 69.2 59.6 58.8 45.8es 60.5 58.5 60.4 80.9 45.7 73.3 53.8 46.9 77.4 55.3 59.1 46.2fi 49.0 41.8 44.5 33.6 71.5 35.2 24.4 44.6 31.7 43.1 38.7 25.5fr 54.2 55.7 63.2 74.8 43.6 79.2 54.7 44.3 76.2 54.8 57.9 46.3ga 32.8 35.3 39.8 56.3 23.5 52.6 72.3 26.0 58.3 32.6 39.7 26.7hu 42.3 53.4 45.4 43.8 53.3 42.1 29.2 72.1 41.2 42.5 43.7 22.7it 57.6 53.4 53.2 72.1 42.7 71.4 54.7 42.2 85.9 54.2 55.7 45.0sv 49.1 59.2 54.9 59.8 47.9 55.7 48.5 52.7 62.2 78.4 54.4 41.2

Table 4: UAS for each language pair in the Universal Dependency Treebank using our best model. TheUAS/LAS column show the average UAS/LAS for all target languages, excluding the source language.The best UAS for each target language is shown in bold.

In this section we assume that we have multi-ple source languages. To see how the performancechanges when using a different source language,we run our best model (i.e., using syntactic em-beddings) for each language pair in the UniversalDependency Treebank. Table 4 shows the UAS foreach language pair, and the average across all tar-get languages for each source language. We alsoconsidered LAS, but observed similar trends, andtherefore only report the average LAS for eachsource language. Observe that English is rarelythe best source language; Czech and French give ahigher average UAS and LAS, respectively. Inter-estingly, while Czech gives high UAS on average,it performs relatively poorly in terms of LAS.

One might expect that the relative performancefrom using different source languages is affectedby the source corpus size, which varies greatly.We tested this question by limiting the source cor-pora 66K sentences (and excluded the very smallga and hu datasets), which resulted in a slight re-duction in scores but overall a near identical pat-tern of results to the use of the full sized sourcecorpora reported in Table 4. Only in one instancedid the best source language change (for target fiwith source de not cs), and the average rankings

by UAS and LAS remained unchanged.The ten languages considered belong to five

families: Romance (French, Spanish, Italian),Germanic (German, English, Swedish), Slavic(Czech), Uralic (Hungarian, Finnish), and Celtic(Irish). At first glance it seems that languagepairs in the same family tend to perform well.For example, the best source language for bothFrench and Italian is Spanish, while the bestsource language for Spanish is French. However,this doesn’t hold true for many target languages.For example, the best source language for bothFinnish and German is Czech. It appears that thebest choice of an appropriate source language isnot predictable from language family information.

We therefore propose two methods to predictthe best source language for a given target lan-guage. In devising these methods we assume thatfor a given resource-poor target language we donot have access to any parsed data, as this is ex-pensive to construct. The first method is based onthe Jensen-Shannon divergence between the dis-tributions of POS n-grams (1 < n < 6) in a pairof languages. The second method converts eachlanguage into a vector of binary features basedon word-order information from WALS, the World

119

cs de en es fi fr ga hu it sv UAS LAS

English 50.2 60.9 — 67.9 51.4 66.0 51.6 52.3 69.2 59.6 58.8 45.8WALS 50.2 59.2 44.5 72.1 51.4 73.3 53.8 44.6 77.4 59.6 60.2 47.1POS 49.1 58.5 53.2 74.8 53.7 73.3 53.8 56.7 76.2 56.8 61.4 47.7Oracle 60.5 65.9 63.2 74.8 53.7 73.3 59.0 56.7 77.4 59.6 64.5 50.8Combined 61.1 67.5 64.4 75.1 54.2 72.8 58.7 57.9 76.7 60.5 64.9 52.0

Table 5: UAS for target languages where the source language is selected in different ways. English usesEnglish as the source language. WALS and POS choose the best source language using the WALS orPOS ngrams based methods, respectively. Oracle always uses the best source language. Combined is themodel that combines information from all available sources language. The UAS/LAS columns show theUAS/LAS average performance across 9 languages (English is excluded).

Atlas of Language Structures (Dryer and Haspel-math, 2013). These features include the relativeorder of adjective and noun, etc, and we computethe cosine similarity between the vectors for a pairof languages.

As an alternative to selecting a single sourcelanguage, we further propose a method to combineinformation from all available source languages tobuild a parser for a target language. To do so wefirst train the syntactic word embeddings on all thelanguages. After this step, lexical items from allsource languages and the target language will bein the same space. We train our parser with syn-tactic word embeddings on the combined corpusof all source languages. This parser is then appliedto the target language directly. The intuition hereis that training on multiple source languages limitsover-fitting to the source language, and learns the“universal” structure of languages.

Table 5 shows the performance of each targetlanguage with the source language given by themodel (in the case of models that select a sin-gle source language). Always choosing Englishas the source language performs worst. UsingWALS features out-performs English on 7 out of9 languages. Using POS ngrams out-performs theWALS feature model on average for both UASand LAS, although the improvement is small. Thecombined model, which combines informationfrom all available source languages, out-performschoosing a single source language. Moreover, thismodel performs even better than the oracle model,which always chooses the single best source lan-guage, especially for LAS. Compared with thebaseline of always choosing English, our com-bined model gives an improvement about 6% forboth UAS and LAS.

6 Conclusions

Most prior work on cross-lingual transfer depen-dency parsing has relied on large parallel corpora.However, parallel data is scarce for resource-poorlanguages. In the first part of this paper we investi-gated building a dependency parser for a resource-poor language without parallel data. We improvedthe performance of a delexicalized parser usingsyntactic word embeddings using a neural net-work parser. We showed that syntactic word em-beddings are better at capturing syntactic infor-mation, and particularly suitable for dependencyparsing. In contrast to the state-of-the-art for un-supervised cross-lingual dependency parsing, ourmethod does not rely on parallel data. Althoughthe state-of-the-art achieves bigger gains over thebaseline than our method, our approach could bemore-widely applied to resource-poor languagesbecause of its lower resource requirements. More-over, we have described how our method could beused to complement previous approaches.

The second part of this paper studied ways ofimproving performance when multiple source lan-guages are available. We proposed two methodsto select a single source language that both leadto improvements over always choosing English asthe source language. We then showed that we canfurther improve performance by combining infor-mation from all the source languages. In summary,without any parallel data, we managed to improvethe direct transfer delexicalized parser by about10% for both UAS and LAS on average, for 9 lan-guages in the Universal Dependency Treebank.

In this paper we focused only on word em-beddings, however, in future work we could alsobuild the POS embeddings and the arc-label em-beddings across languages. This could help our

120

system to move more freely across languages, fa-cilitating not only the development of NLP forresource-poor languages, but also cross-languagecomparisons.

Acknowledgments

This work was supported by the University ofMelbourne and National ICT Australia (NICTA).Dr Cohn is the recipient of an Australian Re-search Council Future Fellowship (project numberFT130101105).

ReferencesJacob Andreas and Dan Klein. 2014. How much do

word embeddings encode about syntax? In Pro-ceedings of the 52nd Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), pages 822–827, Baltimore, Maryland.

Mohit Bansal, Kevin Gimpel, and Karen Livescu.2014. Tailoring continuous word representations fordependency parsing. In Proceedings of the 52nd An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 809–815.

Alena Bohmova, Jan Hajic, Eva Hajicova, and BarboraHladka. 2001. The Prague Dependency Tree-bank: A Three-Level Annotation Scenario. In AnneAbeille, editor, Treebanks: Building and Using Syn-tactically Annotated Corpora, pages 103–127.

Peter F. Brown, Peter V. deSouza, Robert L. Mer-cer, Vincent J. Della Pietra, and Jenifer C. Lai.1992. Class-based n-gram models of natural lan-guage. Computational Linguistics, 18:467–479.

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-Xshared task on multilingual dependency parsing. InProceedings of the Tenth Conference on Computa-tional Natural Language Learning.

Razvan C. Bunescu and Raymond J. Mooney. 2005.A shortest path dependency kernel for relation ex-traction. In Proceedings of the Conference on Hu-man Language Technology and Empirical Methodsin Natural Language Processing, HLT ’05, pages724–731.

Danqi Chen and Christopher Manning. 2014. A fastand accurate dependency parser using neural net-works. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 740–750, Doha, Qatar.

Wenliang Chen, Yue Zhang, and Min Zhang. 2014.Feature embedding for dependency parsing. In Pro-ceedings of COLING 2014, the 25th InternationalConference on Computational Linguistics: Techni-cal Papers, pages 816–826, Dublin, Ireland.

Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, andTat-Seng Chua. 2005. Question answering passageretrieval using dependency relations. In Proceed-ings of the 28th Annual International ACM SIGIRConference on Research and Development in Infor-mation Retrieval, SIGIR ’05, pages 400–407, NewYork, NY, USA.

Matthew S. Dryer and Martin Haspelmath, editors.2013. WALS Online. Leipzig.

Long Duong, Paul Cook, Steven Bird, and PavelPecina. 2013a. Increasing the quality and quan-tity of source language data for Unsupervised Cross-Lingual POS tagging. In Proceedings of theSixth International Joint Conference on NaturalLanguage Processing, pages 1243–1249, Nagoya,Japan.

Long Duong, Paul Cook, Steven Bird, and PavelPecina. 2013b. Simpler unsupervised POS taggingwith bilingual projections. In Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers), pages634–639, Sofia, Bulgaria.

Long Duong, Trevor Cohn, Karin Verspoor, StevenBird, and Paul Cook. 2014. What can we get from1000 tokens? a case study of multilingual pos tag-ging for resource-poor languages. In Proceedings ofthe 2014 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 886–897,Doha, Qatar.

Dan Garrette, Jason Mielens, and Jason Baldridge.2013. Real-world semi-supervised learning of pos-taggers for low-resource languages. In Proceedingsof the 51st Annual Meeting of the Association forComputational Linguistics (ACL-2013), pages 583–592, Sofia, Bulgaria.

Douwe Gelling, Trevor Cohn, Phil Blunsom, and JooGraa. 2012. The pascal challenge on grammar in-duction.

Karl Moritz Hermann and Phil Blunsom. 2014a. Mul-tilingual Distributed Representations without WordAlignment. In Proceedings of ICLR.

Karl Moritz Hermann and Phil Blunsom. 2014b. Mul-tilingual models for compositional distributed se-mantics. CoRR, abs/1404.4641.

Rebecca Hwa, Philip Resnik, Amy Weinberg, ClaraCabezas, and Okan Kolak. 2005. Bootstrappingparsers via syntactic projection across parallel texts.Natural Language Engineering, 11:311–325.

Dan Klein and Christopher Manning. 2004. Corpus-based induction of syntactic structure: Models ofdependency and constituency. In Proceedings of the42nd Annual Meeting on Association for Computa-tional Linguistics, ACL ’04.

121

Philipp Koehn. 2005. Europarl: A Parallel Corpus forStatistical Machine Translation. In Proceedings ofthe Tenth Machine Translation Summit (MT SummitX), pages 79–86, Phuket, Thailand.

Xuezhe Ma and Fei Xia. 2014. Unsupervised depen-dency parsing with transferring distribution via par-allel guidance and entropy regularization. In Pro-ceedings of the 52nd Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 1337–1348.

Ryan McDonald, Koby Crammer, and FernandoPereira. 2005a. Online large-margin training of de-pendency parsers. In Proceedings of the 43rd An-nual Meeting on Association for Computational Lin-guistics, ACL ’05, pages 91–98.

Ryan McDonald, Fernando Pereira, Kiril Ribarov, andJan Hajic. 2005b. Non-projective dependency pars-ing using spanning tree algorithms. In Proceedingsof the Conference on Human Language Technologyand Empirical Methods in Natural Language Pro-cessing, HLT ’05, pages 523–530.

Ryan McDonald, Slav Petrov, and Keith Hall. 2011.Multi-source transfer of delexicalized dependencyparsers. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing,EMNLP ’11, pages 62–72.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S.Corrado, and Jeff Dean. 2013. Distributed repre-sentations of words and phrases and their composi-tionality. In C.j.c. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K.q. Weinberger, editors, Ad-vances in Neural Information Processing Systems26, pages 3111–3119.

Tahira Naseem, Regina Barzilay, and Amir Globerson.2012. Selective sharing for multilingual dependencyparsing. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguis-tics: Long Papers - Volume 1, ACL ’12, pages 629–637.

Joakim Nivre, Johan Hall, Sandra Kubler, Ryan Mc-Donald, Jens Nilsson, Sebastian Riedel, and DenizYuret. 2007. The CoNLL 2007 shared task on de-pendency parsing. In Proceedings of the CoNLLShared Task Session of EMNLP-CoNLL 2007, pages915–932, Prague, Czech Republic.

Joakim Nivre, Cristina Bosco, Jinho Choi, Marie-Catherine de Marneffe, Timothy Dozat, RichardFarkas, Jennifer Foster, Filip Ginter, Yoav Gold-berg, Jan Hajic, Jenna Kanerva, Veronika Laippala,Alessandro Lenci, Teresa Lynn, Christopher Man-ning, Ryan McDonald, Anna Missila, SimonettaMontemagni, Slav Petrov, Sampo Pyysalo, NataliaSilveira, Maria Simi, Aaron Smith, Reut Tsarfaty,Veronika Vincze, and Daniel Zeman. 2015. Univer-sal dependencies 1.0.

Levent Ozgur and Tunga Gungor. 2010. Text classi-fication with the support of pruned dependency pat-terns. Pattern Recognition Letter, 31:1598–1607.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012.A universal part-of-speech tagset. In Proceed-ings of the Eight International Conference on Lan-guage Resources and Evaluation (LREC’12), Istan-bul, Turkey.

Anders Søgaard. 2011. Data point selection for cross-language adaptation of dependency parsers. In Pro-ceedings of the 49th Annual Meeting of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies: Short Papers - Volume 2, HLT’11, pages 682–686.

Oscar Tackstrom, Ryan McDonald, and Jakob Uszko-reit. 2012. Cross-lingual word clusters for directtransfer of linguistic structure. In Proceedings ofthe 2012 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, NAACL HLT ’12,pages 477–487.

Oscar Tackstrom, Ryan McDonald, and Joakim Nivre.2013. Target language adaptation of discrimina-tive transfer parsers. In Proceedings of the 2013Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 1061–1071, Atlanta,Georgia.

Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio.2010. Word representations: A simple and generalmethod for semi-supervised learning. In Proceed-ings of the 48th Annual Meeting of the Associationfor Computational Linguistics, pages 384–394, Up-psala, Sweden.

Min Xiao and Yuhong Guo, 2014. Proceedings of theEighteenth Conference on Computational NaturalLanguage Learning, chapter Distributed Word Rep-resentation Learning for Cross-Lingual DependencyParsing, pages 119–129.

Peng Xu, Jaeho Kang, Michael Ringgaard, and FranzOch. 2009. Using a dependency parser to improvesmt for subject-object-verb languages. In Proceed-ings of Human Language Technologies: The 2009Annual Conference of the North American Chap-ter of the Association for Computational Linguistics,pages 245–253, Boulder, Colorado.

Daniel Zeman, Univerzita Karlova, and Philip Resnik.2008. Cross-language parser adaptation between re-lated languages. In In IJCNLP-08 Workshop on NLPfor Less Privileged Languages, pages 35–42.

Will Y. Zou, Richard Socher, Daniel Cer, and Christo-pher D. Manning. 2013. Bilingual word embed-dings for phrase-based machine translation. In Pro-ceedings of the 2013 Conference on Empirical Meth-ods in Natural Language Processing, pages 1393–1398, Seattle, Washington, USA.

122



Detecting Semantically Equivalent Questionsin Online User Forums

Dasha Bogdanova⇤, Cıcero dos Santos†, Luciano Barbosa† and Bianca Zadrozny†⇤ADAPT centre, School of Computing, Dublin City University, Dublin, Ireland

[email protected]†IBM Research, 138/146 Av. Pasteur, Rio de Janeiro, Brazil{cicerons,lucianoa,biancaz}@br.ibm.com

AbstractTwo questions asking the same thing couldbe too different in terms of vocabulary andsyntactic structure, which makes identify-ing their semantic equivalence challeng-ing. This study aims to detect semanti-cally equivalent questions in online userforums. We perform an extensive numberof experiments using data from two differ-ent Stack Exchange forums. We comparestandard machine learning methods suchas Support Vector Machines (SVM) with aconvolutional neural network (CNN). Theproposed CNN generates distributed vec-tor representations for pairs of questionsand scores them using a similarity metric.We evaluate in-domain word embeddingsversus the ones trained with Wikipedia,estimate the impact of the training setsize, and evaluate some aspects of do-main adaptation. Our experimental re-sults show that the convolutional neuralnetwork with in-domain word embeddingsachieves high performance even with lim-ited training data.

1 IntroductionQuestion-answering (Q&A) community sites,such as Yahoo! Answers,1 Quora2 and Stack Ex-change,3 have gained a lot of attention in the recentyears. Most Q&A community sites advise users tosearch the forum for an answer before posting anew question. However, this is not always an easytask because different users could formulate thesame question in completely different ways. Someuser forums, such as those of the Stack Exchangeonline community, have a duplication policy. Ex-act duplicates, such as copy-and-paste questions,

1 https://answers.yahoo.com/2 http://www.quora.com3 http://stackexchange.com/

and nearly exact duplicates are usually quickly de-tected, closed and removed from the forum. Nev-ertheless, some duplicate questions are kept. Themain reason for that is that there are many waysto ask the same question, and a user might not beable to find the answer if they are asking it a dif-ferent way.4

In this study we define two questions as seman-tically equivalent if they can be adequately an-swered by the exact same answer. Table 1 presentsan example of a pair of such questions from AskUbuntu forum. Detecting semantically equivalentquestions is a very difficult task due to two mainfactors: (1) the same question can be rephrased inmany different ways; and (2) two questions couldbe asking different things but look for the samesolution. Therefore, traditional similarity mea-sures based on word overlap such as shingling andJaccard coefficient (Broder, 1997) and its varia-tions (Wu et al., 2011) are not able to capture manycases of semantic equivalence.

In this paper, we propose a convolutional neuralnetwork architecture to detect semantically equiv-alent questions. The proposed CNN first trans-forms words into word embeddings (Mikolov etal., 2013), using a large collection of unlabeleddata, and then applies a convolutional network tobuild distributed vector representations for pairs ofquestions. Finally, it scores the questions usinga similarity metric. Pairs of questions with simi-larity above a threshold, defined based on a held-out set, are considered duplicates. CNN is trainedusing positive and negative pairs of semanticallyequivalent questions. During training, CNN is in-duced to produce similar vector representationsfor questions that are semantically equivalent.

We perform an extensive number of experi-ments using data from two different Stack Ex-change forums. We compare CNN performancewith a traditional classification algorithm (Support

4 http://stackoverflow.com/help/duplicates

123

Title: I can’t download anything and I can’t watch videos Title: How can I install Windows software or games?Body: Two days ago I tried to download skype and it saysan error occurred it says end of central directory signaturenot found Either this file is not a zipfile, or it constitutesone disk of a multi-part archive. In the latter case the cen-tral directory and zipfile comment will be found on the lastdisk(s) of this archive. zipinfo: cannot find zipfile directoryin one of /home/maria/Downloads/SkypeSetup-aoc-jd.exe or/home/maria/Downloads/SkypeSetup-aoc-jd.exe.zip, and can-not find /home/maria/Downloads/SkypeSetup-aoc-jd.exe.ZIPpe... this happens whenever I try to download anything likegames and also i can’t watch videoss it’s looking for plug insbut it doesn’t find them i hate this [sic!]

Body: Can .exe and .msi files (Windows software) be in-stalled in Ubuntu? [sic!]

Link: http://askubuntu.com/questions/364350 http://askubuntu.com/questions/988Possible Answer (Shortened version): .exe files are not binary-compatible with Ubuntu. There are, however, compat-ibility layers for Linux, such as Wine, that are capable of running .exe.

Table 1: An example of semantically equivalent questions from Ask Ubuntu community.

Vector Machines (Cortes and Vapnik, 1995)) anda duplicate detection approach (shingling (Broder,1997)). The results show CNN outperforms thebaselines by a large margin.

We also investigate the impact of different wordembeddings by analyzing the performance of thenetwork with: (1) word embeddings pre-trained onin-domain data and all of the English Wikipedia;(2) word vectors of different dimensionalities; (3)training sets of different sizes; and (4) out-of-domain training data and in-domain word embed-dings. The numbers show that: (1) word embed-dings pre-trained on domain-specific data achievevery high performance; (2) bigger word embed-dings obtain higher accuracy; (3) in-domain wordembeddings provide better performance indepen-dent of the training set size; and (4) in-domainword embeddings achieve relatively high accuracyeven using out-of-domain training data.

2 Task

This work focuses on the task of predicting seman-tically equivalent questions in online user forums.Following the duplication policy of the Stack Ex-change online community,5 we define semanti-cally equivalent questions as follows:

Definition 1. Two questions are semanticallyequivalent if they can be adequately answered bythe exact same answer.

Since our definition of semantically equivalentquestions corresponds to the rules of the StackExchange duplication policy, we assume that all

5 http://blog.stackoverflow.com/2010/11/dr-strangedupe-or-how-i-learned-to-stop-worrying-and-love-duplication/;http://meta.stackexchange.com/questions/32311/do-not-delete-good-duplicates

questions of this community that were marked asduplicates are semantically equivalent. An exam-ple of such questions is given in Table 1. Thesequestions vary significantly in vocabulary, style,length and content quality. However, both ques-tions require the exact same answer.

The exact task that we approach in this studyconsists in, given two problem definitions, predict-ing if they are semantically equivalent. By prob-lem definition we mean the concatenation of thetitle and the body of a question. Throughout thispaper we use the term question as a synonym ofproblem definition.

3 Related Work

The development of CNN architectures for tasksthat involve sentence-level and document-levelprocessing is currently an area of intensive re-search in natural language processing and infor-mation retrieval, with many recent encouraging re-sults.

Kim (2014) proposes a simple CNN for sen-tence classification built on top of word2vec(2013). A multichannel variant which combinesstatic word vectors from word2vec and wordvectors which are fine-tuned via backpropagationis also proposed. Experiments with different vari-ants are performed on a number of benchmarksfor sentence classification, showing that the sim-ple CNN performs remarkably well, with state-of-the-art results in many of the benchmarks, high-lighting the importance of using unsupervised pre-training of word vectors for this task.

Hu et al.(2014) propose a CNN architecturefor hierarchical sentence modeling and, basedon that, two architectures for sentence matching.

124

They train the latter networks using a ranking-based loss function on three sentence match-ing tasks of different nature: sentence comple-tion, response matching and paraphrase identifica-tion. The proposed architectures outperform pre-vious work for sentence completion and responsematching, while the results are slightly worse thanthe state-of-the-art in paraphrase identification.

Yih et al.(2014) focus on the task of answeringsingle-relation factual questions, using a novel se-mantic similarity model based on a CNN architec-ture. Using this architecture, they train two mod-els: one for linking a question to an entity in theDB and the other mapping a relation pattern to arelation in the DB. Both models are then combinedfor inferring the entity that is the answer. This ap-proach leads to a higher precision on Q&A datafrom the WikiAnswers corpus than the existingrules-based approach for this task.

Dos Santos & Gatti (2014) developed a CNNarchitecture for sentiment analysis of short textsthat jointly uses character-level, word-level andsentence-level information, achieving state-of-the-art results on well known sentiment analysisbenchmarks.

For the specific task of semantically equivalentquestions detection that we address in this pa-per, we are not aware of any previous work usingCNNs. Muthmann and Petrova (2014) approachthe task of identifying topical near-duplicate re-lations between questions from social media asa classification task. They use a simple lexico-syntactical feature set and different classifiers areevaluated, with logistic regression reported as thebest performing one. However, it is not possible todirectly compare our results to theirs because theirexperimental methodology is not clearly describedin the paper.

There are several tasks related to identifying se-mantically equivalent questions. These tasks in-clude near-duplicate detection, paraphrase iden-tification and textual semantic similarity estima-tion. In what follows, we outline the differencesbetween these tasks and the one addressed in thiswork.

Duplicate and Near-Duplicate Detection aimsto detect exact copies or almost exact copies ofthe same document in corpora. Duplicate detec-tion is an important component of systems for Webcrawling and Web search, where it is important toidentify redundant data in large corpora. Common

techniques to detect duplicate documents includeshingling (Broder, 1997; Alonso et al., 2013) andfingerprinting (Manku et al., 2007). State-of-the-art work also focuses on the efficiency issues ofthe task (Wu et al., 2011). It is worth notingthat even though all duplicate and near-duplicatequestions are also semantically equivalent, the re-verse is not true. Semantically equivalent ques-tions could have small or no word overlap (see Ta-ble 1 for an example), and thus, are not duplicates.

Paraphrase Identification is the task of exam-ining two sentences and determining whether theyhave the same meaning (Socher et al., 2011). Iftwo questions are paraphrases, they are also se-mantically equivalent. However, many seman-tically equivalent questions are not paraphrases.The questions shown in Table 1 significantly differin the details they provide, and thus, could not beconsidered as having the same meaning. State-of-the-art approaches to paraphrase identification in-clude using Machine Translation evaluation met-rics (Madnani et al., 2012) and Deep Learningtechniques (Socher et al., 2011).

Textual Semantic Similarity is the task ofmeasuring the degree of semantic similarity be-tween two texts, usually on a graded scale from0 to 5, with 5 being the most similar (Agirre et al.,2013) and meaning that the texts are paraphrases.All semantically equivalent questions are some-what semantically similar, but semantic equiva-lence of questions defined here does not corre-spond to the highest value of the textual semanticsimilarity for the same reasons these questions arenot always paraphrases.

4 Neural Network Architecture

In this section, we present our neural-networkstrategy for detecting semantically equivalentquestions.

4.1 Feed Forward Processing

As detailed in Figure 1, the input for the networkis tokenized text strings of the two questions. Inthe first step, the CNN transforms words into real-valued feature vectors, also known as word em-beddings or word representations. Next, a convo-lutional layer is used to construct two distributedvector representations rq1 and rq2 , one for each in-put question. Finally, the CNN computes a simi-larity score between rq1 and rq2 . Pairs of questionswith similarity above a threshold, defined based on

125

Figure 1: Convolutional neural network for se-mantically equivalent questions detection.

a heldout set, are considered duplicates.

4.2 Word RepresentationsThe first layer of the network transforms wordsinto representations that capture syntactic andsemantic information about the words. Givena question consisting of N words q =

{w1, w2, ..., wN}, every word wn is converted intoa real-valued vector rwn . Therefore, for each ques-tion, the input to the next NN layer is a sequenceof real-valued vectors qemb

= {rw1 , rw2 , ..., rwN }Word representations are encoded by column

vectors in an embedding matrix W 0 2 Rd⇥|V |,where V is a fixed-sized vocabulary. Each columnW 0

i 2 Rd corresponds to the word embedding ofthe i-th word in the vocabulary. We transform aword w into its word embedding rw by using thematrix-vector product:

rw= W 0vw (1)

where vw is a vector of size |V | which has value1 at index w and zero in all other positions. Thematrix W 0 is a parameter to be learned, and thesize of the word embedding d is a hyper-parameterto be chosen by the user.

4.3 Question Representation and ScoringThe next step in the CNN consists in creating dis-tributed vectors from word embeddings represen-tations of the input questions. To perfom this task,the CNN must deal with two main challenges: dif-ferent questions can have different sizes; and im-portant information can appear at any position in

the question. The convolutional approach (Waibelet al., 1989) is a natural choice to tackle thesechallenges. In recent work, convolutional ap-proaches have been used to solve similar prob-lems when creating representations for text seg-ments of different sizes (dos Santos and Gatti,2014) and character-level representations of wordsof different sizes (dos Santos and Zadrozny, 2014).Here, we use a convolutional layer to computethe question-wide distributed vector representa-tions rq1 and rq2 . For each question, the convo-lutional layer first produces local features aroundeach word in the question. Then, it combines theselocal features using a sum operation to create afixed-sized feature vector (representation) for thequestion.

Given a question q1, the convolutional layerapplies a matrix-vector operation to each win-dow of size k of successive windows in qemb

1 =

{rw1 , rw2 , ..., rwN }. Let us define the vector zn 2Rdk as the concatenation of a sequence of k wordembeddings, centralized in the n-th word:6

zn = (rwn�(k�1)/2 , ..., rwn+(k�1)/2)

T

The convolutional layer computes the j-th ele-ment of the vector rq1 2 Rclu as follows:

[rq1 ]j = f

X

1<n<N

⇥f�W 1zn + b1

�⇤j

!(2)

where W 1 2 Rclu⇥dk is the weight matrix of theconvolutional layer and f is the hyperbolic tangentfunction. The same matrix is used to extract localfeatures around each word window of the givenquestion. The global fixed-sized feature vector forthe question is obtained by using the sum over allword windows.7 Matrix W 1 and vector b1 are pa-rameters to be learned. The number of convolu-tional units clu (which corresponds to the size ofthe question representation), and the size of theword context window k are hyper-parameters tobe chosen by the user.

Given rq1 and rq2 , the representations for theinput pair of questions (q1, q2), the last layer of theCNN computes a similarity score between q1 andq2. In our experiments we use the cosine similaritys(q1, q2) =

~rq1 . ~rq2k ~rq1kk ~rq2k .

6 Words with indices outside of the sentence boundariesuse a common padding embedding.

7 Using max operation instead of sum produces verysimilar results.

126

4.4 Training ProcedureOur network is trained by minimizing the mean-squared error over the training set D. Given aquestion pair (q1, q2), the network with param-eter set ✓ computes a similarity score s✓(q1, q2).Let y(q1,q2) be the correct label of the pair, whereits possible values are 1 (equivalent questions) or0 (not equivalent questions). We use stochasticgradient descent (SGD) to minimize the mean-squared error with respect to ✓:

✓ 7!X

(x,y)2D

1

2

(y � s✓(x))

2 (3)

where x = (q1, q2) corresponds to a question pairin the training set D and y represents its respectivelabel y(q1,q2).

We use the backpropagation algorithm to com-pute gradients of the network. In our experiments,we implement the CNN architecture and the back-propagation algorithm using Theano (Bergstra etal., 2010).


5.1 DataIn our experiments we use data from theAsk Ubuntu Community Questions and Answers(Q&A) site.8 Ask Ubuntu is a community forUbuntu users and developers, and it is part of theStack Exchange9 Q&A communities. The usersof these communities can ask and answer ques-tions, and vote up and down both questions andanswers. Users with high reputation become mod-erators and can label a new question as a dupli-cate to an existing question.10 Usually it takes fivevotes from different moderators to close a questionor to mark it as a duplicate.

We use the Ask Ubuntu data dump provided inMay 2014. We extract all question pairs linked asduplicates. The data dump we use contains 15277such pairs. For our experiments, we randomly se-lect a training set of 24K pairs, a test set of 6K anda validation set of 1K, making sure there are nooverlaps between the sets. Half of each set con-tains pairs of semantically equivalent questions(positive pairs) and half are pairs of questions thatare not semantically equivalent. The latter pairs

8 http://askubuntu.com/9 http://stackexchange.com

10 More information about Stack Exchange communitiescould be found here: http://stackexchange.com/tour

are randomly generated from the corpus. The datawas tokenized with NLTK (Bird et al., 2009), andall links were replaced by a unique string.

For the experiments on a different domain (seeSection 6.4) we use the Meta Stack Exchange11

data dump provided in September 2014. MetaStack Exchange (Meta) is used to discuss theStack Exchange community itself. People askquestions about the rules, features and possiblebugs. The data dump we use contains 67746 ques-tions, where 19456 are marked as duplicates. Forthe experiments on this data set, we select randombalanced disjoint sets of 20K pairs for training, 1Kfor validation and 4K for testing. We prepare thedata in exactly the same manner as the Ask Ubuntudata.

5.2 Baselines

We explore three main baselines: a method basedon the Jaccard coefficient which was reported toprovide high accuracy for the task of duplicate de-tection (Wu et al., 2011), a Support Vector Ma-chines (SVM) classifier (Cortes and Vapnik, 1995)and the combination of the two.

For the first baseline, documents are first repre-sented as sets of shingles of lengths from one tofour, and then the Jaccard coefficient for a pair ofdocuments is calculated as follows:

J(S(d1), S(d2)) =

S(d1) \ S(d2)

S(d1) [ S(d2),

where S(di) is the set of shingles generated fromthe ith document. High values of the Jaccard co-efficient denote high similarity between the docu-ments. If the value exceeds a threshold T , the doc-uments are considered semantically equivalent. Inthis case, the training data is used to select the op-timal threshold T .

For the SVM baseline, we represent documentswith n-grams of length up to four. For each pairof questions and each n-gram we generate threefeatures: (1) if the n-gram is present in the firstquestion; (2) if the n-gram is present in the secondquestion; (3) the overall normalized count of then-gram in the two questions. We use the RBF ker-nel and perform grid search to optimize the val-ues of C and � parameters. We use a frequencythreshold12 to reduce the number of features. The

11 meta.stackexchange.com12 Several values (2, 5, 35 and 100) were tried with cross-

validation, the threshold with value 5 was selected

127

implementation provided by LibSVM (Chang andLin, 2011) is used.

In order to combine the two baselines, for a pairof questions we calculate the values of the Jaccardcoefficient with shingles size up to four, and thenadd these values as additional features used by theSVM classifier.

5.3 Word EmbeddingsThe word embeddings used in our experiments areinitialized by means of unsupervised pre-training.We perform pre-training using the skip-gram NNarchitecture (Mikolov et al., 2013) available in theword2vec13 tool. Two different corpora are usedto train word embeddings for most of the experi-ments: the English Wikipedia and the Ask Ubuntucommunity data. The experiments presented inSection 6.4 also use word embeddings trained onthe Meta Stack Exchange community data.

In the experiments with the English Wikipediaword embeddings, we use the embeddings pre-viously produced by dos Santos & Gatti (2014).They have used the December 2013 snapshot ofthe English Wikipedia corpus to obtain word em-beddings with word2vec.

In the experiments with Ask Ubuntu and MetaStack Exchange word embeddings, we use theStack Exchange data dump provided in May 2014to train word2vec. Three main steps are used toprocess all questions and answers from these StackExchange dumps: (1) tokenization of the text us-ing the NLTK tokenizer; (2) image removal, URLreplacement and prefixing/removal of the code ifnecessary (see Section 6.1 for more information);(3) lowercasing of all tokens. The resulting cor-pora contains about 121 million and 19 million to-kens for Ask Ubuntu and Meta Stack Exchange,respectively.

6 Experimental Results

6.1 Comparison with BaselinesAsk Ubuntu community gives users an opportu-nity to format parts of their posts as code by us-ing code tags (an example is in italic in Table 1).It includes not only programming code, but com-mands, paths to directories, names of packages, er-ror messages and links. Around 30% of all postsin the data dump contain code tags. Since the rulesfor code formatting are not well defined, it was notclear if a learning algorithm would benefit from

13 http://code.google.com/p/word2vec/

System Valid. Acc. Test Acc.SVM + shingles 85.5 82.4

CNN + Askubuntu 93.4 92.9

Table 3: CNN and SVM accuracy on the valida-tion and the test set using the full training set.

including it or not. Therefore, for each algorithmwe tested three different approaches to handlingcode: keeping it as text; removing it; and prefix-ing it with a special tag. The latter is done in orderto distinguish between the same term used withintext or within code or a command (e.g., a for asa preposition and a for in a for loop). When cre-ating the word embeddings, the same approach tothe code as for the training data was followed.

The 1K example validation set is used to tunethe hyper-parameters of the algorithms. In orderto speed up computations, we perform our initialexperiments using a 4K examples balanced subsetof the training set. The best validation accuraciesare reported in Table 2.

We test the shingling-based approach with dif-ferent shingle sizes. As Table 2 indicates, theaccuracy decreases with the increase of the shin-gle size. The fact that much better accuracy isachieved when comparing questions based on sim-ple word overlap (shingle size 1), suggests that se-mantically equivalent questions are not duplicatesbut rather have topical similarity. The SVM base-line performs well only when combined with theshingling approach by using the values of the Jac-card coefficient for shingle size up to four as ad-ditional features. A possible reason for this is thatn-gram representations do not capture enough in-formation about semantic equivalence. The CNNwith word embeddings outperforms the baselinesby a significant margin.

The results presented in Table 2 indicate that thealgorithms do not benefit from including the code.This is probably because the code tags are not al-ways used appropriately and some code examplesinclude long error messages, which make the usergenerated data even more noisy. Therefore, in thefollowing experiments the code is removed.

The validation accuracy and the test accuracyusing the full 24K training set is presented in Ta-ble 3. The SVM with four additional shinglingfeatures is found best among the baselines (see Ta-ble 2) and is used as a baseline in this experiment.Again, the CNN with word embeddings outper-forms the best baseline by a significant margin.

128

Algorithm Features Code Best validation acc. Optimal hyper-parametersSVM-RBF binary + freq. kept 66.2 C=8.0, � ⇡3.05e-05SVM-RBF binary + freq. removed 66.53 C=2.0, � ⇡1.2e-04SVM-RBF binary + freq. prefixed 66.53 C=8.0, � ⇡3.05e-05

Shingling (size 1) - kept 72.35 -Shingling (size 1) - removed 72.65 -Shingling (size 1) - prefixed 70.94 -Shingling (size 2) - kept 69.24 -Shingling (size 2) - removed 66.83 -Shingling (size 2) - prefixed 67.74 -Shingling (size 3) - kept 65.23 -Shingling (size 3) - removed 62.93 -Shingling (size 3) - prefixed 64.43 -

SVM-RBF binary + freq. + shingles kept 74.0 C=32.0, � ⇡3.05e-05SVM-RBF binary + freq. + shingles removed 77.4 C=32.0, � ⇡3.05e-05SVM-RBF binary + freq. + shingles prefixed 73.6 C=32.0, � ⇡3.05e-05

CNN Askubuntu word vectors kept 91.3CNN Askubuntu word vectors removed 92.4 d=200, k=3, clu=300, �=0.005CNN Askubuntu word vectors prefixed 91.4

Table 2: Validation Accuracy and best parameters for the baselines and the Convolutional Neural Net-work.

6.2 Impact of Domain-Specific WordEmbeddings

We perform two experiments to evaluate the im-pact of the word embeddings on the CNN accu-racy. In the first experiment, we gradually increasethe dimensionality of word embeddings from 50to 400. The results are presented in Figure 2.The vertical axis corresponds to validation accu-racy and the horizontal axis represents the trainingtime in epochs. As has been shown in (Mikolov etal., 2013), word embeddings of higher dimension-ality trained on a large enough data set capture se-mantic information better than those of smaller di-mensionality. The experimental results presentedin Figure 2 correspond to these findings: we cansee improvements in the neural network perfor-mance when increasing the word embeddings di-mensionality from 50 to 100 and from 100 to 200.However, the Ask Ubuntu data set containing ap-proximately 121M tokens is not big enough for animprovement when increasing the dimensionalityfrom 200 to 400.

In the second experiment, we evaluate the im-pact of in-domain word embeddings on the net-work’s performance. We obtain word embeddingstrained on two different corpora: Ask Ubuntucommunity data and English Wikipedia (see Sec-tion 5.3). Both word embeddings have 200 dimen-sions. The results presented in Table 4 show thattraining on in-domain data is more beneficial forthe network, even though the corpus used to cre-ate word embeddings is much smaller.

Figure 2: CNN accuracy depending on the size ofword embeddings

Word Embeddings Num.tokens Valid.Acc.Wikipedia ⇡1.6B 85.5AskUbuntu ⇡121M 92.4

Table 4: Validation Accuracy of the CNN withword embeddings pre-trained on different corpora.

6.3 Impact of Training Set SizeIn order to measure the impact of the training setsize, we perform experiments using subsets of thetraining data, starting from 100 question pairs andgradually increasing the size to the full 24K train-ing set.14 Figure 3 compares the learning curvesfor the SVM baseline (with parameters and fea-tures described in Section 6.1) and for the CNNwith word embeddings trained on Ask Ubuntu andEnglish Wikipedia. The vertical axis correspondsto the validation accuracy, and the horizontal axisrepresents the training set size. As Figure 3 indi-cates, increasing the size of the training set pro-

14 We use sets of 100, 1000, 4000, 12000 and 24000 ques-tion pairs.

129

vides improvements. Nonetheless, the differencein accuracy when training with the full 24K train-ing set and 4K subset is about 9% for SVM andonly about 1% for the CNN. This difference issmall for both word embeddings pre-trained onAsk Ubuntu and Wikipedia but, the in-domainword embeddings provide better performance in-dependently of the training set size.

Figure 3: Validation accuracy for the baseline andthe CNN depending on the size of training set.

6.4 Domain AdaptationMuthmann and Petrova (2014) report that theMeta Stack Exchange Community15 is one of thehardest for finding semantically equivalent ques-tions.

We perform the same experiments described inprevious sections using the Meta data set. In Ta-ble 5, we can see that the CNN accuracy on Metatest data (92.68%) is similar to the one for AskUbuntu community on test data (92.4%) (see Ta-ble 3).

Also, in Table 5, we show results of a domainadaptation experiment in which we do not usetraining data from the Meta forum. In this case,the CNN is trained using Ask Ubuntu data only.The numbers show that even in this case using in-domain word embeddings helps to achieve rela-tively high accuracy: 83.35% on the test set.

7 Error AnalysisAs we have expected, the CNN with in-domainword vectors outperforms the vocabulary-basedbaselines in identifying semantically equivalentquestions that are too different in terms of vo-cabulary. The CNN is also better at distinguish-ing questions with similar vocabulary but differ-ent meanings. For example, the question pair, (q1)

15 http://meta.stackexchange.com/

Train.Data Size Word Vect. Val.Acc. Test.Acc.META 4K META 91.1 89.97META 4K Wikipedia 86.9 86.27META 20K META 92.8 92.68META 20K Wikipedia 90.6 90.52

AskUbuntu 24K META 83.9 83.35AskUbuntu 24K AskUbuntu 76.8 80.0

Table 5: Convolutional Neural Network Accuracytested on Meta Stack Exchange community data.

How can I install Ubuntu without removing Win-dows? and (q2) How do I upgrade from x86 to x64without losing settings?16 is erroneously classi-fied as a positive pair by the SVM, while the CNNclassifies it correctly as a negative pair.

There are some cases where both CNN andSVM fail to identify semantic equivalence. Someof these cases include questions where essen-tial information is presented as an image, e.g.,a screenshot, which was removed during prepro-cessing.17

8 Conclusions and Future Work

In this paper, we propose a method for identify-ing semantically equivalent questions based on aconvolutional neural network. We experimentallyshow that the proposed CNN achieves very highaccuracy especially when the word embeddingsare pre-trained on in-domain data. The perfor-mance of an SVM-based approach to this task wasshown to depend highly on the size of the trainingdata. In contrast, the CNN with in-domain wordembeddings provides very high performance evenwith limited training data. Furthermore, experi-ments on a different domain have demonstratedthat the neural network achieves high accuracy in-dependently of the domain.

The next step in our research is building a sys-tem for retrieval of semantically equivalent ques-tions. In particular, given a corpus and a question,the task is to find all questions that are semanti-cally equivalent to the given one in the corpus. Webelieve that a CNN architecture similar to the oneproposed in this paper might be a good fit to tacklethis problem.

Acknowledgments

The work of Dasha Bogdanova is supported byScience Foundation Ireland through the CNGL

16 Due to space constraints we only report the titles of thequestions

17 For instance, http://askubuntu.com/questions/450843

130

Programme (Grant 12/CE/I2267) in the ADAPTCentre (www.adaptcentre.ie) at Dublin City Uni-versity. Her contributions were made during aninternship at IBM Research.

ReferencesEneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-

Agirre, and Weiwei Guo. 2013. *sem 2013 sharedtask: Semantic textual similarity. In Second JointConference on Lexical and Computational Seman-tics (*SEM), Volume 1, pages 32–43, Atlanta, Geor-gia, USA, June.

Omar Alonso, Dennis Fetterly, and Mark Manasse.2013. Duplicate news story detection revisited. InRafael E. Banchs, Fabrizio Silvestri, Tie-Yan Liu,Min Zhang, Sheng Gao, and Jun Lang, editors, In-formation Retrieval Technology, volume 8281 ofLecture Notes in Computer Science, pages 203–214.Springer Berlin Heidelberg.

James Bergstra, Olivier Breuleux, Frederic Bastien,Pascal Lamblin, Razvan Pascanu, Guillaume Des-jardins, Joseph Turian, David Warde-Farley, andYoshua Bengio. 2010. Theano: a CPU andGPU math expression compiler. In Proceedingsof the Python for Scientific Computing Conference(SciPy).

Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural Language Processing with Python: Analyz-ing Text with the Natural Language Toolkit.

A. Broder. 1997. On the resemblance and contain-ment of documents. In Proceedings of the Com-pression and Complexity of Sequences 1997, SE-QUENCES’97, Washington, DC, USA. IEEE Com-puter Society.

Chih-Chung Chang and Chih-Jen Lin. 2011. LIB-SVM: A library for support vector machines. ACMTransactions on Intelligent Systems and Technol-ogy, 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning, Volume 20(3),pages 273–297.

Cıcero Nogueira dos Santos and Maıra Gatti. 2014.Deep convolutional neural networks for sentimentanalysis of short texts. In Proceedings of the 25th In-ternational Conference on Computational Linguis-tics (COLING), Dublin, Ireland.

Cıcero Nogueira dos Santos and Bianca Zadrozny.2014. Learning character-level representations forpart-of-speech tagging. In Proceedings of the31st International Conference on Machine Learning(ICML), JMLR: W&CP volume 32, Beijing, China.

Baotian Hu, Zhengdong Lu, Hang Li, and QingcaiChen. 2014. Convolutional neural network archi-tectures for matching natural language sentences. InAdvances in Neural Information Processing Systems27: Annual Conference on Neural Information Pro-cessing Systems 2014, pages 2042–2050, Montreal,Quebec, Canada.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods for Natural Lan-guage Processing, pages 1746–1751, Doha, Qatar.

Nitin Madnani, Joel Tetreault, and Martin Chodorow.2012. Re-examining machine translation metricsfor paraphrase identification. In Proceedings ofthe 2012 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 182–190, Montreal, Canada.

Gurmeet Singh Manku, Arvind Jain, and AnishDas Sarma. 2007. Detecting near-duplicates forweb crawling. In Proceedings of the 16th Interna-tional Conference on World Wide Web, WWW ’07,pages 141–150, New York, NY, USA.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word repre-sentations in vector space. In Proceedings of Inter-national Conference on Learning Representations,ICLR 2013, Scottsdale, AZ, USA.

Klemens Muthmann and Alina Petrova. 2014. AnAutomatic Approach for Identifying Topical Near-Duplicate Relations between Questions from SocialMedia Q/A . In Proceedings of Web-scale Classifi-cation: Classifying Big Data from the Web WSCBD2014.

Richard Socher, Eric H. Huang, Jeffrey Pennington,Andrew Y. Ng, and Christopher D. Manning. 2011.Dynamic pooling and unfolding recursive autoen-coders for paraphrase detection. In John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fer-nando C. N. Pereira, and Kilian Q. Weinberger, ed-itors, Advances in Neural Information ProcessingSystems 24, pages 801–809.

Alexander Waibel, Toshiyuki Hanazawa, GeoffreyHinton, Kiyohiro Shikano, and Kevin J. Lang. 1989.Phoneme recognition using time-delay neural net-works. IEEE Transactions on Acoustics, Speech andSignal Processing, Volume 37(3), pages 328–339.

Yan Wu, Qi Zhang, and Xuanjing Huang. 2011. Ef-ficient near-duplicate detection for q&a forum. InProceedings of 5th International Joint Conferenceon Natural Language Processing, pages 1001–1009,Chiang Mai, Thailand.

Wen-tau Yih, Xiaodong He, and Christopher Meek.2014. Semantic parsing for single-relation ques-tion answering. In Proceedings of the 52nd AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), pages 643–648.

131



Entity Linking Korean Text: An Unsupervised Learning Approach usingSemantic Relations

Youngsik KimKAIST / Korea, The Republic [email protected]

Key-Sun ChoiKAIST / Korea, The Republic of

[email protected]

Abstract

Although entity linking is a widely re-searched topic, the same cannot be said forentity linking geared for languages otherthan English. Several limitations includ-ing syntactic features and the relative lackof resources prevent typical approaches toentity linking to be used as e↵ectively forother languages in general. We describean entity linking system that leverage se-mantic relations between entities within anexisting knowledge base to learn and per-form entity linking using a minimal en-vironment consisting of a part-of-speechtagger. We measure the performance ofour system against Korean Wikipedia ab-stract snippets, using the Korean DBpe-dia knowledge base for training. Based onthese results, we argue both the feasibil-ity of our system and the possibility of ex-tending to other domains and languages ingeneral.

1 Introduction

A crucial step in creating the Web of Data isthe process of extracting structured data, or RDF(Adida et al., 2012) from unstructured text. Thisstep enables machines to read and understandunstructured Web pages that consist the major-ity of the Web. Three tasks play a part in ex-tracting RDF from unstructured text: Named en-tity recognition(NER), where strings representingnamed entities are extracted from the given text;entity linking(EL), where each named entity rec-ognized from NER is mapped to a appropriate re-source from a knowledge base; and relation ex-traction(Usbeck et al., 2014). Although entitylinking is an extensively researched field, mostresearch done is aimed primarily for the Englishlanguage. Research about entity linking for lan-

guages other than English have also been per-formed (Jakob et al., 2013), but most state-of-artentity linking systems are not fully language inde-pendent.

The reason for this is two-fold. Firstly, mostentity linking systems depend on an existing sys-tem to perform named entity recognition before-hand. For instance, the system proposed by Us-beck (2014) uses FOX (Speck et al., 2014) toperform named entity recognition as the startingpoint. The problem with this approach is thatan existing named entity recognition system is re-quired, and thus the performance of entity linkingis bound by the performance of the named entityrecognition system. Named entity recognition sys-tems for English achieve high performance evenby utilizing a simple dictionary-based approachaugmented by part-of-speech annotations, but thisapproach does not work in all languages in gen-eral. CJK1 languages in particular are di�cult toperform named entity recognition on because lan-guage traits such as capitalization and strict tokenseparation by white-space do not exist. Other ap-proaches to named entity recognition such as sta-tistical models or projection models respectivelyrequire a large amount of annotated training dataand an extensive parallel corpus to work, but thecost of creating these resources is also non-trivial.Some approaches to entity linking utilizing su-pervised and semi-supervised learning also su↵erfrom the lack of manually annotated training datafor some languages. Currently, there is no propergolden standard dataset for entity linking for theKorean language.

In this paper, we present an entity linking sys-tem for Korean that overcomes these obstacleswith an unsupervised learning approach utilizingsemantic relations between entities obtained froma given knowledge base. Our system uses thesesemantic relations as hints to learn feature values

1Chinese, Japanese, Korean

132

for both named entity recognition and entity link-ing. In Section 3, we describe the requirementsof our system, and present the architecture of boththe training and actual entity linking processes. InSection 4, we compare the performance of our sys-tem against both a rule-based baseline and the cur-rent state-of-art system based on Kim’s (2014) re-search.

2 Related Work

The current state-of-art entity linking system forKorean handles both named entity recognition andentity linking as two separate steps (Kim et al.,2014). This system uses hyperlinks within the Ko-rean Wikipedia and a small amount of text man-ually annotated with entity information as train-ing data. It employs a SVM model trained withcharacter-based features to perform named entityrecognition, and uses the TF*ICF (Mendes et al.,2011) and LDA metrics to disambiguate betweenentity resources during entity linking. Kim (2014)reports an F1-score of 75.66% for a simplified taskin which only surface forms and entities which ap-pear as at least one hyperlink within the KoreanWikipedia are recognized as potential entity can-didates.

Our system utilizes relations between entities,which can be said to be a graph-based approachto entity linking. There have been recent researchabout entity linking that exploit the graph struc-ture of both the named entities within text andRDF knowledge bases. Han (2011) uses a graph-based collective method which can model and ex-ploit the global interdependence between di↵erententity linking decisions. Alhelbawy (2014) usesgraph ranking combined with clique partitioning.Moro (2014) introduces Babelfy, a unified graph-based approach to entity linking and word sensedisambiguation based on a loose identification ofcandidate meanings coupled with a densest sub-graph heuristic which selects high-coherence se-mantic interpretations. Usbeck (2014) combinesthe Hypertext-Induced Topic Search (HITS) algo-rithm with label expansion strategies and stringsimilarity measures.

There also has been research about named en-tity recognition for Korean. Kim (2012) pro-poses a method to automatically label multi-lingual data with named entity tags, combin-ing Wikipedia meta-data with information ob-tained through English-foreign language parallel

Wikipedia sentences. We do not use this approachin our system because our scope of entities iswider than the named entity scope defined in theMUC-7 annotation guidelines, which is the scopeof Kim’s research.

3 The System

Due to the limitations of performing entity linkingfor the Korean language described in Section 1,our system is designed with some requirements inmind. The requirements are:

• The system should be able to be trained andran within a minimal environment, which wedefine as an existing RDF knowledge basecontaining semantic relations between enti-ties, and a part-of-speech tagger. In this pa-per, we use the 2014 Korean DBpedia RDFknowledge base and the ETRI Korean part-of-speech tagger.

• The system should be able to perform en-tity linking without using external informa-tion not derived from the knowledge base.

We define the task of our system as follows:Given a list of entity uniform resource identi-fiers(URI) derived from the knowledge base, an-notate any given text with the appropriate entityURIs and their positions within the text.

3.1 PreprocessingThe preprocessing step of our system consistsof querying the knowledge base to build a dic-tionary of entity URIs and their respective sur-face forms. As we are using the Korean DB-pedia as our knowledge base, we define all re-sources with a URI starting with the names-pace ‘http://ko.dbpedia.org/resource/’ and whichare not mapped to disambiguation nor redirectionWikipedia pages as valid entities. For each entity,we define all literal strings connected via the prop-erty ‘rdfs:label’ to the entity and all entities thatdisambiguate or redirect to the entity as possiblesurface forms.

The dictionary that results from preprocessingcontains 303,779 distinct entity URIs and 569,908entity URI-surface form pairs.

3.2 TrainingAfter the preprocessing step, our system performstraining by adjusting feature values based on data

133

from a large amount of unannotated text docu-ments. As the given text is not annotated with en-tity data, we use the following assumption to helpdistinguish potential entities within the text:

Assumption. Entity candidates which have a highdegree of semantic relations with nearby en-tity candidates are likely actual entities. Wedefine an ‘entity candidate’ of a document asa substring-entity URI pair in which the sub-string appears at a specific position within thedocument and the pair exists in the dictionarycreated during preprocessing, and a ‘seman-tic relation’ between two entity candidates c1,c2 (RelPred(c1, c2)) as an undirected relationconsisting of all predicates in the knowledgebase that connect the entity URIs of the entitycandidates.

The basis for this assumption is that some men-tions of ‘popular’(having a high degree withinthe knowledge base RDF graph) entity URIs willbe accompanied with related terms, and that theknowledge base will have RDF triples connectingthe entity to the other entity URIs representing therelated terms. Thus we assume that by selecting allentity candidates with a high degree of semanticrelations, these candidates display features morerepresentative of actual entities than the remainingentity candidates do.

For each document in the training set, we firstperform part-of-speech tagging to split the textinto individual morphemes. Because the concate-nation of the morphemes of a Korean word is notalways identical to the word itself, we transformthe morphemes into ‘atomic’ sub-strings whichrepresent minimal building blocks for entity can-didates and can have multiple POS tags.

After part-of-speech tagging is complete, wethen gather a set of entity candidates from thedocument. We first find all non-overlapping sub-strings within the document that correspond toentity surface forms. It is possible for thesesub-strings to overlap with each other([[Chicago]Bulls]); and because the average length of entitysurface forms in Korean is very short(between 2to 3 characters), we opt to reduce the problem sizeof the training process by choosing only the sub-string with the longest length when multiple sub-strings overlap with each other.

As to further reduce the number of entity candi-dates we consider for training, we only use the en-tity candidate with the most ‘popular’ entity URI

Figure 1: All possible entity candidates within thetext fragment ‘Gyeongjong of Goryeo is’ (Kim etal., 2014)

per group of candidates that share the same sub-string within the document. We define the ‘popu-larity’ of an entity URI in terms of the RDF triplesin the knowledge base that has the URI as the ob-ject. More formally, we define UriPop(c) for anentity candidate c with the entity URI cu with theequations below. A larger UriPop value means amore ‘popular’ entity URI, and Pred is meant toprevent overly frequent predicates in the knowl-edge base from dominating UriPop.

(1)UriPop(c) = log(

X

p2KBpredicates

Pred(p)

⇥ |{s|(s, p, cu) 2 KB}|)

(2)Pred(p) = 1 � |{(s, o)|(s, p, o) 2 KB}||{(s, p, o) 2 KB}|

At this stage of the training process, we havea set of non-overlapping entity candidates. Wenow classify these candidates into two classes:eT (entity) and eF(non-entity) according to ourprevious assumption. We measure the degreeof semantic relations of an entity candidate c,S emRel(c), with the following equation where Ncis the set of entity candidates which are within 20words from the target candidate in the document:

(3)S emRel(c) =P

c02Nc

Pp2RelPred(c,c0) Pred(p)|Nc|

Since we do not have enough evidence at thispoint to distinguish entities from non-entities,we use semantic relations from all nearby en-tity candidates to calculate S emRel. We orderthe entity candidates into a list in decreasing or-der of S emRel, and classify entity candidatesfrom the start of the list into eT until either 1)

eTdocument word # > lw or 2) S emRel(c) < ls aresatisfied, where both lw and ls are constants thatcontinuously get adjusted during the training pro-cess. The remaining entity candidates all get clas-sified into eF .

134

We now update the current feature values basedon the entity candidates in eT and eF . For eachfeature f , we first define two sub-classes e f T =

{e|e has f ^ e 2 eT } and e f F = {e|e has f ^ e 2 eF}.We then update the feature value of f from ↵

� to↵+|e f T |

�+|e f T |+|e f F | , which represents the probability of anentity candidate having feature f to be classifiedinto eT (is an entity). The full list of features weused is shown below:

f1: String length The length (in characters) ofthe sub-string of the candidate.

f2: POS tag The POS tag(s) of the sub-string ofthe candidate. If multiple POS tags exist, wetake the average of the f2 values for each tagas the representive f2 value.

f3: Head POS tag The first POS tag of the sub-string of the candidate.

f4: Tail POS tag The last POS tag of the sub-string of the candidate.

f5: Previous POS tag The POS tag of the sub-string right before the candidate. If the can-didate is preceded by white-space, we use thespecial tag ‘BLANK’.

f6: Next POS tag The POS tag of the sub-stringright after the candidate. If the candidate isfollowed by white-space, we use the specialtag ‘BLANK’.

f7: UriPop The UriPop score of the entity URIof the candidate. Since UriPop has a con-tinuous range, we keep separate features forUriPop score intervals of 0.2.

Given these features, we define IndS core(c) ofan entity candidate c, which represents the overallprobability c would be classified in eT indepen-dently of its surrounding context, as the averageof the feature values for c.

We also define WeightedS emRel(c) as theamount of evidence via semantic relations c hasof being classified in eT . We define this score interms of semantic relations relative to the UriPopscore of c in order to positively consider entitycandidates which have more semantic relationsthan their entity URIs would normally have.

Finally, we define EntityS core(c) representingthe overall evidence for c to be in eT . The respec-tive equations for these scores are shown below.

WeightedS emRel(c)

=

Pc02Nc

Pp2RelPred(c,c0) Pred(p) ⇥ UriPop(c0)

UriPop(c)(4)

(5)EntityS core(c) = IndS core(c)+WeightedS emRel(c)

As we want to assign stronger evidence for se-mantic relations with entity candidates that arelikely actual entities (as opposed to relations withnon-entities), we define WeightedS emRel to behave a recursive relation with EntityS core.

We end the training process for a single doc-ument by computing the EntityS core for eachentity candidate in eT and eF , and adding thesescores respectively into the lists of scores distTand distF . As the EntityS core for the same en-tity candidate will change as training proceeds, weonly maintain the 10,000 most recent scores forboth lists.

3.3 Entity LinkingSince the actual entity linking process is also aboutselecting entity candidates from the given docu-ment, the process itself is similar to the trainingprocess. We list the di↵erences of the entity link-ing process compared to training below.

When we choose the initial entity candidates,we do not remove candidates with overlappingsub-strings. Although we do limit the number ofcandidates that map to the same sub-string, wechoose the top 3 candidates based on UriPop in-stead of just 1.

We compute the EntityS core for each entitycandidate without performing candidate classifica-tion nor feature value updates. Although the valueof EntityS core(c) is intended to be proportionalto the possibility the candidate c is actually an en-tity, we need a way to define a threshold constantto determine which candidates to actually classifyas entities. Thus, we then normalize any givenEntityS core(c) score of an entity candidate c intoa confidence score Con f (c), which represents therelative probability of c being a member of eTagainst being a member of eF . This is computedby comparing the score against the lists distT anddistF , as shown in the following equations:

Con f T (c) =|{x|x 2 distT ^ x < EntityS core(c)}|

|distT |(6)

135

Con f F(c) =|{x|x 2 distF ^ x > EntityS core(c)}|

|distF |(7)

(8)Con f (c) =Con f T (c)

Con f T (c) +Con f F(c)

Con f is a normalized score which satisfies 0 Con f (c) 1 for any entity candidate c. This givesus the flexibility to define a threshold 0 � 1 sothat only entity candidates satisfying Con f (c) � �are classified as entities.

Algorithm 1 The entity selection algorithmE []for all c 2 C do

if Con f (c) � � thenunsetE []valid = truefor all e 2 E do

if sub-string of c contains e thenunsetE unsetE + e

else if sub-string of c overlaps with ethenvalid = false

end ifend forif valid is true then

E E + cfor all e 2 unsetE do

E E � eend for

end ifend if

end forreturn E

Algorithm 1 shows the entity selection process,where C is initialized as a list of all entity candi-dates ordered by decreasing score of Con f . Weonly select entity candidates which have a confi-dence score of at least �. When the sub-strings ofmultiple entity candidates overlap with each other,we prioritize the candidate with the highest con-fidence with the exception of candidates that con-tain(one is completely covered by the other, with-out the two being identical) other candidates inwhich we choose the candidate that contains theother candidates.

4 Experiments

4.1 Entity Linking for Korean

4.1.1 The DatasetAlthough the dataset used by Kim (2014) exists,we do not use this dataset because it is for a sim-plified version of the entity linking problem as dis-cussed in Section 2. We instead have created anew dataset intended to serve as answer data forthe entity linking for Korean task, based on guide-lines that were derived from the TAC KBP 20142

guidelines. The guidelines used to annotate textfor our dataset are shown below:

• All entities must be tagged with an entityURI that exists in the 2014 Korean DBpediaknowledge base.

• All entities must be tagged with an entity URIwhich is correct to appear within the contextof the entity.

• All entity URIs must identify a single thing,not a group of things.

• Verbs and adjectives that have an equiva-lent noun that is identified by an entity URIshould not be tagged as entities.

• Only words that directly identify the entityshould be tagged.

• If several possible entities are containedwithin each other, the entity that contains allother entities should be tagged.

• If several consecutive words each can be rep-resented by the same entity URI, these wordsmust be tagged as a single entity, as opposedto tagging each separate word.

• Indirect mentions of entity URIs (ex: pro-nouns) should not be tagged even if co-resolution can be performed in order to iden-tify the entity URI the mention stands for.

Our dataset consists of 60 Korean Wikipedia ab-stract documents annotated by 3 di↵erent annota-tors.

2http://nlp.cs.rpi.edu/kbp/2014/elquery.pdf

136

4.1.2 EvaluationWe evaluate our system against the work of Kim(2014) in terms of precision, recall, and F1-scoremetrics. As Kim’s (2014) system(which we willrefer to as KEL2014) was originally trained withthe Korean Wikipedia, it required only slight ad-justments to properly operate for our dataset.

We first train our system using 4,000 documentsrandomly gathered from the Korean Wikipedia.We then evaluate our system with our dataset,while adjusting the confidence threshold � from0.01 to 0.99 in increments of 0.01. We compareour results with those of the best-performing set-tings of KEL2014.

4.1.3 Results

Figure 2: The distribution of distT and distF aftertraining with 4,000 documents

The results of the training process is shown inFigure 2. We can see that the EntityS core valuesof entity candidates classified in eT are generallyhigher than those in eF . This can be seen as evi-dence to our claim that the features of entities witha high degree of semantic relations can be used todistinguish entities from non-entities in general.

We show additional evidence to this claim withTable 1, which lists the feature values for the fea-ture f1(sub-string length) after the training processis complete. We observe that this feature gives rel-atively little weight to entity candidates with a sub-string of length 1, which is consistent with our ini-tial observation that most of these candidates arenot entities.

Figure 3 shows the performance of our sys-tem compared to the performance of KEL2014against our dataset. As we increase the confidencethreshold, the recall decreases and the precisionincreases. The maximum F1-score of our systemusing our dataset, 0.630, was obtained with theconfidence threshold � = 0.79. This is an im-provement over KEL2014 which scored an F1-

Length eT eT + eF Value1 37629 312543 0.1202 41355 168672 0.2453 20164 41098 0.4904 12542 19619 0.6375 13953 19722 0.7046 5299 7503 0.6987 3223 4187 0.7538 2686 4282 0.6169 2346 3034 0.751

10 922 1099 0.77411 917 1011 0.83012 477 546 0.75013 249 276 0.68714 218 274 0.61515 155 170 0.61816 109 116 0.56717 99 118 0.52318 48 52 0.43419 49 55 0.43320 34 40 0.384

Table 1: Feature values of the feature f1 after train-ing with 4,000 documents

score of 0.598. Although the performance dif-ference itself is small, this shows that our systemtrained with unannotated text documents performsbetter than KEL2014, which is trained with bothpartially annotated documents (Wikipedia articles)and a small number of documents completely an-notated with entity data. This also shows thatKEL2014 does not perform as well outside thescope of the simplified version of entity linking itwas designed for.

Our system currently is implemented as a sim-ple RESTful service where both training and ac-tual entity linking are initiated via POST requests.Our system can be deployed to a server at its initialstate or at a pre-trained state.

4.2 Alternative Methods and Deviations4.2.1 Using Feature SubsetsIn order to test the e↵ectiveness of each feature weuse in training, we compare the results obtained inSection 4.1.3 against the performance of our sys-tem trained with smaller subsets of the featuresshown in Section 3.2. Using the same training dataas in Section 4.1.2, we evaluate the performance ofour system using two di↵erent subsets of featuresas shown below:

137

Figure 3: The performance of our system againstKEL2014. The dots represent the results of oursystem, and the cross represents the results ofKEL2014.

Figure 4: A temporary web interface for our sys-tem showing entity linking results for a sampletext snippet

Dev1 All features except POS-related ones: f1,f7.

Dev2 All POS-related features only: f2, f3, f4, f5,f6.

Figures 5 and 6 shows the performance of oursystem using the feature subsets Dev1 and Dev2.For the feature subset Dev1, our system displaysa maximum F1-score of 0.433 with � = 0.99; forthe features subset Dev2, our system performs bestwith � = 0.98 for an F1-score of 0.568.

As expected, both deviations perform worsethan our system trained with the full set of fea-tures. We observe that the performance of Dev1is significantly worse than that of Dev2. This sug-gests that POS-based features are more important

Figure 5: The performance of our system using thefeature subset Dev1.

Figure 6: The performance of our system using thefeature subset Dev2.

in distinguishing entities from non-entities thanthe other features.

4.2.2 Using SVM ModelsKim (2014) e↵ectively utilized trained SVM mod-els to detect entities within text. Based on Kim’sexperiments, we investigate whether SVM can bealso used to raise performance of our system.

Although we now need to base entity predic-tions with a trained SVM model instead of theconfidence metric Con f , the training process de-scribed in Section 3.2 is mostly unaltered becauseit the classification of entity candidate into eT andeF results in the training data required to train aSVM binary classification model. The di↵erencesin the training process are shown below:

• After we process each document, a list of fea-tures is produced for each entity candidateclassified as eT or eF . Instead of appendingthe EntityS core values of these entity candi-dates to distT and distF , we append the fea-ture lists of each entity candidate themselvesinto two lists of lists, listT and listF . Theselists still contain only the data of the newest

138

10,000 entity candidates.

• After the entire training set is processed, weuse the list of feature lists in listT and listFto train the SVM model. As the original fea-tures are not all numbers, we transform eachlist of features into 7-dimensional vectors byreplacing each feature key into the featurevalue of the respective feature.

We use the same training data as in Section4.1.2, and train a SVM model with a 3-degreepolynomial kernel. We choose this kernel accord-ing to Kim’s (2014) work, where Kim comparesthe performance of SVM for the task of entity link-ing for Korean using multiple kernels.

Our system, using a trained SVM model resultsin a F1-score of 0.569, which is about 0.07 lowerthan the best performance of our system using theconfidence model. One possible reason our sys-tem performed worse using SVM classification isthat the training data that we feed to the classi-fier is not correctly classified, but rather based onthe semantic relations assumption in Section 3.2.As this assumption does not cover any character-istics of non-entities, the e↵ectiveness of SVM de-creases as many entity candidates which are actualentities get classified as eF due to them not havingenough semantic relations.

4.3 Application on Other Languages

As our system does not explicitly exploit any char-acteristics of the Korean language, it theoreticallyis a language-independent entity linking system.We investigate this point using Japanese as a sam-ple language, and MeCab3 as the part-of-speechtagger.

As no proper dataset for the entity linking forKorean task exists, we have created a new datasetin order to measure the performance of our sys-tem. As we believe many other languages willalso lack a proper dataset, we must devise an al-ternative method to measure the performance ofour system for other languages in general.

We use Wikipedia documents and the links an-notated within them as the dataset to use for mea-suring performance of our system for languagesother than Korean. Although the manually anno-tated links within Wikipedia documents do rep-resent actual entities and their surface forms, we

3http://taku910.github.io/mecab/

cannot completely rely on these links as answerdata because of the following reasons:

• The majority of entities within a Wikipediadocument are not actually tagged as links.This includes self-references (entities that ap-pear within the document describing that en-tity), frequently appearing entities that weretagged as links for their first few occurrencesbut were not tagged afterwards, and entitiesthat were not considered important enough totag as links by the annotators. Due to theexistence of so many untagged entities, wee↵ectively cannot use precision as a perfor-mance measure.

• Some links represent entities that are out-side the scope of our research. This includeslinks that point to non-existent Wikipediapages (‘red links’), and links that require co-resolution to resolve.

Due to these problems, we only use the recallmetric to measure the approximate performanceof our system when using Wikipedia documentsas answer data. We compare the performance ofour system for Korean and Japanese by first train-ing our system with 3,000 Wikipedia docuem-nts, and measuring the recall of our system for100 Wikipedia documents for both respective lan-guages while adjusting the confidence threshold �from 0.01 to 0.99 in increments of 0.01.

Figure 7: The recall of our system against Koreanand Japanese Wikipedia documents

Figure 7 shows the recall of our system againstKorean and Japanese Wikipedia documents. Forthe confidence threshold optimized via the exper-iments performed in Section 4.1 (� = 0.79), oursystem shows a large di↵erence of recall between

139

Korean and Japanese. Although this does not ac-curately represent the actual performance of oursystem, we leave the task of improving our systemto display consistent performance across multiplelanguages as future work.

5 Future Work

As our system currently only uses labels of en-tity URIs to determine surface forms of entities, itcan not detect entities with irregular surface forms.For instance, the entity ‘Steve˙Jobs’ has the labels‘Steve Jobs’ and ‘Jobs’ within the Korean DBpe-dia knowledge base, but does not have the label‘Steve’. This results in the inability of our systemto detect certain entities within our dataset, regard-less of training. We plan to improve our systemto handle derivative surface forms such as ‘Steve’for ‘Steve˙Jobs’, without relying on external dic-tionaries if possible.

Kim (2014) shows that a SVM classifier usingcharacter-based features trained with Wikipediaarticles achieves better named entity recogni-tion performance than a rule-based classifier us-ing part-of-speech tags. Although our systemuses trained features based on part-of-speech tagsrather than a rule-based method, we may be able toremove even the part-of-speech tagger from the re-quirements of our system by substituting these fea-tures with the character-based features suggestedin Kim’s (2014) work.

Finally, future work must be performed aboutreplacing the part-of-speech tagger currently usedin our system with a chunking algorithm that doesnot utilize supervised training. Our system cur-rently utilizes the list of morphemes and their re-spective POS tags that are produced from the part-of-speech tagger. Since our system does not re-quire this information to be completely accurate,a dictionary-based approach to chunking might beapplicable as well.

6 Conclusion

In this paper, we present an entity linking systemfor Korean that utilizes several features trainedwith plain text documents. By taking an unsu-pervised learning approach, our system is able toperform entity linking with a minimal environ-ment consisting of an RDF knowledge base anda part-of-speech tagger. We compare the perfor-mance of our system against the state-of-art sys-tem KEL2014, and show that our system outper-

forms KEL2014 in terms of F1-score. We alsobriefly describe variations to our system trainingprocess, such as using feature subsets and utilizingSVM models instead of our confidence metric.

Many languages including Korean are not asrich in resources as the English language, and thelack of resources might prohibit the state-of-artsystems for entity linking for English from per-forming as well in other languages. By utilizinga minimal amount of resources, our system mayprovide a firm starting point for research about en-tity linking for these languages.

7 Acknowledgements

This work was supported by Institute for Infor-mation & communications Technology Promo-tion(IITP) grant funded by the Korea govern-ment(MSIP) (No. R0101-15-0054, WiseKB: Bigdata based self-evolving knowledge base and rea-soning platform)

ReferencesB. Adida, I. Herman, M. Sporny, and M. Birbeck.

RDFa 1.1 Primer. Technical report, World Wide WebConsortium, http://www.w3.org/TR/2012/NOTE-rdfa-primer- 20120607/, June 2012.

Usbeck, R., Ngomo, A. C. N., Rder, M., Gerber,D., Coelho, S. A., Auer, S., and Both, A. (2014).AGDISTIS-graph-based disambiguation of namedentities using linked data. In The Semantic We-bISWC 2014 (pp. 457-471). Springer InternationalPublishing.

Daiber, J., Jakob, M., Hokamp, C., and Mendes, P. N.(2013, September). Improving e�ciency and accu-racy in multilingual entity extraction. In Proceed-ings of the 9th International Conference on SemanticSystems (pp. 121-124). ACM.

Speck, R., and Ngomo, A. C. N. (2014). Ensemblelearning for named entity recognition. In The Se-mantic WebISWC 2014 (pp. 519-534). Springer In-ternational Publishing.

Y. Kim, Y. Hamn, J. Kim, D. Hwang, and K.-S. Choi,A Non-morphological Approach for DBpedia URISpotting within Korean Text, Proc. of HCLT 2014,Chuncheon, 2014.

P. N. Mendes, M. Jakob, A. Garca-Silve, and C. Bizer,”DBpedia spotlight: shedding light on the web ofdocuments,” Proc. of i-SEMANTICS 2011 (7th Int.Conf. on Semantic Systems, 2011.

Han, X., Sun, L., and Zhao, J. (2011, July). Collectiveentity linking in web text: a graph-based method. InProceedings of the 34th international ACM SIGIR

140

conference on Research and development in Infor-mation Retrieval (pp. 765-774). ACM.

Alhelbawy, Ayman, and Robert Gaizauskas. ”Col-lective Named Entity Disambiguation using GraphRanking and Clique Partitioning Approaches.”

Moro, Andrea, Alessandro Raganato, and RobertoNavigli. ”Entity linking meets word sense disam-biguation: a unified approach.” Transactions of theAssociation for Computational Linguistics 2 (2014):231-244.

Kim, S., Toutanova, K., and Yu, H. (2012, July). Multi-lingual named entity recognition using parallel dataand metadata from wikipedia. In Proceedings of the50th Annual Meeting of the Association for Com-putational Linguistics: Long Papers-Volume 1 (pp.694-702). Association for Computational Linguis-tics.

141



Incremental Recurrent Neural Network Dependency Parserwith Search-based Discriminative Training

Majid YazdaniComputer Science Department

University of [email protected]

James HendersonXerox Research Center Europe

[email protected]

Abstract

We propose a discriminatively trained re-current neural network (RNN) that pre-dicts the actions for a fast and accurateshift-reduce dependency parser. The RNNuses its output-dependent model struc-ture to compute hidden vectors that en-code the preceding partial parse, and usesthem to estimate probabilities of parser ac-tions. Unlike a similar previous generativemodel (Henderson and Titov, 2010), theRNN is trained discriminatively to opti-mize a fast beam search. This beam searchprunes after each shift action, so we adda correctness probability to each shift ac-tion and train this score to discriminate be-tween correct and incorrect sequences ofparser actions. We also speed up pars-ing time by caching computations for fre-quent feature combinations, including dur-ing training, giving us both faster trainingand a form of backoff smoothing. The re-sulting parser is over 35 times faster thanits generative counterpart with nearly thesame accuracy, producing state-of-art de-pendency parsing results while requiringminimal feature engineering.

1 Introduction and Motivation

There has been significant interest recently in ma-chine learning and natural language processingcommunity in models that learn hidden multi-layer representations to solve various tasks. Neu-ral networks have been popular in this area as apowerful and yet efficient models. For example,feed forward neural networks were used in lan-guage modeling (Bengio et al., 2003; Collobertand Weston, 2008), and recurrent neural networks(RNNs) have yielded state-of-art results in lan-guage modeling (Mikolov et al., 2010), language

generation (Sutskever et al., 2011) and languageunderstanding (Yao et al., 2013).

1.1 Neural Network Parsing

Neural networks have also been popular in pars-ing. These models can be divided into thosewhose design are motivated mostly by inducinguseful vector representations (e.g. (Socher et al.,2011; Socher et al., 2013; Collobert, 2011)), andthose whose design are motivated mostly by ef-ficient inference and decoding (e.g. (Henderson,2003; Henderson and Titov, 2010; Henderson etal., 2013; Chen and Manning, 2014)).

The first group of neural network parsers are alldeep models, such as RNNs, which gives them thepower to induce vector representations for com-plex linguistic structures without extensive featureengineering. However, decoding in these modelscan only be done accurately if they are used to re-rank the best parse trees of another parser (Socheret al., 2013).

The second group of parsers use a shift-reduceparsing architecture so that they can use searchbased decoding algorithms with effective pruningstrategies. The more accurate parsers also use aRNN architecture (see Section 6), and use gener-ative models to allow beam search. These mod-els are accurate but are relatively slow, and accu-racy degrades when you choose decoding settingsto optimize speed. Because they are generative,they need to predict the words as the parse pro-ceeds through the sentence, which requires nor-malization over all the vocabulary of words. Also,the beam search must maintain many candidatesin the beam in order to check how well each onepredicts future words. Recently (Chen and Man-ning, 2014) propose a discriminative neural net-work shift-reduce parser, which is very fast butless accurate (see Section 6). However, this parseruses a feed-forward neural network with a large setof hand-coded features, making it of limited inter-

142

est for inducing vector representations of complexlinguistic structures.

1.2 Incremental Recurrent Neural NetworkArchitecture

In both approaches to neural network parsing,RNN models have the advantage that they needminimal feature engineering and therefore theycan be used with little effort for a variety of lan-guages and applications. As with other deep neu-ral network architectures, RNNs induce complexfeatures automatically by passing induced (hid-den) features as input to other induced featuresin a recursive structure. This is a particular ad-vantage for domain adaptation, multi-task learn-ing, transfer learning, and semi-supervised learn-ing, where hand-crafted feature engineering is of-ten particularly difficult. The information that istransferred from one task to another is embeddedin the induced feature vectors in a shared latentspace, which is input to another hidden layer forthe target model (Henderson et al., 2013; Rainaet al., 2007; Collobert et al., 2011; Glorot et al.,2011). This transferred information has provento be particularly useful when it comes from verylarge datasets, such as web-scale text corpora, butlearning and inference on such datasets is onlypractical with efficient algorithms.

In this work, we propose a fast discriminativeRNN model of shift-reduce dependency parsing.We choose a left-to-right shift-reduce dependencyparsing architecture to benefit from efficient de-coding. It also easily supports incremental in-terpretation in dialogue systems, or incrementallanguage modeling for speech recognition. Wechoose a RNN architecture to benefit from the au-tomatic induction of informative vector represen-tations of complex linguistic structures and the re-sulting reduction in the required feature engineer-ing. This hidden vector representation is trained toencode the partial parse tree that has been built bythe preceding parse, and is used to predict the nextparser action conditioned on this history.

As our RNN architecture, we use the neuralnetwork approximation of ISBNs (Henderson andTitov, 2010), which we refer to as an IncrementalNeural Network (INN). INNs are a kind of RNNwhere the model structure is built incrementallyas a function of the values of previous output vari-ables. In our case, the hidden vector used to makethe current parser decision is connected to the hid-

den vector from previous decisions based on thepartial parse structure that has been built by theprevious decisions. So any information about theunbounded parse history can potentially be passedto the current decision through a chain of hiddenvectors that reflects locality in the parse tree, andnot just locality in the derivation sequence (Hen-derson, 2003; Titov and Henderson, 2007b). As inall deep neural network architectures, this chain-ing of nonlinear vector computations gives themodel a very powerful mechanism to induce com-plex features from combinations of features in thehistory, which is difficult to replicate with hand-coded features.

1.3 Search-based Discriminative Training

We propose a discriminative model because it al-lows us to use lookahead instead of word predic-tion. As mentioned above, generative word pre-diction is costly, both to compute and because itrequires larger beams to be effective. With looka-head, it is possible to condition on words thatare farther ahead in the string, and thereby avoidhypothesizing parses that are incompatible withthose future words. This allows the parser to prunemuch more aggressively without losing accuracy.Discriminative learning further improves this ag-gressive pruning, because it can optimize for thediscrete choice of whether to prune or not (Huanget al., 2012; Zhang and Clark, 2011).

Our proposed model primarily differs from pre-vious discriminative models of shift-reduce de-pendency parsing in the nature of the discrimina-tive choices that are made and the way these deci-sions are modeled and learned. Rather than learn-ing to make pruning decisions at each parse ac-tion, we learn to choose between sequences of ac-tions that occur in between two shift actions. Thisway of grouping action sequences into chunks as-sociated with each word has been used previouslyfor efficient pruning strategies in generative pars-ing (Henderson, 2003), and for synchronizing syn-tactic parsing and semantic role labeling in a jointmodel (Henderson et al., 2013). We show empiri-cally that making discriminative parsing decisionsat the scale of these chunks also provides a goodbalance between grouping decisions so that morecontext can be used to make accurate parsing de-cisions and dividing decisions so that the space ofalternatives for each decision can be consideredquickly (see Figure 4 below).

143

In line with this pruning strategy, we define ascore called the correctness probability for everyshift action. This score is trained discriminativelyto indicate whether the entire parse prefix is cor-rect or not. This gives us a score function that istrained to optimize the pruning decisions duringsearch (c.f. (Daume III and Marcu, 2005)). Bycombining the scores for all the shift actions ineach candidate parse, we can also discriminate be-tween multiple parses in a beam of parses, therebygiving us the option of using beam search to im-prove accuracy in cases where bounded lookaheaddoes not provide enough information to make a de-terministic decision. The correctness probabilityis estimated by only looking at the hidden vector atits shift action, which encourages the hidden unitsto encode any information about the parse historythat is relevant to deciding whether this is a goodor bad parse, including long distance features.

1.4 Feature Decomposition and Caching

Another popular method of previous neural net-work models that we use and extend in this paperis the decomposition of input feature parametersusing vector-matrix multiplication (Bengio et al.,2003; Collobert et al., 2011; Collobert and We-ston, 2008). As the previous work shows, this de-composition overcomes the features sparsity com-mon in NLP tasks and also enables us to use un-labeled data effectively. For example, the param-eter vector for the feature word-on-top-of-stack isdecomposed into the multiplication of a parame-ter vector representing the word and a parametermatrix representing top-of-stack. But sparsity isnot always a problem, since the frequency of suchfeatures follows a power law distribution, so thereare some very frequent feature combinations. Pre-vious work has noticed that the vector-matrix mul-tiplication of these frequent feature combinationstakes most of the computation time during test-ing, so they cache these computations (Bengio etal., 2003; Devlin et al., 2014; Chen and Manning,2014).

We note that these computations also take mostof the computation time during training, and thatthe abundance of data for these feature combina-tions removes the statistical motivation for decom-posing them. We propose to treat the cached vec-tors for high frequency feature combinations asparameters in their own right, using them both dur-ing training and during testing.

In summary, this paper makes several contri-butions to neural network parsing by consider-ing different scales in the parse sequence and inthe parametrization. We propose a discrimina-tive recurrent neural network model of depen-dency parsing that is trained to optimize an effi-cient form of beam search that prunes based onthe sub-sequences of parser actions between twoshifts, rather than pruning after each parser ac-tion. We cache high frequency parameter compu-tations during both testing and training, and trainthe cached vectors as separate parameters. Asshown in section 6, these improvements signifi-cantly reduce both training and testing times whilepreserving accuracy.

2 History Based Neural Network Parsing

In this section we briefly specify the action se-quences that we model and the neural network ar-chitecture that we use to model them.

2.1 The Parsing Model

In shift-reduce dependency parsing, at each step ofthe parse, the configuration of the parser consistsof a stack S of words, the queue Q of words andthe partial labeled dependency trees constructedby the previous history of parser actions. Theparser starts with an empty stack S and all the in-put words in the queue Q, and terminates when itreaches a configuration with an empty queue Q.We use an arc-eager algorithm, which has 4 ac-tions that all manipulate the word s on top of thestack S and the word q on the front of the queueQ: The decision Left-Arcr adds a dependency arcfrom q to s labeled r. Word s is then popped fromthe stack. The decision Right-Arcr adds an arcfrom s to q labeled r. The decision Reduce popss from the stack. The decision Shift shifts q fromthe queue to the stack. For more details we referthe reader to (Nivre et al., 2004). In this paper wechose the exact definition of the parse actions thatare used in (Titov and Henderson, 2007b).

At each step of the parse, the parser needs tochoose between the set of possible next actions.To train a classifier to choose the best actions,previous work has proposed memory-based clas-sifiers (Nivre et al., 2004), SVMs (Nivre et al.,2006), structured perceptron (Huang et al., 2012;Zhang and Clark, 2011), two-layer neural net-works (Chen and Manning, 2014), and Incremen-tal Sigmoid Belief Networks (ISBN) (Titov and

144

Henderson, 2007b), amongst other approaches.We take a history based approach to model these

sequences of parser actions, which decomposesthe conditional probability of the parse using thechain rule:

P (T |S) = P (D1· · ·Dm|S)

=

Y

t

P (Dt|D1· · ·Dt�1, S)

where T is the parse tree, D1· · ·Dm is its equiva-lent sequence of shift-reduce parser actions and Sis the input sentence. The probability of Left-Arcr

and Right-Arcr include both the probability of theattachment decision and the chosen label r. Butunlike in (Titov and Henderson, 2007b), the prob-ability of Shift does not include a probability pre-dicting the next word, since all the words S areincluded in the conditioning.

2.2 Estimating Action Probabilities

To estimate each P (Dt|D1· · ·Dt�1, S), weneed to handle the unbounded nature of bothD1· · ·Dt�1 and S. We can divide S into the wordsthat have already been shifted, which are handledas part of our encoding of D1· · ·Dt�1, and thewords on the queue. To condition on the wordsin the queue, we use a bounded lookahead:

P (T |S) ⇡Y

t

P (Dt|D1· · ·Dt�1, wta1· · ·wt

ak)

where wta1· · ·wt

akis the first k words on the front

of the queue at time t. At every Shift action thelookahead changes, moving one word onto thestack and adding a new word from the input.

To estimate the probability of a decision at timet conditioned on the history of actions D1· · ·Dt�1,we overcome the problem of conditioning on anunbounded amount of information by using a neu-ral network to induce hidden representations of theparse history sequence. The relevant informationabout the whole parse history at time t is encodedin its hidden representation, denoted by the vectorht of size d.Y

t

P (Dt|D1· · ·Dt�1, wta1· · ·wt

ak) =

Y

t

P (Dt|ht)

The hidden representation at time t is inducedfrom hidden representations of the relevant previ-ous states, plus pre-defined features F computed

from the previous decision and the current queueand stack:

ht= �(

X

c2ChtcW c

HH +

X

f2FWIH(f, :))

In which C is the set of link types for the previousrelevant hidden representations, htc is the hiddenrepresentation of time tc<t that is relevant to ht bythe relation c, W c

HH is the hidden to hidden tran-sition weights for the link type c, and WIH is theweights from features F to hidden representations.� is the sigmoid function and W (i, :) shows rowi of matrix W . F and C are the only hand-codedparts of the model.

The decomposition of features has attracted alot of attention in NLP tasks, because it overcomesfeature sparsity. There is transfer learning from thesame word (or POS tag, Dependency label, etc.) indifferent positions or to similar words. Also unsu-pervised training of word embeddings can be usedeffectively within decomposed features. The useof unsupervised word embeddings in various nat-ural language processing tasks has received muchattention (Bengio et al., 2003; Collobert and We-ston, 2008; Collobert, 2011). Word embeddingsare real-valued feature vectors that are inducedfrom large corpora of unlabeled text data. Usingword embeddings with a large dictionary improvesdomain adaptation, and in the case of a small train-ing set can improve the performance of the model.

Given these advantages, we use feature decom-positions to define the input-to-hidden weightsWIH .

WIH(f, :) = Wemb.(val(f), :)W fHH

Every row in Wemb. is an embedding for a featurevalue, which may be a word, lemma, pos tag, ordependency relation. val(f) is the index of thevalue for feature type f , for example the particularword that is at the front of the queue. Matrix W f

HHis the transition matrix from the feature value em-beddings to the hidden vector, for the given featuretype f . For simplicity, we assume here that thesize of the embeddings and the size of the hiddenrepresentations of the INN are the same.

In this way, the parameters of the embeddingmatrix Wemb. is shared among various feature in-put link types f , which can improve the model inthe case of sparse features. It also allows the use ofword embeddings that are available on the web toimprove coverage of sparse features, but we leave

145

this investigation to future work since it is orthog-onal to the contributions of this paper.

Finally, the probability of each decision is nor-malized across other alternative decisions, andonly conditioned on the hidden representation(softmax layer):

P (Dt=d|ht

) =

ehtWHO(:,d)

Pd0 ehtWHO(:,d0)

where WHO is the weight matrix from hidden rep-resentations to the outputs.

3 Discrimination of Partial Parses

Unlike in a generative model, the above formulasfor computing the probability of a tree make inde-pendence assumptions in that words to the right ofwt

akare assumed to be independent of Dt. And

even for words in the lookahead it can be diffi-cult to learn dependencies with the unstructuredlookahead string. The generative model first con-structs a structure and then uses word predictionto test how well that matches the next word. Ifa discriminative model uses normalized estimatesfor decisions, then once a wrong decision is madethere is no way for the estimates to express thatthis decision has lead to a structure that is incom-patible with the current or future lookahead string(see (Lafferty et al., 2001) for more discussion).More generally, any discriminative model that istrained to predict individual actions has this prob-lem. In this section we discuss how to overcomethis issue.

3.1 Discriminating Correct Parse ChunksDue to this problem, discriminative parsers typi-cally make irrevocable choices for each individualaction in the parse. We propose a method for train-ing a discriminative parser which addresses thisproblem in two ways. First, we train the modelto discriminate between larger sub-sequences ofactions, namely the actions between two Shift ac-tions, which we call chunks. This allows the parserto delay choosing between actions that occur earlyin a chunk until all the structure associated withthat chunk has been built. Second, the model’sscore can be used to discriminate between twoparses long after they have diverged, making it ap-propriate for a beam search.

We employ a search strategy where we pruneat each shift action, but in between shift actionswe consider all possible sequences of actions,similarly to the generative parser in (Henderson,

2003). The INN model is discriminatively trainedto choose between these chunks of sequences ofactions.

The most straightforward way to model thesechunk decisions would be to use unnormalizedscores for the decisions in a chunk and sum thesescores to make the decision, as would be done for astructured perceptron or conditional random field.Preliminary experiments applying this approach toour INN parser did not work as well as having lo-cal normalization of action decisions. We hypoth-esize that the main reason for this result is thatupdating on entire chunks does not provide a suf-ficiently focused training signal. With a locallynormalized action score, such as softmax, increas-ing the score of the correct chunk has the effectof decreasing the score of all the incorrect actionsat each individual action decision. This updateis an approximation to a discriminative update onall incorrect parses that continue from an incor-rect decision (Henderson, 2004). Another pos-sible reason is that local normalization preventsone action’s score from dominating the score ofthe whole parse, as can happen with high fre-quency decisions. In general, this problem cannot be solved just by using norm regularization onweights.

The places in a parse where the generative up-date and the discriminative update differ substan-tially are at word predictions, where the genera-tive model considers that all words are possible buta discriminative model already knows what wordcomes next and so does not need to predict any-thing. We discriminatively train the INN to choosebetween chunks of actions by adding a new scoreat these places in the parse. After each shift ac-tion, we introduce a correctness probability thatis trained to discriminate between cases where thechunk of actions since the previous shift is cor-rect and those where this chunk is incorrect. Thus,the search strategy chooses between all possiblesequences of actions between two shifts using acombination of the normalized scores for each ac-tion and the correctness probability.

In addition to discriminative training at thechunk level, the correctness probability allows usto search using a beam of parses. If a correct de-cision can not be disambiguated, because of theindependence assumptions with words beyond thelookahead or because of the difficulty of inferringfrom an unstructured lookahead, the correctness

146

Figure 1: INN computations for one decision

probability score will drop whenever the mistakebecomes evident. This means that we can not onlycompare two partial parses that differ in the mostrecent chunk, but we can also compare two partialparses that differ in earlier chunks. This allowsus to use beam search decoding. Instead of deter-ministically choosing a single partial parse at eachshift action, we can maintain a small beam of al-ternatives and choose between them based on howcompatible they are with future lookahead stringsby comparing their respective correctness proba-bilities for those future shifts.

We combine this correctness probability withthe action probabilities by simply multiplying:

P (T |S) ⇡Y

t

P (Dt|ht)P (Correct|ht

)

where we train P (Correct|ht) to be the correct-

ness probability for the cases where Dt is a shiftaction, and define P (Correct|ht

) = 1 otherwise.For the shift actions, P (Correct|ht

) is defined us-ing the sigmoid function:

P (Correct|ht) = �(htWCor)

In this way, the hidden representations are trainednot only to choose the right action, but also to en-code correctness of the partial parses. Figure 1shows the model schematically.

3.2 Training the Parameters

We want to train the correctness probabilities todiscriminate correct parses from incorrect parses.The correct parses can be extracted from the train-ing treebank by converting each dependency treeinto its equivalent sequence of arc-eager shift-reduce parser actions (Nivre et al., 2004; Titov

and Henderson, 2007b). These sequences of ac-tions provide us with positive examples. For dis-criminative training, we also need incorrect parsesto act as the negative examples. In particular, wewant negative examples that will allow us to opti-mize the pruning decisions made by the parser.

To optimize the pruning decisions made by theparsing model, we use the parsing model itselfto generate the negative examples (Collins andRoark, 2004). Using the current parameters of themodel, we apply our search-based decoding strat-egy to find our current approximation to the high-est scoring complete parse, which is the output ofour current parsing model. If this parse differsfrom the correct one, then we train the parametersof all the correctness probabilities in each parseso as to increase the score of the correct parse anddecrease the score of the incorrect output parse.By repeatedly decreasing the score of the incorrectbest parse as the model parameters are learned,training will efficiently decreases the score of allincorrect parses.

As discussed above, we train the scores of in-dividual parser actions to optimize the locally-normalized conditional probability of the correctaction. Putting this together with the above train-ing of the correctness probabilities P (Correct|ht

),we get the following objective function:

argmax�X

T2Tpos

X

t2T

logP (dt|ht) + logP (Correct|ht

)

�X

T2T �neg

X

t2T

logP (Correct|ht)

where � is the set of all parameters of the model(namely WHH , Wemb., WHO, and WCor), Tpos

is the set of correct parses, and T �neg is the set

of incorrect parses which the model � scoreshigher than their corresponding correct parses.The derivative of this objective function is the er-ror signal that the neural network learns to mini-mize. This error signal is illustrated schematicallyin Figure 2.

We optimize the above objective using stochas-tic gradient descent. For each parameter update,a positive tree is chosen randomly and a negativetree is built using the above strategy. The resultingerror signals are backpropagated through the INNto compute the derivatives for gradient descent.

147

Figure 2: Positive and negative derivation branches and their training signals

(a) Log-Log frequency byrank of features

(b) Cumulative fraction offeatures ranked by frequency

Figure 3: Feature frequency distributions

4 Caching Frequent Features

Decomposing the parametrization of input fea-tures using vector-matrix multiplication, as wasdescribed in section 2, overcomes the featuressparsity common in NLP tasks and also makes itpossible to use unlabeled data effectively. But itadds a huge amount of computation to both thetraining and decoding, since for every input fea-ture a vector-matrix multiplication is needed. Thisproblem is even more severe in our algorithm be-cause at every training iteration we search for thebest incorrect parse.

The frequency of features follows a power lawdistribution; there are a few very frequent features,and a long tail of infrequent features. Previouswork has noticed that the vector-matrix multiplica-tion of the frequent features takes most of the com-putation time during testing, so they cache thesecomputations (Bengio et al., 2003; Devlin et al.,2014; Chen and Manning, 2014). For examplein the Figure 3(b), only 2100 features are respon-sible of 90% of the computations among ⇠400kfeatures, so cashing these computations can havea huge impact on speed. We note that these com-putations also take most of the computation timeduring training. First, this computation is the dom-

inant part of the forward and backward error com-putations. Second, at every iteration we need todecode to find the highest scoring incorrect parse.

We propose to treat the cached vectors for highfrequency features as parameters in their ownright, using them both during training and duringtesting. This speeds up training because it is nolonger necessary to do the high-frequency vector-matrix multiplications, neither to do the forwarderror computations nor to do the backpropagationthrough the vector-matrix multiplications. Also,the cached vectors used in decoding do not need tobe recomputed every time the parameters change,since the vectors are updated directly by the pa-rameter updates.

Another possible motivation for treating thecached vectors as parameters is that it results ina kind of backoff smoothing; high frequency fea-tures are given specific parameters, and for lowfrequency features we back off to the decomposedmodel. Results from smoothing methods for sym-bolic statistical models indicate that it is better tosmooth low frequency features with other low fre-quency features, and treat high frequency featuresindividually. In this paper we do not systemati-cally investigate this potential advantage, leavingthis for future work.

In our experiments we cache features that makeup to 90% of the feature frequencies. This givesus about a 20 times speed up during training andabout a 7 times speed up during testing, while per-formance is preserved.

5 Parsing Complexity

If the length of the latent vectors is d, for a sen-tence of length L, and beam B, the decoding com-

148

Figure 4: Histogram of number of candidate ac-tions between shifts

plexity is O(L⇥M ⇥B ⇥ (|F| + |C|)⇥ d2).1 If

we choose the best partial parse tree at every shift,then B = 1. M is the total number of actions inthe candidate chunks that are generated betweentwo shifts, so L ⇥M is the total number of can-didate actions in a parse. For shift-reduce depen-dency parsing, the total number of chosen actionsis necessarily linear in L, but because we are ap-plying best-first search in between shifts, M is notnecessarily independent of L. To investigate theimpact of M on the speed of the parser, we em-pirically measure the number of candidate actionsgenerated by the parser between 33368 differentshifts. The resulting distribution is plotted in Fig-ure 4. We observe that it forms a power law distri-bution. Most of the time the number of actions isvery small (2 or 3), with the maximum number be-ing 40, and the average being 2.25. We concludefrom this that the value of M is not a major factorin the parser’s speed.

Remember that |F| is the number of input fea-ture types. Caching 90% of the input feature com-putations allows us to reasonably neglect this term.Because |C| is the number of hidden-to-hiddenconnection types, we cannot apply caching to re-duce this term. However, |C| is much smallerthan |F| (here |C|=3). This is why caching in-put feature computations has such a large impacton parser speed.

1For this analysis, we assume that the output computationis negligable compared to the hidden representation compu-tation, because the output computation grows with d whilethe hidden computation grows with d

2.

6 Experimental Results

We used syntactic dependencies from the Englishsection of the CoNLL 2009 shared task dataset(Hajic et al., 2009). Standard splits of training,development and test sets were used. We compareour model to the generative INN model (Titov andHenderson, 2007b), MALT parser, MST parser,and the feed-forward neural network parser of(Chen and Manning, 2014) (“C&M”). All thesemodels and our own models are trained only on theCoNLL 2009 syntactic data; they use no externalword embeddings or other unsupervised trainingon additional data. This is one reason for choos-ing these models for comparison. In addition, thegenerative model (“Generative INN, large beam”in Table 1) was compared extensively to state-of-art parsers on various languages and tasks in pre-vious work (Titov and Henderson, 2007b; Titovand Henderson, 2007a; Henderson et al., 2008).Therefore, here our objective is not repeating anextensive comparison to the available parsers.

Table 1 shows the labeled and unlabeled ac-curacy of attachments for these models. TheMALT and MST parser scores come from (Sur-deanu and Manning, 2010), which compared dif-ferent parsing models using CoNLL 2008 sharedtask dataset, which is the same as CoNLL 2009for English syntactic parsing. The results for thegenerative INN with a large beam were taken from(Henderson and Titov, 2010), which uses an archi-tecture with 80 hidden units. We replicate this set-ting for the other generative INN results and ourdiscriminative INN results. The parser of (Chenand Manning, 2014) was run with their architec-ture of 200 hidden units with dropout training(“C&M”). All parsing speeds were computed us-ing the latest downloadable versions of the parsers,on a single 3.4GHz CPU.

Our model with beam 1 (i.e. deterministicchoices of chunks) (“DINN, beam 1”) producesstate-of-the-art results while it is over 35 timesfaster than the generative model with beam size10. Moreover, we are able to achieve higher ac-curacies using larger beams (“DINN, beam 10”).The discriminative training of the correctnessprobabilities to optimize search is crucial to theselevels of accuracy, as indicated by the relativelypoor performance of our model when this trainingis removed (“Discriminative INN, no search train-ing”). Previous deterministic shift-reduce parsers(“MALTAE” and “C&M”) are around twice as fast

149

Model LAA UAA wrd/secMALTAE 85.96 88.64 7549C&M 86.49 89.17 9589MST 87.07 89.95 290MALT-MST 87.45 90.22 NA

Generative INN,beam 1 77.83 81.49 1122beam 10 87.67 90.61 107large beam 88.65 91.44 NA

Discriminative INN,no search training 85.28 88.98 4012DINN, beam 1 87.26 90.13 4035DINN, beam 10 88.14 90.75 433

Table 1: Labelled and unlabelled attachment accu-racies and speeds on the test set.

Model German Spanish CzechLAA UAA LAA UAA LAA UAA

C&M 82.5 86.1 81.5 85.4 58.6 70.6MALT 80.7 83.1 82.4 86.6 67.3 77.4MST 84.1 87.6 82.7 87.3 73.4 81.7DINN 86.0 89.6 85.4 88.3 77.5 85.2

Table 2: Labelled and unlabelled attachment accu-racies on the test set of CoNLL 2009.

as our beam 1 model, but at the cost of significantreductions in accuracy.

To evaluate our RNN model’s ability to induceinformative features automatically, we trained ourdeterministic model, MALT, MST and C&M onthree diverse languages from CoNLL 2009, usingthe same features as used in the above experimentson English (model “DINN, beam 1”). We did nolanguage-specific feature engineering for any ofthese parsers. Table 2 shows that our RNN modelgeneralizes substantially better than all these mod-els to new languages, demonstrating the power ofthis model’s feature induction.

7 Conclusion

We propose an efficient and accurate recurrentneural network dependency parser that uses neu-ral network hidden representations to encode arbi-trarily large partial parses for predicting the nextparser action. This parser uses a search strategythat prunes to a deterministic choice at each shiftaction, so we add a correctness probability to eachshift operation, and train this score to discriminatebetween correct and incorrect sequences of parseractions. All other probability estimates are trained

to optimize the conditional probability of the parsegiven the sentence. We also speed up both pars-ing and training times by only decomposing infre-quent features, giving us both a form of backoffsmoothing and twenty times faster training.

The discriminative training for this pruningstrategy allows high accuracy to be preservedwhile greatly speeding up parsing time. The re-current neural network architecture provides pow-erful automatic feature induction, resulting in highaccuracy on diverse languages without tuning.

Acknowledgments

The research leading to this work was funded bythe EC FP7 programme FP7/2011-14 under grantagreement no. 287615 (PARLANCE).

ReferencesYoshua Bengio, Rejean Ducharme, Pascal Vincent, and

Christian Janvin. 2003. A neural probabilistic lan-guage model. J. Mach. Learn. Res., 3:1137–1155,March.

Danqi Chen and Christopher Manning. 2014. A fastand accurate dependency parser using neural net-works. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 740–750. Association for Compu-tational Linguistics, October.

Michael Collins and Brian Roark. 2004. Incremen-tal parsing with the perceptron algorithm. In Pro-ceedings of the 42nd Annual Meeting on Associationfor Computational Linguistics, ACL ’04. Associa-tion for Computational Linguistics.

Ronan Collobert and Jason Weston. 2008. A uni-fied architecture for natural language processing:Deep neural networks with multitask learning. InProceedings of the 25th International Conferenceon Machine Learning, ICML ’08, pages 160–167.ACM.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. J. Mach. Learn. Res., 12:2493–2537,November.

Ronan Collobert. 2011. Deep learning for efficientdiscriminative parsing. In Proceedings of the Four-teenth International Conference on Artificial Intel-ligence and Statistics (AISTATS-11), volume 15,pages 224–232. Journal of Machine Learning Re-search - Workshop and Conference Proceedings.

Hal Daume III and Daniel Marcu. 2005. Learningas search optimization: Approximate large marginmethods for structured prediction. In Proceedings

150

of the 22nd International Conference on MachineLearning, pages 169–176.

Jacob Devlin, Rabih Zbib, Zhongqiang Huang, ThomasLamar, Richard Schwartz, and John Makhoul. 2014.Fast and robust neural network joint models for sta-tistical machine translation. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers). Asso-ciation for Computational Linguistics.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio.2011. Domain adaptation for large-scale sentimentclassification: A deep learning approach. In Pro-ceedings of the 28th International Conference onMachine Learning (ICML-11), ICML ’11, pages513–520. ACM, June.

Jan Hajic, Massimiliano Ciaramita, Richard Johans-son, Daisuke Kawahara, Maria Antonia Martı, LluısMarquez, Adam Meyers, Joakim Nivre, SebastianPado, Jan Stepanek, Pavel Stranak, Mihai Surdeanu,Nianwen Xue, and Yi Zhang. 2009. The conll-2009shared task: Syntactic and semantic dependenciesin multiple languages. In Proceedings of the Thir-teenth Conference on Computational Natural Lan-guage Learning: Shared Task, CoNLL ’09, pages1–18. Association for Computational Linguistics.

James Henderson and Ivan Titov. 2010. Incrementalsigmoid belief networks for grammar learning. J.Mach. Learn. Res., 11:3541–3570, December.

James Henderson, Paola Merlo, Gabriele Musillo, andIvan Titov. 2008. A latent variable model of syn-chronous parsing for syntactic and semantic depen-dencies. In Proceedings of the Twelfth Confer-ence on Computational Natural Language Learning,CoNLL ’08, pages 178–182. Association for Com-putational Linguistics.

James Henderson, Paola Merlo, Ivan Titov, andGabriele Musillo. 2013. Multilingual joint parsingof syntactic and semantic dependencies with a latentvariable model. Comput. Linguist., 39(4):949–998,December.

James Henderson. 2003. Inducing history representa-tions for broad coverage statistical parsing. In Pro-ceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics on Human Language Technology - Vol-ume 1, NAACL ’03, pages 24–31. Association forComputational Linguistics.

James Henderson. 2004. Discriminative training ofa neural network statistical parser. In Proceedingsof the 42nd Meeting of the Association for Compu-tational Linguistics (ACL’04), Main Volume, pages95–102, July.

Liang Huang, Suphan Fayong, and Yang Guo. 2012.Structured perceptron with inexact search. In Pro-ceedings of the 2012 Conference of the North Amer-ican Chapter of the Association for Computational

Linguistics: Human Language Technologies, pages142–151. Association for Computational Linguis-tics, June.

John D. Lafferty, Andrew McCallum, and FernandoC. N. Pereira. 2001. Conditional random fields:Probabilistic models for segmenting and labeling se-quence data. In Proceedings of the Eighteenth Inter-national Conference on Machine Learning, ICML’01, pages 282–289. Morgan Kaufmann PublishersInc.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In Pro-ceedings of the 11th Annual Conference of the In-ternational Speech Communication Association (IN-TERSPEECH 2010), volume 2010, pages 1045–1048. International Speech Communication Associ-ation.

Joakim Nivre, Johan Hall, and Jens Nilsson. 2004.Memory-based dependency parsing. In Hwee TouNg and Ellen Riloff, editors, HLT-NAACL 2004Workshop: Eighth Conference on ComputationalNatural Language Learning (CoNLL-2004), pages49–56. Association for Computational Linguistics,May 6 - May 7.

Joakim Nivre, Johan Hall, Jens Nilsson, GulsenEryigit, and Svetoslav Marinov. 2006. Labeledpseudo-projective dependency parsing with supportvector machines. In Proceedings of the Tenth Con-ference on Computational Natural Language Learn-ing, CoNLL-X ’06, pages 221–225. Association forComputational Linguistics.

Rajat Raina, Alexis Battle, Honglak Lee, BenjaminPacker, and Andrew Y. Ng. 2007. Self-taughtlearning: Transfer learning from unlabeled data. InProceedings of the 24th International Conferenceon Machine Learning, ICML ’07, pages 759–766.ACM.

Richard Socher, Cliff Chiung-Yu Lin, Andrew Ng, andChris Manning. 2011. Parsing natural scenes andnatural language with recursive neural networks. InProceedings of the 28th International Conferenceon Machine Learning (ICML-11), ICML ’11, pages129–136. ACM.

Richard Socher, John Bauer, Christopher D. Manning,and Ng Andrew Y. 2013. Parsing with composi-tional vector grammars. In Proceedings of the 51stAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages455–465. Association for Computational Linguis-tics, August.

Mihai Surdeanu and Christopher D. Manning. 2010.Ensemble models for dependency parsing: Cheapand good? In Human Language Technologies:The 2010 Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics, HLT ’10, pages 649–652. Association forComputational Linguistics.

151

Ilya Sutskever, James Martens, and Geoffrey Hin-ton. 2011. Generating text with recurrent neu-ral networks. In Proceedings of the 28th Interna-tional Conference on Machine Learning (ICML-11),ICML ’11, pages 1017–1024. ACM, June.

Ivan Titov and James Henderson. 2007a. Fast androbust multilingual dependency parsing with a gen-erative latent variable model. In Proceedings ofthe CoNLL Shared Task Session of EMNLP-CoNLL2007, pages 947–951. Association for Computa-tional Linguistics, June.

Ivan Titov and James Henderson. 2007b. A latentvariable model for generative dependency parsing.In Proceedings of the 10th International Conferenceon Parsing Technologies, IWPT ’07, pages 144–155.Association for Computational Linguistics.

Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang,Yangyang Shi, and Dong Yu. 2013. Recurrent neu-ral networks for language understanding. In INTER-SPEECH, pages 2524–2528.

Yue Zhang and Stephen Clark. 2011. Syntactic pro-cessing using the generalized perceptron and beamsearch. Comput. Linguist., 37(1):105–151, March.

152



Instance Selection Improves Cross-Lingual Model Training forFine-Grained Sentiment Analysis

Roman Klinger*‡*Institute for Natural Language Processing

University of Stuttgart70569 Stuttgart, Germany

Philipp Cimiano‡‡Semantic Computing Group, CIT-EC

Bielefeld University33615 Bielefeld, Germany

[email protected]@cit-ec.uni-bielefeld.de

Abstract

Scarcity of annotated corpora for manylanguages is a bottleneck for training fine-grained sentiment analysis models that cantag aspects and subjective phrases. We pro-pose to exploit statistical machine transla-tion to alleviate the need for training databy projecting annotated data in a sourcelanguage to a target language such that asupervised fine-grained sentiment analysissystem can be trained. To avoid a nega-tive influence of poor-quality translations,we propose a filtering approach based onmachine translation quality estimation mea-sures to select only high-quality sentencepairs for projection. We evaluate on thelanguage pair German/English on a corpusof product reviews annotated for both lan-guages and compare to in-target-languagetraining. Projection without any filteringleads to 23 % F1 in the task of detectingaspect phrases, compared to 41 % F1 forin-target-language training. Our approachobtains up to 47 % F1. Further, we showthat the detection of subjective phrases iscompetitive to in-target-language trainingwithout filtering.

1 Introduction

An important task in fine-grained sentiment anal-ysis and opinion mining is the extraction of men-tioned aspects, evaluative subjective phrases andthe relation between them. For instance, in thesentence

“I really like the display but the batteryseems weak to me.”

the task is to detect evaluative (subjective) phrases(in this example “really like” and “seems weak”)and aspects (“display” and “battery”) as well astheir relation (that “really like” refers to “display”and “seems weak” refers to “battery”).

Annotating data for learning a model to extractsuch detailed information is a tedious and time-consuming task. Therefore, given the scarcity ofsuch annotated corpora in most languages, it is in-teresting to generate models which can be appliedon languages without manually created trainingdata. In this paper, we perform annotation pro-jection, which is one of the two main categoriesfor cross-language model induction (next to directmodel transfer (Agic et al., 2014)).

Figure 1 shows an example of a sentence to-gether with its automatically derived translation(source language on top, target language on bot-tom) and the alignment between both. Such analignment can be used to project annotations acrosslanguages, e. g., from a source to target language,to produce data to train a system for the target lan-guage. As shown in the example, translation errorsas well as alignment errors can occur. When usinga projection-based approach, the performance of asystem on the target language crucially depends onthe quality of the translation and the alignment. Inthis paper we address two questions:

• What is the performance on the task whentraining data for the source language is pro-jected into a target language, compared to anapproach where training data for the targetlanguage is available?

• Can the performance be increased by selectingonly high-quality translations and alignments?

Towards answering these questions, we present thefollowing contributions:

153

Es gibt mit Sicherheit bessere Maschinen , aber die bietet das beste Preis-Leistungs-Verhaltnis .

There are certainly better machines , but o�ers the best price-performance ratio .

Figure 1: Example for the projection of an annotation from the source language to the target language.The translation has been generated with the Google translate API (https://cloud.google.com/translate/).The alignment is induced with FastAlign (Dyer et al., 2013).

• We propose to use a supervised approach to in-duce a fine-grained sentiment analysis modelto predict aspect and subjective phrases onsome target language, given training data insome source language. This approach relieson automatic translation of source trainingdata and projection of annotations to the tar-get language data.

• We present an instance selection method thatonly selects sentences with a certain trans-lation quality. For this, we incorporate dif-ferent measures of translation and alignmentconfidence. We show that such an instanceselection method leads to increased perfor-mance compared to a system without instanceselection for the prediction of aspects. Re-markably, for the prediction of aspects theperformance is comparable to an upper base-line using manually annotated target languagedata for training (we refer to the latter settingas in-target-language training).

• In contrast, for the prediction of subjectivephrases, we show that, while a competitive re-sult compared to target language training canbe observed when training with the projectedtraining data, there is no beneficial effect ofthe filtering.

In the following, we describe our methodologyin detail, including the description of the machinetranslation, annotation projection, and quality esti-mation methods (Section 2), and present the evalu-ation on manually annotated data (Section 3). Re-lated work is discussed in Section 4. We concludewith Section 5 and mention promising future steps.

2 Methods

2.1 Supervised Model for Aspect andSubjective Phrase Detection

We use a supervised model induced from trainingdata to detect aspect phrases, subjective (evaluative)

phrases and their relations. The structure followsthe proposed pipeline approach by Klinger andCimiano (2013).1 However, in contrast to theirwork, we focus on the detection of phrases only,and exploit the detection of relations only duringinference, such that the detection of relations hasan effect on the detection of phrases, but is notevaluated directly.

The phrase detection follows the idea of semi-Markov conditional random fields (Sarawagi andCohen, 2004; Yang and Cardie, 2012) and modelsphrases as spans over tokens as variables. Factortemplates for spans of type aspect and subjectivetake into account token strings, prefixes, suffixes,the inclusion of digits, and part-of-speech tags,both as full string and as bigrams, for the spansand their vicinity. In addition, the length of thespan is modeled by cumulative binning. The rela-tion template indicates how close an aspect is to asubjective phrase based on token distance and onthe length of the shortest path in the dependencytree. The edge names of the shortest path are alsoincluded as features. It is further checked if noother noun than the aspect is close to the subjectivephrase.

Inference during training and testing is done viaMarkov Chain Monte Carlo (MCMC). In each sam-pling step (with options of adding a span, removinga span, adding an aspect as target to a subjectivephrase), the respective factors lead to an associatedmodel score. The model parameters are adaptedbased on sample rank (Wick et al., 2011) using anobjective function which computes the fraction ofcorrectly predicted tokens in a span. For detailson the model configuration and its implementationin FACTORIE (McCallum et al., 2009), we refer tothe description in the original paper (Klinger andCimiano, 2013). The objective function to evaluatea span r during training is

f(r) = max

g2s

|r \ g||g| � ↵ · |r\g| ,

1https://bitbucket.org/rklinger/jfsa

154

where g is the set of all gold spans, and |r \ g|is the number of tokens shared by gold and pre-dicted span and |r\g| the number of predicted to-kens which are not part of the gold span. Theparameter ↵ is set to 0.1 as in the original paper.2.The objective for the predictions in a whole sen-tence s containing spans is f(s) =

Pr2s f(r).

This model does not take into account language-specific features and can therefore be trained fordifferent languages. In the following, we explainour procedure for inducing a model for a targetlanguage for which no annotations are available.

2.2 Statistical Machine Translation andAnnotation Projection

Annotating textual corpora with fine-grained sen-timent information is a time-consuming and there-fore costly process. In order to adapt a model to anew domain and to a new language, correspondingtraining data is needed. In order to circumvent theneed for additional training data when addressinga new language, we project training data automati-cally from a source to a target language. As inputto our approach we require a corpus annotated forsome source language and a translation from thesource to a target language. As the availabilityof a parallel training corpus cannot be assumedin general, we use statistical machine translation(SMT) methods, relying on phrase-based transla-tion models that use large amounts of parallel datafor training (Koehn, 2010).

While using an open-source system such asMoses3 would have been an option, we note thatthe quality would be limited by whether the sys-tem can be trained on a representative corpus. Astandard dataset that SMT systems are trained onis EuroParl (Koehn, 2005). EuroParl covers 21 lan-guages and contains 1.920.209 sentences for thepair German/English. The corpus includes only 4sentences with the term “toaster”, 12 with “knives”(mostly in the context of violence), 6 with “dish-washer” (in the context of regulations) and 0 with“trash can”. The terms “camera” and “display” aremore frequent, with 208 and 1186 mentions, respec-tively, but they never occur together.4 The corpusis thus not representative for product reviews as weconsider in this paper.

2Note that the learning is independent from the actual valuefor all 0 < ↵ < (maxg2Corpus |g|)

�1.3www.statmt.org/moses/4These example domains are taken from the USAGE cor-

pus (Klinger and Cimiano, 2014), which is used in Section 3.

Thus, we opt for using a closed translation sys-tem that is trained on larger amounts of data, that isGoogle Translate, through the available API5. Thealignment is then computed as a post processingstep relying on FastAlign (Dyer et al., 2013), areparametrization of IBM Model 2 with a reducedset of parameters. It is trained in an unsupervisedfashion via expectation maximization.

Projecting the annotations from the source to thetarget language works as follows: given an anno-tated sentence in the source language s1, . . . , sn

and some translation of this sentence t1, . . . , tminto the target language, we induce an inductivemapping a : [1 . . . n] ! [1 . . . m] using FastAlign.For a source language phrase si,j = si, . . . , sj werefer by a(si,j) to the set of tokens that some to-ken in si,j has been aligned to, that is: a(si,j) =

[ikj{a(k)}. Note that the tokens in a(si,j) arenot necessarily consecutive, therefore the annota-tion in the target language is defined as the minimalsequence including all tokens tk 2 a(si,j), i. e., themost left and most right tokens define the span ofthe target annotation.

This procedure leads to the same number of spanannotations in source and target language with theonly exception that we exclude projected annota-tions for which |n�m| > 10.

2.3 Quality Estimation-based InstanceFiltering

The performance of an approach relying on pro-jection of training data from a source to a targetlanguage and using this automatically projecteddata to train a supervised model crucially dependson the quality of the translations and alignments.In order to reduce the impact of spurious transla-tions, we filter out low-quality sentence pairs. Toestimate this quality, we take three measures intoconsideration (following approaches described byShah and Specia (2014), in addition to a manualassessment of the translation quality as an upperbaseline):

1. The probability of the sentence in the sourcelanguage given a language model build onunannotated text in the source language (mea-suring if the language to be translated is typi-cal, referred to as Source LM).

2. The probability of the machine translated sen-tence given a language model built on unanno-

5https://cloud.google.com/translate/

155

# reviewsen de

coffee machine 75 108cutlery 49 72microwave 100 100toaster 100 4trash can 100 99vacuum cleaner 51 140washing machine 49 88dish washer 98 0

Table 1: Frequencies of the corpus used in ourexperiments (Klinger and Cimiano, 2014).

tated text in the target language (measuring ifthe translation is typical, referred to as TargetLM).

3. The likelihood that the alignment is correct,directly computed on the basis of the align-ment probability (referred to as Alignment):p(e | f) =

Qi=1m p(ei | f , m, n), where e

and f are source and target sentences and mand n denote the sentence lengths (Dyer et al.,2013, Eq. 1f.).

For building the language models, we employ thetoolkit SRILM (Stolcke, 2002; Stolcke et al., 2011).The likelihood for the alignment as well as thelanguage model probability are normalized by thenumber of tokens in the sentence.

3 Experiments

3.1 Corpus and SettingThe proposed approach is evaluated on the lan-guage pair German/English in both directions (pro-jecting German annotations into an automaticallygenerated English corpus and testing on Englishannotations and vice versa). As a resource, weuse the recently published USAGE corpus (Klingerand Cimiano, 2014), which consists of 622 En-glish and 611 German product reviews from http://www.amazon.com/ and http://www.amazon.de/,respectively. The reviews are on coffee machines,cutlery sets, microwaves, toasters, trash cans, vac-uum cleaners, washing machines, and dish washers.Frequencies of entries in the corpus are summa-rized in Table 1. Each review has been annotatedby two annotators. We take into account the datagenerated by the first annotator in this work toavoid the design of an aggregation procedure. The

corpus is unbalanced between the product classes.The average numbers of annotated aspects in eachreview in the German corpus (10.4) is smaller thanin English (13.7). The average number of sub-jective phrases is more similar with 8.6 and 8.3,respectively. The total number of aspects is 8545for English and 6340 in German, the number ofsubjective phrases is 5321 and 5086, respectively.

The experiments are performed in a leave-one-domain-out setting, e. g., testing on coffee machinereviews is based on a model trained on all otherproducts except coffee machines. This holds forthe cross-language and the in-target-language train-ing results and leads therefore to comparable set-tings. We use exact match precision, recall andF1-measure for evaluation. However, it should benoted that partial matching scores are also com-monly applied in fine-grained sentiment analysisdue to the fact that boundaries of annotations candiffer substantially between annotators. For sim-plicity, we limit ourselves to the more strict evalua-tion measure.

The language models are trained on 7,413,774German and 9,650,633 English sentences sampledfrom Amazon product reviews concerning the prod-uct categories in the USAGE corpus. The FastAlignmodel is trained on the EuroParl corpus and theautomatically translated USAGE corpus in both di-rections (German translated to English and Englishtranslated to German).

3.2 Results

We evaluate and compare the impact of the threeautomatic quality estimation methods and comparethem to a manual sentence-based judgement forthe language projection from German to English(testing on English). The manual judgement wasperformed by assigning values ranging from 0 (notunderstandable), over 1 and 2 (slightly understand-able) to 8 (some flaws in translation), 9 (minorflaws in translation) to 10 (perfect translation).6

Figure 2 shows the results for all four methods(including manual quality assessment) from Ger-man to English for the product category of coffeemachines compared to in-target-language trainingresults. The x-axis corresponds to different valuesfor the filtering threshold. Thus, when increasingthe threshold, the number of sentences used fortraining decreases. For all quality estimation meth-

6This annotated data is available at http://www.romanklinger.de/translation-quality-review-corpus

156

��

��

� ��

��

��

��

� ��

��

��

� ��

� ��

��

��

� ��

� ��

��

��

��

��

��

��

��

��

��

��

Figure 2: Complete results for the reviews for coffee machines for the projection direction German toEnglish.

ods except for the language model for the sourcelanguage, the performance on the target languageincreases significantly for the prediction of aspectphrases. The English in-target-language trainingperformance represents an upper baseline, result-ing from training the system on manually annotateddata for the target language (F1 = 0.42). Withoutany instance filtering, relying on all automaticallyproduced translations to induce projected annota-tions for the target language, an F1 measure of0.21 is reached. With filtering based on the manu-ally assigned translation quality estimation, a resultof F1 = 0.43 is reached. Using the alignment asquality score for filtering, the best result obtainedis F1 = 0.48. However, results start decreasingfrom this threshold value on, which is likely dueto the fact that the positive effect of instance filter-ing is outweighed by the performance drop due totraining with a smaller dataset. The filtering basedon the target language model leads to F1 = 0.42,while the source language model cannot filter thetraining instances such that the performance in-creases over the initial value.

Surprisingly, instance filtering has no impact onthe detection of subjective phrases. Without anyfiltering, for the prediction of subjective phrases weget an F1 of 0.42, which is close to the performance

of in-target-language training of F1 = 0.49. Forthe case of phrase detection, the difference betweentraining with all data (21%) and in-target-languagetraining (42%) is considerably higher. Decreas-ing the size of the training set by filtering onlydecreases the F1 measure.

Figure 3 shows the macro-average results sum-marizing the different domains in precision, recalland F1 measure. The thresholds for the filteringhave been fixed to the best result of the productcoffee machine for all products. The manual qual-ity estimation as well as the alignment and targetlanguage model lead to comparable (or superior)results compared to target language training foraspect detection. This holds for nearly all prod-uct domains, only for trash cans and cutlery theperformance achieved by filtering is slightly lowerfor the direction German-to-English. The initialperformance for the whole data set is on averagehigher for the projection direction English to Ger-man; therefore the values for the source languagemodel are comparably higher than for the otherprojection direction.

For the aspect detection, all filtering methodsexcept using the source language model lead to animprovement over the baseline (without filtering)that is significant according to a Welch t-test (↵ =

157

��

��

�� ! ��"��

� �� 0.43 0.43

0.57

0.39

0.61 0.570.45

0.52

0.15

0.380.31

0.15

0.44 0.47

0.230.35

0.41

0.23

��

��

�� ! ��"��

� �0.57

0.64 0.580.66 0.61

0.270.20

0.320.25

0.19

0.36 0.310.40 0.37

0.29

��

��

�� ! � #$��%�

�0.48 0.47 0.48 0.490.56

0.48

0.34 0.34 0.34 0.340.40

0.340.40 0.39 0.40 0.40

0.470.40

��

��

�� ! � #$��%�

�0.49 0.50 0.50 0.53 0.49

0.26 0.27 0.27

0.42

0.270.34 0.35 0.35

0.470.35

��

Figure 3: Macro-average results of cross-language training over the different domains showing precision,recall and F1 measure. Significant differences of the F1 results to the no-filter-baseline are marked with astar (Welch t-test comparing the different separate domain results, p < 0.05).

0.05). For English to German, in-target-languagetraining and the target language model filteringprovide improved results over the baseline that aresignificant. For subjective phrase detection, thein-language training is significantly better than thebaseline.

It is notable that in all experiments the model’sperformance without filtering is mainly limited inrecall, which drops the performance in F1. In-stance filtering therefore has mainly an effect onthe recall.

3.3 DiscussionOur results show that an approach based on au-tomatic training data projection across languagesis feasible and provides competitive results com-

pared to training on manually annotated target lan-guage data. We have in particular quantified theloss of performance of such an automatic approachcompared to using a system trained on manuallyannotated data in the target language. We haveshown that the performance of aspect detectionof a system using all available projected trainingdata yields a drop of ⇡ 50 % in F1-measure com-pared to a model trained using manually annotateddata in the target language. The instance filter-ing approaches in which only the sentences withhighest quality are selected to project training datato the target language using a threshold has a sig-nificant positive impact on the performance of amodel trained on automatically projected training

158

data when predicting aspect phrases on the targetlanguage, increasing results from 23 % to 47 % onaverage. Our approach relying on instance filteringcomes close to results produced by a system trainedon manually annotated data for the target languageon the task of predicting aspect phrases.

In contrast to these results for the aspect phraserecognition, the impact of filtering training in-stances is negligible for the detection of subjectivephrases. The highest performance is achieved withthe full set of training instances. Therefore, it maybe concluded that for aspect name detection, highquality training is crucial. For subjective phrasedetection, a greater training set is crucial. In con-trast to aspect recognition, the drop in subjectivephrase recognition performance is comparativelylow when training on all instances.

Filtering translations by manually provided trans-lation scores (as an upper baseline for the filtering)yields comparable results to using the alignmentand the language model on the target language. Us-ing the language model on the source language forfiltering does not lead to any improvement. Predict-ing the quality of translation relying on the proba-bility of the source sentence via a source languagemodel therefore seems not to be a viable approachon the task in question. Using the target languagemodel as a filter leads to the most consistent resultsand is therefore to be preferred over the sourcelanguage model and the alignment score.

Including more presumably noisy instances byusing a smaller filtering threshold leads to a de-creased recall throughout all methods in aspect de-tection and to a lesser extent for subjective phrasedetection. The precision is affected to a smallerdegree. This can as well be observed in the numberof predictions the models based on different thresh-olds generate: While the number of true positiveaspects for the coffee machine subdomain is 1100,only 221 are predicted with a threshold of the man-ual quality assignment of 0. However, a tresholdof 9 leads to 560 predictions and a threshold of 10to 1291. This effect can be observed for subjectivephrases as well. It increases from 465 to 827 whilethe gold number is 676. These observations holdfor all filtering methods analogously.

4 Related Work

In-target-language training approaches for fine-grained sentiment analysis include those targetingthe extraction of phrases or modelling it as text

classification (Choi et al., 2010; Johansson andMoschitti, 2011; Yang and Cardie, 2012; Hu andLiu, 2004; Li et al., 2010; Popescu and Etzioni,2005; Jakob and Gurevych, 2010b). Such modelsare typically trained or optimized on manually an-notated data (Klinger and Cimiano, 2013; Yang andCardie, 2012; Jakob and Gurevych, 2010a; Zhanget al., 2011). The necessary data, at least containingfine-grained annotations for aspects and subjectivephrases instead of only an overall polarity score,are mainly available for the English language to asufficient extent. Popular corpora used for trainingare for instance the J.D. Power and Associates Sen-timent Corpora (Kessler et al., 2010) or the MPQAcorpora (Wilson and Wiebe, 2005).

Non-English resources are scarce. Examples area YouTube corpus consisting of English and Italiancomments (Uryupina et al., 2014), a not publiclyavailable German Amazon review corpus of 270sentences (Boland et al., 2013), in addition to theUSAGE corpus (Klinger and Cimiano, 2014) wehave used in this work, consisting of German andEnglish reviews. The (non-fine-grained annotated)Spanish TASS corpus consists of Twitter messages(Saralegi and Vicente, 2012). The “MultilingualSubjectivity Analysis Gold Standard Data Set” fo-cuses on subjectivity in the news domain (Balahurand Steinberger, 2009). A Chinese corpus anno-tated at the aspect and subjective phrase level isdescribed by Zhao et al. (2014).

There has not been too much work on ap-proaches to transfer a model either directly or viaannotation projection in the area of sentiment anal-ysis. One example is based on sentence level anno-tations which are automatically translated to yielda resource in another language. This approach hasbeen proven to work well across several languages(Banea et al., 2010; Mihalcea et al., 2007; Balahurand Turchi, 2014). Recent work approached multi-lingual opinion mining on the above-mentionedmulti-lingual Youtube corpus with tree kernels pre-dicting the polarity of a comment and whether itconcerns the product or the video in which theproduct is featured. (Severyn et al., 2015). Brookeet al. (2009) compare dictionary and classificationtransfer from English to Spanish in a similar classi-fication setting.

While cross-lingual annotation projection hasbeen investigated in the context of polarity com-putation, we are only aware of two approachesexploiting cross-lingual annotation projection on

159

the task of identifying aspects specifically with anevaluation on manually annotated data in more thanone language. The CLOpinionMiner (Zhou et al.,2015) uses an English data set which is transferedto Chinese. Models are further improved by co-training. Xu et al. (2013) perform self-trainingbased on a projected corpus from English to Chi-nese to detect opinion holders. Due to the lackof existing manually annotated resources, to ourknowledge no cross-language projection approachfor fine-grained annotation at the level of aspectand subjective phrases has been proposed before.

The projection of annotated data sets has beeninvestigated in a variety of applications. Early workincludes an approach to the projection of part-of-speech tags and noun phrases (Yarowsky et al.,2001; Yarowsky and Ngai, 2001) and parsing infor-mation (Hwa et al., 2005) on a parallel corpus. Es-pecially in syntactic and semantic parsing, heuris-tics to remove or correct spuriously projected an-notations have been developed (Pado and Lapata,2009; Agic et al., 2014). It is typical for theseapproaches to be applied on existing parallel cor-pora (one counter example is the work by Basili etal. (2009) who perform postprocessing of machinetranslated resources to improve the annotation fortraining semantic role labeling models). In casesin which no such parallel resources are availablecontaining pertinent annotations, models can betransfered after training. Early work includes across-lingual parser adaption (Zeman and Resnik,2008). A recent example is the projection of ametaphor detection model using a bilingual dictio-nary (Tsvetkov et al., 2014). A combination ofmodel transfer and annotation projection for depen-dency parsing has been proposed by Kozhevnikovand Titov (2014).

To improve quality of the overall corpus of pro-jected annotations, the selection of data points fordependency parsing has been studied (Søgaard,2011). Similarly, Axelrod et al. (2011) improvethe average quality of machine translation systemsby selection of promising training examples andshow that such a selection approach has a positiveimpact. Related to the latter, a generic instanceweighting scheme has been proposed for domainadaptation (Jiang and Zhai, 2007).

Other work has attempted to exploit informationavailable in multiple languages to induce a modelfor a language for which sufficient training data isnot available. For instance, universal tag sets take

advantage of annotations that are aligned acrosslanguages (Snyder et al., 2008). Delexicalizationallows for applying a model to other languages(McDonald et al., 2011).

Focusing on cross-lingual sentiment analysis,joint training of classification models on multiplelanguages shows an improvement over separatedmodels. Balahur and Turchi (2014) analyzed theimpact of using different machine translation ap-proaches in such settings. Differences in sentimentexpressions have been analyzed between Englishand Dutch (Bal et al., 2011). Co-training with non-annotated corpora has been shown to yield goodresults for Chinese (Wan, 2009). Ghorbel (2012)analyzed the impact of automatic translation onsentiment analysis.

Finally, SentiWordNet has been used for multi-lingual sentiment analysis (Denecke, 2008). Build-ing dictionaries for languages with scarce resourcescan be supported by bootstrapping approaches(Banea et al., 2008).

Estimating the quality of machine translation canbe understood as a ranking problem and thus bemodeled as regression or classification. An impor-tant research focus is on investigating the impact ofdifferent features on predicting translation quality.For instance, sentence length, the output probabil-ity, number of unknown words of a target languageas well as parsing-based features have been used(Avramidis et al., 2011). The alignment contextcan also be taken into account (Bach et al., 2011).An overview on confidence measures for machinetranslation is for instance provided by Ueffing et al.(2003). The impact of different features has beenanalyzed by Shah et al. (2013). A complete systemand framework for quality estimation (includinga list of possible features) is QuEst (Specia et al.,2013).

For an overview of other cross-lingual applica-tions and methods, we refer to Bikel and Zitouni(2012).


We have presented an approach that alleviates theneed of training data for a target language whenadapting a fine-grained sentiment analysis systemto a new language. Our approach relies on trainingdata available for a source language and on auto-matic machine translation, in particular statisticalmethods, to project training data to the target lan-guage, thus creating a training corpus on which

160

a supervised sentiment analysis approach can betrained on. We have in particular shown that ourresults are competitive to training with manuallyannotated data in the target language, both for theprediction of aspect phrases as well as subjectivephrases. We have further shown that performancefor aspect detection can be almost doubled by es-timating the quality of translations and selectingonly the translations with highest quality for train-ing. Such an effect cannot be observed in the pre-diction of subjective phrases, which neverthelessdelivers results comparable to training using tar-get language data using all automatically projectedtraining data. Predicting translation quality by boththe alignment probability and the target languagemodel probability have been shown to deliver goodresults, while an approach exploiting source lan-guage model probability does not perform well.

Our hypothesis for the failure of translation filter-ing for the prediction of subjective phrases is thattranslation quality for subjective phrases is gener-ally higher as their coverage in standard parallelcorpora is reasonable and they are often domain-independent. A further possible explanation is thatsubjective phrases have a more complex structure(for instance, their average length is 2.38 tokensin English and 2.57 tokens in German, while theaspect length is 1.6 and 1.3, respectively). There-fore, translation as well as filtering might be morechallenging. These hypotheses should be verifiedand investigated further in future work.

Further work should also be devoted to the in-vestigation of other quality estimation procedures,in particular combinations of those investigated inthis paper. Preliminary experiments have shownthat the correlation between the filters incorporatedin this paper is low. Thus, their combination couldindeed have an additional impact. Similarly, theprojection quality can be affected by the transla-tion itself and by the alignment. These two aspectsshould be analyzed separately.

In addition, instead of Boolean filtering (usingan instance or not), weighting the impact of the in-stance in the learning procedure might be beneficialas lower-quality instances can still be taken intoaccount, although with a lower impact proportionalto their corresponding score or probability.

In addition to the presented approach of pro-jecting annotations, a comparison to directly trans-ferring a trained model across languages wouldallow for a deeper understanding of the processes

involved. Finally, it is an important and promis-ing step to apply the presented methods on otherlanguages.

Acknowledgments

We thank Nils Reiter, John McCrae, and MatthiasHartung for fruitful discussions. Thanks to Lu-cia Specia for her comments on quality estimationmethods in our work. We thank Chris Dyer forhis help with FastAlign. Thanks to the anonymousreviewers for their comments. This research waspartially supported by the “It’s OWL” project (“In-telligent Technical Systems Ostwestfalen-Lippe”,http://www.its-owl.de/), a leading-edge cluster ofthe German Ministry of Education and Researchand by the Cluster of Excellence Cognitive Interac-tion Technology ‘CITEC’ (EXC 277) at BielefeldUniversity, which is funded by the German Re-search Foundation (DFG).

ReferencesZeljko Agic, Jorg Tiedemann, Danijela Merkler, Si-

mon Krek, Kaja Dobrovoljc, and Sara Moze. 2014.Cross-lingual dependency parsing of related lan-guages with rich morphosyntactic tagsets. In Work-shop on Language Technology for Closely RelatedLanguages and Language Variants, EMNLP.

Eleftherios Avramidis, Maja Popovic, David Vilar, andAljoscha Burchardt. 2011. Evaluate with confi-dence estimation: Machine ranking of translationoutputs using grammatical features. In Workshop onStatistical Machine Translation, ACL.

Amittai Axelrod, Xiaodong He, and Jianfeng Gao.2011. Domain adaptation via pseudo in-domain dataselection. In EMNLP.

Nguyen Bach, Fei Huang, and Yaser Al-Onaizan. 2011.Goodness: A method for measuring machine trans-lation confidence. In ACL-HLT.

Daniella Bal, Malissa Bal, Arthur van Bunningen,Alexander Hogenboom, Frederik Hogenboom, andFlavius Frasincar. 2011. Sentiment analysis witha multilingual pipeline. In Web Information SystemEngineering WISE 2011.

Alexandra Balahur and Ralf Steinberger. 2009. Re-thinking sentiment analysis in the news: from theoryto practice and back. In Proceeding of Workshop onOpinion Mining and Sentiment Analysis (WOMSA).

Alexandra Balahur and Marco Turchi. 2014. Compar-ative experiments using supervised learning and ma-chine translation for multilingual sentiment analysis.Computer Speech & Language, 28(1).

161

Carmen Banea, Rada Mihalcea, and Janyce Wiebe.2008. A bootstrapping method for building subjec-tivity lexicons for languages with scarce resources.In LREC.

Carmen Banea, Rada Mihalcea, and Janyce Wiebe.2010. Multilingual subjectivity: Are more lan-guages better? In COLING.

Roberto Basili, Diego De Cao, Danilo Croce, Bonaven-tura Coppola, and Alessandro Moschitti. 2009.Cross-language frame semantics transfer in bilin-gual corpora. In CICLING, volume 5449 of LectureNotes in Computer Science. Springer Berlin Heidel-berg.

Daniel Bikel and Imed Zitouni, editors. 2012. Multi-lingual Natural Language Processing Applications:From Theory to Practice. IBM Press, 1st edition.

Katarina Boland, Andias Wira-Alam, and ReinhardMesserschmidt. 2013. Creating an annotated corpusfor sentiment analysis of german product reviews.Technical Report 2013/05, GESIS Institute.

Julian Brooke, Milan Tofiloski, and Maite Taboada.2009. Cross-linguistic sentiment analysis: Fromenglish to spanish. In RANLP, Borovets, Bulgaria,September. Association for Computational Linguis-tics.

Yoonjung Choi, Seongchan Kim, and Sung-HyonMyaeng. 2010. Detecting Opinions and their Opin-ion Targets in NTCIR-8. In NTCIR8.

Kerstin Denecke. 2008. Using sentiwordnet for multi-lingual sentiment analysis. In ICDEW.

Chris Dyer, Victor Chahuneau, and Noah A. Smith.2013. A simple, fast, and effective reparameteriza-tion of ibm model 2. In NAACL.

Hatem Ghorbel. 2012. Experiments in cross-lingualsentiment analysis in discussion forums. In KarlAberer, Andreas Flache, Wander Jager, Ling Liu, JieTang, and Christophe Guret, editors, Social Infor-matics, volume 7710 of Lecture Notes in ComputerScience. Springer Berlin Heidelberg.

Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In ACM SIGKDD.

Rebecca Hwa, Philip Resnik, Amy Weinberg, ClaraCabezas, and Okan Kolak. 2005. Bootstrappingparsers via syntactic projection across parallel texts.Nat. Lang. Eng., 11(3), September.

Niklas Jakob and Iryna Gurevych. 2010a. Extractingopinion targets in a single- and cross-domain settingwith conditional random fields. In EMNLP.

Niklas Jakob and Iryna Gurevych. 2010b. Usinganaphora resolution to improve opinion target iden-tification in movie reviews. In ACL.

Jing Jiang and ChengXiang Zhai. 2007. Instanceweighting for domain adaptation in nlp. In ACL.

Richard Johansson and Alessandro Moschitti. 2011.Extracting opinion expressions and their polarities:exploration of pipelines and joint models. In ACL-HLT.

Jason S. Kessler, Miriam Eckert, Lyndsie Clark, andNicolas Nicolov. 2010. The 2010 ICWSM JDPASentment Corpus for the Automotive Domain. InProc. od the 4th International AAAI Conference onWeblogs and Social Media Data Workshop Chal-lenge (ICWSM-DWC 2010).

Roman Klinger and Philipp Cimiano. 2013. Bi-directional inter-dependencies of subjective expres-sions and targets and their value for a joint model.In ACL.

Roman Klinger and Philipp Cimiano. 2014. The usagereview corpus for fine grained multi lingual opinionanalysis. In LREC.

Philipp Koehn. 2005. Europarl: A Parallel Corpusfor Statistical Machine Translation. In ConferenceProceedings: the tenth Machine Translation Summit.AAMT.

Philipp Koehn. 2010. Statistical Machine Translation.Cambridge University Press, New York, NY, USA,1st edition.

Mikhail Kozhevnikov and Ivan Titov. 2014. Cross-lingual model transfer using feature representationprojection. In ACL.

Fangtao Li, Minlie Huang, and Xiaoyan Zhu. 2010.Sentiment analysis with global topics and local de-pendency. In AAAI.

Andrew McCallum, Karl Schultz, and Sameer Singh.2009. FACTORIE: Probabilistic programming viaimperatively defined factor graphs. In NIPS.

Ryan McDonald, Slav Petrov, and Keith Hall. 2011.Multi-source transfer of delexicalized dependencyparsers. In EMNLP.

Rada Mihalcea, Carmen Banea, and Janyce Wiebe.2007. Learning multilingual subjective language viacross-lingual projections. In ACL.

Sebastian Pado and Mirella Lapata. 2009. Cross-lingual annotation projection for semantic roles.Journal of Artificial Intelligence Research (JAIR),36.

Ana-Maria Popescu and Oren Etzioni. 2005. Extract-ing product features and opinions from reviews. InHLT-EMNLP.

Xabier Saralegi and Inaki San Vicente. 2012. Tass:Detecting sentiments in spanish tweets. In Work-shop on Sentiment Analisis at SEPLN (TASS). SE-PLN, 09/2012.

Sunita Sarawagi and William W. Cohen. 2004. Semi-markov conditional random fields for informationextraction. In NIPS.

162

Aliaksei Severyn, , Alessandro Moschitti, OlgaUryupina, Barbara Plank, and Katja Filippova. 2015.Multi-lingual opinion mining on youtube. Informa-tion Processing and Management. in press.

Kashif Shah and Lucia Specia. 2014. Quality esti-mation for translation selection. In Conference ofthe European Association for Machine Translation,EAMT, Dubrovnik, Croatia.

Kashif Shah, Trevor Cohn, and Lucia Specia. 2013.An investigation on the effectiveness of features fortranslation quality estimation. In Machine Transla-tion Summit XIV.

Benjamin Snyder, Tahira Naseem, Jacob Eisenstein,and Regina Barzilay. 2008. Unsupervised multilin-gual learning for POS tagging. In EMNLP.

Anders Søgaard. 2011. Data point selection for cross-language adaptation of dependency parsers. In ACL-HLT.

Lucia Specia, Kashif Shah, Jose G.C. de Souza, andTrevor Cohn. 2013. Quest - a translation qualityestimation framework. In ACL.

Andreas Stolcke, Jing Zheng, Wen Wang, and VictorAbrash. 2011. SRILM at sixteen: Update andoutlook. In Proceedings IEEE Automatic SpeechRecognition and Understanding Workshop.

Andreas Stolcke. 2002. Srilm-an extensible languagemodeling toolkit. In Proceedings International Con-ference on Spoken Language Processing.

Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman,Eric Nyberg, and Chris Dyer. 2014. Metaphor de-tection with cross-lingual model transfer. In ACL.

Nicola Ueffing, Klaus Macherey, and Hermann Ney.2003. Confidence measures for statistical machinetranslation. In In Proc. MT Summit IX.

Olga Uryupina, Barbara Plank, Aliaksei Severyn,Agata Rotondi, and Alessandro Moschitti. 2014.Sentube: A corpus for sentiment analysis on youtubesocial media. In LREC.

Xiaojun Wan. 2009. Co-training for cross-lingual sen-timent classification. In Keh-Yih Su, Jian Su, andJanyce Wiebe, editors, ACL/IJCNLP.

M. Wick, K. Rohanimanesh, K. Bellare, A. Culotta,and A. McCallum. 2011. SampleRank: Trainingfactor graphs with atomic gradients. In ICML.

Theresa Wilson and Janyce Wiebe. 2005. Annotatingattributions and private states. In CorpusAnno, ACL.

Ruifeng Xu, Lin Gui, Jun Xu, Qin Lu, and Kam-FaiWong. 2013. Cross lingual opinion holder extrac-tion based on multi-kernel svms and transfer learn-ing. World Wide Web Journal.

Bishan Yang and Claire Cardie. 2012. Extracting opin-ion expressions with semi-markov conditional ran-dom fields. In EMNLP-CoNLL.

David Yarowsky and Grace Ngai. 2001. Inducing mul-tilingual pos taggers and np bracketers via robustprojection across aligned corpora. In NAACL.

David Yarowsky, Grace Ngai, and Richard Wicen-towski. 2001. Inducing multilingual text analysistools via robust projection across aligned corpora. InHLT.

Daniel Zeman and Philip Resnik. 2008. Cross-language parser adaptation between related lan-guages. In IJCNLP-08 Workshop on NLP for LessPrivileged Languages.

Qi Zhang, Yuanbin Wu, Yan Wu, and Xuanjing Huang.2011. Opinion mining with sentiment graph. InProceedings of the 2011 IEEE/WIC/ACM Interna-tional Conferences on Web Intelligence and Intelli-gent Agent Technology - Volume 01.

Y. Zhao, B. Qin, and T. Liu. 2014. Creating a fine-grained corpus for chinese sentiment analysis. Intel-ligent Systems, IEEE, PP(99).

Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2015.Clopinionminer: Opinion target extraction in a cross-language scenario. IEEE/ACM Transactions on Au-dio, Speech & Language Processing, 23(4).

163



Labeled Morphological Segmentation with Semi-Markov Models

Ryan Cotterell1,2

Department of Computer Science1

Johns Hopkins University, [email protected]

Thomas Müller2 Alexander Fraser2

Center for Information and Language Processing2

University of Munich, [email protected]

Hinrich Schütze2

Abstract

We present labeled morphologicalsegmentation—an alternative view ofmorphological processing that unifies sev-eral tasks. We introduce a new hierarchyof morphotactic tagsets and CHIPMUNK,a discriminative morphological segmen-tation system that, contrary to previouswork, explicitly models morphotactics.We show improved performance on threetasks for all six languages: (i) morpho-logical segmentation, (ii) stemming and(iii) morphological tag classification. Formorphological segmentation our methodshows absolute improvements of 2-6points F1 over a strong baseline.

1 Introduction

Morphological processing is often an overlookedproblem since many well-studied languages (e.g.,Chinese and English) are morphologically impov-erished. But for languages with complex mor-phology (e.g., Finnish and Turkish) morphologicalprocessing is essential. A specific form of mor-phological processing, morphological segmenta-tion, has shown its utility for machine translation(Dyer et al., 2008), sentiment analysis (Abdul-Mageed et al., 2014), bilingual word alignment(Eyigöz et al., 2013), speech processing (Creutz etal., 2007b) and keyword spotting (Narasimhan etal., 2014), inter alia. We advance the state-of-the-art in supervised morphological segmentation bydescribing a high-performance, data-driven toolfor handling complex morphology, even in low-resource settings.

In this work, we make the distinction betweenunlabeled morphological segmentation (UMS )(often just called “morphological segmentation”)and labeled morphological segmentation (LMS).The labels in our supervised discriminative model

for LMS capture the distinctions between differenttypes of morphemes and directly model the mor-photactics. We further create a hierarchical uni-versal tagset for labeling morphemes, with differ-ent levels appropriate for different tasks. Our hi-erarchical tagset was designed by creating a stan-dard representation from heterogeneous resourcesfor six languages. This allows us to use a singleunified framework to obtain strong performanceon three common morphological tasks that havetypically been viewed as separate problems andaddressed using different methods. We give anoverview of the tasks addressed in this paper inFigure 1. The figure shows the expected outputfor the Turkish word gençlesmelerin ‘of rejuvenat-ings’. In particular, it shows the full labeled mor-phological segmentation, from which three repre-sentations can be directly derived: the unlabeledmorphological segmentation, the stem/root 1 andthe structured morphological tag containing POSand inflectional features.

We model these tasks with CHIPMUNK, a semi-Markov conditional random field (semi-CRF)(Sarawagi and Cohen, 2004), a model that is well-suited for morphology. We provide a robust eval-uation and analysis on six languages and CHIP-MUNK yields strong results on all three tasks, in-cluding state-of-the-art accuracy on morphologi-cal segmentation.

Section 2 presents our LMS framework and themorphotactic tagsets we use, i.e., the labels of thesequence prediction task CHIPMUNK solves. Sec-tion 3 introduces our semi-CRF model. Section 4presents our novel features. Section 5 comparesCHIPMUNK to previous work. Section 6 presentsexperiments on the three complementary tasks ofsegmentation (UMS), stemming, and morpholog-

1Terminological notes: We use root to refer to a mor-pheme with concrete meaning, stem to refer to the concate-nation of all roots and derivational affixes, root detection torefer to stripping both derivational and inflectional affixes,and stemming to refer to stripping only inflectional affixes.

164

gençlesmelerinUMS genç les me ler inGloss young -ate -ion -s GENITIVE MARKER

LMS genç les me ler inROOT:ADJECTIVAL SUFFIX:DERIV:VERB SUFFIX:DERIV:NOUN SUFFIX:INFL:NOUN:PLURAL SUFFIX:INFL:NOUN:GENITIVE

Root genç Stem gençlesme Morphological Tag PLURAL:GENITIVE

Figure 1: Examples of the tasks addressed for the Turkish word gençlesmelerin ‘of rejuvenatings’: Traditional unlabeledsegmentation (UMS), Labeled morphological segmentation (LMS), stemming / root detection and (inflectional) morphologicaltag classification. The morphotactic annotations produced by LMS allow us to solve these tasks using a single model.

ical tag classification. Section 7 briefly discussesfinite-state morphology. Section 8 concludes.

The datasets created in this work, additionaldescription of our novel tagsets and CHIPMUNKcan be found at http://cistern.cis.lmu.de/

chipmunk.

2 Labeled Segmentation and Tagset

We define the framework of labeled morphologi-cal segmentation (LMS), an enhancement of mor-phological segmentation that—in addition to iden-tifying the boundaries of segments—assigns afine-grained morphotactic tag to each segment.LMS leads to both better modeling of segmenta-tion and subsumes several other tasks, e.g., stem-ming.

Most previous approaches to morphologicalsegmentation are either unlabeled or use a small,coarse-grained set such as prefix, root, suffix. Incontrast, our labels are fine-grained. This finergranularity has two advantages. (i) The labels areneeded for many tasks, for instance in sentimentanalysis detecting morphologically encoded nega-tion, as in Turkish, is crucial. In other words,for many applications UMS is insufficient. (ii)The LMS framework allows us to learn a prob-abilistic model of morphotactics. Working withLMS results in higher UMS accuracy. So even inapplications that only need segments and no la-bels, LMS is beneficial. Note that the concate-nation of labels across segments yields a bundleof morphological attributes similar to those foundin the CoNLL datasets often used to train mor-phological taggers (Buchholz and Marsi, 2006)—thus LMS helps to unify UMS and morphologicaltagging. We believe that LMS is a needed exten-sion of current work in morphological segmenta-tion. Our framework concisely allows the modelto capture interdependencies among various mor-phemes and model relations between entire mor-

pheme classes—a neglected aspect of the problem.We first create a hierarchical tagset with in-

creasing granularity, which we created by analyz-ing the heterogeneous resources for the six lan-guages we work on. The optimal level of gran-ularity is task and language dependent: the levelis a trade-off between simplicity and expressivity.We illustrate our tagset with the decomposition ofthe German word Enteisungen ‘defrostings’ (Fig-ure 2).

The level 0 tagset involves a single tag indi-cating a segment. It ignores morphotactics com-pletely and is similar to previous work. The level1 tagset crudely approximates morphotactics: itconsists of the tags {PREFIX, ROOT, SUFFIX}.This scheme has been successfully used by un-supervised segmenters, e.g., MORFESSOR CAT-MAP (Creutz et al., 2007a). It allows the modelto learn simple morphotactics, for instance that aprefix cannot be followed by a suffix. This makesa decomposition like reed ! re+ed unlikely. Wealso add an additional UNKNOWN tag for mor-phemes that do not fit into this scheme. The level2 tagset splits affixes into DERIVATIONAL and IN-FLECTIONAL, effectively increasing the maximaltagset size from 4 to 6. These tags can encodethat many languages allow for transitions fromderivational to inflectional endings, but rarely theopposite. This makes the incorrect decomposi-tion of German Offenheit ‘openness’ into Off, in-flectional en and derivational heit unlikely2. Thistagset is also useful for building statistical stem-mers. The level 3 tagset adds POS, i.e., whether aroot is VERBAL, NOMINAL or ADJECTIVAL, andthe POS of the word that an affix derives. Thelevel 4 tagset includes the inflectional feature asuffix adds, e.g., CASE or NUMBER. This is help-ful for certain agglutinative languages, in which,

2Like en in English open, en in German Offen is part ofthe root.

165

5 PREFIX:DERIV:VERB ROOT:NOUN SUFFIX:DERIV:NOUN SUFFIX:INFL:NOUN:PLURAL4 PREFIX:DERIV:VERB ROOT:NOUN SUFFIX:DERIV:NOUN SUFFIX:INFL:NOUN:NUMBER3 PREFIX:DERIV:VERB ROOT:NOUN SUFFIX:DERIV:NOUN SUFFIX:INFL:NOUN2 PREFIX:DERIV ROOT SUFFIX:DERIV SUFFIX:INFL1 PREFIX ROOT SUFFIX SUFFIX0 SEGMENT SEGMENT SEGMENT SEGMENT

German Ent eis ung enEnglish de frost ing s

Figure 2: Example of the different morphotactic tagset granularities for German Enteisungen ‘defrostings’.

level: 0 1 2 3 4English 1 4 5 13 16Finnish 1 4 6 14 17German 1 4 6 13 17Indonesian 1 4 4 8 8Turkish 1 3 4 10 20Zulu 1 4 6 14 17

Table 1: Morphotactic tagset size at each level of granularity.

e.g., CASE must follow NUMBER. The level 5tagset adds the actual value of the inflectional fea-ture, e.g., PLURAL, and corresponds to the anno-tation in the datasets. In preliminary experimentswe found that the level 5 tagset is too rich and doesnot yield consistent improvements, we thus do notexplore it. Table 1 shows tagset sizes for the sixlanguages.3

3 Model

CHIPMUNK is a supervised model implementedusing the well-understood semi-Markov condi-tional random field (semi-CRF) (Sarawagi andCohen, 2004) that naturally fits the task ofLMS. Semi-CRFs generalize linear-chain CRFsand model segmentation jointly with sequence la-beling. Just as linear-chain CRFs are discrimina-tive adaptations of hidden Markov models (Laf-ferty et al., 2001), semi-CRFs are an analogousadaptation of hidden semi-Markov models (Mur-phy, 2002). Semi-CRFs allow us to elegantly inte-grate new features that look at complete segments,this is not possible with CRFs, making semi-CRFsa natural choice for morphology.

A semi-CRF represents w (a word) as a se-quence of segments s = hs1, . . . , sni, each ofwhich is assigned a label ì. The concatenationof all segments equals w. We seek a log-lineardistribution p✓(s, ` | w) over all possible segmen-tations and label sequences for w, where ✓ is the

3As converting segmentation datasets to tagsets is not al-ways straightforward, we include tags that lack some fea-tures, e.g., some level 4 German tags lack POS because ourGerman data does not specify it.

parameter vector. Note that we recover the stan-dard CRF if we restrict the segment length to 1.Formally, we define p✓ as

p✓(s, ` | w)

def=

1

Z✓(w)

Y

i

e✓T f(si,ì,ì�1,i), (1)

where f is the feature function and Z✓(w) is thepartition function. To keep the notation unclut-tered, we will write f without all its arguments inthe future. We use a generalization of the forward-backward algorithm for efficient gradient compu-tation (Sarawagi and Cohen, 2004). Inspection ofthe semi-Markov forward recursion,

↵(t, l) =

X

i

X

`0

e✓T f · ↵(t� i, `0

), (2)

shows that algorithm runs in O(n2·L2) time where

n is the length of the word w and L is the numberof labels (size of the tagset).

We employ the maximum-likelihood criterionto estimate the parameters with L-BFGS (Liu andNocedal, 1989), a gradient-based optimization al-gorithm. As in all exponential family models, thegradient of the log-likelihood takes the form of thedifference between the observed and expected fea-tures counts (Wainwright and Jordan, 2008) andcan be computed efficiently with the semi-Markovextension of the forward-backward algorithm. Weuse L2 regularization with a regularization coeffi-cient tuned during cross-validation.

We note that semi-Markov models have the po-tential to obviate typical errors made by standardMarkovian sequence models with an IOB label-ing scheme over characters. For instance, con-sider the incorrect segmentation of the Englishverb sees into se+es. These are reasonable splitpositions as many English stems end in se (e.g.,consider abuse-s). Semi-CRFs have a major ad-vantage here as they can have segmental featuresthat allow them to learn se is not a good morph.

166

# Affixes Random ExamplesEnglish 394 -ard -taxy -odon -en -otic -foldFinnish 120 -tä -llä -ja -t -nen -hön -jä -tonGerman 112 -nomie -lichenes -ell -en -yl -ivIndonesian 5 -kau -an -nya -ku -muTurkish 263 -ten -suz -mek -den -t -ünüzZulu 72 i- u- za- tsh- mi- obu- olu-

Table 2: Sizes of the various affix gazetteers.

4 Features

We introduce several novel features for LMS. Weexploit existing resources, e.g., spell checkers andWiktionary, to create straightforward and effectivefeatures and we incorporate ideas from related ar-eas: named-entity recognition (NER) and morpho-logical tagging.

Affix Features and Gazetteers. In contrast tosyntax and semantics, the morphology of a lan-guage is often simple to document and a list of themost common morphs can be found in any goodgrammar book. Wiktionary, for example, con-tains affix lists for all the six languages used inour experiments.4 Providing a supervised learnerwith such a list is a great boon, just as gazetteerfeatures aid NER (Smith and Osborne, 2006)—perhaps even more so since suffixes and prefixesare generally closed-class; hence these lists arelikely to be comprehensive. These features arebinary and fire if a given substring occurs in thegazetteer list. In this paper, we simply use suffixlists from English Wiktionary, except for Zulu, forwhich we use a prefix list, see Table 2.

We also include a feature that fires on the con-junction of tags and substrings observed in thetraining data. In the level 5 tagset this allows usto link all allomorphs of a given morpheme. In thelower level tagsets, this links related morphemes.Virpioja et al. (2010) explored this idea for un-supervised segmentation. Linking allomorphs to-gether under a single tag helps combat sparsity inmodeling the morphotactics.

Stem Features. A major problem in statisticalsegmentation is the reluctance to posit morphs notobserved in training; this particularly affects roots,which are open-class. This makes it nearly im-possible to correctly segment compounds that con-tain unseen roots, e.g., to correctly segment home-work you need to know that home and work areindependent English words. We solve this prob-lem by incorporating spell-check features: binary

4A good example of such a resource is en.wiktio-nary.org/wiki/Category:Turkish_suffixes.

English 119,839Finnish 6,690,417German 364,564Indonesian 35,269Turkish 80,261Zulu 73,525

Table 3: Number of words covered by the respective ASPELLdictionary

features that fire if a segment is valid for a givenspell checker. Spell-check features function effec-tively as a proxy for a “root detector”. We usethe open-source ASPELL dictionaries as they arefreely available in 91 languages. Table 3 showsthe coverage of these dictionaries.

Integrating the Features. Our model uses thefeatures discussed in this section and addition-ally the simple n-gram context features of Ruoko-lainen et al. (2013). The n-gram features look atvariable length substrings of the word on both theright and left side of each potential boundary. Wecreate conjunctive features from the cross-productbetween the morphotactic tagset (Section 2) andthe features.

5 Related Work

Van den Bosch and Daelemans (1999) and Marsiet al. (2005) present memory-based approaches todiscriminative learning of morphological segmen-tation. This is the previous work most similar toour work. They address the problem of LMS. Wedistinguish our work from theirs in that we definea cross-lingual schema for defining a hierarchicaltagset for LMS. Morever, we tackle the problemwith a feature-rich log-linear model, allowing usto easily incorporate disparate sources of knowl-edge into a single framework, as we show in ourextensive evaluation.

UMS has been mainly addressed by unsu-pervised algorithms. LINGUISTICA (Goldsmith,2001) and MORFESSOR (Creutz and Lagus, 2002)are built around an idea of optimally encoding thedata, in the sense of minimal description length(MDL). MORFESSOR CAT-MAP (Creutz et al.,2007a) formulates the model as sequence predic-tion based on HMMs over a morph dictionaryand MAP estimation. The model also attemptsto induce basic morphotactic categories (PREFIX,ROOT, SUFFIX). Kohonen et al. (2010a,b) andGrönroos et al. (2014) present variations of MOR-FESSOR for semi-supervised learning. Poon et

167

al. (2009) introduces a Bayesian state-space modelwith corpus-wide priors. The model resembles asemi-CRF, but dynamic programming is no longerpossible due to the priors. They employ the three-state tagset of Creutz and Lagus (2004) (row 1in Figure 2) for Arabic and Hebrew UMS. Theirgradient and objective computation is based on anenumeration of a heuristically chosen subset of theexponentially many segmentations. This limits itsapplicability to language with complex concatena-tive morphology, e.g., Turkish and Finnish.

Ruokolainen et al. (2013) present an averagedperceptron (Collins, 2002), a discriminative struc-tured prediction method, for UMS. The model out-performs the semi-supervised model of Poon et al.(2009) on Arabic and Hebrew morpheme segmen-tation as well as the semi-supervised model of Ko-honen et al. (2010a) on English, Finnish and Turk-ish.

Finally, Ruokolainen et al. (2014) get furtherconsistent improvements by using features ex-tracted from large corpora, based on the letter suc-cessor variety (LSV) model (Harris, 1995) and onunsupervised segmentation models such as Mor-fessor CatMAP (Creutz et al., 2007a). The ideabehind LSV is that for example talking should besplit into talk and ing, because talk can also be fol-lowed by different letters then i such as e (talked)and s (talks).

Chinese word segmentation (CWS) is relatedto UMS. Andrew (2006) successfully apply semi-CRFs to CWS. The problem of joint CWS andPOS tagging (Ng and Low, 2004; Zhang andClark, 2008) is related to LMS. To our knowl-edge, joint CWS and POS tagging has not beenaddressed by a simple single semi-CRF, possi-bly because POS tagsets typically used in Chinesetreebanks are much bigger than our morphotactictagsets and the morphological poverty of Chinesemakes higher-order models necessary and the di-rect application of semi-CRFs infeasible.

6 Experiments

We experimented on six languages from diverselanguage families. The segmentation data for En-glish, Finnish and Turkish was taken from Mor-phoChallenge 2010 (Kurimo et al., 2010).5 De-spite typically being used for UMS tasks, the Mor-phoChallenge datasets do contain morpheme level

5http://research.ics.aalto.fi/events/morphochallenge2010/

Un. Data Train+Tune+Dev TestTrain Tune Dev

English 878k 800 100 100 694Finnish 2,928k 800 100 100 835German 2,338k 800 100 100 751Indonesian 88k 800 100 100 2500Turkish 617k 800 100 100 763Zulu 123k 800 100 100 9040

Table 4: Dataset sizes (number of types).

labels. The German data was extracted from theCELEX2 collection (Baayen et al., 1993). TheZulu data was taken from the Ukwabelana cor-pus (Spiegler et al., 2010). Finally, the Indone-sian portion was created applying the rule-basedanalyzer MORPHIND (Larasati et al., 2011) to theIndonesian portion of an Indonesian-English bilin-gual corpus.6

We did not have access to the MorphoChallengetest set and thus used the original development setas our final evaluation set (Test). We developedCHIPMUNK using 10-fold cross-validation on the1000 word training set and split every fold intotraining (Train), tuning (Tune) and developmentsets (Dev).7 For German, Indonesian and Zulu werandomly selected 1000 word forms as training setand used the rest as evaluation set. For our finalevaluation we trained CHIPMUNK on the concate-nation of Train, Tune and Dev (the original 1000word training set), using the optimal parametersfrom the cross-evaluation and tested on Test.

One of our baselines also uses unlabeled train-ing data. MorphoChallenge provides word lists forEnglish, Finnish, German and Turkish. We use theunannotated part of Ukwabelana for Zulu; and forIndonesian, data from Wikipedia and the corpus ofKrisnawati and Schulz (2013).

Table 4 shows the important statistics of ourdatasets.

In all evaluations, we use variants of the stan-dard MorphoChallenge evaluation approach. Im-portantly, for word types with multiple correctsegmentations, this involves finding the maximumscore by comparing our hypothesized segmenta-tion with each correct segmentation, as is stan-dardly done in MorphoChallenge.

6https://github.com/desmond86/Indonesian-English-Bilingual-Corpus

7We used both Tune and Dev in order to both optimizehyperparameters on held-out data (Tune) and perform quali-tative error analysis on separate held-out data (Dev).

168

English Finnish Indonesian German Turkish ZuluCRF-MORPH 83.23 81.98 93.09 84.94 88.32 88.48CRF-MORPH +LSV 84.45 84.35 93.50 86.90 89.98 89.06First-order CRF 84.66 85.05 93.31 85.47 90.03 88.99Higher-order CRF 84.66 84.78 93.88 85.40 90.65 88.85CHIPMUNK 84.40 84.40 93.76 85.53 89.72 87.80CHIPMUNK +Morph 83.27 84.71 93.17 84.84 90.48 90.03CHIPMUNK +Affix 83.81 86.02 93.51 85.81 89.72 89.64CHIPMUNK +Dict 86.10 86.11 95.39 87.76 90.45 88.66CHIPMUNK +Dict,+Affix,+Morph 86.31 88.38 95.41 87.85 91.36 90.16

Table 5: Test F1 for UMS. Features: LSV = letter successor variety, Affix = affix, Dict = dictionary, Morph = optimal (on Tune)morphotactic tagset.

6.1 UMS ExperimentsWe first evaluate CHIPMUNK on UMS, by pre-dicting LMS and then discarding the labels. Ourprimary baseline is the state-of-the-art super-vised system CRF-MORPH of Ruokolainen et al.(2013). We ran the version of the system that theauthors published on their website.8 We optimizedthe model’s two hyperparameters on Tune: thenumber of epochs and the maximal length of n-gram character features. The system also supportsHarris’s letter successor variety (LSV) features(Section 5), extracted from large unannotated cor-pora, our second baseline. For completeness, wealso compare CHIPMUNK with a first-order CRFand a higher-order CRF (Müller et al., 2013), bothused the same n-gram features as CRF-MORPH,but without the LSV features.9 We evaluate allmodels using the traditional macro F1 of the seg-mentation boundaries.

Discussion. The UMS results on held-out dataare displayed in Table 5. Our most complex modelbeats the best baseline by between 1 (German) and3 (Finnish) points F1 on all six languages. Weadditionally provide extensive ablation studies tohighlight the contribution of our novel features.We find that the properties of each specific lan-guage highly influences which features are mosteffective. For the agglutinative languages, i.e,Finnish, Turkish and Zulu, the affix based features(+Affix) and the morphotactic tagset (+Morph)yield consistent improvements over the semi-CRFmodels with a single state. Improvements for theaffix features range from 0.2 for Turkish to 2.14for Zulu. The morphological tagset yields im-

8http://users.ics.tkk.fi/tpruokol/software/crfs_morph.zip

9Model order, maximal character n-gram length and reg-ularization coefficients were optimized on Tune.

provements of 0.77 for Finnish, 1.89 for Turkishand 2.10 for Zulu. We optimized tagset granularityon Tune and found that levels 4 and level 2 yieldedthe best results for the three agglutinative and thethree other languages, respectively.

The dictionary features (+Dict) help universally,but their effects are particularly salient in lan-guages with productive compounding, i.e., En-glish, Finnish and German, where we see improve-ments of > 1.7.

In comparison with previous work (Ruoko-lainen et al., 2013) we find that our most complexmodel yields consistent improvements over CRF-MORPH +LSV for all languages: The improve-ments range from > 1 for German over > 1.5 forZulu, English, and Indonesian to > 2 for Turkishand > 4 for Finnish.

To illustrate the effect of modeling morphotac-tics through the larger morphotactic tagset on per-formance, we provide a detailed analysis of Turk-ish. See Table 6. We consider three different fea-ture sets and increase the size of the morphotactictagsets depicted in Figure 2. The results evince thegeneral trend that improved morphotactic model-ing benefits segmentation. Additionally, we ob-serve that the improvements are complementary tothose from the other features.

As discussed earlier, a key problem in UMS, es-pecially in low-resource settings, is the detectionof novel roots and affixes. Since many of our fea-tures were designed to combat this problem specif-ically, we investigated this aspect independently.Table 7 shows the number of novel roots and af-fixes found by our best model and the baseline. Inall languages, CHIPMUNK correctly identifies be-tween 5% (English) and 22% (Finnish) more novelroots than the baseline. We do not see major im-

169

+Affix +Dict,+AffixLevel 0 90.11 90.13 91.66Level 1 90.73 90.68 92.80Level 2 89.80 90.46 92.04Level 3 91.03 90.83 92.31Level 4 91.80 92.19 93.21

Table 6: Example of the effect of larger tagsets (Figure 2)on Turkish segmentation measured on our development set.As Turkish is an agglutinative language with hundreds of af-fixes, the efficacy of our approach is expected to be partic-ularly salient here. Recall we optimized for the best tagsetgranularity for our experiments on Tune.

Figure 3: This figure represents a comparative analysis of un-dersegmentation. Each column (labels at the bottom) showshow often CRF-MORPH +LSV (top number in heatmap) andCHIPMUNK (bottom number in heatmap) select a segmentthat is two separate segments in the gold standard. E.g., Rt-Sx indicates how a root and a suffix were treated as a sin-gle segment. The color depends on the difference of the twocounts.

provements for affixes, but this is of less interestas there are far fewer novel affixes.

We further explore how CHIPMUNK and thebaseline perform on different boundary types bylooking at missing boundaries between differentmorphotactic types; this error type is also knownas undersegmentation. Figure 3 shows a heatmapthat overviews errors broken down by morphotac-tic tag. We see that most errors are caused betweenroot and suffixes across all languages. This is re-lated to the problem of finding new roots, as a newroot is often mistaken as a root-affix composition.

6.2 Root Detection and StemmingRoot detection1 and stemming1 are two importantNLP problems that are closely related to morpho-logical segmentation and used in applications suchas MT, information retrieval, parsing and infor-mation extraction. Here we explore the utility ofCHIPMUNK as a statistical stemmer and root de-

CRF-MORPH CHIPMUNKRoots Affixes Roots Affixes

English 614 6 644 12Finnish 502 10 613 11German 360 6 414 9Indonesian 593 0 639 0Turkish 435 22 514 19Zulu 146 10 160 11

Table 7: Dev number of unseen root and affix types cor-rectly identified by CRF-MORPH +LSV and CHIPMUNK+Affix,+Dict,+Morph.

tector.Stemming is closely related to the task of

lemmatization, which involves the additional stepof normalizing to the canonical form.10 Con-sider the German particle verb participle auf-ge-schrieb-en ‘written down’. The participle isbuilt by applying an alternation to the verbal rootschreib ‘write’ adding the participial circumfix ge-en and finally adding the verb particle auf. In oursegmentation-based definition, we would considerschrieb ‘write’ as its root and auf-schrieb as itsstem. In order to additionally to restore the lemma,we would also have to reverse the stem alternationthat replaced ei with ie and add the infinitival end-ing en yielding the infinitive auf-schreib-en.

Our baseline MORFETTE (Chrupala et al., 2008)is a statistical transducer that first extracts editpaths between input and output and then uses aperceptron classifier to decide which edit path toapply. In short, MORFETTE treats the task as astring-to-string transduction problem, whereas weview it as a labeled segmentation problem.11 Note,that MORFETTE would in principle be able to han-dle stem alternations, although these usually leadto an increase in the number of edit paths. We uselevel 2 tagsets for all experiments—the smallesttagsets complex enough for stemming—and ex-tract the relevant segments.

Discussion. Our results are shown in Table 8.We see consistent improvements across all tasks.For the fusional languages (English, German andIndonesian) we see modest gains in performanceon both root detection and stemming. However,for the agglutinative languages (Finnish, Turkishand Zulu) we see absolute gains as high as 50%

10Thus in our experiments there are no stem alternations.The output is equivalent to that of the Porter stemmer (Porter,1980).

11Note that MORFETTE is a pipeline that first tags and thenlemmatizes. We only make use of this second part of MOR-FETTE for which it is a strong string-to-string transductionbaseline.

170

English Finnish German Indonesian Turkish ZuluRoot MORFETTE 62.82 39.28 43.81 86.00 26.08 30.76Detection CHIPMUNK 70.31 69.85 67.37 90.00 75.62 62.23Stemming MORFETTE 91.35 51.74 79.49 86.00 28.57 58.12

CHIPMUNK 94.24 79.23 85.75 89.36 85.06 67.64

Table 8: Test Accuracies for root detection and stemming.

Finnish TurkishF1 MaxEnt 75.61 69.92

MaxEnt +Split 74.02 76.61CHIPMUNK +All 80.34 85.07

Acc. MaxEnt 60.96 37.88MaxEnt +Split 59.04 44.30CHIPMUNK +All 65.00 56.06

Table 9: Test F-Scores / accuracies for morphological tagclassification.

(Turkish) in accuracy. This significant improve-ment is due to the complexity of the tasks inthese languages—their productive morphology in-creases sparsity and makes the unstructured string-to-string transduction approach suboptimal. Weview this as solid evidence that labeled segmen-tation has utility in many components of the NLPpipeline.

6.3 Morphological Tag Classification

The joint modeling of segmentation and morpho-tactic tags allows us to use CHIPMUNK for a crudeform of morphological analysis: the task of mor-phological tag classification , which we define asannotation of a word with its most likely inflec-tional features.12 To be concrete, our task is topredict the inflectional features of word type basedonly on its character sequence and not its sen-tential context. To this end, we take Finnish andTurkish as two examples of languages that shouldsuit our approach particularly well as both havehighly complex inflectional morphologies. We useour most fine-grained tagset and replace all non-inflectional tags with a simple segment tag. Thetagset sizes are listed in Table 10.

We use the same experimental setup as in Sec-tion 6.2 and compare CHIPMUNK to a maximumentropy classifier (MaxEnt), whose features arecharacter n-grams of up to a maximal length of

12We recognize that this task is best performed with sen-tential context (token-based). Integration with a POS tagger,however, is beyond the scope of this paper.

Morpheme Tags Full Word TagsFinnish 43 172Turkish 50 636

Table 10: Number of full word and morpheme tags in thedatasets.

k. 13 The maximum entropy classifier is L1-regularized and its regularization coefficient aswell as the value for k are optimized on Tune.As a second, stronger baseline we use a MaxEntclassifier that splits tags into their constituents andconcatenates the features with every constituent aswell as the complete tag (MaxEnt +Split). Bothof the baselines in Table 9 are 0th-order versionsof the state-of-the-art CRF-based morphologicaltagger MARMOT (Müller et al., 2013) (since ourmodel is type-based), making this a strong base-line. We report full analysis accuracy and macroF1 on the set of individual inflectional features.

Discussion. The results in Table 9 show that ourproposed method outperforms both baselines onboth performance metrics. We see gains of over6% in accuracy in both languages. This is evi-dence that our proposed approach could be suc-cessfully integrated into a morphological tagger togive a stronger character-based signal.

7 Comparison to Finite-StateMorphology

A morphological finite-state analyzer is customar-ily a hand-crafted tool that generates all the pos-sible morphological readings with their associatedfeatures. We believe that, for many applications,high quality finite-state morphological analysis issuperior to our techniques. Finite-state morpho-logical analyzers output a small set of linguis-tically valid analyses of a type, typically withonly limited overgeneration. However, there aretwo significant problems. The first is that signif-icant effort is required to develop the transducersmodeling the “grammar” of the morphology and

13Prefixes and suffixes are explicitly marked.

171

there is significant effort in creating and updatingthe lexicon. The second is, it is difficult to usefinite-state morphology to guess analyses involv-ing roots not covered in the lexicon.14 In fact,this is usually solved by viewing it as a differentproblem, morphological guessing, where linguis-tic knowledge similar to the features we have pre-sented is used to try to guess POS and morpholog-ical analysis for types with no analysis.

In contrast, our training procedure learns aprobabilistic transducer, which is a soft version ofthe type of hand-engineered grammar that is usedin finite-state analyzers. The 1-best labeled mor-phological segmentation our model produces of-fers a simple and clean representation which willbe of great use in many downstream applications.Furthermore our model unifies analysis and guess-ing into a single simple framework. Nevertheless,finite-state morphologies are still extremely use-ful, high-precision tools. A primary goal of fu-ture work will be to use CHIPMUNK to attemptto induce higher-quality morphological processingsystems.


We have presented labeled morphological seg-mentation (LMS) in this paper, a new ap-proach to morphological processing. LMS uni-fies three tasks that were solved before by differ-ent methods—unlabeled morphological segmenta-tion, stemming, and morphological tag classifica-tion. LMS annotation itself has great potential foruse in downstream NLP applications. Our hierar-chy of labeled morphological segmentation tagsetscan be used to map the heterogeneous data in sixlanguages we work with to universal representa-tions of different granularities. We plan future cre-ation of gold standard segmentations in more lan-guages using our annotation scheme.

We further presented CHIPMUNK a semi-CRF-based model for LMS that allows for the integra-tion of various linguistic features and consistentlyout-performs previously presented approaches tounlabeled morphological segmentation. An im-portant extension of CHIPMUNK is embedding itin a context-sensitive POS tagger. Current state-of-the-art models only employ character level n-gram features to model word-internals (Müller etal., 2013). We have demonstrated that our struc-

14While one can in theory put in wildcard root states, thisdoes not work in practice due to overgeneration.

tured approach outperforms this baseline. Weleave this natural extension to future work.

The datasets used in this work, additional de-scription of our novel tagsets and CHIPMUNKcan be found at http://cistern.cis.lmu.de/

chipmunk.

AcknowledgmentsWe would like to thank Jason Eisner, HelmutSchmid, Özlem Çetinoglu and the anonymous re-viewers for their comments. This material isbased upon work supported by a Fulbright fellow-ship awarded to the first author by the German-American Fulbright Commission and the NationalScience Foundation under Grant No. 1423276.The second author is a recipient of the GoogleEurope Fellowship in Natural Language Process-ing, and this research is supported by this GoogleFellowship. The fourth author was partiallysupported by Deutsche Forschungsgemeinschaft(grant SCHU 2246/10-1). This project has re-ceived funding from the European Union’s Hori-zon 2020 research and innovation programme un-der grant agreement No 644402 (HimL) and theDFG grant Models of Morphosyntax for Statisti-cal Machine Translation.

ReferencesMuhammad Abdul-Mageed, Mona T. Diab, and Sandra

Kübler. 2014. SAMAR: Subjectivity and sentimentanalysis for Arabic social media. Computer Speech& Language.

Galen Andrew. 2006. A hybrid Markov/semi-Markovconditional random field for sequence segmentation.In Proceedings of EMNLP.

R Harald Baayen, Richard Piepenbrock, and Leon Gu-likers. 1993. The CELEX lexical database on CD-ROM. Technical report, Linguistic Data Consor-tium.

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-Xshared task on multilingual dependency parsing. InProceedings of CoNLL.

Grzegorz Chrupala, Georgiana Dinu, and Josef vanGenabith. 2008. Learning morphology with mor-fette. In Proceedings of LREC.

Michael Collins. 2002. Discriminative training meth-ods for hidden Markov models: Theory and exper-iments with perceptron algorithms. In Proceedingsof EMNLP.

Mathias Creutz and Krista Lagus. 2002. Unsuperviseddiscovery of morphemes. In Proceedings of SIG-MORPHON.

172

Mathias Creutz and Krista Lagus. 2004. Induction of asimple morphology for highly-inflecting languages.In Proceedings of SIGMORPHON.

Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo,Antti Puurula, Janne Pylkkönen, Vesa Siivola, MattiVarjokallio, Ebru Arisoy, Murat Saraclar, and An-dreas Stolcke. 2007a. Analysis of morph-basedspeech recognition and the modeling of out-of-vocabulary words across languages. In Proceedingsof HLT-NAACL.

Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo,Antti Puurula, Janne Pylkkönen, Vesa Siivola, MattiVarjokallio, Ebru Arisoy, Murat Saraclar, and An-dreas Stolcke. 2007b. Morph-based speech recog-nition and modeling of out-of-vocabulary wordsacross languages. TSLP.

Christopher Dyer, Smaranda Muresan, and PhilipResnik. 2008. Generalizing word lattice translation.In Proceedings of ACL.

Elif Eyigöz, Daniel Gildea, and Kemal Oflazer. 2013.Simultaneous word-morpheme alignment for statis-tical machine translation. In Proceedings of HLT-NAACL.

John Goldsmith. 2001. Unsupervised learning of themorphology of a natural language. ComputationalLinguistics.

Stig-Arne Grönroos, Sami Virpioja, Peter Smit, andMikko Kurimo. 2014. Morfessor FlatCat: AnHMM-based method for unsupervised and semi-supervised learning of morphology. In Proceedingsof COLING.

Zellig Harris. 1995. From phoneme to morpheme.Language.

Oskar Kohonen, Sami Virpioja, and Krista Lagus.2010a. Semi-supervised learning of concatenativemorphology. In Proceedings of SIGMORPHON.

Oskar Kohonen, Sami Virpioja, Laura Leppänen, andKrista Lagus. 2010b. Semi-supervised extensions toMorfessor baseline. In Proceedings of the MorphoChallenge Workshop.

Lucia D. Krisnawati and Klaus U. Schulz. 2013. Pla-giarism detection for Indonesian texts. In Proceed-ings of iiWAS.

Mikko Kurimo, Sami Virpioja, Ville Turunen, andKrista Lagus. 2010. Morpho challenge competition2005–2010: Evaluations and results. In Proceedingsof SIGMORPHON.

John Lafferty, Andrew McCallum, and FernandoPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. In Proceedings of ICML.

Septina Dian Larasati, Vladislav Kubon, and DanielZeman. 2011. Indonesian morphology tool (mor-phind): Towards an Indonesian corpus. In Sys-tems and Frameworks for Computational Morphol-ogy. Springer.

Dong C Liu and Jorge Nocedal. 1989. On the limitedmemory BFGS method for large scale optimization.Mathematical Programming.

Erwin Marsi, Antal van den Bosch, and AbdelhadiSoudi. 2005. Memory-based morphological analy-sis generation and part-of-speech tagging of Arabic.In Proceedings of ACL Workshop: ComputationalApproaches to Semitic Languages.

Kevin P Murphy. 2002. Hidden semi-Markov models(hsmms). Technical report, Massachusetts Instituteof Technology.

Thomas Müller, Helmut Schmid, and Hinrich Schütze.2013. Efficient higher-order CRFs for morphologi-cal tagging. In Proceedings of EMNLP.

Karthik Narasimhan, Damianos Karakos, RichardSchwartz, Stavros Tsakalidis, and Regina Barzilay.2014. Morphological segmentation for keywordspotting. In Proceedings of EMNLP.

Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once?word-based or character-based? In Proceedings ofEMNLP.

Hoifung Poon, Colin Cherry, and Kristina Toutanova.2009. Unsupervised morphological segmentationwith log-linear models. In Proceedings of NAACL.

Martin F Porter. 1980. An algorithm for suffix strip-ping. Program.

Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja,and Mikko Kurimo. 2013. Supervised morpholog-ical segmentation in a low-resource learning settingusing conditional random fields. In Proceedings ofCoNLL.

Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja,and Mikko Kurimo. 2014. Painless semi-supervisedmorphological segmentation using conditional ran-dom fields. Proceedings of EACL.

Sunita Sarawagi and William W Cohen. 2004. Semi-Markov conditional random fields for informationextraction. In Proceedings of NIPS.

Andrew Smith and Miles Osborne. 2006. Usinggazetteers in discriminative information extraction.In Proceedings of CoNLL.

Sebastian Spiegler, Andrew Van Der Spuy, and Peter AFlach. 2010. Ukwabelana: An open-source mor-phological Zulu corpus. In Proceedings of COL-ING.

173

Antal Van den Bosch and Walter Daelemans. 1999.Memory-based morphological analysis. In Proceed-ings of ACL. Association for Computational Lin-guistics.

Sami Virpioja, Oskar Kohonen, and Krista Lagus.2010. Unsupervised morpheme analysis with Al-lomorfessor. In Multilingual Information AccessEvaluation. Springer.

Martin J Wainwright and Michael I Jordan. 2008.Graphical models, exponential families, and varia-tional inference. Foundations and Trends in Ma-chine Learning.

Yue Zhang and Stephen Clark. 2008. Joint word seg-mentation and POS tagging using a single percep-tron. In Proceedings of ACL.

174



Learning to Exploit Structured Resources for Lexical Inference

Vered Shwartz† Omer Levy† Ido Dagan† Jacob Goldberger§† Computer Science Department, Bar-Ilan University

§ Faculty of Engineering, Bar-Ilan [email protected] {omerlevy,dagan}@cs.biu.ac.il

[email protected]

Abstract

Massive knowledge resources, such asWikidata, can provide valuable informa-tion for lexical inference, especially forproper-names. Prior resource-based ap-proaches typically select the subset of eachresource’s relations which are relevant fora particular given task. The selectionprocess is done manually, limiting theseapproaches to smaller resources such asWordNet, which lacks coverage of proper-names and recent terminology. This paperpresents a supervised framework for auto-matically selecting an optimized subset ofresource relations for a given target infer-ence task. Our approach enables the useof large-scale knowledge resources, thusproviding a rich source of high-precisioninferences over proper-names.1

1 Introduction

Recognizing lexical inference is an importantcomponent in semantic tasks. Various lexical-semantic relations, such as synonomy, class-membership, part-of, and causality may be usedto infer the meaning of one word from another,in order to address lexical variability. For in-stance, a question answering system asked “whichartist’s net worth is $450 million?” might re-trieve the candidates Beyonce Knowles andLloyd Blankfein, who are both worth $450 mil-lion. To correctly answer the question, the appli-cation needs to know that Beyonce is an artist, andthat Lloyd Blankfein is not.

1Our code and data are available at:https://github.com/vered1986/LinKeR

Corpus-based methods are often employed torecognize lexical inferences, based on either co-occurrence patterns (Hearst, 1992; Turney, 2006)or distributional representations (Weeds and Weir,2003; Kotlerman et al., 2010). While earlier meth-ods were mostly unsupervised, recent trends intro-duced supervised methods for the task (Baroni etal., 2012; Turney and Mohammad, 2015; Rolleret al., 2014). In these settings, a targeted lexicalinference relation is implicitly defined by a train-ing set of term-pairs, which are annotated as posi-tive or negative examples of this relation. Severalsuch datasets have been created, each representinga somewhat different flavor of lexical inference.

While corpus-based methods usually enjoy highrecall, their precision is often limited, hinder-ing their applicability. An alternative commonpractice is to mine high-precision lexical in-ferences from structured resources, particularlyWordNet (Fellbaum, 1998). Nevertheless, Word-Net is an ontology of the English language,which, by definition, does not cover many proper-names (Beyonce ! artist) and recent termi-nology (Facebook ! social network). A po-tential solution may lie in rich and up-to-datestructured knowledge resources such as Wikidata(Vrandecic, 2012), DBPedia (Auer et al., 2007),and Yago (Suchanek et al., 2007). In this paper, weinvestigate how these resources can be exploitedfor lexical inference over proper-names.

We begin by examining whether the commonusage of WordNet for lexical inference can be ex-tended to larger resources. Typically, a subset ofWordNet relations is manually selected (e.g. allsynonyms and hypernyms). By nature, each ap-plication captures a different aspect of lexical in-ference, and thus defines different relations as in-dicative of its particular flavor of lexical infer-

175

Resource Relation Exampleposition held David Petraeus ! Director of CIA

performer Sheldon Cooper ! Jim Parsonsoperating system iPhone ! iOS

Table 1: Examples of Wikidata relations that are indicative oflexical inference.

ence. For instance, the hypernym relation isindicative of the is a flavor of lexical inference(e.g. musician ! artist), but does not indicatecausality.

Since WordNet has a relatively simple schema,manually finding such an optimal subset is fea-sible. However, structured knowledge resources’schemas contain thousands of relations, dozens ofwhich may be indicative. Many of these are nottrivial to identify by hand, as shown in Table 1.A manual effort to construct a distinct subset foreach task is thus quite challenging, and an auto-mated method is required.

We present a principled supervised framework,which automates the selection process of resourcerelations, and optimizes this subset for a giventarget inference relation. This automation al-lows us to leverage large-scale resources, and ex-tract many high-precision inferences over proper-names, which are absent from WordNet. Finally,we show that our framework complements state-of-the-art corpus-based methods. Combining thetwo approaches can particularly benefit real-worldtasks in which proper-names are prominent.

2 Background

2.1 Common Use of WordNet for InferenceWordNet (Fellbaum, 1998) is widely used foridentifying lexical inference. It is usually used inan unsupervised setting where the relations rele-vant for each specific inference task are manuallyselected a priori.

One approach looks for chains of these prede-fined relations (Harabagiu and Moldovan, 1998),e.g. dog ! mammal using a chain of hy-pernyms: dog ! canine ! carnivore !placental mammal ! mammal. Another ap-proach is via WordNet Similarity (Pedersen et al.,2004), which takes two synsets and returns a nu-meric value that represents their similarity basedon WordNet’s hierarchical hypernymy structure.

While there is a broad consensus that synonymsentail each other (elevator $ lift) and hy-ponyms entail their hypernyms (cat ! animal),other relations, such as meronymy, are not agreed

Resource #Entities #Properties VersionDBPedia 4,500,000 1,367 July 2014Wikidata 6,000,000 1,200 July 2014

Yago 10,000,000 70 December 2014WordNet 150,000 13 3.0

Table 2: Structured resources explored in this work.

upon, and may vary depending on task and context(e.g. living in London ! living in England,but leaving London 6! leaving England).Overall, there is no principled way to select thesubset of relevant relations, and a suitable subsetis usually tailored to each dataset and task. Thiswork addresses this issue by automatically learn-ing the subset of relations relevant to the task.

2.2 Structured Knowledge Resources

While WordNet is quite extensive, it is hand-crafted by expert lexicographers, and thus cannotcompete in terms of scale with community-builtknowledge bases such as Wikidata (Vrandecic,2012), which connect millions of entities througha rich variety of structured relations (properties).

Using these resources for various NLP tasks hasbecome exceedingly popular (Wu and Weld, 2010;Rahman and Ng, 2011; Unger et al., 2012; Be-rant et al., 2013). Little attention, however, wasgiven to leveraging them for identifying lexical in-ference; the exception being Shnarch et al. (2009),who used structured data from Wikipedia for thispurpose.

In this paper, we experimented with such re-sources, in addition to WordNet. DBPedia (Aueret al., 2007) contains structured information fromWikipedia: info boxes, redirections, disambigua-tion links, etc. Wikidata (Vrandecic, 2012) con-tains facts edited by humans to support Wikipediaand other Wikimedia projects. Yago (Suchanek etal., 2007) is a semantic knowledge base derivedfrom Wikipedia, WordNet, and GeoNames.2

Table 2 compares the scale of the resources weused. The massive scale of the more recent re-sources and their rich schemas can potentially in-crease the coverage of current WordNet-based ap-proaches, yet make it difficult to manually selectan optimized subset of relations for a task. Ourmethod automatically learns such a subset, andprovides lexical inferences on entities that are ab-sent from WordNet, particularly proper-names.

2We also considered Freebase, but it required significantlylarger computational resources to work in our framework,which, at the time of writing, exceeded our capacity. §4.1discusses complexity.

176

“Beyonce” BeyonceKnowles musician artist “artist”

term to concept occupation subclass of concept to term

Figure 1: An excerpt of a resource graph (Wikidata) connecting “Beyonce” to “artist”. Resource graphs contain two types ofnodes: terms (ellipses) and concepts (rectangles).

3 Task Definition and Representation

We wish to leverage the information in structuredresources to identify whether a certain lexical-inference relation R holds between a pair of terms.Formally, we wish to classify whether a term-pair(x, y) satisfies the relation R. R is implicitly de-fined by a training set of (x, y) pairs, annotated aspositive or negative examples. We are also givena set of structured resources, which we will utilizeto classify (x, y).

Each resource can be naturally viewed as a di-rected graph G (Figure 1). There are two typesof nodes in G: term (lemma) nodes and con-cept (synset) nodes. The edges in G are each la-beled with a property (edge type), defining a widerange of semantic relations between concepts (e.g.occupation, subclass of). In addition, terms aremapped to the concepts they represent via term-concept edge types.

When using multiple resources, G is a dis-connected graph composed of a subgraph per re-source, without edges connecting nodes from dif-ferent resources. One may consider connect-ing multiple resource graphs at the term nodes.However, this may cause sense-shifts, i.e. con-nect two distinct concepts (in different resources)through the same term. For example, the conceptJanuary 1

st in Wikidata is connected to the con-cept fruit in WordNet through the polysemousterm date. The alternative, aligning resources inthe concept space, is not trivial. Some partial map-pings exist (e.g. Yago-WordNet), which can be ex-plored in future work.

4 Algorithmic Framework

We present an algorithmic framework for learningwhether a term-pair (x, y) satisfies a relation R,given an annotated set of term-pairs and a resourcegraph G. We first represent (x, y) as the set ofpaths connecting x and y in G (§4.1). We thenclassify each such path as indicative or not of R,and decide accordingly whether xRy (§4.2).

4.1 Representing Term-Pairs as Path-SetsWe represent each (x, y) pair as the set of pathsthat link x and y within each resource. We retain

only the shortest paths (all paths x ; y of minimallength) as they yielded better performance.

Resource graphs are densely connected, andthus have a huge branching factor b. We thus lim-ited the maximum path length to ` = 8 and em-ployed bidirectional search (Russell and Norvig,2009, Ch.3) to find the shortest paths. This algo-rithm runs two simultaneous instances of breadth-first search (BFS), one from x and another from y,halting when they meet in the middle. It is muchmore efficient, having a complexity of O(b`/2

) =

O(b4) instead of BFS’s O(b`

) = O(b8).

To further reduce complexity, we split thesearch to two phases: we first find all nodes alongthe shortest paths between x and y, and then re-construct the actual paths. Searching for rele-vant nodes ignores edge types, inducing a sim-pler resource graph, which can be represented asa sparse adjacency matrix and manipulated effi-ciently with matrix operations (elaborated in ap-pendix A). Once the search space is limited to rel-evant nodes alone, path-finding becomes trivial.

4.2 Classification FrameworkWe consider edge types that typically connect be-tween concepts in R to be “indicative”; for exam-ple, the occupation edge type is indicative of theis a relation, as in “Beyonce is a musician”. Ourframework’s goal is to learn which edge types areindicative of a given relation R, and use that infor-mation to classify new (x, y) term-pairs.

Figure 2 presents the dependencies betweenedge types, paths, and term-pairs. As discussed inthe previous section, we represent each term-pairas a set of paths. In turn, we represent each path asa “bag of edges”, a vector with an entry for eachedge type.3 To model the edges’ “indicativeness”,we assign a parameter to each edge type, and learnthese parameters from the term-pair level supervi-sion provided by the training data.

In this work, we are not only interested in opti-mizing accuracy or F1, but in exploring the entirerecall-precision trade-off. Therefore, we optimize

3We add special markers to the first and last edges withineach path. This allows the algorithm to learn that applyingterm-to-concept and concept-to-term edge types in the middleof a path causes sense-shifts.

177

x ! y p2

p1

p3

e2

e1

e3

supervision

parameters

Figure 2: The dependencies between term-pairs (x ! y),paths (pj), and edge types (ei).

the F� objective, where �2 balances the recall-precision trade-off.4 In particular, we expect struc-tured resources to facilitate high-precision infer-ences, and are thus more interested in lower valuesof �2, which emphasize precision over recall.

4.2.1 Weighted Edge ModelA typical neural network approach is to assign aweight wi to each edge type ei, where more in-dicative edge types should have higher values ofwi. The indicativeness of a path (p) is modeledusing logistic regression: p , �(~w · ~�), where ~�is the path’s “bag of edges” representation, i.e. afeature vector of each edge type’s frequency in thepath.

The probability of a term-pair being positivecan be determined using either the sum of allpath scores or the score of its most indicativepath (max-pooling). We trained both variants withback-propagation (Rumelhart et al., 1986) andgradient ascent. In particular, we optimized F�

using a variant of Jansche’s (2005) derivation ofF�-optimized logistic regression (see suplemen-tary material5 for full derivation).

This model can theoretically quantify how in-dicative each edge type is of R. Specifically,it can differentiate weakly indicative edges (e.g.meronyms) from those that contradict R (e.g.antonyms). However, on our datasets, this modelyielded sub-optimal results (see §6.1), and there-fore serves as a baseline to the binary model pre-sented in the following section.

4.2.2 Binary Edge ModelPreliminary experiments suggested that in mostdatasets, each edge type is either indicative ornon-indicative of the target relation R. We there-fore developed a binary model, which defines a

4F� =

(1+�2)·precision·recall�2·precision+recall

5http://u.cs.biu.ac.il/%7enlp/wp-content/uploads/LinKeR-sup.pdf

global set of edge types that are indicative of R:a whitelist.

Classification We represent each path p as abinary “bag of edges” �, i.e. the set of edgetypes that were applied in p. Given a term-pair(x, y) represented as a path-set paths(x, y), and awhitelist w, the model classifies (x, y) as positiveif:

9� 2 paths(x, y) : � ✓ w (1)

In other words:

1. A path is classified as indicative if all its edgetypes are whitelisted.

2. A term-pair is classified as positive if at leastone of its paths is indicative.6

The first design choice essentially assumes that Ris a transitive relation. This is usually the case inmost inference relations (e.g. hypernymy, causal-ity). In addition, notice that the second modelingassumption is unidirectional; in some cases xRy,yet an indicative path between them does not ex-ist. This can happen, for example, if the relationbetween them is not covered by the resource, e.g.causality in WordNet.

Training Learning the optimal whitelist over atraining set can be cast as a subset selection prob-lem: given a set of possible edge types E =

{e1, ..., en} and a utility function u : 2

E ! R,find the subset (whitelist) w ✓ E that maximizesthe utility, i.e. w⇤

= arg maxw u(w). In our case,the utility u is the F� score over the training set.

Structured knowledge resources contain hun-dreds of different edge types, making E very large,and an exhaustive search over its powerset infea-sible. The standard approach to this class of sub-set selection problems is to apply local search al-gorithms, which find an approximation of the op-timal subset. We tried several local search algo-rithms, and found that genetic search (Russell andNorvig, 2009, Ch.4) performed well. In general,genetic search is claimed to be a preferred strategyfor subset selection (Yang and Honavar, 1998).

In our application of genetic search, each in-dividual (candidate solution) is a whitelist, repre-sented by a bit vector with a bit for each edge type.We defined the fitness function of a whitelist w ac-cording to the F� score of w over the training set.

6As a corollary, if x⇢Ry, then every path between them isnon-indicative.

178

Dataset #Instances #Positive #Negativekotlerman2010 2,940 880 2,060turney2014 1,692 920 772levy2014 12,602 945 11,657

proper2015 1,500 750 750

Table 3: Datasets evaluated in this work.

We also applied L2 regularization to reduce the fit-ness of large whitelists.

The binary edge model works well in practice,successfully replicating the common practice ofmanually selected relations from WordNet (see§6.1). In addition, the model outputs a human-interpretable set of indicative edges.

Although the weighted model’s hypothesisspace subsumes the binary model’s, the binarymodel performed better on our datasets. We con-jecture that this stems from the limited amount oftraining instances, which prevents a more generalmodel from converging into an optimal solution.

5 Datasets

We used 3 existing common-noun datasets andone new proper-name dataset. Each dataset con-sists of annotated (x, y) term-pairs, where both xand y are noun phrases. Since each dataset wascreated in a slightly different manner, the underly-ing semantic relation R varies as well.

5.1 Existing Datasetskotlerman2010 (Kotlerman et al., 2010) isa manually annotated lexical entailment datasetof distributionally similar nouns. turney2014(Turney and Mohammad, 2015) is based on acrowdsourced dataset of semantic relations, fromwhich we removed non-nouns and lemmatizedplurals. levy2014 (Levy et al., 2014) was gen-erated from manually annotated entailment graphsof subject-verb-object tuples. Table 3 providesmetadata on each dataset.

Two additional datasets were created usingWordNet (Baroni and Lenci, 2011; Baroni et al.,2012), whose definition of R can be trivially cap-tured by a resource-based approach using Word-Net. Hence, they are omitted from our evaluation.

5.2 A New Proper-Name DatasetAn important linguistic component that is miss-ing from these lexical-inference datasets is proper-names. We conjecture that much of the addedvalue in utilizing structured resources is the abil-ity to cover terms such as celebrities (Lady Gaga)

and recent terminology (social networks) that donot appear in WordNet. We thus created a newdataset of (x, y) pairs in which x is a proper-name,y is a common noun, and R is the is a relation.For instance, (Lady Gaga, singer) is true, but(Lady Gaga, film) is false.

To construct the dataset, we sampled 70 articlesin 9 different topics from a corpus of recent events(online magazines). As candidate (x, y) pairs, weextracted 24,000 pairs of noun phrases x and ythat belonged to the same paragraph in the orig-inal text, selecting those in which x is a proper-name. These pairs were manually annotated bygraduate students, who were instructed to use theirworld knowledge and the original text for disam-biguation (e.g. England ! team in the contextof football). The agreement on a subset of 4,500pairs was = 0.954.

After annotation, we had roughly 800 positiveand 23,000 negative pairs. To balance the dataset,we sampled negative examples according to thefrequency of y in positive pairs, creating “harder”negative examples, such as (Sherlock, lady) and(Kylie Minogue, vice president).

6 Results

We first validate our framework by checkingwhether it can automatically replicate the com-mon manual usage of WordNet. We then evaluateit on the proper-name dataset using additional re-sources. Finally, we compare our method to state-of-the-art distributional methods.

Experimental Setup While F1 is a standardmeasure of performance, it captures only one pointon the recall-precision curve. Instead, we presentthe entire curve, while expecting the contributionof structured resources to be in the high-precisionregion. To create these curves, we optimized ourmethod and the baselines using F� with 40 valuesof �2 2 (0, 2).

We randomly split each dataset into 70% train,25% test and 5% validation.7 We applied L2 regu-larization to our method and the baselines, tuningthe regularization parameter on the validation set.

6.1 Performance on WordNetWe examine whether our algorithm can replicatethe common use of WordNet (§2.1), by manuallyconstructing 4 whitelists based on the literature

7Since our methods do not use lexical features, we did notuse lexical splits as in (Levy et al., 2015).

179

Figure 3: Recall-precision curve of each dataset with Word-Net as the only resource. Each point in the graph stands forthe performance on a certain value of �. Notice that in someof the graphs, different � values yield the same performance,causing less points to be displayed.

(see Table 4), and evaluating their performance us-ing the classification methods in §4.2. In addition,we compare our method to Resnik’s (1995) Word-Net similarity, which scores each pair of termsbased on their lowest common hypernym. Thisscore was used as a single feature in F�-optimizedlogistic regression to create a classifier.

Figure 4: Recall-precision curve for proper2015.

Name Edge Typesbasic {synonym, hypernym, instance hypernym}

+holo basic [ {holonym}

+mero basic [ {meronym}

+hypo basic [ {hyponym}

Table 4: The manual whitelists commonly used in WordNet.

Figure 3 compares our algorithm to Word-Net’s baselines, showing that our binary modelalways replicates the best-performing manually-constructed whitelists, for certain values of �2.Synonyms and hypernyms are often selected,and additional edges are added to match thesemantic flavor of each particular dataset. Inturney2014, for example, where meronyms arecommon, our binary model learns that they are in-dicative by including meronymy in its whitelist. Inlevy2014, however, where meronyms are lessindicative, the model does not select them.

We also observe that, in most cases, our algo-rithm outperforms Resnik’s similarity. In addition,the weighted model does not perform as well asthe binary model, as discussed in §4.2. We there-fore focus our presentation on the binary model.

6.2 Lexical Inference over Proper-NamesWe evaluated our model on the new proper-namedataset proper2015 described in §5.2. Thistime, we incorporated all the resources describedin §2.2 (including WordNet) into our framework,and compared the performance to that of usingWordNet alone. Indeed, our algorithm is able toexploit the information in the additional resourcesand greatly increase performance, particularly re-call, on this dataset (Figure 4).8

8We also evaluated our algorithm on the common-nounsdatasets with all resources, but apparently, adding resourcesdid not significantly improve performance.

180

Figure 5: Recall-precision curve of each dataset using: (1) Supervised word2vec (2) Our binary model.

The binary model yields 97% precision at 29%

recall, at the top of the “precision cliff”. Thewhitelist learnt at this point contains 44 edgetypes, mainly from Wikidata and Yago. Eventhough the is a relation implicitly defined inproper2015 is described using many differentedge types, our binary model still manages to learnwhich of the over 2,500 edge types are indicative.Table 5 shows some of the learnt edge types (seethe supplementary material for the complete list).

The performance boost in proper2015demonstrates that community-built resources havemuch added value when considering proper-names. As expected, many proper-names do notappear in WordNet (Doctor Who). That said,even when both terms appear in WordNet, theyoften lack important properties covered by otherresources (Louisa May Alcott is a woman).

6.3 Comparison to Corpus-based MethodsLexical inference has been thoroughly exploredin distributional semantics, with recent supervisedmethods (Baroni et al., 2012; Turney and Mo-hammad, 2015) showing promising results. While

Edge Type Exampleoccupation Daniel Radcliffe! actor

sex or gender Louisa May Alcott! womaninstance of Doctor Who! series

acted in Michael Keaton! Beetlejuicegenre Touch! drama

position played on team Jason Collins! center

Table 5: An excerpt of the whitelist learnt for proper2015by the binary model with accompanying true-positives thatdo not have an indicative path in WordNet.

these methods leverage huge corpora to increasecoverage, they often introduce noise that affectstheir precision. Structured resources, on the otherhand, are precision-oriented. We therefore expectour approach to complement distributional meth-ods in high-precision scenarios.

To represent term-pairs with distributional fea-tures, we downloaded the pre-trained word2vecembeddings.9 These vectors were trained overa huge corpus (100 billion words) using a state-of-the-art embedding algorithm (Mikolov et al.,2013). Since each vector represents a single term(either x or y), we used 3 state-of-the-art meth-

9http://code.google.com/p/word2vec/

181

ods to construct a feature vector for each term-pair: concatenation ~x � ~y (Baroni et al., 2012),difference ~y � ~x (Roller et al., 2014; Fu et al.,2014; Weeds et al., 2014), and similarity ~x · ~y.We then used F�-optimized logistic regression totrain a classifier. Figure 5 compares our methodsto concatenation, which was the best-performingcorpus-based method.10

In turney2014 and proper2015, theembeddings retain over 80% precision whileboasting higher recall than our method’s. Inturney2014, it is often a result of the moreassociative relations prominent in the dataset(football ! playbook), which seldom are ex-pressed in structured resources. In proper2015,the difference in recall seems to be from miss-ing terminology (Twitter ! social network).However, the corpus-based method’s precisiondoes not exceed the low 80’s, while our bi-nary algorithm yields 93% @ 27% precision-at-recall on turney2014 and 97% @ 29% onproper2015.

In levy2014, there is an overwhelming ad-vantage to our resource-based method over thecorpus-based method. This dataset containshealthcare terms and might require a domain-specific corpus to train the embeddings. Havingsaid that, many of its examples are of an ontologi-cal nature (drug x treats disease y), which may bemore suited to our resource-based approach, re-gardless of domain.

7 Error Analysis

Since resource-based methods are precision-oriented, we analyzed our binary model by select-ing the setting with the highest attainable recallthat maintains high precision. This point is oftenat the top of a “precision cliff” in Figures 3 and 4.These settings are presented in Table 6.

The high-precision settings we chose resultedin few false positives, most of which are causedby annotation errors or resource errors. Naturally,regions of higher recall and lower precision willyield more false positives and less false negatives.We thus focus the rest of our discussion on falsenegatives (Table 7).

While structured resources cover most terms,10Note that the corpus-based method benefits from lexical

memorization (Levy et al., 2015), overfitting for the lexicalterms in the training set, while our resource-based methoddoes not. This means that Figure 5 paints a relatively opti-mistic picture of the embeddings’ actual performance.

Dataset � Whitelist Prec. Rec.kotlerman2010 0.05 basic 83% 9%

turney2014 0.05 +mero 93% 27%

levy2014 10

�5 basic 87% 37%

proper2015 0.344 edge types

97% 29%from all resources(see supplementary material)

Table 6: The error analysis setting of each dataset.

Error Type kotlerman levy turney proper2010 2014 2014 2015

Not Covered 2% 12% 4% 13%

No Indicative Paths 35% 48% 73% 75%

Whitelist Error 6% 3% 5% 8%

Resource Error 15% 11% 7% 0%

Annotation Error 40% 23% 7% 1%

Other 2% 3% 4% 3%

Table 7: Analysis of false negatives in each dataset. We ob-served the following errors: (1) One of the terms is out-of-vocabulary (2) All paths are not indicative (3) An indicativepath exists, but discarded by the whitelist (4) The resourcedescribes an inaccurate relation between the terms (5) Theterm-pair was incorrectly annotated as positive.

the majority of false negatives stem from thelack of indicative paths between them. Manyimportant relations are not explicitly covered bythe resources, such as noun-quality (saint !holiness), which are abundant in turney2014,or causality (germ ! infection), which appearin levy2014. These examples are occasionallycaptured by other (more specific) relations, andtend to be domain-specific.

In kotlerman2010, we found that manyfalse negatives are caused by annotation errors inthis dataset. Pairs are often annotated as positivebased on associative similarity (e.g. transport!environment, financing ! management),making it difficult to even manually construct a co-herent whitelist for this dataset. This may explainthe poor performance of our method and otherbaselines on this dataset.


In this paper, we presented a supervised frame-work for utilizing structured resources to rec-ognize lexical inference. We demonstrated thatour framework replicates the common practice ofWordNet and can increase the coverage of proper-names by exploiting larger structured resources.Compared to the prior practice of manually identi-fying useful relations in structured resources, ourcontribution offers a principled learning approachfor automating and optimizing this common need.

While our method enjoys high-precision, its re-call is limited by the resources’ coverage. In fu-ture work, combining our method with high-recall

182

corpus-based methods may have synergistic re-sults. Another direction for increasing recall isto use cross-resource mappings to allow cross-resource paths (connected at the concept-level).

Finally, our method can be extended to becomecontext-sensitive, that is, deciding whether the lex-ical inference holds in a given context. This maybe done by applying a resource-based WSD ap-proach similar to (Brody et al., 2006; Agirre et al.,2014), detecting the concept node that matches theterm’s sense in the given context.

Acknowledgments

This work was supported by an Intel ICRI-CI grant, the Google Research Award Programand the German Research Foundation via theGerman-Israeli Project Cooperation (DIP, grantDA 1600/1-1).

ReferencesEneko Agirre, Oier Lopez de Lacalle, and Aitor Soroa.

2014. Random walks for knowledge-based wordsense disambiguation. Computational Linguistics,40(1):57–84.

Soren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary Ives.2007. Dbpedia: A nucleus for a web of open data.Springer.

Marco Baroni and Alessandro Lenci. 2011. How weblessed distributional semantic evaluation. In Pro-ceedings of the GEMS 2011 Workshop on GEomet-rical Models of Natural Language Semantics, pages1–10, Edinburgh, UK, July. Association for Compu-tational Linguistics.

Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do,and Chung-chieh Shan. 2012. Entailment above theword level in distributional semantics. In Proceed-ings of the 13th Conference of the European Chap-ter of the Association for Computational Linguistics,pages 23–32, Avignon, France, April. Associationfor Computational Linguistics.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing, pages 1533–1544, Seattle, Wash-ington, USA, October. Association for Computa-tional Linguistics.

Samuel Brody, Roberto Navigli, and Mirella Lapata.2006. Ensemble methods for unsupervised wsd.In Proceedings of the 21st International Conferenceon Computational Linguistics and the 44th annual

meeting of the Association for Computational Lin-guistics, pages 97–104. Association for Computa-tional Linguistics.

Christiane Fellbaum. 1998. WordNet. Wiley OnlineLibrary.

Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, HaifengWang, and Ting Liu. 2014. Learning semantic hier-archies via word embeddings. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1199–1209, Baltimore, Maryland, June. Associationfor Computational Linguistics.

Sanda Harabagiu and Dan Moldovan. 1998. Knowl-edge processing on an extended wordNet. WordNet:An electronic lexical database, 305:381–405.

Marti A Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In COLING 1992Volume 2: The 15th International Conference onComputational Linguistics, pages 529–545, Nantes,France.

Martin Jansche. 2005. Maximum expected f-measuretraining of logistic regression models. In Proceed-ings of Human Language Technology Conferenceand Conference on Empirical Methods in NaturalLanguage Processing, pages 692–699, Vancouver,British Columbia, Canada, October. Association forComputational Linguistics.

Lili Kotlerman, Ido Dagan, Idan Szpektor, and MaayanZhitomirsky-Geffet. 2010. Directional distribu-tional similarity for lexical inference. Natural Lan-guage Engineering, 16(04):359–389.

Omer Levy, Ido Dagan, and Jacob Goldberger. 2014.Focused entailment graphs for open ie propositions.In Proceedings of the Eighteenth Conference onComputational Natural Language Learning, pages87–97, Ann Arbor, Michigan, June. Association forComputational Linguistics.

Omer Levy, Steffen Remus, Chris Biemann, and IdoDagan. 2015. Do supervised distributional meth-ods really learn lexical inference relations? In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages970–976, Denver, Colorado, May–June. Associationfor Computational Linguistics.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory SCorrado, and Jeffrey Dean. 2013. Distributed rep-resentations of words and phrases and their compo-sitionality. In Advances in Neural Information Pro-cessing Systems, pages 3111–3119.

Ted Pedersen, Siddharth Patwardhan, and Jason Miche-lizzi. 2004. Wordnet::similarity - measuring the re-latedness of concepts. In Daniel Marcu Susan Du-mais and Salim Roukos, editors, HLT-NAACL 2004:Demonstration Papers, pages 38–41, Boston, Mas-sachusetts, USA, May 2 - May 7. Association forComputational Linguistics.

183

Altaf Rahman and Vincent Ng. 2011. Coreference res-olution with world knowledge. In Proceedings ofthe 49th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technolo-gies, pages 814–824, Portland, Oregon, USA, June.Association for Computational Linguistics.

Philip Resnik. 1995. Using information content toevaluate semantic similarity in a taxonomy. InProceedings of the 14th international joint confer-ence on Artificial intelligence - Volume 1, IJCAI’95,pages 448–453. Morgan Kaufmann Publishers Inc.

Stephen Roller, Katrin Erk, and Gemma Boleda. 2014.Inclusive yet selective: Supervised distributional hy-pernymy detection. In Proceedings of COLING2014, the 25th International Conference on Compu-tational Linguistics: Technical Papers, pages 1025–1036, Dublin, Ireland, August. Dublin City Univer-sity and Association for Computational Linguistics.

D E Rumelhart, G E Hinton, and R J Williams. 1986.Learning representations by back-propagating er-rors. Nature, pages 533–536.

Stuart Russell and Peter Norvig. 2009. Artificial Intel-ligence: A Modern Approach. Prentice Hall.

Eyal Shnarch, Libby Barak, and Ido Dagan. 2009. Ex-tracting lexical reference rules from wikipedia. InProceedings of the Joint Conference of the 47th An-nual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processingof the AFNLP, pages 450–458, Suntec, Singapore,August. Association for Computational Linguistics.

Fabian M Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: a core of semantic knowl-edge. In Proceedings of the 16th international con-ference on World Wide Web, pages 697–706. ACM.

Peter D Turney and Saif M Mohammad. 2015. Ex-periments with three approaches to recognizing lex-ical entailment. Natural Language Engineering,21(03):437–476.

Peter D Turney. 2006. Similarity of semantic relations.Computational Linguistics, 32(3):379–416.

Christina Unger, Lorenz Buhmann, Jens Lehmann,Axel-Cyrille Ngonga Ngomo, Daniel Gerber, andPhilipp Cimiano. 2012. Template-based questionanswering over rdf data. In Proceedings of the 21stinternational conference on World Wide Web, pages639–648. ACM.

Denny Vrandecic. 2012. Wikidata: A new platformfor collaborative data collection. In Proceedingsof the 21st international conference companion onWorld Wide Web, pages 1063–1064. ACM.

Julie Weeds and David Weir. 2003. A generalframework for distributional similarity. In MichaelCollins and Mark Steedman, editors, Proceedings ofthe 2003 Conference on Empirical Methods in Nat-ural Language Processing, pages 81–88.

Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir,and Bill Keller. 2014. Learning to distinguish hy-pernyms and co-hyponyms. In Proceedings of COL-ING 2014, the 25th International Conference onComputational Linguistics: Technical Papers, pages2249–2259, Dublin, Ireland, August. Dublin CityUniversity and Association for Computational Lin-guistics.

Fei Wu and Daniel S. Weld. 2010. Open informationextraction using wikipedia. In Proceedings of the48th Annual Meeting of the Association for Compu-tational Linguistics, pages 118–127, Uppsala, Swe-den, July. Association for Computational Linguis-tics.

Jihoon Yang and Vasant Honavar. 1998. Feature subsetselection using a genetic algorithm. In Feature ex-traction, construction and selection, pages 117–136.Springer.

Appendix A Efficient Path-Finding

We split the search to two phases: we first find allnodes along the shortest paths between x and y,and then reconstruct the actual paths. The firstphase ignores edge types, inducing a simpler re-source graph, which we represent as a sparse adja-cency matrix and manipulate efficiently with ma-trix operations (Algorithm 1). Once the searchspace is limited to relevant nodes only, the secondphase becomes trivial.

Algorithm 1 Find Relevant Nodes1: function NODESINPATH(~nx,~ny, len)2: if len == 1 then3: return ~nx [ ~ny

4: for 0 < k len do5: if k is odd then6: ~nx = ~nx · A

7: else8: ~ny = ~ny · A

T

9: if ~nx · ~ny > 0 then10: ~nxy = ~nx \ ~ny

11: ~nforward = nodesInPath(~nx,~nxy, d

k2 e)

12: ~nbackward = nodesInPath(~nxy,~ny, b

k2 c)

13: return ~nforward [ ~nbackward

14: return ~

0

The algorithm finds all nodes in the paths between x and y

subject to the maximum length (len). A is the resource adja-cency matrix and ~nx,~ny are one-hot vectors of x, y.At each iteration, we either make a forward (line 6) or a back-ward (8) step. If the forward and backward search meet (9),we recursively call the algorithm for each side (11-12), andmerge their results (13). The stop conditions are len = 0, re-turning an empty set when no path was found, and len = 1,merging both sides when they are connected by single edges.

184



Linking Entities Across Images and Text

Rebecka WeegarDSV

Stockholm University

Kalle AstromDept. of Mathematics

Lund [email protected], [email protected], [email protected]

Pierre NuguesDept. of Computer Science

Lund University

AbstractThis paper describes a set of methods tolink entities across images and text. Asa corpus, we used a data set of images,where each image is commented by a shortcaption and where the regions in the im-ages are manually segmented and labeledwith a category. We extracted the entitymentions from the captions and we com-puted a semantic similarity between thementions and the region labels. We alsomeasured the statistical associations be-tween these mentions and the labels andwe combined them with the semantic sim-ilarity to produce mappings in the formof pairs consisting of a region label anda caption entity. In a second step, weused the syntactic relationships betweenthe mentions and the spatial relationshipsbetween the regions to rerank the listsof candidate mappings. To evaluate ourmethods, we annotated a test set of 200images, where we manually linked the im-age regions to their corresponding men-tions in the captions. Eventually, we couldmatch objects in pictures to their correctmentions for nearly 89 percent of the seg-ments, when such a matching exists.

1 IntroductionLinking an object in an image to a mention of thatobject in an accompanying text is a challengingtask, which we can imagine useful in a numberof settings. It could, for instance, improve im-age retrieval by complementing the geometric re-lationships extracted from the images with textualdescriptions from the text. A successful mappingwould also make it possible to translate knowledgeand information across image and text.

In this paper, we describe methods to link men-tions of entities in captions to labeled image seg-

ments and we investigate how the syntactic struc-ture of a caption can be used to better understandthe contents of an image. We do not address theclosely related task of object recognition in the im-ages. This latter task can be seen as a complementto entity linking across text and images. See Rus-sakovsky et al. (2015) for a description of progressand results to date in object detection and classifi-cation in images.

2 An ExampleFigure 1 shows an example of an image from theSegmented and Annotated IAPR TC-12 data set(Escalantea et al., 2010). It has four regions la-beled cloud, grass, hill, and river, and the caption:

a flat landscape with a dry meadow inthe foreground, a lagoon behind it andmany clouds in the sky

containing mentions of five entities that we iden-tify with the words meadow, landscape, lagoon,cloud, and sky. A correct association of the men-tions in the caption to the image regions would

Figure 1: Image from the Segmented and Anno-tated IAPR TC-12 data set with the caption: a flatlandscape with a dry meadow in the foreground, alagoon behind it and many clouds in the sky

185

map clouds to the region labeled cloud, meadowto grass, and lagoon to river.

This image, together with its caption, illustratesa couple of issues: The objects or regions labelledor visible in an image are not always mentionedin the caption, and for most of the images in thedata set, more entities are mentioned in the cap-tions than there are regions in the images. In addi-tion, for a same entity, the words used to mentionit are usually different from the words used as la-bels (the categories), as in the case of grass andmeadow.

3 Previous Work

Related work includes the automatic generationof image captions that describes relevant objectsin an image and their relationships. Kulkarni etal. (2011) assign each detected image object a vi-sual attribute and a spatial relationship to the otherobjects in the image. The spatial relationshipsare translated into selected prepositions in the re-sulting captions. Elliott and Keller (2013) usedmanually segmented and labeled images and intro-duced visual dependency representations (VDRs)that describe spatial relationships between the im-age objects. The captions are generated using tem-plates. Both Kulkarni et al. (2011) and Elliott andKeller (2013) used the BLEU-score and humanevaluators to assess grammatically the generatedcaptions and on how well they describe the image.

Although much work has been done to linkcomplete images to a whole text, there are only afew papers on the association of elements inside atext and an image. Naim et al. (2014) analyzedparallel sets of videos and written texts, wherethe videos show laboratory experiments. Writteninstructions are used to describe how to conductthese experiments. The paper describes models formatching objects detected in the video with men-tions of those objects in the instructions. The au-thors mainly focus on objects that get touched by ahand in the video. For manually annotated videos,Naim et al. (2014) could match objects to nounsnearly 50% of the time.

Karpathy et al. (2014) proposed a system for re-trieving related images and sentences. They usedneural networks and they show that the results areimproved if image objects and sentence fragmentsare included in the model. Sentence fragmentsare extracted from dependency graphs, where eachedge in the graphs corresponds to a fragment.

4 Entity Pairs

4.1 Data Set

We used the Segmented and Annotated IAPR TC-12 Benchmark data set (Escalantea et al., 2010)that consists of about 20,000 photographs witha wide variety of themes. Each image has ashort caption that describes its content, most oftenconsisting of one to three sentences separated bysemicolons. The images are manually segmentedinto regions with, on average, about 5 segments ineach image.

Each region is labelled with one out of 275predefined image labels. The labels are arrangedin a hierarchy, where all the nodes are availableas labels and where object is the top node.The labels humans, animals, man-made,landscape/nature, food, and other formthe next level.

4.2 Entities and Mentions

An image caption describes a set of entities, thecaption entities CE, where each entity CEi isreferred to by a set of mentions M . To detectthem, we applied the Stanford CoreNLP pipeline(Toutanova et al., 2003) that consists of a part-of-speech tagger, lemmatizer, named entity recog-nizer (Finkel et al., 2005), dependency parser, andcoreference solver. We considered each noun ina caption as an entity candidate. If an entity CEi

had only one mention Mj , we identified it by thehead noun of its mention. We represented the en-tities mentioned more than once by the head nounof their most representative mention. We appliedthe entity extraction to all the captions in the dataset, and we found 3,742 different nouns or nouncompounds to represent the entities.

In addition to the caption entities, each imagehas a set of labeled segments (or regions) corre-sponding to the image entities, IE. The Carte-sian product of these two sets results in pairs Pgenerating all the possible mappings of captionentities to image labels. We considered a pair(IEi, CEj) a correct mapping, if the image la-bel IEi and the caption entity CEj referred to thesame entity. We represented a pair by the regionlabel and the identifier of the caption entity, i.e.the head noun of the entity mention. In Fig. 1, thecorrect pairs are (grass, meadow), (river, lagoon),and (cloud, clouds).

186

4.3 Building a Test SetAs the Segmented and Annotated IAPR TC-12data set does not provide information on links be-tween the image regions and the mentions, weannotated a set of 200 randomly selected imagesfrom the data set to evaluate the automatic linkingaccuracy. We assigned the image regions to enti-ties in the captions and we excluded these imagesfrom the training set. The annotation does not al-ways produce a 1:1 mapping of caption entities toregions. In many cases, objects are grouped or di-vided into parts differently in the captions and inthe segmentation. We created a set of guidelinesto handle these mappings in a consistent way. Ta-ble 1 shows the sizes of the different image setsand the fraction of image regions that have a cor-responding entity mention in the caption.

Set Files Regions Mappings %Data set 19,176 – – –Train. set 18,976 – – –Test set 200 928 730 78.7

Table 1: The sizes of the different image sets.

5 Ranking Entity Pairs

To identify the links between the regions of an im-age and the entity identifiers in its caption, wefirst generated all the possible pairs. We thenranked these pairs using a semantic distance de-rived from WordNet (Miller, 1995), statistical as-sociation metrics, and finally, a combination ofboth techniques.

5.1 Semantic DistanceThe image labels are generic English words thatare semantically similar to those used in the cap-tions. In Fig. 1, cloud and clouds are used bothas label and in the caption, but the region labeledgrass is described as a meadow and the region la-beled river, as a lagoon. We used the WordNetSimilarity for Java library, (WS4J), (Shima, 2014)to compute the semantic similarity of the regionlabels and the entity identifiers. WS4J comes witha number of metrics that approximate similarity asdistances between WordNet synsets: PATH, WUP(Wu and Palmer, 1994), RES, (Resnik, 1995), JCN(Jiang and Conrath, 1997), HSO (Hirst and St-Onge, 1998), LIN (Lin, 1998), LCH (Leacockand Chodorow, 1998), and LESK (Banerjee andBanerjee, 2002).

We manually lemmatized and simplified the im-age labels and the entity mentions so that they arecompatible with WordNet entries. It resulted in asmaller set of labels: 250 instead of the 275 orig-inal labels. We also simplified the named entitiesfrom the captions. When a person or location wasnot present in WordNet, we used its named entitytype as identifier. In some cases, it was not possi-ble to find an entity identifier in WordNet, mostlydue to misspellings in the caption, like buldings,or buidling, or because of POS-tagging errors. Wechose to identify these entities with the word en-tity. The normalization reduced the 3,742 entityidentifiers to 2,216 unique ones.

Finally, we computed a 250⇥ 2216 matrix con-taining the similarity scores for each (image label,entity identifier) pair for each of the WS4J seman-tic similarity metrics.

5.2 Statistical Associations

We used three functions to reflect the statisticalassociation between an image label and an entityidentifier:

• Co-occurrence counts, i.e. the frequencies ofthe region labels and entity identifiers that oc-cur together in the pictures of the training set;

• Pointwise mutual information (PMI) (Fano,1961) that compares the joint probability ofthe occurrence of a (image label, entity iden-tifier) pair to the independent probability ofthe region label and the caption entity occur-ring by themselves; and finally

• The simplified Student’s t-score as describedin Church and Mercer (1993).

As with the semantic similarity scores, we usedmatrices to hold the scores for all the (image la-bel, entity identifier) pairs for the three associationmetrics.

5.3 The Mapping Algorithm

To associate the region labels of an image to theentities in its caption, we mapped the label Li tothe caption entity Ej that had the highest scorewith respect to Li. We did this for the three associ-ation scores and the eight semantic metrics. Notethat a region label is not systematically paired withthe same caption entity, since each caption con-tains different sets of entities.

187

Background and foreground are two of the mostfrequent words in the captions and they were fre-quently assigned to image regions. Since theyrarely represent entities, but merely tell where theentities are located, we included them in a list ofstop words, as well as middle, left, right, and frontthat we removed from the identifiers.

We applied the linking algorithm to the anno-tated set. We formed the Cartesian product of theimage labels and the entity identifiers and, for eachimage region, we ranked the caption entities usingthe individual scoring functions. This results in anordered list of entity candidates for each region.Table 2 shows the average ranks of the correct can-didate for each of the scoring functions and the to-tal number of correct candidates at different ranks.

6 Reranking

The algorithm in Sect. 5.3 determines the relation-ship holding between a pair of entities, where oneelement in the pair comes from the image and theother from the caption. The entities on each sideare considered in isolation. We extended their de-scription with relationships inside the image andthe caption. Weegar et al. (2014) showed that pairsof entities in a text that were linked by the preposi-tions on, at, with, or in, often corresponded to pairsof segments that were close to each other. We fur-ther investigated the idea that spatial relationshipsin the image relate to syntactical relationships inthe captions and we implemented it in the form ofa reranker.

For each label-identifier pair, we included therelationship between the image segment in the pairand the closest segment in the image. As in Wee-gar et al. (2014), we defined the closeness as theEuclidean distance between the gravity centers ofthe bounding boxes of the segments. We alsoadded the relationship between the caption entityin the label-identifier pair and the entity mentionswhich were the closest in the caption. We parsedthe captions and we measured the distance as thenumber of edges between the two entities in thedependency graph.

6.1 Spatial Features

The Segmented and Annotated IAPR TC-12 dataset comes with annotations for three differenttypes of spatial relationships holding between thesegment pairs in each image: Topological, hori-zontal, and vertical (Hernandez-Gracidas and Su-

car, 2007). The possible values are adjacent ordisjoint for the topological category, beside or hor-izontally aligned for the horizontal one, and finallyabove, below, or vertically aligned for the verticalone.

6.2 Syntactic FeaturesThe syntactic features are all based on the struc-ture of the sentences’ dependency graphs. We fol-lowed the graph from the caption-entity in the pairto extract its closest ancestors and descendants.We only considered children to the right of thecandidate. We also included all the prepositionsbetween the entity and these ancestor and descen-dant.

Figure 2: Dependency graph of the sentence a flatlandscape with a dry meadow in the foreground

Figure 2 shows the dependency graph of thesentence a flat landscape with a dry meadow inthe foreground. The descendants of the landscapeentity are meadow and foreground linked respec-tively by the prepositions with and in. Its an-cestor is the root node and the distance betweenlandscape and meadow is 2. The syntactic fea-tures we extract for the entities in this sentence ar-ranged in the order ancestor, distance to ancestor,preposition, descendant, distance to descendant,and preposition are for landscape, (root, 1, null,meadow, 2, with) and (root, 1, null, foreground,2, in), for meadow, (landscape, 2, with, null, –,null), and for foreground, (landscape, 2, in, null,–, null). We discard foreground as it is part of thestop words.

6.3 Pairing FeaturesThe single features consist of the label, entity iden-tifier, and score of the pair. To take interactioninto account, we also paired features characteriz-ing properties across image and text. The list ofthese features is (Table 3):

1. The label of the image region and the identi-fier of the caption entity. In Fig 2, we creategrass meadow from (grass, meadow).

2. The label of the closest image segment to theancestor of the caption entity. The closest

188

Scoring function Average rank Rank = 1 Rank 2 Rank 3 Rank 4co-occurrence 1.58 338 525 609 667PMI 1.61 340 527 624 673t-scores 1.59 337 540 623 669PATH 1.19 559 604 643 668HSO 1.18 574 637 666 691JCN 1.22 535 580 626 653LCH 1.19 559 604 643 668LESK 1.19 560 609 646 670LIN 1.19 542 581 623 652RES 1.17 559 611 638 665WUP 1.21 546 599 640 663

Table 2: Average rank of the correct candidate obtained by each scoring function on the 200 annotatedimages of the test set, and number of correct candidates that are ranked first, first or second, etc. Theceiling is 730

Label: Simplified segment label Entity: Identifier for the caption en-tity

Label Entity: Label and entityfeatures combined

Score: Score given by the currentscoring function

Anc ClosestSeg: Closest segmentlabel with the ancestor of the captionentity

Desc ClosestSeg: Closest seg-ment label with the descendant of thecaption entity

AncDist: Distance between the an-cestor and the caption entity, and dis-tance between segments

DescDist: Distance between the de-scendant and the caption entity, anddistance between the segments

TopoRel DescPreps: Topologicalrelationship between segments and theprepositions linking the caption entitywith its descendant

TopoRel AncPreps: Topologicalrelationship between the segments andthe prepositions linking the caption en-tity with its ancestor

XRel DescPreps: Horizontal re-lationship between segments and theprepositions linking the caption entitywith its descendant

XRel AncPreps: Horizontal rela-tionship between segments and theprepositions linking the caption entitywith its ancestor

YRel DescPreps: Vertical relation-ship between segments and the prepo-sitions linking the caption entity withits descendant

YRel AncPreps: Vertical relation-ship between segments and the prepo-sitions linking the caption entity withits ancestor

SegmentDist: Distance (in pix-els) between the gravity center of thebounding boxes framing the two clos-est segments

Table 3: The reranking features using the current segment and its closest segment in the image

segment of the grass segment is river and theancestor of meadow is landscape. This givesthe paired feature meadow landscape.The labels of the segments closest to the cur-rent segment and the descendant of meadoware also paired.

3. The distance between the segment pairs inthe image divided into seven intervals withthe distance between the caption entities. Wemeasured the distance in pixels since all theimages have the same pixel dimensions.

4. The spatial relationships of the closest seg-ments with the prepositions found betweentheir corresponding caption entities. The seg-ments grass and river in the image are ad-jacent and horizontally aligned and grassis located below the segment labeled river.Each of the spatial features is paired with theprepositions for both the ancestor and the de-

scendant.

We trained the reranking models from the pairsof labeled segments and caption entities, wherethe correct mappings formed the positive exam-ples and the rest, the negative ones. In Fig. 1,the mapping (grass, meadow) is marked as correctfor the region labeled grass, while the mappings(grass, lagoon) and (grass, cloud) are marked asincorrect. We used the manually annotated images(200 images, Table 1) as training data, a leave-one-out cross-validation, and L2-regularized logisticregression from LIBLINEAR (Fan et al., 2008).We applied a cutoff of 3 for the list of candidatesin the reranking and we multiplied the originalscore of the label-identifier pairs with the rerank-ing probability.

6.4 Reranking ExampleTable 4, upper part, shows the two top candidatesobtained from the co-occurrence scores for the

189

Label Entity 1 Score Entity 2 Scorecloud sky 2207 cloud 1096grass sky 1489 meadow 887hill sky 861 cloud 327river sky 655 cloud 250cloud cloud 769 sky 422grass meadow 699 landscape 176hill landscape 113 cloud 28river cloud 37 meadow 10

Table 4: An example of an assignment before (up-per part) and after (lower part) reranking. The cap-tion entities are ranked according to the number ofco-occurrences with the label. We obtain the newscore for a label-identifier pair by multiplying theoriginal score by the output of the reranker for thispair

four regions in Fig. 1. The column Entity 1 showsthat the scoring function maps the caption entitysky to all of the regions. We created a reranker’sfeature vector for each of the 8 label-identifierpairs. Table 5 shows two of them correspondingto the pairs (grass, sky) and (grass, meadow). Thepair (grass, meadow) is a correct mapping, but ithas a lower co-occurrence score than the incorrectpair (grass, sky).

In the cross-validation evaluation, we appliedthe classifier to these vectors and we obtained thereranking scores of 0.0244 for (grass, sky) and0.79 for (grass, meadow) resulting in the respec-tive final scores of 36 and 699. Table 4, lowerpart, shows the new rankings, where the high-est scores correspond to the associations: (cloud,cloud), (grass, meadow), (hill, landscape), and(river, cloud), which are all correct except the lastone.

7 Results

7.1 Individual Scoring FunctionsWe evaluated the three scoring functions: Co-occurrence, mutual information, and t-score, andthe semantic similarity functions. Each labeledsegment in the annotated set was assigned thecaption-entity that gave the highest scoring label-identifier pair.

To confront the lack of annotated data we alsoinvestigated a self-training method. We used thestatistical associations we derived from the train-ing set and we applied the mapping procedure inSect. 5.3 to this set. We repeated this procedure

Feature (grass, meadow) (grass, sky)Label grass grassEntity meadow skyLabel Entity grass meadow grass skyScore 881 1,477Anc ClosestSeg landscape river cloud riverDesc ClosestSeg lagoon river null riverAncDist 2 a 2 aDescDist 1 a 100 aTopoRel DescPrep adj null adj nullTopoRel AncPrep adj with adj inXRel DescPrep horiz null horiz nullXRel AncPrep horiz with horiz inYRel DescPrep below null below nullYRel AncPrep below with below inSegmentDist 24 24Classification correct incorrect

Table 5: Feature vectors for the pairs (grass,meadow) and (grass, sky). The ancestor distance2 a means that there are two edges in the depen-dency graph between the words meadow and land-scape, and a represents the smallest of the distanceintervals, meaning that the two segments grass andriver are less than 50 pixels apart

with the three statistical scoring functions. Wecounted all the mappings we obtained between theregion labels and the caption identifiers and weused these counts to create three new scoring func-tions denoted with a

Psign.

Table 6 shows the performance comparison be-tween the different functions. The second columnshows how many correct mappings were foundby each function. The fourth column shows theimproved score when the stop words were re-moved. The removal of the stop words as en-tity candidates improved the co-occurrence and t-score scoring functions considerably, but providedonly marginal improvement for the scoring func-tions based on semantic similarity and pointwisemutual information. The percentage of correctmappings is based on the 730 regions that have amatching caption entity in the annotated test set.

The semantic similarity functions – PATH,HSO, JCN, LCH, LESK, LIN, RES and WUP –outperform the statistical one and the self-trainedversions of the statistical scoring functions yieldbetter results than the original ones.

We applied an ensemble voting procedure withthe individual scoring functions, where each func-tion was given a number of votes to place on itspreferred label-identifier pair. We counted thevotes and the entity that received the majority ofthe votes was selected as the mapping for thecurrent label. Table 7 shows the results, where

190

With stop words Without stop wordsFunction # correct % # correct %co-oc. 208 28.5 338 46.3PMI 339 46.4 340 46.6t-score 241 33.0 337 46.1P

co-oc. 226 30.0 387 53.0PPMI 457 62.6 458 62.7Pt-score 247 33.8 397 54.4

PATH 552 75.6 559 76.6HSO 562 77.0 574 78.6JCN 527 72.2 535 73.3LCH 552 75.6 559 76.6LESK 549 75.2 560 76.7LIN 532 72.9 542 74.2RES 539 73.8 559 76.6WUP 540 74.0 546 74.8

Table 6: Comparison of the individual scoringfunctions. This test is performed on the annotatedset of 200 images, with 730 possible correct map-pings

we reached a maximum 79.45% correct mappingswhen all the functions were used together with onevote each.

Scoring function Number of votesco-oc. 1 0 1PMI 1 0 1t-score 1 0 1P

co-oc. 1 0 1PPMI 1 0 1Pt-score 1 0 1

PATH 0 1 1HSO 0 1 1JCN 0 1 1LCH 0 1 1LESK 0 1 1LIN 0 1 1RES 0 1 1WUP 0 1 1number correct 382 569 580percent correct 52 78 79

Table 7: Results of ensemble voting on the anno-tated set

7.2 RerankingWe reranked all the scoring functions using themethods described in Sect. 6. We used the threelabel-identifier pairs with the highest score foreach segment and function to build the model andwe also reranked the top three label-identifier pairsfor each of the assignments. Table 8 shows the re-sults we obtained with the reranker compared tothe original scoring functions. The reranking pro-

co-o

cPM

It-s

core

Pco

-oc

PPM

IP

t-sco

rePA

TH HSO JC

NLC

HLE

SK LIN

RES

WU

P

0

200

400

600

Num

bero

fseg

men

ts

Figure 3: A comparison of the number of correctlyassigned labels when using the different scoringfunctions. The leftmost bars show the results ofthe original functions, the middle bars show theperformance when the stop words are removed,and the rightmost ones show the performance ofthe reranked functions

cedure improves the performance of all the scoringfunctions, especially the statistical ones, where themaximal improvement reaches 58%.

Function correct correct rerank. % Improv.co-oc. 338 515 52.4PMI 340 515 51.5t-score 337 532 57.9P

co-oc. 387 506 30.7PPMI 458 552 20.5Pt-score 397 521 31.2

PATH 559 587 5.0HSO 574 587 2.3JCN 535 558 4.3LCH 559 586 4.8LESK 560 563 0.5LIN 542 558 3.0RES 559 585 4.7WUP 546 565 3.5

Table 8: The performance of the reranked scoringfunctions compared to the original scoring func-tions

Figure 3 shows the comparison between theoriginal scoring functions, the scoring functionswithout stop words, and the reranked versions.There is a total of 928 segments, where 730 havea matching entity in the caption.

We applied an ensemble voting with thereranked functions (Table 9). Reranking yields asignificant improvement for the statistical scoring

191

functions. When they get one vote each in the en-semble voting, the results increase from 52% cor-rect mappings to 75%. When used in an ensemblewith the semantic similarity scoring functions, theresults improve further.

Scoring function Number of votesReranked co-oc. 1 0 1Reranked PMI 1 0 1Reranked t-score 1 0 1Reranked

Pco-oc. 1 0 1

RerankedP

PMI 1 0 1Reranked

Pt-score 1 0 1

Reranked PATH 0 1 1Reranked HSO 0 1 1Reranked JCN 0 1 1Reranked LCH 0 1 1Reranked LESK 0 1 1Reranked LIN 0 1 1Reranked RES 0 1 1Reranked WUP 0 1 1number correct 546 594 633percent correct 75 81 87

Table 9: Results of ensemble voting with rerankedassignments segments

We also evaluated ensemble voting with differ-ent numbers of votes for the different functions.We tested all the permutations of integer weightsin the interval {0,3} on the development set. Ta-ble 10 shows the best result for both the originalassignments and the reranked assignments on thetest set. The reranked assignments gave the bestresults, 88.76% correct mappings, and this is alsothe best result we have been able to reach.


The extraction of relations across text and imageis a new area for research. We showed in this pa-per that we could use semantic and statistical func-tions to link the entities in an image to mentions ofthe same entities in captions describing this image.We also showed that using the syntactic structureof the caption and the spatial structure of the imageimproves linking accuracy. Eventually, we man-aged to map correctly nearly 89% of the imagesegments in our data set, counting only segmentsthat have a matching entity in the caption.

The semantic similarity functions form the mostaccurate mapping tool, when using functions inisolation. The statistical functions improve sig-

Number of votesScoring function Original Rerankedco-oc. 0 0PMI 2 3t-score 0 0P

co-oc. 0 1PPMI 2 1Pt-score 0 1

PATH 1 1HSO 2 3JCN 0 0LCH 0 0LESK 1 0LIN 2 0RES 0 0WUP 0 0number correct 298 316percent correct 83.71 88.76

Table 10: Results of weighted ensemble voting.

nificantly their results when they are used in anensemble. This shows that it is preferable to usemultiple scoring functions, as their different prop-erties contribute to the final score.

Including the syntactic structures of the cap-tions and pairing them with the spatial structuresof the images is also useful when mapping entitiesto segments. By training a model on such featuresand using this model to rerank the assignments,the ordering of entities in the assignments is im-proved with a better precision for all the scoringfunctions.

Although we used images manually annotatedwith segments and labels, we believe the meth-ods we described here can be applied on automati-cally segmented and labeled images. Using imagerecognition would then certainly introduce incor-rectly classified image regions and thus probablydecrease the linking scores.

Acknowledgments

This research was supported by Vetenskapsradetunder grant 621-2010-4800, and the Det digitalis-erade samhallet and eSSENCE programs.

192

ReferencesSatanjeev Banerjee and Satanjeev Banerjee. 2002. An

adapted Lesk algorithm for word sense disambigua-tion using Wordnet. In Proceedings of the Third In-ternational Conference on Intelligent Text Process-ing and Computational Linguistics, pages 136–145.

Kenneth Church and Robert Mercer. 1993. Introduc-tion to the special issue on computational linguis-tics using large corpora. Computational Linguistics,19(1):1–24.

Desmond Elliott and Frank Keller. 2013. Image de-scription using visual dependency representations.In Proceedings of the 2013 Conference on Em-pirical Methods in Natural Language Processing,pages 1292–1302, Seattle, Washington, USA, Oc-tober. Association for Computational Linguistics.

Hugo Jair Escalantea, Carlos A. Hernandeza, Je-sus A. Gonzaleza, A. Lopez-Lopeza, Manuel Mon-tesa, Eduardo F. Moralesa, L. Enrique Sucara,Luis Villasenora, and Michael Grubinger. 2010.The segmented and annotated IAPR TC-12 bench-mark. Computer Vision and Image Understanding,114:419–428.

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR:A library for large linear classification. Journal ofMachine Learning Research, 9:1871–1874.

Robert Fano. 1961. Transmission of Information: AStatistical Theory of Communications. MIT Press,Cambridge, MA.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In Proceedings of the 43rd Annual Meet-ing of the ACL, pages 363–370, Ann Arbor.

Carlos Arturo Hernandez-Gracidas and Luis EnriqueSucar. 2007. Markov random fields and spatial in-formation to improve automatic image annotation.In Domingo Mery and Luis Rueda, editors, PSIVT,volume 4872 of Lecture Notes in Computer Science,pages 879–892. Springer.

Graeme Hirst and David St-Onge. 1998. Lexicalchains as representations of context for the detec-tion and correction of malapropisms. In ChristianeFellbaum, editor, WordNet: An Electronic LexicalDatabase. MIT Press, Cambridge, MA.

Jay J. Jiang and David W. Conrath. 1997. Semanticsimilarity based on corpus statistics and lexical tax-onomy. CoRR, cmp-lg/9709008.

Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014.Deep fragment embeddings for bidirectional imagesentence mapping. CoRR, abs/1406.5679.

Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Sim-ing Li, Yejin Choi, Alexander C Berg, and Tamara L

Berg. 2011. Baby talk: Understanding and generat-ing image descriptions. In Proceedings of the 24thCVPR.

Claudia Leacock and Martin Chodorow. 1998. Com-bining local context and wordnet similarity for wordsense identification. In Christiane Fellbaum, edi-tor, WordNet: An Electronic Lexical Database. MITpress, Cambridge, MA.

Dekang Lin. 1998. An information-theoretic defini-tion of similarity. In Proceedings of the 15th In-ternational Conference on Machine Learning, pages296–304. Morgan Kaufmann.

George A. Miller. 1995. WordNet: A lexicaldatabase for English. Communications of the ACM,38(11):39–41, November.

Iftekhar Naim, Young Chol Song, Qiguang Liu, HenryKautz, Jiebo Luo, and Daniel Gildea. 2014. Unsu-pervised alignment of natural language instructionswith video segments. In Proceedings of the NationalConference on Artificial Intelligence (AAAI-14).

Philip Resnik. 1995. Using information content toevaluate semantic similarity in a taxonomy. In Pro-ceedings of the 14th International Joint Conferenceon Artificial Intelligence, pages 448–453.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. ImageNetLarge Scale Visual Recognition Challenge. Interna-tional Journal of Computer Vision (IJCV).

Hideki Shima. 2014. WordNet Similarity for Java,February.

Kristina Toutanova, Dan Klein, Christopher D. Man-ning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network.In Proceedings of the 2003 Conference of the HLT-NAACL, pages 252–259, Edmonton.

Rebecka Weegar, Linus Hammarlund, Agnes Tegen,Magnus Oskarsson, Kalle Astrom, and PierreNugues. 2014. Visual entity linking: A preliminarystudy. In Proceedings of the AAAI 2014 Workshopon Cognitive Computing for Augmented Human In-telligence, pages 46–49, Quebec, July 27.

Zhibiao Wu and Martha Palmer. 1994. Verbs seman-tics and lexical selection. In Proceedings of the32nd Annual Meeting of the Association for Com-putational Linguistics, ACL ’94, pages 133–138,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

193



Making the Most of Crowdsourced Document Annotations:Confused Supervised LDA

Paul FeltDept. of Computer ScienceBrigham Young University

[email protected]

Eric K. RinggerDept. of Computer ScienceBrigham Young [email protected]

Jordan Boyd-GraberDept. of Computer Science

University of Colorado [email protected]

Kevin SeppiDept. of Computer ScienceBrigham Young [email protected]

Abstract

Corpus labeling projects frequently uselow-cost workers from microtask market-places; however, these workers are ofteninexperienced or have misaligned incen-tives. Crowdsourcing models must be ro-bust to the resulting systematic and non-systematic inaccuracies. We introduce anovel crowdsourcing model that adapts thediscrete supervised topic model sLDA tohandle multiple corrupt, usually conflict-ing (hence “confused”) supervision sig-nals. Our model achieves significant gainsover previous work in the accuracy of de-duced ground truth.

1 Modeling Annotators and Abilities

Supervised machine learning requires labeledtraining corpora, historically produced by labo-rious and costly annotation projects. Microtaskmarkets such as Amazon’s Mechanical Turk andCrowdflower have turned crowd labor into a com-modity that can be purchased with relatively lit-tle overhead. However, crowdsourced judgmentscan suffer from high error rates. A common solu-tion to this problem is to obtain multiple redundanthuman judgments, or annotations,1 relying on theobservation that, in aggregate, non-experts oftenrival or exceed experts by averaging over individ-ual error patterns (Surowiecki, 2005; Snow et al.,2008; Jurgens, 2013).

A crowdsourcing model harnesses the wisdomof the crowd and infers labels based on the ev-idence of the available annotations, imperfect

1In this paper, we call human judgments annotations todistinguish them from gold standard class labels.

though they be. A common baseline crowd-sourcing method aggregates annotations by ma-jority vote, but this approach ignores importantinformation. For example, some annotators aremore reliable than others and their judgmentsought to be upweighted accordingly. State-of-the-art crowdsourcing methods account for anno-tator expertise, often through a probabilistic for-malism. Compounding the challenge, assessingunobserved annotator expertise is tangled with es-timating ground truth from imperfect annotations;thus, joint inference of these interrelated quantitiesis necessary. State-of-the-art models also take thedata into account, because data features can helpratify or veto human annotators.

We introduce a model that improves on stateof the art crowdsourcing algorithms by modelingnot only the annotations but also the features ofthe data (e.g., words in a document). Section 2identifies modeling deficiencies affecting previouswork and proposes a solution based on topic mod-eling; Section 2.4 presents inference for the newmodel. Experiments that contrast the proposedmodel with select previous work on several textclassification datasets are presented in Section 3.In Section 4 we highlight additional related work.

2 Latent Representations that ReflectLabels and Confusion

Most crowdsourcing models extend the item-response model of Dawid and Skene (1979). TheBayesian version of this model, referred to here asITEMRESP, is depicted in Figure 1. In the gen-erative story for this model, a confusion matrix�j is drawn for each human annotator j. Eachrow �jc of the confusion matrix �j is drawn from

194

✓

�jc

b(✓)

b(�)

yd

adj

d = 1 . . . D

c = 1 . . . C

j = 1 . . . J

j = 1 . . . J

Figure 1: ITEMRESP as a plate diagram. Roundnodes are random variables. Rectangular nodesare free parameters. Shaded nodes are observed.D,J,C are the number of documents, annotators,and classes, respectively.

a symmetric Dirichlet distribution Dir(b(�)jc ) and

encodes a categorical probability distribution overlabel classes that annotator j is apt to choose whenpresented with a document whose true label is c.Then for each document d an unobserved docu-ment label yd is drawn. Annotations are generatedas annotator j corrupts the true label yd accordingto the categorical distribution Cat(�jyd).

2.1 Leveraging DataSome extensions to ITEMRESP model the featuresof the data (e.g., words in a document). Manydata-aware crowdsourcing models condition thelabels on the data (Jin and Ghahramani, 2002;Raykar et al., 2010; Liu et al., 2012; Yan et al.,2014), possibly because discriminative classifiersdominate supervised machine learning. Othersmodel the data generatively (Bragg et al., 2013;Lam and Stork, 2005; Felt et al., 2014; Simp-son and Roberts, 2015). Felt et al. (2015) arguethat generative models are better suited than condi-tional models to crowdsourcing scenarios becausegenerative models often learn faster than theirconditional counterparts (Ng and Jordan, 2001)—especially early in the learning curve. This advan-tage is amplified by the annotation noise typical ofcrowdsourcing scenarios.

Extensions to ITEMRESP that model documentfeatures generatively tend to share a commonhigh-level architecture. After the document classlabel yd is drawn for each document d, features aredrawn from class-conditional distributions. Felt etal. (2015) identify the MOMRESP model, repro-duced in Figure 2, as a strong representative ofgenerative crowdsourcing models. In MOMRESP,

✓

�jc�c

b(✓)

b(�) b(�)

yd

xd adj

d = 1 . . . D

c = 1 . . . C

j = 1 . . . J

j = 1 . . . J

c = 1 . . . C

Figure 2: MOMRESP as a plate diagram. |xd| =

V , the size of the vocabulary. Documents withsimilar feature vectors x tend to share a commonlabel y. Reduces to mixture-of-multinomials clus-tering when no annotations a are observed.

the feature vector xd for document d is drawn fromthe multinomial distribution with parameter vector�yd . This class-conditional multinomial model ofthe data inherits many of the strengths and weak-nesses of the naıve Bayes model that it resem-bles. Strengths include easy inference and a stronginductive bias which helps the model be robustto annotation noise and scarcity. Weaknesses in-clude overly strict conditional independence as-sumptions among features, leading to overconfi-dence in the document model and thereby caus-ing the model to overweight feature evidence andunderweight annotation evidence. This imbalancecan result in degraded performance in the presenceof high quality or many annotations.

2.2 Confused Supervised LDA (CSLDA)

We solve the problem of imbalanced feature andannotation evidence observed in MOMRESP by re-placing the class-conditional structure of previousgenerative crowdsourcing models with a richergenerative story where documents are drawn firstand class labels yd are obtained afterwards via alog-linear mapping. This move towards condi-tioning classes on documents content is sensiblebecause in many situations document content isauthored first, whereas label structure is not im-posed until afterwards. It is plausible to assumethat there will exist some mapping from a latentdocument structure to the desired document labeldistinctions. Moreover, by jointly modeling top-ics and the mapping to labels, we can learn thelatent document representations that best explainhow best to predict and correct annotator errors.

195

Term DefinitionNd Size of document dNdt

Pn (zdn = t)

NtP

d,n (zdn = t)Njcc0

Pd adjc0

(yd = c)Njc hNjc1 · · · NjcCiNvt

Pdn (wdn = v ^ zdn = t)

NtP

dn (zdn = t)ˆN Count excludes variable being sampledzd Vector where zdt =

1Nd

Pn (zdn = t)

ˆzd Excludes the zdn being sampled

Table 1: Definition of counts and select notation.(·) is the indicator function.

We call our model confused supervised LDA(CSLDA, Figure 3), based on supervised topicmodeling. Latent Dirichlet Allocation (Blei etal., 2003, LDA) models text documents as ad-mixtures of word distributions, or topics. Al-though pre-calculated LDA topics as features caninform a crowdsourcing model (Levenberg et al.,2014), supervised LDA (sLDA) provides a prin-cipled way of incorporating document class la-bels and topics into a single model, allowing topicvariables and response variables to co-inform oneanother in joint inference. For example, whensLDA is given movie reviews labeled with sen-timent, inferred topics cluster around sentiment-heavy words (Mcauliffe and Blei, 2007), whichmay be quite different from the topics inferred byunsupervised LDA. One way to view CSLDA is asa discrete sLDA in settings with noisy supervisionfrom multiple, imprecise annotators.

The generative story for CSLDA is:

1. Draw per-topic word distributions �t fromDir(b(✓)

).2. Draw per-class regression parameters ⌘c from

Gauss(µ,⌃).3. Draw per-annotator confusion matrices �j

with row �jc drawn from Dir(b(�)jc ).

4. For each document d,(a) Draw topic vector ✓d from Dir(b(✓)

).(b) For each token position n, draw topic

zdn from Cat(✓d) and word wdn fromCat(�zdn).

(c) Draw class label yd with probability pro-portional to exp[⌘|

yd zd].(d) For each annotator j draw annotation

vector adj from �jyd .

✓d

�jc

�t

⌘cb(✓)

b(�)

b(�)

µ

⌃

yd

zdn

wdn adj

d = 1 . . . D

n =

1 . . . Ndc = 1 . . . C

j = 1 . . . J

c =

1 . . . C

j = 1 . . . J

t = 1 . . . T

Figure 3: CSLDA as a plate diagram. D,J,C, Tare the number of documents, annotators, classes,and topics, respectively. Nd is the size of docu-ment d. |�t| = V , the size of the vocabulary. ⌘c

is a vector of regression parameters. Reduces toLDA when no annotations a are observed.

2.3 Stochastic EM

We use stochastic expectation maximization (EM)for posterior inference in CSLDA, alternating be-tween sampling values for topics z and documentclass labels y (the E-step) and optimizing valuesof regression parameters ⌘ (the M-step). To sam-ple z and y efficiently, we derive the full condi-tional distributions of z and y in a collapsed modelwhere ✓, �, and � have been analytically integratedout. Omitting multiplicative constants, the col-lapsed model joint probability is

p(z, w, y, a|⌘) = p(z)p(w|z)p(y|z, ⌘)p(a|y) (1)

/M(a)· Y

d

B(Nd+b(✓))

!· Y

t

B(Nt+b(�)t )

!

· Y

d

exp(⌘|yd zd)P

c exp(⌘|c zd)

!·

0

@Y

j

Y

c

B(Njc+b(�)jc )

1

A

where B(·) is the Beta function (multivariate asnecessary), counts N and related symbols are de-fined in Table 1, and M(a) =

Qd,j M(adj) where

M(adj) is the multinomial coefficient.Simplifying Equation 1 yields full conditionals

for each word zdn,

p(zdn = t|z, w, y, a, ⌘) /⇣

ˆNdt + b(✓)t

⌘(2)

·ˆNwdnt + b(�)

wdn

ˆNt + |b(�)|1·

exp(

⌘ydt

Nd)

Pc exp(

⌘ctNd

+ ⌘|c ˆzd)

,

196

and similarly for document label yd:

p(yd = c|z, w, y, a, ⌘) / exp(⌘|c zd)P

c0 exp(⌘|c0 zd)

(3)

·Y

j

Qc0

⇣ˆNjcc0

+ b(�)⌘adjc0

⇣Pc0 ˆNjcc0

+ b(�)jcc0

⌘Pc0 adjc0

,

where xk , x(x + 1) · · · (x + k � 1) is the risingfactorial. In Equation 2 the first and third termsare independent of word n and can be cached atthe document level for efficiency.

For the M-step, we train the regression param-eters ⌘ (containing one vector per class) by opti-mizing the same objective function as for traininga logistic regression classifier, assuming that classy is given:

p(y = c|z, ⌘) =

Y

d

exp(⌘|c zd)P

c0 exp(⌘|c0 zd)

. (4)

We optimize the objective (Equation 4) using L-BFGS and a regularizing Gaussian prior with µ =

0, �2= 1.

While EM is sensitive to initialization, CSLDAis straightforward to initialize. Majority vote isused to set initial y values y. Corresponding initialvalues for z and ⌘ are obtained by clamping y to yand running stochastic EM on z and ⌘.

2.4 Hyperparameter OptimizationIdeally, we would test CSLDA performance underall of the many algorithms available for inferencein such a model. Although that is not feasible,Asuncion et al. (2009) demonstrate that hyperapa-rameter optimization in LDA topic models helpsto bring the performance of alternative inferencealgorithms into approximate agreement. Accord-ingly, in Section 2.4 we implement hyperparame-ter optimization for CSLDA to make our results asgeneral as possible.

Before moving on, however, we take a momentto validate that the observation of Asuncion et al.generalizes from LDA to the ITEMRESP model,which, together with LDA, comprises CSLDA.Figure 4 demonstrates that three ITEMRESP infer-ence algorithms, Gibbs sampling (Gibbs), mean-field variational inference (Var), and the iter-ated conditional modes algorithm (ICM) (Besag,1986), are brought into better agreement after opti-mizing their hyperparameters via grid search. That

NEWSGROUPS

0.2

0.4

0.6

0.8

0 5 10 15Number of annotated instances x 1,000

Acc

ura

cy

algorithm

Gibbs

Var

ICM

(a) Hyperparameters fixedNEWSGROUPS

0.2

0.4

0.6

0.8


Acc

ura

cy

algorithm

Gibbs

Var

ICM

(b) Hyperparameters optimized via grid search on validationdata

Figure 4: Differences among the inferred label ac-curacy learning curves of three ITEMRESP infer-ence algorithms are reduced when hyperparame-ters are optimized.

is, the algorithms in Figure 4b are in better agree-ment, particularly near the extremes, than the algo-rithms in Figure 4a. This difference is subtle, but itholds to an equal and greater extent in other simu-lation conditions we tested (experiment details aresimilar to those reported in Section 3).

Fixed-point Hyperparameter UpdatesAlthough a grid search is effective, it is not prac-tical for a model with many hyperparameters suchas CSLDA. For efficiency, therefore, we use thefixed-point updates of Minka (2000). Our up-dates differ slightly from Minka’s since we tie hy-perparameters, allowing them to be learned morequickly from less data. In our implementation thematrices of hyperparameters b(�) and b(✓) over theDirichlet-multinomial distributions are completelytied such that b(�)

tv = b(�)8t, v and b(✓)t = b(✓)8t.

This leads to

b(�) b(�)·P

t,v[ (Ntv+b(�))]�TV (b(�)

)

V [ (Nt+V b(�))� (V b(�)

)]

(5)

and

b(✓) b(✓)·P

d,t[ (Ndt+b(✓))]�NT (b(✓)

)

T [ (Nd+Tb(✓))� (Tb(✓)

)]

. (6)

The updates for b(�) are slightly more involvedsince we choose to tie the diagonal entries b(�)

d andseparately the off-diagonal entries b(�)

o , updatingeach separately:

b(�)d b(�)

d ·P

j,c[ (Njcc+b(�)d )]�JC (b(�)

d )

Z(b(�))

(7)

197

and b(�)o

b(�)o ·

X

j,c,c0 6=c

[ (Njcc0+b(�)

o )]�JC(C�1) (b(�)o )]

(C�1)Z(b(�))

(8)where

Z(b(�)) =

X

j,c

[ (Njc + b(�)d + (C � 1)b(�)

o )]

� JC (b(�)d + (C � 1)b(�)

o ).

As in the work of Asuncion et al. (2009), we addan algorithmic gamma prior (b(·) ⇠ G(↵,�)) forsmoothing by adding ↵�1

b(·)to the numerator and �

to the denominator of Equations 5-8. Note thatthese algorithmic gamma “priors” should not beunderstood as first-class members of the CSLDAmodel (Figure 3). Rather, they are regularizationterms that keep our hyperparameter search algo-rithm from straying towards problematic valuessuch as 0 or1.

3 ExperimentsFor all experiments we set CSLDA’s number oftopics T to 1.5 times the number of classes in eachdataset. We found that model performance wasreasonably robust to this parameter. Only whenT drops below the number of label classes doesperformance suffer. As per Section 2.3, z and ⌘values are initialized with 500 rounds of stochas-tic EM, after which the full model is updated with1000 additional rounds. Predictions are generatedby aggregating samples from the last 100 rounds(the mode of the approximate marginal posterior).

We compare CSLDA with (1) a majority votebaseline, (2) the ITEMRESP model, and rep-resentatives of the two main classes of data-aware crowdsourcing models, namely (3) data-generative and (4) data-conditional. MOMRESPrepresents a typical data-generative model (Bragget al., 2013; Felt et al., 2014; Lam and Stork, 2005;Simpson and Roberts, 2015). Data-conditional ap-proaches typically model data features condition-ally using a log-linear model (Jin and Ghahramani,2002; Raykar et al., 2010; Liu et al., 2012; Yan etal., 2014). For the purposes of this paper, we re-fer to this model as LOGRESP. For ITEMRESP,MOMRESP, and LOGRESP we use the variationalinference methods presented by Felt et al. (2015).Unlike that paper, in this work we have augmentedinference with the in-line hyperparameter updatesdescribed in Section 2.4.

WEATHER

0.6

0.7

0.8

0.9

1.0

0 5 10 15Number of annotations x 1,000

Acc

ura

cy

algorithm

csLDA

MomResp

LogResp

ItemResp

Majority

Figure 5: Inferred label accuracy of models onsentiment-annotated weather tweets.

3.1 Human-generated Annotations

To gauge the effectiveness of data-aware crowd-sourcing models, we use the sentiment-annotatedtweet dataset distributed by CrowdFlower as apart of its “data for everyone” initiative.2 In the“Weather Sentiment” task, 20 annotators judgedthe sentiment of 1000 tweets as either positive,negative, neutral, or unrelated to the weather.In the secondary “Weather Sentiment Evaluated”task, 10 additional annotators judged the correct-ness of each consensus label. We construct agold standard from the consensus labels that werejudged to be correct by 9 of the 10 annotators inthe secondary task.

Figure 5 plots learning curves of the accuracyof model-inferred labels as annotations are added(ordered by timestamp). All methods, includingmajority vote, converge to roughly the same accu-racy when all 20,000 annotations are added. Whenfewer annotations are available, statistical mod-els beat majority vote, and CSLDA is consider-ably more accurate than other approaches. Learn-ing curves are bumpy because annotation order isnot random and because inferred label accuracy iscalculated only over documents with at least oneannotation. Learning curves collectively increasewhen average annotation depth (the number of an-notations per item) increases and decrease whennew documents are annotated and average anno-tation depth decreases. CSLDA stands out by be-ing more robust to these changes than other algo-rithms, and also by maintaining a higher level ofaccuracy across the board. This is important be-cause high accuracy using fewer annotations trans-lates to decreased annotations costs.

2http://www.crowdflower.com/data-for-everyone

198

D C V 1N

Pd Nd

20 News 16,995 20 22,851 111WebKB 3,543 4 5,781 131Reuters8 6,523 8 6,776 53Reuters52 7,735 52 5,579 58CADE12 34,801 12 41,628 110Enron 3,854 32 14,069 431

Table 2: Dataset statistics. D is number of doc-uments, C is number of classes, V is number offeatures, and 1

N

Pd Nd is average document size.

Values are calculated after setting aside 15% asvalidation data and doing feature selection.

3.2 Synthetic Annotations

Datasets including both annotations and gold stan-dard labels are in short supply. Although plentyof text categorization datasets have been anno-tated, common practice reflects that initial noisyannotations be discarded and only consensus la-bels be published. Consequently, we follow pre-vious work in achieving broad validation by con-structing synthetic annotators that corrupt knowngold standard labels. We base our experimen-tal setup on the annotations gathered by Felt etal. (2015),3 who paid CrowdFlower annotators torelabel 1000 documents from the well-known 20Newsgroups classification dataset. In that exper-iment, 136 annotators contributed, each instancewas labeled an average of 6.9 times, and anno-tator accuracies were distributed approximatelyaccording to a Beta(3.6, 5.1) distribution. Ac-cordingly we construct 100 synthetic annotators,each parametrized by an accuracy drawn fromBeta(3.6, 5.1) and with errors drawn from a sym-metric Dirichlet Dir(1). Datasets are annotatedby selecting an instance (at random without re-placement) and then selecting K annotators (atrandom without replacement) to annotate it beforemoving on. We choose K = 7 to mirror the em-pirical average in the CrowdFlower annotation set.

We evaluate on six text classification datasets,summarized in Table 2. The 20 Newsgroups, We-bKB, Cade12, Reuters8, and Reuters52 datasetsare described in more detail by Cardoso-Cachopo(2007). The LDC-labeled Enron emails dataset isdescribed by Berry et al. (2001). Each dataset is

3The dataset is available via git at git://nlp.cs.byu.edu/plf1/crowdflower-newsgroups.git

NEWSGROUPS

0.80

0.85

0.90

0.95

1.00

0 25 50 75 100 125Number of annotations x 1,000

Acc

ura

cy

algorithm

csLDA

MomResp

LogResp

ItemResp

Majority

WEBKB

0.6

0.7

0.8

0.9

0 10 20Number of annotations x 1,000

Acc

ura

cy

algorithm

csLDA

MomResp

LogResp

ItemResp

Majority

CADE12

0.6

0.7

0.8

0.9

0 50 100 150 200Number of annotations x 1,000

Acc

ura

cy

algorithm

csLDA

MomResp

LogResp

ItemResp

Majority

Figure 6: Inferred label accuracy of models onsynthetic annotations. The first instance is anno-tated 7 times, then the second, and so on.

preprocessed via Porter stemming and by removalof the stopwords from MALLET’s stopword list(McCallum, 2002). Features occurring fewer than5 times in the corpus are discarded. In the caseof MOMRESP, features are fractionally scaled sothat each document is the same length, in keep-ing with previous work in multinomial documentmodels (Nigam et al., 2006).

Figure 6 plots learning curves on three repre-sentative datasets (Enron resembles Cade12, andthe Reuters datasets resemble WebKB). CSLDAconsistently outperforms LOGRESP, ITEMRESP,and majority vote. The generative models(CSLDA and MOMRESP) tend to excel in low-annotation portions of the learning curve, par-tially because generative models tend to convergequickly and partially because generative modelsnaturally learn from unlabeled documents (i.e.,semi-supervision). However, MOMRESP tends toquickly reach a performance plateau after whichadditional annotations do little good. The perfor-mance of MOMRESP is also highly dataset de-

199

95% Accuracy CSLDA MOMRESP LOGRESP ITEMRESP Majority20 News 85 (5.0x) 150 (8.8x) 152 (8.9x) 168 (9.9x) 233 (13.7x)WebKB 31 (8.8x) - 46 (13.0x) 46 (13.0x) -Reuters8 25 (3.8x) - 73 (11.2x) 62 (9.5x) -Reuters52 33 (4.3x) 73 (9.4x) 67.5 (8.7x) 60 (7.8x) 87 (11.2x)CADE12 250 (7.2x) - 295 (8.5x) 290 (8.3x) 570 (16.4x)Enron 31 (8.0x) - 40 (10.4x) 38 (9.9x) 47 (12.2x)

Table 3: The number of annotations ⇥1000 at which the algorithm reaches 95% inferred label accuracyon the indicated dataset (average annotations per instance are in parenthesis). All instances are annotatedonce, then twice, and so on. Empty entries (’-’) do not reach 95% even with 20 annotations per instance.

pendent: it is good on 20 Newsgroups, mediocreon WebKB, and poor on CADE12. By contrast,CSLDA is relatively stable across datasets.

To understand the different behavior of the twogenerative models, recall that MOMRESP is iden-tical to ITEMRESP save for its multinomial datamodel. Indeed, the equations governing infer-ence of label y in MOMRESP simply sum togetherterms from an ITEMRESP model and terms froma mixture of multinomials clustering model (andfor reasons explained in Section 2.1, the multino-mial data model terms tend to dominate). There-fore when MOMRESP diverges from ITEMRESPit is because MOMRESP is attracted toward a y as-signment that satisfies the multinomial data model,grouping similar documents together. This canboth help and hurt. When data clusters and la-bel classes are misaligned, MOMRESP falters (asin the case of the Cade12 dataset). In contrast,CSLDA’s flexible mapping from topics to labelsis less sensitive: topics can diverge from labelclasses so long as there exists some linear trans-formation from the topics to the labels.

Many corpus annotation projects are not com-plete until the corpus achieves some target level ofquality. We repeat the experiment reported in Fig-ure 6, but rather than simulating seven annotationsfor each instance before moving on, we simulateone annotation for each instance, then two, and soon until each instance in the dataset is annotated20 times. Table 3 reports the minimal number ofannotations before an algorithm’s inferred labelsreach an accuracy of 95%, a lofty goal that can re-quire significant amounts of annotation when us-ing poor quality annotations. CSLDA achieves95% accuracy with fewer annotations, correspond-ing to reduced annotation cost.

NEWSGROUPS

0.70

0.75

0.80

0.85

0.90

20 30 40 50 60Number of annotations x 1,000

Acc

ura

cy algorithm

csLDA

csLDA−P

Figure 7: Joint inference for CSLDA vs pipelineinference (CSLDA-P).

3.3 Joint vs Pipeline Inference

To isolate the effectiveness of joint inference inCSLDA, we compare against the pipeline alterna-tive where topics are inferred first and then heldconstant during inference (Levenberg et al., 2014).Joint inference yields modest but consistent bene-fits over a pipeline approach. Figure 7 highlightsa portion of the learning curve on the Newsgroupsdataset (based on the experiments summarized inTable 3). This trend holds across all of the datasetsthat we examined.

3.4 Error Analysis

Class-conditional models like MOMRESP includea feature that data-conditional models like CSLDAlack: an explicit prior over class prevalence. Fig-ure 8a shows that CSLDA performs poorly on theCrowdFlower-annotated Newsgroups documentsdescribed at the beginning of Section 3 (not thesynthetic annotations). Error analysis uncoversthat CSLDA lumps related classes together in thisdataset. This is because annotators could specifyup to 3 simultaneous labels for each annotation,so that similar labels (e.g., “talk.politics.misc”and “talk.politics.mideast”) are usually chosen inblocks. Suppose each member of a set of doc-uments with similar topical content is annotated

200

CFGROUPS1000

0.4

0.5

0.6

0.7


Acc

ura

cyalgorithm

csLDA

MomResp

LogResp

ItemResp

Majority

(a) Original dataCFSIMPLEGROUPS

0.75

0.80

0.85

0.90

0.95


Acc

ura

cy

algorithm

csLDA

MomResp

LogResp

ItemResp

Majority

(b) After combining frequently co-annotated label classes

Figure 8: An illustrative failure case. CSLDA,lacking a class label prior, prefers to combine labelclasses that are highly co-annotated.

with both label A and B. In this scenario it is ap-parent that CSLDA will achieve its best fit by in-ferring all documents to have the same label eitherA or B. By contrast, MOMRESP’s uniform priordistribution over ✓ leads it to prefer solutions witha balance of A and B.

The hypothesis that class combination explainsCSLDA’s performance is supported by Figure 8b,which shows that CSLDA recovers after com-bining the classes that were most frequently co-annotated. We greedily combine label class pairsto maximize Krippendorf’s ↵ until only 10 la-bels were left: “alt.atheism,” religion, and poli-tics classes were combined; also, “sci.electronics”and the computing classes. The remaining eightclasses were unaltered. However, one could alsoargue that the original behavior of CSLDA is insome ways desirable. That is, if two classes ofdocuments are mostly the same both topically andin terms of annotator decisions, perhaps thoseclasses ought to be collapsed. We are not overlyconcerned that MOMRESP beats CSLDA in Fig-ure 8, since this result is consistent with early rel-ative performance in simulation.

4 Additional Related Work

This section reviews related work not already dis-cussed. A growing body of work extends the item-response model to account for variables such asitem difficulty (Whitehill et al., 2009; Passonneau

and Carpenter, 2013; Zhou et al., 2012), anno-tator trustworthiness (Hovy et al., 2013), corre-lation among various combinations of these vari-ables (Zhou et al., 2014), and change in annotatorbehavior over time (Simpson and Roberts, 2015).

Welinder et al. (2010) carefully model the pro-cess of annotating objects in images, includingvariables for item difficulty, item class, and class-conditional perception noise. In follow-up work,Liu et al. (2012) demonstrate that similar levelsof performance can be achieved with the sim-ple item-response model by using variational in-ference rather than EM. Alternative inference al-gorithms have been proposed for crowdsourcingmodels (Dalvi et al., 2013; Ghosh et al., 2011;Karger et al., 2013; Zhang et al., 2014). Somecrowdsourcing work regards labeled data not as anend in itself, but rather as a means to train clas-sifiers (Lin et al., 2014). The fact-finding litera-ture assigns trust scores to assertions made by un-trusted sources (Pasternack and Roth, 2010).


We describe CSLDA, a generative, data-awarecrowdsourcing model that addresses importantmodeling deficiencies identified in previous work.In particular, CSLDA handles data in which thenatural document clusters are at odds with theintended document labels. It also transitionssmoothly from situations in which few annotationsare available to those in which many annotationsare available. Because of the flexible mapping inCSLDA to class labels, many structural variantsare possible in future work. For example, thismapping could depend not just on inferred topi-cal content but also directly on data features (c.f.Nguyen et al. (2013)) or learned embedded featurerepresentations.

The large number of parameters in the learnedconfusion matrices of crowdsourcing modelspresent difficulty at scale. This could be addressedby modeling structure both inside of the annotatorsand classes. Redundant annotations give uniqueinsights into both inter-annotator and inter-classrelationships and could be used to induce anno-tator or label class hierarchies with parsimoniousrepresentations. Simpson et al. (2013) identify an-notator clusters using community detection algo-rithms but do not address annotator hierarchy orscalable confusion representations.

201

Acknowledgments This work was supported bythe collaborative NSF Grant IIS-1409739 (BYU)and IIS-1409287 (UMD). Boyd-Graber is alsosupported by NSF grants IIS-1320538 and NCSE-1422492. Any opinions, findings, conclusions, orrecommendations expressed here are those of theauthors and do not necessarily reflect the view ofthe sponsor.

ReferencesArthur Asuncion, Max Welling, Padhraic Smyth, and

Yee Whye Teh. 2009. On smoothing and inferencefor topic models. In Proceedings of Uncertainty inArtificial Intelligence.

Michael W. Berry, Murray Browne, and Ben Signer.2001. Topic annotated Enron email data set. Lin-guistic Data Consortium.

Julian Besag. 1986. On the statistical analysis of dirtypictures. Journal of the Royal Statistical Society,48(3):259–302.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent Dirichlet allocation. Journal of Ma-chine Learning Research, 3:993–1022.

Jonathan Bragg, Mausam, and Daniel S. Weld. 2013.Crowdsourcing multi-label classification for taxon-omy creation. In First AAAI Conference on HumanComputation and Crowdsourcing.

Ana Margarida de Jesus Cardoso-Cachopo. 2007. Im-proving Methods for Single-label Text Categoriza-tion. Ph.D. thesis, Universidade Tecnica de Lisboa.

Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vib-hor Rastogi. 2013. Aggregating crowdsourced bi-nary ratings. In Proceedings of World Wide WebConference.

Alexander P. Dawid and Allan M. Skene. 1979. Max-imum likelihood estimation of observer error-ratesusing the EM algorithm. Applied Statistics, pages20–28.

Paul Felt, Robbie Haertel, Eric Ringger, and KevinSeppi. 2014. MomResp: A Bayesian model formulti-annotator document labeling. In InternationalLanguage Resources and Evaluation.

Paul Felt, Eric Ringger, Kevin Seppi, and Robbie Haer-tel. 2015. Early gains matter: A case for preferringgenerative over discriminative crowdsourcing mod-els. In Conference of the North American Chapterof the Association for Computational Linguistics.

Arpita Ghosh, Satyen Kale, and Preston McAfee.2011. Who moderates the moderators?: crowd-sourcing abuse detection in user-generated content.In Proceedings of the 12th ACM conference on Elec-tronic commerce.

Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani,and Eduard H. Hovy. 2013. Learning whom to trustwith MACE. In Conference of the North AmericanChapter of the Association for Computational Lin-guistics.

Rong Jin and Zoubin Ghahramani. 2002. Learningwith multiple labels. In Proceedings of Advances inNeural Information Processing Systems, pages 897–904.

David Jurgens. 2013. Embracing ambiguity: Acomparison of annotation methodologies for crowd-sourcing word sense labels. In Proceedings ofNAACL-HLT, pages 556–562.

David R. Karger, Sewoong Oh, and Devavrat Shah.2013. Efficient crowdsourcing for multi-class label-ing. In ACM SIGMETRICS Performance EvaluationReview, volume 41, pages 81–92. ACM.

Chuck P. Lam and David G. Stork. 2005. Towardoptimal labeling strategy under multiple unreliablelabelers. In AAAI Spring Symposium: KnowledgeCollection from Volunteer Contributors.

Abby Levenberg, Stephen Pulman, Karo Moilanen,Edwin Simpson, and Stephen Roberts. 2014.Predicting economic indicators from web text us-ing sentiment composition. International Jour-nal of Computer and Communication Engineering,3(2):109–115.

Christopher H. Lin, Mausam, and Daniel S. Weld.2014. To re (label), or not to re (label). In Sec-ond AAAI Conference on Human Computation andCrowdsourcing.

Qiang Liu, Jian Peng, and Alex T. Ihler. 2012. Varia-tional inference for crowdsourcing. In Proceedingsof Advances in Neural Information Processing Sys-tems.

Jon D. Mcauliffe and David Blei. 2007. Supervisedtopic models. In Proceedings of Advances in NeuralInformation Processing Systems.

Andrew McCallum. 2002. Mallet: A machine learningfor language toolkit. http://mallet.cs.umass.edu.

Thomas Minka. 2000. Estimating a Dirichlet distribu-tion.

Andrew Y. Ng and Michael I. Jordan. 2001. On dis-criminative vs. generative classifiers: A comparisonof logistic regression and Naive Bayes. Proceedingsof Advances in Neural Information Processing Sys-tems.

Viet-An Nguyen, Jordan L. Boyd-Graber, and PhilipResnik. 2013. Lexical and hierarchical topic regres-sion. In Proceedings of Advances in Neural Infor-mation Processing Systems.

Kamal Nigam, Andrew McCallum, and Tom Mitchell.2006. Semi-supervised text classification using EM.Semi-Supervised Learning, pages 33–56.

202

Rebecca J. Passonneau and Bob Carpenter. 2013. Thebenefits of a model of annotation. In Proceedings ofthe 7th Linguistic Annotation Workshop and Inter-operability with Discourse, pages 187–195.

Jeff Pasternack and Dan Roth. 2010. Knowing whatto believe (when you already know something). InProceedings of International Conference on Compu-tational Linguistics.

Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Ger-ardo Hermosillo Valadez, Charles Florin, Luca Bo-goni, and Linda Moy. 2010. Learning from crowds.Journal of Machine Learning Research, 11:1297–1322.

Edwin Simpson and Stephen Roberts. 2015. Bayesianmethods for intelligent task assignment in crowd-sourcing systems. In Decision Making: Uncer-tainty, Imperfection, Deliberation and Scalability,pages 1–32. Springer.

E. Simpson, S. Roberts, I. Psorakis, and A. Smith.2013. Dynamic bayesian combination of multipleimperfect classifiers. In Decision Making and Im-perfection, pages 1–35. Springer.

Rion Snow, Brendan O’Connor, Daniel Jurafsky, andAndrew Y. Ng. 2008. Cheap and fast—but is itgood?: Evaluating non-expert annotations for natu-ral language tasks. In Proceedings of EMNLP. ACL.

James Surowiecki. 2005. The Wisdom of Crowds.Random House LLC.

Peter Welinder, Steve Branson, Pietro Perona, andSerge J. Belongie. 2010. The multidimensional wis-dom of crowds. In NIPS, pages 2424–2432.

Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier RMovellan, and Paul L. Ruvolo. 2009. Whosevote should count more: Optimal integration of la-bels from labelers of unknown expertise. NIPS,22:2035–2043.

Yan Yan, Romer Rosales, Glenn Fung, RamanathanSubramanian, and Jennifer Dy. 2014. Learn-ing from multiple annotators with varying expertise.Machine Learning, 95(3):291–327.

Yuchen Zhang, Xi Chen, Dengyong Zhou, andMichael I. Jordan. 2014. Spectral methods meetem: A provably optimal algorithm for crowd-sourcing. In Advances in Neural Information Pro-cessing Systems 27, pages 1260–1268. Curran As-sociates, Inc.

Dengyong Zhou, Sumit Basu, Yi Mao, and John C.Platt. 2012. Learning from the wisdom of crowdsby minimax entropy. In NIPS, volume 25, pages2204–2212.

Dengyong Zhou, Qiang Liu, John Platt, and Christo-pher Meek. 2014. Aggregating ordinal labels fromcrowds by minimax conditional entropy. In Pro-ceedings of the International Conference of MachineLearning.

203



Multichannel Variable-Size Convolution for Sentence Classification

Wenpeng Yin and Hinrich SchutzeCenter for Information and Language Processing

University of Munich, [email protected]

Abstract

We propose MVCNN, a convolution neu-ral network (CNN) architecture for sen-tence classification. It (i) combines di-verse versions of pretrained word embed-dings and (ii) extracts features of multi-granular phrases with variable-size convo-lution filters. We also show that pretrain-ing MVCNN is critical for good perfor-mance. MVCNN achieves state-of-the-artperformance on four tasks: on small-scalebinary, small-scale multi-class and large-scale Twitter sentiment prediction and onsubjectivity classification.

1 Introduction

Different sentence classification tasks are crucialfor many Natural Language Processing (NLP) ap-plications. Natural language sentences have com-plicated structures, both sequential and hierarchi-cal, that are essential for understanding them. Inaddition, how to decode and compose the featuresof component units, including single words andvariable-size phrases, is central to the sentenceclassification problem.

In recent years, deep learning models haveachieved remarkable results in computer vision(Krizhevsky et al., 2012), speech recognition(Graves et al., 2013) and NLP (Collobert and We-ston, 2008). A problem largely specific to NLP ishow to detect features of linguistic units, how toconduct composition over variable-size sequencesand how to use them for NLP tasks (Collobert etal., 2011; Kalchbrenner et al., 2014; Kim, 2014).Socher et al. (2011a) proposed recursive neuralnetworks to form phrases based on parsing trees.This approach depends on the availability of a wellperforming parser; for many languages and do-mains, especially noisy domains, reliable parsingis difficult. Hence, convolution neural networks

(CNN) are getting increasing attention, for theyare able to model long-range dependencies in sen-tences via hierarchical structures (Dos Santos andGatti, 2014; Kim, 2014; Denil et al., 2014). Cur-rent CNN systems usually implement a convolu-tion layer with fixed-size filters (i.e., feature detec-tors), in which the concrete filter size is a hyper-parameter. They essentially split a sentence intomultiple sub-sentences by a sliding window, thendetermine the sentence label by using the domi-nant label across all sub-sentences. The underly-ing assumption is that the sub-sentence with thatgranularity is potentially good enough to representthe whole sentence. However, it is hard to find thegranularity of a “good sub-sentence” that workswell across sentences. This motivates us to imple-ment variable-size filters in a convolution layer inorder to extract features of multigranular phrases.

Breakthroughs of deep learning in NLP are alsobased on learning distributed word representations– also called “word embeddings” – by neural lan-guage models (Bengio et al., 2003; Mnih and Hin-ton, 2009; Mikolov et al., 2010; Mikolov, 2012;Mikolov et al., 2013a). Word embeddings are de-rived by projecting words from a sparse, 1-of-Vencoding (V : vocabulary size) onto a lower di-mensional and dense vector space via hidden lay-ers and can be interpreted as feature extractors thatencode semantic and syntactic features of words.

Many papers study the comparative perfor-mance of different versions of word embed-dings, usually learned by different neural net-work (NN) architectures. For example, Chen etal. (2013) compared HLBL (Mnih and Hinton,2009), SENNA (Collobert and Weston, 2008),Turian (Turian et al., 2010) and Huang (Huanget al., 2012), showing great variance in qualityand characteristics of the semantics captured bythe tested embedding versions. Hill et al. (2014)showed that embeddings learned by neural ma-chine translation models outperform three repre-

204

sentative monolingual embedding versions: skip-gram (Mikolov et al., 2013b), GloVe (Penningtonet al., 2014) and C&W (Collobert et al., 2011) insome cases. These prior studies motivate us to ex-plore combining multiple versions of word embed-dings, treating each of them as a distinct descrip-tion of words. Our expectation is that the com-bination of these embedding versions, trained bydifferent NNs on different corpora, should containmore information than each version individually.We want to leverage this diversity of different em-bedding versions to extract higher quality sentencefeatures and thereby improve sentence classifica-tion performance.

The letters “M” and “V” in the name“MVCNN” of our architecture denote the multi-channel and variable-size convolution filters, re-spectively. “Multichannel” employs languagefrom computer vision where a color image has red,green and blue channels. Here, a channel is a de-scription by an embedding version.

For many sentence classification tasks, only rel-atively small training sets are available. MVCNNhas a large number of parameters, so that overfit-ting is a danger when they are trained on smalltraining sets. We address this problem by pre-training MVCNN on unlabeled data. These pre-trained weights can then be fine-tuned for the spe-cific classification task.

In sum, we attribute the success of MVCNNto: (i) designing variable-size convolution filtersto extract variable-range features of sentences and(ii) exploring the combination of multiple pub-lic embedding versions to initialize words in sen-tences. We also employ two “tricks” to further en-hance system performance: mutual learning andpretraining.

In remaining parts, Section 2 presents relatedwork. Section 3 gives details of our classificationmodel. Section 4 introduces two tricks that en-hance system performance: mutual-learning andpretraining. Section 5 reports experimental re-sults. Section 6 concludes this work.

2 Related Work

Much prior work has exploited deep neural net-works to model sentences.

Blacoe and Lapata (2012) represented a sen-tence by element-wise addition, multiplication, orrecursive autoencoder over embeddings of com-ponent single words. Yin and Schutze (2014) ex-

tended this approach by composing on words andphrases instead of only single words.

Collobert and Weston (2008) and Yu et al.(2014) used one layer of convolution over phrasesdetected by a sliding window on a target sentence,then used max- or average-pooling to form a sen-tence representation.

Kalchbrenner et al. (2014) stacked multiple lay-ers of one-dimensional convolution by dynamic k-max pooling to model sentences. We also adoptdynamic k-max pooling while our convolutionlayer has variable-size filters.

Kim (2014) also studied multichannel repre-sentation and variable-size filters. Differently,their multichannel relies on a single version ofpretrained embeddings (i.e., pretrained Word2Vecembeddings) with two copies: one is kept stableand the other one is fine-tuned by backpropaga-tion. We develop this insight by incorporating di-verse embedding versions. Additionally, their ideaof variable-size filters is further developed.

Le and Mikolov (2014) initialized the represen-tation of a sentence as a parameter vector, treat-ing it as a global feature and combining this vec-tor with the representations of context words todo word prediction. Finally, this fine-tuned vec-tor is used as representation of this sentence. Ap-parently, this method can only produce genericsentence representations which encode no task-specific features.

Our work is also inspired by studies that com-pared the performance of different word embed-ding versions or investigated the combination ofthem. For example, Turian et al. (2010) comparedBrown clusters, C&W embeddings and HLBL em-beddings in NER and chunking tasks. They foundthat Brown clusters and word embeddings bothcan improve the accuracy of supervised NLP sys-tems; and demonstrated empirically that combin-ing different word representations is beneficial.Luo et al. (2014) adapted CBOW (Mikolov etal., 2013a) to train word embeddings on differ-ent datasets: free text documents from Wikipedia,search click-through data and user query data,showing that combining them gets stronger resultsthan using individual word embeddings in websearch ranking and word similarity task. How-ever, these two papers either learned word repre-sentations on the same corpus (Turian et al., 2010)or enhanced the embedding quality by extendingtraining corpora, not learning algorithms (Luo et

205

al., 2014). In our work, there is no limit to thetype of embedding versions we can use and theyleverage not only the diversity of corpora, but alsothe different principles of learning algorithms.

3 Model Description

We now describe the architecture of our modelMVCNN, illustrated in Figure 1.

Multichannel Input. The input of MVCNN in-cludes multichannel feature maps of a consideredsentence, each is a matrix initialized by a differ-ent embedding version. Let s be sentence length,d dimension of word embeddings and c the to-tal number of different embedding versions (i.e.,channels). Hence, the whole initialized input is athree-dimensional array of size c⇥d⇥s. Figure 1depicts a sentence with s = 12 words. Each wordis initialized by c = 5 embeddings, each com-ing from a different channel. In implementation,sentences in a mini-batch will be padded to thesame length, and unknown words for correspond-ing channel are randomly initialized or can acquiregood initialization from the mutual-learning phasedescribed in next section.

Multichannel initialization brings two advan-tages: 1) a frequent word can have c representa-tions in the beginning (instead of only one), whichmeans it has more available information to lever-age; 2) a rare word missed in some embeddingversions can be “made up” by others (we call it“partially known word”). Therefore, this kind ofinitialization is able to make use of informationabout partially known words, without having toemploy full random initialization or removal ofunknown words. The vocabulary of the binarysentiment prediction task described in experimen-tal part contains 5232 words unknown in HLBLembeddings, 4273 in Huang embeddings, 3299 inGloVe embeddings, 4136 in SENNA embeddingsand 2257 in Word2Vec embeddings. But only1824 words find no embedding from any chan-nel! Hence, multichannel initialization can con-siderably reduce the number of unknown words.

Convolution Layer (Conv). For convenience,we first introduce how this work uses a convo-lution layer on one input feature map to gener-ate one higher-level feature map. Given a sen-tence of length s: w1, w2, . . . , ws; wi 2 Rd de-notes the embedding of word wi; a convolutionlayer uses sliding filters to extract local featuresof that sentence. The filter width l is a param-

Figure 1: MVCNN: supervised classification andpretraining.

eter. We first concatenate the initialized embed-dings of l consecutive words (wi�l+1, . . . ,wi) asci 2 Rld

(1 i < s + l), then generate the fea-ture value of this phrase as pi (the whole vectorp 2 Rs+l�1 contains all the local features) usinga tanh activation function and a linear projectionvector v 2 Rld as:

pi = tanh(v

Tci + b) (1)

More generally, convolution operation can dealwith multiple input feature maps and can bestacked to yield feature maps of increasing layers.In each layer, there are usually multiple filters ofthe same size, but with different weights (Kalch-brenner et al., 2014). We refer to a filter with aspecific set of weights as a kernel. The goal isoften to train a model in which different kernelsdetect different kinds of features of a local region.However, this traditional way can not detect thefeatures of regions of different granularity. Hence

206

we keep the property of multi-kernel while extend-ing it to variable-size in the same layer.

As in CNN for object recognition, to increasethe number of kernels of a certain layer, multiplefeature maps may be computed in parallel at thesame layer. Further, to increase the size diversityof kernels in the same layer, more feature mapscontaining various-range dependency features canbe learned. We denote a feature map of the ith

layer by Fi, and assume totally n feature maps ex-ist in layer i � 1: F

1i�1, . . . ,F

ni�1. Considering

a specific filter size l in layer i, each feature mapF

ji,l is computed by convolving a distinct set of fil-

ters of size l, arranged in a matrix V

j,ki,l , with each

feature map F

ki�1 and summing the results:

F

ji,l =

nX

k=1

V

j,ki,l ⇤ F

ki�1 (2)

where ⇤ indicates the convolution operation andj is the index of a feature map in layer i. Theweights in V form a rank 4 tensor.

Note that we use wide convolution in this work:it means word representations wg for g 0 org � s+1 are actually zero embeddings. Wide con-volution enables that each word can be detected byall filter weights in V.

In Figure 1, the first convolution layer dealswith an input with n = 5 feature maps.1 Its filtershave sizes 3 and 5 respectively (i.e., l = 3, 5), andeach filter has j = 3 kernels. This means this con-volution layer can detect three kinds of features ofphrases with length 3 and 5, respectively.

DCNN in (Kalchbrenner et al., 2014) used one-dimensional convolution: each higher-order fea-ture is produced from values of a single dimen-sion in the lower-layer feature map. Even thoughthat work proposed folding operation to modelthe dependencies between adjacent dimensions,this type of dependency modeling is still lim-ited. Differently, convolution in present work isable to model dependency across dimensions aswell as adjacent words, which obviates the needfor a folding step. This change also means ourmodel has substantially fewer parameters than theDCNN since the output of each convolution layeris smaller by a factor of d.

1A reviewer expresses surprise at such a small number ofmaps. However, we will use four variable sizes (see below),so that the overall number of maps is 20. We use a smallnumber of maps partly because training times for a networkare on the order of days, so limiting the number of parametersis important.

Dynamic k-max Pooling. Kalchbrenner et al.(2014) pool the k most active features comparedwith simple max (1-max) pooling (Collobert andWeston, 2008). This property enables it to con-nect multiple convolution layers to form a deeparchitecture to extract high-level abstract features.In this work, we directly use it to extract featuresfor variable-size feature maps. For a given featuremap in layer i, dynamic k-max pooling extracts ki

top values from each dimension and ktop top val-ues in the top layer. We set

ki = max(ktop, dL� i

Lse) (3)

where i 2 {1, 2, . . . L} is the order of convolutionlayer from bottom to top in Figure 1; L is the totalnumbers of convolution layers; ktop is a constantdetermined empirically, we set it to 4 as (Kalch-brenner et al., 2014).

As a result, the second convolution layer in Fig-ure 1 has an input with two same-size featuremaps, one results from filter size 3, one from filtersize 5. The values in the two feature maps are forphrases with different granularity. The motivationof this convolution layer lies in that a feature re-flected by a short phrase may be not trustworthywhile the longer phrase containing the short one istrustworthy, or the long phrase has no trustworthyfeature while its component short phrase is morereliable. This and even higher-order convolutionlayers therefore can make a trade-off between thefeatures of different granularity.

Hidden Layer. On the top of the final k-max pooling, we stack a fully connected layer tolearn sentence representation with given dimen-sion (e.g., d).

Logistic Regression Layer. Finally, sentencerepresentation is forwarded into logistic regressionlayer for classification.

In brief, our MVCNN model learns from(Kalchbrenner et al., 2014) to use dynamic k-max pooling to stack multiple convolution layers,and gets insight from (Kim, 2014) to investigatevariable-size filters in a convolution layer. Com-pared to (Kalchbrenner et al., 2014), MVCNNhas rich feature maps as input and as output ofeach convolution layer. Its convolution opera-tion is not only more flexible to extract featuresof variable-range phrases, but also able to modeldependency among all dimensions of representa-tions. MVCNN extends the network in (Kim,2014) by hierarchical convolution architecture and

207

further exploration of multichannel and variable-size feature detectors.

4 Model Enhancements

This part introduces two training tricks that en-hance the performance of MVCNN in practice.

Mutual-Learning of Embedding Versions.One observation in using multiple embedding ver-sions is that they have different vocabulary cover-age. An unknown word in an embedding versionmay be a known word in another version. Thus,there exists a proportion of words that can onlybe partially initialized by certain versions of wordembeddings, which means these words lack thedescription from other versions.

To alleviate this problem, we design a mutual-learning regime to predict representations of un-known words for each embedding version bylearning projections between versions. As a result,all embedding versions have the same vocabulary.This processing ensures that more words in eachembedding version receive a good representation,and is expected to give most words occurring in aclassification dataset more comprehensive initial-ization (as opposed to just being randomly initial-ized).

Let c be the number of embedding versions inconsideration, V1, V2, . . . , Vi, . . . , Vc their vocab-ularies, V ⇤

= [ci=1Vi their union, and V �

i =

V ⇤\Vi (i = 1, . . . , c) the vocabulary of unknownwords for embedding version i. Our goal is tolearn embeddings for the words in V �

i by knowl-edge from the other c� 1 embedding versions.

We use the overlapping vocabulary between Vi

and Vj , denoted as Vij , as training set, formalizinga projection fij from space Vi to space Vj (i 6=j; i, j 2 {1, 2, . . . , c}) as follows:

wj = Mijwi (4)

where Mij 2 Rd⇥d, wi 2 Rd denotes the rep-resentation of word w in space Vi and wj is theprojected (or learned) representation of word w inspace Vj . Squared error between wj and wj is thetraining loss to minimize. We use ˆ

wj = fij(wi)

to reformat Equation 4. Totally c(c� 1)/2 projec-tions fij are trained, each on the vocabulary inter-section Vij .

Let w be a word that is unknown in Vi, but isknown in V1, V2, . . . , Vk. To compute an embed-ding for w in Vi, we first compute the k projectionsf1i(w1), f2i(w2), . . ., fki(wk) from the source

spaces V1, V2, . . . , Vk to the target space Vi. Then,the element-wise average of f1i(w1), f2i(w2), . . .,fki(wk) is treated as the representation of w in Vi.Our motivation is that – assuming there is a truerepresentation of w in Vi (e.g., the one we wouldhave obtained by training embeddings on a muchlarger corpus) and assuming the projections werelearned well – we would expect all the projectedvectors to be close to the true representation. Also,each source space contributes potentially comple-mentary information. Hence averaging them is abalance of knowledge from all source spaces.

As discussed in Section 3, we found that forthe binary sentiment classification dataset, manywords were unknown in at least one embeddingversion. But of these words, a total of 5022 wordsdid have coverage in another embedding versionand so will benefit from mutual-learning. In theexperiments, we will show that this is a very ef-fective method to learn representations for un-known words that increases system performance iflearned representations are used for initialization.

Pretraining. Sentence classification systemsare usually implemented as supervised trainingregimes where training loss is between true la-bel distribution and predicted label distribution. Inthis work, we use pretraining on the unlabeled dataof each task and show that it can increase the per-formance of classification systems.

Figure 1 shows our pretraining setup. The“sentence representation” – the output of “Fullyconnected” hidden layer – is used to predict thecomponent words (“on” in the figure) in the sen-tence (instead of predicting the sentence label Y/Nas in supervised learning). Concretely, the sen-tence representation is averaged with representa-tions of some surrounding words (“the”, “cat”,“sat”, “the”, “mat”, “,” in the figure) to predict themiddle word (“on”).

Given sentence representation s 2 Rd and ini-tialized representations of 2t context words (t leftwords and t right words): wi�t, . . ., wi�1, wi+1,. . ., wi+t; wi 2 Rd, we average the total 2t + 1

vectors element-wise, depicted as “Average” op-eration in Figure 1. Then, this resulting vector istreated as a predicted representation of the mid-dle word and is used to find the true middle wordby means of noise-contrastive estimation (NCE)(Mnih and Teh, 2012). For each true example, 10noise words are sampled.

Note that in pretraining, there are three places

208

where each word needs initialization. (i) Eachword in the sentence is initialized in the “Multi-channel input” layer to the whole network. (ii)Each context word is initialized as input to the av-erage layer (“Average” in the figure). (iii) Each tar-get word is initialized as the output of the “NCE”layer (“on” in the figure). In this work, we usemultichannel initialization for case (i) and randominitialization for cases (ii) and (iii). Only fine-tuned multichannel representations (case (i)) arekept for subsequent supervised training.

The rationale for this pretraining is similarto auto-encoder: for an object composed ofsmaller-granular elements, the representations ofthe whole object and its components can learneach other. The CNN architecture learns sentencefeatures layer by layer, then those features are jus-tified by all constituent words.

During pretraining, all the model parameters,including mutichannel input, convolution parame-ters and fully connected layer, will be updated un-til they are mature to extract the sentence features.Subsequently, the same sets of parameters will befine-tuned for supervised classification tasks.

In sum, this pretraining is designed to producegood initial values for both model parameters andword embeddings. It is especially helpful for pre-training the embeddings of unknown words.

5 Experiments

We test the network on four classification tasks.We begin by specifying aspects of the implemen-tation and the training of the network. We thenreport the results of the experiments.

5.1 Hyperparameters and TrainingIn each of the experiments, the top of the net-work is a logistic regression that predicts theprobability distribution over classes given the in-put sentence. The network is trained to mini-mize cross-entropy of predicted and true distri-butions; the objective includes an L2 regulariza-tion term over the parameters. The set of param-eters comprises the word embeddings, all filterweights and the weights in fully connected layers.A dropout operation (Hinton et al., 2012) is put be-fore the logistic regression layer. The network istrained by back-propagation in mini-batches andthe gradient-based optimization is performed us-ing the AdaGrad update rule (Duchi et al., 2011)

In all data sets, the initial learning rate is 0.01,

dropout probability is 0.8, L2 weight is 5 · 10

�3,batch size is 50. In each convolution layer, filtersizes are {3, 5, 7, 9} and each filter has five kernels(independent of filter size).

5.2 Datasets and Experimental Setup

Standard Sentiment Treebank (Socher et al.,2013). This small-scale dataset includes two taskspredicting the sentiment of movie reviews. Theoutput variable is binary in one experiment andcan have five possible outcomes in the other:{negative, somewhat negative, neutral, somewhatpositive, positive}. In the binary case, we usethe given split of 6920 training, 872 developmentand 1821 test sentences. Likewise, in the fine-grained case, we use the standard 8544/1101/2210split. Socher et al. (2013) used the Stanford Parser(Klein and Manning, 2003) to parse each sentenceinto subphrases. The subphrases were then labeledby human annotators in the same way as the sen-tences were labeled. Labeled phrases that occuras subparts of the training sentences are treatedas independent training instances as in (Le andMikolov, 2014; Kalchbrenner et al., 2014).

Sentiment1402 (Go et al., 2009). This is alarge-scale dataset of tweets about sentiment clas-sification, where a tweet is automatically labeledas positive or negative depending on the emoticonthat occurs in it. The training set consists of 1.6million tweets with emoticon-based labels and thetest set of about 400 hand-annotated tweets. Wepreprocess the tweets minimally as follows. 1)The equivalence class symbol “url” (resp. “user-name”) replaces all URLs (resp. all words thatstart with the @ symbol, e.g., @thomasss). 2) Asequence of k > 2 repetitions of a letter c (e.g.,“cooooooool”) is replaced by two occurrences ofc (e.g., “cool”). 3) All tokens are lowercased.

Subj. Subjectivity classification dataset3 re-leased by (Pang and Lee, 2004) has 5000 sub-jective sentences and 5000 objective sentences.We report the result of 10-fold cross validation asbaseline systems did.

5.2.1 Pretrained Word VectorsIn this work, we use five embedding versions, asshown in Table 1, to initialize words. Four ofthem are directly downloaded from the Internet.

2http://help.sentiment140.com/for-students3http://www.cs.cornell.edu/people/pabo/movie-review-

data/

209

Set Training Data Vocab Size Dimensionality SourceHLBL Reuters English newswire 246,122 50 downloadHuang Wikipedia (April 2010 snapshot) 100,232 50 downloadGlove Twitter 1,193,514 50 download

SENNA Wikipedia 130,000 50 downloadWord2Vec English Gigawords 418,129 50 trained from scratch

Table 1: Description of five versions of word embedding.

Binary Fine-grained Senti140 SubjHLBL 5,232 5,562 344,632 8,621Huang 4,273 4,523 327,067 6,382Glove 3,299 3,485 257,376 5,237SENNA 4,136 4,371 323,501 6,162W2V 2257 2,409 288,257 4,217Voc size 18,876 19,612 387,877 23,926Full hit 12,030 12,357 30,010 13,742Partial hit 5,022 5,312 121,383 6,580No hit 1,824 1,943 236,484 3,604

Table 2: Statistics of five embedding versions forfour tasks. The first block with five rows providesthe number of unknown words of each task whenusing corresponding version to initialize. Voc size:vocabulary size. Full hit: embedding in all 5 ver-sions. Partial hit: embedding in 1–4 versions, Nohit: not present in any of the 5 versions.

(i) HLBL. Hierarchical log-bilinear model pre-sented by Mnih and Hinton (2009) and releasedby Turian et al. (2010);4 size: 246,122 word em-beddings; training corpus: RCV1 corpus, one yearof Reuters English newswire from August 1996 toAugust 1997. (ii) Huang.5 Huang et al. (2012) in-corporated global context to deal with challengesraised by words with multiple meanings; size:100,232 word embeddings; training corpus: April2010 snapshot of Wikipedia. (iii) GloVe.6 Size:1,193,514 word embeddings; training corpus: aTwitter corpus of 2B tweets with 27B tokens. (iv)SENNA.7 Size: 130,000 word embeddings; train-ing corpus: Wikipedia. Note that we use their 50-dimensional embeddings. (v) Word2Vec. It hasno 50-dimensional embeddings available online.We use released code8 to train skip-gram on En-glish Gigaword Corpus (Parker et al., 2009) with

4http://metaoptimize.com/projects/wordreprs/5http://ai.stanford.edu/ ehhuang/6http://nlp.stanford.edu/projects/glove/7http://ml.nec-labs.com/senna/8http://code.google.com/p/word2vec/

setup: window size 5, negative sampling, sam-pling rate 10

�3, threads 12. It is worth empha-sizing that above embeddings sets are derived ondifferent corpora with different algorithms. This isthe very property that we want to make use of topromote the system performance.

Table 2 shows the number of unknown wordsin each task when using corresponding embed-ding version to initialize (rows “HLBL”, “Huang”,“Glove”, “SENNA”, “W2V”) and the number ofwords fully initialized by five embedding versions(“Full hit” row), the number of words partiallyinitialized (“Partial hit” row) and the number ofwords that cannot be initialized by any of the em-bedding versions (“No hit” row).

About 30% of words in each task have partiallyinitialized embeddings and our mutual-learning isable to initialize the missing embeddings throughprojections. Pretraining is expected to learn goodrepresentations for all words, but pretraining is es-pecially important for words without initialization(“no hit”); a particularly clear example for this isthe Senti140 task: 236,484 of 387,877 words or61% are in the “no hit” category.

5.2.2 Results and Analysis

Table 3 compares results on test of MVCNN andits variants with other baselines in the four sen-tence classification tasks. Row 34, “MVCNN(overall)”, shows performance of the best config-uration of MVCNN, optimized on dev. This ver-sion uses five versions of word embeddings, fourfilter sizes (3, 5, 7, 9), both mutual-learning andpretraining, three convolution layers for Senti140task and two convolution layers for the other tasks.Overall, our system gets the best results, beatingall baselines.

The table contains five blocks from top to bot-tom. Each block investigates one specific config-urational aspect of the system. All results in thefive blocks are with respect to row 34, “MVCNN(overall)”; e.g., row 19 shows what happens when

210

Model Binary Fine-grained Senti140 Subj

baselines

1 RAE (Socher et al., 2011b) 82.4 43.2 – –2 MV-RNN (Socher et al., 2012) 82.9 44.4 – –3 RNTN (Socher et al., 2013) 85.4 45.7 – –4 DCNN (Kalchbrenner et al., 2014) 86.8 48.5 87.4 –5 Paragraph-Vec (Le and Mikolov, 2014) 87.7 48.7 – –6 CNN-rand (Kim, 2014) 82.7 45.0 – 89.67 CNN-static (Kim, 2014) 86.8 45.5 – 93.08 CNN-non-static (Kim, 2014) 87.2 48.0 – 93.49 CNN-multichannel (Kim, 2014) 88.1 47.4 – 93.2

10 NBSVM (Wang and Manning, 2012) – – – 93.211 MNB (Wang and Manning, 2012) – – – 93.612 G-Dropout (Wang and Manning, 2013) – – – 93.413 F-Dropout (Wang and Manning, 2013) – – – 93.614 SVM (Go et al., 2009) – – 81.6 –15 BINB (Go et al., 2009) – – 82.7 –16 MAX-TDNN (Kalchbrenner et al., 2014) – – 78.8 –17 NBOW (Kalchbrenner et al., 2014) – – 80.9 –18 MAXENT (Go et al., 2009) – – 83.0 –

versions

19 MVCNN (-HLBL) 88.5 48.7 88.0 93.620 MVCNN (-Huang) 89.2 49.2 88.1 93.721 MVCNN (-Glove) 88.3 48.6 87.4 93.622 MVCNN (-SENNA) 89.3 49.1 87.9 93.423 MVCNN (-Word2Vec) 88.4 48.2 87.6 93.4

filters

24 MVCNN (-3) 89.1 49.2 88.0 93.625 MVCNN (-5) 88.7 49.0 87.5 93.426 MVCNN (-7) 87.8 48.9 87.5 93.127 MVCNN (-9) 88.6 49.2 87.8 93.3

tricks 28 MVCNN (-mutual-learning) 88.2 49.2 87.8 93.529 MVCNN (-pretraining) 87.6 48.9 87.6 93.2

layers

30 MVCNN (1) 89.0 49.3 86.8 93.831 MVCNN (2) 89.4 49.6 87.6 93.932 MVCNN (3) 88.6 48.6 88.2 93.133 MVCNN (4) 87.9 48.2 88.0 92.434 MVCNN (overall) 89.4 49.6 88.2 93.9

Table 3: Test set results of our CNN model against other methods. RAE: Recursive Autoencoderswith pretrained word embeddings from Wikipedia (Socher et al., 2011b). MV-RNN: Matrix-VectorRecursive Neural Network with parse trees (Socher et al., 2012). RNTN: Recursive Neural Tensor Net-work with tensor-based feature function and parse trees (Socher et al., 2013). DCNN, MAX-TDNN,NBOW: Dynamic Convolution Neural Network with k-max pooling, Time-Delay Neural Networks withMax-pooling (Collobert and Weston, 2008), Neural Bag-of-Words Models (Kalchbrenner et al., 2014).Paragraph-Vec: Logistic regression on top of paragraph vectors (Le and Mikolov, 2014). SVM, BINB,MAXENT: Support Vector Machines, Naive Bayes with unigram features and bigram features, Maxi-mum Entropy (Go et al., 2009). NBSVM, MNB: Naive Bayes SVM and Multinomial Naive Bayes withuni-bigrams from Wang and Manning (2012). CNN-rand/static/multichannel/nonstatic: CNN withword embeddings randomly initialized / initialized by pretrained vectors and kept static during training/ initialized with two copies (each is a “channel”) of pretrained embeddings / initialized with pretrainedembeddings while fine-tuned during training (Kim, 2014). G-Dropout, F-Dropout: Gaussian Dropoutand Fast Dropout from Wang and Manning (2013). Minus sign “-” in MVCNN (-Huang) etc. means“Huang” is not used. “versions / filters / tricks / layers” denote the MVCNN variants with differ-ent setups: discard certain embedding version / discard certain filter size / discard mutual-learning orpretraining / different numbers of convolution layer.

211

HLBL is removed from row 34, row 28 showswhat happens when mutual learning is removedfrom row 34 etc.

The block “baselines” (1–18) lists some sys-tems representative of previous work on the cor-responding datasets, including the state-of-the-artsystems (marked as italic). The block “versions”(19–23) shows the results of our system when oneof the embedding versions was not used duringtraining. We want to explore to what extend dif-ferent embedding versions contribute to perfor-mance. The block “filters” (24–27) gives the re-sults when individual filter width is discarded. Italso tells us how much a filter with specific sizeinfluences. The block “tricks” (28–29) shows thesystem performance when no mutual-learning orno pretraining is used. The block “layers” (30–33)demonstrates how the system performs when it hasdifferent numbers of convolution layers.

From the “layers” block, we can see that oursystem performs best with two layers of convo-lution in Standard Sentiment Treebank and Sub-jectivity Classification tasks (row 31), but withthree layers of convolution in Sentiment140 (row32). This is probably due to Sentiment140 being amuch larger dataset; in such a case deeper neuralnetworks are beneficial.

The block “tricks” demonstrates the effect ofmutual-learning and pretraining. Apparently, pre-training has a bigger impact on performance thanmutual-learning. We speculate that it is be-cause pretraining can influence more words and alllearned word embeddings are tuned on the datasetafter pretraining.

The block “filters” indicates the contribution ofeach filter size. The system benefits from filtersof each size. Sizes 5 and 7 are most important forhigh performance, especially 7 (rows 25 and 26).

In the block “versions”, we see that each em-bedding version is crucial for good performance:performance drops in every single case. Though itis not easy to compare fairly different embeddingversions in NLP tasks, especially when those em-beddings were trained on different corpora of dif-ferent sizes using different algorithms, our resultsare potentially instructive for researchers makingdecision on which embeddings to use for their owntasks.

6 Conclusion

This work presented MVCNN, a novel CNN ar-chitecture for sentence classification. It com-bines multichannel initialization – diverse ver-sions of pretrained word embeddings are used –and variable-size filters – features of multigranu-lar phrases are extracted with variable-size convo-lution filters. We demonstrated that multichannelinitialization and variable-size filters enhance sys-tem performance on sentiment classification andsubjectivity classification tasks.

7 Future Work

As pointed out by the reviewers the success of themultichannel approach is likely due to a combina-tion of several quite different effects.

First, there is the effect of the embedding learn-ing algorithm. These algorithms differ in many as-pects, including in sensitivity to word order (e.g.,SENNA: yes, word2vec: no), in objective func-tion and in their treatment of ambiguity (explicitlymodeled only by Huang et al. (2012).

Second, there is the effect of the corpus. Wewould expect the size and genre of the corpus tohave a big effect even though we did not analyzethis effect in this paper.

Third, complementarity of word embeddings islikely to be more useful for some tasks than forothers. Sentiment is a good application for com-plementary word embeddings because solving thistask requires drawing on heterogeneous sourcesof information, including syntax, semantics andgenre as well as the core polarity of a word. Othertasks like part of speech (POS) tagging may bene-fit less from heterogeneity since the benefit of em-beddings in POS often comes down to making acorrect choice between two alternatives – a singleembedding version may be sufficient for this.

We plan to pursue these questions in futurework.

Acknowledgments

Thanks to CIS members and anonymous re-viewers for constructive comments. This workwas supported by Baidu (through a Baiduscholarship awarded to Wenpeng Yin) and byDeutsche Forschungsgemeinschaft (grant DFGSCHU 2246/8-2, SPP 1335).

212

ReferencesYoshua Bengio, Rejean Ducharme, Pascal Vincent, and

Christian Janvin. 2003. A neural probabilistic lan-guage model. The Journal of Machine Learning Re-search, 3:1137–1155.

William Blacoe and Mirella Lapata. 2012. A com-parison of vector-based representations for semanticcomposition. In Proceedings of the 2012 Joint Con-ference on Empirical Methods in Natural LanguageProcessing and Computational Natural LanguageLearning, pages 546–556. Association for Compu-tational Linguistics.

Yanqing Chen, Bryan Perozzi, Rami Al-Rfou, andSteven Skiena. 2013. The expressive power of wordembeddings. In ICML Workshop on Deep Learningfor Audio, Speech, and Language Processing.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Pro-ceedings of the 25th international conference onMachine learning, pages 160–167. ACM.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. The Journal of Machine Learning Re-search, 12:2493–2537.

Misha Denil, Alban Demiraj, Nal Kalchbrenner, PhilBlunsom, and Nando de Freitas. 2014. Modelling,visualising and summarising documents with a sin-gle convolutional neural network. arXiv preprintarXiv:1406.3830.

Cıcero Nogueira Dos Santos and Maıra Gatti. 2014.Deep convolutional neural networks for sentimentanalysis of short texts. In Proceedings of the 25th In-ternational Conference on Computational Linguis-tics.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. The Journal of Ma-chine Learning Research, 12:2121–2159.

Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit-ter sentiment classification using distant supervision.CS224N Project Report, Stanford, pages 1–12.

Alex Graves, A-R Mohamed, and Geoffrey Hinton.2013. Speech recognition with deep recurrent neuralnetworks. In Acoustics, Speech and Signal Process-ing, 2013 IEEE International Conference on, pages6645–6649. IEEE.

Felix Hill, KyungHyun Cho, Sebastien Jean, ColineDevin, and Yoshua Bengio. 2014. Not all neuralembeddings are born equal. In NIPS Workshop onLearning Semantics.

Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky,Ilya Sutskever, and Ruslan R Salakhutdinov. 2012.

Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprintarXiv:1207.0580.

Eric H Huang, Richard Socher, Christopher D Man-ning, and Andrew Y Ng. 2012. Improving wordrepresentations via global context and multiple wordprototypes. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguis-tics: Long Papers-Volume 1, pages 873–882. Asso-ciation for Computational Linguistics.

Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network formodelling sentences. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics. Association for ComputationalLinguistics.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, October.

Dan Klein and Christopher D Manning. 2003. Ac-curate unlexicalized parsing. In Proceedings of the41st Annual Meeting on Association for Computa-tional Linguistics-Volume 1, pages 423–430. Asso-ciation for Computational Linguistics.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In Advances in neuralinformation processing systems, pages 1097–1105.

Quoc V Le and Tomas Mikolov. 2014. Distributedrepresentations of sentences and documents. In Pro-ceedings of the 31st international conference on Ma-chine learning.

Yong Luo, Jian Tang, Jun Yan, Chao Xu, and ZhengChen. 2014. Pre-trained multi-view word embed-ding using two-side neural network. In Twenty-Eighth AAAI Conference on Artificial Intelligence.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In IN-TERSPEECH 2010, 11th Annual Conference of theInternational Speech Communication Association,Makuhari, Chiba, Japan, September 26-30, 2010,pages 1045–1048.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient estimation of word represen-tations in vector space. In Proceedings of Workshopat ICLR.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in Neural Information ProcessingSystems, pages 3111–3119.

Tomas Mikolov. 2012. Statistical language modelsbased on neural networks. Presentation at Google,Mountain View, 2nd April.

213

Andriy Mnih and Geoffrey E Hinton. 2009. A scal-able hierarchical distributed language model. InAdvances in neural information processing systems,pages 1081–1088.

Andriy Mnih and Yee Whye Teh. 2012. A fast andsimple algorithm for training neural probabilisticlanguage models. In Proceedings of the 29th In-ternational Conference on Machine Learning, pages1751–1758.

Bo Pang and Lillian Lee. 2004. A sentimental educa-tion: Sentiment analysis using subjectivity summa-rization based on minimum cuts. In Proceedings ofthe 42nd annual meeting on Association for Compu-tational Linguistics, page 271. Association for Com-putational Linguistics.

Robert Parker, Linguistic Data Consortium, et al.2009. English gigaword fourth edition. LinguisticData Consortium.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. Proceedings of the Empiricial Meth-ods in Natural Language Processing, 12.

Richard Socher, Eric H Huang, Jeffrey Pennin, Christo-pher D Manning, and Andrew Y Ng. 2011a. Dy-namic pooling and unfolding recursive autoencodersfor paraphrase detection. In Advances in Neural In-formation Processing Systems, pages 801–809.

Richard Socher, Jeffrey Pennington, Eric H Huang,Andrew Y Ng, and Christopher D Manning. 2011b.Semi-supervised recursive autoencoders for predict-ing sentiment distributions. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing, pages 151–161. Association forComputational Linguistics.

Richard Socher, Brody Huval, Christopher D Manning,and Andrew Y Ng. 2012. Semantic compositional-ity through recursive matrix-vector spaces. In Pro-ceedings of the 2012 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning, pages1201–1211. Association for Computational Linguis-tics.

Richard Socher, Alex Perelygin, Jean Y Wu, JasonChuang, Christopher D Manning, Andrew Y Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In Proceedings of the conference on em-pirical methods in natural language processing, vol-ume 1631, page 1642. Citeseer.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: a simple and general methodfor semi-supervised learning. In Proceedings of the48th annual meeting of the association for compu-tational linguistics, pages 384–394. Association forComputational Linguistics.

Sida Wang and Christopher D Manning. 2012. Base-lines and bigrams: Simple, good sentiment and topicclassification. In Proceedings of the 50th AnnualMeeting of the Association for Computational Lin-guistics: Short Papers-Volume 2, pages 90–94. As-sociation for Computational Linguistics.

Sida Wang and Christopher Manning. 2013. Fastdropout training. In Proceedings of the 30th In-ternational Conference on Machine Learning, pages118–126.

Wenpeng Yin and Hinrich Schutze. 2014. An explo-ration of embeddings for generalized phrases. Pro-ceedings of the 52nd annual meeting of the associ-ation for computational linguistics, student researchworkshop, pages 41–47.

Lei Yu, Karl Moritz Hermann, Phil Blunsom, andStephen Pulman. 2014. Deep learning for answersentence selection. NIPS deep learning workshop.

214



Opinion Holder and Target Extraction based on the Induction of VerbalCategories

Michael WiegandSpoken Language Systems

Saarland UniversityD-66123 Saarbrucken, Germany

[email protected]

Josef RuppenhoferDept. of Information Science

and Language TechnologyHildesheim University

D-31141 Hildesheim, [email protected]

Abstract

We present an approach for opinion roleinduction for verbal predicates. Our modelrests on the assumption that opinion verbscan be divided into three different typeswhere each type is associated with a char-acteristic mapping between semantic rolesand opinion holders and targets. In sev-eral experiments, we demonstrate the rel-evance of those three categories for thetask. We show that verbs can easily becategorized with semi-supervised graph-based clustering and some appropriatesimilarity metric. The seeds are obtainedthrough linguistic diagnostics. We evalu-ate our approach against a new manually-compiled opinion role lexicon and performin-context classification.

1 Introduction

While there has been much research in senti-ment analysis on subjectivity detection and po-larity classification, there has been less work onthe extraction of opinion roles, i.e. entities thatexpress an opinion (opinion holders), and enti-ties or propositions at which sentiment is directed(opinion targets). Previous research relies on largeamounts of labeled training data or leverages gen-eral semantic resources which are expensive toconstruct, e.g. FrameNet (Baker et al., 1998).

In this paper, we present an approach to induceopinion roles of verbal predicates. The input is aset of opinion verbs that can be found in a com-mon sentiment lexicon. Our model rests on theassumption that those verbs can be divided intothree different types. Each type has a character-istic mapping between semantic roles and opinionholders and targets. Thus, the problem of opinionrole induction is reduced to automatically catego-rizing opinion verbs.

We frame the task of opinion role extraction asa triple (pred , const , role) where pred is a predi-cate evoking an opinion (we exclusively focus onopinion verbs), const is some constituent bearinga semantic role assigned by pred , and role is theopinion role that is assigned to const .

Our work assumes the knowledge of opinionwords. We do not cover polarity classification.Many lexicons with that kind of information al-ready exist. Our sole interest is the assignment ofopinion holder and target given some opinion verb.There does not exist any publicly available lexicalresource specially designed for this task.

For the induction of opinion verb types, we con-sider semi-supervised graph clustering with someappropriate similarity metric. We also propose aneffective method for deriving seeds automaticallyby applying some linguistic diagnostics.

Our approach is evaluated in a supervised learn-ing scenario on a set of sentences with annotatedopinion holders and targets. We employ differ-ent kinds of features, including features derivedfrom a semantic parser based on FrameNet. Wealso compare our proposed model based on thethree opinion verb types against a new manually-compiled lexicon in which the semantic roles ofopinion holders and targets for each individualverb have been explicitly enumerated.

We also evaluate our approach in the context ofcross-domain opinion holder extraction. Thus wedemonstrate the importance of our approach in thecontext of previous datasets and classifiers.

This is the first work that proposes to induceboth opinion holders and targets evoked by opin-ion verbs with data-driven methods. Unlike previ-ous work, we are able to categorize all verbs of apre-specified set of opinion verbs. Our approachis a low-resource approach that is also applicableto languages other than English. We demonstratethis on German. A by-product of our study arenew resources including a verb lexicon specifying

215

semantic roles for holders and targets.

2 Lexicon-based Opinion RoleExtraction

Opinion holder and target extraction is a hard task(Ruppenhofer et al., 2008). Conventional syntac-tic or semantic levels of representation do not cap-ture sufficient information that allows a reliableprediction of opinion holders and targets. This isillustrated by (1) and (2) which show that, evenwith common semantic roles, i.e. agent and pa-tient1, assigned to the entities, one may not be ableto discriminate between the opinion roles.

(1) Peteragent criticized Marypatient .(criticize, Peter, holder) & (criticize, Mary, target)

(2) Peteragent disappoints Marypatient .(disappoint, Peter, target) & (disappoint, Mary, holder)

We assume that it is lexical information that de-cides what semantic role an opinion holder oropinion target takes. As a consequence, we built agold-standard lexicon for verbs that encodes suchinformation. For example, it states that the targetof criticize is its patient, while for disappoint, thetarget is its agent. This fine-grained lexicon alsoaccounts for the fact that a constituent can haveseveral roles given the same opinion verb. An ex-treme case is:

(3) [Peter]1 persuades [Mary]2 [to accept his invitation]3 .

The sentence conveys that:• Peter wants Mary to do something. (view1)• Mary is influenced by Peter. (view2)• Peter has some attitude towards Mary accepting his invitation. (view3)• Mary has some attitude towards accepting Peter’s invitation. (view4)

This corresponds to the role assignments:• view1: (persuade, [1], holder), (persuade, [2], target)

• view2: (persuade, [2], holder), (persuade, [1], target)



(in short: 2 opinion holders and 3 opinion targets).Our lexicon also includes another dimension

neglected in many previous works. Many opinionverbs predominantly express the sentiment of thespeaker of the utterance (or some nested source)(4). This concept is also known as expressive sub-jectivity (Wiebe et al., 2005) or speaker subjectiv-ity (Maks and Vossen, 2012). In such opinions, theopinion holder is not realized as a dependent of theopinion verb.

(4) At my work, [they]1 are constantly gossiping.(gossip, speaker,holder) & (gossip, [1], target)

1By agent and patient, we mean constituents labeled asA0 and A1 in PropBank (Kingsbury and Palmer, 2002).

Our lexicon covers the 1175 verb lemmas con-tained in the Subjectivity Lexicon (Wilson et al.,2005). We annotated the semantic roles similar tothe format of PropBank (Kingsbury and Palmer,2002). The basis of the annotation were onlinedictionaries (e.g. Macmillan Dictionary) whichprovide both a verb definition and example sen-tences. We do not annotate implicature-related in-formation about effects (Deng and Wiebe, 2014)but inherent sentiment (the data release2 includesmore details regarding the annotation process andour notion of holders and targets).

On a sample of 400 verbs, we measured an in-terannotation agreement of Cohen’s = 60.8 foropinion holders, = 62.3 for opinion targets and = 59.9 for speaker views. This agreement ismostly substantial (Landis and Koch, 1977).

3 The Three Verb Categories

Rather than induce the opinion roles for individ-ual verbs, we group verbs that share similar opin-ion role subcategorization. Thus, the main task forinduction is to decide which type an opinion verbbelongs to. Once the verb type has been estab-lished, the typical semantic roles for opinion hold-ers and targets can be derived from that type. Theverb categorization is motivated by the semanticroles of the three common views (Table 1) that anopinion holder can take. In our lexicon, all of theopinion holders were observed with either of thesesemantic roles. For facilitating induction, we as-sume that those types are disjoint (see also §3.4).

3.1 Verbs with Agent View (AG)

Verbs with an agent view, such as criticize, loveand believe, convey the sentiment of its agent.Therefore, those verbs take the agent as opinionholder and the patient as opinion target. Table 1also exemplifies semantic role labels as a suitablebasis to align opinion holders and targets within aparticular verb type. For example, targets of AG-verbs align to the patient, yet the patient can takethe form of various phrase types (i.e. NPs, PPs orinfinitive/complement phrases3).

2available at: www.coli.uni-saarland.de/˜miwieg/conll_2015_op_roles_data.tgz

3Note that infinitive and complement clauses may rep-resent a semantic role other than patient (e.g. the infinitiveclause in (3)). As these types of clauses are fairly unambigu-ous, we marked them as targets even if they are no patients.

216

Type Example Holder Target

AG [They]agent

like [the idea]patient

. agent pat.[The guests]

agent

complained [about noise]patient

.[They]

agent

argue [that this plan is infeasible]patient

.PT [The noise]

agent

irritated [the guests]patient

. pat. agent[That gift]

agent

pleased [her]patient

very much.SP [They]

agent

cheated [in the exam]adjunct

. -N/A- agent,[He]

agent

besmirched [the King’s name]patient

. (pat.)

Table 1: Verb types for opinion role extraction.

3.2 Verbs with Patient View (PT)Verbs with a patient view (irritate, upset and dis-appoint) are opposite to AG-verbs in that thoseverbs have the patient as opinion holder and theagent as opinion target.

3.3 Verbs with Speaker View (SP)The third type we consider comprises all verbswhose perspective is that of the speaker. That is,these are verbs whose sentiment is primarily thatof the speaker of the utterance rather than personsinvolved in the action to which is referred. Typicalexamples are gossip, improve or cheat.

While the agent is usually the target of the senti-ment of the speaker, it depends on the specific verbwhether its patient is also a target or not (in Table1, only the patient of the second SP-verb, i.e. be-smirch, is considered a target4). Since we aim at aprecise induction approach, we will always (only)mark the agent of an induced SP-verb as a target.

3.4 Relation to Fine-Grained LexiconTable 2 provides statistics as to how clear-cut thethree prototypical verb types are in the manually-compiled fine-grained lexicon. These numberssuggest that many verbs evoke several opinionviews (e.g. a verb with an AG-view may alsoevoke a PT-view). While the fine-grained lexi-con is fairly exhaustive in listing semantic rolesfor opinion holders and targets, it may also oc-casionally overgenerate. One major reason forthis is that we do not annotate on the sense-level(word-sense disambiguation (Wiebe and Mihal-cea, 2006) is still in its infancy) but on the lemma-level. Accordingly, we attribute all views to allsenses, whereas actually certain views pertain onlyto specific senses. However, we found that usuallyone view is conveyed by most (if not all) senses ofa word. For example, the lexicon lists both an AG-view and a PT-view for appease. This is correct

4We consider the patient a target since the speaker has apositive (non-defeasible) sentiment towards that entity.

Type Freq Type Freqverbs with AG-view 868 verbs with PT-view 392verbs with exclusive AG-view 371 verbs with exclusive PT-view 117verbs with AG- and SP-view 352 verbs with PT- and AG-view 226verbs with AG- and PT-view 226 verbs with PT- and SP-view 139verbs with SP-view 537verbs with exclusive SP-view 134verbs with SP- and AG-view 352verbs with SP- and PT-view 139

Table 2: Verb types in the fine-grained lexicon.

Agent (AG) Patient (PT) Speaker (SP)Freq Percent Freq Percent Freq Percent450 38.3 188 16.0 537 45.7

Table 3: Verb types in the coarse-grained lexicon.

for (5) but wrong for (6). The AG-view is derivedfrom a definition to give your opponents what theywant. (6) does not convey an agent’s volitional ac-tion. Here, the verb just conveys make someonefeel less angry. Similarly, the lexicon lists an SP-view and an AG-view for degrade, which is rightfor (7) but wrong for (8). The AG-view is derivedfrom a lexicon definition to treat someone in a waythat makes them stop respecting themselves. (8)does not convey an agent’s volitional action. Theverb just conveys to make something worse. Thatis, neither (6) nor (8) evoke an AG-view. We foundthat these variations regularly occur. We adoptthe heuristic that verbs with an SP-view and AG-or PT-view preserve the SP-view across their uses(7)-(8). Verbs with both PT- and AG-view pre-serve their PT-view (5)-(6). Following these ob-servations, we converted our fine-grained lexiconinto a gold standard coarse-grained lexicon (only3% of the verbs needed to be manually correctedafter the automatic conversion) in which a verb isclassified as AG, PT or SP according to its domi-nant view. The final class distribution of this lexi-con is shown in Table 3. In §5.2, we show throughan in-context evaluation that our coarse-grainedrepresentation preserves most of the informationcaptured by the fine-grained representation.

(5) [Chamberlain]agent

appeased [Hitler]patient

.(6) [The orange juice]

agent

appeased [him]patient

for a while.(7) [Mary]

agent

degrades [Henrietta]patient

.(8) [This technique]

agent

degrades [the local water supply]patient

.

4 Induction of Verb Categories

The task is to categorize each verb as a predom-inant AG-, PT-, or SP-verb. Our approach com-prises two steps. In the first step, seeds for the dif-ferent verb types are extracted (§4.1). In the sec-

217

AG argue, contend, speculate, fear, doubt, complain, con-sider, praise, recommend, view, acknowledge, hope

PT interest, surprise, please, excite, disappoint, delight, im-press, shock, trouble, embarrass, annoy, distress

SP murder, plot, incite, blaspheme, bewitch, bungle, de-spoil, plagiarize, prevaricate, instigate, molest, conspire

Table 4: The top 12 extracted verb seeds.

ond step, a similarity metric (§4.2) is employed inorder to propagate the verb type labels from theseeds to the remaining opinion verbs (§4.3). TheNorth American News Text Corpus is used for seedextraction and computation of verb similarities.

Wiegand and Klakow (2012) proposed methodsfor extracting AG- and PT-verbs. We will re-usethese methods for generating seeds. A major con-tribution of this paper is the introduction of thethird dimension, i.e. SP-verbs, in the context ofinduction. We show that in combination with thisthird dimension, one can categorize all opinionverbs contained in a sentiment lexicon. Further-more, given this three-way classification, we alsoobtain better results on the detection of AG-verbsand PT-verbs than by just detecting those verbsin isolation without graph clustering (this will beshown in Table 7 and discussed in §5.1).

A second major contribution of this work is thatwe show that these methods are also equally im-portant for opinion target extraction. So far, thesignificance of AG- and PT-verbs has only beendemonstrated for opinion holder extraction.

In this work, we exclusively focus on the set of1175 opinion verbs from the Subjectivity Lexicon.However, this is owed solely to the effort requiredto generate larger sets of evaluation data. In prin-ciple, our induction approach is applicable to anyset of opinion verbs of arbitrary (e.g. larger) size.

4.1 Pattern-based Seed InitializationFor AG-verbs, we rely on the findings of Wiegandand Klakow (2012) who suggest that verbs predic-tive for opinion holders can be induced with thehelp of prototypical opinion holders. These com-mon nouns, e.g. opponents (9) or critics (10), actlike opinion holders and, therefore, can be seenas a proxy. Verbs co-occurring with prototypicalopinion holders do not represent the entire rangeof opinion verbs but coincide with AG-verbs.

(9) Opponents claim these arguments miss the point.(10) Critics argued that the proposed limits were unconstitutional.

For PT-verbs, we make use of the adjective heuris-tic proposed by Wiegand and Klakow (2012). The

authors make use of the observation that morpho-logically related adjectives exist for PT-verbs, un-like for AG- and SP-verbs. Therefore, in orderto extract PT-verbs, one needs to check whethera verb in its past participle form, such as up-set in (11), is identical to some predicate adjec-tive (12).

(11) He had upsetverb

me.(12) I am upset

adj

.

We are not aware of any previously publishedapproach effectively inducing SP-verbs. Noticingthat many of those verbs contain some form of re-proach, we came up with the patterns accused ofX

VBG

and blamed for XVBG

as in (13) and (14).(13) He was accused of falsifying the documents.(14) The UN was blamed for misinterpreting climate data.

Table 4 lists for each of the verb types the 12

seeds most frequently occurring with the respec-tive patterns. We observed that the SP-verb seedsare exclusively negative polar expressions. Thatis why we also extracted seeds from an additionalpattern help to X

VB

producing prototypical posi-tive SP-verbs, such as stabilize, allay or heal.

4.2 Similarity Metrics4.2.1 Word EmbeddingsRecent research in machine learning has focusedon inducing vector representations of words. Asan example of a competitive word embeddingmethod, we induce vectors for our opinion verbswith Word2Vec (Mikolov et al., 2013). Baroni etal. (2014) showed that this method outperformscount vector representations on a variety of tasks.For the similarity between two verbs, we computethe cosine-similarity between their vectors.

4.2.2 WordNet::SimilarityWe use WordNet::Similarity (Pedersen et al.,2004) as an alternative source for similarity met-rics. The metrics are based on WordNet’s graphstructure (Miller et al., 1990). Various relationswithin WordNet have been shown to be effectivefor polarity classification (Esuli and Sebastiani,2006; Rao and Ravichandran, 2009).

4.2.3 CoordinationAnother method to measure similarity is ob-tained by leveraging coordination. Coordinationis known to be a syntactic relation that also pre-serves great semantic coherence (Ziering et al.,2013), e.g. (15). It has been successfully applied

218

not only to noun categorization (Riloff and Shep-herd, 1997; Roark and Charniak, 1998) but alsoto different tasks in sentiment analysis, includ-ing polarity classification (Hatzivassiloglou andMcKeown, 1997), the induction of patient polarityverbs (Goyal et al., 2010) and connotation learn-ing (Kang et al., 2014). We use the dependencyrelation from Stanford parser (Klein and Manning,2003) to detect coordination (16).

(15) They criticize and hate him.(16) conj(criticize,hate)

As a similarity function, we simply take theabsolute frequency of observing two words w1

and w2 in a conjunction, i.e. sim(w1, w2) =

freq(conj(w1, w2)).

4.2.4 Dependency-based Similarity

The metric proposed by Lin (1998) exploits therich set of dependency-relation labels in the con-text of distributional similarity. Moreover, it hasbeen effectively used for the related task of ex-tending frames of unknown predicates in semanticparsing (Das and Smith, 2011).

The metric is based on dependency triples(w, r,w0

) where w and w0 are words and r is adependency relation (e.g. (argue-V, nsubj,critics-N)). The metric is defined as:sim(w1 , w2) =

!(r,w)2T (w1)\T (w2)(I(w1 ,r,w)+I(w2,r,w))

!(r,w)2T (w1) I(w1,r,w)+

!(r,w)2T (w2) I(w2,r,w)

where I(w, r, w0) = log kw,r,w0k⇥k⇤,r,⇤kkw,r,⇤k⇥k⇤,r,w0k and T (w) is

defined as the set of pairs (r,w0) such that

log kw,r,w0k⇥k⇤,r,⇤kkw,r,⇤k⇥k⇤,r,w0k > 0.

4.3 Propagation Methods

We use the k nearest neighbour classifier (kNN)(Cover and Hart, 1967) as a simple method forpropagating labels from seeds to other instances.Alternatively, we consider verb categorization as aclustering task on a graph G = (V,E,W ) whereV is the set of nodes (i.e. our opinion verbs), Eis the set of edges connecting them with weightsW : E ! R+. W can be directly derived fromany of the similarity metrics (§4.2.1-§4.2.4). Theaim is that all nodes v 2 V are assigned a la-bel l 2 {AG,PT, SP}. Initially, only the verbseeds are labeled. We then use the Adsorption la-bel propagation algorithm from junto (Talukdar etal., 2008) in order propagate the labels from theseeds to the remaining verbs.

Acc Prec Rec F1Baselines Majority Class 45.7 14.2 33.3 20.9

Only Seeds 8.9 87.0 9.8 17.6Coordination kNN 45.2 61.5 47.3 53.4

graph 42.7 68.7 39.7 50.4WordNet kNN 52.8 51.5 50.7 51.1

graph 51.1 51.9 51.5 51.5Embedding kNN 59.3 58.4 61.0 59.7

graph 64.0 70.5 59.4 64.5Dependency kNN 65.7 63.8 65.4 64.5

graph 70.3 72.0 68.0 70.6

Table 5: Eval. of similarity metrics and classifiers.

5 Experiments

5.1 Evaluation of the Induced Lexicon

Table 5 compares the performance of the differ-ent similarity metrics when incorporated in eitherkNN or graph clustering. The resulting catego-rizations are compared against the gold standardcoarse-grained lexicon (§3.4). For kNN, we setk = 3 for which we obtained best performance inall our experiments.

As seeds, we took the top 40 AG-verbs, 30 PT-verbs and 50 SP-verbs produced by the respectiveinitialization methods (§4.1). The seed propor-tions should vaguely correspond to the actual classdistribution (Table 3). Large increases of the seedsets do not improve the quality (as shown below).10 of the 50 SP-verbs are extracted from the posi-tive SP-patterns, while the remaining verbs are ex-tracted from the negative SP-patterns (§4.1).

As baselines, we include a classifier only em-ploying the seeds and a majority class classifieralways predicting an SP-verb. For word embed-dings (§4.2.1) and WordNet::Similarity (§4.2.2),we only report the performance of the best met-ric/configuration, i.e. for embeddings, the con-tinuous bag-of-words model with 500 dimensionsand for WordNet::Similarity, the Wu & Palmermeasure (Wu and Palmer, 1994).

Table 5 shows that the baselines can be out-performed by large margins. The performanceof the different similarity metrics varies. Thedependency-based metric performs notably betterthan the other metrics. Together with word embed-dings, it is the only metric for which graph cluster-ing produces a notable improvement over kNN.

Table 6 illustrates the quality of the similar-ity metrics for the present task. The table showsthat the dependency-based similarity metric pro-vides the most suitable output. The poor qual-ity of coordination may come as a surprise. That

219

Coordin. appear, believe, refuse, vow, want, offend, shock, help, ex-hilarate, challenge, support, distort

WordNet appal, scandalize, anger, rage, sicken, temper, hate, fear,love, alarm, dread, tingle

Embedd. anger, dismay, disgust, protest, alarm, enrage, shock, regret,concern, horrify, appal, sorrow

Depend. anger, infuriate, alarm, shock, stun, enrage, incense, dis-may, upset, appal, offend, disappoint

Table 6: The 12 most similar verbs to outrage(PT-verb) according to the different metrics (verbsother than PT-verbs are underlined).

AG-verbs PT-verbs SP-verbsno graph graph no graph graph no graph graph

(Wiegand 2012) (Wiegand 2012)

55.45 69.12 38.59 67.66 52.16 72.03

Table 7: F-scores of entire output of pattern-basedextraction (§4.1) where no propagation is applied(no graph) vs. best proposed induction methodfrom Table 5 (graph).

method suffers from data-sparsity. In our corpus,the frequency of verbs co-occurring with outragein a conjunction is 5 or lower.5 The table alsoshows that WordNet may not be appropriate forour present verb categorization task. However, itmay be suitable for other subtasks in sentimentanalysis, particularly polarity classification. If weconsider the similar entries of outrage provided bythat metric, we find that polarity is largely pre-served (10 out of 12 verbs are negative). This ob-servation is consistent with Esuli and Sebastiani(2006) and Rao and Ravichandran (2009).

In Table 5 we only used the top 40/30/50 verbsfrom the initialization methods as seeds. We canalso compare the output of these methods (com-bined with propagation, i.e. graph clustering) withthe entire verb lists produced by these pattern-based initialization methods where no propagationis applied. As far as AG- and PT-verbs are con-cerned, the entire lists of these initialization meth-ods correspond to the original approach of Wie-gand and Klakow (2012). Table 7 shows the result.The new graph-induction always outperforms theoriginal induction method by a large margin.

In Table 8, we compare our automatically gen-5We found that for frequently occurring opinion verbs,

this similarity metric produces more reasonable output.

Patternhalf

Pattern Patterndouble

Goldhalf

Gold Golddouble

68.71 70.59 66.50 66.21 70.31 73.77

Table 8: Comparison of automatic and gold seeds(evaluation measure: macro-average F-score).

Coordin. WordNet Embedd. Depend.Major. kNN graph kNN graph kNN graph kNN graph

English 20.9 53.4 50.4 51.1 51.5 59.7 64.5 64.5 70.6German 22.9 43.8 48.9 53.2 59.9 54.3 60.9 58.3 63.1

Table 9: Comparison of English and German data(evaluation measure: macro-average F-score).

erated seeds using the patterns from §4.1 (Pat-tern) with seeds extracted from our gold stan-dard (Gold). We rank those verbs by fre-quency. Size and verb type distribution are pre-served. We also examine what impact doublingthe size of seeds (Gold|Pattern

double

) and halv-ing them (Gold|Pattern

half

) has on classification.Dependency-based similarity and graph clusteringis used for all configurations. Only if we doublethe amount of seeds are the gold seeds notably bet-ter than the automatically generated seeds.

Since our induction approach just requires asentiment lexicon and aims at low-resource lan-guages, we replicated the experiments for Ger-man, as shown in Table 9. We use the PolArt-sentiment lexicon (Klenner et al., 2009) (1416 en-tries). (As a gold standard, we manually annotatedthat lexicon according to our three verb types.)As an unlabeled corpus, we chose the Huge Ger-man Corpus6. As a parser, we used ParZu (Sen-nrich et al., 2009). Instead of WordNet, we usedGermaNet (Hamp and Feldweg, 1997). The au-tomatically generated seeds were manually trans-lated from English to German. Table 9 showsthat as on English data, dependency-based similar-ity combined with graph clustering performs best.The fact that we can successfully replicate ourapproach in another language supports the gen-eral applicability of our proposed categorization ofverbs into three types for opinion role extraction.

5.2 In-Context Evaluation

We now evaluate our induced knowledge in thetask of extracting opinion holders and targets fromactual text. For this in-context evaluation, we sam-pled sentences from the North American NewsCorpus in which our opinion verbs occurred. Weannotated all holders and targets of those verbs.(A constituent may have several roles for the sameverb (§2).) The dataset contains about 1100 sen-tences. We need to rely on this dataset since itis the only corpus in which our opinion verbs are

6www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/hgc.html

220

Features Descriptioncand lemma head lemma of candidate (phrase)cand pos part-of-speech tag of head of candidate phrasecand phrase phrase label of candidatecand person is candidate a personverb lemma verb lemmatizedverb pos part-of-speech tag of verbword bag of words: all words within the sentencepos part-of-speech sequence between cand. and verbdistance token distance between candidate and verbconst path from constituency parse tree from cand. to verbsubcat subcategorization frame of verbsrl

propbank

/dep semantic role/dependency path between cand. andverb (semantic roles based on PropBank)

brown Brown-clusters of cand word/verb word/wordsrl

framenet

frame element name assigned to candidate and theframe name (to which frame element belongs)

fine-grain lex is candidate holder/target/targetspeaker accordingto the fine-grained lexicon

coarse-grain lex is candidate holder/target/targetspeaker accordingto the coarse-grained lexicon

inducgraph is candidate holder/target/targetspeaker accordingto the coarse-grained lexicon automatically inducedwith graph clustering (and induced seeds (§4.1))

Table 10: Feature set for in-context classification.

widely represented and both holders and targetsare annotated.

We solve this task with supervised learning. Asa classifier, we employ Support Vector Machinesas implemented in SVMlight (Joachims, 1999).The task is to extract three different entities: opin-ion holders, opinion targets and opinion targetsevoked by speaker views. All those entities are al-ways put into the relation to a specific opinion verbin the sentence. The instance space thus consistsof tuples (verb, const), where verb is the mentionof an opinion verb and const is any possible (syn-tactic) constituent in the respective sentence. Thedataset contains 753 holders, 745 targets and 499

targets of a speaker view. Since a constituent mayhave several roles at the same time, we train threebinary classifiers for either of the entity types. Ona sample of 200 sentences, we measured an in-terannotation agreement of Cohen’s = 0.69 forholders, =0.63 for targets and also =0.63 fortargets of a speaker view.

Table 10 shows the features used in our super-vised classifier. They have been previously foundeffective (Choi et al., 2005; Jakob and Gurevych,2010; Wiegand and Klakow, 2012; Yang andCardie, 2013). The standard features are the fea-tures from cand word to brown. For semantic rolelabeling of PropBank-structures, we used mate-tools (Bjorkelund et al., 2009). For person detec-tion, we employ named-entity tagging (Finkel et

Features Holder Target TargetSpeaker

standard 63.59 54.18 40.06+srl

framenet

65.44⇤ 55.70⇤ 42.14+inducgraph 68.06⇤� 59.61⇤� 46.66⇤�

+srlframenet

+inducgraph 69.70⇤� 60.47⇤� 47.33⇤�

+coarse-grain lex 68.56⇤� 59.89⇤� 54.31⇤�†

+srlframenet

+coarse-grain lex 69.70⇤� 60.68⇤� 54.06⇤�†

+fine-grain lex 69.83⇤�† 62.89⇤�† 56.71⇤�†

+srlframenet

+fine-grain lex 70.80⇤�† 63.72⇤�† 56.64⇤�†

statistical significance testing (paired t-test, significance level p < 0.05) ⇤:better than standard; �: better than +srl

framenet

; †: better than +inducgraph

Table 11: In-context evaluation (eval.: F-score).

al., 2005) and WordNet (Miller et al., 1990).For semantic role labeling of FrameNet-

structures (srlframenet

), we used Semafor (Das etal., 2010) with the argument identification basedon dual decomposition (Das et al., 2012). We runthe configuration that also assigns frame structuresto unknown predicates (Das and Smith, 2011).This is necessary as 45% of our opinion verbs arenot contained in FrameNet (v1.5). FrameNet hasbeen shown to enable a correct role assignment forAG- and PT-verbs (Bethard et al., 2004; Kim andHovy, 2006). For instance, in (17) and (18), theopinion holder is assigned to the same frame el-ement EXPERIENCER. However, the PropBankrepresentation does not produce a correct align-ment: In (17), the opinon holder is the agent ofthe opinion verb, while in (18), the opinion holderis the patient of the opinion verb.

(17) PeterEXPERIENCERagent dislikes Marypatient .

(dislike, Peter, holder)

(18) Peteragent disappoints MaryEXPERIENCERpatient .

(disappoint, Mary, holder)

With the feature fine-grain lex, we want to val-idate that our manually-compiled opinion role lex-icon for verbs (§2), i.e. the lexicon that also allowsmultiple opinion roles for the same semantic roles,is effective for in-context evaluation. Coarse-grain lex is derived from the fine-grained lexicon(§3.4). With this feature, we measure how muchwe lose by dropping the fine-grained representa-tion. Inducgraph induces the verb types of thecoarse representation automatically by employingthe best induction method obtained in Table 5.

Table 11 compares the different features on 10-fold crossvalidation. The table shows that the fea-tures encoding opinion role information, includingour induction approach, are more effective thansrl

framenet

. Even though the fine-grained lexiconproduces the best results, we almost reach that per-formance with the coarse-grained lexicon. This is

221

manual lexiconsFeatures inducgraph coarse-grain fine-grainlexicon feature only (⇡ unsuperv.) 52.38 55.81 60.68lexicon with all other features 59.90⇤ 61.71⇤ 63.92�

statistical significance testing (paired t-test, significance level p < 0.05) ⇤:better than lexicon feature only; �: only significant on 2 out of 3 roles

Table 12: Lexical resources and the impactof other (not lexicon-based) features (evaluationmeasure: macro-average F-score).

Distribution of Verb TypesCorpus AG PT SP # sentencesMPQA (training+test) 77.5 7.7 14.8 15, 753

FICTION (test) 67.5 15.1 17.3 614

VERB (test) 34.8 14.0 52.7 1, 073

Table 13: Statistics on the different corpora used.

further evidence that our proposed three-way verbcategorization, which is also the basis of our in-duction approach, is adequate.

Table 12 compares the performance of the dif-ferent lexicons in isolation (this is comparablewith an unsupervised classifier, as each lexiconfeature has three values each predicting either ofthe opinion roles) and in combination with thestandard (+srl

framenet

) features. The table showsthat all lexicon features are strong features on theirown. The score of induction is lowest but this fea-ture has been created without manual supervision.Moreover, the improvement by adding the otherfeatures is much larger for induction than for themanually-built fine-grained lexicon. This meansthat we can compensate some lexical knowledgemissing in induction by standard features.

Since we could substantially outperform thefeatures relying on FrameNet with our new lex-ical resources, we looked closer at the predictedframe structures. Beside obvious errors in auto-matic frame assignment, we also found that thereare problems inherent in the frame design. Particu-larly, the notion of SP-verbs (§3.3) is not properlyreflected. Many frames, such as SCRUTINY, typi-cally devised for AG-verbs, such as investigate oranalyse, also contain SP-verbs like pry. This ob-servation is in line with Ruppenhofer and Rehbein(2012) who claim that extensions to FrameNet arenecessary to properly represent opinions evokedby verbal predicates.

5.3 Comparison to Previous Cross-DomainOpinion Holder Extraction

We now compare our proposed induction ap-proach with previous work on opinion holder ex-

in domain out of domainConfig MPQA FICTION VERBMultiRel 72.54⇤� 53.02 44.80CK 62.98 52.91 43.88CK + induc

Wiegand 2012

65.15 57.33⇤ 50.83⇤

CK + inducgraph 66.06⇤ 65.03⇤� 60.91⇤�

CK + coarse-grain lex 66.82⇤ 64.13⇤� 63.72⇤�†

CK + fine-grain lex 66.16⇤ 64.98⇤� 70.85⇤�†‡

statistical significance testing (permutation test, significance level p < 0.05)⇤: better than CK; �: better than CK + induc

Wiegand 2012

; †: better thanCK + induc

graph

; ‡: better than CK + coarse-grain lex

Table 14: Evaluation on opinion holder extractionon various corpora (evaluation measure: F-score).

traction. We replicate several classifiers and com-pare them to our new approach. (Because of thelimited space of this paper, we cannot also addresscross-domain opinion target extraction.) We con-sider three different corpora as shown in Table 13.MPQA (Wiebe et al., 2005) is the standard corpusfor fine-grained sentiment analysis. FICTION,introduced in Wiegand and Klakow (2012), is acollection of summaries of classic literary works.VERB is the new corpus used in the previousevaluation (§5.2). VERB and MPQA both orig-inate from the news domain but VERB is sam-pled in such a way that mentions of all opinionverbs of the Subjectivity Lexicon are represented.The other corpora consist of contiguous sentences.They will have a bias towards only those opinionverbs frequently occurring in that particular do-main. This also results in different distributionsof verb types as shown in Table 13. For example,SP-verbs are rare in MPQA. However, there ex-ist plenty of them (Table 3). Other domains mayhave much more frequent SP-verbs (just as FIC-TION has more PT-verbs than MPQA). A robustdomain-independent classifier should therefore beable to cope equally well with all three verb types.

MPQA is also the largest corpus. FollowingWiegand and Klakow (2012), this corpus is cho-sen as a training set.7 Despite its size, however,almost every second opinion verb from our set ofopinion verbs is not contained in that corpus.

In the evaluation, we only consider the opinionholders of our opinion verbs. (Other opinion hold-ers, both in the gold standard and the predictionsof the classifiers are ignored.) Recall that we takethe knowledge of what is an opinion verb as given.Our graph-based induction can be arbitrarily ex-tended by increasing the set of opinion verbs.

7The split-up of training and test set on the MPQA corpusfollows the specification of Johansson and Moschitti (2013).

222

For classifiers, we consider convolution ker-nels CK from Wiegand and Klakow (2012) andthe sequence labeler from Johansson and Mos-chitti (2013) MultiRel that incorporates relationalfeatures taking into account interactions betweenmultiple opinion cues. It is currently the most so-phisticated opinion holder extractor. CK can becombined with additional knowledge. We com-pare inducgraph with induc

Wiegand 2012

, whichemploys the word lists induced for AG- and PT-verbs in the fashion of Wiegand and Klakow(2012), i.e. without graph clustering. As an upperbound for the induction methods, coarse grain lexand fine grain lex are used.8 The combinationof CK with this additional knowledge follows thebest settings from Wiegand and Klakow (2012).9

Table 14 shows the results. MultiRel producesthe best performance on MPQA but suffers simi-larly from a domain-mismatch as CK on FICTIONand VERB. MultiRel and CK cannot handle manyPT- and SP-verbs in those corpora, simply becausemany of them do not occur in MPQA. On MPQA,only the new induction approach and the lexiconssignificantly improve CK. The knowledge of opin-ion roles has a lower impact on MPQA. In that cor-pus, most opinion verbs take their opinion holderas an agent. Given the large size of MPQA, thisinformation can be easily learned from the train-ing data. The situation is different for FICTIONand VERB where the knowledge from inductionlargely improves classification. In these corpora,opinion holders as agents are much less frequentthan on MPQA. The new induction proposed inthis paper also notably outperforms the inductionfrom Wiegand and Klakow (2012).

Although the fine-grained lexicon is among thetop performing systems, we only note large im-provements on VERB. VERB has the highest pro-portion of PT- and SP-verbs (Table 13). Knowl-edge about role-assignment is most critical here.

6 Related Work

Most approaches for opinion role extraction em-ploy supervised learning. The feature design is

8Wiegand and Klakow (2012) use a lexicon Lex whichjust comprises the notion of AG and PT verbs, so our manuallexicons are more accurate and harder to beat.

9For in-domain evaluation (i.e. MPQA) the trees (treestructures are the input to CK) are augmented with verb cat-egory information. For out-of-domain evaluation (i.e. FIC-TION and VERB), we add to the predictions of CK the pre-diction of a rule-based classifier using the opinion role assign-ment according to the respective lexicon or induction method.

mainly inspired by semantic role labeling (Bethardet al., 2004; Li et al., 2012). Some work also em-ploys information from existing semantic role la-belers based on FrameNet (Kim and Hovy, 2006)or PropBank (Johansson and Moschitti, 2013;Wiegand and Klakow, 2012). Although those re-sources give extra information for opinion role ex-traction in comparison to syntactic or other surfacefeatures, we showed in this work that further task-specific knowledge, i.e. either opinion verb typesor a manually-built opinion role lexicon, provideeven more accurate information.

There has been a substantial amount of researchon opinion target extraction. It focuses, however,on the extraction of topic-specific opinion terms(Jijkoun et al., 2010; Qiu et al., 2011) rather thanthe variability of semantic roles for opinion hold-ers and targets. Mitchell et al. (2013) presenta low-resource approach for target extraction buttheir aim is to process Twitter messages withoutusing general syntax tools. In this work, we usesuch tools. Our notion of low resources is differentin that we mean the absence of semantic resourceshelpful for our task (e.g. FrameNet).

7 Conclusion

We presented an approach for opinion role in-duction for verbal predicates. We assume thatthose predicates can be divided into three differ-ent verb types where each type is associated witha characteristic mapping between semantic rolesand opinion holders and targets. In several ex-periments, we demonstrated the relevance of thosethree types. We showed that verbs can effectivelybe categorized with graph clustering given a suit-able similarity metric. The seeds are automaticallyselected. Our proposed induction approach out-performs both a previous induction approach andfeatures derived from semantic role labelers. Wealso pointed out the importance of the knowledgegained by induction in supervised cross-domainclassification.

Acknowledgements

The authors would like to thank Stephanie Koser for annotat-ing parts of the resources presented in this paper. For proof-reading the paper, the authors would also like to thank InesRehbein, Asad Sayeed and Marc Schulder. We also thankAshutosh Modi for advising us on word embeddings. The au-thors were partially supported by the German Research Foun-dation (DFG) under grants RU 1873/2-1 and WI 4204/2-1.

223

ReferencesCollin F. Baker, Charles J. Fillmore, and John B. Lowe.

1998. The Berkeley FrameNet Project. In Proceed-ings of COLING/ACL.

Marco Baroni, Georgiana Dinu, and GermanKruszewski. 2014. Don’t count, predict! Asystematic comparison of context-counting vs.context-predicting semantic vectors. In Proceedingsof ACL.

Steven Bethard, Hong Yu, Ashley Thornton, VasileiosHatzivassiloglou, and Dan Jurafsky. 2004. Extract-ing Opinion Propositions and Opinion Holders us-ing Syntactic and Lexical Cues. In Computing At-titude and Affect in Text: Theory and Applications.Springer-Verlag.

Anders Bjorkelund, Love Hafdell, and Pierre Nugues.2009. Multilingual Semantic Role Labeling. In Pro-ceedings of the CoNLL – Shared Task.

Yejin Choi, Claire Cardie, Ellen Riloff, and SiddharthPatwardhan. 2005. Identifying Sources of Opinionswith Conditional Random Fields and Extraction Pat-terns. In Proceedings of HLT/EMNLP.

Thomas Cover and Peter Hart. 1967. Nearestneighbor pattern classification. Information Theory,13(1):21–27.

Dipanjan Das and Noah A. Smith. 2011. Semi-Supervised Frame-Semantic Parsing for UnknownPredicates. In Proceedings of ACL.

Dipanjan Das, Nathan Schneider, Desai Chen, andNoah A. Smith. 2010. Probabilistic Frame-Semantic Parsing. In Proceedings of HLT/NAACL.

Dipanjan Das, Andre F.T. Martins, and Noah A. Smith.2012. An Exact Dual Decomposition Algorithm forShallow Semantic Parsing with Constraints. In Pro-ceedings of *SEM.

Lingjia Deng and Janyce Wiebe. 2014. SentimentPropagation via Implicature Constraints. In Pro-ceedings of EACL.

Andrea Esuli and Fabrizio Sebastiani. 2006. Senti-WordNet: A Publicly Available Lexical Resourcefor Opinion Mining. In Proceedings of LREC.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating Non-local Informa-tion into Information Extraction Systems by GibbsSampling. In Proceedings of ACL.

Amit Goyal, Ellen Riloff, and Hal Daume III. 2010.Automatically Producing Plot Unit Representationsfor Narrative Text. In Proceedings of EMNLP.

Birgit Hamp and Helmut Feldweg. 1997. GermaNet -a Lexical-Semantic Net for German. In Proceedingsof ACL workshop Automatic Information Extractionand Building of Lexical Semantic Resources for NLPApplications.

Vasileios Hatzivassiloglou and Kathleen R. McKeown.1997. Predicting the Semantic Orientation of Adjec-tives. In Proceedings of EACL.

Niklas Jakob and Iryna Gurevych. 2010. ExtractingOpinion Targets in a Single- and Cross-Domain Set-ting with Conditional Random Fields. In Proceed-ings of EMNLP.

Valentin Jijkoun, Maarten de Rijke, and WouterWeerkamp. 2010. Generating Focused Topic-Specific Sentiment Lexicons. In Proceedings ofACL.

Thorsten Joachims. 1999. Making Large-Scale SVMLearning Practical. In B. Scholkopf, C. Burges,and A. Smola, editors, Advances in Kernel Meth-ods - Support Vector Learning, pages 169–184. MITPress.

Richard Johansson and Alessandro Moschitti. 2013.Relational Features in Fine-Grained Opinion Analy-sis. Computational Linguistics, 39(3):473–509.

Jun Seok Kang, Song Feng, Leman Akoglu, and YejinChoi. 2014. ConnotationWordNet: Learning Con-notation over the Word+Sense Network. In Pro-ceedings of ACL.

Soo-Min Kim and Eduard Hovy. 2006. ExtractingOpinions, Opinion Holders, and Topics Expressedin Online News Media Text. In Proceedings of ACLWorkshop on Sentiment and Subjectivity in Text.

Paul Kingsbury and Martha Palmer. 2002. From Tree-Bank to PropBank. In Proceedings of LREC.

Dan Klein and Christopher D. Manning. 2003. Accu-rate Unlexicalized Parsing. In Proceedings of ACL.

Manfred Klenner, Angela Fahrni, and Stefanos Pe-trakis. 2009. PolArt: A Robust Tool for SentimentAnalysis. In Proceedings of NoDaLiDa.

J. Richard Landis and Gary G. Koch. 1977. TheMeasurement of Observer Agreement for Categor-ical Data. Biometrics, 33(1):159–174.

Shoushan Li, Rongyang Wang, and Guodong Zhou.2012. Opinion Target Extraction via Shallow Se-mantic Parsing. In Proceedings of AAAI.

Dekang Lin. 1998. Automatic Retrieval and Clus-tering of Similar Words. In Proceedings ofACL/COLING.

Isa Maks and Piek Vossen. 2012. A lexicon model fordeep sentiment analysis and opinion mining applica-tions. Decision Support Systems, 53:680–688.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient Estimation of Word Repre-sentations in Vector Space. In Proceedings of Work-shop at ICLR.

224

George Miller, Richard Beckwith, Christine Fellbaum,Derek Gross, and Katherine Miller. 1990. Intro-duction to WordNet: An On-line Lexical Database.International Journal of Lexicography, 3:235–244.

Margaret Mitchell, Jacqueline Aguilar, Theresa Wil-son, and Benjamin Van Durme. 2013. OpenDomain Targeted Sentiment. In Proceedings ofEMNLP.

Ted Pedersen, Siddharth Patwardhan, and Jason Miche-lizzi. 2004. WordNet::Similarity – Measuringthe Relatedness of Concepts. In Proceedings ofHLT/NAACL.

Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen.2011. Opinion Word Expansion and Target Extrac-tion through Double Propagation. ComputationalLinguistics, 37(1):9–27.

Delip Rao and Deepak Ravichandran. 2009. Semi-Supervised Polarity Lexicon Induction. In Proceed-ings of EACL.

Ellen Riloff and Jessica Shepherd. 1997. A Corpus-Based Approach for Building Semantic Lexicons.In Proceedings of EMNLP.

Brian Roark and Eugene Charniak. 1998. Noun-phrase co-occurrence statistics for semi-automaticsemantic lexicon construction. In Proceedings ofCOLING.

Josef Ruppenhofer and Ines Rehbein. 2012. Seman-tic frames as an anchor representation for sentimentanalysis. In Proceedings of WASSA.

Josef Ruppenhofer, Swapna Somasundaran, and JanyceWiebe. 2008. Finding the Source and Targets ofSubjective Expressions. In Proceedings of LREC.

Rico Sennrich, Gerold Schneider, Martin Volk, andMartin Warin. 2009. A New Hybrid DependencyParser for German. In Proceedings of GSCL.

Partha Pratim Talukdar, Joseph Reisinger, MariusPasca, Deepak Ravichandran, Rahul Bhagat, andFernando Pereira. 2008. Weakly-Supervised Acqui-sition of Labeled Class Instances using Graph Ran-dom Walks. In Proceedings of EMNLP.

Janyce Wiebe and Rada Mihalcea. 2006. Word Senseand Subjectivity. In Proceedings of COLING/ACL.

Janyce Wiebe, Therea Wilson, and Claire Cardie.2005. Annotating Expressions of Opinions andEmotions in Language. Language Resources andEvaluation, 39(2/3):164–210.

Michael Wiegand and Dietrich Klakow. 2012. Gener-alization Methods for In-Domain and Cross-DomainOpinion Holder Extraction. In Proceedings ofEACL.

Thesesa Wilson, Janyce Wiebe, and Paul Hoffmann.2005. Recognizing Contextual Polarity in Phrase-level Sentiment Analysis. In Proceedings ofHLT/EMNLP.

Zhibiao Wu and Martha Palmer. 1994. Verb semanticsand lexcial selection. In Proceedings of ACL.

Bishan Yang and Claire Cardie. 2013. Joint Inferencefor Fine-grained Opinion Extraction. In Proceed-ings of ACL.

Patrick Ziering, Lonneke van der Plas, and HinrichSchuetze. 2013. Bootstrapping Semantic Lexiconsfor Technical Domains. In Proceedings of IJCNLP.

225



Quantity, Contrast, and Convention in Cross-Situated LanguageComprehension

Ian Perera and James F. AllenUniversity of Rochester, Department of Computer Science, Rochester, NY 14627 USA

Institute for Human and Machine Cognition, Pensacola, FL 32502 USA{iperera,jallen}@ihmc.us

Abstract

Typically, visually-grounded languagelearning systems only accept featuredata about objects in the environmentthat are explicitly mentioned, whetherthrough annotation labels or direct ref-erence through natural language. Weshow that when objects are describedambiguously using natural language, asystem can use a combination of thepragmatic principles of Contrast andConventionality, and multiple-instancelearning to learn from ambiguous exam-ples in an online fashion. Applying childlanguage learning strategies to visuallearning enables more effective learningin real-time environments, which canlead to enhanced teaching interactionswith robots or grounded systems inmulti-object environments.

1 Introduction

As opposed to the serial nature of labeled datapresented to a machine learning classifier, chil-dren and robots “in the wild” must learn objectnames and attributes like color, size, and shapewhile being surrounded by a number of stimuliand possible referents. When a child hears “the redball”, they must first identify the object mentioned,then use existing knowledge to identify that “red”and “ball” are distinct concepts, and over time,learn that objects called “red” share some sim-ilarity in color while objects called “ball” sharesome similarity in shape. Learning for them there-fore requires both identification and establishingjoint attention with the speaker before assigninga label to an object, while also applying otherlanguage learning strategies to narrow down thesearch space of possible referents, as illustrated byQuine’s “gavagai” problem (1964).

Trying to learn attributes and objects withoutnon-linguistic cues such as pointing and gazemight seem an insurmountable challenge. Yet achild experiences many such situations and cannevertheless learn grounded concepts over time.Fortunately, adult speakers tend to understandthe limitation of these cues in certain situationsand adjust their speech in accordance to Grice’sMaxim of Quantity when referring to objects : beonly as informative as necessary (Grice, 1975).We therefore treat the language describing a par-ticular object in a scene as an expression of an it-erative process, where the speaker is attempting toguide the listener towards the referent in a way thatavoids both ambiguity and unnecessary verbosity.

Language learners additionally make use of thepragmatic assumptions of Conventionality, thatspeakers agree upon the meaning of a word, andContrast, that different words have different mean-ings (Clark, 2009). The extension of these princi-ples to grounded language learning yields the as-sumptions that the referents picked out by a refer-ring expression will have some similarity (percep-tual in our domain), and will be dissimilar com-pared to objects not included in the reference.Children will eventually generalize learned con-cepts or accept synonyms in a way that violatesthese principles (Baldwin, 1992), but these as-sumptions aid in the initial acquisition of concepts.In our system, we manifest these principles us-ing distance metrics and thereby allow significantflexibility in the implementation of object and at-tribute representations while allowing a classifierto aid in reference resolution.

When faced with unresolvable ambiguity in de-termining the correct referent, past, ambiguousexperiences can be called upon to resolve ambi-guity in the current situation in a strategy calledCross-Situational Learning (XSL). There is somedebate over whether people use XSL, as it re-quires considerable memory and computational

226

load (Trueswell et al., 2013). However, other ex-periments show evidence for XSL in adults andchildren in certain situations (Smith and Yu, 2008;Smith et al., 2011). We believe these instancesthat show evidence of XSL certainly merit an im-plementation both for better understanding lan-guage learning and for advancing grounded lan-guage learning in the realm of robotics where suchlimitations do not exist. We show that by reason-ing over multiple ambiguous learning instancesand constraining possibilities with pragmatic in-ferences, a system can quickly learn attributes andnames of objects without a single unambiguoustraining example.

Our overarching research goal is to learn com-positional models of grounded attributes towardsdescribing an object in a scene, rather than justidentifying it. That is, we do not only learnto recognize instances of objects, but also learnattributes constrained to feature spaces that willbe compatible with contextual modifiers such asdark/light in terms of color, or small/large in termsof size and object classification. Therefore, weapproach the static, visual aspects of the symbolgrounding problem with an eye towards ensur-ing that our grounded representations of attributescan be composed in the same way that their se-mantic analogues can. We continue our previouswork (Perera and Allen, 2013) with two evalua-tions to demonstrate the effectiveness of applyingthe principles of Quantity, Contrast, and Conven-tionality, as well as incorporating quantifier con-straints, negative information, and classificationin the training step. Our first evaluation is ref-erence resolution to determine how well the sys-tem identifies the correct objects to attend to, andour second is description generation to determinehow well the system uses those training examplesto understand attributes and object classes.

2 Related Work

Our algorithm for reference resolution and XSLfits into our previous work on a situated languagelearning system for grounding linguistic symbolsin perception. The integration of language in amulti-modal task is a burgeoning area of research,with the grounded data being any of a range ofpossible situations, from objects on a table (Ma-tuszek et al., 2012) to wetlab experiments (Naimet al., 2014). Our end goal of using natural lan-guage to learn from visual scenes is similar to

work by Krishnamurthy and Kollar (2013) and Yuand Siskind (2013), and our emphasis on attributesis related to work by Farhadi et al. (2009). How-ever, our focus is on learning from situations thata child would be exposed to, without using anno-tated data, and to test implementations of childlanguage learning strategies in a computationalsystem.

We use a tutor-directed approach to training oursystem where the speaker presents objects to thesystem and describes them, as in work by Skocajet al. (2011). The focus of this work is in evaluat-ing referring expressions as in work by Mohan etal. (2013), although without any dialogue for dis-ambiguation. Kollar et al. (2013) also incorporatequantifier and pragmatic constraints on referenceresolution in a setting similar to ours. In this work,we undertake a more detailed analysis of the ef-fects of different pragmatic constraints on systemperformance.

The task of training a classifier from “bags”of instances with a label applying to only someof the instances contained within is referred toas Multiple-Instance Learning (MIL) (Dietterich,1997), and is the machine-learning analogue ofcross-situational learning. There is a wide rangeof methods used in MIL and a number of differ-ent assumptions that can be made to fit the task athand (Foulds and Frank, 2010). Online MIL meth-ods so far have been used for object tracking (Li etal., 2010), and Dindo and Zambuto (2010) applyMIL to grounded language learning, but we are notaware of any research that investigates the applica-tion of online MIL to studying cognitive models ofincremental grounded language learning. In addi-tion, we find that we must relax many assumptionsused in MIL to handle natural language references,such as the 1-of-N assumption used by Dindo andZambuto. The lack of appropriate algorithms forhandling this task motivates our development of anovel algorithm for language learning situations.

3 Experimental Design

3.1 Learning Environment and DataCollection

Our environment in this experiment consists of atable with blocks of nine different shapes and twotoy cars, with the objects spanning four colors. Aperson stands behind the table, places a randomlychosen group of objects in a designated demon-stration area on the table as shown in Figure 1,

227

Figure 1: One of the training examples, describedas “The two red blocks [left-most two blocks inthis figure] are next to the other blocks.”

and describes one or more of the objects directlywhile possibly mentioning some relation to thesurrounding objects. The goal of this setup is tofacilitate object descriptions that more closely ap-proximate child-directed speech, compared to thelanguage in captioned images. Audio is recordedand transcribed by hand with timestamps at the ut-terance level, but there are no other annotations be-yond timestamps. We use these intervals to matchthe spoken descriptions to the video data, whichis recorded using the Microsoft Kinect to obtainRGB + Depth information.

3.2 Training and Test InstancesAll training instances involved multiple objects,with an average of 2.8 objects per demonstration.The subject could select any set of objects to de-scribe (often with respect to the other objects).References to objects varied in detail, from “thecube” to “a tall yellow rectangle”. Since a setof objects might have different shapes, the mostcommon descriptor was “block”. The majority(80%) of the quantifiers were definite or numeric,and 85% of the demonstrations referred to a sin-gle object. Test instances consisted solely of sin-gle objects presented one at a time. 20% of theobjects used as test instances appeared in trainingbecause of the limited set of objects available, yetthe objects were placed in slightly different ori-entations and at different locations, deforming theshape contour due to perspective.

3.3 Prior System KnowledgeWe encode some existing linguistic and percep-tual knowledge into the system to aid in learn-ing from unconstrained object descriptions. Therepresentative feature, defined as the system’s fea-ture space assigned to a property name (e.g., color

for “white”, or shape for “round”), was precho-sen for the task’s vocabulary to reduce the numberof factors affecting the evaluation of the system.In previous work, we showed that the accuracyof the system’s automatic choice of representativefeatures can reach 78% after about 50 demonstra-tions of objects presented one at a time (Perera andAllen, 2013). In addition, we developed an ex-tension to a semantic parser that distinguishes be-tween attributes and object names using syntacticconstructions.

3.4 Language Processing

The transcribed utterances are passed through theTRIPS parser (Allen et al., 2008) for simultane-ous lexicon learning and recognition of object de-scriptions. The parser outputs generalized quanti-fiers and numeric constraints (capturing singular/-plural instances, as well as specific numbers) inreferring expressions, which are used for applyingquantifier constraints to the possibilities of the ref-erent object or group of objects. The parser’s abil-ity to distinguish between attributes and objectsthrough syntax greatly increases learning perfor-mance, as demonstrated in our previous work (Per-era and Allen, 2013). We extract the speech act(for detecting when an utterance is demonstratinga new object or adding additional information toa known object) and the referring expression fromthe TRIPS semantic output. Figure 2 shows theformat of such a referring expression.

(MENTIONED : ID ONT : : V11915:TERMS

( (TERM ONT : : V11915 : CLASS ( : ⇤ONT : : REFERENTIAL�SEM W: : BLOCK)

: PROPERTIES ( ( : ⇤ONT : : MODIFIER W: : YELLOW) )

:QUAN ONT : : THE) ) )

Figure 2: Primary referring expression extractionfrom the semantic parse for ”The yellow block isnext to the others”.

Although there may be many objects or groupsof objects mentioned, we only store the propertiesof the reference that is the subject of the sentence.For example, in, “Some blue cars are next to theyellow ones”, we will extract that there exists atleast two blue cars. Because it is an indefinitereference, we cannot draw any further inferenceabout whether the reference set includes all exam-ples of blue cars.

228

3.5 Feature ExtractionTo extract features, we first perform object seg-mentation using Kinect depth information, whichprovides a pixel-level contour around each of theobjects in the scene. Then for each object, werecord its dimensions and location, extract visualfeatures corresponding to color, shape, size, colorvariance, and texture. No sophisticated track-ing algorithm is needed as the objects are sta-tionary on the table. Color is represented inLAB space for perceptual similarity to humansusing Euclidean distance, shape is captured us-ing scale- and rotation-invariant 25-dimensionalZernike moments (Khotanzad and Hong, 1990),and texture is captured using 13-dimensional Har-alick features (Haralick et al., 1973).

3.6 Classification and Distance MeasuresTo determine the similarity of new properties andobjects to the system’s previous knowledge ofsuch descriptors, we use a k-Nearest Neighborclassifier (k-NN) with Mahalanobis distance met-ric (Mahalanobis, 1936), distance weighting, andclass weighting using the method described inBrown and Koplowitz (1979).

Our k-NN implementation allows negative ex-amples so as to incorporate information that weinfer about unmentioned objects. We do not trainthe system with any explicit negative information(i.e., we have no training examples described as“This is not a red block.”, but if the system is confi-dent that an object is not red, it can mark a trainingexample as such). A negative example contributesa weight to the voting equal and opposite to whatits weight would have been if it were a positiveexample of that class.

The Mahalanobis distance provides a way toincorporate a k-nearest neighbor classifier into aprobabilistic framework. Because the squared Ma-halanobis distance is equal to the number of stan-dard deviations from the mean of the data assum-ing a normal distribution (Rencher, 2003), we canconvert the Mahalanobis distance to a probabilitymeasure to be used in probabilistic reasoning.

4 The Reference Lattice

To learn from underspecified training examples,we must resolve the referring expression and as-sign the properties and object name in the expres-sion to the correct referents. To incorporate exist-ing perceptual knowledge, semi-supervised meth-

ods, and pragmatic constraints in the reference res-olution task, we use a probabilistic lattice structurethat we call the reference lattice.

The reference lattice consists of nodes corre-sponding to possible partitions of the scene foreach descriptor (either property or object name).There is one column of nodes for each descriptor,with the object name as the final column. Edgessignify the set-intersection of the connected nodesalong a path.

Paths through the lattice correspond to a suc-cessive application of these set-intersections, ul-timately resulting in a set of objects correspond-ing to the hypothesized referent group. In thisway, paths represent a series of steps in referringexpression generation where the speaker providessalient attributes sequentially to eventually makethe referent set clear.

Figure 3: Three examples of paths in the ref-erence lattice for the referring expression “thewhite square”, when the visible objects are a greysquare, white square, and a white triangle.

4.1 Lattice GenerationFor each descriptor, we generate a node for ev-ery possible partition of the scene into positive andnegative examples of that descriptor. For example,if the descriptor is “red”, each node is a hypoth-esized split that attempts to put red objects in thepositive set and non-red objects in the negative set.For each column there are 2

n � 1 nodes, where nis the number of objects in the scene (the emptyset is not included, as it would lead to an emptyreference set). We then generate lattice edges be-tween every pair of partitions in adjacent columns.

229

We can discard a large proportion of these edges,as many will correspond to the intersection of dis-joint partitions and will therefore be empty. Fi-nally, we generate all possible paths through thelattice, and, if using quantifier constraints, discardany paths with a final output referent set that doesnot agree with the number constraints on the men-tioned referent group.

The structure of the lattice is shown in Figure3. In this figure, partitions are represented by splitboxes in the first two columns, with positive exam-ples in solid lines and negative examples in dottedlines. Not shown are edges connecting each parti-tion in one column with each partition in the nextand the paths they create. The intersection of thepartitions in path (a) lead to a null set, and the pathis removed from the lattice. Path (b) is the groundtruth path, as the individual partitions accuratelydescribe the composition of the attributes. Path(c) contains an overspecified edge and achieves thecorrect referent set albeit using incorrect assump-tions about the attributes. The result sets fromboth (b) and (c) agree with the quantifier constraint(definite singular).

4.2 Node Probabilities

We consider two probabilities for determining theprobability of a partition: that which can be deter-mined from distance data (considering distancesbetween objects in the partition), and that whichrequires previous labeled data to hypothesize aclass using the classifier (considering distancesfrom each object to the mean of the data labeledwith the descriptor).

The distance probability, our implementation ofthe principle of Contrast, is a prior that enforcesminimum intraclass distance for the positive ex-amples and maximum interclass distance acrossthe partition. The motivation and implementationshares some similarities with the Diverse Densityframework for multiple instance learning (Maronand Lozano-Perez, 1998), although here it alsoacts as an unsupervised clustering for determiningthe best reference set. It is the product of the min-imum probability that any two objects in the pos-itive examples are in the same set multiplied bythe complement of the maximum probability thatany two objects across the partition are in the sameclass. Therefore, for partition N with positive ex-amples + and negative examples �:

Pintra = min

x,y 2+P (xc = yc)

Pinter =

(maxx2+,y2� P (xc = yc) if |�| > 0

1 if |�| = 0

Pdistance = Pintra ⇥ (1� Pinter)

The classifier probability is similar, exceptrather than comparing objects to other objects inthe partition, the objects are compared to the meanof the column’s descriptor C in the descriptor’srepresentative feature. If the descriptor is a classname, we instead choose the Zernike shape fea-ture, implementing the shape bias children showin word learning (Landau et al., 1998).

If there is insufficient labeled data to use, thenthe classifier probability is set to 1 for the entirecolumn, meaning only the distance probabilitieswill affect the probabilities of the nodes. For agiven descriptor C, the classifier probabilities areas follows:

Ppos(C) = min

x2+P (xc = C)

Pneg(C) =

(maxx2� P (xc = C) if |�| > 0

1 if |�| = 0

Pclassifier(C) = Ppos(C)⇥ (1� Pneg(C))

The final probability of a partition is the productof the distance probability and the classifier prob-ability, and the node probabilities are normalizedfor each column.

4.3 Overspecification and Edge Probabilities

Edges have a constant transition probability equalto the overspecification probability if overspec-ified, or equal to the complement otherwise.We use these probabilities to incorporate thephenomenon of overspecification in our model,where, contrary to a strict interpretation of Grice’sMaxim of Quantity, speakers will give more in-formation than is needed to identify a referent(Koolen et al., 2011). An edge is considered over-specified if the hypothesis for the objects that sat-isfy the next descriptor does not add additional in-formation, i.e., the set-intersection it correspondsto does not remove any possible objects from thereferent set. Thus the model will prefer hypothe-ses for the next descriptor that narrow down thehypothesized set of referents.

230

4.4 Path Probabilities

The probability of each path is the product of prob-abilities of each of the partitions along its path andthe edge (overspecification) probabilities. If thereis a single path with a probability greater than allothers by an amount ✏, the labels of the parti-tions along that path are assigned to the positiveexamples while also being assigned as negativeproperties for the negative examples. We performthis updating step after each utterance to simulateincremental continuous language learning and toprovide the most current knowledge available forresolving new ambiguous data.

If there are multiple best paths within ✏ of thehighest probability path, then the learning exampleis considered ambiguous and saved in memory toresolve with information from future examples.

4.5 Multiple-Instance Learning

In many cases, especially in the system’s firstlearning instances, there is not enough informationto unambiguously learn from the demonstration.Without any unambiguous examples, our systemwould struggle to learn no matter how much datawas available to it. An ambiguous training exam-ple yields more than one highest probability path.Our goal is to use new information from each newtraining demonstration to reevaluate these pathsand determine a singular best path, which allowsus to update our knowledge accordingly.

To do this, we independently consider columnsfor each unknown descriptors from unresolveddemonstrations containing that descriptor andcombine them to form super-partitions which arethen evaluated using our distance probability func-tion. For example, consider two instances de-scribed with “the red box”. The first has a redand a blue box, while the second has a red anda green box. Individually they are ambiguous toa system that does not know what “red” meansand therefore each demonstration would have twopaths with equal probability. If we combine thepartitions across the two demonstrations into foursuper-partitions, the highest probability will begenerated when the two red boxes are in the pos-itive set. This probability is stored in each of theconstituent partitions as a meta-probability, whichis otherwise 1 when multiple-instance learningis not required to resolve ambiguity. The meta-probability allows us to find the most probablepath given previous instances.

5 System Pipeline

5.1 Training

To train the system on a video, we transcribe thevideo with sentence-level timestamps, and extractfeatures from the demonstration video. The sys-tem takes as input the feature data aligned withutterances from the demonstration video. It thenfinds the most likely path through the referencelattice and adds all hypothesized positive exam-ples for the descriptor as class examples for theclassifier. If there is more than one likely path, itsaves the lattice for later resolution using multiple-instance learning.

5.2 Description Generation

During testing, the system generates a descriptionfor an object in the test set by finding examples ofproperties and objects similar to it in previouslyseen objects. For properties, the system checkseach feature space separately to find previous ex-amples of objects similar in that feature space andadds each found property label to the k-NN votingset, weighted by the distance. If the majority la-bel does not have the matching representative fea-ture, the system skips this feature space for addinga property to the description. The object name ischosen using a distance generated from the sumof the distances (normalized and weighted throughthe Mahalanobis distance metric) to the most sim-ilar previous examples. More details about the de-scription generation process can be found in ourprevious paper (Perera and Allen, 2013).

6 Evaluation

To evaluate our system, we use two metrics: ourevaluation method used in previous work for rat-ing the quality of generated descriptions (Pereraand Allen, 2013), and a standard precision/recallmeasurement to determine the accuracy of refer-ence resolution.

The description generated by the system is com-pared with a number of possible ground truth de-scriptions which are generated using precision andrecall equivalence classes from our previous work.Precision is calculated according to which wordsin the description could be found in a ground truthdescription, while recall is calculated according towhich words in the closest ground truth descrip-tion were captured by the system’s description. Asan example, a system output of “red rectangle”

231

Figure 4: F-Score for Description Generation ingrey and Reference Resolution in black for vari-ous configurations of the system run on 4 under-specificed videos. Error bars are one standard de-viation.

when the ground truth description is “red square”or “red cube” would have a precision score of 1(because both “red” and “rectangle” are accuratedescriptors of the object) but a recall of .5 (becausethe square-ness was not captured by the system’sdescription).

In the reference resolution evaluation, precisionand recall are calculated on the training set accord-ing to the standard measures by comparing the ref-erent set obtained by the system and the groundtruth referent set (those objects actually referredto by the speaker). Training instances lacking fea-ture data because of an error in recording were ex-cluded from the F1-score for reference resolution.

Each underspecified demonstration video con-sisted of 15-20 demonstrations containing one ormore focal objects referenced in the descriptionand, in most cases, distractor objects that are notmentioned. We used the same test video from ourprevious work with objects removed that could notbe described using terms used in the training set,leaving 15 objects.

We tested eight different system configurations.The baseline system simply guessed at a paththrough the lattice without any multiple-instancelearning (G). We then added multiple instancelearning (M), distance probabilities (D), classifierprobabilities (C), quantifier constraints (Q), andnegative information (N). We show the data forthese different methods in Figure 4.

7 Results and Discussion

7.1 Learning MethodsRather than comparing our language learning sys-tem to others on a common dataset, we choose to

focus our analysis on how our implementations ofpragmatic inference and child language learningstrategies affected performance of reference reso-lution and description generation.

The relatively strong naming performance of Gcan be attributed to the fact that many demon-strations had similarities among the objects pre-sented that could be learned from choosing any ofthe objects. However, reference resolution perfor-mance for G averaged a .34 F1-score comparedwith a .70 F1-score for our best performing con-figuration. Adding quantifier constraints (GQ)did not help, although quantifier constraints withmultiple-instance learning (GMQ) led to a signifi-cant increase in reference resolution performance.

Multiple-instance learning provided a signifi-cant gain in reference resolution performance, andwith quantifier constraints also yielded the highestnaming performance (QDM and QDML). The rel-ative lower performance by inclusion of classifierprobabilities with this limited training data is dueto errors in classification that compound in thisonline-learning framework. In multiple-instancecases where there are a number of previous exam-ples to draw from, then the information providedby classifier probability is redundant and less ac-curate. However, as the approach scales and re-taining previous instances is intractable, the classi-fier probabilities provide a more concise represen-tation of knowledge to be used in future learning.

We found that negative information hurt perfor-mance in this framework (QDMCN vs. QDMC)for two reasons. First, the risk of introducing neg-ative information is high compared to its possiblereward. While it promises to remove some errorsin classification, an accurate piece of negative in-formation only removes one class from consider-ation when multiple other alternatives exist, whilean inaccurate piece of negative information con-tributes to erroneous classification.

Second, situations where negative informationmight be inferred are induced by a natural lan-guage description which, by Grice’s Maxims, willattempt to be as clear as possible given the lis-tener’s information. This means that, adheringto the Contrast principle, negative examples arelikely already far from the positive examples forthe class.

Figure 5 shows results from the averaging ofrandom combinations of 4 underspecified videos,using our highest-scoring configuration QDM to

232

Figure 5: Description generation results fromtraining according to the number of training videoswith standard error bars. The solid line is theQDM’s performance learning from underspecifiedvideos. The dashed line is the system’s perfor-mance learning from videos where objects are pre-sented one at a time. The dotted line is the baseline(G). F1-score for reference resolution in the under-specified case was consistent across videos (mean.7, SD .01).

show the increase in performance as more trainingdata is provided to the system. We compare our re-sults on videos with multiple objects to the perfor-mance of the system with objects presented one ata time and with the baseline G. Because the train-ing objects are slightly different, we present resultson a subset of objects where at least a ground truthobject name was present in the training data. Ourresults show that while the performance is lower inthe ambiguous case, the general learning rate pervideo is comparable with the single-object case. Inthe 1-video case, guessing is equally as effective asour method due to the system being too tentativewith assigning labels to objects without more in-formation to minimize errors affecting learning inlater demonstrations.

We did see an effect of the order in which videoswere presented to the system on performance, sug-gesting that learning the correct concepts earlyon can have long-term ramifications for an onlinelearning process. Possible ways to mitigate thiseffect include a memory model with forgetting ora more robust classifier. We leave such efforts tofuture work.

7.2 Running Time Performance

While the number of nodes and paths in the lat-tice is exponential in the number of objects in thescene, our system can still perform quickly enoughto serve as a language learning agent suitable forreal-time interaction. The pragmatic constraints

on possible referent sets allow us to remove a largenumber of paths, which is especially importantwhen there are many objects in the scene or whenthe referring expression contains a number of de-scriptors. In situations with more than 4-5 objects,we expect that other cues can establish joint atten-tion with enough resolution to remove some ob-jects from consideration.

Visual features can be extracted from video at3 frames per second, which is acceptable for real-time interaction as only 5 frames are needed fortraining or testing. Not including the feature ex-traction (performed separately), the QUM config-uration processed our 55 demonstrations in about1 minute on a 2.3 GHz Intel Core i7.

7.3 Relation Between Evaluation MetricsWe compared our results from the description gen-eration metric with the reference resolution metricto evaluate how the quality of reference resolutionaffected learning performance. The descriptiongeneration F-score was more strongly positivelycorrelated with the reference resolution precisionthan with the recall. We found a reference resolu-tion F-score with � = .7 (as opposed to the stan-dard � = 1) had the highest Pearson correlationwith the F-score (r = .63, p < .0001), indicat-ing that reference resolution precision is roughly1.4 times more important than recall in predictinglearning performance in this system.

This result provides evidence that the qualityof the first data in a limited data learning algo-rithm can be critical in establishing long-term per-formance, especially in an online learning system,and suggests that our results could be improvedby correcting hypotheses that once appeared rea-sonable to the system. It also suggests that F1-score may not be the most appropriate measure forperformance of a component that is relied upon togive accurate data for further learning.

7.4 OverspecificationAccounting for overspecification in the modelmore closely approximates human speech at theexpense of a strict interpretation of the Maxim ofQuantity. It allows us to use graded pragmatic con-straints that admit helpful heuristics for learningwithout treating them as a rule. In our trainingdata, the speaker was typically over-descriptive,leading to a high optimal overspecification. Figure6 shows the effect of different values for the over-specification probability on the performance of the

233

Figure 6: Effect of varying overspecification prob-ability on the F1 score for both Description Gen-eration (black dashed) and Reference Resolution(grey solid), calculated on a dataset with hand lo-cation information.

system. The strong dip in reference resolution per-formance at an overspecification probability of 0shows the significant negative effect a strict inter-pretation of the Maxim of Quantity would have inthis situation. The correct value for overspecifica-tion probability for a given situation depends ona number of factors such as scene complexity anddescriptor type (Koolen et al., 2011), but we havenot yet incorporated these factors into our over-specification probability in this work.

7.5 Comparison to Other Multi-InstanceLearning Methods

Our multi-instance learning procedure can be clas-sified as instance-level with witnesses, whichmeans that we identify the positive examples thatlead to the label of the “bag”, or demonstration inthis case. In addition, we relax the assumption thatthere is only a single positive instance correspond-ing to the label of the demonstration. This relax-ation increases the complexity of cross-instancerelationships, but allows for references to multipleobjects simultaneously and therefore faster train-ing than a sequential presentation would allow. Inaccounting for overspecification, we also must es-tablish a dependence on the labels of the imagevia the edges of the lattice. This adds additionalcomplexity, but our results show that accountingfor overspecification can lead to increased perfor-mance.

8 Future Work

Work on this system is ongoing, with extensionsplanned for improving performance, generatingmore complete symbol grounding, and allowing

more flexibility in both environment and language.While the parser in our system can interpret

phrases such as “the tall block”, we do not havea way of resolving the non-intersective predicate“tall” in our current framework. Non-intersectivepredicates add complexity to the system becausetheir reference point is not necessarily the otherobjects in the scene - it may be a reference to otherobjects in the same class (i.e., blocks).

Also, our set of features is rather rudimen-tary and could be improved, as we chose low-dimensional, continuous features in an attempt tofacilitate a close connection between language andvision. The use of continuous features ensuresthat primitive concepts are grounded solely in per-ception and not higher-order conceptual models(Perera and Allen, 2014). Initial results using 3Dshape features show a considerable performanceincrease on a kitchen dataset we are developing.

9 Conclusion

We have proposed a probabilistic framework forusing pragmatic inference to learn from under-specified visual descriptions. We show that thissystem can use pragmatic assumptions attenuatedby overspecification probability to learn attributesand object names from videos that include a num-ber of distractors. We also analyzed various learn-ing methods in an attempt to gain a deeper under-standing of the theoretical and practical consider-ations of situated language learning, finding thatConventionality and Contrast learning strategieswith quantifiers and overspecification probabilitiesyielded the best performing system. These resultssupport the idea that an understanding of how hu-mans learn and communicate can lead to better vi-sually grounded language learning systems. Webelieve this work is an important step towards sys-tems in which natural language not only stands infor manual annotation, but also enables new meth-ods of training robots and other situated systems.

10 Acknowledgements

This work was funded by The Office of NavalResearch (N000141210547), the Nuance Founda-tion, and DARPA Big Mechanism program underARO contract W911NF-14-1-0391.

234

ReferencesJ. Allen, Mary Swift, and Will de Beaumont. 2008.

Deep Semantic Analysis of Text. In Symp. Semant.Syst. Text Process., volume 2008, pages 343–354,Morristown, NJ, USA. Association for Computa-tional Linguistics.

D A Baldwin. 1992. Clarifying the role of shape inchildren’s taxonomic assumption. J. Exp. Child Psy-chol., 54(3):392–416.

T Brown and J Koplowitz. 1979. The WeightedNearest Neighbor Rule for Class Dependent SampleSizes. IEEE Trans. Inf. Theory, I(5):617–619.

Eve V. Clark. 2009. On the pragmatics of contrast. J.Child Lang., 17(02):417, February.

T Dietterich. 1997. Solving the multiple instanceproblem with axis-parallel rectangles. Artif. Intell.,89:31–71.

Haris Dindo and Daniele Zambuto. 2010. A prob-abilistic approach to learning a visually groundedlanguage model through human-robot interaction.IEEE/RSJ 2010 Int. Conf. Intell. Robot. Syst. IROS2010 - Conf. Proc., pages 790–796.

A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. 2009.Describing objects by their attributes. 2009 IEEEConf. Comput. Vis. Pattern Recognit., pages 1778–1785, June.

James Foulds and Eibe Frank. 2010. A review ofmulti-instance learning assumptions. Knowl. Eng.Rev., 25:1.

HP Grice. 1975. Logic and conversation. In PeterCole and Jerry L. Morgan, editors, Syntax Semant.,pages 41–58. Academic Press.

Robert M. Haralick, K. Shanmugam, and Its’hak Din-stein. 1973. Textural features for image classifica-tion. IEEE Trans. Syst. Man, Cybern. SMC-3.

A Khotanzad and Y H Hong. 1990. Invariant ImageRecognition by Zernike Moments. IEEE Trans. Pat-tern Anal. Mach. Intell., 12(5):489–497, May.

Thomas Kollar, Jayant Krishnamurthy, and GrantStrimel. 2013. Toward interactive grounded lan-guage acquisition. Proc. Robot. Sci. Syst.

Ruud Koolen, Albert Gatt, Martijn Goudbeek, andEmiel Krahmer. 2011. Factors causing over-specification in definite descriptions. J. Pragmat.,43(13):3231–3250, October.

Jayant Krishnamurthy and Thomas Kollar. 2013.Jointly Learning to Parse and Perceive: ConnectingNatural Language to the Physical World. Trans. As-soc. Comput. Linguist., 1:193–206.

Barbara Landau, Linda Smith, and Susan Jones. 1998.Object Shape, Object Function, and Object Name.J. Mem. Lang., 38(1):1–27, January.

Mu Li, James T. Kwok, and Bao Liang Lu. 2010.Online multiple instance learning with no regret.Proc. IEEE Comput. Soc. Conf. Comput. Vis. PatternRecognit., pages 1395–1401.

PC C Mahalanobis. 1936. On The Generalized Dis-tance in Statistics. Proc. Natl. Inst. Sci. India, pages49–55.

Oded Maron and Tomas Lozano-Perez. 1998. Aframework for multiple-instance learning. Adv. Neu-ral Inf. Process. Syst., 10:570 – 576.

Cynthia Matuszek, N FitzGerald, Luke Zettlemoyer,Liefeng Bo, and Dieter Fox. 2012. A Joint Modelof Language and Perception for Grounded AttributeLearning. In Proc. Int. Conf. Mach. Learn.

Shiwali Mohan, John E Laird, and Laird Umich Edu.2013. Towards an Indexical Model of SituatedLanguage Comprehension for Real-World CognitiveAgents. 2013:153–170.

Iftekhar Naim, Young Chol Song, Qiguang Liu, HenryKautz, Jiebo Luo, and Daniel Gildea. 2014. Un-supervised Alignment of Natural Language Instruc-tions with Video Segments. In AAAI.

Ian Perera and JF Allen. 2013. SALL-E: SituatedAgent for Language Learning. In Twenty-SeventhAAAI Conf. Artif. Intell.

Ian Perera and James F Allen. 2014. What is theGround ? Continuous Maps for Grounding Percep-tual Primitives. In P Bello, M. Guarini, M Mc-Shane, and Brian Scassellati, editors, Proc. 36thAnnu. Conf. Cogn. Sci. Soc., Austin, TX. CognitiveScience Society.

A C Rencher. 2003. Methods of Multivariate Analysis.Wiley Series in Probability and Statistics. Wiley.

Danijel Skocaj, Matej Kristan, Alen Vrecko, MarkoMahnic, Miroslav Janicek, Geert-Jan M. Krui-jff, Marc Hanheide, Nick Hawes, Thomas Keller,Michael Zillich, and Kai Zhou. 2011. A systemfor interactive learning in dialogue with a tutor. In2011 IEEE/RSJ Int. Conf. Intell. Robot. Syst., pages3387–3394. IEEE, September.

Linda Smith and Chen Yu. 2008. Infants rapidly learnword-referent mappings via cross-situational statis-tics. Cognition, 106:1558–1568.

Kenny Smith, Andrew D M Smith, and Richard a.Blythe. 2011. Cross-situational learning: An exper-imental study of word-learning mechanisms. Cogn.Sci., 35:480–498.

John C Trueswell, Tamara Nicol Medina, Alon Hafri,and Lila R Gleitman. 2013. Propose but verify:fast mapping meets cross-situational word learning.Cogn. Psychol., 66(1):126–56, February.

W Van Orman Quine. 1964. Word and Object. MITPress paperback series. MIT Press.

235

Haonan Yu and Jeffrey Mark Siskind. 2013. GroundedLanguage Learning from Video Described with Sen-tences. In Proc. 51st Annu. Meet. Assoc. Comput.Linguist., pages 53–63.

236



Recovering Traceability Links in Requirements Documents

Zeheng Li Mingrui Chen LiGuo HuangDepartment of Computer Science & Engineering

Southern Methodist UniversityDallas, TX 75275-0122

{zehengl,mingruic,lghuang}@smu.edu

Vincent NgHuman Language Technology Institute

University of Texas at DallasRichardson, TX 75083-0688

[email protected]

Abstract

Software system development is guided bythe evolution of requirements. In this pa-per, we address the task of requirementstraceability, which is concerned with pro-viding bi-directional traceability betweenvarious requirements, enabling users tofind the origin of each requirement andtrack every change made to it. We pro-pose a knowledge-rich approach to thetask, where we extend a supervised base-line system with (1) additional traininginstances derived from human-providedannotator rationales; and (2) additionalfeatures derived from a hand-built ontol-ogy. Experiments demonstrate that our ap-proach yields a relative error reduction of11.1–19.7%.

1 Introduction

Software system development is guided by theevolution and refinement of requirements. Re-quirements specifications, which are mostly doc-umented using natural language, are refined withadditional design details and implementation in-formation as the development life cycle pro-gresses. A crucial task throughout the entire de-velopment life cycle is requirements traceability,which is concerned with linking requirements inwhich one is a refinement of the other.

Specifically, one is given a set of high-level(coarse-grained) requirements and a set of low-level (fine-grained) requirements, and the goal ofrequirements traceability is to find for each high-level requirement all the low-level requirementsthat refine it. Note that the resulting mapping be-tween high- and low-level requirements is many-

to-many, because a low-level requirement can po-tentially refine more than one high-level require-ment. As an example, consider the three high-level requirements and two low-level requirementsshown in Figure 1 about the well-known Pineemail system. In this example, three traceabil-ity links should be established: (1) HR01 is re-fined by UC01 (because UC01 specifies the short-cut key for saving an entry in the address book);(2) HR02 is refined by UC01 (because UC01 spec-ifies how to store contacts in the address book);and (3) HR03 is refined by UC02 (because both ofthem are concerned with the help system).

From a text mining perspective, requirementstraceability is a very challenging task. First, therecould be abundant information irrelevant to theestablishment of a link in one or both of the re-quirements. For instance, all the information un-der the Description section in UC01 is irrelevantto the establishment of the link between UC01and HR02. Worse still, as the goal is to inducea many-to-many mapping, information irrelevantto the establishment of one link could be rele-vant to the establishment of another link involv-ing the same requirement. For instance, whilethe Description section is irrelevant when linkingUC01 and HR02, it is crucial for linking UC01 andHR01. Above all, a link can exist between a pairof requirements (HR01 and UC01) even if they donot possess any overlapping or semantically simi-lar content words.

Virtually all existing approaches to the require-ments traceability task were developed in the soft-ware engineering (SE) research community. Re-lated work on this task can be broadly divided intotwo categories. In manual approaches, require-ments traceability links are recovered manually bydevelopers. Automated approaches, on the other

237

Figure 1: Samples of high- and low-level requirements.

hand, have relied on information retrieval (IR)techniques, which recover links based on comput-ing the similarity between a given pair of require-ments. Hence, such similarity-based approachesare unable to recover links between those pairs thatdo not contain overlapping or semantically similarwords or phrases.

In light of this weakness, we recast require-ments traceability as a supervised binary classifi-cation task, where we classify each pair of high-and low-level requirements as positive (having alink) or negative (not having a link). In particular,we propose a knowledge-rich approach to the task,where we extend a supervised baseline employ-ing only word pairs and LDA-induced topics asfeatures (see Section 4) with two types of human-supplied knowledge. First, we employ annotatorrationales. In the context of requirements trace-ability, rationales are human-annotated words orphrases in a pair of high- and low-level require-ments that motivated a human annotator to estab-lish a link between the two. In other words, ra-tionales contain the information relevant to the es-tablishment of a link. Therefore, using them couldallow a learner to focus on the relevant portions ofa requirement. Motivated by Zaidan et al. (2007),we employ rationales to create additional traininginstances for the learner.

Second, we employ an ontology hand-built bya domain expert. A sample ontology built for thePine domain is shown in Table 1. As we can see,the ontology contains a verb clustering and a nounclustering: the verbs are clustered by the functionthey perform, whereas a noun cluster correspondsto a (domain-specific) semantic type. We employ

the ontology to derive additional features.There are at least two reasons why the ontology-

based features might be useful for identifyingtraceability links. First, since only those verbsand nouns that (1) appear in the training data and(2) are deemed relevant by the domain expert forlink identification are included in the ontology,it provides guidance to the learner as to whichwords/phrases in the requirements it should fo-cus on in the learning process.1 Second, the verband noun clusters provide a robust generalizationof the words/phrases in the requirements. For in-stance, a word pair that is relevant for link identifi-cation may still be ignored by the learner due to itsinfrequency of occurrence. The features computedbased on these clusters, on the other hand, will bemore robust to the infrequency problem and couldtherefore provide better generalizations.

Our contributions are three-fold. First, theknowledge-rich approach we propose for require-ments traceability significantly outperforms a su-pervised baseline on two traceability datasets, Pineand WorldVistA. Second, we increase the NLPcommunity’s awareness of this under-studied,challenging, yet important problem in SE, whichcould lead to fruitful inter-disciplinary collabora-tion. Third, to facilitate future research on thisproblem, we make our annotated resources, in-cluding the datasets, the rationales, and the ontolo-

1Note that both the rationales and the words/phrases in theontology could help the learner by allowing it to focus on rel-evant materials in a given pair of requirements. Nevertheless,they are not identical: rationales are words/phrases that arerelevant to the establishment of a particular traceability link,whereas the words/phrases in the ontology are relevant to linkestablishment in general in the given domain.

238

Category TermsMessage mail, message, email, e-mail, PDL, subjectsContact contact, addresses, multiple addressesFolder folder, folder list, tree structureLocation address book, address field, entry, addressPlatform windows,unix,window system,unix systemModule help system, spelling check, Pico, shellProtocol MIME,SMTPCommand shortcut key, ctrl+c, ctrl+m, ctrl+p, ctrl+x

(a) Noun clustering

Category TermsSystem Operation evoke, operate, set up, activate, logMessage Search search, findContactManipulation add, store, capture

MessageManipulation compose, delete, edit, save, print

FolderManipulation create, rename, delete, nest

MessageCommunication reply, send, receive, forward, cc, bcc

User Input input, type, enter, press, hit, chooseVisualization display, list, show, prompt, highlightMovement move,navigateFunction support, have, perform, allow, use

(b) Verb clustering

Table 1: Manual ontology for Pine.

gies, publicly available.2

2 Related Work

Related work on traceability link prediction can bebroadly divided into two categories, manual ap-proaches and automatic approaches.Manual requirement tracing. Traditional man-ual requirements tracing is usually accomplishedby system analysts with the help of requirementmanagement tools, where analysts visually exam-ine each pair of requirements documented in therequirement management tools to build the Re-quirement Traceability Matrix (RTM). Most ex-isting requirement management tools (e.g., Ra-tional DOORS3, Rational RequisitePro4, CASE5)support traceability analysis. Manual tracing isoften based on observing the potential relevancebetween a pair of requirements belonging to dif-ferent categories or at different levels of detail.The manual process is human-intensive and error-prone given a large set of requirements.

2See our website at http://lyle.smu.edu/˜lghuang/research/Traceability/ for theseannotated resources.

3http://www-03.ibm.com/software/products/en/ratidoor

4http://www.ibm.com/developerworks/downloads/r/rrp

5http://www.analysttool.com

Automated requirement tracing. Automatedor semi-automated requirements traceability, onthe other hand, generates traceability links auto-matically, and hence significantly increases effi-ciency. Pierce (1978) designed a tool that main-tains a requirements database to aid automated re-quirements tracing. Jackson (1991) proposed akeyphrase-based approach for tracing a large num-ber of requirements of a large Surface Ship Com-mand System. More advanced approaches rely-ing on information retrieval (IR) techniques, suchas the tf-idf-based vector space model (Sundaramet al., 2005), Latent Semantic Indexing (Lormansand Van Deursen, 2006; De Lucia et al., 2007;De Lucia et al., 2009), probabilistic networks(Cleland-Huang et al., 2005), and Latent DirichletAllocation (Port et al., 2011), have been investi-gated, where traceability links were generated bycalculating the textual similarity between require-ments using similarity measures such as Dice, Jac-card, and Cosine coefficients (Dag et al., 2002).All these methods were developed based on eithermatching keywords or identifying similar wordsacross a pair of requirements, and none of themhave studied the feasibility of employing super-vised learning to accomplish this task, unlike ours.

3 Datasets

For evaluation, we employ two publicly-availabledatasets annotated with traceability links. The firstdataset, annotated by Sultanov and Hayes (2010),involves the Pine email system developed at theUniversity of Washington. The second dataset, an-notated by Cleland-Huang et al. (2010), involvesWorldVistA, an electronic health information sys-tem developed by the USA Veterans Administra-tion. Statistics on these datasets are shown in Ta-ble 2. For Pine, 2499 instances can be createdby pairing the 49 high-level requirements with the51 low-level use cases. For WorldVistA, 9193instances can be created by pairing the 29 high-level requirements with the 317 low-level specifi-cations. As expected, these datasets have skewedclass distributions: only 10% (Pine) and 4.3%(WorldVistA) of the pairs are linked.

While these datasets have been annotated withtraceability links, they are not annotated with an-notator rationales. Consequently, we employeda software engineer specializing in requirementstraceability to perform rationale annotation. Weasked him to annotate rationales for a pair of re-

239

Datasets Pine WorldVistA# of (high-level) requirements 49 29# of (low-level) specifications 51 317Avg. # of words per requirement 17 18Avg. # of words per specification 148 26Avg. # of links per requirement 5.1 13.6Avg. # of links per specification 4.9 1.2# of pairs that have links 250 394# of pairs that do not have links 2249 8799

Table 2: Statistics on the datasets.

quirements only if he believed that there shouldbe a traceability link between them. The reasonis that in traceability prediction, the absence of atraceability link between two requirements is at-tributed to the lack of evidence that they should belinked, rather than the presence of evidence thatthey should not be linked. More specifically, weasked the annotator to identify as rationales all thewords/phrases in a pair of requirements that mo-tivated him to label the pair as positive. For in-stance, for the link between HR01 and UC01 inFigure 1, he identified two rationales from HR01(“shortcut key” and “control and the shortcut keyare pressed”) and one from UC01 (“press ctrl+x”).

4 Hand-Building the Ontologies

A traceability link prediction ontology is com-posed of a verb clustering and a noun cluster-ing. We asked a software engineer with exper-tise in requirements traceability to hand-build theontology for each of the two datasets. Using hisdomain expertise, the engineer first identified thenoun categories and verb categories that are rele-vant for traceability prediction. Then, by inspect-ing the training data, he manually populated eachnoun/verb category with words and phrases col-lected from the training data.

As will be discussed in Section 8, we evalu-ate our approach using five-fold cross validation.Since the nouns/verbs in the ontology were col-lected only from the training data, the softwareengineer built five ontologies for each dataset, onefor each fold experiment. Hence, nouns/verbs thatappear in only the test data in a fold experimentare not included in that experiment’s ontology. Inother words, our test data are truly held-out w.r.t.the construction of the ontology. Tables 1 and 3show the complete lists of noun and verb cate-gories identified for Pine and WorldVistA, respec-tively, as well as sample nouns and verbs that pop-ulate each category. Note that the five ontologiesemploy the same set of noun and verb categories,

Category TermsSignature signature, co-signature, identityPrescription prescription, electronic prescriptionAuthorization authorized users, administratorPatient Info patient data, health summaryGeneral Info information, dataComponent platform, interface, integrationLaboratory lab test results, laboratory resultsCustomization individual customization,customizationDischarge discharge, discharge instructionRecords progress note, final note, saved noteDetails specifications, order detailsMedication medications, drugSide-effect allergy, drug-drug interaction, reactionCode ICD-9, standards, management codeUser user class hierarchy, roles, user rolesOrder orderable file, order, medication orderList problem list,problem/diagnosis listRules business rules, C32, HITSP C32Document documents, level 2 CCD documentsSchedule appointments, schedule, remindersWarning warning, notice warningEncounter encounter, encounter dataHospitalization hospitalization data,procedure dataArrangement templates, clinical templates, data entryImmunization immunization, immunization recordProtocol protocol,HL7,HTTP,FTP,HL7-ASTMSystem data codified data,invalid data,data elementsKey key, security key, computer keyIdentity social security number,account numberAudit audit log, audit records, audit trailDuty assignment, task

(a) Noun clustering

Category TermsInterface actions click, select, search, browseAuthentication sign, co-sign, cosign, authenticateCustomization fit, customize, individualizeNotification check, alertSecurity control control, prevent, prohibit, protectData manipulation capture,associate,document,createVisualization display, view, provide, generateRecording audit, logData retrieval export, retrieveDeletion remove, deleteSystem operation 1 save, retainSystem operation 2 abort, restrict, delay, lockSearch find, queryCommunication forward, redirect

(b) Verb clustering

Table 3: Manual ontology for WorldVistA.

differing only w.r.t. the nouns and verbs that pop-ulate each category. As we can see, for Pine, eightgroups of nouns and ten groups of verbs are de-fined, and for WorldVistA, 31 groups of nouns and14 groups of verbs are defined. Each noun cate-gory represents a domain-specific semantic class,and each verb category corresponds to a functionperformed by the action underlying a verb.

240

5 Baseline Systems

In this section, we describe three baseline systemsfor traceability prediction.

5.1 Unsupervised BaselinesThe Tf-idf baseline. We employ tf-idf as ourfirst unsupervised baseline. Each document isrepresented as a vector of unigrams. The valueof each vector entry is its associated word’s tf-idf value. Cosine similarity is used to computethe similarity between two documents. Any pairof requirements whose similarity exceeds a giventhreshold is labeled as positive.The LDA baseline. We employ LDA (Blei etal., 2003) as our second unsupervised baseline.We train an LDA model on our data to producen topics. For Pine, we set n to 10, 20, . . ., 50. ForWorldVistA, because of its larger size, we set nto 50, 60, . . ., 100. We then represent each docu-ment as a vector of length n, with each entry set tothe probability that the document belongs to oneof the topics. Cosine similarity is used as the sim-ilarity measure. Any pair of requirements whosesimilarity exceeds a given threshold is labeled aspositive.

5.2 Supervised BaselineEach instance corresponds to a high-level require-ment and a low-level requirement. Hence, we cre-ate instances by pairing each high-level require-ment with each low-level requirement. The classvalue of an instance is positive if the two require-ments involved should be linked; otherwise, it isnegative. Each instance is represented using twotypes of features:Word pairs. We create one binary feature foreach word pair (wi, wj) collected from the train-ing instances, where wi and wj are words appear-ing in a high-level requirement and a low-level re-quirement respectively. Its value is 1 if wi and wj

appear in the high-level and low-level pair underconsideration, respectively.LDA-induced topic pairs. Motivated by previ-ous work, we create features based on the top-ics induced by an LDA model for a requirement.Specifically, we first train an LDA model on ourdata to obtain n topics, where n is to be tunedjointly with C on the development (dev) set.6

Then, we create one binary feature for each topic6As in the LDA baseline, for Pine we set n to 10, 20, . . .,

50, and for WorldVistA, we set n to 50, 60, . . ., 100.

pair (ti, tj), where ti and tj are the topics corre-sponding to a high-level requirement and a low-level requirement, respectively. Its value is 1 if tiand tj are the most probable topics of the high-level and low-level pair under consideration, re-spectively.

We employ LIBSVM (Chang and Lin, 2011) totrain a binary SVM classifier on the training set.We use a linear kernel, setting all learning param-eters to their default values except for the C (reg-ularization) parameter, which we tune jointly withn (the number of LDA-induced topics) to maxi-mize F-score on the dev set.7 Since we conductfive-fold cross validation, in all experiments thatrequire a dev set, we use three folds for training,one fold for dev, and one fold for evaluation.

6 Exploiting Rationales

In this section, we describe our first extension tothe baseline: exploiting rationales to generate ad-ditional training instances for the SVM learner.

6.1 Background

The idea of using annotator rationales to improvebinary text classification was proposed by Zaidanet al. (2007). A rationale is a human-annotatedtext fragment that motivated an annotator to as-sign a particular label to a training document. Intheir work on classifying the sentiment expressedin movie reviews as positive or negative, Zaidanet al. generate additional training instances by re-moving rationales from documents. Since thesepseudo-instances lack information that the anno-tators thought was important, an SVM learnershould be less confident about the labels of theseweaker instances, and should therefore place thehyperplane closer to them. A learner that suc-cessfully learns this difference in confidence as-signs a higher importance to the pieces of text thatare present only in the original instances. Thusthe pseudo-instances help the learner both by in-dicating which parts of the documents are impor-tant and by increasing the number of training in-stances.

6.2 Application to Traceability Prediction

Unlike in sentiment analysis, where rationales canbe identified for both positive and negative train-ing reviews, in traceability prediction, rationales

7C was selected from the set {1, 10, 100, 1000, 10000}.

241

can only be identified for the positive training in-stances (i.e., pairs with links). As noted before, thereason is that in traceability prediction, an instanceis labeled as negative because of the absence of ev-idence that the two requirements involved shouldbe linked, rather than the presence of evidence thatthey should not be linked.

Using these rationales, we can create positivepseudo-instances. Note, however, that we cannotemploy Zaidan et al.’s method to create positivepseudo-instances. According to their method, wewould (1) take a pair of linked requirements, (2)remove the rationales from both of them, (3) createa positive pseudo-instance from the remaining textfragments, and (4) add a constraint to the SVMlearner forcing it to classify it less confidently thanthe original positive instance. Creating positivepseudo-instances in this way is problematic for ourtask. The reason is simple: as discussed previ-ously, a negative instance in our task stems fromthe absence of evidence that the two requirementsshould be linked. In other words, after removingthe rationales from a pair of linked requirements,the pseudo-instance created from the remainingtext fragments should be labeled as negative.

Given this observation, we create a positivepseudo-instance from each pair of linked require-ments by removing any text fragments from thepair that are not part of a rationale. In otherwords, we use only the rationales to create positivepseudo-instances. This has the effect of amplify-ing the information present in the rationales.

As mentioned above, while Zaidan et al.’smethod cannot be used to create positive pseudo-instances, it can be used to create negative pseudo-instances. For each pair of linked requirements,we create three negative pseudo-instances. Thefirst one is created by removing all and only therationales from the high-level requirement in thepair. The second one is created by removing alland only the rationales from the low-level require-ment in the pair. The third one is created by re-moving all the rationales from both requirementsin the pair.

To better understand our annotator rationaleframework, let us define it more formally. Recallthat in a standard soft-margin SVM, the goal is tofind w and ⇠ to minimize

1

2

|w|2 + CX

i

⇠i

subject to

8i : ciw · xi � 1� ⇠i, ⇠i > 0

where xi is a training example; ci 2 {�1, 1} is theclass label of xi; ⇠i is a slack variable that allowsxi to be misclassified if necessary; and C > 0 isthe misclassification penalty (a.k.a. the regulariza-tion parameter).

To enable this standard soft-margin SVM to alsolearn from the positive pseudo-instances, we addthe following constraints:

8i : w · vi � µ(1� ⇠i),

where vi is the positive pseudo-instance createdfrom positive example xi, ⇠i � 0 is the slack vari-able associated with vi, and µ is the margin size(which controls how confident the classifier is inclassifying the pseudo-instances).

Similarly, to learn from the negative pseudo-instances, we add the following constraints:

8i, j : w · uij µ(1� ⇠ij),

where uij is the jth negative pseudo-instance cre-ated from positive example xi, ⇠ij � 0 is the slackvariable associated with uij , and µ is the marginsize.

We let the learner decide how confidently itwants to classify these additional training in-stances based on the dev data. Specifically, wetune this confidence parameter µ jointly with theC value to maximize F-score on the dev set.8

7 Extension 2: Exploiting an OntologyNext, we describe our second extension to thebaseline: exploiting ontology-based features.

7.1 Ontology-Based FeaturesAs mentioned before, we derive additional fea-tures for the SVM learner from the verb and nounclusters in the hand-built ontology. Specifically,we derive five types of features:Verb pairs. We create one binary feature foreach verb pair (vi, vj) collected from the traininginstances, where (1) vi and vj appear in a high-level requirement and a low-level requirement re-spectively, and (2) both verbs appear in the ontol-ogy. Its value is 1 if vi and vj appear in the high-level and low-level pair under consideration, re-spectively. Using these verb pairs as features may

8C was selected from the set {1, 10, 100, 100, 10000},

and µ was selected from the set {0.2, 0.3, 1, 3, 5}.

242

allow the learner to focus on verbs that are relevantto traceability prediction.Verb group pairs. For each verb pair feature de-scribed above, we create one binary feature by re-placing each verb in the pair with its cluster id inthe ontology. Its value is 1 if the two verb groupsin the pair appear in the high-level and low-levelpair under consideration, respectively. These fea-tures may enable the resulting classifier to providerobust generalizations in cases where the learnerchooses to ignore certain useful verb pairs owingto their infrequency of occurrence.Noun pairs. We create one binary feature foreach noun pair (ni, nj) collected from the traininginstances, where (1) ni and nj appear in a high-level requirement and a low-level requirement re-spectively, and (2) both nouns appear in the on-tology. Its value is computed in the same manneras the verb pairs. These noun pairs may help thelearner to focus on verbs that are relevant to trace-ability prediction.Noun group pairs. For each noun pair featuredescribed above, we create one binary feature byreplacing each noun in the pair with its cluster idin the ontology. Its value is computed in the samemanner as the verb group pairs. These featuresmay enable the classifier to provide robust gener-alizations in cases where the learner chooses to ig-nore certain useful noun pairs owing to their infre-quency of occurrence.Dependency pairs. In some cases, thenoun/verb pairs may not provide sufficientinformation for traceability prediction. Forexample, the verb pair feature (delete, delete) issuggestive of a positive instance, but the instancemay turn out to be negative if one requirementconcerns deleting messages and the other con-cerns deleting folders. As another example, thenoun pair feature (folder, folder) is suggestive of apositive instance, but the instance may turn out tobe negative if one requirement concerns creatingfolders and the other concerns deleting folders.

In other words, we need to develop features thatencode the relationship between verbs and nouns.To do so, we first parse each requirement usingthe Stanford dependency parser (de Marneffe etal., 2006), and collect each noun-verb pair (ni,vj)connected by a dependency relation. We then cre-ate binary features by pairing each related noun-verb pair found in a high-level training require-ment with each related noun-verb pair found in a

low-level training requirement. The feature valueis 1 if the two noun-verb pairs appear in the pair ofrequirements under consideration. To enable thelearner to focus on learning from relevant verbsand nouns, only verbs and nouns that appear in theontology are used to create these features.

7.2 Learning the OntologyAn interesting question is: is it possible to learn anontology rather than hand-building it? This ques-tion is of practical relevance, as hand-constructingthe ontology is a time-consuming and error-proneprocess. Below we describe the steps we proposefor ontology learning.Step 1: Verb/Noun selection. We select thenouns, noun phrases (NPs) and verbs in the train-ing set to be clustered. Specifically, we select averb/noun/NP if (1) it appears more than once inthe training data; (2) it contains at least three char-acters (thus avoiding verbs such as be); and (3) itappears in the high-level but not the low-level re-quirements and vice versa.Step 2: Verb/Noun representation. We repre-sent each noun/NP/verb as a feature vector. Eachverb v is represented using the set of nouns/NPscollected in Step 1. The value of each feature isbinary: 1 if the corresponding noun/NP occurs asthe direct or indirect object of v in the trainingdata (as determined by the Stanford dependencyparser), and 0 otherwise. Similarly, each noun n isrepresented using the set of verbs collected in Step1. The value of each feature is binary: 1 if n servesas the direct or indirect object of the correspondingverb in the training data, and 0 otherwise.Step 3: Clustering. To produce a verb cluster-ing and a noun clustering, we cluster the verbsand the nouns/NPs separately using the single-linkalgorithm. Single-link is an agglomerative algo-rithm where each object to be clustered is initiallyin its own cluster. In each iteration, it merges thetwo most similar clusters and stops when the de-sired number of clusters is reached. Since we areusing single-link clustering, the similarity betweentwo clusters is the similarity between the two mostsimilar objects in the two clusters. We computethe similarity between two objects by taking thedot product of their feature vectors.

Since we do not know the number of clusters tobe produced a priori, for Pine we produce threenoun clusterings and three verb clusterings (with10, 15, and 20 clusters each). For WorldVistA,

243

given its larger size, we produce five noun cluster-ings and five verb clusterings (with 10, 20, 30, 40,and 50 clusters each). We then select the combi-nation of noun clustering, verb clustering, and Cvalue that maximizes F-score on the dev set, andapply the resulting combination on the test set.

To compare the usefulness of the hand-built andinduced ontologies, in our evaluation we will per-form separate experiments in which each ontologyis used to derive the features from Section 7.1.

8 Evaluation

8.1 Experimental SetupWe employ as our evaluation measure F-score,which is the unweighted harmonic mean of recalland precision. Recall (R) is the percentage of linksin the gold standard that are recovered by our sys-tem. Precision (P) is the percentage of links recov-ered by our system that are correct. We preprocesseach document by removing stopwords and stem-ming the remaining words. All results are obtainedvia five-fold cross validation.

8.2 Results and DiscussionResults on Pine and WorldVistA are shown in Ta-ble 4(a) and Table 4(b), respectively.

8.2.1 No Pseudo-instancesThe “No pseudo” column of Table 4 shows the re-sults when the learner learns from only real train-ing instances (i.e., no pseudo-instances). Specifi-cally, rows 1 and 2 show the results of the two un-supervised baselines, tf-idf and LDA, respectively.

Recall from Section 5.1 that in both baselines,we compute the cosine similarity between a pairof requirements, positing them as having a trace-ability link if and only if their similarity score ex-ceeds a threshold that is tuned based on the testset. By doing so, we are essentially giving bothunsupervised baselines an unfair advantage in theevaluation. As we can see from rows 1 and 2of the table, tf-idf achieves F-scores of 54.5% onPine and 46.5% on WorldVistA. LDA performssignificantly worse than tf-idf, achieving F-scoresof 34.2% on Pine and 15.1% on WorldVistA.9

Row 3 shows the results of the supervised base-line described in Section 5.2. As we can see,this baseline achieves F-scores of 57.5% on Pineand 63.3% on WorldVistA, significantly outper-forming the better unsupervised baseline (tf-idf)

9All significance tests are paired t-tests (p < 0.05).

on both datasets. When this baseline is aug-mented with features derived from manual clusters(row 4), the resulting system achieves F-scores of62.6% on Pine and 64.2% on WorldVistA, outper-forming the supervised baseline by 5.1% and 0.9%in F-score on these datasets. These results repre-sent significant improvements over the supervisedbaseline on both datasets, suggesting the useful-ness of the features derived from manual clustersfor traceability link prediction. When employingfeatures derived from induced rather than manualclusters (row 5), the resulting system achieves F-scores of 61.7% on Pine and 64.6% on World-VistA, outperforming the supervised baseline by4.2% and 1.3% in F-score on these datasets. Theseresults also represent significant improvementsover the supervised baseline on both datasets. Inaddition, the results obtained using manual clus-ters (row 4) and induced clusters (row 5) are statis-tically indistinguishable. This result suggests thatthe ontologies we induced can potentially be usedin lieu of the manually constructed ontologies fortraceability link prediction.

8.2.2 Using Positive Pseudo-instancesThe “Pseudo pos only” column of Table 4 showsthe results when each of the systems is trained withadditional positive pseudo-instances.

Comparing the first two columns, we cansee that employing positive pseudo-instances in-creases performance on Pine (F-scores rise by 0.7–1.1%) but decreases performance on WorldVistA(F-scores drop by 0.3–2.1%). Nevertheless, thecorresponding F-scores in all but one case (Pine,induced) are statistically indistinguishable. Theseresults seem to suggest that the addition of posi-tive pseudo-instances is not useful for traceabilitylink prediction.

Note that the addition of features derived frommanual/induced clusters to the supervised baselineno longer consistently improves its performance:while F-scores still rise significantly by 4.6–5.5%on Pine, they drop insignificantly by 0.1–0.5% onWorldVistA.

8.2.3 Using Positive and NegativePseudo-instances

The “Pseudo pos+neg” column of Table 4 showsthe results when each of the systems is trained withadditional positive and negative pseudo-instances.

Comparing these results with the correspond-ing “Pseudo pos only” results, we can see that

244

No pseudo Pseudo pos only Pseudo pos+neg Pseudo residualSystem R P F R P F R P F R P F

1 Tf-idf baseline 73.6 43.3 54.5 – – – – – – – – –2 LDA baseline 30.4 39.2 34.2 – – – – – – – – –3 Supervised baseline 50.4 67.0 57.5 51.2 67.3 58.2 53.9 73.8 62.3 31.6 68.6 43.24 + manual clusters 54.4 73.9 62.6 55.6 74.7 63.7 57.6 77.0 65.9 30.0 72.1 42.35 + induced clusters 53.6 72.8 61.7 54.8 73.6 62.8 55.2 75.0 63.6 30.0 73.5 42.6

(a) Pine

No pseudo Pseudo pos only Pseudo pos+neg Pseudo residualSystem R P F R P F R P F R P F

1 Tf-idf baseline 60.4 37.8 46.5 – – – – – – – – –2 LDA baseline 25.9 10.6 15.1 – – – – – – – – –3 Supervised baseline 52.5 79.9 63.3 52.2 79.2 63.0 55.9 80.6 66.0 49.2 71.5 58.34 + manual clusters 52.5 82.8 64.2 51.5 80.8 62.9 57.1 83.0 67.6 47.7 76.1 58.65 + induced clusters 52.8 83.2 64.6 51.0 80.7 62.5 57.1 82.1 67.4 47.7 76.4 58.7

(b) WorldVistA

Table 4: Results of supervised systems on the Pine and WorldVistA datasets.

additionally employing negative pseudo-instancesconsistently improves performance: F-scores riseby 0.8–4.1% on Pine and 3.0–4.9% on World-VistA. In particular, the improvements in F-scorein three of the six cases (Pine/Baseline, World-VistA/manual, WorldVistA/induced) are statisti-cally significant. These results suggest that the ad-ditional negative pseudo-instances provide usefulinformation for traceability link prediction.

In addition, the use of features derived frommanual/induced clusters to the supervised baselineconsistently improves its performance: F-scoresrise significantly by 1.3–3.6% on Pine and signifi-cantly by 1.4–1.6% on WorldVistA.

Finally, the best results in our experiments areachieved when both positive and negative pseudo-instances are used in combination with man-ual/induced clusters: F-scores reach 63.6–65.9%on Pine and 67.4–67.6% on WorldVistA. Theseresults translate to significant improvements in F-score over the supervised baseline by 6.1–8.4% onPine and 4.1–4.3% on WorldVistA, or relative er-ror reductions of 14.3–19.7% on Pine and 11.1–11.7% on WorldVistA.

8.2.4 Pseudo-instances from ResidualsRecall that Zaidan et al. (2007) created pseudo-instances from the text fragments that remain afterthe rationales are removed. In Section 6.3, we ar-gued that their method of creating positive pseudo-instances for our requirements traceability task isproblematic. In this subsection, we empiricallyverify the correctness of this claim.

Specifically, the “Pseudo residual” column ofTable 4 shows the results when each of the “Nopseudo” systems is additionally trained on the pos-

itive pseudo-instances created using Zaidan et al.’smethod. Comparing these results with the corre-sponding “Pseudo pos+neg” results, we see thatreplacing our method of creating positive pseudo-instances with Zaidan et al.’s method causes theF-scores to drop significantly by 7.7–23.6% in allcases. In fact, comparing these results with thecorresponding “No pseudo” results, we see thatexcept for the baseline system, employing posi-tive pseudo-instances created from Zaidan et al.’smethod yields significantly worse results than notemploying pseudo-instances at all. These resultsprovide suggestive evidence for our claim.

9 ConclusionWe investigated a knowledge-rich approach to animportant yet under-studied SE task that presentsa lot of challenges to NLP researchers: traceabil-ity prediction. Experiments on two evaluationdatasets showed that (1) in comparison to a su-pervised baseline, this method reduces relative er-ror by 11.1–19.7%; and (2) results obtained us-ing induced clusters were competitive with thoseobtained using manual clusters. To stimulate re-search on this task, we make our annotated re-sources publicly available.

AcknowledgmentsWe thank the three anonymous reviewers for theirinsightful comments on an earlier draft of the pa-per. This research was supported in part by theU.S. Department of Defense Systems Engineer-ing Research Center (SERC) new project incuba-tor fund RT-128 and the U.S. National ScienceFoundation (NSF Awards CNS-1126747 and IIS-1219142).

245

ReferencesDavid M. Blei, Andrew Y. Ng, and Michael I. Jordan.

2003. Latent Dirichlet Allocation. Journal of Ma-chine Learning Research, 3:993–1022.

Chih-Chung Chang and Chih-Jen Lin. 2011. LIB-SVM: a library for support vector machines. ACMTransactions on Intelligent Systems and Technology,2(3):27.

Jane Cleland-Huang, Raffaella Settimi, Chuan Duan,and Xuchang Zou. 2005. Utilizing supporting evi-dence to improve dynamic requirements traceability.In Proceedings of the 13th IEEE International Con-ference on Requirements Engineering, pages 135–144.

Jane Cleland-Huang, Adam Czauderna, Marek Gibiec,and John Emenecker. 2010. A machine learn-ing approach for tracing regulatory codes to prod-uct specific requirements. In Proceedings of the32nd ACM/IEEE International Conference on Soft-ware Engineering (Volume 1), pages 155–164.

Johan Natt Dag, Bjorn Regnell, Par Carlshamre,Michael Andersson, and Joachim Karlsson. 2002.A feasibility study of automated natural language re-quirements analysis in market-driven development.Requirements Engineering, 7(1):20–33.

Andrea De Lucia, Fausto Fasano, Rocco Oliveto, andGenoveffa Tortora. 2007. Recovering traceabil-ity links in software artifact management systemsusing information retrieval methods. ACM Trans-actions on Software Engineering and Methodology,16(4):13.

Andrea De Lucia, Rocco Oliveto, and Genoveffa Tor-tora. 2009. Assessing IR-based traceability recov-ery tools through controlled experiments. EmpiricalSoftware Engineering, 14(1):57–92.

Marie-Catherine de Marneffe, Bill MacCartney, andChristopher D. Manning. 2006. Generating typeddependency parses from phrase structure parses. InProceedings of the 5th International Conference onLanguage Resources and Evaluation, pages 449–454.

Justin Jackson. 1991. A keyphrase based traceabilityscheme. In IEEE Colloquium on Tools and Tech-niques for Maintaining Traceability During Design,pages 2/1–2/4. IET.

Marco Lormans and Arie Van Deursen. 2006. Can LSIhelp reconstructing requirements traceability in de-sign and test? In Proceedings of the 10th EuropeanConference on Software Maintenance and Reengi-neering, pages 47–56.

Robert A Pierce. 1978. A requirements tracingtool. ACM SIGSOFT Software Engineering Notes,3(5):53–60.

Daniel Port, Allen Nikora, Jane Huffman Hayes, andLiGuo Huang. 2011. Text mining support for soft-ware requirements: Traceability assurance. In Pro-ceedings of the 44th Hawaii International Confer-ence on System Sciences, pages 1–11.

Hakim Sultanov and Jane Huffman Hayes. 2010. Ap-plication of swarm techniques to requirements en-gineering: Requirements tracing. In RequirementsEngineering Conference (RE), 2010 18th IEEE In-ternational, pages 211–220.

Senthil Karthikeyan Sundaram, Jane Huffman Hayes,and Alexander Dekhtyar. 2005. Baselines in re-quirements tracing. ACM SIGSOFT Software En-gineering Notes, 30(4):1–6.

Omar Zaidan, Jason Eisner, and Christine Piatko.2007. Using “annotator rationales” to improve ma-chine learning for text categorization. In HumanLanguage Technologies 2007: The Conference ofthe North American Chapter of the Association forComputational Linguistics; Proceedings of the MainConference, pages 260–267.

246



Structural and lexical factors in adjective placement in complex nounphrases across Romance languages

Kristina GulordavaUniversity of Geneva

[email protected]

Paola MerloUniversity of Geneva

[email protected]

Abstract

One of the most common features acrossall known languages is their variabilityin word order. We show that differencesin the prenominal and postnominal place-ment of adjectives in the noun phraseacross five main Romance languages isnot only subject to heaviness effects, aspreviously reported, but also to subtlerstructural interactions among dependen-cies that are better explained as effects ofthe principle of dependency length min-imisation. These effects are almost purelystructural and show lexical conditioningonly in highly frequent collocations.

1 Introduction

One of the most widely observed characteristicsof all languages is the variability in the linear or-der of their words, both across and within a singlelanguage. In this study, we concentrate on wordorder alternations where one structure can be lin-earised in two different ways. Consider, for exam-ple, the case when a phrasal verb (V + particle)has a direct object (NP), in English. Two alter-native orders are possible: VP1 = V NP Prt, andVP2 = V Prt NP. If the NP is heavy, as defined innumber of words or number of syllables, it will befrequently placed after the Prt, yielding the V-Prt-NP order. Compare, for instance Call me up! toCall up the customer who called yesterday. Thistendency is also formulated as a Principle of EndWeight, where phrases are presented in order ofincreasing weight (Wasow, 2002). Cases of heavyNP-shift (Stallings et al., 1998), dative alternation(Bresnan et al., 2007) and other alternation pref-erences among verbal dependents are traditionallyevoked to argue in favour of the “heaviness” effect.

In this work, we study the alternations in thenoun-phrase domain, much less investigated in

Figure 1: Percent of postnominal simple (green)and heavy (red) adjectives across seventeen lan-guages.

connection with the heaviness effect. Abeille andGodard (2000) introduce the heaviness of adjec-tive phrases as a principle explaining their post-nominal placement compared to ‘light’ adjectivesin French. Their observations have been recentlyconfirmed in a corpus study by Thuilier (2012).Cross-linguistically, the data that we have col-lected across many languages and several families,presented in Figure 1, confirm the heaviness effectfor adjectives1. By extracting relevant statisticsfrom gold dependency annotated corpora, we canobserve that heavy adjectives (adjective phrases ofat least two words) appear more frequently post-nominally than simple adjectives.

While the effect of size or heaviness is well-documented, this statistics is very coarse and itconfounds various linguistic factors, such as types

1We use the following languages and treebanks: English,Czech, Spanish, Chinese, Catalan, German, Italian (Hajic etal., 2009), Danish, Dutch, Portuguese, Swedish (Buchholzand Marsi, 2006), Latin, Ancient Greek (Haug and Jøhndal,2008), Hungarian (Csendes et al., 2005), Polish (Wolinski etal., 2011), Arabic (Zeman et al., 2012), French (McDonald etal., 2013). The extraction is based on the conversion to theuniversal part-of-speech tags (Petrov et al., 2012).

247

of adjectives, and annotation conventions of dif-ferent corpora. From a typological perspective,the formulation needs to be refined from a pref-erence of end weight to a preference for all el-ements being closer to the governing head: lan-guages with Verb-Object dominant order tend toput constituents in ‘short before long’ order, whileObject-Verb languages, like Japanese or Korean,do the reverse (Hawkins, 1994; Wasow, 2002).A more general explanation for the weight effecthas been sought in a general tendency to minimisethe length of the dependency between two relatedwords, called Dependency Length Minimisation(DLM, Temperley (2007), Gildea and Temperley(2007)).

In this paper, we look at the structural factors,such as DLM, and lexical factors that play a rolein adjective-noun word order alternations in Ro-mance languages and the predictions they makeon prenominal or postnominal placement of adjec-tives. We concentrate on a smaller set of languagesthan those shown in Figure 1 to be able to studyfiner-grained effects than what can be observed ata very large scale and across many different cor-pus annotation schemes. We choose Romance lan-guages because they show a good amount of vari-ation in the word order of the noun phrase.

The DLM principle can be stated as follows:if there exist possible alternative orderings of aphrase, the one with the shortest overall depen-dency length (DL) is preferred.

Consider, again, the case when a phrasal verb(verb + particle) has a direct object (NP). Two al-ternative orders are possible: VP1 = V NP Prt,whose length is DL1 and VP2 = V Prt NP, whoselength is DL2. DL1 is DL(V-NP)+DL(V-Prt) =

|NP| + 1; DL2 is DL(V-NP) + DL(V-Prt) =

|Prt| + 1. If DL1 is bigger than DL2, then VP2

is preferred over VP1. Unlike the principle of EndWeight, this explanation applies also to languageswith a different word order than English.

The observation that human languages appearto minimise the distance between related words iswell documented in sentence processing (Gibson,1998; Hawkins, 1994; Hawkins, 2004), in cor-pus properties of treebanks (Gildea and Temper-ley, 2007; Futrell et al., 2015), in diachronic lan-guage change (Tily, 2010). It is usually interpretedas a means to reduce memory load and support ef-ficient communication. Dependency length min-imisation has been demonstrated on a large scale

in the verbal domain and at the sentence level, buthas not yet been investigated in the more limitednominal domain, where dependencies are usuallyshorter and might create lighter processing loadsthat do not need to be minimised. In applying thegeneral principle of DLM to the dependency struc-ture of noun phrases, our goal is to test to whatextent the DLM principle predicts the observedadjective-noun word order alternation patterns.

In this paper, we develop and discuss a morecomplex variant of a model described previously(Gulordava et al., 2015) and extend its analysis.First, we investigate whether the more complexDLM principle is necessary to explain our findingsor if the simpler heaviness effect demonstrated formany languages in Figure 1 is sufficient. Theanswer is positive: the complexity introduced byDLM is necessary. Then, we develop a more de-tailed analysis of the only prediction of the modelthat is only weakly confirmed, showing that thisresult still holds under different definitions of de-pendency length. We also present an in-depthstudy to show that the DLM effect is structural, asassumed, and not lexical. While it is well-knownthat in French prenominal and post-nominal place-ment of adjectives is sometimes lexically-specificand meaning-dependent, this is not often the casein other languages like Italian, and does not ex-plain the extent of the variation.

2 Dependency length minimisation in thenoun phrase

In this section, we summarise the model in Gulor-dava et al. (2015). In the next section we proposea more complex model and study some factors indepth. Gulordava et al. (2015) consider a proto-typical noun phrase with an adjective phrase as amodifier. They assume a simplified noun phrasewith only one adjective modifier adjacent to thenoun and two possible placements for an adjectivephrase: post-nominal and prenominal. The adjec-tive modifier can be a complex phrase with bothleft and right dependents (↵ and �, respectively).The noun phrase can have parents and right mod-ifiers (X and Y, respectively). The structures forthe possible cases are shown in Figure 2.

These structures correspond to examples likethose shown in (1), in Italian (X=‘vede’,Adj=‘bella’, N=‘casa’, Y= ‘al mare’).

248

DL1: X [NP [AP ↵ Adj � ] N Y ]

d

01

d

02

d

03

(a) Left external dependent, prenominal adjective

DL2: X [NP N [AP ↵ Adj � ] Y ]

d

001

d

002

d

003

(b) Left external dependent, postnominal adjective

DL1: [NP [AP ↵ Adj � ] N Y ] X

d

01 d

02

d

03

(c) Right external dependent, prenominal adjective

DL2: [NP N [AP ↵ Adj � ] Y ] X

d

001

d

002

d

003

(d) Right external dependent, postnominal adjective

Figure 2: Noun phrase structure variants and thedependencies relevant for the DLM calculationwith right noun dependent Y.

RightNP=Yes RightNP=NoX=Left |�|� |↵| 2|�| + 1

X=Right �3|↵|� 2 �2|↵|� 1

Table 1: Dependency length difference for differ-ent types of noun phrases. By convention, we al-ways calculate DL1 �DL2.

(1) a. ...vede la bella casa al mare.

(’..sees the beautiful house at the sea’)

b. ...vede la casa bella al mare.

(’..sees the house beautiful at the sea’)

c. La bella casa al mare e vuota.

(’the beautiful house at the sea is empty.’)

d. La casa bella al mare e vuota.

(’the house beautiful at the sea is empty.’)

The differences in dependency lengths pre-dicted by DLM are summarized in Table 1. DLMmakes predictions on adjective placement with re-spect to the noun —prenominal or postnominal—

given the dependents of the adjectives, ↵ and �,and given the dependent of the noun Y.

The column RightNP=No shows the depen-dency length difference for the two cases wherethe noun does not have a right dependent Y. Giventhat the calculation of DL differences is alwayscalculated as DL1 � DL2, the fact that the cell(X=Left, RightNP=No) holds a positive value in-dicates that DL1 > DL2 and that the differ-ence in length depends only on � and not on↵. Conversely, the negative value of (X=Right,RightNP=No) shows that DL1 < DL2 and thatthe difference in length does not depend on �,but only on ↵. This is not intuitive: intuitively,one would expect that whether the Adjective isleft or right of the Noun depends on the relativelengths of ↵ and �, but instead if we look at allthe dependencies that come into play for a nounphrase in a larger structure, the adjective positiondepends on only one of the two dependents. Thetable also shows that, on average, across all thecells, the weights of ↵ are less than zero while theweights of � are greater than zero. This indicatesthat DL1 < DL2, which means that globally theprenominal adjective order is preferred.

DLM also makes predictions on adjective place-ment with respect to the noun given the depen-dents of the noun. Here the predictions of DLMare not intuitive. DLM predicts that when the ex-ternal dependency is right (the dependency fromthe noun to its parent, X=right), then the adjectiveis prenominal, else it is postnominal. To spell thisout, DLM predicts that, for example, we shouldfind more prenominal adjectives in subject NPsthan in NPs in object position. We discuss thisodd prediction below.

Another prediction that will be investigated indetail is that when the noun has a right depen-dent, the prenominal adjective position is morepreferred than when there is no right dependent, asevinced by the fact that the RightNP = Yes columnis always greater than the RightNP = No column.

Gulordava et al. (2015) develop a mixed-effectsmodel to test which of the fine-grained predictionsderived from DLM are confirmed by the data pro-vided by the dependency annotated corpora of fivemain Romance languages. The different elementsin the DLM configuration are encoded as four fac-tors: corresponding to the factors illustrated inFigure 2 and example (1), represented as binaryor real-valued variables: LeftAP - the cumulative

249

length (in words) of all left dependents of the ad-jective, indicated as ↵ in Figure 2; RightAP - thecumulative length (in words) of all right depen-dents of the adjective, indicated as � in Figure 2;ExtDep - the direction of the arc from the noun toits parent X, an indicator variable; RightNP - theindicator variable representing the presence or ab-sence of the right dependent of the noun, indicatedas Y in Figure 2. 2

Their findings partly confirm the predictionsabout adjective placement with respect to the noungiven the adjective dependents. The DLM predic-tions about the position of the noun with respectto its parent are instead not confirmed. Finally, theprediction related to the presence of a right depen-dent of the noun on the placement of the adjectiveare confirmed.

In the next two sections, we replicate and inves-tigate in more detail these results. First, we de-velop and discuss a more detailed model, wherenot only the factors, but also their interactions aretaken into account. Then, we compare the pre-dictions of the DLM model to the predictions ofa simpler heaviness account, and confirm that thecomplexity of DLM is needed, as a simpler modelbased on heaviness of the adjective does not yieldthe same effects. Then, we discuss the externaldependency factor, which, in the more complexmodel with interactions, is a significant factor. Fi-nally, the RightNP factor is significant in the fittedmodel. The presence of a noun dependent on theright of the noun favours a prenominal placement,as predicted by DLM. We investigate the lexicalaspects of this result in a more detailed case study.

3 Analysis of Dependency MinimisationFactors

The analysis that we develop here is based on theassumption that DLM is exhibited by the depen-dencies in the avalailable dependency-annotatedcorpora for the five Romance languages.

3.1 Materials: Dependency treebanksThe dependency annotated corpora of five Ro-mance languages are used: Catalan, Spanish, Ital-ian (Hajic et al., 2009), French (McDonald et

2In addition, to account for lexical variation, they includeadjective tokens (or lemmas when available) as grouping vari-ables introducing random effects. For example, the instancesof adjective-noun order for a particular adjective will sharethe same weight value � for the adjective variable, but acrossdifferent adjectives this value can vary.

al., 2013), and Portuguese (Buchholz and Marsi,2006).

Noun phrases containing adjectives are ex-tracted using part-of-speech information and de-pendency arcs from the gold annotation. Specif-ically, all treebanks are converted to coarse uni-versal part-of-speech tags, using existing conven-tional mappings from the original tagset to the uni-versal tagset (Petrov et al., 2012). All adjectivesare identified using the universal PoS tag ‘ADJ’,whose dependency head is a noun, tagged usingthe universal PoS tag ‘NOUN’. All elements ofthe dependency subtree, the noun phrase, rootedin this noun are collected. For all languages wherethis information is available, we extract lemmasof adjective and noun tokens. The only treebankwithout lemma annotation is French, for which weextract token forms.3 A total of around 64’000 in-stances of adjectives in noun phrases is collected,ranging from 2’800 for Italian to 20’000 for Span-ish.

3.2 Method: Mixed-Effects models

The interactions of several dependency factors areanalysed using a logit mixed effect models (Bateset al., 2014). Mixed-effect logistic regressionmodels (logit models) are a type of GeneralizedLinear Mixed Models with the logit link functionand are designed for binomially distributed out-comes such as Order, in our case.

3.3 Factors and their interactions

While the original model in Gulordava et al.(2015) represents the four main factors involvedin DLM in the noun phrase — ↵, �, RightNP andExtDep — the predictions described above oftenmention interactions, which are not directly mod-elled in the original proposal. We introduce inter-actions, so that the model is more faithful to theDLM predictions, as shown in (2) and in Table 2.We do not take directly represent the interactionbetween the LeftAP and RightAP because in ourcorpora these two factors were both greater thanzero in only 1% of the cases.

3During preprocessing, we exclude all adjectives andnouns with non-lower case and non-alphabetic symbolswhich can include common names. Compounds (in Span-ish and Catalan treebanks), and English borrowings are alsoexcluded. Neither do we include in our analysis noun phraseswhich have their elements separated by punctuation (for ex-ample, commas or parentheses) to ensure that the placementof the adjective is not affected by an unusual structure of thenoun phrase.

250

Predictor � SE Z pIntercept -0.157 0.117 -1.33 0.182LeftAP 2.129 0.183 11.63 < .001

RightAP 0.887 0.091 9.79 < .001

RightNP -0.794 0.056 -14.24 < .001

ExtDep -0.243 0.118 -2.06 0.039RightNP:ExtDep 0.296 0.149 1.98 0.047RightAP:RightNP:ExtDep -0.639 0.353 -1.81 0.070

Random effects VarAdjective 1.989Language 0.023

Table 2: Summary of the fixed and random effects in the mixed-effects logit model with interactions(N = 15842), shown in (2). Non-significant factors are not shown.

Model Df AIC BIC logLik deviance �2 Df pWithout interactions 7 12137 12190 -6061.3 12123With interactions 14 12134 12241 -6052.9 12106 16.847 7 0.018⇤

Table 3: Comparison of the fits of two models: the model with interactions (2) and a simpler modelwithout any interactions between the factors RightAP, LeftAP, RightNP and ExtDep.

yij = (LeftAP + RightAP ) · RightNP ·· ExtDep⇥ � + �Adji + �Langj

(2)

Contrary to the model without interactions (Gu-lordava et al., 2015), both the ExtDep factor andits interaction with the RightNP factor are signifi-cant. This interaction corresponds to the four dif-ferent NP contexts presented in Table 1. Its sig-nificance, then, can be taken as preliminary con-firming evidence for the distinction of these con-texts, as predicted by DLM. A direct comparisonof the two models, with and without interactions,shows, however, that the effects of these interac-tions are rather small (Table 3). The log-likelihoodtest shows that the model with interactions fits thedata significantly better (�2

= 16.8, p = 0.02),but the comparison of the Bayesian InformationCriterion scores of the two models — criterionwhich strongly penalises the number of parame-ters — suggests that the model without interac-tions should be preferred.

3.4 Comparison of DLM and heavinessmodel

Dependency length minimisation was introduced,as mentioned in the introduction, to better explainprocessing effects at the sentence level for whichheaviness accounts were inadequate. However,noun phrases are small and relatively simple do-mains. We ask, then, if a model is sufficient wherethe AP is not divided into LeftAP and RightAP, but

holistically represented by the size of the wholeAP.

Specifically, a simpler Heaviness model doesnot make a difference between left and right de-pendent of adjectives: all heavy adjectives are pre-dicted to move post-nominally. Heaviness wouldalso not predict the interaction between placementand the existence of a noun dependent to the right.

We compare, then, two minimally differentmodels. Since neither the external dependencyfactor nor the interactions were shown to be highlysignificant, we compare a simplified DLM modelshown in (3) to a model where AP is representedonly by its heaviness (number of words) as in (4).

yij = LeftAP · �LAP + RightAP · �RAP

+ RightNP · �RNP + �Adji + �Langj

(3)

yij = SizeAP · �HV + RightNP · �RNP

+ �Adji + �Langj

(4)

The DLM model that distinguishes LeftAPfrom RightAP in (3) fits the data better than amodel where AP is represented only by its heav-iness as in (4), as can be seen in Table 4 andfrom the difference in AIC values of two mod-els (�AIC = 146). This result confirms that thecomplexity introduced by DLM minimisation isneeded, and confirms DLM as a property of lan-guage, also in noun phrases. The main conceptualdifference between heaviness accounts and DLM

251

Df AIC BIC logLik deviance �2 Df pModel with SizeAP 5 12518 12557 -6254.1 12508Model with LeftAP, RightAP 6 12372 12418 -6179.8 12360 148.5 1 < .001

Table 4: Comparison of the simplified DLM model in (3) and the heaviness model in (4).

accounts resides in the fact that the former donot take into account the structure and the natureof the context of the heavy element, while DLMdoes. This model comparison shows that adjectiveplacement is not independent of its context.

Prediction for External Dependencies The ex-pected effect of the external dependency of thenoun predicted by the DLM is borne out onlymarginally. This factor predicts a difference be-tween noun phrases that precede their head, for ex-ample subjects, and noun phrases that follow theirhead, for example objects. The prediction is unex-pected, while the result that the factor is not highlysignificant less so, as it is not immediately clearwhy nouns preceding heads should behave differ-ently from nouns that follow heads.

A possible explanation for this mismatch ofthe predictions and the observed data patterns liesin the assumptions underlying the DLM princi-ple. We have assumed a definition of dependencylength as the number of words between the headand the dependent, as found in the corpus annota-tion. Our data are annotated using a content-headrule, which assumes that the noun is the head ofthe noun phrase. Hawkins (1994), in his well-developed variant of DLM, postulates that min-imisation occurs on the dependencies between thehead and the edge of the dependent phrase. Fornoun phrases, the relevant dependencies will spanbetween the determiner which unambiguously de-fines the left edge of the noun phrase and the headof NP (e.g., a verb). The predictions of Hawkins’theory for adjective placement will therefore differfrom the DLM predictions based on our definition.As can be observed from Figure 2, the d0

1 and d001

dependencies to the left edge of the NP will beof equal length in cases (a) and (b) (similarly tod0

2 and d002 in cases (c) and (d)). The external de-

pendency is predicted therefore not to affect theresulting adjective placement, as observed in thedata. This result lends weak support to a theorywhere in this case the relevant dependency is be-tween the parent and the edge of the dependent.

A question remains of what dependencies are

minimised when the noun phrase does not have adeterminer and the left edge of the noun phrase isambiguous.4 This issue is difficult to test in prac-tice in our corpora. First, there are many morecases (84% versus 16%) with left ExtDep (X ison the right, e.g. for object NPs) than with rightExtDep (X is on the left, e.g. for subjects) in Ro-mance languages. This is because all of them, ex-cept French, can optionally omit subjects. More-over, the function of the NP, subject or object, andtherefore the ExtDep variable, correlates with thedefiniteness of the NP. NPs in object position takean article 75% of time while NPs in subject po-sition take an article 96% of time. Consequently,NPs without articles and on the left of the head areobserved only 135 times in our data sample (acrossall languages). This small number of cases did notallow us to develop a model.

4 In-depth study of the RightNPdependency factor

The most novel result of the model in Gulordavaet al. (2015), extended here to the more complexmodel (2) concerns the interaction between the ad-jective position and the RightNP. This effect wouldnot be predicted by a heaviness explanation andeven in the DLM framework it is surprising thatminimisation should apply to such a short depen-dency. We investigate this interaction in more de-tail and ask two questions: is this effect evenlyspread across different nouns and adjectives or is itdriven by some lexical outliers? what are the lexi-cal effects of the noun and its dependent? We anal-yse a large amount of data constructed to be a rep-resentative sample of adjective variation for sev-eral nouns (around thirty for each language) andvery many adjectives and investigate noun phraseswith a right dependent introduced by the preposi-tion ‘de/di’5.

4In one of his analyses, Hawkins claims that adjectivesdefine unambiguously the left edge of the NP, but this as-sumption is controversial.

5For Italian, the preposition is ‘di’, while for other threelanguages it is ‘de’. We do not consider complex prepositionssuch as ‘du’ in French or ‘do’ in Portuguese.

252

4.1 Data extracted from large PoS-taggedcorpora

We extract the data by querying automaticallya collection of corpora brought together by theSketchEngine project (Kilgarriff et al., 2014).This web-interface-based collection allows par-tially restricted access to billion-word corpora ofItalian (4 billions of words), French (11 billions),Spanish (8.7 billions) and Portuguese (4.5 bil-lions). The corpora are collected by web-crawlingand automatically PoS-tagged. A similar Catalancorpus was not available through this service.

First, we define the set of seed nouns that will bequeried. For each language, we use our treebanksto find the list of the two-hundred most frequentnouns which take the ‘di/de’ preposition as a com-plement. A noun has ‘di/de’ as its right dependentif there is a direct head-dependent link betweenthese elements in the gold annotation. Nouns inthe list which could be ambiguous between dif-ferent types of parts of speech are replaced man-ually. We randomly sample around thirty nouns,based on the percentage of their co-occurrencewith ‘di/de’. Given the list of seed nouns, weautomatically queried the four corpora with sim-ple regular patterns containing these nouns to ex-tract cases of prenominal and postnominal noun-adjective co-occurrences.6

For each noun, we collected a maximum of100’000 matches for each of the two patterns,which is the SketchEngine service limit. Thesematches include left and right contexts of the pat-tern and allow to extract the token following thepattern, which can be ‘di/de’ or nothing.

We modeled the data using the Logit mixed ef-fect models, with the Order as a response vari-able, one fixed effect (Di) and nouns and adjec-tives as random effects. We fit the maximal modelwith both slope and intercept parameters, as shownin model (5).

y = Di · (�Di + �Adji + �Nounj )

+ �Adji + �Nounj

(5)

We fit our models on a sample of data of around200’000 instances of adjective-noun alternationsfor each language, equally balanced for nounphrases with Di = True and Di = False.

6Our patterns were of the type ‘[tag=”ADJ”] noun’ and‘noun [tag=”ADJ”]’, where the tag field is specified for thePoS tag of a token. In our case, ‘A.’ was the tag for adjectivesin , and ‘ADJ’ in Italian, French and Spanish.

Figure 3: Percent postnominal placement for thethirty most frequent adjectives in French. (Nounphrase has a right ‘de’-complement (green) and itdoes not (red).

4.2 ResultsThe data shows that the Di effect is small, buthighly significant for all languages. The resultingvalues are similar: for French �Di = �0.84, Por-tuguese �Di = �0.95, Italian �Di = �1.14 andSpanish �Di = �1.65 (all p < 0.001).

Figure 3 illustrates the Di effect for French (cu-mulative for all nouns). We observe that mostof the adjectives appear more frequently prenom-inally in noun phrases with a ‘de’ complementthan in noun phrases without a ‘de’ complement(green columns are smaller than corresponding redcolumns). Importantly, we observe a very similarpicture cross-linguistically for all four languagesand for the adjective alternation across the major-ity of the nouns (if considered independently), asshown in Figure 4.

This result agrees with our predictions, andshows that DLM effects show up even in shortspans, where they are not necessarily expected.If a postnominal adjective intervenes between thenoun and the dependent, the dependency length in-creases only by one word (with respect to the nounphrase with the prenominal adjective). Our resultsnevertheless suggest that even such short depen-dencies are consistently minimised. This effect isconfirmed in all languages.

4.3 Lexical effects on adjective placementOne of the lexical factors that could play a con-founding role for the prenominal placement of ad-

253

Figure 4: Percent postnominal placement for the thirty most frequent adjectives in Italian, Spanish, andPortuguese (in this order). (Noun phrase has a right ‘de/di’-complement (green) and it does not (red).

jectives in Di constructions is the strength of the‘Noun + di/de + Complement’ collocation. For ex-ample, in the French compound ‘taux de chomage’(‘unemployement rate’) the placement of an ad-jective after the compound — ‘taux de chomageimportant’ — is preferred in our corpora to theadjacent postnominal placement (‘taux importantde chomage’). In our analysis, we do not extractthese types of post-NP adjectives. From this per-spective, a drop in the percentage of postnominaladjectives in ‘di’ cases could indicate that adjec-tives prefer not to intervene between nouns andtheir complements. We hypothesize that this de-pendency is more strongly minimised than otherdependencies in the noun phrase because of thisstrong lexical link.

We confirm that the Di effect is an interaction ofthe DLM principle and lexical properties of com-pounds by a further preliminary analysis of col-locations. From the French data, we selected asubset with the most frequent ‘Noun + de + Com-plement’ sequences (10 for each seed noun) and asubset with infrequent sequences (100 random de-complements for each seed noun). We assume thatthe frequency of the sequence is an indicator of thecollocational strength, so that highly frequent se-quences are collocations while low frequency se-quences are non-collocational combinations. Thefirst subset has a proportion of 78% prenominaland 22% postnominal adjectives, while the secondsubset has 61% prenominal and 39% postnominaladjectives. We confirm, then, that in the frequentcollocations there is a substantial lexical effect inadjective placement. However, we also observe apreference of prenominal placement for the infre-quent ‘Noun + de + Complement’ sequences thatare not collocational combinations, since prenom-inal placement is still much higher than what is

observed for French adjectives, on average (46%prenominal and 54% postnominal in our sampleof data). These numbers suggest that the Di effectreported in the previous section is not a result ofmere lexical collocation effects and that, for lowfrequency combinations at least, DLM is at play.

A different kind of lexical effect is shown inFigure 5. Here we plot the percent postnominalplacement of the adjective, if the noun has a com-plement introduced by di (of), che (that), per (for),in Italian. We notice that adjective placement isno longer as majoritarily prenominal for the rightdependent introduced by che and per as it is fordi. The main difference between di (of) and che(that), per (for) is that the former introduces a PPthat is inside the NP that selects it, while che andper usually do not, they are adjuncts, or infiniti-vals or clauses. In the linguistic literature, this isa distinction between arguments and adjuncts ofthe noun and it is represented structurally. Thisdistinction is, then, a lexically-induced structuraldistinction, and not simply a collocation.

5 Related workOur work occupies the middle ground betweendetailed linguistic investigations of weight effectin chosen constructions of well-studied languagesand large scale demonstrations of the dependencylength minimisation principle.

Gildea and Temperley (2007) demonstrated thatDLM applies for the dependency annotated cor-pora in English and German. They calculate ran-dom and optimal dependency lengths for eachsentence given its unordered dependency treeand compare these values to actual dependencylengths. English lengths are shown to be closeto optimal, but for German this tendency is not asclear. A recent study of Futrell et al. (2015) applies

254

Figure 5: Percent postnominal placement for thirty most frequent adjectives in Italian, followed byfunction word di, che, per, in this order. (Noun phrase has a right ‘di/per/che’-complement (green)and it does not (red)).

this analysis on a large-scale, for more than thirtylanguages that have dependency treebanks. Theirresults also confirm the correspondence betweenthe dependency annotation and the experimentaldata, something that has been reported previously(Merlo, 1994; Roland and Jurafsky, 2002).

Much work in theoretical linguistics addressesthe adjective-noun order in Romance languages.Such work typically concentrates on lexico-semantic aspects of adjective placement (Cinque,2010; Alexiadou, 2001). In our work, we accountfor the strong lexical prenominal or postnominalpreferences of adjectives by including them as ran-dom effects in our models.

Closest to our paper is the theoretical work ofAbeille and Godard (2000) on the placement ofadjective phrases in French and recent corpus-based work by Fox and Thuilier (2012) andThuilier (2012). Fox and Thuilier (2012) use adependency annotated corpus of French to extractcases of adjective-noun variation and their syntac-tic contexts. They model the placement of an ad-jective as a lexical, syntactic and semantic multi-factorial variation. They find, for example, thatphonologically heavy simple adjectives tend to bepostnominal. This result highlights the distinc-tion between phonological weight and syntacticweight, a topic which we do not address in the cur-rent work.

6 ConclusionIn this paper, we have shown that differences in theprenominal and postnominal placement of adjec-tives in the noun phrase across five main Romancelanguages is not only subject to heaviness effects,but to subtler dependency length minimisation ef-fects. These effects are almost purely structural

and show lexical conditioning only in highly fre-quent collocations.

The subtle interactions found in this work raisequestions about the exact definition of what depen-dencies are minimised and to what extent a givendependency annotation captures these distinctions.Future work will investigate more refined defini-tions of dependency length minimisation, that dis-tinguish different kinds of dependencies with dif-ferent weights.

Acknowledgements

We gratefully acknowledge the partial funding ofthis work by the Swiss National Science Founda-tion, under grant 144362. Part of this work wasconducted during the visit of The first author to theLabex EFL in Paris and Alpage INRIA group. Wethank Benoit Crabbe for the fruitful discussionsduring this period.

ReferencesAnne Abeille and Daniele Godard. 2000. French word

order and lexical weight. In Robert D. Borsley, edi-tor, The nature and function of Syntactic Categories,volume 32 of Syntax and Semantics, pages 325–360.BRILL.

Artemis Alexiadou. 2001. Adjective syntax and nounraising: word order asymmetries in the DP as theresult of adjective distribution. Studia linguistica,55(3):217–248.

Douglas Bates, Martin Maechler, Ben Bolker, andSteven Walker, 2014. lme4: Linear mixed-effectsmodels using Eigen and S4. R package version 1.1-7.

Joan Bresnan, Anna Cueni, Tatiana Nikitina, and Har-ald Baayen. 2007. Predicting the dative alternation.In G. Boume, I. Kraemer, and J. Zwarts, editors,

255

Cognitive Foundations of Interpretation, pages 69–94. Royal Netherlands Academy of Science, Ams-terdam.

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-XShared Task on Multilingual Dependency Parsing.In Proceedings of the Tenth Conference on Com-putational Natural Language Learning, CoNLL-X’06, pages 149–164, Stroudsburg, PA, USA. Associ-ation for Computational Linguistics.

Guglielmo Cinque. 2010. The Syntax of Adjectives: AComparative Study. MIT Press.

Dora Csendes, Janos Csirik, Tibor Gyimothy, andAndras Kocsor. 2005. The Szeged treebank.In Text, Speech and Dialogue, pages 123–131.Springer.

Gwendoline Fox and Juliette Thuilier. 2012. Pre-dicting the Position of Attributive Adjectives inthe French NP. In Daniel Lassiter and MarijaSlavkovik, editors, New Directions in Logic, Lan-guage and Computation, Lecture Notes in ComputerScience, pages 1–15. Springer, April.

Richard Futrell, Kyle Mahowald, and Edward Gibson.2015. Large-Scale Evidence of Dependency LengthMinimization in 37 Languages. (Submitted to Pro-ceedings of the National Academy of Sciences of theUnited States of America).

Edward Gibson. 1998. Linguistic complexity: Local-ity of syntactic dependencies. Cognition, 68(1):1–76.

Daniel Gildea and David Temperley. 2007. Optimiz-ing Grammars for Minimum Dependency Length.In Proceedings of the 45th Annual Conferenceof the Association for Computational Linguistics(ACL’07), pages 184–191, Prague, Czech Republic.

Kristina Gulordava, Paola Merlo, and Benoit Crabbe.2015. Dependency length minimisation effects inshort spans: a large-scale analysis of adjective place-ment in complex noun phrases. In Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics: Short Papers (ACL’15).

Jan Hajic, Massimiliano Ciaramita, Richard Johans-son, Daisuke Kawahara, Maria Antonia Martı, LluısMarquez, Adam Meyers, Joakim Nivre, SebastianPado, Jan Stepanek, Pavel Stranak, Mihai Surdeanu,Nianwen Xue, and Yi Zhang. 2009. The CoNLL-2009 shared task: syntactic and semantic depen-dencies in multiple languages. In Proceedings ofthe Thirteenth Conference on Computational Natu-ral Language Learning: Shared Task, CoNLL ’09,pages 1–18, Stroudsburg, PA, USA. Association forComputational Linguistics.

Dag T. T. Haug and Marius L. Jøhndal. 2008. Cre-ating a Parallel Treebank of the Old Indo-EuropeanBible Translations. In Proceedings of the 2nd Work-shop on Language Technology for Cultural HeritageData, pages 27–34, Marrakech, Morocco.

John A Hawkins. 1994. A performance theory of or-der and constituency. Cambridge University Press,Cambridge.

John A. Hawkins. 2004. Efficiency and Complexity inGrammars. Oxford linguistics. Oxford UniversityPress, Oxford, UK.

Adam Kilgarriff, Vıt Baisa, Jan Busta, MilosJakubıcek, Vojtech Kovar, Jan Michelfeit, PavelRychly, and Vıt Suchomel. 2014. The sketch en-gine: ten years on. Lexicography, 1(1):7–36.

Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuz-man Ganchev, Keith Hall, Slav Petrov, HaoZhang, Oscar Tackstrom, Claudia Bedini, NuriaBertomeu Castello, and Jungmee Lee. 2013. Uni-versal dependency annotation for multilingual pars-ing. In Proceedings of the 51st Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 2: Short Papers), pages 92–97. Association forComputational Linguistics.

Paola Merlo. 1994. A corpus-based analysis of verbcontinuation frequencies for syntactic processing.Journal of Psycholinguistic Research, 23(6):435–457.

Slav Petrov, Dipanjan Das, and Ryan T. McDonald.2012. A Universal Part-of-Speech Tagset. In Pro-ceedings of the Eight International Conference onLanguage Resources and Evaluation (LREC’12),pages 2089–2096, Istanbul, Turkey.

Douglas Roland and Daniel Jurafsky. 2002. Verb senseand verb subcategorization probabilities. In PaolaMerlo and Suzanne Stevenson, editors, The lexi-cal basis of sentence processing: Formal, compu-tational, and experimental issues. John Benjamins.

Lynne M Stallings, Maryellen C MacDonald, andPadraig G O’Seaghdha. 1998. Phrasal orderingconstraints in sentence production: Phrase lengthand verb disposition in heavy-NP shift. Journal ofMemory and Language, 39(3):392–417.

David Temperley. 2007. Minimization of dependencylength in written English. Cognition, 105(2):300–333.

Juliette Thuilier. 2012. Contraintes preferentielles etordre des mots en francais. Ph.D. Thesis, UniversiteParis-Diderot - Paris VII, Sep.

Harry Joel Tily. 2010. The role of processing com-plexity in word order variation and change. Ph.D.Thesis, Stanford University.

Thomas Wasow. 2002. Postverbal Behavior. CSLIPublications.

Marcin Wolinski, Katarzyna Głowinska, and MarekSwidzinski. 2011. A Preliminary Version of Sklad-nica—a Treebank of Polish. In Zygmunt Vetulani,

256

editor, Proceedings of the 5th Language & Technol-ogy Conference: Human Language Technologies asa Challenge for Computer Science and Linguistics,pages 299–303, Poznan, Poland.

Daniel Zeman, David Marecek, Martin Popel,Loganathan Ramasamy, Jan Stepanek, ZdenekZabokrtsky, and Jan Hajic. 2012. HamleDT: ToParse or Not to Parse? In Proceedings of the EightInternational Conference on Language Resourcesand Evaluation (LREC’12), pages 23–25, Istanbul,Turkey, may. European Language Resources Asso-ciation (ELRA).

257



Symmetric Pattern Based Word Embeddingsfor Improved Word Similarity Prediction

Roy Schwartz,1 Roi Reichart,21Institute of Computer Science, The Hebrew University

2Technion, IIT{roys02|arir}@cs.huji.ac.il [email protected]

Ari Rappoport1

Abstract

We present a novel word level vector rep-resentation based on symmetric patterns(SPs). For this aim we automatically ac-quire SPs (e.g., “X and Y”) from a largecorpus of plain text, and generate vectorswhere each coordinate represents the co-occurrence in SPs of the represented wordwith another word of the vocabulary. Ourrepresentation has three advantages overexisting alternatives: First, being based onsymmetric word relationships, it is highlysuitable for word similarity prediction.Particularly, on the SimLex999 word simi-larity dataset, our model achieves a Spear-man’s ⇢ score of 0.517, compared to 0.462of the state-of-the-art word2vec model. In-terestingly, our model performs exception-ally well on verbs, outperforming state-of-the-art baselines by 20.2–41.5%. Sec-ond, pattern features can be adapted to theneeds of a target NLP application. For ex-ample, we show that we can easily controlwhether the embeddings derived from SPsdeem antonym pairs (e.g. (big,small)) assimilar or dissimilar, an important distinc-tion for tasks such as word classificationand sentiment analysis. Finally, we showthat a simple combination of the word sim-ilarity scores generated by our method andby word2vec results in a superior predic-tive power over that of each individualmodel, scoring as high as 0.563 in Spear-man’s ⇢ on SimLex999. This emphasizesthe differences between the signals cap-tured by each of the models.

1 Introduction

In the last decade, vector space modeling (VSM)for word representation (a.k.a word embedding),

has become a key tool in NLP. Most approaches toword representation follow the distributional hy-pothesis (Harris, 1954), which states that wordsthat co-occur in similar contexts are likely to havesimilar meanings.

VSMs differ in the way they exploit word co-occurrence statistics. Earlier works (see (Turney etal., 2010)) encode this information directly in thefeatures of the word vector representation. MoreRecently, Neural Networks have become promi-nent in word representation learning (Bengio etal., 2003; Collobert and Weston, 2008; Collobertet al., 2011; Mikolov et al., 2013a; Pennington etal., 2014, inter alia). Most of these models aimto learn word vectors that maximize a languagemodel objective, thus capturing the tendencies ofthe represented words to co-occur in the trainingcorpus. VSM approaches have resulted in highlyuseful word embeddings, obtaining high qualityresults on various semantic tasks (Baroni et al.,2014).

Interestingly, the impressive results of thesemodels are achieved despite the shallow linguis-tic information most of them consider, which islimited to the tendency of words to co-occur to-gether in a pre-specified context window. Particu-larly, very little information is encoded about thesyntactic and semantic relations between the par-ticipating words, and, instead, a bag-of-words ap-proach is taken.1

This bag-of-words approach, however, comeswith a cost. As recently shown by Hill et al.(2014), despite the impressive results VSMs thattake this approach obtain on modeling word as-sociation, they are much less successful in model-ing word similarity. Indeed, when evaluating theseVSMs with datasets such as wordsim353 (Finkel-stein et al., 2001), where the word pair scores re-

1A few recent VSMs go beyond the bag-of-words as-sumption and consider deeper linguistic information in wordrepresentation. We address this line of work in Section 2.

258

flect association rather than similarity (and there-fore the (cup,coffee) pair is scored higher thanthe (car,train) pair), the Spearman correlation be-tween their scores and the human scores oftencrosses the 0.7 level. However, when evaluat-ing with datasets such as SimLex999 (Hill et al.,2014), where the pair scores reflect similarity, thecorrelation of these models with human judgmentis below 0.5 (Section 6).

In order to address the challenge in model-ing word similarity, we propose an alternative,pattern-based, approach to word representation. Inprevious work patterns were used to represent avariety of semantic relations, including hyponymy(Hearst, 1992), meronymy (Berland and Charniak,1999) and antonymy (Lin et al., 2003). Here, inorder to capture similarity between words, we useSymmetric patterns (SPs), such as “X and Y” and“X as well as Y”, where each of the words in thepair can take either the X or the Y position. Sym-metric patterns have shown useful for representingsimilarity between words in various NLP tasks in-cluding lexical acquisition (Widdows and Dorow,2002), word clustering (Davidov and Rappoport,2006) and classification of words to semantic cat-egories (Schwartz et al., 2014). However, to thebest of our knowledge, they have not been appliedto vector space word representation.

Our representation is constructed in the follow-ing way (Section 3). For each word w, we con-struct a vector v of size V , where V is the size ofthe lexicon. Each element in v represents the co-occurrence in SPs of w with another word in thelexicon, which results in a sparse word represen-tation. Unlike most previous works that appliedSPs to NLP tasks, we do not use a hard coded setof patterns. Instead, we extract a set of SPs fromplain text using an unsupervised algorithm (Davi-dov and Rappoport, 2006). This substantially re-duces the human supervision our model requiresand makes it applicable for practically every lan-guage for which a large corpus of text is available.

Our SP-based word representation is flexible.Particularly, by exploiting the semantics of thepattern based features, our representation can beadapted to fit the specific needs of target NLP ap-plications. In Section 4 we exemplify this prop-erty through the ability of our model to con-trol whether its word representations will deemantonyms similar or dissimilar. Antonyms arewords that have opposite semantic meanings (e.g.,

(small,big)), yet, due to their tendency to co-occurin the same context, they are often assigned sim-ilar vectors by co-occurrence based representa-tion models (Section 6). Controlling the modeljudgment of antonym pairs is highly useful forNLP tasks: in some tasks, like word classification,antonym pairs such as (small,big) belong to thesame class (size adjectives), while in other tasks,like sentiment analysis, identifying the differencebetween them is crucial. As discussed in Section4, we believe that this flexibility holds for variousother pattern types and for other lexical semanticrelations (e.g. hypernymy, the is-a relation, whichholds in word pairs such as (dog,animal)).

We experiment (Section 6) with the SimLex999dataset (Hill et al., 2014), consisting of 999 pairsof words annotated by human subjects for similar-ity. When comparing the correlation between thesimilarity scores derived from our learned repre-sentation and the human scores, our representationreceives a Spearman correlation coefficient score(⇢) of 0.517, outperforming six strong baselines,including the state-of-the-art word2vec (Mikolovet al., 2013a) embeddings, by 5.5–16.7%. Ourmodel performs particularly well on the verb por-tion of SimLex999 (222 verb pairs), achieving aSpearman score of 0.578 compared to scores of0.163–0.376 of the baseline models, an astonish-ing improvement of 20.2–41.5%. Our analysis re-veals that the antonym adjustment capability ofour model is vital for its success.

We further demonstrate that the word pairscores produced by our model can be combinedwith those of word2vec to get an improved pre-dictive power for word similarity. The combinedscores result in a Spearman’s ⇢ correlation of0.563, a further 4.6% improvement compared toour model, and a total of 10.1–21.3% improve-ment over the baseline models. This suggests thatthe models provide complementary informationabout word semantics.

2 Related Work

Vector Space Models for Lexical Semantics.Research on vector spaces for word representationdates back to the early 1970’s (Salton, 1971). Intraditional methods, a vector for each word w isgenerated, with each coordinate representing theco-occurrence of w and another context item of in-terest – most often a word but possibly also a sen-tence, a document or other items. The feature rep-

259

resentation generated by this basic construction issometimes post-processed using techniques suchas Positive Pointwise Mutual Information (PPMI)normalization and dimensionality reduction. Forrecent surveys, see (Turney et al., 2010; Clark,2012; Erk, 2012).

Most VSM works share two important charac-teristics. First, they encode co-occurrence statis-tics from an input corpus directly into the wordvector features. Second, they consider very lit-tle information on the syntactic and semantic rela-tions between the represented word and its contextitems. Instead, a bag-of-words approach is taken.

Recently, there is a surge of work focusing onNeural Network (NN) algorithms for word repre-sentations learning (Bengio et al., 2003; Collobertand Weston, 2008; Mnih and Hinton, 2009; Col-lobert et al., 2011; Dhillon et al., 2011; Mikolovet al., 2013a; Mnih and Kavukcuoglu, 2013; Le-bret and Collobert, 2014; Pennington et al., 2014).Like the more traditional models, these works alsotake the bag-of-words approach, encoding onlyshallow co-occurrence information between lin-guistic items. However, they encode this informa-tion into their objective, often a language model,rather than directly into the features.

Consider, for example, the successful word2vecmodel (Mikolov et al., 2013a). Its continuous-bag-of-words architecture is designed to predict a wordgiven its past and future context. The resulted ob-jective function is:

max

TX

t=1

log p(wt|wt�c, . . . , wt�1, wt+1, . . . , wt+c)

where T is the number of words in the corpus,and c is a pre-determined window size. Anotherword2vec architecture, skip-gram, aims to predictthe past and future context given a word. Its ob-jective is:

max

TX

t=1

X

�cjc,j 6=0

log p(wt+j |wt)

In both cases the objective function relates to theco-occurrence of words within a context window.

A small number of works went beyond the bag-of-words assumption, considering deeper relation-ships between linguistic items. The Strudel sys-tem (Baroni et al., 2010) represents a word usingthe clusters of lexico-syntactic patterns in whichit occurs. Murphy et al. (2012) represented wordsthrough their co-occurrence with other words insyntactic dependency relations, and then used the

Non-Negative Sparse Embedding (NNSE) methodto reduce the dimension of the resulted represen-tation. Levy and Goldberg (2014) extended theskip-gram word2vec model with negative sam-pling (Mikolov et al., 2013b) by basing the wordco-occurrence window on the dependency parsetree of the sentence. Bollegala et al. (2015) re-placed bag-of-words contexts with various pat-terns (lexical, POS and dependency).

We introduce a symmetric pattern based ap-proach to word representation which is particu-larly suitable for capturing word similarity. In ex-periments we show the superiority of our modelover six models of the above three families: (a)bag-of-words models that encode co-occurrencestatistics directly in features; (b) NN models thatimplement the bag-of-words approach in their ob-jective; and (c) models that go beyond the bag-of-words assumption.

Similarity vs. Association Most recent VSMresearch does not distinguish between associationand similarity in a principled way, although no-table exceptions exist. Turney (2012) constructedtwo VSMs with the explicit goal of capturing ei-ther similarity or association. A classifier thatuses the output of these models was able to pre-dict whether two concepts are associated, sim-ilar or both. Agirre et al. (2009) partitionedthe wordsim353 dataset into two subsets, one fo-cused on similarity and the other on association.They demonstrated the importance of the associ-ation/similarity distinction by showing that someVSMs perform relatively well on one subset whileothers perform comparatively better on the other.

Recently, Hill et al. (2014) presented the Sim-Lex999 dataset consisting of 999 word pairsjudged by humans for similarity only. The partic-ipating words belong to a variety of POS tags andconcreteness levels, arguably providing a more re-alistic sample of the English lexicon. Using theirdataset the authors show the tendency of VSMsthat take the bag-of-words approach to capture as-sociation much better than similarity. This obser-vation motivates our work.

Symmetric Patterns. Patterns (symmetric ornot) were found useful in a variety of NLPtasks, including identification of word relationssuch as hyponymy (Hearst, 1992), meronymy(Berland and Charniak, 1999) and antonymy (Linet al., 2003). Patterns have also been applied to

260

tackle sentence level tasks such as identificationof sarcasm (Tsur et al., 2010), sentiment analysis(Davidov et al., 2010) and authorship attribution(Schwartz et al., 2013).

Symmetric patterns (SPs) were employed in var-ious NLP tasks to capture different aspects of wordsimilarity. Widdows and Dorow (2002) used SPsfor the task of lexical acquisition. Dorow et al.(2005) and Davidov and Rappoport (2006) usedthem to perform unsupervised clustering of words.Kozareva et al. (2008) used SPs to classify propernames (e.g., fish names, singer names). Feng etal. (2013) used SPs to build a connotation lexicon,and Schwartz et al. (2014) used SPs to performminimally supervised classification of words intosemantic categories.

While some of these works used a hand craftedset of SPs (Widdows and Dorow, 2002; Dorow etal., 2005; Kozareva et al., 2008; Feng et al., 2013),Davidov and Rappoport (2006) introduced a fullyunsupervised algorithm for the extraction of SPs.Here we apply their algorithm in order to reducethe required human supervision and demonstratethe language independence of our approach.

Antonyms. A useful property of our model isits ability to control the representation of antonympairs. Outside the VSM literature several worksidentified antonyms using word co-occurrencestatistics, manually and automatically induced pat-terns, the WordNet lexicon and thesauri (Lin et al.,2003; Turney, 2008; Wang et al., 2010; Moham-mad et al., 2013; Schulte im Walde and Koper,2013; Roth and Schulte im Walde, 2014). Re-cently, Yih et al. (2012), Chang et al. (2013)and Ono et al. (2015) proposed word represen-tation methods that assign dissimilar vectors toantonyms. Unlike our unsupervised model, whichuses plain text only, these works used the WordNetlexicon and a thesaurus.

3 Model

In this section we describe our approach for gener-ating pattern-based word embeddings. We start bydescribing symmetric patterns (SPs), continue toshow how SPs can be acquired automatically fromtext, and, finally, explain how these SPs are usedfor word embedding construction.

3.1 Symmetric PatternsLexico-syntactic patterns are sequences of wordsand wildcards (Hearst, 1992). Examples of pat-

Candidate Examples of Instances“X of Y” “point of view”, “years of age”“X the Y” “around the world”, “over the past”“X to Y” “nothing to do”, “like to see”

“X and Y” “men and women”, “oil and gas”“X in Y” “keep in mind”, “put in place”

“X of the Y” “rest of the world”, “end of the war”

Table 1:The six most frequent pattern candidates that contain exactly

two wildcards and 1-3 words in our corpus.

terns include “X such as Y”, “X or Y” and “X isa Y”. When patterns are instantiated in text, wild-cards are replaced by words. For example, the pat-tern “X is a Y”, with the X and Y wildcards, canbe instantiated in phrases like “Guffy is a dog”.

Symmetric patterns are a special type of patternsthat contain exactly two wildcards and that tendto be instantiated by wildcard pairs such that eachmember of the pair can take the X or the Y posi-tion. For example, the symmetry of the pattern “Xor Y” is exemplified by the semantically plausibleexpressions “cats or dogs” and “dogs or cats”.

Previous works have shown that words that co-occur in SPs are semantically similar (Section 2).In this work we use symmetric patterns to repre-sent words. Our hypothesis is that such represen-tation would reflect word similarity (i.e., that sim-ilar vectors would represent similar words). Ourexperiments show that this is indeed the case.

Symmetric Patterns Extraction. Most worksthat used SPs manually constructed a set of suchpatterns. The most prominent patterns in theseworks are “X and Y” and “X or Y” (Widdows andDorow, 2002; Feng et al., 2013). In this work wefollow (Davidov and Rappoport, 2006) and applyan unsupervised algorithm for the automatic ex-traction of SPs from plain text.

This algorithm starts by defining an SP templateto be a sequence of 3-5 tokens, consisting of ex-actly two wildcards, and 1-3 words. It then tra-verses a corpus, looking for frequent pattern can-didates that match this template. Table 1 shows thesix most frequent pattern candidates, along withcommon instances of these patterns.

The algorithm continues by traversing the pat-tern candidates and selecting a pattern p if a largeportion of the pairs of words wi, wj that co-occurin p co-occur both in the (X = wi,Y = wj) formand in the (X = wj ,Y = wi) form. Consider, forexample, the pattern candidate “X and Y”, and thepair of words “cat”,“dog”. Both pattern instances

261

“cat and dog” and “dog and cat” are likely to beseen in a large corpus. If this property holds for alarge portion2 of the pairs of words that co-occurin this pattern, it is selected as symmetric. On theother hand, the pattern candidate “X of Y” is infact asymmetric: pairs of words such as “point”,“view” tend to come only in the (X = “point”,Y= “view”) form and not the other way around.The reader is referred to (Davidov and Rappoport,2006) for a more formal description of this algo-rithm. The resulting pattern set we use in this pa-per is “X and Y”, “X or Y”, “X and the Y”, “fromX to Y”, “X or the Y”, “X as well as Y”, “X or aY”,“X rather than Y”, “X nor Y”, “X and one Y”,“either X or Y”.

3.2 SP-based Word Embeddings

In order to generate word embeddings, our modelrequires a large corpus C , and a set of SPs P . Themodel first computes a symmetric matrix M ofsize V ⇥ V (where V is the size of the lexicon).In this matrix, Mi,j is the co-occurrence count ofboth wi,wj and wj ,wi in all patterns p 2 P . Forexample, if wi,wj co-occur 1 time in p1 and 3times in p5, while wj ,wi co-occur 7 times in p9,then Mi,j = Mj,i = 1 + 3 + 7 = 11. We thencompute the Positive Pointwise Mutual Informa-tion (PPMI) of M , denoted by M⇤.3 The vectorrepresentation of the word wi (denoted by vi) isthe ith row in M⇤.

Smoothing. In order to decrease the sparsity ofour representation, we apply a simple smoothingtechnique. For each word wi, W n

i denotes the topn vectors with the smallest cosine-distance fromvi. We define the word embedding of wi to be

v0i = vi + ↵ ·X

v2W ni

v

where ↵ is a smoothing factor.4 This process re-duces the sparsity of our vector representation. Forexample, when n = 0 (i.e., no smoothing), theaverage number of non-zero values per vector isonly 0.3K (where the vector size is⇠250K). Whenn = 250, this number reaches ⇠14K.

2We use 15% of the pairs of words as a threshold.3PPMI was shown useful for various co-occurrence mod-

els (Baroni et al., 2014).4We tune n and ↵ using a development set (Section 5).

Typical values for n and ↵ are 250 and 7, respectively.

4 Antonym Representation

In this section we show how our model allows usto adjust the representation of pairs of antonyms tothe needs of a subsequent NLP task. This propertywill later be demonstrated to have a substantial im-pact on performance.

Antonyms are pairs of words with an oppositemeaning (e.g., (tall,short)). As the members ofan antonym pair tend to occur in the same con-text, their word embeddings are often similar. Forexample, in the skip-gram model (Mikolov et al.,2013a), the score of the (accept,reject) pair is 0.73,and the score of (long,short) is 0.71. Our SP-basedword embeddings also exhibit a similar behavior.

The question of whether antonyms are simi-lar or not is not a trivial one. On the one hand,some NLP tasks might benefit from representingantonyms as similar. For example, in word classi-fication tasks, words such as “big” and “small” po-tentially belong to the same class (size adjectives),and thus representing them as similar is desired.On the other hand, antonyms are very dissimilarby definition. This distinction is crucial in taskssuch as search, where a query such as “tall build-ings” might be poorly processed if the representa-tions of “tall” and “short” are similar.

In light of this, we construct our word embed-dings to be controllable of antonyms. That is, ourmodel contains an antonym parameter that can beturned on in order to generate word embeddingsthat represent antonyms as dissimilar, and turnedoff to represent them as similar.

To implement this mechanism, we follow (Linet al., 2003), who showed that two patterns are par-ticularly indicative of antonymy – “from X to Y”and “either X or Y” (e.g., “from bottom to top”,“either high or low”). As it turns out, these twopatterns are also symmetric, and are discovered byour automatic algorithm. Henceforth, we refer tothese two patterns as antonym patterns.

Based on this observation, we present a variantof our model, which is designed to assign dissim-ilar vector representations to antonyms. We de-fine two new matrices: MSP and MAP , which arecomputed similarly to M⇤ (see Section 3.2), onlywith different SP sets. MSP is computed usingthe original set of SPs, excluding the two antonympatterns, while MAP is computed using the twoantonym patterns only.

Then, we define an antonym-sensitive, co-

262

occurrence matrix M+AN to beM+AN

= MSP � � · MAP

where � is a weighting parameter.5 Similarly toM⇤, the antonym-sensitive word representation ofthe ith word is the ith row in M+AN .

Discussion. The case of antonyms presented inthis paper is an example of one relation that apattern based representation model can control.This property can be potentially extended to addi-tional word relations, as long as they can be iden-tified using patterns. Consider, for example, thehypernymy relation (is-a, as in the (apple,fruit)pair). This relation can be accurately identifiedusing patterns such as “X such as Y” and “X likeY” (Hearst, 1992). Consequently, it is likely thata pattern-based model can be adapted to controlits predictions with respect to this relation usinga method similar to the one we use to controlantonym representation. We consider this a strongmotivation for a deeper investigation of pattern-based VSMs in future work.

We next turn to empirically evaluate the perfor-mance of our model in estimating word similarity.


5.1 DatasetsEvaluation Dataset. We experiment with theSimLex999 dataset (Hill et al., 2014),6 consistingof 999 pairs of words. Each pair in this datasetwas annotated by roughly 50 human subjects, whowere asked to score the similarity between the pairmembers. SimLex999 has several appealing prop-erties, including its size, part-of-speech diversity,and diversity in the level of concreteness of theparticipating words.

We follow a 10-fold cross-validation experi-mental protocol. In each fold, we randomly sam-ple 25% of the SimLex999 word pairs (⇠250pairs) and use them as a development set for pa-rameter tuning. We use the remaining 75% of thepairs (⇠750 pairs) as a test set. We report the av-erage of the results we got in the 10 folds.

Training Corpus. We use an 8G words corpus,constructed using the word2vec script.7 Throughthis script we also apply a pre-processing step

5We tune � using a development set (Section 5). Typicalvalues are 7 and 10.

6www.cl.cam.ac.uk/˜fh295/simlex.html7code.google.com/p/word2vec/source/

browse/trunk/demo-train-big-model-v1.sh

which employs the word2phrase tool (Mikolovet al., 2013c) to merge common word pairs andtriples to expression tokens. Our corpus consistsof four datasets: (a) The 2012 and 2013 crawlednews articles from the ACL 2014 workshop on sta-tistical machine translation (Bojar et al., 2014);8

(b) The One Billion Word Benchmark of Chelbaet al. (2013);9 (c) The UMBC corpus (Han et al.,2013);10 and (d) The September 2014 dump of theEnglish Wikipedia.11

5.2 BaselinesWe compare our model against six baselines: onethat encodes bag-of-words co-occurrence statisticsinto its features (model 1 below), three NN modelsthat encode the same type of information into theirobjective function (models 2-4), and two mod-els that go beyond the bag-of-words assumption(models 5-6). Unless stated otherwise, all modelsare trained on our training corpus.

1. BOW. A simple model where each coordi-nate corresponds to the co-occurrence count of therepresented word with another word in the train-ing corpus. The resulted features are re-weightedaccording to PPMI. The model’s window size pa-rameter is tuned on the development set.12

2-3. word2vec. The state-of-the-art word2vectoolkit (Mikolov et al., 2013a)13 offers twoword embedding architectures: continuous-bag-of-words (CBOW) and skip-gram. We follow therecommendations of the word2vec script for set-ting the parameters of both models, and tune thewindow size on the development set.14

4. GloVe. GloVe (Pennington et al., 2014)15 isa global log-bilinear regression model for wordembedding generation, which trains only on thenonzero elements in a co-occurrence matrix. Weuse the parameters suggested by the authors, andtune the window size on the development set.16

8http://www.statmt.org/wmt14/training-monolingual-news-crawl/

9http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz

10http://ebiquity.umbc.edu/redirect/to/resource/id/351/UMBC-webbase-corpus

11dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

12The value 2 is almost constantly selected.13https://code.google.com/p/word2vec/14Window size 2 is generally selected for both models.15nlp.stanford.edu/projects/glove/16Window size 2 is generally selected.

263

5. NNSE. The NNSE model (Murphy et al.,2012). As no full implementation of this modelis available online, we use the off-the-shelf em-beddings available at the authors’ website,17 tak-ing the full document and dependency model with2500 dimensions. Embeddings were computed us-ing a dataset about twice as big as our corpus.

6. Dep. The modified, dependency-based, skip-gram model (Levy and Goldberg, 2014). To gen-erate dependency links, we use the Stanford POSTagger (Toutanova et al., 2003)18 and the MALTparser (Nivre et al., 2006).19 We follow the pa-rameters suggested by the authors.

5.3 Evaluation

For evaluation we follow the standard VSM litera-ture: the score assigned to each pair of words by amodel m is the cosine similarity between the vec-tors induced by m for the participating words. m’squality is evaluated by computing the Spearmancorrelation coefficient score (⇢) between the rank-ing derived from m’s scores and the one derivedfrom the human scores.

6 Results

Main Result. Table 2 presents our results. Ourmodel outperforms the baselines by a margin of5.5–16.7% in the Spearman’s correlation coeffi-cient (⇢). Note that the capability of our model tocontrol antonym representation has a substantialimpact, boosting its performance from ⇢ = 0.434when the antonym parameter is turned off to ⇢ =

0.517 when it is turned on.

Model Combination. We turn to explorewhether our pattern-based model and our bestbaseline, skip-gram, which implements a bag-of-words approach, can be combined to provide animproved predictive power.

For each pair of words in the test set, we take alinear combination of the cosine similarity scorecomputed using our embeddings and the scorecomputed using the skip-gram (SG) embeddings:f+

(wi, wj) = �·fSP (wi, wj)+(1��)·fSG(wi, wj)

In this equation f<m>(wi, wj) is the cosinesimilarity between the vector representations ofwords wi and wj according to model m, and � is a

17http://www.cs.cmu.edu/˜bmurphy/NNSE/18nlp.stanford.edu/software/19http://www.maltparser.org/index.html

Model Spearman’s ⇢

GloVe 0.35BOW 0.423

CBOW 0.43Dep 0.436

NNSE 0.455skip-gram 0.462

SP(�) 0.434SP(+) 0.517

Joint (SP(+), skip-gram) 0.563Average Human Score 0.651

Table 2:Spearman’s ⇢ scores of our SP-based model with the antonymparameter turned on (SP(+)) or off (SP(�)) and of the base-lines described in Section 5.2. Joint (SP(+), skip-gram) isan interpolation of the scores produced by skip-gram and ourSP(+) model. Average Human Score is the average correla-tion of a single annotator with the average score of all anno-tators, taken from (Hill et al., 2014).

weighting parameter tuned on the development set(a common value is 0.8).

As shown in Table 2, this combination forms thetop performing model on SimLex999, achieving aSpearman’s ⇢ score of 0.563. This score is 4.6%higher than the score of our model, and a 10.1–21.3% improvement compared to the baselines.

wordsim353 Experiments. The wordsim353dataset (Finkelstein et al., 2001) is frequently usedfor evaluating word representations. In order tobe compatible with previous work, we experimentwith this dataset as well. As our word embeddingsare designed to support word similarity rather thanrelatedness, we focus on the similarity subset ofthis dataset, according to the division presented in(Agirre et al., 2009).

As noted by (Hill et al., 2014), the word pairscores in both subsets of wordsim353 reflect wordassociation. This is because the two subsets cre-ated by (Agirre et al., 2009) keep the originalwordsim353 scores, produced by human evalua-tors that were instructed to score according to as-sociation rather than similarity. Consequently, weexpect our model to perform worse on this datasetcompared to a dataset, such as SimLex999, whoseannotators were guided to score word pairs ac-cording to similarity.

Contrary to SimLex999, wordsim353 treatsantonyms as similar. For example, the similarityscore of the (life,death) and (profit,loss) pairs are7.88 and 7.63 respectively, on a 0-10 scale. Con-sequently, we turn the antonym parameter off forthis experiment.

Table 3 presents the results. As expected, our

264

Model Spearman’s ⇢

GloVe 0.677Dep 0.712

BOW 0.729CBOW 0.734NNSE 0.78

skip-gram 0.792SP(�) 0.728

Average Human Score 0.756

Table 3:Spearman’s ⇢ scores for the similarity portion of wordsim353(Agirre et al., 2009). SP(�) is our model with the antonymparameter turned off. Other abbreviations are as in Table 2.

Model Adj. Nouns VerbsGloVe 0.571 0.377 0.163Dep 0.54 0.449 0.376

BOW 0.548 0.451 0.276CBOW 0.579 0.48 0.252NNSE 0.594 0.487 0.318

skip-gram 0.604 0.501 0.307SP(+) 0.663 0.497 0.578

Table 4:A POS-based analysis of the various models. Numbers arethe Spearman’s ⇢ scores of each model on each of the respec-tive portions of SimLex999.

model is not as successful on a dataset that doesn’treflect pure similarity. Yet, it still crosses the ⇢ =

0.7 score, a quite high performance level.

Part-of-Speech Analysis. We next perform aPOS-based evaluation of the participating models,using the three portions of the SimLex999: 666pairs of nouns, 222 pairs of verbs, and 111 pairs ofadjectives. Table 4 indicates that our SP(+) modelis exceptionally successful in predicting verb andadjective similarity. On verbs, SP(+) obtains ascore of ⇢ = 0.578, a 20.2–41.5% improvementover the baselines. On adjectives, SP(+) performseven better (⇢ = 0.663), an improvement of 5.9–12.3% over the baselines. On nouns, SP(+) issecond only to skip-gram, though with very smallmargin (0.497 vs. 0.501), and is outperforming theother baselines by 1–12%. The lower performanceof our model on nouns might partially explain itsrelatively low performance on wordsim353, whichis composed exclusively of nouns.

Analysis of Antonyms. We now turn to a qual-itative analysis, in order to understand the im-pact of our modeling decisions on the scores ofantonym word pairs. Table 5 presents examples ofantonym pairs taken from the SimLex999 dataset,along with their relative ranking among all pairsin the set, as judged by our model (SP(+) with� = 10 or SP(�) with � = �1) and by the best

Pair of Words SP skip-gram+AN -ANnew - old 1 6 6

narrow - wide 1 7 8necessary - unnecessary 2 2 9

bottom - top 3 8 10absence - presence 4 7 9

receive - send 1 9 8fail - succeed 1 8 6

Table 5:Examples of antonym pairs and their decile in the similarityranking of our SP model with the antonym parameter turnedon (+AN, �=10) or off (-AN, �=-1), and of the skip-grammodel, the best baseline. All examples are judged in the low-est decile (1) by SimLex999’s annotators.

baseline representation (skip-gram). Each pair ofwords is assigned a score between 1 and 10 byeach model, where a score of M means that thepair is ranked at the M ’th decile. The examplesin the table are taken from the first (lowest) decileaccording to SimLex999’s human evaluators. Thetable shows that when the antonym parameter isoff, our model generally recognizes antonyms assimilar. In contrast, when the parameter is on,ranks of antonyms substantially decrease.

Antonymy as Word Analogy. One of the mostnotable features of the skip-gram model is thatsome geometric relations between its vectorstranslate to semantic relations between the repre-sented words (Mikolov et al., 2013c), e.g.:

vwoman � vman + vking ⇡ vqueen

It is therefore possible that a similar method canbe applied to capture antonymy – a useful propertythat our model was demonstrated to have.

To test this hypothesis, we generated a set of200 analogy questions of the form ”X - Y + Z =?” where X and Y are antonyms, and Z is a wordwith an unknown antonym.20 Example questionsinclude: “stupid - smart + life = ?” (death) and“huge - tiny + arrive = ?” (leave). We appliedthe standard word analogy evaluation (Mikolov etal., 2013c) on this dataset with the skip-gram em-beddings, and found that results are quite poor:3.5% accuracy (compared to an average 56% ac-curacy this model obtains on a standard word anal-ogy dataset (Mikolov et al., 2013a)). Given theseresults, the question of whether skip-gram is capa-

20Two human annotators selected a list of potentialantonym pairs from SimLex999 and wordsim353. We tookthe intersection of their selections (26 antonym pairs) andrandomly generated 200 analogy questions, each containingtwo antonym pairs. The dataset can be found in www.cs.huji.ac.il/˜roys02/papers/sp_embeddings/antonymy_analogy_questions.zip

265

ble of accounting for antonyms remains open.

7 ConclusionsWe presented a symmetric pattern based model forword vector representation. On SimLex999, ourmodel is superior to six strong baselines, includingthe state-of-the-art word2vec skip-gram model byas much as 5.5–16.7% in Spearman’s ⇢ score. Wehave shown that this gain is largely attributed tothe remarkably high performance of our model onverbs, where it outperforms all baselines by 20.2–41.5%. We further demonstrated the adaptabil-ity of our model to antonym judgment specifica-tions, and its complementary nature with respectto word2vec.

In future work we intend to extend our pattern-based word representation framework beyondsymmetric patterns. As discussed in Section 4,other types of patterns have the potential to furtherimprove the expressive power of word vectors. Aparticularly interesting challenge is to enhance ourpattern-based approach with bag-of-words infor-mation, thus enjoying the provable advantages ofboth frameworks.

AcknowledgmentsWe would like to thank Elad Eban for his helpfuladvice. This research was funded (in part) by theIntel Collaborative Research Institute for Compu-tational Intelligence (ICRI-CI) and the Israel Min-istry of Science and Technology Center of Knowl-edge in Machine Learning and Artificial Intelli-gence (Grant number 3-9243).

ReferencesEneko Agirre, Enrique Alfonseca, Keith Hall, Jana

Kravalova, Marius Pasca, and Aitor Soroa. 2009.A study on similarity and relatedness using distri-butional and wordnet-based approaches. In Proc. ofHLT-NAACL.

Marco Baroni, Brian Murphy, Eduard Barbu, and Mas-simo Poesio. 2010. Strudel: A corpus-based seman-tic model based on properties and types. CognitiveScience.

Marco Baroni, Georgiana Dinu, and GermanKruszewski. 2014. Don’t count, predict! asystematic comparison of context-counting vs.context-predicting semantic vectors. In Proc. ofACL.

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model. JMLR.

Matthew Berland and Eugene Charniak. 1999. Find-ing parts in very large corpora. In Proc. of ACL.

Ondrej Bojar, Christian Buck, Christian Federmann,Barry Haddow, Philipp Koehn, Christof Monz, MattPost, and Lucia Specia, editors. 2014. Proc. of theNinth Workshop on Statistical Machine Translation.

Danushka Bollegala, Takanori Maehara, YuichiYoshida, and Ken ichi Kawarabayashi. 2015.Learning word representations from relationalgraphs. In Proc. of AAAI.

Kai-wei Chang, Wen-tau Yih, and Christopher Meek.2013. Multi-Relational Latent Semantic Analysis.In Proc. of EMNLP.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, and Phillipp Koehn. 2013. Onebillion word benchmark for measuring progress instatistical language modeling. CoRR.

Stephen Clark. 2012. Vector space models of lexi-cal meaning. Handbook of Contemporary Seman-ticssecond edition, pages 1–42.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Proc. ofICML.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. JMLR, 12:2493–2537.

Dmitry Davidov and Ari Rappoport. 2006. Effi-cient unsupervised discovery of word categories us-ing symmetric patterns and high frequency words.In Proc. of ACL-Coling.

Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010.Enhanced sentiment learning using twitter hashtagsand smileys. In Proc. of Coling.

Paramveer Dhillon, Dean P Foster, and Lyle H Ungar.2011. Multi-view learning of word embeddings viacca. In Proc. of NIPS.

Beate Dorow, Dominic Widdows, Katarina Ling, Jean-Pierre Eckmann, Danilo Sergi, and Elisha Moses.2005. Using Curvature and Markov Clustering inGraphs for Lexical Acquisition and Word Sense Dis-crimination.

Katrin Erk. 2012. Vector Space Models of WordMeaning and Phrase Meaning: A Survey. Languageand Linguistics Compass, 6(10):635–653.

Song Feng, Jun Seok Kang, Polina Kuznetsova, andYejin Choi. 2013. Connotation lexicon: A dash ofsentiment beneath the surface meaning. In Proc. ofACL.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-tan Ruppin. 2001. Placing search in context: Theconcept revisited. In Proc. of WWW.

266

Lushan Han, Abhay L. Kashyap, Tim Finin,James Mayfield, and Jonathan Weese. 2013.Umbc ebiquity-core: Semantic textual similaritysystems. In Proc. of *SEM.

Zellig Harris. 1954. Distributional structure. Word.

Marti A. Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In Proc. of Coling– Volume 2.

Felix Hill, Roi Reichart, and Anna Korhonen. 2014.Simlex-999: Evaluating semantic models with(genuine) similarity estimation. arXiv:1408.3456[cs.CL].

Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy.2008. Semantic class learning from the web withhyponym pattern linkage graphs. In Proc. of ACL-HLT.

Remi Lebret and Ronan Collobert. 2014. Word em-beddings through hellinger pca. In Proc. of EACL.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proc. of ACL (Volume2: Short Papers).

Dekang Lin, Shaojun Zhao, Lijuan Qin, and MingZhou. 2003. Identifying synonyms among distri-butionally similar words. In Proc. of IJCAI.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Proc. of NIPS.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013c. Linguistic regularities in continuous spaceword representations. In Proc. of NAACL-HLT.

Andriy Mnih and Geoffrey E Hinton. 2009. A scalablehierarchical distributed language model. In Proc. ofNIPS.

Andriy Mnih and Koray Kavukcuoglu. 2013. Learningword embeddings efficiently with noise-contrastiveestimation. In Proc. of NIPS.

Saif M. Mohammad, Bonnie J. Dorr, Graeme Hirst, andPeter D. Turney. 2013. Computing lexical contrast.Computational Linguistics, 39(3):555–590.

Brian Murphy, Partha Pratim Talukdar, and TomMitchell. 2012. Learning Effective and In-terpretable Semantic Models using Non-NegativeSparse Embedding. In Proc. of Coling.

Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.Maltparser: A data-driven parser-generator for de-pendency parsing. In Proc. of LREC.

Masataka Ono, Makoto Miwa, and Yutaka Sasaki.2015. Word embedding-based antonym detectionusing thesauri and distributional information. InProc. of NAACL.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Proc. of EMNLP.

Michael Roth and Sabine Schulte im Walde. 2014.Combining Word Patterns and Discourse Markersfor Paradigmatic Relation Classification. In Proc.of ACL.

Gerard Salton. 1971. The SMART Retrieval Sys-tem: Experiments in Automatic Document Process-ing. Prentice-Hall, Inc., Upper Saddle River, NJ,USA.

Sabine Schulte im Walde and Maximilian Koper. 2013.Pattern-based distinction of paradigmatic relationsfor german nouns, verbs, adjectives. Language Pro-cessing and Knowledge in the Web, pages 184–198.

Roy Schwartz, Oren Tsur, Ari Rappoport, and MosheKoppel. 2013. Authorship attribution of micro-messages. In Proc. of EMNLP.

Roy Schwartz, Roi Reichart, and Ari Rappoport. 2014.Minimally supervised classification to semantic cat-egories using automatically acquired symmetric pat-terns. In Proc. of Coling.

Kristina Toutanova, Dan Klein, Christopher D Man-ning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network.In Proc. of NAACL.

Oren Tsur, Dmitry Davidov, and Ari Rappoport.2010. Icwsm–a great catchy name: Semi-supervisedrecognition of sarcastic sentences in online productreviews. In Proc. of ICWSM.

Peter D. Turney, Patrick Pantel, et al. 2010. From fre-quency to meaning: Vector space models of seman-tics. Journal of Artificial Intelligence research.

Peter D. Turney. 2008. A uniform approach to analo-gies, synonyms, antonyms, and associations. InProc. of Coling.

Peter D. Turney. 2012. Domain and function: Adual-space model of semantic relations and compo-sitions. Journal of Artificial Intelligence Research,pages 533–585.

Wenbo Wang, Christopher Thomas, Amit Sheth, andVictor Chan. 2010. Pattern-based synonym andantonym extraction. In Proc. of ACM.

Dominic Widdows and Beate Dorow. 2002. A graphmodel for unsupervised lexical acquisition. In Proc.of Coling.

Wen-tau Yih, Geoffrey Zweig, and John C. Platt. 2012.Polarity inducing latent semantic analysis. In Proc.of EMNLP-CoNLL.

267



Task-Oriented Learning of Word Embeddingsfor Semantic Relation Classification

Kazuma Hashimoto†, Pontus Stenetorp‡, Makoto Miwa§, and Yoshimasa Tsuruoka††The University of Tokyo, 3-7-1 Hongo, Bunkyo-ku, Tokyo, Japan

{hassy,tsuruoka}@logos.t.u-tokyo.ac.jp‡University College London, London, United Kingdom

[email protected]§Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, Japan

[email protected]

Abstract

We present a novel learning method forword embeddings designed for relationclassification. Our word embeddings aretrained by predicting words between nounpairs using lexical relation-specific fea-tures on a large unlabeled corpus. This al-lows us to explicitly incorporate relation-specific information into the word embed-dings. The learned word embeddings arethen used to construct feature vectors fora relation classification model. On a well-established semantic relation classificationtask, our method significantly outperformsa baseline based on a previously intro-duced word embedding method, and com-pares favorably to previous state-of-the-artmodels that use syntactic information ormanually constructed external resources.

1 Introduction

Automatic classification of semantic relations hasa variety of applications, such as information ex-traction and the construction of semantic net-works (Girju et al., 2007; Hendrickx et al., 2010).A traditional approach to relation classification isto train classifiers using various kinds of featureswith class labels annotated by humans. Carefullycrafted features derived from lexical, syntactic,and semantic resources play a significant role inachieving high accuracy for semantic relation clas-sification (Rink and Harabagiu, 2010).

In recent years there has been an increasing in-terest in using word embeddings as an alternativeto traditional hand-crafted features. Word embed-dings are represented as real-valued vectors andcapture syntactic and semantic similarity between

words. For example, word2vec1 (Mikolov et al.,2013b) is a well-established tool for learning wordembeddings. Although word2vec has successfullybeen used to learn word embeddings, these kindsof word embeddings capture only co-occurrencerelationships between words (Levy and Gold-berg, 2014). While simply adding word embed-dings trained using window-based contexts as ad-ditional features to existing systems has provenvaluable (Turian et al., 2010), more recent studieshave focused on how to tune and enhance wordembeddings for specific tasks (Bansal et al., 2014;Boros et al., 2014; Chen et al., 2014; Guo et al.,2014; Nguyen and Grishman, 2014) and we con-tinue this line of research for the task of relationclassification.

In this work we present a learning method forword embeddings specifically designed to be use-ful for relation classification. The overview ofour system and the embedding learning processare shown in Figure 1. First we train word em-beddings by predicting each of the words betweennoun pairs using lexical relation-specific featureson a large unlabeled corpus. We then use the wordembeddings to construct lexical feature vectors forrelation classification. Lastly, the feature vectorsare used to train a relation classification model.

We evaluate our method on a well-establishedsemantic relation classification task and compareit to a baseline based on word2vec embeddingsand previous state-of-the-art models that rely oneither manually crafted features, syntactic parsesor external semantic resources. Our method sig-nificantly outperforms the word2vec-based base-line, and compares favorably with previous state-of-the-art models, despite relying only on lexi-

1https://code.google.com/p/word2vec/.

268

Figure 1: The overview of our system (a) and the embedding learning method (b). In the examplesentence, each of are, caused, and by is treated as a target word to be predicted during training.

cal level features and no external annotated re-sources. Furthermore, our qualitative analysis ofthe learned embeddings shows that n-grams of ourembeddings capture salient syntactic patterns sim-ilar to semantic relation types.

2 Related Work

A traditional approach to relation classification isto train classifiers in a supervised fashion using avariety of features. These features include lexicalbag-of-words features and features based on syn-tactic parse trees. For syntactic parse trees, thepaths between the target entities on constituencyand dependency trees have been demonstrated tobe useful (Bunescu and Mooney, 2005; Zhang etal., 2006). On the shared task introduced by Hen-drickx et al. (2010), Rink and Harabagiu (2010)achieved the best score using a variety of hand-crafted features which were then used to train aSupport Vector Machine (SVM).

Recently, word embeddings have become popu-lar as an alternative to hand-crafted features (Col-lobert et al., 2011). However, one of the limita-tions is that word embeddings are usually learnedby predicting a target word in its context, leadingto only local co-occurrence information being cap-tured (Levy and Goldberg, 2014). Thus, severalrecent studies have focused on overcoming thislimitation. Le and Mikolov (2014) integrated para-graph information into a word2vec-based model,which allowed them to capture paragraph-level in-formation. For dependency parsing, Bansal etal. (2014) and Chen et al. (2014) found ways toimprove performance by integrating dependency-based context information into their embeddings.

Bansal et al. (2014) trained embeddings by defin-ing parent and child nodes in dependency trees ascontexts. Chen et al. (2014) introduced the con-cept of feature embeddings induced by parsing alarge unannotated corpus and then learning em-beddings for the manually crafted features. Forinformation extraction, Boros et al. (2014) trainedword embeddings relevant for event role extrac-tion, and Nguyen and Grishman (2014) employedword embeddings for domain adaptation of rela-tion extraction. Another kind of task-specific wordembeddings was proposed by Tang et al. (2014),which used sentiment labels on tweets to adaptword embeddings for a sentiment analysis tasks.However, such an approach is only feasible whena large amount of labeled data is available.

3 Relation Classification Using WordEmbedding-based Features

We propose a novel method for learning wordembeddings designed for relation classification.The word embeddings are trained by predictingeach word between noun pairs, given the corre-sponding low-level features for relation classifi-cation. In general, to classify relations betweenpairs of nouns the most important features comefrom the pairs themselves and the words betweenand around the pairs (Hendrickx et al., 2010). Forexample, in the sentence in Figure 1 (b) there isa cause-effect relationship between the two nounsconflicts and players. To classify the relation, themost common features are the noun pair (conflicts,players), the words between the noun pair (are,caused, by), the words before the pair (the, exter-nal), and the words after the pair (playing, tiles,

269

to, ...). As shown by Rink and Harabagiu (2010),the words between the noun pairs are the most ef-fective among these features. Our main idea is totreat the most important features (the words be-tween the noun pairs) as the targets to be predictedand other lexical features (noun pairs, words out-side them) as their contexts. Due to this, we expectour embeddings to capture relevant features forrelation classification better than previous modelswhich only use window-based contexts.

In this section we first describe the learning pro-cess for the word embeddings, focusing on lexicalfeatures for relation classification (Figure 1 (b)).We then propose a simple and powerful techniqueto construct features which serve as input for asoftmax classifier. The overview of our proposedsystem is shown in Figure 1 (a).

3.1 Learning Word Embeddings

Assume that there is a noun pair n = (n1, n2) ina sentence with Min words between the pair andMout words before and after the pair:

• win = (win1 , . . . , win

Min) ,

• wbef = (wbef1 , . . . , wbef

Mout) , and

• waft = (waft1 , . . . , waft

Mout) .

Our method predicts each target word wini 2 win

using three kinds of information: n, words aroundwin

i in win, and words in wbef and waft. Wordsare embedded in a d-dimensional vector space andwe refer to these vectors as word embeddings. Todiscriminate between words in n from those inwin, wbef , and waft, we have two sets of wordembeddings: N 2 Rd⇥|N | and W 2 Rd⇥|W|. Wis a set of words and N is also a set of words butcontains only nouns. Hence, the word cause hastwo embeddings: one in N and another in W. Ingeneral cause is used as a noun and a verb, andthus we expect the noun embeddings to capturethe meanings focusing on their noun usage. Thisis inspired by some recent work on word represen-tations that explicitly assigns an independent rep-resentation for each word usage according to itspart-of-speech tag (Baroni and Zamparelli, 2010;Grefenstette and Sadrzadeh, 2011; Hashimoto etal., 2013; Hashimoto et al., 2014; Kartsaklis andSadrzadeh, 2013).

A feature vector f 2 R2d(2+c)⇥1 is constructedto predict win

i by concatenating word embeddings:

f = [N(n1);N(n2);W(wini�1); . . . ;W(win

i�c);

W(wini+1); . . . ;W(win

i+c);

1

Mout

MoutX

j=1

W(wbefj );

1

Mout

MoutX

j=1

W(waftj )] .

(1)

N(·) and W(·) 2 Rd⇥1 corresponds to each wordand c is the context size. A special NULL token isused if i� j is smaller than 1 or i+ j is larger thanMin for each j 2 {1, 2, . . . , c}.

Our method then estimates a conditional prob-ability p(w|f) that the target word is a word wgiven the feature vector f , using a logistic regres-sion model:

p(w|f) = �(

˜

W(w) · f + b(w)) , (2)

where ˜

W(w) 2 R2d(2+c)⇥1 is a weight vector forw, b(w) 2 R is a bias for w, and �(x) =

11+e�x

is the logistic function. Each column vector in˜

W 2 R2d(c+1)⇥|W| corresponds to a word. Thatis, we assign a logistic regression model for eachword, and we can train the embeddings using theone-versus-rest approach to make p(win

i |f) largerthan p(w0|f) for w0 6= win

i . However, naively opti-mizing the parameters of those logistic regressionmodels would lead to prohibitive computationalcost since it grows linearly with the size of the vo-cabulary.

When training we employ several proceduresintroduced by Mikolov et al. (2013b), namely,negative sampling, a modified unigram noise dis-tribution and subsampling. For negative samplingthe model parameters N, W, ˜

W, and b are learnedby maximizing the objective function Junlabeled:

X

n

MinX

i=1

0

@log(p(win

i |f)) +

kX

j=1

log(1� p(w0j |f))

1

A ,

(3)where w0

j is a word randomly drawn from the uni-gram noise distribution weighted by an exponentof 0.75. Maximizing Junlabeled means that ourmethod can discriminate between each target wordand k noise words given the target word’s context.This approach is much less computationally ex-pensive than the one-versus-rest approach and hasproven effective in learning word embeddings.

270

To reduce redundancy during training we usesubsampling. A training sample, whose tar-get word is w, is discarded with the probabilityPd(w) = 1�

qt

p(w) , where t is a threshold which

is set to 10

�5 and p(w) is a probability corre-sponding to the frequency of w in the training cor-pus. The more frequent a target word is, the morelikely it is to be discarded. To further emphasizeinfrequent words, we apply the subsampling ap-proach not only to target words, but also to nounpairs; concretely, by drawing two random numbersr1 and r2, a training sample whose noun pair is(n1, n2) is discarded if Pd(n1) is larger than r1 orPd(n2) is larger than r2.

Since the feature vector f is constructed as de-fined in Eq. (1), at each training step, ˜

W(w) isupdated based on information about what pair ofnouns surrounds w, what word n-grams appear ina small window around w, and what words appearoutside the noun pair. Hence, the weight vector˜

W(w) captures rich information regarding the tar-get word w.

3.2 Constructing Feature VectorsOnce the word embeddings are trained, we can usethem for relation classification. Given a noun pairn = (n1, n2) with its context words win, wbef ,and waft, we construct a feature vector to classifythe relation between n1 and n2 by concatenatingthree kinds of feature vectors:

gn the word embeddings of the noun pair,

gin the averaged n-gram embeddings between thepair, and

gout the concatenation of the averaged word em-beddings in wbef and waft.

The feature vector gn 2 R2d⇥1 is the concate-nation of N(n1) and N(n2):

gn = [N(n1);N(n2)] . (4)

Words between the noun pair contribute to clas-sifying the relation, and one of the most commonways to incorporate an arbitrary number of wordsis treating them as a bag of words. However, wordorder information is lost for bag-of-words featuressuch as averaged word embeddings. To incorpo-rate the word order information, we first define n-gram embeddings hi 2 R4d(1+c)⇥1 between the

noun pair:

hi = [W(wini�1); . . . ;W(win

i�c);

W(wini+1); . . . ;W(win

i+c);˜

W(wini )] .

(5)

Note that ˜

W can also be used and that the valueused for n is (2c+1). As described in Section 3.1,˜

W captures meaningful information about eachword and after the first embedding learning stepwe can treat the embeddings in ˜

W as features forthe words. Mnih and Kavukcuoglu (2013) havedemonstrated that using embeddings like those in˜

W is useful in representing the words. We thencompute the feature vector gin by averaging hi:

gin =

1

Min

MinX

i=1

hi . (6)

We use the averaging approach since Min dependson each instance. The feature vector gin allows usto represent word sequences of arbitrary lengths asfixed-length feature vectors using the simple oper-ations: concatenation and averaging.

The words before and after the noun pair aresometimes important in classifying the relation.For example, in the phrase “pour n1 into n2”, theword pour should be helpful in classifying the re-lation. As with Eq. (1), we use the concatenationof the averaged word embeddings of words beforeand after the noun pair to compute the feature vec-tor gout 2 R2d⇥1:

gout =

1

Mout[

MoutX

j=1

W(wbefj );

MoutX

j=1

W(waftj )] .

(7)As described above, the overall feature vector

e 2 R4d(2+c)⇥1 is constructed by concatenatinggn, gin, and gout. We would like to emphasizethat we only use simple operations: averaging andconcatenating the learned word embeddings. Thefeature vector e is then used as input for a soft-max classifier, without any complex transforma-tion such as matrix multiplication with non-linearfunctions.

3.3 Supervised LearningGiven a relation classification task we train a soft-max classifier using the feature vector e describedin Section 3.2. For each k-th training sample witha corresponding label lk among L predefined la-bels, we compute a conditional probability given

271

its feature vector ek:

p(lk|ek) =

exp(o(lk))PLi=1 exp(o(i))

, (8)

where o 2 RL⇥1 is defined as o = Sek + s, andS 2 RL⇥4d(2+c) and s 2 RL⇥1 are the softmaxparameters. o(i) is the i-th element of o. We thendefine the objective function as:

Jlabeled =

KX

k=1

log(p(lk|ek))��

2

k✓k2 . (9)

K is the number of training samples and � con-trols the L-2 regularization. ✓ = (N,W, ˜

W,S, s)is the set of parameters and Jlabeled is maximizedusing AdaGrad (Duchi et al., 2011). We havefound that dropout (Hinton et al., 2012) is help-ful in preventing our model from overfitting. Con-cretely, elements in e are randomly omitted with aprobability of 0.5 at each training step. Recentlydropout has been applied to deep neural networkmodels for natural language processing tasks andproven effective (Irsoy and Cardie, 2014; Pauluset al., 2014).

In what follows, we refer to the above methodas RelEmb. While RelEmb uses only low-levelfeatures, a variety of useful features have beenproposed for relation classification. Among them,we use dependency path features (Bunescu andMooney, 2005) based on the untyped binary de-pendencies of the Stanford parser to find the short-est path between target nouns. The dependencypath features are computed by averaging word em-beddings from W on the shortest path, and arethen concatenated to the feature vector e. Fur-thermore, we directly incorporate semantic infor-mation using word-level semantic features fromNamed Entity (NE) tags and WordNet hypernyms,as used in previous work (Rink and Harabagiu,2010; Socher et al., 2012; Yu et al., 2014). Werefer to this extended method as RelEmbFULL.Concretely, RelEmbFULL uses the same binaryfeatures as in Socher et al. (2012). The featurescome from NE tags and WordNet hypernym tagsof target nouns provided by a sense tagger (Cia-ramita and Altun, 2006).

4 Experimental Settings4.1 Training DataFor pre-training we used a snapshot of the En-glish Wikipedia2 from November 2013. First,

2http://dumps.wikimedia.org/enwiki/.

we extracted 80 million sentences from the orig-inal Wikipedia file, and then used Enju3 (Miyaoand Tsujii, 2008) to automatically assign part-of-speech (POS) tags. From the POS tags we usedNN, NNS, NNP, or NNPS to locate noun pairs inthe corpus. We then collected training data by list-ing pairs of nouns and the words between, before,and after the noun pairs. A noun pair was omit-ted if the number of words between the pair waslarger than 10 and we consequently collected 1.4billion pairs of nouns and their contexts 4. We usedthe 300,000 most frequent words and the 300,000most frequent nouns and treated out-of-vocabularywords as a special UNK token.

4.2 Initialization and OptimizationWe initialized the embedding matrices N and W

with zero-mean gaussian noise with a variance of1d . ˜

W and b were zero-initialized. The model pa-rameters were optimized by maximizing the ob-jective function in Eq. (3) using stochastic gradi-ent ascent. The learning rate was set to ↵ and lin-early decreased to 0 during training, as describedin Mikolov et al. (2013a). The hyperparametersare the embedding dimensionality d, the contextsize c, the number of negative samples k, the initiallearning rate ↵, and Mout, the number of wordsoutside the noun pairs. For hyperparameter tun-ing, we first fixed ↵ to 0.025 and Mout to 5, andthen set d to {50, 100, 300}, c to {1, 2, 3}, andk to {5, 15, 25}.

At the supervised learning step, we initialized S

and s with zeros. The hyperparameters, the learn-ing rate for AdaGrad, �, Mout, and the number ofiterations, were determined via 10-fold cross val-idation on the training set for each setting. Notethat Mout can be tuned at the supervised learningstep, adapting to a specific dataset.

5 Evaluation

5.1 Evaluation DatasetWe evaluated our method on the SemEval 2010Task 8 data set5 (Hendrickx et al., 2010), whichinvolves predicting the semantic relations between

3Despite Enju being a syntactic parser we only use thePOS tagger component. The accuracy of the POS tagger isabout 97.2% on the WSJ corpus.

4The training data, the training code, and the learnedmodel parameters used in this paper are publicly available athttp://www.logos.t.u-tokyo.ac.jp/˜hassy/publications/conll2015/

5http://docs.google.com/View?docid=dfvxd49s_36c28v9pmw.

272

noun pairs in their contexts. The dataset, contain-ing 8,000 training and 2,717 test samples, definesnine classes (Cause-Effect, Entity-Origin, etc.) forordered relations and one class (Other) for otherrelations. Thus, the task can be treated as a 19-class classification task. Two examples from thetraining set are shown below.

(a) Financial [stress]E1 is one of the main causesof [divorce]E2

(b) The [burst]E1 has been caused by water ham-mer [pressure]E2

Training example (a) is classified as Cause-Effect(E1, E2) which denotes that E2 is an effectcaused by E1, while training example (b) is classi-fied as Cause-Effect(E2, E1) which is the inverseof Cause-Effect(E1, E2). We report the officialmacro-averaged F1 scores and accuracy.

5.2 ModelsTo empirically investigate the performance of ourproposed method we compared it to several base-lines and previously proposed models.

5.2.1 Random and word2vec InitializationRand-Init. The first baseline is RelEmb itself,but without applying the learning method on theunlabeled corpus. In other words, we train thesoftmax classifier from Section 3.3 on the labeledtraining data with randomly initialized model pa-rameters.

W2V-Init. The second baseline is RelEmb us-ing word embeddings learned by word2vec. Morespecifically, we initialize the embedding matricesN and W with the word2vec embeddings. Re-lated to our method, word2vec has a set of weightvectors similar to ˜

W when trained with negativesampling and we use these weight vectors as a re-placement for ˜

W. We trained the word2vec em-beddings using the CBOW model with subsam-pling on the full Wikipedia corpus. As with ourexperimental settings, we fix the learning rate to0.025, and investigate several hyperparameter set-tings. For hyperparameter tuning we set the em-bedding dimensionality d to {50, 100, 300}, thecontext size c to {1, 3, 9}, and the number of neg-ative samples k to {5, 15, 25}.

5.2.2 SVM-Based SystemsA simple approach to the relation classificationtask is to use SVMs with standard binary bag-

of-words features. The bag-of-words features in-cluded the noun pairs and words between, before,and after the pairs, and we used LIBLINEAR6 asour classifier.

5.2.3 Neural Network ModelsSocher et al. (2012) used Recursive Neural Net-work (RNN) models to classify the relations.Subsequently, Ebrahimi and Dou (2015) andHashimoto et al. (2013) proposed RNN models tobetter handle the relations. These methods rely onsyntactic parse trees.

Yu et al. (2014) introduced their novel Factor-based Compositional Model (FCM) and presentedresults from several model variants, the best per-forming being FCMEMB and FCMFULL. The for-mer only uses word embedding information andthe latter relies on dependency paths and NE fea-tures, in addition to word embeddings.

Zeng et al. (2014) used a Convolutional Neu-ral Network (CNN) with WordNet hypernyms.Noteworthy in relation to the RNN-based meth-ods, the CNN model does not rely on parse trees.More recently, dos Santos et al. (2015) have in-troduced CR-CNN by extending the CNN modeland achieved the best result to date. The key pointof CR-CNN is that it improves the classificationscore by omitting the noisy class “Other” in thedataset described in Section 5.1. We call CR-CNNusing the “Other” class CR-CNNOther and CR-CNN omitting the class CR-CNNBest.

5.3 Results and Discussion

The scores on the test set for SemEval 2010 Task 8are shown in Table 1. RelEmb achieves 82.8% ofF1 which is better than those of almost all modelscompared and comparable to that of the previousstate of the art, except for CR-CNNBest. Note thatRelEmb does not rely on external semantic fea-tures and syntactic parse features7. Furthermore,RelEmbFULL achieves 83.5% of F1. We calcu-lated a confidence interval (82.0, 84.9) (p < 0.05)using bootstrap resampling (Noreen, 1989).

5.3.1 Comparison with the BaselinesRelEmb significantly outperforms not only theRand-Init baseline, but also the W2V-Init baseline.

6http://www.csie.ntu.edu.tw/˜cjlin/liblinear/.

7While we use a POS tagger to locate noun pairs, RelEmbdoes not explicitly use POS features at the supervised learn-ing step.

273

Features for classifiers F1 / ACC (%)RelEmbFULL embeddings, dependency paths, WordNet, NE 83.5 / 79.9RelEmb embeddings 82.8 / 78.9RelEmb (W2V-Init) embeddings 81.8 / 77.7RelEmb (Rand-Init) embeddings 78.2 / 73.5SVM bag of words 76.5 / 72.0SVM bag of words, POS, dependency paths, WordNet, 82.2 / 77.9(Rink and Harabagiu, 2010) paraphrases, TextRunner, Google n-grams, etc.CR-CNNBest (dos Santos et al., 2015) embeddings, word position embeddings 84.1 / n/aFCMFULL (Yu et al., 2014) embeddings, dependency paths, NE 83.0 / n/aCR-CNNOther (dos Santos et al., 2015) embeddings, word position embeddings 82.7 / n/aCRNN (Ebrahimi and Dou, 2015) embeddings, parse trees, WordNet, NE, POS 82.7 / n/aCNN (Zeng et al., 2014) embeddings, WordNet 82.7 / n/aMVRNN (Socher et al., 2012) embeddings, parse trees, WordNet, NE, POS 82.4 / n/aFCMEMB (Yu et al., 2014) embeddings 80.6 / n/aRNN (Hashimoto et al., 2013) embeddings, parse trees, phrase categories, etc. 79.4 / n/a

Table 1: Scores on the test set for SemEval 2010 Task 8.

These results show that our task-specific word em-beddings are more useful than those trained usingwindow-based contexts. A point that we wouldlike to emphasize is that the baselines are un-expectedly strong. As was noted by Wang andManning (2012), we should carefully implementstrong baselines and see whether complex modelscan outperform these baselines.

5.3.2 Comparison with SVM-Based SystemsRelEmb performs much better than the bag-of-words-based SVM. This is not surprising giventhat we use a large unannotated corpus and embed-dings with a large number of parameters. RelEmbalso outperforms the SVM system of Rink andHarabagiu (2010), which demonstrates the effec-tiveness of our task-specific word embeddings, de-spite our only requirement being a large unanno-tated corpus and a POS tagger.

5.3.3 Comparison with Neural NetworkModels

RelEmb outperforms the RNN models. In our pre-liminary experiments, we have found some un-desirable parse trees when computing vector rep-resentations using RNN-based models and suchparsing errors might hamper the performance ofthe RNN models.

FCMFULL, which relies on dependency pathsand NE features, achieves a better score than thatof RElEmb. Without such features, RelEmb out-performs FCMEMB by a large margin. By incor-porating external resources, RelEmbFULL outper-forms FCMFULL.

RelEmb compares favorably to CR-CNNOther,despite our method being less computationally ex-pensive than CR-CNNOther. When classifying aninstance, the number of the floating number mul-tiplications is 4d(2 + c)L in our method sinceour method requires only one matrix-vector prod-uct for the softmax classifier as described in Sec-tion 3.3. c is the window size, d is the wordembedding dimensionality, and L is the numberof the classes. In CR-CNNOther, the number is(Dc(d + 2d0)N + DL), where D is the dimen-sionality of the convolution layer, d0 is the posi-tion embedding dimensionality, and N is the av-erage length of the input sentences. Here, weomit the cost of the hyperbolic tangent functionin CR-CNNOther for simplicity. Using the besthyperparameter settings, the number is roughly3.8 ⇥ 10

4 in our method, and 1.6 ⇥ 10

7 in CR-CNNOther assuming N is 10. dos Santos et al.(2015) also boosted the score of CR-CNNOther

by omitting the noisy class “Other” by a ranking-based classifier, and achieved the best score (CR-CNNBest). Our results may also be improved byusing the same technique, but the technique isdataset-dependent, so we did not incorporate thetechnique.

5.4 Analysis on Training SettingsWe perform analysis of the training procedure fo-cusing on RelEmb.

5.4.1 Effects of Tuning HyperparametersIn Tables 2 and 3, we show how tuning the hyper-parameters of our method and word2vec affects

274

c d k = 5 k = 15 k = 25

1

50 80.5 81.0 80.9100 80.9 81.3 81.2

2

50 80.9 81.3 81.3100 81.3 81.6 81.7

3

50 81.0 81.0 81.5100 81.3 81.9 82.2300 - - 82.0

Table 2: Cross-validation results for RelEmb.

c d k = 5 k = 15 k = 25

1

50 80.5 80.7 80.9100 81.1 81.2 81.0300 81.2 81.3 81.2

3

50 80.4 80.7 80.8100 81.0 81.0 80.9

9

50 80.0 79.8 80.2100 80.3 80.4 80.1

Table 3: Cross-validation results for the W2V-Init.

the classification results using 10-fold cross vali-dation on the training set. The same split is usedfor each setting, so all results are comparable toeach other. The best settings for the cross vali-dation are used to produce the results reported inTable 1.

Table 2 shows F1 scores obtained by RelEmb.The results for d = 50, 100 show that RelEmbbenefits from relatively large context sizes. Then-gram embeddings in RelEmb capture richer in-formation by setting c to 3 compared to setting cto 1. Relatively large numbers of negative sam-ples also slightly boost the scores. As opposedto these trends, the score does not improve usingd = 300. We use the best setting (c = 3, d = 100,k = 25) for the remaining analysis. We note thatRelEmbFULL achieves an F1-score of 82.5.

We also performed similar experiments for theW2V-Init baseline, and the results are shown inTable 3. In this case, the number of negative sam-ples does not affect the scores, and the best scoreis achieved by c = 1. As discussed in Bansal et al.(2014), the small context size captures the syntac-tic similarity between words rather than the top-ical similarity. This result indicates that syntacticsimilarity is more important than topical similarityfor this task. Compared to the word2vec embed-dings, our embeddings capture not only local con-text information using word order, but also long-

gn gin g

0in gn,gin gn,gin,gout

61.8 70.2 68.2 81.1 82.2

Table 4: Cross-validation results for ablation tests.

Method ScoreRelEmb N 0.690RelEmb W 0.599W2V-Init 0.687

Table 5: Evaluation on the WordSim-353 dataset.

range co-occurrence information by being tailoredfor the specific task.

5.4.2 Ablation TestsAs described in Section 3.2, we concatenate threekinds of feature vectors, gn, gin, and gout, forsupervised learning. Table 4 shows classificationscores for ablation tests using 10-fold cross val-idation. We also provide a score using a sim-plified version of gin, where the feature vectorg

0in is computed by averaging the word embed-

dings [W(wini );

˜

W(wini )] of the words between

the noun pairs. This feature vector g

0in then serves

as a bag-of-words feature.Table 4 clearly shows that the averaged n-gram

embeddings contribute the most to the semanticrelation classification performance. The differ-ence between the scores of gin and g

0in shows the

effectiveness of our averaged n-gram embeddings.

5.4.3 Effects of DropoutAt the supervised learning step we use dropout toregularize our model. Without dropout, our per-formance drops from 82.2% to 81.3% of F1 on thetraining set using 10-fold cross validation.

5.4.4 Performance on a Word Similarity TaskAs described in Section 3.1, we have the noun-specific embeddings N as well as the standardword embeddings W. We evaluated the learnedembeddings using a word-level semantic evalua-tion task called WordSim-353 (Finkelstein et al.,2001). This dataset consists of 353 pairs of nounsand each pair has an averaged human rating whichcorresponds to a semantic similarity score. Evalu-ation is performed by measuring Spearman’s rankcorrelation between the human ratings and the co-sine similarity scores of the embeddings. Table 5shows the evaluation results. We used the best set-tings reported in Table 2 and 3 since our method

275

Cause-Effect(E1,E2) Content-Container(E1,E2) Message-Topic(E1,E2)resulted poverty caused the inside was inside a discuss magazines relating tocaused stability caused the in was in a explaining to discuss aspectsgenerated coast resulted in hidden hidden in a discussing concerned about NULLcause fire caused due was was inside the relating interview relates tocauses that resulted in stored was hidden in describing to discuss the

Cause-Effect(E2,E1) Content-Container(E2,E1) Message-Topic(E2,E1)after caused by radiation full NULL full of subject were related infrom caused by infection included was full of related was related incaused stomach caused by contains a full NULL discussed been discussed intriggered caused by genetic contained a full and documented is related throughdue anger caused by stored a full forty received the subject of

Table 6: Top five unigrams and trigrams with the highest scores for six classes.

is designed for relation classification and it is notclear how to tune the hyperparameters for the wordsimilarity task. As shown in the result table, thenoun-specific embeddings perform better than thestandard embeddings in our method, which indi-cates the noun-specific embeddings capture moreuseful information in measuring the semantic sim-ilarity between nouns. The performance of thenoun-specific embeddings is roughly the same asthat of the word2vec embeddings.

5.5 Qualitative Analysis on the EmbeddingsUsing the n-gram embeddings hi in Eq. (5), we in-spect which n-grams are relevant to each relationclass after the supervised learning step of RelEmb.When the context size c is 3, we can use at most7-grams. The learned weight matrix S in Sec-tion 3.3 is used to detect the most relevant n-gramsfor each class. More specifically, for each n-gramembedding (n = 1, 3) in the training set, we com-pute the dot product between the n-gram embed-ding and the corresponding components in S. Wethen select the pairs of n-grams and class labelswith the highest scores. In Table 6 we show the topfive n-grams for six classes. These results clearlyshow that the n-gram embeddings capture salientsyntactic patterns which are useful for the relationclassification task.


We have presented a method for learning word em-beddings specifically designed for relation classi-fication. The word embeddings are trained usinglarge unlabeled corpora to capture lexical featuresfor relation classification. On a well-establishedsemantic relation classification task our methodsignificantly outperforms the baseline based onword2vec. Our method also compares favorably toprevious state-of-the-art models that rely on syn-

tactic parsers and external semantic resources, de-spite our method requiring only access to an unan-notated corpus and a POS tagger. For future work,we will investigate how well our method performson other domains and datasets and how relation la-bels can help when learning embeddings in a semi-supervised learning setting.

Acknowledgments

We thank the anonymous reviewers for their help-ful comments and suggestions.

ReferencesMohit Bansal, Kevin Gimpel, and Karen Livescu.

2014. Tailoring Continuous Word Representationsfor Dependency Parsing. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics (Volume 2: Short Papers), pages809–815.

Marco Baroni and Roberto Zamparelli. 2010. Nounsare Vectors, Adjectives are Matrices: RepresentingAdjective-Noun Constructions in Semantic Space.In Proceedings of the 2010 Conference on Empiri-cal Methods in Natural Language Processing, pages1183–1193.

Emanuela Boros, Romaric Besancon, Olivier Ferret,and Brigitte Grau. 2014. Event Role Extrac-tion using Domain-Relevant Word Representations.In Proceedings of the 2014 Conference on Em-pirical Methods in Natural Language Processing(EMNLP), pages 1852–1857.

Razvan Bunescu and Raymond Mooney. 2005. AShortest Path Dependency Kernel for Relation Ex-traction. In Proceedings of Human Language Tech-nology Conference and Conference on EmpiricalMethods in Natural Language Processing, pages724–731.

Wenliang Chen, Yue Zhang, and Min Zhang. 2014.Feature Embedding for Dependency Parsing. In

276

Proceedings of COLING 2014, the 25th Interna-tional Conference on Computational Linguistics:Technical Papers, pages 816–826.

Massimiliano Ciaramita and Yasemin Altun. 2006.Broad-Coverage Sense Disambiguation and Infor-mation Extraction with a Supersense Sequence Tag-ger. In Proceedings of the 2006 Conference on Em-pirical Methods in Natural Language Processing,pages 594–602.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural Language Processing (Almost) fromScratch. Journal of Machine Learning Research,12:2493–2537.

Cicero Nogueira dos Santos, Bing Xiang, and BowenZhou. 2015. Classifying Relations by Ranking withConvolutional Neural Networks. In Proceedingsof the Joint Conference of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing of the Asian Federation ofNatural Language Processing. to appear.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive Subgradient Methods for Online Learningand Stochastic Optimization. Journal of MachineLearning Research, 12:2121–2159.

Javid Ebrahimi and Dejing Dou. 2015. Chain BasedRNN for Relation Classification. In Proceedings ofthe 2015 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, pages 1244–1249.

Lev Finkelstein, Gabrilovich Evgenly, Matias Yossi,Rivlin Ehud, Solan Zach, Wolfman Gadi, and Rup-pin Eytan. 2001. Placing Search in Context: TheConcept Revisited. In Proceedings of the Tenth In-ternational World Wide Web Conference.

Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Sz-pakowicz, Peter Turney, and Deniz Yuret. 2007.SemEval-2007 Task 04: Classification of SemanticRelations between Nominals. In Proceedings of theFourth International Workshop on Semantic Evalu-ations (SemEval-2007), pages 13–18.

Edward Grefenstette and Mehrnoosh Sadrzadeh. 2011.Experimental Support for a Categorical Composi-tional Distributional Model of Meaning. In Pro-ceedings of the 2011 Conference on Empirical Meth-ods in Natural Language Processing, pages 1394–1404.

Jiang Guo, Wanxiang Che, Haifeng Wang, and TingLiu. 2014. Revisiting Embedding Features for Sim-ple Semi-supervised Learning. In Proceedings ofthe 2014 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 110–120.

Kazuma Hashimoto, Makoto Miwa, Yoshimasa Tsu-ruoka, and Takashi Chikayama. 2013. Simple Cus-tomization of Recursive Neural Networks for Se-mantic Relation Classification. In Proceedings of

the 2013 Conference on Empirical Methods in Nat-ural Language Processing, pages 1372–1376.

Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa,and Yoshimasa Tsuruoka. 2014. Jointly LearningWord Representations and Composition FunctionsUsing Predicate-Argument Structures. In Proceed-ings of the 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages1544–1555.

Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid O Seaghdha, SebastianPado, Marco Pennacchiotti, Lorenza Romano, andStan Szpakowicz. 2010. SemEval-2010 Task 8:Multi-Way Classification of Semantic Relations be-tween Pairs of Nominals. In Proceedings of the5th International Workshop on Semantic Evaluation,pages 33–38.

Geoffrey E. Hinton, Nitish Srivastava, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhut-dinov. 2012. Improving neural networks bypreventing co-adaptation of feature detectors.CoRR, abs/1207.0580.

Ozan Irsoy and Claire Cardie. 2014. Deep RecursiveNeural Networks for Compositionality in Language.In Advances in Neural Information Processing Sys-tems 27, pages 2096–2104.

Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2013.Prior Disambiguation of Word Tensors for Con-structing Sentence Vectors. In Proceedings of the2013 Conference on Empirical Methods in NaturalLanguage Processing, pages 1590–1601.

Quoc Le and Tomas Mikolov. 2014. DistributedRepresentations of Sentences and Documents. InProceedings of the 31st International Conferenceon Machine Learning (ICML-14), ICML ’14, pages1188–1196.

Omer Levy and Yoav Goldberg. 2014. Neural WordEmbedding as Implicit Matrix Factorization. In Ad-vances in Neural Information Processing Systems27, pages 2177–2185.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient Estimation of Word Repre-sentations in Vector Space. In Proceedings of Work-shop at the International Conference on LearningRepresentations.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed Represen-tations of Words and Phrases and their Composition-ality. In Advances in Neural Information ProcessingSystems 26, pages 3111–3119.

Yusuke Miyao and Jun’ichi Tsujii. 2008. Feature For-est Models for Probabilistic HPSG Parsing. Compu-tational Linguistics, 34(1):35–80, March.

277

Andriy Mnih and Koray Kavukcuoglu. 2013. Learningword embeddings efficiently with noise-contrastiveestimation. In Advances in Neural Information Pro-cessing Systems 26, pages 2265–2273.

Thien Huu Nguyen and Ralph Grishman. 2014. Em-ploying Word Representations and Regularizationfor Domain Adaptation of Relation Extraction. InProceedings of the 52nd Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers), pages 68–74.

Eric W. Noreen. 1989. Computer-Intensive Methodsfor Testing Hypotheses: An Introduction. Wiley-Interscience.

Romain Paulus, Richard Socher, and Christopher DManning. 2014. Global Belief Recursive NeuralNetworks. In Advances in Neural Information Pro-cessing Systems 27, pages 2888–2896.

Bryan Rink and Sanda Harabagiu. 2010. UTD: Clas-sifying Semantic Relations by Combining Lexicaland Semantic Resources. In Proceedings of the5th International Workshop on Semantic Evaluation,pages 256–259.

Richard Socher, Brody Huval, Christopher D. Man-ning, and Andrew Y. Ng. 2012. Semantic Compo-sitionality through Recursive Matrix-Vector Spaces.In Proceedings of the 2012 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning,pages 1201–1211.

Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, TingLiu, and Bing Qin. 2014. Learning Sentiment-Specific Word Embedding for Twitter SentimentClassification. In Proceedings of the 52nd AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 1555–1565.

Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio.2010. Word Representations: A Simple and GeneralMethod for Semi-Supervised Learning. In Proceed-ings of the 48th Annual Meeting of the Associationfor Computational Linguistics, pages 384–394.

Sida Wang and Christopher Manning. 2012. Baselinesand Bigrams: Simple, Good Sentiment and TopicClassification. In Proceedings of the 50th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), pages 90–94.

Mo Yu, Matthew R. Gormley, and Mark Dredze. 2014.Factor-based Compositional Embedding Models. InProceedings of Workshop on Learning Semantics atthe 2014 Conference on Neural Information Pro-cessing Systems.

Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jun Zhao. 2014. Relation Classification viaConvolutional Deep Neural Network. In Proceed-ings of COLING 2014, the 25th International Con-ference on Computational Linguistics: TechnicalPapers, pages 2335–2344.

Min Zhang, Jie Zhang, Jian Su, and GuoDong Zhou.2006. A Composite Kernel to Extract Relations be-tween Entities with Both Flat and Structured Fea-tures. In Proceedings of the 21st International Con-ference on Computational Linguistics and 44th An-nual Meeting of the Association for ComputationalLinguistics, pages 825–832.

278



Temporal Information Extraction from Korean Texts

Young-Seob Jeong, Zae Myung Kim, Hyun-Woo Do, Chae-Gyun Lim, and Ho-Jin ChoiSchool of Computing

Korea Advanced Institute of Science and Technology291 Daehak-ro, Yuseong-gu, Daejeon 305-701, South Korea

{pinode,zaemyung,realstorm103,rayote,hojinc}@kaist.ac.kr

Abstract

As documents tend to contain temporalinformation, extracting such informationis attracting much research interests re-cently. In this paper, we propose a hybridmethod that combines machine-learningmodels and hand-crafted rules for the taskof extracting temporal information fromunstructured Korean texts. We addressKorean-specific research issues and pro-pose a new probabilistic model to generatecomplementary features. The performanceof our approach is demonstrated by exper-iments on the TempEval-2 dataset, and theKorean TimeBank dataset which we builtfor this study.

1 Introduction

Due to the increasing number of unstructureddocuments available on the Web and from othersources, developing techniques that automaticallyextract knowledge from the documents has beenof paramount importance. Among many aspectsof extracting knowledge from documents, the ex-traction of temporal information is recently draw-ing much attention, since the documents usuallyincorporate temporal information that is usefulfor further applications such as Information Re-trieval (IR) and Question Answering (QA) sys-tems. Given a simple question, “who was the pres-ident of the U.S. 8 years ago?”, for example, aQA system may have a difficulty in finding theright answer without the correct temporal informa-tion about when the question is posed and what ‘8years ago’ refers to.

There have been many studies for temporal in-formation extraction, but most of them are appli-cable only to their target languages. The main rea-son for this limitation is that some parts of tem-poral information are difficult to predict without

the use of language-specific processing. For ex-ample, the normalized value ‘1983-03-08’ can berepresented by ‘March 8, 1983’ in English, whileit can be represented by ‘1983Dº‘8|’ in Ko-rean. The order of date representation in Korean isusually different from that of English, and the digitexpression in Korean is more complex than that ofEnglish. This implies that it is necessary to inves-tigate language-specific difficulties for developinga method to extract temporal information.

In this paper, we propose a method for tem-poral information extraction from Korean texts.The contributions of this paper are as follows:we (1) show how the Korean-specific issues (e.g.,morpheme-level tagging, various ways of digit ex-pression, uses of lunar calendar, and so on) are ad-dressed, (2) propose a hybrid method that com-bines a set of hand-crafted rules and machine-learning models, (3) propose a data-driven proba-bilistic model to generate complementary features,and (4) create a new dataset, the Korean Time-Bank, that consists of more than 3,700 manuallyannotated sentences.

The rest of the paper is organized as follows.Section 2 describes the background of the re-search. Section 3 presents the details of the pro-posed method, the Korean TimeBank dataset, andhow we apply the probabilistic model for generat-ing features. Section 4 shows experimental results,and Section 5 concludes the paper.

2 Background

TempEval is a series of shared tasks for temporalinformation extraction (Verhagen et al., 2009; Ver-hagen et al., 2010; UzZaman et al., 2013). Therehave been many studies related to the shared tasks(Chambers et al., 2007; Yoshikawa et al., 2009;UzZaman and Allen, 2010; Ling and Weld, 2010;Mirroshandel and Ghassem-Sani, 2012; Bethard,2013b), which are based on the Time Mark-upLanguage (TimeML) (Pustejovsky et al., 2003).

279

The shared tasks can be summarized into three ex-traction tasks: (1) extraction of timex3 tags, (2) ex-traction of event and makeinstance tags, and (3)extraction of tlink tags. The timex3 tag is asso-ciated with expressions of temporal informationsuch as ‘May 1973’ and ‘today’. The event tagand makeinstance tag represent some eventual ex-pressions which can be related to temporal infor-mation. The makeinstance tag is an instance ofthe event tag. For example, the sentence, “I goto school on Mondays and Tuesdays”, containsone event tag on ‘go’ and two makeinstance tagsas the action ‘go’ occurs twice on Mondays andTuesdays. The tlink tag represents a linkage be-tween two tags. The tlink can be a linkage betweentwo timex3 tags (TT tlink), two makeinstance tags(MM tlink), a timex3 tag and a makeinstance tag(TM tlink), or Document Creation Time and amakeinstance tag (DM tlink). Note that the tlinktakes makeinstance tags as arguments, but not theevent tags, as the event tags are merely templatesfor them. For the above sentence, there will be twoTM tlinks: go-Tuesdays and go-Mondays. The TTtlink is assumed to be easy to extract, so TempEvaldoes not incorporate the TT tlink into the task ofextracting tlink tags.

Among many related studies, there are sev-eral leading ones. HeidelTime is proposed forextraction of timex3 tags (Strotgen and Gertz,2010). It strongly depends on hand-crafted rules,and showed the best performance in TempEval-2. Llorens et al. (2010) proposed TIPSem forall of the three extraction tasks. It employs Con-ditional Random Fields (CRF) (Lafferty et al.,2001) for capturing patterns of texts, and definesa set of hand-crafted rules for determining sev-eral attributes of the tags. ClearTK is another workproposed for all three extraction tasks (Bethard,2013a); it utilizes machine-learning models suchas Support Vector Machines (SVM) (Boser et al.,1992; Cortes and Vapnik, 1995) and Logistic Re-gressions (LR), and shows the best performance inTempEval-3.

Although the existing approaches show good re-sults, most of them are applicable only to theirtarget languages. The first reason is that thereare several attributes which are difficult to pre-dict without the use of language-specific process-ing. For instance, the attribute value of timex3tag has a normalized form of time (e.g., 1999-04-12) following ISO-8601. This is not accurately

predictable by relying solely on data-driven ap-proaches. The second reason is that they dependon some language-specific resources (e.g., Word-Net) or tools. Unless the same quality of resourcesor tools is achieved for other languages, the exist-ing works would not be available to the other lan-guages. To alleviate this limitation, a language in-dependent parser for extracting timex3 tags is pro-posed (Angeli and Uszkoreit, 2013). Its portabilityis demonstrated by experiments with TempEval-2 dataset of six languages: English, Spanish, Ital-ian, Chinese, Korean, and French. However, theperformance in English and Spanish datasets areabout twice as high as the other languages, sincethe method highly depends on the feature defi-nition and language-specific preprocessing (e.g.,morphological analysis). This implies that it isnecessary to address language-specific difficultiesin order to achieve high performance.

Korean language has many subtle rules and ex-ceptions on word spacing. Korean is an aggluti-native language, where verbs are formed by at-taching various endings to the stem. There areusually multiple morphemes for each token, andempty elements appear very often because sub-jects and objects are dropped to avoid duplication.Temporal expressions often take the lunar calendarrepresentation as a tradition. Moreover, the sametemporal information can have various forms dueto a complex system of digit expression. For in-stance, a digit 30 can be represented as ‘30’, ‘ºÌ[sam-sib]’, or ‘⌧x[seo-run]’. Most of these is-sues stem from the Chinese language, as a largenumber of Chinese words and letters have becomean integral part of Korean vocabulary due to his-torical contact. All these issues hinder the perfor-mance of the existing approaches when applied toKorean documents.

In this paper, we show how these issues are ad-dressed in our Korean-specific hybrid method. Tothe best of our knowledge, this is the first Korean-specific study which addresses all of the three ex-traction tasks.

3 Proposed Method

3.1 Korean TimeBankAlthough there is a Korean dataset provided byTempEval-2, we chose not to use it because it issmall in size and has many annotation errors. InTempEval-2 Korean dataset, there are missing val-ues of timex3 tags, and multiple tags that must be

280

Figure 1: Example of extent representation by to-ken indices and letter indices.

merged into one. Please refer to the examples inthe 11th sentence of the 2nd training documentof the TempEval-2 Korean dataset. Thus, we con-structed a new dataset called Korean TimeBank.

The new dataset is based on TimeML but withseveral differences. The tags of the new datasetare represented using a stand-off scheme, keep-ing the original sentences unharmed. As there areoften multiple morphemes within each token inKorean, the tags are annotated in letter-level. Theletter-level annotation allows multiple annotationsto appear within a single token. This also makesthe dataset independent of morphological analy-sis, so it is not required to update the dataset whenthe morphological analyzer is updated. To enablethe letter-level annotation, we introduce several at-tributes for timex3 tag and event tag: text, begin,end, e begin, and e end. The attributes e begin ande end indicate token indices, while begin and endindicate letter indices of the extent. The attributetext contains the string of the extent. For example,the sentence, “I work today” in Fig. 1, contains onetimex3 tag whose text is ‘today’, where e begin=2,e end=2, begin=0, and end=4.

Since temporal expressions following the lunarcalendar representation appear often in Korean,we add an attribute calendar. The value of the cal-endar can be LUNAR or other types of calendar,and its default value is GREGORIAN when it isnot explicitly clarified. We also add two values forthe attribute mod of timex3 tag: START OR MIDand MID OR END, as these expressions appearoften in Korean. For example, ‘�⌘⇠[cho-joong-ban]’ represents beginning or middle phase of aperiod, and ‘⌘ƒ⇠[joong-hoo-ban]’ representsmiddle or ending phase of a period.

The source of the Korean TimeBank includesWikipedia documents and hundreds of manuallygenerated question-answer pairs. The domains ofthe Wikipedia documents are personage, music,university, and history. The documents are anno-tated by two trained annotators and a supervisor,all majoring in computer science. The annotatedtags of each document is saved in an XML file.

Figure 2: Overall process of temporal informationextraction from Korean texts.

3.2 Temporal Information Extraction fromKorean Texts

Our proposed method addresses all of the threetasks: (1) extraction of timex3 tags, (2) extractionof event and makeinstance tags, and (3) extractionof tlink tags. The proposed method also extractsadditional attributes of timex3 tag, such as freq, be-ginPoint, endPoint, mod, and calendar. The over-all process is depicted in Fig. 2, where the solidline represents training process and the dotted linerepresents testing process. The Korean analyzerat the center of the figure takes Korean texts asan input and generates several raw features as anoutput, such as results of morphological analysis,Part-Of-Speech (POS) tags, Named-Entity (NE)tags, and results of dependency parsing (Lim etal., 2006). The number of possible POS tags is45, which follows the definition of Sejong Tree-bank1. The number of possible NE tags is 178,where each of them belongs to one of 15 super NEtags.

The generated raw features are used to define aset of features for machine-learning models and aset of hand-crafted rules. The rules are designed byexamining the training dataset and the errors thatthe proposed method generates with the valida-tion dataset. We employ several machine-learningmethods that have shown the best performancein the TempEval shared tasks, such as MaximumEntropy Model (MEM), Support Vector Machine(SVM), Conditional Random Fields (CRF), andLogistic Regression (LR).

Fig. 2 introduces Temporal Information Extrac-tor (TIE) which consists of four sub-extractors:

1Korean Language Institute, http://www.sejong.or.kr

281

timex3 extractor, event extractor, makeinstance ex-tractor, and tlink extractor. The timex3 extrac-tor and the event extractor work independently,and the makeinstance extractor uses the predictedevent tags. The tlink extractor makes use of thepredicted makeinstance tags and predicted timex3tags. Thus, the performance of timex3 extractorand event extractor will strongly influence the per-formance of makeinstance extractor and tlink ex-tractor. These four sub-extractors as a whole givepredicted tags as an output, where the tags are rep-resented in morpheme-level. The extent converterat the center of the figure changes the morpheme-level tags into letter-level and vice versa by check-ing ASCII values of each letter and each mor-pheme. In training process, the annotated tags ofKorean TimeBank are converted into morpheme-level through the extent converter, and used to trainthe TIE.

3.2.1 Timex3 extractorThe goal of timex3 extractor is to predict whethereach morpheme belongs to the extent of a timex3tag or not, and finds appropriate attributes of thetag. There are five types of timex3 tag: DATE,TIME, DURATION, SET, and NONE. The NONErepresents that the corresponding morpheme doesnot belong to the extent of a timex3 tag, andthe other four types follow the same definitionof TimeML. This is essentially a morpheme-levelclassification over 5 classes.

We basically take two approaches: a set of 100hand-crafted rules and machine-learning models.Examples of the rules for extent and type are listedin Table 1. In the second rule of the table, the firstcondition is satisfied when the sequence of twomorphemes is a digit followed by a morpheme ‘‘[wol]’(month), ‘|[il]’(day), or ‘¸[joo]’(week).The various ways of digit expressions are also con-sidered. The second condition is satisfied when themorpheme next to the extent is ‘–[eh]’(at) or ‘»‰[ma-da]’(every) followed by ‘à[beon]’(times)or ‘å[hoi]’(times), and there must be no othertags between the two morphemes. If these twoconditions are satisfied, then the sequence of mor-phemes becomes the extent of timex3 tag whosetype is SET. If one of the rules is satisfied, thenthe remaining rules are skipped for the target mor-pheme.

We compare two machine-learning models,CRF and MEM, for timex3 tag by experiments. Wedefined a set of features based on the raw features

Table 1: Examples of the rules for extent and typeof timex3 tags.

Type ConditionsDATE Extent=(digit,D)

SETExtent=(digit,‘_|_¸),Next morps=(–_»‰,no other tags,à_å)

Table 2: Examples of the rules for value of timex3tag.

Operations Conditions

Month=digitContext updated

Type=DATE_TIMESurrounding morps=

(digit,‘)

Year=contextType=DATE_TIMESurrounding morps=

(,t_tàt)

obtained from the Korean analyzer. The definedfeatures include morphemes, POS tags, NE tags,morpheme-level features of dependency parsing,given a particular window size. The morpheme-level features of dependency parsing are generatedby following approaches in Sang and Buchholz(2000).

To predict other attributes of each predictedtimex3 tag, we also define sets of rules: 112 rulesfor value, 7 rules for beginPoint/endPoint, 9 rulesfor freq, 10 rules for mod, and 1 rule for calen-dar. Especially, the rules for value and freq takea temporal context into account. For instance, thesentence “We go there tomorrow”, makes it hardto predict value of ‘tomorrow’ without consideringthe temporal context. We assume that the temporalcontext of each sentence depends on the previoussentence. For each document, the temporal con-text is initialized with Document Creation Time(DCT), and the context is updated when a normal-ized value appears in a certain condition.

Examples of the rules are described in Table 2.If the first rule in the table is satisfied, then themonth of value is changed to the correspondingdigit and the temporal context is updated.

As there can be multiple clues for determiningvalue within an extent, all of the rules are checkedfor each timex3 tag. To avoid overwriting value bymultiple satisfied rules, the rules are listed in as-cending order of temporal unit. That is, the rules

282

for seconds or minutes are listed before the rulesfor hours or days. This allows value to be changedfrom smaller temporal unit to bigger unit, therebyavoiding overwriting wrong value. The rules fordifferent attributes are listed in separate files, andare written in a systematic way similar to regularexpressions. Such format enables rules to be easilymanipulated.

3.2.2 Event extractorThe goal of event extractor is to predict whethereach morpheme belongs to the extent of an eventtag or not, and finds appropriate class of the tag.There are 7 classes of event tag: OCCURRENCE,PERCEPTION, REPORTING, STATE, I STATE,I ACTION, and NONE. The NONE representsthat the corresponding morpheme does not be-long to the extent, and the other classes followthe same definition of TimeML. Similar to thetimex3 extractor, we take two approaches: a set of26 rules and machine-learning models (e.g., CRFand MEM), based on the set of features used in thetimex3 extractor.

There are several verbs that often appear withinthe extents of event tags, although they do notcarry any meaning. For example, in the sen-tence, “òîıÄ|X‰”(I study), the verb ‘X[ha]’(do) has no meaning while the noun ‘ıÄ[gong-bu]’(study) has eventual meaning. We de-fine a set of such verbal morphemes (e.g., ‘⌅X[wi-ha]’(for), and ‘µX[tong-ha]’(through)), toavoid generating meaningless event tags.

3.2.3 Makeinstance extractorThe goal of makeinstance extractor is to generateat least one makeinstance tag for each event tag,and find appropriate attributes. As we observedthat there is only one makeinstance tag for eachevent tag in most cases, the makeinstance extractorsimply generates one makeinstance tag for eachevent tag. For the attribute POS, we simply takethe POS tags obtained from the Korean analyzer.We define a set of 5 rules for the attribute tense,and 2 rules for the attribute polarity.

3.2.4 Tlink extractorThe goal of tlink extractor is to make a linkage be-tween two tags, and find appropriate types of thelinks. For each pair of tags, it determines whetherthere must be a linkage between them, and findsthe most appropriate relType. There are 11 rel-Types: BEFORE, AFTER, INCLUDES, DUR-

Table 3: Features in the two kinds of classifiers.Features for TM tlinkSurrounding morphemes of the argument tagsLinear order of the argument tagsAttribute type of timex3 tagAttribute class of event tagAttribute polarity of makeinstance tagAttribute tense of makeinstance tagWhether tlink tag is non-consuming tag or notDoes timex3 tag exist between argument tags?Does event tag exist between argument tags?Is event tag an objective of other event tagin dependency tree?Features for MM tlinkSurrounding morphemes of event tagsLinear order of event tagsAttribute class of event tagsAttribute polarity of makeinstance tagsAttribute tense of makeinstance tagsDoes timex3 tag exist between event tags?Does event tag exist between event tags?

ING, DURING INV, SIMULTANEOUS, IDEN-TITY, BEGINS, ENDS, OVERLAP, and NONE.The NONE represents that there is no linkage be-tween the two argument tags, and the OVERLAPmeans that the temporal intervals of two tags areoverlapping. The other relTypes follow the samedefinition of TimeML. Thus, it is essentially aclassification over 11 classes for each pair of twoargument tags.

The tlink extractor generates intra-sentencetlinks and inter-sentence tlinks. For the intra-sentence tlinks, we take two approaches: a set of19 rules and machine-learning models (e.g., SVMand LR). Among the four kinds of tlinks (e.g.,TT tlink, TM tlink, MM tlink, and DM tlink), ourtlink extractor generates the first three kinds oftlinks. The reason for excluding DM tlink is thatwe maintain the temporal context initialized withDCT in the timex3 extractor, so it is not necessaryto generate DM tlinks. The TT tlinks are extractedby comparing normalized values of two timex3tags. Two models are independently trained forpredicting TM tlinks and MM tlinks, respectively.We tried many possible combinations of featuresto reach better performance, and obtained sets offeatures as described in Table 3.

Given a pair of two makeinstance tags, it isstraight forward to derive relType when the two

283

event tags are linked with timex3 tags. Thus, firstlywe predict TT tlinks, and thereafter predict TMtlinks and MM tlinks. For the inter-sentence tlinks,we generate MM tlinks between adjacent sen-tences when there is a particular expression atthe beginning of a sentence, such as ‘¯ ƒ[geu-hoo]’(afterward), ‘¯ ⌅[geu-jeon]’(beforehand),or ‘¯‰L[geu-da-eum]’(thereafter).

3.3 Online LIFE

As the performance of the Korean analyzer is notstable, we need complementary features to makebetter classifiers. Jeong and Choi (2015) proposedLanguage Independent Feature Extractor (LIFE)which generates a pair of class label and topic la-bel for each Letter-Sequence (LS), where LS rep-resents frequently appeared letter sequence. Theclass labels can be used as syntactic features, whilethe topic labels can be used as semantic features.The concept of LS makes it language indepen-dent, so it is basically applicable to any language.This is especially helpful to some languages thathave no stable feature extractors. Korean is one ofsuch languages, so we employ the LIFE to gener-ate complementary features.

The temporal information extractor must workonline because it usually takes a stream of docu-ments as an input. However, as the LIFE is origi-nally designed to work offline, we propose an ex-tended version of the LIFE, namely, Online LIFE(O-LIFE), whose parameters are estimated incre-mentally. When we design O-LIFE, the LS con-cept of LIFE becomes a problem because the LSdictionary changes. For example, if the LS dictio-nary has only one LS goes and a new token gocomes in, then the LS dictionary may contain goand es. Note that the LS goes does not exist inthe new dictionary. This issue is addressed by ourproposed algorithm which basically distributes thevalues of previously estimated parameters to newprior parameters of overlapping LSs. For the aboveexample, �k,‘goes0 will be distributed to �k,‘go0 and�k,‘es0 , where k is a topic index.

The formal algorithm of O-LIFE is shown in Al-gorithm 1, where C is the number of classes and Tis the number of topics. Sstream is the number ofstreams, and Ss is s-th stream of Ds documents.The four parameters a, bt, g and bc are default val-ues of the priors ↵, �, � and �. The three thresholdparameters t1, t2, and t3 are used to generate LSdictionary.

Algorithm 1 Online LIFE1: INPUT: a;bt;g;bc;wp;Ss;t1;t2;t32: for s=1 to Sstream do3: Dic

s = DictionaryGenerator(t1,t2,t3)4: if s=1 then5: �s

t =bt, 1tT6: �s

c=bc, 1cC7: else8: �s

t,w=bt, w 2Dic

s

9: �sc,w=bc, w 2Dic

s

10: for each item wi 2Dic

s�1 do11: if wi 2Dic

s then12: �s

t,wi+=B

s�1t,wi

wp, 1tT13: �s

c,wi+=D

s�1c,wi

wp, 1cC14: end if15: for each w0

i 2Dic

s do16: �s

t,w0i+=B

s�1t,wi

wp, 1tT

17: �sc,w0

i+=D

s�1c,wi

wp, 1cC

18: end for19: for each w00

i 2Dic

s do20: r=|woverlap|/|wi|21: �s

t,w00i

+=B

s�1t,wi

rwp, 1tT

22: �sc,w00

i+=D

s�1c,wi

rwp, 1cC

23: end for24: end for25: end if26: ↵s

d=a, 1dDs

27: �sc =g, 1cC

28: initilize �s, ⌘s, ⇡s, and ✓s to zeros29: initilize class/topic assignments30: [�s,⌘s,⇡s,✓s] =31: ParameterEstimation(Ss,↵s,�s,�s,�s)32: B

st =B

s�1t

S�s

t , 1tT33: D

st =D

s�1t

S⌘s

c , 1cC34: end for

The LS dictionary of O-LIFE is updated as itreads data. That is, |Dic

prev| |Dic

cur| andDic

prev 6✓ Dic

cur, where Dic

prev is the pre-vious dictionary and Dic

cur is the current dictio-nary. To handle this change in dictionary, the twoparameters (e.g., �, �) are updated by four steps,based on an assumption that every unique LS wi

of the previous dictionary should contribute to thenew dictionary as much as possible. Firstly, as de-scribed in 8th line of the algorithm, the two pa-rameters are initialized with default values. Sec-ondly, if a particular LS wi of the previous dic-tioanry exists in the new dictionary, then the twoparameters increase by B

s�1t,wi

wp and D

s�1c,wi

wp, re-

284

Table 4: The statistics of Korean TimeBank, wherethe digits represent the number of correspondingitems.

Training Validation Testdocuments 536 131 173sentences 2357 466 879

timex3 1245 253 494event 6594 1145 2609

makeinstance 6615 1155 2613tlink 1295 374 674

spectively. B

s�1t denotes an evolutionary matrix

whose columns are LS-topic distribution �s�1t ,

and D

s�1c means an evolutionary matrix whose

columns are LS-class distribution ⌘s�1t . By multi-

plying them with the weight vector wp, the contri-bution in initializing priors is determined for eachtime slice. We call this value, the weighted con-tribution. Thirdly, for every w0

i which contains wi,the two parameters increase by the weighted con-tribution. Lastly, for every w00

i which overlaps withwi, the two parameters increase by r times of theweighted contribution, where woverlap is the over-lapping part.

The class labels and topic labels are convertedto morpheme-level features by concatenating la-bels of LSs overlapping with a given morpheme.We call these features as LIFE features, and theseare used to train machine-learning models, to-gether with the features based on the raw features.

4 Experiments

The Korean TimeBank is divided into a trainingdataset, a validation dataset, and a test dataset. Thestatistics of the dataset are described in Table 4. Asshown in the table, only one makeinstance tag ex-ists for each event tag in most cases, which followsthe assumption of makeinstance extractor.

4.1 Timex3 predictionFor the extent and type prediction, only the exactlypredicted extents and types are regarded as correct,and the results are summarized in Table 5, whereMEM is trained only with the features generatedfrom raw features, and MEML is trained with bothof the features and LIFE features. We employ theCRF++ library2 and MEM toolkit3. The optimal

2http://crfpp.googlecode.com/svn/trunk/doc/index.html3http://homepages.inf.ed.ac.uk/lzhang10/maxent toolkit

parameter settings are found by a grid search withthe validation set. The optimal setting for CRF isas follows: L1-regularization, c=0.6, and f=1. TheMEM shows its best performance without Gaus-sian prior smoothing. Both models give generallybetter performance when window size is 2. Theparameter setting for O-LIFE is as follows: a=0.1,bt=0.1, g=0.1, bc=0.1, wp

=(1), C=10, T=10,t1=0.4, t2=0.4, t3=0.7, and the number of iter-ations for estimation is 1000.

As shown in the table, CRF gives generally bet-ter performance than MEM and rules. We trieda combination of the rules and machine-learningmodels, and the combination led to an increase inthe performance. For example, the combination ofCRF and rules gives better performance than us-ing only CRF. This can be explained that thereare some patterns that the machine-learning mod-els could not capture, so the combination with therules can deal with the patterns. Note that using theLIFE features dramatically increases the perfor-mance. We believe that this is due to the raw fea-tures of Korean (e.g., POS tagger) being unstable.The LIFE features complement these unstable fea-tures by capturing syntactic/semantic patterns thatare inherent in the given documents. Furthermore,we observed that the combination of the rulesand machine-learning models trained with LIFEfeatures does not contribute to the performance.This implies that using LIFE features allows themachine-learning models to capture the patternsthat could not be captured by the machine-learningmodels without LIFE features. The CRFL is dis-covered to be the best, and the other remaining at-tributes are predicted using the rules. The perfor-mance is measured in a sequential manner, so theperformance generally decreases from the top tobottom of the table.

Another experiments of timex3 prediction us-ing TempEval-2 Korean dataset are conducted tocompare with the existing method of Angeli andUszkoreit (2013). The existing method makes useof a latent parse conveying a language-flexiblerepresentation of time, and extract features overboth the parse and associated temporal semantics.The results of comparison are shown in Table 6,where our method uses CRFL for type and rulesfor value. For a fair comparison, both methods aretrained and tested using only TempEval-2 Koreandataset. As mentioned before, TempEval-2 Koreandataset contained errors, so we corrected them and

285

Table 5: Timex3 prediction results, where P repre-sents precision, R means recall, F represents F1score, points represent beginPoint/endPoint, andcal represents calendar.

Attri-butes

PerformancesComb P R F

extent

Rules 74.79 70.95 72.82MEM 31.79 25.1 28.05CRF 80.34 65.42 72.11

MEM,Rules 70.61 70.75 70.68CRF,Rules 73.05 72.33 72.69MEML 33.51 41.34 37.02CRF

L

81.15 72.33 76.49

type

Rules 72.71 68.97 70.79MEM 30.26 23.89 26.7CRF 78.88 64.23 70.81


L

75.83 67.59 71.47value Rules 71.18 63.44 67.08points Rules 71.18 63.44 67.08freq Rules 71.18 63.44 67.08mod Rules 70.07 62.45 66.04cal Rules 70.07 62.45 66.04

Table 6: Results of timex3 prediction withTempEval-2 Korean dataset, where the digits rep-resent accuracy.

Attributes Angeli’s method Our methodtype 82.0 100.0value 42.0 100.0

used for this experiment. As shown in the table,our method gives much better accuracy than theexisting method.

4.2 Event prediction

The results of event prediction are summarized inTable 7, which are similar to the results of timex3prediction. By a grid search, the optimal param-eter settings are found to be the same as that ofthe timex3 extractor. Employing the LIFE featuresagain increases the performance, and the CRFL isdiscovered to be the best.

Table 7: Event prediction results.

Attri-butes


extent

Rules 75.62 44.0 55.63MEM 33.22 15.07 20.73CRF 45.33 35.72 39.96


L

86.49 78.5 82.3

class

Rules 63.97 37.22 47.06MEM 31.11 14.12 19.41CRF 34.63 27.29 30.53


L

78.63 71.37 74.82

Table 8: Makeinstance prediction results.

Attri-butes

PerformancesP R F

eventID 86.28 78.19 82.03polarity 93.59 73.17 82.13

tense 71.03 51.97 60.02

4.3 Makeinstance predictionAll the attributes of makeinstance tag are predictedthrough hand-crafted rules. The performance issummarized in Table 8. The measurement of theattribute POS is excluded because we simply takethe results of the Korean analyzer.

4.4 Tlink predictionBy performing a grid search with the validationset, we found that Support Vector Machine (SVM)of C-SVC type with Radial Basis Function (RBF)gives the best performance when � of kernel func-tion is 1/number of features. It is also found thatthe L1-regularized Logistic Regression (LR) withC=1 gives the best performance. We observed thatboth models give better performance when a win-dow size is 1.

There are two cases of tlink prediction: (1) tlinkprediction given correct other tags, and (2) tlinkprediction given predicted other tags. The resultsof the first case are summarized in Table 9, wherethe performance is measured in a sequential man-ner. As shown in the table, we tried a combina-

286

Table 9: Tlink prediction results given correcttimex3, event, and makeinstance tags.

Attri-butes


link-age

SVM 63.84 16.72 26.49LR 20.02 63.91 30.49

Rules 39.11 59.76 47.28SVM,Rules 39.04 61.69 47.82

LR,Rules 21.13 77.22 33.19

rel-Type

SVM 63.84 16.72 26.49LR 19.18 61.24 29.22

Rules 37.27 56.95 45.06SVM,Rules 37.27 58.88 45.64

LR,Rules 20.2 73.82 31.72

Table 10: Tlink prediction results of the combina-tion of SVM and rules, given timex3, event, andmakeinstance tags which are predicted using LIFEfeatures.

Attributes PerformancesP R F

linkage 34.39 38.46 36.31relType 32.94 36.83 34.77

tion of the rules and machine-learning models,and the combination of SVM and rules performedthe best. Note that we do not use the LIFE fea-tures for tlink prediction because we observed thatthe LIFE features do not contribute to the perfor-mance for tlink prediction. We believe that this isbecause the LIFE features represent only the syn-tactic/semantic patterns of the given terms, but notarbitrary relations between the terms.

The results of the second case are obtained us-ing the best combination, and are shown in Ta-ble 10. Note that the tlink tags are predicted with-out the LIFE features, while the other tags (e.g.,timex3, event, makeinstance) are obtained usingthe LIFE features. To measure the impact of theLIFE features, we also conduct the experiment ofthe second case given the predicted tags withoutusing the LIFE features, and the results are shownin Table 11. As shown in Table 10 and Table 11,using the LIFE features increases the F1 scoreabout 6 percents.

Table 11: Tlink prediction results of the combina-tion of SVM and rules, given the tags predictedwithout using the LIFE features.

Attributes PerformancesP R F

linkage 25.41 36.69 30.02relType 24.18 34.91 28.57

5 Conclusion

We introduced a new method for extracting tem-poral information from unstructured Korean texts.Korean language has a complex grammar, so therewere many research issues to address prior toachieving our goal. We presented such issues andour solutions to them. Experimental results il-lustrated the effectiveness of our method, espe-cially when we adopted the extended probabilis-tical model, Online LIFE (O-LIFE), to gener-ate complementary features for training machine-learning models. In addition, as there were no suf-ficient data for this study, we have manually con-structed the Korean TimeBank consisting of morethan 3,700 annotated sentences. We will extendour study to interact with Knowledge-Base forachieving better prediction of temporal informa-tion.

Acknowledgments

This work was supported by ICT R&D programof MSIP/IITP. [R0101-15-0062, Development ofKnowledge Evolutionary WiseQA Platform Tech-nology for Human Knowledge Augmented Ser-vices] Many thanks to my family (HoYoun, Youn-Seo), and my parents.

287

ReferencesGabor Angeli and Jakob Uszkoreit. 2013. Language-

independent discriminative parsing of temporal ex-pressions. In Proceedings of the 51th Annual Meet-ing of the Association for Computational Linguis-tics, Sofia, Bulgaria.

Steven Bethard. 2013a. Cleartk-timeml: A minimalistapproach to tempeval 2013. In Proceedings of theSeventh International Workshop on Semantic Evalu-ation, pages 10–14, Atlanta, Georgia, USA.

Steven Bethard. 2013b. A synchronous context freegrammar for time normalization. In Proceedingsof the Conference on Empirical Methods in Natu-ral Language Processing, pages 821–826, Seattle,USA.

Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N.Vapnik. 1992. A training algorithm for optimalmargin classifiers. In Proceedings of the fifth An-nual Workshop on Computational Learning Theory,pages 144–152, Pittsburgh, PA, USA.

Nathanael Chambers, Shan Wang, and Dan Juraf-sky. 2007. Classifying temporal relations betweenevents. In Proceedings of the 45th Annual Meetingof the ACL on Interactive Poster and DemonstrationSessions, pages 173–176, Prague, Czech Republic.

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning, 20(3):273–297.

Young-Seob Jeong and Ho-Jin Choi. 2015. Languageindependent feature extractor. In Proceedings of theTwenty-Ninth AAAI Conference on Artificial Intelli-gence, pages 4170–4171, Texas, USA.

John D. Lafferty, Andrew McCallum, and FernandoC. N. Pereira. 2001. Conditional random fields:Probabilistic models for segmenting and labeling se-quence data. In Proceedings of the Eighteenth In-ternational Conference on Machine Learning, pages282–289, Williamstown, USA.

Soojong Lim, ChangKi Lee, Jeong Hur, and Myoung-Gil Jang. 2006. Syntax analysis of enumera-tion type and parallel type using maximum en-tropy model. In Proceedings of the Korea Human-Computer Interaction Conference, pages 1240–1245.

Xiao Ling and Daniel S. Weld. 2010. Temporal in-formation extraction. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence,Atlanta, Georgia, USA.

Hector Llorens, Estela Saquete, and Borja Navarro.2010. Tipsem (english and spanish): Evaluating crfsand semantic roles in tempeval-2. In Proceedings ofthe Fifth International Workshop on Semantic Eval-uation, pages 284–291, Uppsala, Sweden.

Seyed Abolghasem Mirroshandel and GholamrezaGhassem-Sani. 2012. Towards unsupervised learn-ing of temporal relations between events. Journal ofArtificial Intelligence Research, 45(1):125–163.

James Pustejovsky, Jose Castano, Robert Ingria, RoserSauri, Robert Gaizauskas, Andrea Setzer, and Gra-ham Katz. 2003. Timeml: Robust specification ofevent and temporal expressions in text. In New Di-rections in Question Answering, pages 28–34, Stan-ford, USA.

Erik F. Tjong Kim Sang and Sabine Buchholz. 2000.Introduction to the conll-2000 shared task: chunk-ing. In Proceedings of the 2nd workshop on Learn-ing language in logic and the 4th conference onComputational natural language learning, pages127–132, Lisbon, Portugal.

Jannik Strotgen and Michael Gertz. 2010. Heideltime:High quality rule-based extraction and normaliza-tion of temporal expressions. In Proceedings of theFifth International Workshop on Semantic Evalua-tion, pages 321–324, Uppsala, Sweden.

Naushad UzZaman and James Allen. 2010. Eventand temporal expression extraction from raw text:First step towards a temporally aware system. Inter-national Journal of Semantic Computing, 4(4):487–508.

Naushad UzZaman, Hector Llorens, Leon Derczyn-ski, Marc Verhagen, James Allen, and James Puste-jovsky. 2013. Semeval-2013 task 1: Tempeval-3:Evaluating time expressions, events, and temporalrelations. In Proceedings of the Seventh Interna-tional Workshop on Semantic Evaluation, pages 1–9,Atlanta, Georgia, USA.

Marc Verhagen, Robert J. Gaizauskas, Frank Schilder,Mark Hepple, Jessica Moszkowicz, and JamesPustejovsky. 2009. The tempeval challenge: Iden-tifying temporal relations in text. Language Re-sources and Evaluation, 43(2):161–179.

Marc Verhagen, Roser Sauri, Tommaso Caselli, andJames Pustejovsky. 2010. Semeval-2010 task 13:Tempeval-2. In Proceedings of the Fifth Interna-tional Workshop on Semantic Evaluation, pages 57–62, Uppsala, Sweden.

Katsumasa Yoshikawa, Sebastian Riedel, MasayukiAsahara, and Yuji Matsumoto. 2009. Jointly identi-fying temporal relations with markov logic. In Pro-ceedings of the Joint Conference of the 47th AnnualMeeting of the ACL and the 4th International JointConference on Natural Language Processing of theAFNLP, pages 405–413, Suntec, Singapore.

288



Transition-based Spinal Parsing

Miguel Ballesteros1,2 Xavier Carreras3

1NLP Group, Pompeu Fabra University 2Carnegie Mellon University3Xerox Research Centre Europe

[email protected] [email protected]

Abstract

We present a transition-based arc-eagermodel to parse spinal trees, a dependency-based representation that includes phrase-structure information in the form of con-stituent spines assigned to tokens. As amain advantage, the arc-eager model canuse a rich set of features combining depen-dency and constituent information, whileparsing in linear time. We describe a setof conditions for the arc-eager system toproduce valid spinal structures. In experi-ments using beam search we show that themodel obtains a good trade-off betweenspeed and accuracy, and yields state of theart performance for both dependency andconstituent parsing measures.

1 Introduction

There are two main representations of the syntac-tic structure of sentences, namely constituent anddependency-based structures. In terms of statisti-cal modeling, an advantage of dependency repre-sentations is that they are naturally lexicalized, andthis allows the statistical model to capture a richset of lexico-syntactic features. The recent liter-ature has shown that such lexical features greatlyfavor the accuracy of statistical models for pars-ing (Collins, 1999; Nivre, 2003; McDonald et al.,2005). Constituent structure, on the other hand,might still provide valuable syntactic informationthat is not captured by standard dependencies.

In this work we investigate transition-based sta-tistical models that produce spinal trees, a rep-resentation that combines dependency and con-stituent structures. Statistical models that use bothrepresentations jointly were pioneered by Collins(1999), who used constituent trees annotated withhead-child information in order to define lexical-ized PCFG models, i.e. extensions of classic

constituent-based PCFG that make a central useof lexical dependencies.

An alternative approach is to view the com-bined representation as a dependency structureaugmented with constituent information. This ap-proach was first explored by Collins (1996), whodefined a dependency-based probabilistic modelthat associates a triple of constituents with eachdependency. In our case, we follow the representa-tions proposed by Carreras et al. (2008), which wecall spinal trees. In a spinal tree (see Figure 1 foran example), each token is associated with a spineof constituents, and head-modifier dependenciesare attached to nodes in the spine, thus combin-ing the two sources of information in a tight man-ner. Since spinal trees are inherently dependency-based, it is possible to extend dependency mod-els for such representations, as shown by Carreraset al. (2008) using a so-called graph-based model.The main advantage of such models is that theyallow a large family of rich features that includedependency features, constituent features and con-junctions of the two. However, the consequenceis that the additional spinal structure greatly in-creases the number of dependency relations. Eventhough a graph-based model remains parseable incubic time, it is impractical unless some pruningstrategy is used (Carreras et al., 2008).

In this paper we propose a transition-basedparser for spinal parsing, based on the arc-eagerstrategy by Nivre (2003). Since transition-basedparsers run in linear time, our aim is to speedup spinal parsing while taking advantage of therich representation it provides. Thus, the re-search question underlying this paper is whetherwe can accurately learn to take greedy parsingdecisions for rich but complex structures such asspinal trees. To control the trade-off, we usebeam search for transition-based parsing, whichhas been shown to be successful (Zhang and Clark,2011b). The main contributions of this paper are

289

the following:

• We define an arc-eager statistical model forspinal parsing that is based on the triplet re-lations by Collins (1996). Such relations, inconjunction with the partial spinal structureavailable in the stack of the parser, provide avery rich set of features.

• We describe a set of conditions that an arc-eager strategy must guarantee in order to pro-duce valid spinal structures.

• In experiments using beam search we showthat our method obtains a good trade-off between speed and accuracy for bothdependency-based attachment scores andconstituent measures.

2 Background

2.1 Spinal Trees

A spinal tree is a generalization of a dependencytree that adds constituent structure to the depen-dencies in the form of spines. In this section wedescribe the spinal trees used by Carreras et al.(2008). A spine is a sequence of constituent nodesassociated with a word in the sentence. From alinguistic perspective, a spine corresponds to theprojection of the word in the constituent tree. Inother words, the spine of a word consists of theconstituents whose head is the word. See Figure1 for an example of a sentence and its constituentand spinal trees. In the example the spine of eachtoken is the vertical sequence on top of it.

Formally a spinal tree for a sentence x1:n is apair (V,E), where V is a sequence of n spinalnodes and E is a set of n spinal dependencies. Thei-th node in V is a pair (xi, �i), where xi is the i-thword of the sentence and �i is its spine.

A spine � is a vertical sequence of constituentnodes. We denote by N the set of constituentnodes, and we use ? 62 N to denote a special ter-minal node. We denote by l(�) the length of aspine. A spine � is always non-empty, l(�) � 1,its first node is always ?, and for any 2 j l(�)

the j-th node of the spine is an element of N .A spinal dependency is a tuple hh, d, pi that rep-

resents a directed dependency from the p-th nodeof �h to the d-th node of V . Thus, a spinal de-pendency is a regular dependency between a headtoken h and a dependent token d augmented with

a position p in the head spine. It must be that1 h, d n and that 1 < p l(�h).

The set of spinal dependencies E satisfies thestandard conditions of forming a rooted directedprojected tree (Kubler et al., 2009). Plus, E satis-fies that the dependencies are correctly nested withrespect to the constituent structure that the spinesrepresent. Formally, let (h, d1, p1) and (h, d2, p2)

be two spinal dependencies associated with thesame head h. For left dependencies, correct nest-ing means that if d1 < d2 < h then p1 � p2. Forright dependents, if h < d1 < d2 then p1 p2.

In practice, it is straightforward to obtain spinaltrees from a treebank of constituent trees withhead-child annotations in each constituent (Car-reras et al., 2008): starting from a token, its spineconsists of the non-terminal labels of the con-stituents whose head is the token; the parent nodeof the top of the spine gives information about thelexical head (by following the head children of theparent) and the position where the spine attachesto. Given a spinal tree it is trivial to recover theconstituent and dependency trees.

2.2 Arc-Eager Transition-Based Parsing

The arc-eager transition-based parser (Nivre,2003) parses a sentence from left to right in lineartime. It makes use of a stack that stores tokens thatare already processed (partially built dependencystructures) and it chooses the highest-scoring pars-ing action at each point. The arc-eager algorithmadds every arc at the earliest possible opportunityand it can only parse projective trees.

The training process is performed with an ora-cle (a set of transitions to a parse for a given sen-tence, (see Figure 2)) and it learns the best transi-tion given a configuration. The SHIFT transitionremoves the first node from the buffer and putsit on the stack. The REDUCE transition removesthe top node from the stack. The LEFT-ARCt tran-sition introduces a labeled dependency edge be-tween the first element of the buffer and the topelement of the stack with the label t. The top el-ement is removed from the stack (reduce transi-tion). The RIGHT-ARCt transition introduces a la-beled dependency edge between the top element ofthe stack and the first element in the buffer with alabel d, and it performs a shift transition. Each ac-tion can have constraints (Nivre et al., 2014), Fig-ure 2 and Section 3.2 describe the constraints ofthe spinal parser.

290

(a) — Constituent Tree with head-children annotationsS

NP

DT

This

NN

market

VP

VBN

has

VP

VBN

been

VP

ADVP

RB

very

ADV

badly

VBN

damaged

.

.

(b) — Spinal Tree

This?

market?

NP

has?

VP

S

been?

VP

very?

ADVP

badly?

damaged?

VP

.?

Figure 1: (a) A constituent tree for This market has been very badly damaged. For each constituent, theunderlined child annotates the head child of the constituent. (b) The corresponding spinal tree.

In this paper, we took the already existent im-plementation of arc-eager from ZPar1 (Zhang andClark, 2009) which is a beam-search parser imple-mented in C++ focused on efficiency. ZPar givescompetitive accuracies, yielding state-of-the-artresults, and very fast parsing speeds for depen-dency parsing. In the case of ZPar, the parsingprocess starts with a root node at the top of thestack (see Figure 3) and the buffer contains thewords/tokens to be parsed.

3 Transition-based Spinal Parsing

In this section we describe an arc-eager transitionsystem that produces spinal trees. Figure 3 showsa parsing example. In essence, the strategy wepropose builds the spine of a token by pieces, byadding a piece of spine each time the parser pro-duces a dependency involving such token.

We first describe a labeling of dependencies thatencodes a triplet of constituent labels, and it is thebasis for defining an arc-eager statistical model.Then we describe a set of constraints that guaran-

1http://sourceforge.net/projects/zpar/

tees that the arc-eager derivations we produce cor-respond to spinal trees. Finally we discuss how tomap arc-eager derivations to spinal trees.

3.1 Constituent TripletsWe follow Collins (1996) and define a labeling fordependencies based on constituent triplets.

Consider a spinal tree (V,E) for a sentencex1:n. A constituent triplet of a spinal dependency(h, d, p) 2 E is a tuple ha, b, ci where:

• a 2 N is the node at position p of �h (parentlabel)

• b 2 N [ {?} is the node at position p� 1 of�h (head label)

• c 2 N [{?} is the top node of �d (dependentlabel)

For example, a dependency labeled withhS,VP,NPi is a subject relation, while the triplethVP, ?,NPi represents an object relation. Notethat a constituent triplet, in essence, correspondsto a context-free production in a head-drivenPCFG (i.e. a ! bc, where b is the head child of a).

291

Initial configuration Ci = h[ ], [x1 . . . xn], ;, iTerminal configuration Cf 2 {C |C = h⌃, [ ], Ai}SHIFT h⌃, i|B,Ai ) h⌃|i, B,AiREDUCE h⌃|i, B,Ai ) h⌃, B,Ai

if 9j, t : {j

t! i} 2 A

LEFT-ARC(ha, b, ci) h⌃|i, j|B,Ai ) h⌃, j|B,A [ {j ha,b,ci! i}iif ¬9k, t : {i

t! k} 2 A

(1) if (c 6= ?) _ ¬(9k : {i

t! k} 2 A)

(3) if b = ? ) 8{i

ha0,b0,c0i! k} 2 A, a = a

0^ b = b

0

RIGHT-ARC(ha, b, ci) h⌃|i, j|B,Ai ) h⌃|i|j, B, A [ {i ha,b,ci! j}i(1) if (c 6= ?) _ ¬(9k : {j

t! k} 2 A)

(2) if ¬(9k, ha

0, b

0, c

0i : {k

ha0,b0,c0i! i} 2 A ^ c

0= ? )

(4) if b = ? ) 8{j

ha0,b0,c0i! k} 2 A such that j < k, a = a

0^ b = b

0

(5) if b = ? ) 8{j

ha0,b0,c0i! k} 2 A such that k < j ^ b

0= ?, a = a

0

Figure 2: Arc-eager transition system with spinal constraints. ⌃ represents the stack, B represents thebuffer, A represents the set of arcs, t represents a given triplet when its components are not relevant,ha, b, ci represents a given triplet when its components are relevant and i, j and k represent tokens of thesentence. The constraints labeled with (1) . . . (5) are described in Section 3.2. The constraints that arenot labeled are standard constraints of the arc-eager parsing algorithm (Nivre, 2003).

In the literature, these triplets have been shown toprovide very rich parameterizations of statisticalmodels for parsing (Collins, 1996; Collins, 1999;Carreras et al., 2008).

For our purposes, we associate with each spinaldependency (h, d, p) 2 E a triplet dependency(h, d, ha, b, ci), where the triplet is defined asabove. We then define a standard statistical modelfor arc-eager parsing that uses constituent tripletsas dependency labels. An important advantage ofthis model is that left-arc and right-arc transitionscan have feature descriptions that combine stan-dard dependency features with phrase-structure in-formation in the form of constituent triplets. Asshown by Carreras et al. (2008), this rich set offeatures can obtain significant gains in parsing ac-curacy.

3.2 Spinal Arc-Eager Constraints

We now describe constraints that guarantee thatany derivation produced by a triplet-based arc-eager model corresponds to a spinal structure.

Let us make explicit some properties that relatea derivation D with a token i, the arcs in D involv-ing i, and its spine �i:

• D has at most a single arc (h, i, ha, b, ci)where i is in the dependent position. The de-pendent label c of this triplet defines the topof �i. If c = ? then �i = ?, and i can nothave dependants.

• Consider the subsequence of D of left arcswith head i, of the form (i, j, ha, b, ci). In anarc-eager derivation this subsequence followsa head-outwards order. Each of these arcs hasin its triplet a pair of contiguous nodes b�a of�i. We call such pairs spinal edges. The sub-sequence of spinal edges is ordered bottom-up, because arcs appear head-outwards. Inaddition, sibling arcs may attach to the sameposition in �i. Thus, the subsequence of leftspinal edges of i in D is a subsequence withrepeats of the sequence of edges of �i.

• Analogously, the subsequence of right spinal

292

Transition Stack Buffer Added Arc[Root] [This, market, has, been, very, badly, damaged, .]

SHIFT [Root, This] [market, has, been, very, badly, damaged, .]

L-A(hNP, ?, ?i) [Root] [market, has, been, very, badly, damaged, .] markethNP,?,?i�! This

SHIFT [Root, market] [has, been, very, badly, damaged, .]

L-A(hS, VP, NPi) [Root] [has, been, very, badly, damaged, .] hashS,VP,NPi�! market

R-A(hTOP, ?, Si) [Root, has] [been, very, badly, damaged, .] RoothTOP,?,Si�! has

R-A(hVP, ?, VPi) [Root, has, been] [very, badly, damaged, .] hashVP,?,VPi�! been

SHIFT [Root, has,been, very] [badly, damaged, .]

R-A(hADVP, ?, ?i) [Root, has, been, very, badly] [damaged, .] veryhADVP,?,?i�! badly

REDUCE [Root, has, been, very] [damaged, .]

L-A(hVP, ?, ADVPi) [Root, has, been] [damaged, .] damagedhVP,?,ADVPi�! very

R-A(hVP, ?, VPi) [Root, has, been, damaged] [.] beenhVP,?,VPi�! damaged

REDUCE [Root, has, been] [.]REDUCE [Root, has] [.]

R-A(hS, VP, ?i) [Root, has, .] [ ] hashS,VP,?i�! .

Figure 3: Transition sequence for This market has been very badly damaged.

edges of i in D is a subsequence with repeatsof the edges of �i. In D, right spinal edgesappear after left spinal edges.

We constrain the arc-eager transition processsuch that these properties hold. Recall that a well-formed spine starts with a terminal node ?, and sodoes the first edge of the spine and only the first.Let C be a configuration, i.e. a partial derivation.The constraints are:

(1) An arc (h, i, ha, b, ?i) is not valid if i has de-pendents in C.

(2) An arc (i, j, ha, b, ci) is not valid ifC contains a dependency of the form(h, i, ha0, b0, ?i).

(3) A left arc (i, j, ha, ?, ci) is only valid ifall sibling left arcs in C are of the form(i, j0, ha, ?, c0i).

(4) Analogous to (3) for right arcs.

(5) If C has a left arc (i, j, ha, ?, ci), then a rightarc (i, j0, ha0, ?, ci) is not valid if a 6= a0.

In essence, constraints 1-2 relate the top of aspine with the existence of descendants, whileconstraints 3-5 enforce that the bottom of thespine is well formed. We enforce no further con-straints looking at edges in the middle of the spine.This means that left and right arc operations canadd spinal edges in a free manner, without ex-plicitly encoding how these edges relate to eachother. In other words, we rely on the statisticalmodel to correctly build a spine by adding left and

right spinal edges along the transition process in abottom-up fashion.

It is easy to see that these constraints do not pre-vent the transition process from ending. Specif-ically, even though the constraints invalidate arcoperations, the arc-eager process can always finishby leaving tokens in the buffer without any headassigned, in which case the resulting derivation isa forest of several projective trees.

3.3 Mapping Derivations to Spinal TreesThe constrained arc-eager derivations correspondto spinal structures, but not necessarily to sin-gle spinal trees, for two reasons. First, from thederivation we can extract two subsequences of leftand right spinal edges, but the derivation does notencode how these sequences should merge into aspine. Second, as in the basic arc-eager process,the derivation might be a forest rather than a singletree. Next we describe processes to turn a spinalarc-eager derivation into a tree.

Forming spines. For each token i we departfrom the top of the spine t, a sequence L of leftspinal edges, and a sequence R of right spinaledges. The goal is to form a spine �i, such thatits top is t, and that L and R are subsequenceswith repeats of the edges of �i. We look for theshortest spine satisfying these properties. For ex-ample, consider the derivation in Figure 3 and thethird token has:

• Top t: S

• Left edges L: VP� S

• Right edges R: ?�VP, VP� S

293

In this case the shortest spine that is consistentwith the edges and the top is ?�V P � S. Ourmethod runs in two steps:

1. Collapse. Traverse each sequence of edgesand replace any contiguous subsequence ofidentical edges by a single occurrence. Theassumption is that identical contiguous edgescorrespond to sibling dependencies that at-tach to the same node in the spine.2

2. Merge the left L and right R sequences ofedges overlapping them as much as possible,i.e. looking for the shortest spine. We do thisin O(nm), where n and m are the lengths ofthe two sequences. Whenever multiple short-est spines are compatible with the left andright edge sequences, we give preference tothe spine that places left edges to the bottom.

The result of this process is a spine �i with left andright dependents attached to positions of the spine.Note that this strategy has some limitations: (a)it can not recover non-terminal spinal nodes thatdo not participate in any triplet; and (b) it flattensspinal structures that involve contiguous identicalspinal edges. 3

Rooting Forests. The arc-eager transition sys-tem is not guaranteed to generate a single rootin a derivation (though see (Nivre and Fernandez-Gonzalez, 2014) for a solution). Thus, after map-ping a derivation to a spinal structure, we mightget a forest of projective spinal trees. In this case,to produce a constituent tree from the spinal for-est, we promote the last tree and place the rest oftrees as children of its top node.

4 Experiments

In this section we describe the performance of thetransition-based spinal parser by running it withdifferent sizes of the beam and by comparing it

2However, this is not always the case. For example, in thePenn Treebank adjuncts create an additional constituent levelin the verb-phrase structure, and this can result in a seriesof contiguous VP spinal nodes. The effect of flattening suchstructures is mild, see below.

3These limitations have relatively mild effects on recov-ering constituent trees in the style of the Penn Treebank. Tomeasure the effect, we took the correct spinal trees of thedevelopment section and mapped them to the correspondingarc-eager derivation. Then we mapped the derivation back toa spinal tree using this process and recovered the constituenttree. This process obtained 98.4% of bracketing recall, 99.5%of bracketing precision, and 99.0 of F1 measure.

with the state-of-the-art. We used the ZPar imple-mentation modified to incorporate the constraintsfor spinal arc-eager parsing. We used the exactsame features as Zhang and Nivre (2011), whichextract a rich set of features that encode higher-order interactions betwen the current action andelements of the stack. Since our dependency la-bels are constituent triplets, these features encodea mix of constituent and dependency structure.

4.1 Data

We use the WSJ portion of the Penn Treebank4,augmented with head-dependant information us-ing the rules of Yamada and Matsumoto (2003).This results in a total of 974 different constituenttriplets, which we use as dependency labels in thespinal arc-eager model. We use predicted part-of-speech tags5.

4.2 Results in the Development Set

In Table 1 we show the results of our parser forthe dependency trees, the table shows unlabeledattachment score (UAS) , triplet accuracy (TA,which would be label accuracy, LA) and triplet at-tachment score (TAS), and spinal accuracy (SA)(the spinal accuracy is the percentage of completespines that the parser correctly predicts). In or-der to be fully comparable, for the dependency-based metrics we report results including and ex-cluding punctuation symbols for evaluation. Thetable also shows the speed (sentences per second)in standard hardware. We trained the parser withdifferent beam values, we run a number of itera-tions until the model converges and we report theresults of the best iteration.

As it can be observed the best model is the onetrained with beam size 64, and greater sizes of thebeam help to improve the results. Nonetheless, italso makes the parser slower. This result is ex-pected since the number of dependency labels, i.e.triplets, is 974 so a higher size of the beam al-lows to test more of them when new actions are in-cluded in the agenda. This model already provideshigh results over 92.34% UAS and it can also pre-dict most of the triplets that label the dependency

4We use the standard partition: sections 02-21 for train-ing, section 22 for development, and section 23 for testing.

5We use the same setting as in (Carreras et al., 2008) bytraining over a treebank with predicted part-of-speech tagswith mxpost (Ratnaparkhi, 1996) (accuracy: 96.5) and wetest on the development set and test set with predicted part-of-speech tags of Collins (1997) (accuracy: 96.8).

294

Dep. (incl punct) Dep. (excl punct) Const SpeedBeam-size UAS TA TAS SA UAS TA TAS LR LP F1 Sent/Sec

8 91.39 90.47 88.78 95.60 92.32 91.21 89.73 88.6 88.4 88.5 7.816 91.81 90.95 89.28 95.84 92.70 91.65 90.21 89.0 89.1 89.1 3.932 92.08 91.14 89.52 95.96 92.91 91.77 90.38 89.4 89.5 89.5 1.764 92.34 91.45 89.84 96.13 93.12 92.04 90.65 89.5 89.7 89.7 0.8

Table 1: UAS with predicted part-of-speech tags for the dev.set including and excluding punctuationsymbols. Constituent results for the development set. Parsing speed in sentences per second (an estimatethat varies depending on the machine). TA and TAS refer to label accuracy and labeled attachment scorewhere the labels are the different constituent triplets described in Section 3. SA is the spinal accuracy.

arcs (91.45 TA and 89.84 TAS) (including punctu-ation symbols for evaluation).

Table 1 also shows the results of the parser inthe development set after transforming the depen-dency trees by following the method described inSection 3. The result even surpasses 89.5% F1which is a competitive accuracy. As we can see,the parser also provides a good trade-off betweenparsing speed and accuracy.6

In order to test whether the number of depen-dency labels is an issue for the parser, we alsotrained a model on dependency trees labeled withYamada and Matsumoto (2003) rules, and the re-sults are comparable to ours. For a beam of size64, the best model with dependency labels pro-vides 92.3% UAS for the development set includ-ing punctuation and 93.0% excluding punctuation,while our spinal parser for the same beam sizeprovides 92.3% UAS including punctuation and93.1% excluding punctuation. This means that thebeam-search arc-eager parser is capable of copingwith the dependency triplets, since it even pro-vides slightly better results for unlabeled attach-ment scores. However, unlike (Carreras et al.,2008), the arc-eager parser does not substantiallybenefit of using the triplets during training.

4.3 Final Results and State-of-the-artComparison

Our best model (obtained with beam=64) provides92.14 UAS, 90.91 TA and 89.32 TAS in the testset including punctuation and 92.78 UAS, 91.53

6However, in absolute terms, our running times are slowerthan typical shift-reduce parsers. Our purpose is to show a re-lation between speed and accuracy, and we opted for a simpleimplementation rather than an engineered one. As one exam-ple, our parser considers all dependency triplets (974) in allcases, which is somehow absurd since most of these can beruled out given the parts-of-speech of the candidate depen-dency. Incorporating a filtering strategy of this kind wouldresult in a speedup factor constant to all beam sizes.

Parser UASMcDonald et al. (2005) 90.9McDonald and Pereira (2006) 91.5Huang and Sagae (2010) 92.1Zhang and Nivre (2011) 92.9Koo and Collins (2010)* 93.0Bohnet and Nivre (2012) 93.0Koo et al. (2008) †* 93.2Martins et al. (2010) 93.3Ballesteros and Bohnet (2014) 93.5Carreras et al. (2008) †* 93.5Suzuki et al. (2009) †* 93.8this work (beam 64) †* 92.1this work (beam 64) † 92.8

Table 2: State-of-the-art comparison for unlabeledattachment score for WSJ-PTB with Y&M rules.Results marked with † use other kind of infor-mation, and are not directly comparable. Resultsmarked with * include punctuation for evaluation.

TA and 90.11 TAS excluding punctuation. Table2 compares our results with the state-of-the-art.Our model obtains comptetitive dependency accu-racies when compared to other systems.

In terms of constituent structure, our best model(beam=64) obtains 88.74 LR, 89.21 LP and 88.97F1. Table 3 compares our model with other con-stituent parsers, including shift-reduce parsers asours. Our best model is competitive comparedwith the rest.

5 Related Work

Collins (1996) defined a statistical model fordependency parsing based on using constituenttriplets in the labels, which forms the basis of ourarc-eager model. In that work, a chart-based al-gorithm was used for parsing, while here we usegreedy transition-based parsing.

295

Beam-size LR LP F1Sagae and Lavie (2005)* 86.1 86.0 86.0

Ratnaparkhi (1999) 86.3 87.5 86.9Sagae and Lavie (2006)* 87.8 88.1 87.9

Collins (1999) 88.1 88.3 88.2Charniak (2000) 89.5 89.9 89.5

Zhang and Clark (2009)* 90.0 89.9 89.9Petrov and Klein (2007) 90.1 90.2 90.1

Zhu et al. (2013)-1* 90.2 90.7 90.4Carreras et al. (2008) 90.7 91.4 91.1Zhu et al. (2013)-2†* 91.1 91.5 91.3

Huang (2008) 91.2 91.8 91.5Charniak (2000) 91.2 91.8 91.5

Huang et al. (2010) 91.2 91.8 91.5McClosky et al. (2006) 91.2 91.8 91.5this work (beam 64)* 88.7 89.2 89.0

Table 3: State-of-the-art comparison in the testset for phrase structure parsing. Results markedwith † use additional information, such as semi-supervised models, and are not directly compara-ble to the others. Results marked with * are shift-reduce parsers.

Carreras et al. (2008) was the first to use spinalrepresentations to define an arc-factored depen-dency parsing model based on the Eisner algo-rithm, that parses in cubic time. Our work canbe seen as the transition-based counterpart of that,with a greedy parsing strategy that runs in lineartime. Because of the extra complexity of spinalstructures, they used three probabilistic non-spinaldependency models to prune the search space ofthe spinal model. In our work, we show that a sin-gle arc-eager model can obtain very competitiveresults, even though the accuracies of our modelare lower than theirs.

In terms of parsing spinal structures, Rush et al.(2010) introduced a dual decomposition methodthat uses constituent and dependency parsing rou-tines to parse a combined spinal structure.

In a similar style to our method Hall et al.(2007), Hall and Nivre (2008) and Hall (2008)introduced an approach for parsing Swedish andGerman, in which MaltParser (Nivre et al., 2007)is used to predict dependency trees, whose depen-dency labels are enriched with constituency labels.They used tuples that encode dependency labels,constituent labels, head relations and the attach-ment. The last step is to make the inverse transfor-mation from a dependency graph to a constituent

structure.

Recently Kong et al. (2015) proposed a struc-tured prediction model for mapping dependencytrees to constituent trees, using the CKY algo-rithm. They assume a fixed dependency tree usedas a hard constraint. Also recently, Fernandez-Gonzalez and Martins (2015) proposed an arc-factored dependency model for constituent pars-ing. In that work dependency labels encode theconstituent node where the dependency arises aswell as the position index of that node in the headspine. In contrast, we use constituent triplets asdependency labels.

Our method is based on constraining a shift-reduce parser using the arc-eager strategy. Nivre(2003) and Nivre (2004) establish the basis forarc-eager algorithm and arc-standard parsing algo-rithms, which are central to most recent transition-based parsers (Zhang and Clark, 2011b; Zhangand Nivre, 2011; Bohnet and Nivre, 2012). Theseparsers are very fast, because the number of pars-ing actions is linear in the length of the sentence,and they obtain state-of-the-art-performance, asshown in Section 4.3.

For shift-reduce constituent parsing, Sagaeand Lavie (2005; 2006) presented a shift-reducephrase structure parser. The main difference toours is that their models do not use lexical de-pendencies. Zhang and Clark (2011a) presenteda shift-reduce parser based on CCG, and as suchis lexicalized. Both spinal and CCG represen-tations are very expressive. One difference isthat spinal trees can be directly obtained fromconstituent treebanks with head-child information,while CCG derivations are harder to obtain.

More recently, Zhang and Clark (2009) and thesubsequent work of Zhu et al. (2013) describeda beam-search shift-reduce parsers obtaining veryhigh results. These models use dependency in-formation via stacking, by running a dependencyparser as a preprocess. In the literature, stackingis a common technique to improve accuracies bycombining dependency and constituent informa-tion, in both ways (Wang and Zong, 2011; Farkasand Bohnet, 2012). Our model differs from stack-ing approaches in that it natively produces the twostructures jointly, in such a way that a rich set offeatures is available.

296


There are several lessons to learn from this paper.First, we show that a simple modification to thearc-eager strategy results in a competitive greedyspinal parser which is capable of predicting depen-dency and constituent structure jointly. In orderto make it work, we introduce simple constraintsto the arc-eager strategy that ensure well-formedspinal derivations. Second, by doing this, we areproviding a good trade-off between speed and ac-curacy, while at the same time we are providinga dependency structure which can be really usefulfor downstream applications. Even if the depen-dency model needs to cope with a huge amountof dependency labels (in the form of constituenttriplets), the unlabeled attachment accuracy doesnot drop and the labeling accuracy (for the triplets)is good enough for getting a good phrase-structureparse. Overall, our work shows that greedy strate-gies to dependency parsing can be successfulyaugmented to include constituent structure.

In the future, we plan to explore spinal deriva-tions in new transition-based dependency parsers(Chen and Manning, 2014; Dyer et al., 2015;Weiss et al., 2015; Zhou et al., 2015). This wouldallow to explore the spinal derivations in new waysand to test their potentialities.

Acknowledgements

We thank Noah A. Smith and the anonymous re-viewers for their very useful comments. MiguelBallesteros is supported by the European Com-mission under the contract numbers FP7-ICT-610411 (project MULTISENSOR) and H2020-RIA-645012 (project KRISTINA).

ReferencesMiguel Ballesteros and Bernd Bohnet. 2014. Au-

tomatic feature selection for agenda-based depen-dency parsing. In Proc. COLING.

Bernd Bohnet and Joakim Nivre. 2012. A transition-based system for joint part-of-speech tagging and la-beled non-projective dependency parsing. In Pro-ceedings of the 2012 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning, pages1455–1465, Jeju Island, Korea, July. Association forComputational Linguistics.

Xavier Carreras, Michael Collins, and Terry Koo.2008. TAG, Dynamic Programming, and the Per-ceptron for Efficient, Feature-Rich Parsing. In

CoNLL 2008: Proceedings of the Twelfth Confer-ence on Computational Natural Language Learning,pages 9–16, Manchester, England, August. Coling2008 Organizing Committee.

Eugene Charniak. 2000. A maximum-entropy-inspired parser. In Proceedings of the 1st NorthAmerican Chapter of the Association for Computa-tional Linguistics Conference, NAACL 2000, pages132–139, Stroudsburg, PA, USA. Association forComputational Linguistics.

Danqi Chen and Christopher D. Manning. 2014. A fastand accurate dependency parser using neural net-works. In Proc. EMNLP.

Michael John Collins. 1996. A new statistical parserbased on bigram lexical dependencies. In Proceed-ings of the 34th Annual Meeting of the Associa-tion for Computational Linguistics, pages 184–191,Santa Cruz, California, USA, June. Association forComputational Linguistics.

Michael Collins. 1997. Three generative, lexicalisedmodels for statistical parsing. In Proceedings of theeighth conference on European chapter of the Asso-ciation for Computational Linguistics, pages 16–23.Association for Computational Linguistics.

Michael Collins. 1999. Head-Driven Statistical Mod-els for Natural Language Parsing. Ph.D. thesis,University of Pennsylvania.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of ACL.

Richard Farkas and Bernd Bohnet. 2012. Stacking ofdependency and phrase structure parsers. In COL-ING, pages 849–866.

Daniel Fernandez-Gonzalez and Andre F. T. Martins.2015. Parsing as reduction. CoRR, abs/1503.00030.

Johan Hall and Joakim Nivre. 2008. A dependency-driven parser for german dependency and con-stituency representations. In Proceedings of theWorkshop on Parsing German, PaGe ’08, pages 47–54, Stroudsburg, PA, USA. Association for Compu-tational Linguistics.

Johan Hall, Joakim Nivre, and Jens Nilsson. 2007.A Hybrid Constituency-Dependency Parser forSwedish. In Proceedings of the 16th Nordic Con-ference of Computational Linguistics (NODALIDA).

Johan Hall. 2008. Transition-based natural languageparsing with dependency and constituency represen-tations. Master’s thesis, Vaxjo University.

Liang Huang and Kenji Sagae. 2010. Dynamic pro-gramming for linear-time incremental parsing. InProceedings of the 48th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages1077–1086.

297

Zhongqiang Huang, Mary Harper, and Slav Petrov.2010. Self-training with products of latent vari-able grammars. In Proceedings of the 2010 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 12–22. Association for Computa-tional Linguistics.

Liang Huang. 2008. Forest reranking: Discriminativeparsing with non-local features. In ACL, pages 586–594.

Lingpeng Kong, Alexander M. Rush, and Noah A.Smith. 2015. Transforming dependencies intophrase structures. In Proceedings of the 2015 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics – Human Lan-guage Technologies. Association for ComputationalLinguistics.

Terry Koo and Michael Collins. 2010. Efficient third-order dependency parsers. In Proceedings of the48th Annual Meeting of the Association for Compu-tational Linguistics (ACL), pages 1–11.

Terry Koo, Xavier Carreras, and Michael Collins.2008. Simple semi-supervised dependency parsing.In Proceedings of the 46th Annual Meeting of theAssociation for Computational Linguistics (ACL),pages 595–603.

Sandra Kubler, Ryan McDonald, and Joakim Nivre.2009. Dependency Parsing. Morgan and Claypool.

Andre Martins, Noah Smith, Eric Xing, Pedro Aguiar,and Mario Figueiredo. 2010. Turbo parsers: De-pendency parsing by approximate variational infer-ence. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP), pages 34–44.

David McClosky, Eugene Charniak, and Mark John-son. 2006. Effective self-training for parsing.In Proceedings of the Main Conference on Hu-man Language Technology Conference of the NorthAmerican Chapter of the Association of Computa-tional Linguistics, HLT-NAACL ’06, pages 152–159, Stroudsburg, PA, USA. Association for Com-putational Linguistics.

Ryan McDonald and Fernando Pereira. 2006. Onlinelearning of approximate dependency parsing algo-rithms. In Proceedings of the 11th Conference ofthe European Chapter of the Association for Com-putational Linguistics (EACL), pages 81–88.

Ryan McDonald, Koby Crammer, and FernandoPereira. 2005. Online large-margin training of de-pendency parsers. In Proceedings of the 43rd An-nual Meeting of the Association for ComputationalLinguistics (ACL), pages 91–98.

Joakim Nivre and Daniel Fernandez-Gonzalez. 2014.Arc-eager parsing with the tree constraint. Compu-tational Linguistics, 40(2):259–267.

Joakim Nivre, Johan Hall, Jens Nilsson, AtanasChanev, Gulsen Eryigit, Sandra Kubler, SvetoslavMarinov, and Erwin Marsi. 2007. Maltparser:A language-independent system for data-driven de-pendency parsing. Natural Language Engineering,13:95–135.

Joakim Nivre, Yoav Goldberg, and Ryan T. McDon-ald. 2014. Constrained arc-eager dependency pars-ing. Computational Linguistics, 40(2):249–527.

Joakim Nivre. 2003. An efficient algorithm for pro-jective dependency parsing. In Proceedings of the8th International Workshop on Parsing Technologies(IWPT), pages 149–160.

Joakim Nivre. 2004. Incrementality in deterministicdependency parsing. In Proceedings of the Work-shop on Incremental Parsing: Bringing Engineeringand Cognition Together (ACL), pages 50–57.

Slav Petrov and Dan Klein. 2007. Improved infer-ence for unlexicalized parsing. In HLT-NAACL, vol-ume 7, pages 404–411.

Adwait Ratnaparkhi. 1996. A maximum entropy part-of-speech tagger. In Proceedings of the Conferenceon Empirical Methods in Natural Language Pro-cessing.

Adwait Ratnaparkhi. 1999. Learning to parse naturallanguage with maximum entropy models. MachineLearning, 34(1-3):151–175.

Alexander M Rush, David Sontag, Michael Collins,and Tommi Jaakkola. 2010. On dual decompositionand linear programming relaxations for natural lan-guage processing. In Proceedings of the 2010 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1–11. Association for Computa-tional Linguistics.

Kenji Sagae and Alon Lavie. 2005. A classifier-basedparser with linear run-time complexity. In Proceed-ings of the Ninth International Workshop on ParsingTechnology, Parsing ’05, pages 125–132, Strouds-burg, PA, USA. Association for Computational Lin-guistics.

Kenji Sagae and Alon Lavie. 2006. Parser combi-nation by reparsing. In Proceedings of the HumanLanguage Technology Conference of the NAACL,Companion Volume: Short Papers, NAACL-Short’06, pages 129–132, Stroudsburg, PA, USA. Associ-ation for Computational Linguistics.

Jun Suzuki, Hideki Isozaki, Xavier Carreras, andMichael Collins. 2009. An empirical study of semi-supervised structured conditional models for depen-dency parsing. In Proceedings of EMNLP, pages551–560.

Zhiguo Wang and Chengqing Zong. 2011. Parsereranking based on higher-order lexical dependen-cies. In IJCNLP, pages 1251–1259.

298

David Weiss, Christopher Alberti, Michael Collins, andSlav Petrov. 2015. Structured training for neuralnetwork transition-based parsing. In Proc. ACL.

Hiroyasu Yamada and Yuji Matsumoto. 2003. Statis-tical dependency analysis with support vector ma-chines. In Proceedings of the 8th InternationalWorkshop on Parsing Technologies (IWPT), pages195–206.

Yue Zhang and Stephen Clark. 2009. Transition-basedparsing of the chinese treebank using a global dis-criminative model. In Proceedings of the 11th Inter-national Conference on Parsing Technologies, pages162–171. Association for Computational Linguis-tics.

Yue Zhang and Stephen Clark. 2011a. Shift-reduceccg parsing. In Proceedings of the 49th AnnualMeeting of the Association for Computational Lin-guistics: Human Language Technologies-Volume 1,pages 683–692. Association for Computational Lin-guistics.

Yue Zhang and Stephen Clark. 2011b. Syntactic pro-cessing using the generalized perceptron and beamsearch. Computational Linguistics, 37(1):105–151.

Yue Zhang and Joakim Nivre. 2011. Transition-baseddependency parsing with rich non-local features. InProceedings of the 49th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages188–193, Portland, Oregon, USA.

Hao Zhou, Yue Zhang, Shujian Huang, and JiajunChen. 2015. A Neural Probabilistic Structured-Prediction Model for Transition-Based DependencyParsing. In ACL.

Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang,and Jingbo Zhu. 2013. Fast and accurate shift-reduce constituent parsing. In Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages434–443. Association for Computational Linguis-tics.

299



Accurate Cross-lingual Projection between Count-based Word Vectorsby Exploiting Translatable Context Pairs

Shonosuke Ishiwatari|, Nobuhiro Kaji}~, Naoki Yoshinaga}~,| Graduate School of Information Science and Technology, the University of Tokyo

} Institute of Industrial Science, the University of Tokyo~ National Institute of Information and Communications Technology, Japan

{ishiwatari, kaji, ynaga}@tkl.iis.u-tokyo.ac.jp

Masashi Toyoda}, Masaru Kitsuregawa}�} Institute of Industrial Science, the University of Tokyo

� National Institute of Informatics, Japan{toyoda, kitsure}@tkl.iis.u-tokyo.ac.jp

Abstract

We propose a method that learns a cross-lingual projection of word representationsfrom one language into another. Ourmethod utilizes translatable context pairsas bonus terms of the objective function.In the experiments, our method outper-formed existing methods in three languagepairs, (English, Spanish), (Japanese, Chi-nese) and (English, Japanese), without us-ing any additional supervisions.

1 Introduction

Vector-based representations of word meanings,hereafter word vectors, have been widely used ina variety of NLP applications including synonymdetection (Baroni et al., 2014), paraphrase detec-tion (Erk and Pado, 2008), and dialogue analy-sis (Kalchbrenner and Blunsom, 2013). The ba-sic idea behind those representation methods isthe distributional hypothesis (Harris, 1954; Firth,1957) that similar words are likely to co-occurwith similar context words.

A problem with the word vectors is that theyare not meant for capturing the similarity betweenwords in different languages, i.e., translation pairssuch as “gato” and “cat.” The meaning represen-tations of such word pairs are usually dissimilar,because the vast majority of the context words arefrom the same language as the target words (e.g.,Spanish for “gato” and English for “cat”). Thisprevents using word vectors in multi-lingual appli-cations such as cross-lingual information retrievaland machine translation.

Several approaches have been made so far toaddress this problem (Fung, 1998; Klementiev et

al., 2012; Mikolov et al., 2013b). In particular,Mikolov et al. (2013b) recently explored learninga linear transformation between word vectors ofdifferent languages from a small amount of train-ing data, i.e., a set of bilingual word pairs.

This study explores incorporating prior knowl-edge about the correspondence between dimen-sions of word vectors to learn more accurate trans-formation, when using count-based word vectors(Baroni et al., 2014). Since the dimensions ofcount-based word vectors are explicitly associatedwith context words, we can partially be aware ofthe cross-lingual correspondence between the di-mensions of word vectors by diverting the trainingdata. Also, word surface forms present noisy yetuseful clues on the correspondence when targetingthe language pairs that have exchanged their vo-cabulary (e.g., “cocktail” in English and “coctel”in Spanish). Although apparently useful, how toexploit such knowledge within the learning frame-work has not been addressed so far.

We evaluated the proposed method in three lan-guage pairs. Compared with baselines includinga method that uses vectors learned by neural net-works, our method gave better results.

2 Related Work

Neural networks (Mikolov et al., 2013a; Bengioet al., 2003) have recently gained much attentionas a way of inducing word vectors. Although thescope of our study is currently limited to the count-based word vectors, our experiment demonstratedthat the proposed method performs significantlybetter than strong baselines including neural net-works. This suggests that count-based word vec-tors have a great advantage when learning a cross-lingual projection. As a future work, we are also

300

interested in extending the method presented hereto apply word vectors learned by neural networks.

There are also methods that directly inducingmeaning representations shared by different lan-guages (Klementiev et al., 2012; Lauly et al.,2014; Xiao and Guo, 2014; Hermann and Blun-som, 2014; Faruqui and Dyer, 2014; Gouws andSøgaard, 2015), rather than learning transforma-tion between different languages (Fung, 1998;Mikolov et al., 2013b; Dinu and Baroni, 2014).However, the former approach is unable to handlewords not appearing in the training data, unlike thelatter approach.

3 Proposed Method

3.1 Learning cross-lingual projectionWe begin by introducing the previous method oflearning a linear transformation from word vectorsin one language into another, which are hereafterreferred to as source and target language.

Suppose we have a training data of n examples{(x1,z1), (x2,z2), . . . (xn,zn)}, where xi is thecount-based vector representation of a word in thesource language (e.g., “gato”), and zi is the wordvector of its translation in the target language (e.g.,“cat”). Then, we seek for a translation matrix, W ,such that Wxi approximates zi, by solving thefollowing optimization problem.

W

?=argmin

W

nX

i=1

kWxi � zik2 +

�

2

kW k2.(1)

The second term is the L2 regularizer. Althoughthe regularization term does not appear in the orig-inal formalization (Mikolov et al., 2013b), we takethis as a starting point of our investigation becausethe regularizer can prevent over-fitting and gener-ally helps learn better models.

3.2 Exploiting translatable context pairsWithin the learning framework above, we proposeexploiting the fact that dimensions of count-basedword vectors are associated with context words,and some dimensions in the source language aretranslations of those in the target language.

For illustration purpose, suppose count-basedword vectors of Spanish and English. The Span-ish word vectors would have dimensions associ-ated with context words such as “amigo,” “comer,”“importante,” while the dimensions of the Englishword vectors are associated with “eat,” “run,”

“small” and “importance,” and so on. Since,for example, “friend” is a English translation of“amigo,” the Spanish dimension associated with“amigo” is likely to be mapped to the English di-mension associated with “friend.” Such knowl-edge about the cross-lingual correspondence be-tween dimensions is considered beneficial forlearning accurate translation matrix.

We take two approaches to obtaining such cor-respondence. Firstly, since we have already as-sumed that a small amount of training data is avail-able for training the translation matrix, it can alsobe used for finding the correspondence betweendimensions (referred to as Dtrain). Note that it isnatural that some words in a language have manytranslations in another language. Thus, for ex-ample, Dtrain may include (“amigo”, “friend”),(“amigo”, “fan”) and (“amigo”, “supporter”).

Secondly, since languages have evolved overthe years while often deriving or borrowing words(or concepts) from those in other languages, thosewords have similar or even the same spelling.We take advantage of this to find the correspon-dence between dimensions. We specifically definefunction DIST(r, s) that measures the surface-levelsimilarity, and regard all context word pairs (r, s)having smaller distance than a threshold1 as trans-latable ones (referred to as Dsim).

DIST(r, s)=Levenshtein(r, s)

min(len(r), len(s))

where function Levenshtein(r, s) represents theLevenshtein distance between the two words, andlen(r) represents the length of the word.

3.3 New objective functionWe incorporate the knowledge about the corre-spondence between the dimensions into the learn-ing framework. Since the correspondence ob-tained by the methods presented above can benoisy, we want to treat it as a soft constraint. Thisconsideration leads us to develop the followingnew objective function:

W

?=argmin

W

nX

i=1

kWxi � zik2 +

�

2

kW k2

��train

X

(j,k)2Dtrain

wjk � �sim

X

(j,k)2Dsim

wjk.

The third and fourth terms are newly added toguide the learning process to strengthen wjk when

1The threshold was fixed to 0.5.

301

k-th dimension in the source language corre-sponds to j-th dimension in the target language.Dtrain and Dsim are sets of dimension pairs foundby the two methods. �train and �sim are param-eters representing the strength of the new terms,and are tuned on held-out development data.

3.4 OptimizationWe use Pegasos algorithm (Shalev-Shwartz et al.,2011), an instance of the stochastic gradient de-scent (Bottou, 2004), to optimize the new objec-tive. Given ⌧ -th learning sample (x⌧ ,z⌧ ), we up-date translation matrix W as follows:

W W � ⌘⌧rE⌧ (W )

where ⌘⌧ represents the learning rate and is set to⌘⌧ =

1�⌧ , and rE⌧ (W ) is the gradient which is

calculated from ⌧ -th sample (x⌧ ,z⌧ ):

2(Wx⌧ � z⌧ )xT⌧ � �trainA� �simB + �W .

A and B are gradients corresponding to the twonew terms. A is a matrix in which ajk = 1 if(j, k) 2 Dtrain otherwise 0. B is defined simi-larly.

4 Experiments

We evaluate our method on translation amongword vectors in four languages: English (En),Spanish (Es), Japanese (Jp) and Chinese (Cn). Wehave chosen three language pairs: (En, Es), (Jp,Cn) and (En, Jp), for the translation, so that wecan examine the impact of each type of translat-able context pairs integrated into the learning ob-jective.

4.1 SetupFirst, we prepared source text in the four lan-guages from Wikipedia2 dumps following (Baroniet al., 2014). We extracted plain text from theXML dumps by using wp2txt.3 Since words areconcatenated in Japanese and Chinese, we usedMeCab4 and Stanford Word Segmenter5 to tok-enize the text. Since inflection occurs in English,Spanish, and Japanese, we used Stanford POS tag-ger,6 Pattern,7 and MeCab to lemmatize the text.

2http://dumps.wikimedia.org/3https://github.com/yohasebe/wp2txt/4http://taku910.github.io/mecab/5http://nlp.stanford.edu/software/segmenter.shtml6http://nlp.stanford.edu/software/tagger.shtml7http://www.clips.ua.ac.be/pages/pattern

Next, we induced count-based word vectorsfrom the obtained text. We considered contextwindows of five words to both sides of the tar-get word. The function words are then excludedfrom the extracted context words. Since the countvectors are very high-dimensional and sparse, weselected top-10k frequent words as contexts words(in other words, the number of dimensions of theword vectors). We converted the counts into pos-itive point-wise mutual information (Church andHanks, 1990) and normalized the resulting vectorsto remove the bias that is introduced by the differ-ence of the word frequency.

Then, we compiled a seed bilingual dictionary(a set of bilingual word pairs) for each languagepair that is used to learn and evaluate the transla-tion matrix. We utilized cross-lingual synsets inthe Open Multilingual Wordnet8 to obtain bilin-gual pairs.

Since our method aims to be used in expand-ing bilingual dictionaries, we designed datasets as-suming such a situation. Considering that morefrequent words are likely to be registered in a dic-tionary, we sorted words in the source languageby frequency and used the top-11k words andtheir translations in the target language as a train-ing/development data, and used the subsequent 1kwords and their translations as a test data.

We have compared our method with the follow-ing three methods:

Baseline learns a translation matrix using Eq. 1for the same count-based word vectors as theproposed method. Comparison between theproposed method and this method reveals theimpact of incorporating the cross-lingual cor-respondences between dimensions.

CBOW learns a translation matrix using Eq. 1for word vectors learned by a neural net-work (specifically, continuous bag-of-words(CBOW)) (Mikolov et al., 2013b). Compari-son between this method and the above base-line reveals the impact of the vector repre-sentation. Note that the CBOW-based wordvectors take rare context words as well asthe top-10k frequent words into account. Weused word2vec9 to obtain the vectors for eachlanguage.10 Since Mikolov et al. (2013b)

8http://compling.hss.ntu.edu.sg/omw/9https://code.google.com/p/word2vec/

10The threshold of sub-sampling of words was set to 1e-3to reduce the effect of very frequent words, e.g., “a” or “the.”

302

Table 1: Experimental results: the accuracy of the translation.Testset Baseline CBOW Direct Mapping Proposedw/o surface Proposed

P@1 P@5 P@1 P@5 P@1 P@5 P@1 P@5 P@1 P@5Es!En 0.1% 0.5% 7.5% 22.0% 45.7% 61.1% 46.6% 62.4% 54.7% 67.6%Es En 0.1% 0.6% 7.1% 18.9% 11.9% 26.1% 28.7% 45.7% 31.3% 49.6%Jp !Cn 0.6% 1.6% 5.4% 13.8% 9.3% 22.2% 11.1% 26.2% 15.5% 34.0%Jp Cn 0.3% 1.2% 2.9% 11.3% 11.6% 26.8% 7.8% 21.6% 13.1% 27.9%En! Jp 0.3% 1.1% 4.9% 13.3% 5.4% 13.9% 18.5% 36.4% 19.3% 37.1%En Jp 0.2% 1.0% 6.5% 19.1% 22.3% 37.4% 32.3% 51.0% 32.5% 51.9%

reported the accurate translation can be ob-tained when the vectors in the source lan-guage is 2-4x larger than that in the targetlanguage, we prepared m-dimensional (m =

100, 200, 300) vectors for the target languageand n-dimensional (n = 2m, 3m, 4m) vec-tors for the source language, and optimizedtheir combinations on the development data.

Direct Mapping exploits the training data to mapeach dimension in a word vector in the sourcelanguage to the corresponding dimension ina word vector in the target language, refer-ring to the bilingual pairs in the training data(Fung, 1998). To deal with words that havemore than one translation, we weighted eachtranslation by a reciprocal rank of its fre-quency among the translations in the targetlanguage, as in (Prochasson et al., 2009).

Note that all methods, including the proposedmethods, use the same amount of supervision(training data) and thereby they are completelycomparable with each other.

Evaluation procedure For each word vector inthe source language, we translate it into the targetlanguage and evaluate the quality of the translationas in (Mikolov et al., 2013b): i) measure the cosinesimilarity between the resulting word vector andall the vectors in the test data (in the target lan-guage), ii) next choose the top-n (n = 1, 5) wordvectors that have the highest similarity against theresulting vector, and iii) then examine whether thechosen vectors include the correct one.

4.2 ResultsTable 1 shows results of the translation betweenword vectors in each language pair. Proposed sig-nificantly improved the translation quality againstBaseline, and performed the best among all ofthe methods. Although the use of CBOW-basedword vectors (CBOW) has improved the trans-lation quality against Baseline, the performance

Figure 1: The impact of the size of training data(Es! En).

gain is smaller than that obtained by our new ob-jective. Proposedw/o surface uses only the trainingdata to find translatable context pairs by setting�sim = 0. Thus, its advantage over Direct Map-ping confirms the importance of learning a trans-lation matrix. In addition, the greater advantage ofProposed over Proposedw/o surface in the transla-tion between (En, Es) or (Jp, Cn) conforms to ourexpectation that surface-level similarity is moreuseful for translation between the language pairswhich have often exchanged their vocabulary.

Figure 1 shows P@1 (Es! En) plotted againstthe size of training data. Remember that the train-ing data is not only used to learn a translation ma-trix in the methods other than Direct Mapping butalso is used to map dimensions in Direct Mappingand the proposed methods. Proposed performsthe best among all methods regardless the size oftraining data. Comparison between Direct Map-ping and Proposedw/o surface reveals that learninga translation matrix is not always effective whenthe size of the training data is small, since it maybe suffered from over-fitting (the size of the trans-lation matrix is too large for the size of trainingdata). We can see that surface-level similarity isbeneficial especially when the size of training datais small.

303

5 Conclusion

We have proposed the use of prior knowledgein accurately translating word vectors. We havespecifically exploited two types of translatablecontext pairs, which are taken from the trainingdata and guessed by surface-level similarity, to de-sign a new objective function in learning the trans-lation matrix. Experimental results confirmed thatour method significantly improved the translationamong word vectors in four languages, and the ad-vantage was greater than that obtained by the useof a word vector learned by a neural network.

Acknowledgments

This work was partially supported by JSPS KAK-ENHI Grant Number 25280111.

ReferencesMarco Baroni, Georgiana Dinu, and German

Kruszewski. 2014. Don’t count, predict! asystematic comparison of context-counting vs.context-predicting semantic vectors. In Proceedingsof ACL, pages 238–247.

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, andChristian Janvin. 2003. A neural probabilistic lan-guage model. The Journal of Machine Learning Re-search, 3:1137–1155.

Leon Bottou. 2004. Stochastic learning. In Ad-vanced lectures on machine learning, pages 146–168. Springer.

Kenneth Ward Church and Patrick Hanks. 1990. Wordassociation norms, mutual information, and lexicog-raphy. Computational linguistics, 16(1):22–29.

Georgiana Dinu and Marco Baroni. 2014. How tomake words with vectors: Phrase generation in dis-tributional semantics. In Proceedings of ACL, pages624–633.

Katrin Erk and Sebastian Pado. 2008. A structuredvector space model for word meaning in context. InProceedings of EMNLP, pages 897–906.

Manaal Faruqui and Chris Dyer. 2014. Improvingvector space word representations using multilingualcorrelation. In Proceedings of EACL, pages 462–471.

John R. Firth. 1957. A synopsis of linguistic theory.Studies in Linguistic Analysis, pages 1–32.

Pascale Fung. 1998. A statistical view on bilinguallexicon extraction: from parallel corpora to non-parallel corpora. In Machine Translation and theInformation Soup, pages 1–17. Springer.

Stephan Gouws and Anders Søgaard. 2015. Simpletask-specific bilingual word embeddings. In Pro-ceedings of NAACL-HLT, pages 1386–1390.

Zellig S. Harris. 1954. Distributional structure. Word,10:146–162.

Karl Moritz Hermann and Phil Blunsom. 2014. Multi-lingual models for compositional distributed seman-tics. In Proceedings of ACL, pages 58–68.

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentconvolutional neural networks for discourse com-positionality. In Proceedings of ACL Workshop onContinuous Vector Space Models and their Compo-sitionality, pages 119–126.

Alexandre Klementiev, Ivan Titov, and Binod Bhat-tarai. 2012. Inducing crosslingual distributed rep-resentations of words. In Proceedings of COLING,pages 1459–1474.

Stanislas Lauly, Hugo Larochelle, Mitesh Khapra,Balaraman Ravindran, Vikas C Raykar, and AmritaSaha. 2014. An autoencoder approach to learn-ing bilingual word representations. In Advances inNIPS, pages 1853–1861.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient estimation of word represen-tations in vector space. In Proceedings of Workshopat ICLR.

Tomas Mikolov, Quoc V. Le, and Ilya Sutskever.2013b. Exploiting similarities among languages formachine translation. arXiv preprint.

Emmanuel Prochasson, Emmanuel Morin, and KyoKageura. 2009. Anchor points for bilingual lexi-con extraction from small comparable corpora. InProceedings of MT Summit XII, pages 284–291.

Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro,and Andrew Cotter. 2011. Pegasos: Primal esti-mated sub-gradient solver for SVM. Mathematicalprogramming, 127(1):3–30.

Min Xiao and Yuhong Guo. 2014. Distributed wordrepresentation learning for cross-lingual dependencyparsing. In Proceedings of CoNLL, pages 119–129.

304



Deep Neural Language Models for Machine Translation

Minh-Thang Luong Michael Kayser Christopher D. ManningComputer Science Department, Stanford University, Stanford, CA, 94305

{lmthang, mkayser, manning}@stanford.edu

Abstract

Neural language models (NLMs) havebeen able to improve machine translation(MT) thanks to their ability to generalizewell to long contexts. Despite recent suc-cesses of deep neural networks in speechand vision, the general practice in MTis to incorporate NLMs with only one ortwo hidden layers and there have not beenclear results on whether having more lay-ers helps. In this paper, we demonstratethat deep NLMs with three or four lay-ers outperform those with fewer layers interms of both the perplexity and the trans-lation quality. We combine various tech-niques to successfully train deep NLMsthat jointly condition on both the sourceand target contexts. When reranking n-best lists of a strong web-forum baseline,our deep models yield an average boostof 0.5 TER / 0.5 BLEU points comparedto using a shallow NLM. Additionally, weadapt our models to a new sms-chat do-main and obtain a similar gain of 1.0 TER/ 0.5 BLEU points.1

1 Introduction

Deep neural networks (DNNs) have been success-ful in learning more complex functions than shal-low ones (Bengio, 2009) and exceled in manychallenging tasks such as in speech (Hinton et al.,2012) and vision (Krizhevsky et al., 2012). Theseresults have sparked interest in applying DNNsto natural language processing problems as well.Specifically, in machine translation (MT), there

1Our code and related materials are publicly available athttp://stanford.edu/˜lmthang/nlm.

has been an active body of work recently in uti-lizing neural language models (NLMs) to improvetranslation quality. However, to the best of ourknowledge, work in this direction only makes useof NLMs with either one or two hidden layers. Forexample, Schwenk (2010, 2012) and Son et al.(2012) used shallow NLMs with a single hiddenlayer for reranking. Vaswani et al. (2013) consid-ered two-layer NLMs for decoding but providedno comparison among models of various depths.Devlin et al. (2014) reported only a small gainwhen decoding with a two-layer NLM over a sin-gle layer one. There have not been clear results onwhether adding more layers to NLMs helps.

In this paper, we demonstrate that deep NLMswith three or four layers are better than thosewith fewer layers in terms of the perplexity andthe translation quality. We detail how we com-bine various techniques from past work to suc-cessfully train deep NLMs that condition on boththe source and target contexts. When reranking n-best lists of a strong web-forum MT baseline, ourdeep models achieve an additional improvementof 0.5 TER / 0.5 BLEU compared to using a shal-low NLM. Furthermore, by fine-tuning general in-domain NLMs with out-of-domain data, we obtaina similar boost of 1.0 TER / 0.5 BLEU points overa strong domain-adapted sms-chat baseline com-pared to utilizing a shallow NLM.

2 Neural Language Models

We briefly describe the NLM architecture andtraining objective used in this work as well as com-pare our approach to other related work.Architecture. Neural language models are fun-damentally feed-forward networks as described in(Bengio et al., 2003), but not necessarily lim-ited to only a single hidden layer. Like any

305

other language model, NLMs specify a distribu-tion, p(w|c), to predict the next word w given acontext c. The first step is to lookup embeddingsfor words in the context and concatenate them toform an input, h(0), to the first hidden layer. Wethen repeatedly build up hidden representations asfollows, for l = 1, . . . , n:

h(l)= f

⇣W (l)h(l�1)

+ b(l)⌘

(1)

where f is a non-linear fuction such as tanh. Thepredictive distribution, p(w|c), is then derived us-ing the standard softmax:

s = W (sm)h(n)+ b(sm)

p(w|c) =

exp(sw)Pw2V exp(sw)

(2)

Objective. The typical way of training NLMs isto maximize the training data likelihood, or equiv-alently, to minimize the cross-entropy objective ofthe following form:

P(c,w)2T � log p(w|c).

Training NLMs can be prohibitively slow dueto the computationally expensive softmax layer.As a result, past works have tried to use a moreefficient version of the softmax such as the hi-erarchical softmax (Morin, 2005; Mnih and Hin-ton, 2007; Mnih and Hinton, 2009) or the class-based one (Mikolov et al., 2010; Mikolov et al.,2011). Recently, the noise-contrastive estimation(NCE) technique (Gutmann and Hyvarinen, 2012)has been applied to train NLMs in (Mnih and Teh,2012; Vaswani et al., 2013) to avoid explicitlycomputing the normalization factors.

Devlin et al. (2014) used a modified versionof the cross-entropy objective, the self-normalizedone. The idea is to not only improve the predic-tion, p(w|c), but also to push the normalizationfactor per context, Zc, close to 1:

J =

X

(c,w)2T

� log p(w|c) + ↵ log

2(Zc) (3)

While self-normalization does not lead to speed upin training, it allows trained models to be appliedefficiently at test time without computing the nor-malization factors. This is similar in flavor to NCEbut allows for flexibility (through ↵) in how hardwe want to “squeeze” the normalization factors.Training deep NLMs. We follow (Devlin et al.,2014) to train self-normalized NLMs, condition-ing on both the source and target contexts. Unlike

(Devlin et al., 2014), we found that using the recti-fied linear function, max{0, x}, proposed in (Nairand Hinton, 2010), works better than tanh. Therectified linear function was used in (Vaswani etal., 2013) as well. Furthermore, while these worksuse a fixed learning rate throughout, we found thathaving a simple learning rate schedule is usefulin training well-performing deep NLMs. This hasalso been demonstrated in (Sutskever et al., 2014;Luong et al., 2015) and is detailed in Section 3.We do not perform any gradient clipping and no-tice that learning is more stable when short sen-tences of length less than or equal to 2 are re-moved. Bias terms are used for all hidden layersas well as the softmax layer as described earlier,which is slightly different from other work such as(Vaswani et al., 2013). All these details contributeto our success in training deep NLMs.

For simplicity, the same vocabulary is used forboth the embedding and the softmax matrices.2 Inaddition, we adopt the standard softmax to take ad-vantage of GPUs in performing large matrix mul-tiplications. All hyperparameters are given later.

3 Experiments

3.1 Data

We use the Chinese-English bitext in the DARPABOLT (Broad Operational Language Translation)program, with 11.1M parallel sentences (281MChinese words and 307M English words). We re-serve 585 sentences for validation, i.e., choosinghyperparameters, and 1124 sentences for testing.3

3.2 NLM Training

We train our NLMs described in Section 2 withSGD, using: (a) a source window of size 5, i.e.,11-gram source context4, (b) a 4-word target his-tory, i.e., 5-gram target LM, (c) a self-normalizedweight ↵ = 0.1, (d) a mini-batch of size 128, and(e) a learning rate of 0.1 (training costs are nor-malized by the mini-batch size). All weights areuniformly initialized in [�0.01, 0.01]. We trainour models for 4 epochs (after 2 epochs, the learn-ing rate is halved every 0.5 epoch). The vocab-ularies are limited to the top 40K frequent wordsfor both Chinese and English. All words not in

2Some work (Schwenk, 2010; Schwenk et al., 2012) uti-lize a smaller softmax vocabulary, called short-list.

3The test set is from BOLT and labeled as p1r6 dev.4We used an alignment heuristic similar to Devlin et al.

(2014) but applicable to our phrase-based MT system.

306

Models Valid Test | log Z|1 layer 9.39 8.99 0.512 layers 9.20 8.96 0.503 layers 8.64 8.13 0.434 layers 8.10 7.71 0.35

Table 1: Training NLMs – validation and testperplexities achieved by self-normalized NLMs ofvarious depths. We report the | log Z| value (basee), similar to Devlin et al. (2014), to indicate howgood each model is in pushing the log normaliza-tion factors towards 0. All perplexities are derivedby explicitly computing the normalization factors.

these vocabularies are replaced by a universal un-known symbol. Embeddings are of size 256 andall hidden layers have 512 units each. Our train-ing speed on a single Tesla K40 GPU device isabout 1000 target words per second and it gener-ally takes about 10-14 days to fully train a model.

We present the NLM training results in Table 1.With more layers, the model succeeds in learningmore complex functions; the prediction, hence,becomes more accurate as evidenced by smallerperplexities for both the validation and test sets.Interestingly, we observe that deeper nets can learnself-normalized NLMs better: the mean log nor-malization factor, | log Z| in Eq. (3), is driven to-wards 0 as the depth increases.5

3.3 MT Reranking with NLMsOur MT models are built using the Phrasal MTtoolkit (Cer et al., 2010). In addition to the stan-dard dense feature set6, we include a variety ofsparse features for rules, word pairs, and wordclasses, as described in (Green et al., 2014). Ourdecoder uses three language models.7 We use atuning set of 396K words in the newswire and webdomains and tune our systems using online ex-pected error rate training as in (Green et al., 2014).Our tuning metric is (BLEU-TER)/2.

We run a discriminative reranker on the 1000-best output of a decoder with MERT. The featuresused in reranking include all the dense features,

5As a reference point, though not directly comparable,Devlin et al. (2014) achieved 0.68 for | log Z| on a differenttest set with the same self-normalized constant ↵=0.1.

6Consisting of forward and backward translation mod-els, lexical weighting, linear distortion, word penalty, phrasepenalty and language model.

7One is trained on the English side of the bitext, oneis trained on a 16.3-billion-word monolingual corpus takenfrom various domains, and one is a class-based languagemodel trained on the same large monolingual corpus.

System dev test1 test2T# B" T# B" T# B"

baseline 53.7 33.1 55.1 31.3 63.5 16.5Reranking

1 layer 52.9 34.3 54.5 32.0 63.1 16.72 layers 52.7 34.1 54.3 31.9 63.0 16.83 layers 52.5 34.5 54.3 32.3 62.5 17.34 layers 52.6 34.7 54.1 32.4 62.7 17.2vs. baseline +1.2† +1.6† +1.0† +1.1† +1.0† +0.8†

vs. 1 layer +0.4† +0.4† +0.4† +0.4† +0.6† +0.6†

Table 2: Web-forum Results – TER (T)and BLEU (B) scores on both the dev set(dev10wb dev), used to tune reranking weights,and the test sets (dev10wb syscomtune andp1r6 dev accordingly). Relative improvementsbetween the best system and the baseline as wellas the 1-layer model are bolded. † marks improve-ments that are statistically significant (p<0.05).

an aggregate decoder score, and an NLM score.We learn the reranker weights on a second tuningset, different from the decoder tuning set, to makethe reranker less biased towards the dense features.This second tuning set consists of 33K words ofweb-forum text and is important to obtain goodimprovements with reranking.

3.4 Results

As shown in Table 2, it is not obvious if the depth-2 model is better than the single layer one, bothof which are what past work used. In contrast,reranking with deep NLMs of three or four lay-ers are clearly better, yielding average improve-ments of 1.0 TER / 1.0 BLEU points over the base-line and 0.5 TER / 0.5 BLEU points over the sys-tem reranked with the 1-layer model, all of whichare statisfically significant according to the test de-scribed in (Riezler and Maxwell, 2005).

The fact that the improvements in terms of theintrinsic metrics listed in Table 1 do translate intogains in translation quality is interesting. It rein-forces the trend reported in (Luong et al., 2015)that better source-conditioned perplexities lead tobetter translation scores. This phenomon is a use-ful result as in the past, many intrinsic metrics,e.g., alignment error rate, do not necessarily cor-relate with MT quality metrics.

3.5 Domain Adaptation

For the sms-chat domain, we use a tune set of260K words in the newswire, web, and sms-chatdomains to tune the decoder weights and a sepa-

307

System dev testT# B" T# B"

baseline 62.2 18.7 57.3 23.3Reranking

1 layer (5.42, 0.51) 60.1 22.0 56.2 25.92 layers (5.50, 0.51) 60.7 21.5 56.3 26.03 layers (5.34 0.43) 59.9 21.4 55.2 26.4vs. baseline +2.3‡ +3.3‡ +2.1‡ +3.1‡

vs. 1 layer +0.2 -0.6 +1.0‡ +0.5

Table 3: Domain-adaptation Results – transla-tion scores for the sms-chat domain similar toTable 2. We use p2r2smscht dev for dev andp2r2smscht syscomtune for test. The test perplex-ities and the | log Z| values of our domain-adaptedNLMs are shown in italics. ‡ marks improvementsthat are statistically significant (p<0.01).

rate small, 8K words set to tune reranking weights.To train adapted NLMs, we use models previouslytrained on general in-domain data and further fine-tune with out-domain data for about 4 hours.8

Similar to the web-forum domain, for sms-chat,Table 3 shows that on the test set, our deep NLMwith three layers yields a significant gain of 2.1TER / 3.1 BLEU points over the baseline and 1.0TER / 0.5 BLEU points over the 1-layer rerankedsystem. It is worth pointing out that for such asmall amount of out-domain training data, depthbecomes less effective as exhibited through the in-significant BLEU gain in test and a drop in devwhen comparing between the 1- and 3-layer mod-els. We exclude the 4-layer NLM as it seems tohave overfitted the training data. Nevertheless, westill achieve decent gains in using NLMs for MTdomain adaptation.

4 Analysis

4.1 NLM TrainingWe show in Figure 1 the learning curves for vari-ous NLMs, demonstrating that deep nets are betterthan the shallow NLM with a single hidden layer.Starting from minibatch 20K, the ranking is gen-erally maintained that deeper NLMs have bettercross-entropies. The gaps become less discerniblefrom minibatch 30K onwards, but numerically, asthe model becomes deeper, the average gaps, inperplexities, are consistently 40.1, 1.1, and 2.0.

8Our sms-chat corpus consists of 146K sentences (1.6MChinese and 1.9M English words). We randomly select 3000sentences for validation and 3000 sentences for test. Modelsare trained for 8 iterations with the same hyperparameters.

1 2 3 4 5 6 7 8 9 10x 104

3

3.5

4

4.5

5

5.5

6

6.5

Mini−batches

Cro

ss−E

ntro

py

1 layer2 layers3 layers4 layers

Figure 1: NLM Learning Curve – test cross-entropies (loge perplexities) for various NLMs.

4.2 Reranking SettingsIn Table 4, we compare reranking using all densefeatures (All) to conditions which use only denseLM features (LM) and optionally, include a wordpenalty (WP) feature. All these settings includean NLM score and an aggregate decoder score. Asshown, it is best to include all dense features atreranking time.

All LMs + WP LMs1 layer 11.3 11.3 11.42 layers 11.2 11.4 11.53 layers 11.0 11.1 11.44 layers 10.9 11.2 11.3

Table 4: Reranking Conditions – (TER-BLEU)/2scores when reranking the web-forum baseline.

5 Related Work

It is worth mentioning another active line of re-search in building end-to-end neural MT systems(Kalchbrenner and Blunsom, 2013; Sutskever etal., 2014; Bahdanau et al., 2015; Luong et al.,2015; Jean et al., 2015). These methods havenot yet demonstrated success on challenging lan-guage pairs such as English-Chinese. Arsoy et al.(2012) have preliminarily examined deep NLMsfor speech recognition, however, we believe, thisis the first work that puts deep NLMs into the con-text of MT.

6 Conclusion

In this paper, we have bridged the gap that pastwork did not show, that is, neural language mod-els with more than two layers can help improvetranslation quality. Our results confirm the trendreported in (Luong et al., 2015) that source-conditioned perplexity strongly correlates withMT performance. We have also demonstrated the

308

use of deep NLMs to obtain decent gains in out-of-domain conditions.

Acknowledgment

We gratefully acknowledge support from a giftfrom Bloomberg L.P. and from the DefenseAdvanced Research Projects Agency (DARPA)Broad Operational Language Translation (BOLT)program under contract HR0011-12-C-0015through IBM. Any opinions, findings, and con-clusions or recommendations expressed in thismaterial are those of the author(s) and do notnecessarily reflect the view of DARPA, or the USgovernment. We thank members of the StanfordNLP Group as well as the annonymous reviewersfor their valuable comments and feedbacks.

ReferencesEbru Arsoy, Tara N. Sainath, Brian Kingsbury, and

Bhuvana Ramabhadran. 2012. Deep neural networklanguage models. In NAACL WLM Workshop.

D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neuralmachine translation by jointly learning to align andtranslate. In ICLR.

Y. Bengio, R. Ducharme, P. Vincent, and C. Jan-vin. 2003. A neural probabilistic language model.JMLR, 3:1137–1155.

Yoshua Bengio. 2009. Learning deep architectures forai. Foundations and Trends in Machine Learning,2(1):1–127, January.

D. Cer, M. Galley, D. Jurafsky, and C. D. Manning.2010. Phrasal: A statistical machine translationtoolkit for exploring new model features. In ACL,Demonstration Session.

J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz,and J. Makhoul. 2014. Fast and robust neural net-work joint models for statistical machine translation.In ACL.

S. Green, D. Cer, and C. D. Manning. 2014. An em-pirical comparison of features and tuning for phrase-based machine translation. In WMT.

Michael Gutmann and Aapo Hyvarinen. 2012. Noise-contrastive estimation of unnormalized statisticalmodels, with applications to natural image statistics.JMLR, 13:307–361.

G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed,N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,T. Sainath, and B. Kingsbury. 2012. Deep neuralnetworks for acoustic modeling in speech recogni-tion. IEEE Signal Processing Magazine.

Sebastien Jean, Kyunghyun Cho, Roland Memisevic,and Yoshua Bengio. 2015. On using very large tar-get vocabulary for neural machine translation. InACL.

N. Kalchbrenner and P. Blunsom. 2013. Recurrentcontinuous translation models. In EMNLP.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012.ImageNet classification with deep convolutionalneural networks. In NIPS.

M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, andW. Zaremba. 2015. Addressing the rare word prob-lem in neural machine translation. In ACL.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In Inter-speech.

Tomas Mikolov, Stefan Kombrink, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2011. Exten-sions of recurrent neural network language model.In ICASSP.

Andriy Mnih and Geoffrey Hinton. 2007. Three newgraphical models for statistical language modelling.In ICML.

Andriy Mnih and Geoffrey Hinton. 2009. A scalablehierarchical distributed language model. In NIPS.

Andriy Mnih and Yee Whye Teh. 2012. A fast andsimple algorithm for training neural probabilisticlanguage models. In ICML.

Frederic Morin. 2005. Hierarchical probabilistic neu-ral network language model. In AISTATS.

Vinod Nair and Geoffrey E. Hinton. 2010. Rectifiedlinear units improve restricted boltzmann machines.In ICML.

Stefan Riezler and T. John Maxwell. 2005. On somepitfalls in automatic evaluation and significance test-ing for MT. In ACL Workshop, Intrinsic/ExtrinsicEvaluation Measures for MT and Summarization.

H. Schwenk, A. Rousseau, and M. Attik. 2012. Large,pruned or continuous space language models on agpu for statistical machine translation. In NAACLWLM workshop.

H. Schwenk. 2010. Continuous space language mod-els for statistical machine translation. The PragueBulletin of Mathematical Linguistics, (93):137–146.

Le Hai Son, Alexandre Allauzen, and Francois Yvon.2012. Continuous space translation models withneural networks. In NAACL-HLT.

I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequenceto sequence learning with neural networks. In NIPS.

Ashish Vaswani, Yinggong Zhao, Victoria Fossum, andDavid Chiang. 2013. Decoding with large-scaleneural language models improves translation. InEMNLP.

309



Finding Opinion Manipulation Trolls in News Community Forums

Todor MihaylovFMI

Sofia [email protected]

Georgi D. GeorgievOntotext AD

Sofia, [email protected]

Preslav NakovQatar Computing Research Institute

HBKU, [email protected]

AbstractThe emergence of user forums in elec-tronic news media has given rise tothe proliferation of opinion manipulationtrolls. Finding such trolls automaticallyis a hard task, as there is no easy way torecognize or even to define what they are;this also makes it hard to get training andtesting data. We solve this issue pragmati-cally: we assume that a user who is calleda troll by several people is likely to be one.We experiment with different variations ofthis definition, and in each case we showthat we can train a classifier to distinguisha likely troll from a non-troll with veryhigh accuracy, 82–95%, thanks to our richfeature set.

1 IntroductionWith the rise of social media, it became normalfor people to read and follow other users’ opinion.This created the opportunity for corporations, gov-ernments and others to distribute rumors, misin-formation, speculation and to use other dishonestpractices to manipulate user opinion (Derczynskiand Bontcheva, 2014a). They could consistentlyuse trolls (Cambria et al., 2010), write fake postsand comments in public forums, thus making ve-racity one of the challenges in digital social net-working (Derczynski and Bontcheva, 2014b).

The practice of using opinion manipulationtrolls has been reality since the rise of Internet andcommunity forums. It has been shown that useropinions about products, companies and politicscan be influenced by posts by other users in onlineforums and social networks (Dellarocas, 2006).This makes it easy for companies and political par-ties to gain popularity by paying for “reputationmanagement” to people or companies that write indiscussion forums and social networks fake opin-ions from fake profiles.

In Europe, the problem has emerged in the con-text of the crisis in Ukraine.12 There have been anumber of publications in news media describingthe behavior of organized trolls that try to manipu-late other users’ opinion.345 Still, it is hard for fo-rum administrators to block them as trolls try notto violate the forum rules.

2 Related Work

Troll detection and offensive language use are un-derstudied problems (Xu and Zhu, 2010). Theyhave been addressed using analysis of the seman-tics and sentiment in posts to filter out trolls (Cam-bria et al., 2010); there have been also studies ofgeneral troll behavior (Herring et al., 2002; Buck-els et al., 2014). Another approach has been touse lexico-syntactic features about user’s writingstyle, structure and specific cyber-bullying con-tent (Chen et al., 2012); cyber-bullying was de-tected using sentiment analysis (Xu et al., 2012);graph-based approaches over signed social net-works have been used as well (Ortega et al., 2012;Kumar et al., 2014). A related problem is that oftrustworthiness of statements on the Web (Roweand Butters, 2009). Yet another related problemis Web spam detection, which has been addressedas a text classification problem (Sebastiani, 2002),e.g., using spam keyword spotting (Dave et al.,2003), lexical affinity of arbitrary words to spamcontent (Hu and Liu, 2004), frequency of punc-tuation and word co-occurrence (Li et al., 2006).See (Castillo and Davison, 2011) for an overviewon adversarial web search.

1http://www.forbes.com/sites/peterhimler/2014/05/06/

russias-media-trolls/2http://www.theguardian.com/commentisfree/2014/may/04/

pro-russia-trolls-ukraine-guardian-online3http://www.washingtonpost.com/

news/the-intersect/wp/2014/06/04/hunting-for-paid-russian-trolls-in-the-washington-post-comments-section/

4http://www.theguardian.com/world/2015/apr/02/

putin-kremlin-inside-russian-troll-house5http://www.theguardian.com/commentisfree/2014/may/04/

pro-russia-trolls-ukraine-guardian-online

310

Object CountPublications 34,514Comments 1,930,818-of which replies 897,806User profiles 14,598Topics 232Tags 13,575

Table 1: Statistics about our dataset.

3 Data

We crawled the largest Internet community forumof a Bulgarian media, that of Dnevnik.bg,6 a dailynewspaper that requires users to be signed in inorder to comment, which makes it easy to trackthem. The platform allows users to comment onnews, to reply to other users’ comments and tovote on them with thumbs up or thumbs down. Inthe forum, the official language is Bulgarian andall comments are written in Bulgarian.

Each publication has a category, a subcategory,and a list of manually selected tags (keywords).We crawled all publications in the Bulgaria, Eu-rope, and World categories, which turned out to bemostly about politics, for the period 01-Jan-2013to 01-Apr-2015, together with the comments andthe corresponding user profiles as seen in Table 1.

We considered as trolls users who were calledsuch by at least n distinct users, and non-trolls ifthey have never been called so. Requiring that auser should have at least 100 comments in order tobe interesting for our experiments left us with 317trolls and 964 non-trolls. Here are two examples(translated):

“To comment from ”Historama”: Murzi7, youknow that you cannot manipulate public opinion,right?”

“To comment from ”Rozalina”: You, trolls, areso funny :) I saw the same signature under othercomments:)”

6http://dnevnik.bg7Murzi is the short for murzilka. According to series of re-

cent publications in Bulgarian media, Russian Internet usersreportedly use the term murzilka to refer to Internet trolls. Asa result, this term was adopted by some pro-Western Bulgar-ian forum users as a way to refer to users that they perceiveas pro-Russian opinion manipulation trolls. Despite the termbeing now in circulation in Bulgaria, it is not really in use inRussia. In fact, the vast majority of Russian Internet usershave never heard that murzilka, the name of a cute monkey-like children’s toy and of a popular Soviet-time children’sjournal, could possibly be used to refer to Internet trolls.

4 Method

Our features are motivated by several publicationsabout troll behavior mentioned above.

For each user, we extract statistics such as num-ber of comments posted, number of days in theforum, number of days with at least one comment,and number of publications commented on. All(other) features are scaled with respect to thesestatistics, which makes it possible to handle usersthat registered only recently. Our features can bedivided in the following groups:

Vote-based features. We calculate the num-ber of comments with positive and negative votesfor each user. This is useful as we assume thatnon-trolls are likely to disagree with trolls, and togive them negative votes. We use the sum fromall comments as a feature. We also count sepa-rately the comments with high, low and mediumpositive to negative ratio. Here are some ex-ample features: the number of comments where(positive/negative) < 0.25, and the number ofcomments where (positive/negative) < 0.50.

Comment-to-publication similarity. Thesefeatures measure the similarity between commentsand publications. We use cosine and TF.IDF-weighted vectors for the comment and for the pub-lication. The idea is that trolls might try to changeor blurr the topic of the publication if it differsfrom his/her views or agenda.

Comment order-based features. We counthow many user comments the user has among thefirst k. The idea is that trolls might try to be amongthe first to comment to achieve higher impact.

Top loved/hated comments. We calculate thenumber of times the user’s comments were amongthe top 1, 3, 5, 10 most loved/hated commentsin some thread. The idea is that in the commentthread below many publications there are sometrolls that oppose all other users, and usually theircomments are among the most hated.

Comment replies-based features. These arefeatures that count how many user comments arereplies to other comments, how many are repliesto replies, and so on. The assumption here is thattrolls try to post the most comments and want todominate the conversation, especially when de-fending a specific cause. We further generate com-plex features that mix comment reply features andvote counts-based features, thus generating evenmore features that model the relationship betweenreplies and user agreement/disagreement.

311

Time-based features. We generate featuresfrom the number of comments posted during dif-ferent time periods on a daily or on a weekly ba-sis. We assume that users that write comments onpurpose could be paid, or could be activists of po-litical parties, and they probably have some usualtimes to post, e.g., maybe they do it as a full-timejob. On the other hand, most non-trolls work from9:00 to 18:00, and thus we could expect that theyshould probably post less comments during thispart of the day. We have time-based features thatcount the number of comments from 9:00 to 9:59,from 12:00 to 12:59, during working hours 9:00-18:00, etc.

All the above features are scaled, i.e., dividedby the number of comments, the number of daysin the forum, the number of days with more thanone comment. Overall, we have a total of 338 suchscaled features. In addition, we define a new set offeatures, which are non-scaled.

Non-scaled features. The non-scaled featuresare features based on the same statistics as above,but just not divided by the number of comments /number of days in the forum / number of days withmore that one comment, etc. For example, onenon-scaled feature is the number of times a com-ment by the target user was voted negatively, i.e.,as thumbs down, by other users. As a non-scaledfeature, we would use this number directly, whileabove we would scale it by dividing it by the totalnumber of user’s comments, by the total number ofpublications the user has commented on, etc. Ob-viously, there is a danger in using non-scaled fea-tures: older users are likely to have higher valuesfor them compared to recently-registered users.

5 Experiments and Evaluation

As we mentioned above, in our experiments,we focus on users with at least 100 comments.This includes 317 trolls and 964 non-trolls. Foreach user, we extract the above-described features,scaled and non-scaled, and we normalize them inthe -1 to 1 interval. We then use a support vectormachine (SVM) classifier (Chang and Lin, 2011)with an RBF kernel with C=32 and g=0.0078125as this was the best-performing configuration. Inorder to avoid overfitting, we used 5-fold cross-validation. The results are shown in Tables 2 and3, where the Accuracy column shows the cross-validation accuracy and the Diff column shows theimprovement over the majority class baseline.

Features Accuracy DiffAS + Non-scaled 94.37(+3.74) 19.13AS � total comments 91.17(+0.54) 15.93AS � comment order 91.10(+0.46) 15.85AS � similarity 91.02(+0.39) 15.77AS � time day of week 90.78(+0.15) 15.53AS � trigg rep range 90.78(+0.15) 15.53AS � time all 90.71(+0.07) 15.46All scaled (AS) 90.63 15.38AS � top loved/hated 90.55(-0.07) 15.30AS � time hours 90.47(-0.15) 15.22AS � vote u/down rep 90.47(-0.15) 15.22AS � similarity top 90.32(-0.31) 15.07AS � triggered cmnts 90.32(-0.31) 15.07AS � is rep to has rep 90.08(-0.54) 14.83AS � vote up/down all 89.69(-0.93) 14.44AS � is reply 89.61(-1.01) 14.36AS � up/down votes 88.29(-2.34) 13.04

Table 2: Results for classifying 317 mentionedtrolls vs. 964 non-trolls for All Scaled (AS) ‘–’(minus) some scaled feature group. The Accuracycolumn shows the cross-validation accuracy, andthe Diff column shows the improvement over themajority class baseline.

Table 2 presents the results when using all fea-tures, as well as when using all features but ex-cluding/adding one feature group. Here All scaled(AS) refers to the features from all groups ex-cept for those in the non-scaled features group de-scribed last in the previous section.

We can see that the best feature set is theone that includes all features, including the Non-scaled features group: adding this group con-tributes +3.74 to accuracy. We further see that ex-cluding features based on time, e.g., AS� time dayof week and AS � time all, improves the accuracy,which means that time of posting is not so impor-tant as a feature. Similarly, we see that it hurtsaccuracy to use as features the total number or theorder of comments. Finally, the most importantfeatures turn out to be those based on replies andon thumbs up/down votes.

Next, Table 3 shows results of experimentswhen using different feature groups in isolation.As expected, the features that hurt most when ex-cluded from the All scaled feature set, performbest when used alone. Here, the similarity featuresperform worst, which suggests that trolls tend notto change the topic.

312

Features Accuracy DiffAll Non-scaled 93.21 17.95Only vote up/down 87.67 12.41Only vote up/down totals 87.20 11.94Only reply up/down voted 86.10 10.85Only time hours 84.93 9.68Only time all 84.31 9.06Only is reply with rep 82.83 7.57Only triggered rep range 82.83 7.57Only day of week 82.28 7.03Only total comments 82.28 7.03Only reply status 80.72 5.46Only triggered replies 80.33 5.07Only comment order 80.09 4.84Only top loved/hated 79.39 4.14Only pub similarity top 75.25 0.00Only pub similarity 75.25 0.00

Table 3: Results for classifying 317 mentionedtrolls vs. 964 non-trolls for individual featuregroups (all scaled, except for line 1). The Accu-racy column shows the cross-validation accuracy,and the Diff column shows the improvement overthe majority class baseline.

6 Discussion

We considered as trolls people who try to manipu-late other users’ opinion. Our positive and neg-ative examples are based on trolls having beencalled such by at least n other users (we usedn = 5).

However, this is much of a witch hunt anddespite our good overall results, the data needssome manual checking in future work. We arealso aware that some trolls can actually accusenon-trolls of being trolls, and we cannot be surewhether this is true or not unless we have some-one to check it manually. In fact, we do have a listof trolls that are known to have been paid (as ex-posed in Bulgarian media), but there are only 15of them, and we could not build a good classifierusing only them due to the severe class imbalance.

As the choice of a minimum number of accu-sations for a user of being a troll that we used todefine a troll, namely n = 5, might be seen asarbitrary, we also experimented with n = 3, 4, 5,6, while keeping the required minimum number ofcomments per user to be 100 as before. The re-sults are shown in Table 4. We can see that as thenumber of troll mentions/accusations increases, sodoes the cross-validation accuracy.

min mentions 3 4 5 6trolls 545 419 317 260non-troll 964 964 964 964Accuracy 85.49 87.85 90.87 92.32Diff +21.60 +18.15 +15.61 +13.56

Table 4: Results for classifying mentioned trollsvs. non-trolls, using different numbers of mini-mum troll accusations to define a troll (users with100 comments or more only). The Accuracy col-umn shows the cross-validation accuracy, and theDiff column shows the improvement over the ma-jority class baseline.

However, this is partly due to the increased classimbalance of trolls vs. non-trolls, which can beseen by the decrease in the improvement of ourclassifier compared to the majority class baseline.

We also ran experiments with a fixed number ofminimum mentions for the trolls (namely 5 as be-fore), but with varying minimum number of com-ments per user: 10, 25, 50, 100. The results areshown in Figure 1. We can see that as the min-imum number of comments increases, the cross-validation accuracy for both the baseline and forour classifier decreases (as the troll-vs-non-trollratio becomes more balanced); yet, the improve-ment of our classifier over the baseline increases,which means that the more we know about a user,the better we can predict whether s/he will be seenas a troll by other users.

Figure 1: Results for classifying mentioned trollsvs. non-trolls for users with a different minimalnumber of comments (trolls were accused of be-ing such by 5 or more different users). Shown areresults for our classifier and for the majority classbaseline.

313


We have presented experiments in trying to dis-tinguish trolls vs. non-trolls in news communityforums. We have experimented with a large num-ber of features, both scaled and non-scaled, andwe have achieved very strong overall results usingstatistics such as number of comments, of posi-tive and negative votes, of posting replies, activityover time, etc. The nature of our features meansthat our troll detection works best for “elder trolls”with at least 100 comments in the forum. In futurework, we plan to add content features such as key-words, topics, named entities, part of speech, andnamed entities, which should help detect “fresh”trolls. Our ultimate objective is to be able to findand expose paid opinion manipulation trolls.

Acknowledgments

We would like to thank the anonymous review-ers for their constructive comments, which havehelped us to improve the paper.

ReferencesErin E Buckels, Paul D Trapnell, and Delroy L Paulhus.

2014. Trolls just want to have fun. Personality andindividual Differences, 67:97–102.

Erik Cambria, Praphul Chandra, Avinash Sharma, andAmir Hussain. 2010. Do not feel the trolls. In Pro-ceedings of the 3rd International Workshop on So-cial Data on the Web, SDoW ’10, Shanghai, China.

Carlos Castillo and Brian D. Davison. 2011. Adversar-ial web search. Found. Trends Inf. Retr., 4(5):377–486, May.

Chih-Chung Chang and Chih-Jen Lin. 2011. LIB-SVM: A library for support vector machines. ACMTransactions on Intelligent Systems and Technology,2:27:1–27:27.

Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu.2012. Detecting offensive language in social me-dia to protect adolescent online safety. In Proceed-ings of the 2012 International Conference on Pri-vacy, Security, Risk and Trust and of the 2012 In-ternational Conference on Social Computing, PAS-SAT/SocialCom ’12, pages 71–80, Amsterdam,Netherlands.

Kushal Dave, Steve Lawrence, and David M Pennock.2003. Mining the peanut gallery: Opinion extrac-tion and semantic classification of product reviews.In Proceedings of the 12th international World WideWeb conference, WWW ’03, pages 519–528, Bu-dapest, Hungary.

Chrysanthos Dellarocas. 2006. Strategic manip-ulation of internet opinion forums: Implicationsfor consumers and firms. Management Science,52(10):1577–1593.

Leon Derczynski and Kalina Bontcheva. 2014a.Pheme: Veracity in digital social networks. In Pro-ceedings of the UMAP Project Synergy workshop.

Leon Derczynski and Kalina Bontcheva. 2014b.Spatio-temporal grounding of claims made on theweb, in pheme. In Proceedings of the 10th JointISO-ACL SIGSEM Workshop on Interoperable Se-mantic Annotation, ISA ’14, page 65, Reykjavik,Iceland.

Susan Herring, Kirk Job-Sluder, Rebecca Scheckler,and Sasha Barab. 2002. Searching for safety on-line: Managing “trolling” in a feminist forum. TheInformation Society, 18(5):371–384.

Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In Proceedings of the tenthACM SIGKDD international conference on Knowl-edge discovery and data mining, KDD ’04, pages168–177, Seattle, WA, USA.

Srijan Kumar, Francesca Spezzano, and VS Subrah-manian. 2014. Accurately detecting trolls inslashdot zoo via decluttering. In Proceedings ofthe 2014 IEEE/ACM International Conference onAdvances in Social Network Analysis and Mining,ASONAM ’14, pages 188–195, Beijing, China.

Wenbin Li, Ning Zhong, and Chunnian Liu. 2006.Combining multiple email filters based on multivari-ate statistical analysis. In Foundations of IntelligentSystems, pages 729–738. Springer.

F. Javier Ortega, Jose A. Troyano, Fermın L. Cruz,Carlos G. Vallejo, and Fernando Enrıquez. 2012.Propagation of trust and distrust for the detectionof trolls in a social network. Computer Networks,56(12):2884 – 2895.

Matthew Rowe and Jonathan Butters. 2009. Assess-ing Trust: Contextual Accountability. In Proceed-ings of the First Workshop on Trust and Privacy onthe Social and Semantic Web, SPOT ’09, Heraklion,Greece.

Fabrizio Sebastiani. 2002. Machine learning in auto-mated text categorization. ACM computing surveys(CSUR), 34(1):1–47.

Zhi Xu and Sencun Zhu. 2010. Filtering offensive lan-guage in online communities using grammatical re-lations. In Proceedings of the Seventh Annual Col-laboration, Electronic Messaging, Anti-Abuse andSpam Conference.

Jun-Ming Xu, Xiaojin Zhu, and Amy Bellmore. 2012.Fast learning for sentiment analysis on bullying. InProceedings of the First International Workshop onIssues of Sentiment Discovery and Opinion Mining,WISDOM ’12, pages 10:1–10:6, New York, NY,USA.

314



Do dependency parsing metrics correlate with human judgments?

Barbara Plank,? Hector Martınez Alonso,? Zeljko Agic,? Danijela Merkler,† Anders Søgaard?

?Center for Language Technology, University of Copenhagen, Denmark†Department of Linguistics, University of Zagreb, Croatia

[email protected]

AbstractUsing automatic measures such as labeledand unlabeled attachment scores is com-mon practice in dependency parser evalu-ation. In this paper, we examine whetherthese measures correlate with human judg-ments of overall parse quality. We ask lin-guists with experience in dependency an-notation to judge system outputs. We mea-sure the correlation between their judg-ments and a range of parse evaluation met-rics across five languages. The human-metric correlation is lower for dependencyparsing than for other NLP tasks. Also,inter-annotator agreement is sometimeshigher than the agreement between judg-ments and metrics, indicating that the stan-dard metrics fail to capture certain aspectsof parse quality, such as the relevance ofroot attachment or the relative importanceof the different parts of speech.

1 IntroductionIn dependency parser evaluation, the standard ac-curacy metrics—labeled and unlabeled attachmentscores—are defined simply as averages over cor-rect attachment decisions. Several authors havepointed out problems with these metrics; they areboth sensitive to annotation guidelines (Schwartzet al., 2012; Tsarfaty et al., 2011), and they failto say anything about how parsers fare on rare,but important linguistic constructions (Nivre et al.,2010). Both criticisms rely on the intuition thatsome parsing errors are more important than oth-ers, and that our metrics should somehow reflectthat. There are sentences that are hard to anno-tate because they are ambiguous, or because theycontain phenomena peripheral to linguistic theory,such as punctuation, clitics, or fragments. Man-ning (2011) discusses similar issues for part-of-speech tagging.

To measure the variable relevance of parsingerrors, we present experiments with human judg-ment of parse output quality across five languages:Croatian, Danish, English, German, and Spanish.For the human judgments, we asked professionallinguists with dependency annotation experienceto judge which of two parsers produced the bet-ter parse. Our stance here is that, insofar ex-perts are able to annotate dependency trees, theyare also able to determine the quality of a pre-dicted syntactic structure, which we can in turnuse to evaluate parser evaluation metrics. Eventhough downstream evaluation is critical in as-sessing the usefulness of parses, it also presentsnon-trivial challenges in choosing the appropriatedownstream tasks (Elming et al., 2013), we seehuman judgments as an important supplement toextrinsic evaluation.

To the best of our knowledge, no prior studyhas analyzed the correlation between dependencyparsing metrics and human judgments. For a rangeof other NLP tasks, metrics have been evaluatedby how well they correlate with human judgments.For instance, the standard automatic metrics forcertain tasks—such as BLEU in machine trans-lation, or ROUGE-N and NIST in summariza-tion or natural language generation—were evalu-ated, reaching correlation coefficients well above.80 (Papineni et al., 2002; Lin, 2004; Belz and Re-iter, 2006; Callison-Burch et al., 2007).

We find that correlations between evaluationmetrics and human judgments are weaker fordependency parsing than other NLP tasks—ourcorrelation coefficients are typically between .35and .55—and that inter-annotator agreement issometimes higher than human-metric agreement.Moreover, our analysis (§5) reveals that humanshave a preference for attachment over labeling de-cisions, and that attachments closer to the root aremore important. Our findings suggest that the cur-rently employed metrics are not fully adequate.

315

Contributions We present i) a systematic com-parison between a range of available dependencyparsing metrics and their correlation with humanjudgments; and ii) a novel dataset1 of 984 sen-tences (up to 200 sentences for each of the 5 lan-guages) annotated with human judgments for thepreferred automatically parsed dependency tree,enabling further research in this direction.

2 Metrics

We evaluate seven dependency parsing metrics,described in this section.

Given a labeled gold tree G = hV,EG, lG(·)iand a labeled predicted tree P = hV,EP , lP (·)i,let E ⇢ V ⇥ V be the set of directed edges fromdependents to heads, and let l : V ⇥ V ! L be theedge labeling function, with L the set of depen-dency labels.

The three most commonly used metricsare those from the CoNLL 2006–7 sharedtasks (Buchholz and Marsi, 2006): unlabeled at-tachment score (UAS), label accuracy (LA), bothintroduced by Eisner (1996), and labeled attach-ment score (LAS), the pivotal dependency parsingmetric introduced by Nivre et al. (2004).

UAS =

|{e | e 2 EG \ EP }|

|V |

LAS =

|{e | lG(e) = lP (e), e 2 EG \ EP }|

|V |

LA =

|{v | v 2 V, lG(v, ·) = lP (v, ·)}|

|V |

We include two further metrics—namely, la-beled (LCP) and unlabeled (UCP) completepredications—to give account for the relevance ofcorrect predicate prediction for parsing quality.

LCP is inspired by the complete predicates met-ric from the SemEval 2015 shared task on seman-tic parsing (Oepen et al., 2015).2 LCP is triggeredby a verb (i.e., set of nodes Vverb) and checkswhether all its core arguments match, i.e., all out-going dependency edges except for punctuation.Since LCP is a very strict metric, we also evaluateUCP, its unlabeled variant. Given a function cX(v)

that retrieves the set of child nodes of a node vfrom a tree X , we first define UCP as follows, andthen incorporate the label matching for LCP:

1The dataset is publicly available at https://bitbucket.org/lowlands/release

2http://alt.qcri.org/semeval2015/

UCP =

|{v | Vverb, cG(v) = cP (v)}|

|Vverb|

LCP =

|{v | Vverb, cG(v) = cP (v) ^ lG(v, ·) = lP (v, ·)}|

|Vverb|

For the final figure of seven different parsingmetrics, on top of the previous five, in our exper-iments we also include the neutral edge directionmetric (NED) (Schwartz et al., 2011), and tree editdistance (TED) (Tsarfaty et al., 2011; Tsarfaty etal., 2012).3

3 Experiment

In our analysis, we compare the metrics with hu-man judgments. We examine how well the auto-matic metrics correlate with each other, as wellas with human judgments, and whether inter-annotator agreement exceeds annotator-metricagreement.

LANG TYPE SENT SL TD ANN RAW

da CDT 200 22.7 8.1 2-3 .77 .53de UD 200 18.0 4.4 2 .67 .33en UD 200 23.4 5.4 4 .73 .45es UD 184 32.5 6.7 4 .60 .20hr PDT 200 28.5 7.8 2 .80 .59

Table 1: Data characteristics and agreement statis-tics. TD: tree depth; SL: sentence length.

Data In our experiments we use data from fivelanguages: The English (en), German (de) andSpanish (es) treebanks from the Universal Depen-dencies (UD v1.0) project (Nivre et al., 2015), theCopenhagen Dependency Treebank (da) (Buch-Kromann, 2003), and the Croatian DependencyTreebank (hr) (Agic and Merkler, 2013). We keepthe original POS tags for all datasets (17 tags incase of UD, 13 tags for Croatian, and 23 for Dan-ish). Data characteristics are in Table 1.

For the parsing systems, we follow McDon-ald and Nivre (2007) and use the second or-der MST (McDonald et al., 2005), as well asMalt parser with pseudo-projectivization (Nivreand Nilsson, 2005) and default parameters. Foreach language, we train the parsers on the canoni-cal training section. We randomly select 200 sen-tences from the test sections, where our two de-

3http://www.tsarfaty.com/unipar/

316

LANG PARSER LAS UAS LA NED TED LCP UCP

en Malt 79.17 82.31 87.88 84.34 85.20 41.27 47.17MST 78.30 82.91 86.80 84.72 83.49 36.05 45.58

es Malt 78.72 82.85 87.34 82.90 84.20 34.00 43.00MST 79.51 84.97 86.95 85.00 83.16 31.83 44.00

da Malt 79.28 83.40 85.92 83.39 77.50 47.69 55.23MST 82.75 87.00 88.42 87.01 78.39 52.31 62.31

de Malt 69.09 75.70 82.05 75.54 80.37 19.72 30.45MST 72.07 80.29 82.22 80.13 78.94 19.38 33.22

hr Malt 63.21 72.34 76.66 71.94 71.64 23.18 31.03MST 65.98 76.20 79.01 75.89 72.82 24.71 34.29

Avg Malt 73.84 79.32 83.97 76.62 79.78 33.17 43.18MST 75.72 82.27 84.68 82.55 79.36 32.86 44.08

Table 2: Parsing performance of Malt and MST.

pendency parsers do not agree on the correct anal-ysis, after removing punctuation.4 We do not con-trol for predicted trees matching the gold standard.

Annotation task A total of 7 annotators wereinvolved in the annotation task. All the annota-tors are either native or fluent speakers, and well-versed in dependency syntax analysis.

For each language, we present the selected200 sentences with their two predicted depen-dency structures to 2–4 annotators and ask themto rank which of the two parses is better. Theysee graphical representations of the two de-pendency structures, visualized with the What’sWrong With My NLP? tool.5 The annotatorswere not informed of what parser produced whichtree, nor had they access to the gold stan-dard. The dataset of 984 sentences is availableat: https://bitbucket.org/lowlands/release (folder CoNLL2015).

4 Results

First, we perform a standard evaluation in order tosee how the parsers fare, using our range of depen-dency evaluation measures. In addition, we com-pute correlations between metrics to assess theirsimilarity. Finally, we correlate the measures withhuman judgements, and compare average annota-tor and human-system agreements.

Table 2 presents the parsing performances withrespect to the set of metrics. We see that usingLAS, Malt performs better on English, while MSTperforms better on the remaining four languages.

Table 3 presents Spearman’s ⇢ between metricsacross the 5 languages. Some metrics are strongly

4For Spanish, we had fewer analyses where the twoparsers disagreed, i.e., 184.

5https://code.google.com/p/whatswrong/

⇢ UAS LA NED TED LCP UCP

LAS .755 .622 .743 .556 .236 .286UAS – .436 .869 .512 .211 .342LA – – .436 .419 .206 .154NED – – – .499 .216 .339TED – – – – .175 .219LCP – – – – – .352

Table 3: Correlations between metrics.

⇢ en es da de hr All

LAS .547 .478 .297 .466 .540 .457UAS .541 .437 .331 .453 .397 .425LA .387* .250* .232 .310 .467 .324*NED .541 .469 .318 .501 .446 .448TED .372* .404 .323 .331 .405* .361*LCP .022* .230* .171 .120* .120* .126*UCP .249* .195* .223 .190* .143* .195*

Table 4: Correlations between human judgmentsand metrics (micro avg). * means significantlydifferent from LAS ⇢ using Fisher’s z-transform.Bold: highest correlation per language.

correlated, e.g., LAS and LA, and UAS and NED,but some exhibit very low correlation coefficients.

Next we study correlations with human judg-ments (Table 4). In order to aggregate over the an-notations, we use an item-response model (Hovyet al., 2013). The correlations are relatively weakcompared to similar findings for other NLP tasks.For instance, ROUGE-1 (Lin, 2004) correlatesstrongly with perceived summary quality, with acoefficient of 0.99. The same holds for BLEU andhuman judgments of machine translation quality(Papineni et al., 2002).

We find that, overall, LAS is the metric that cor-relates best with human judgments. It is closelyfollowed by UAS, which does not differ signifi-cantly from LAS, albeit the correlations for UASare slightly lower on average. NED is in turnhighly correlated with UAS. The correlations forthe predicate-based measures (LCP, UCP) are thelowest, as they are presumably too strict, and verydifferent to LAS.

Motivated by the fact that people prefer theparse that gets the overall structure right (§5), weexperimented with weighting edges proportionallyto their log-distance to root. However, the sig-nal was fairly weak; the correlations were onlyslightly higher for English and Danish: .552 and.338, respectively.

Finally, we compare the mean agreement be-

317

ANN LAS UAS LA NED TED LCP UCP

da .768 .838 .848 .808 .828 .828 .745 .765de .670 .710 .690 .635 .710 .630 .575 .565en .728 .715 .705 .660 .700 .658 .525 .600es .601 .663 .644 .603 .652 .635 .581 .554hr .800 .755 .700 .735 .730 .705 .570 .580

Table 5: Average mean agreement between anno-tators, and between annotators and metrics.

tween humans with the mean agreement betweenhumans and standard metrics, cf. Table 5. For twolanguages (English and Croatian), humans agreemore with each other than with the standard met-rics, suggesting that metrics are not fully adequate.The mean agreement between humans is .728 forEnglish, with slightly lower scores for the met-rics (LAS: .715, UAS: .705, NED: .660). Thedifference between mean agreement of annotatorsand human-metric was higher for Croatian: .80vs .755. For Danish, German and Spanish, how-ever, average agreement between metrics and hu-man judgments is higher than our inter-annotatoragreement.

5 Analysis

In sum, our experiments show that metrics corre-late relatively weakly with human judgments, sug-gesting that some errors are more important to hu-mans than others, and that the relevance of theseerrors are not captured by the metrics.

To better understand this, we first consider thePOS-wise correlations between human judgmentsand LAS, cf. Table 6. In English, for example, thecorrelation between judgments and LAS is signif-icantly stronger for content words6 (⇢c = 0.522)than for function words (⇢f = 0.175). This alsoholds for the other UD languages, namely Ger-man (⇢c = 0.423 vs ⇢f = 0.263) and Spanish(⇢c = 0.403 vs ⇢f = 0.228). This is not thecase for the non-UD languages, Croatian and Dan-ish, where the difference between content-POSand function-POS correlations is not significantlydifferent. In Danish, function words head nouns,and are thus more important than in UD, wherecontent-content word relations are annotated, andfunction words are leaves in the dependency tree.This difference in dependency formalism is shownby the higher correlation for ⇢f for Danish.

The greater correlation for content words forEnglish, German and Spanish suggests that errors

6Tagged as ADJ, NOUN, PROPN, VERB.

⇢ content function

en .522 .175de .423 .263es .403 .228da .148 .173hr .340 .306

Table 6: Correlations between human judgementsand POS-wise LAS (content ⇢c vs function ⇢f pos-wise LAS correlations).

in attaching or labeling content words mean moreto human judges than errors in attaching or label-ing function words. We also observe that longersentences do not compromise annotation quality,with a ⇢ between�0.07 and 0.08 across languagesregarding sentence length and agreement.

For the languages for which we had 4 annota-tors, we analyzed the subset of trees where hu-mans and system (by LAS) disagreed, but wherethere was majority vote for one tree. We obtained35 dependency instances for English and 27 forSpanish (cf. Table 7). Two of the authors deter-mined whether humans preferred labeling over at-tachment, or otherwise.

attachment labeling items

en 86% 14% 35es 67% 33% 27

Table 7: Preference of attachment or labeling foritems where humans and system disagreed and hu-man agreement � 0.75.

Table 7 shows that there is a prevalent prefer-ence for attachment over labeling for both lan-guages. For Spanish, there is proportionallyhigher label preference. Out of the attach-ment preferences, 36% and 28% were related toroot/main predicate attachments, for English andSpanish respectively. The relevance of the root-attachment preference indicates that attachment ismore important than labeling for our annotators.

Figure 5 provides three examples from the datawhere human and system disagree. Parse i) in-volves a coordination as well as a (local) adver-bial, where humans voted for correct coordination(red) and thus unanimously preferred attachmentover labeling. Yet, LAS was higher for the analy-sis in blue because “certainly” is attached to “Eu-ropeans” in the gold standard. Parse ii) is another

318

Figure 1: Examples where human and system (LAS) disagree. Human choice: i) red; ii) red; iii) blue.

example where humans preferred attachment (inthis case root attachment), while iii) shows a Span-ish example (“waiter is needed”) where the subjectlabel (nsubj) of “camarero” (“waiter”) was the de-cisive trait.

6 Related WorkParsing metrics are sensitive to the choice of an-notation scheme (Schwartz et al., 2012; Tsarfatyet al., 2011) and fail to capture how parsers fareon important linguistic constructions (Nivre et al.,2010). In other NLP tasks, several studies haveexamined how metrics correlate with human judg-ments, including machine translation, summariza-tion and natural language generation (Papineniet al., 2002; Lin, 2004; Belz and Reiter, 2006;Callison-Burch et al., 2007). Our study is the firstto assess the correlation of human judgments anddependency parsing metrics. While previous stud-ies reached correlation coefficients over 0.80, thisis not the case for dependency parsing, where weobserve much lower coefficients.

7 ConclusionsWe have shown that out of seven metrics, LAScorrelates best with human jugdments. Neverthe-less, our study shows that there is an amount ofhuman preference that is not captured with LAS.Our analysis on human versus system disagree-ment indicates that attachment is more importantthan labeling, and that humans prefer a parse thatgets the overall structure right. For some lan-guages, inter-annotator agreement is higher thanannotator-metric (LAS) agreement, and content-POS is more important than function-POS, indi-cating there is an amount of human preference that

is not captured with our current metrics. Theseobservations raise the important question on howto incorporate our observations into parsing met-rics that provide a better fit to human judgments.We do not propose a better metric here, but simplyshow that while LAS seems to be the most ade-quate metric, there is still a need for better metricsto complement downstream evaluation.

We outline a number of extensions for futureresearch. Among those, we would aim at aug-menting the annotations by obtaining more de-tailed judgments from human annotators. The cur-rent evaluation would ideally encompass more (di-verse) domains and languages, as well as the manydiverse annotation schemes implemented in vari-ous publicly available dependency treebanks thatwere not included in our experiment.

Acknowledgments

We thank Muntsa Padro and Miguel Ballesterosfor their help and the three anonymous reviewersfor their valuable feedback.

ReferencesZeljko Agic and Danijela Merkler. 2013. Three

Syntactic Formalisms for Data-Driven DependencyParsing of Croatian. In Text, Speech, and Dialogue.Springer.

Anja Belz and Ehud Reiter. 2006. Comparing Auto-matic and Human Evaluation of NLG Systems. InEACL.

Matthias Buch-Kromann. 2003. The Danish Depen-dency Treebank and the DTAG Treebank Tool. InTLT.

319

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-XShared Task on Multilingual Dependency Parsing.In CoNLL.

Chris Callison-Burch, Cameron Fordyce, PhilippKoehn, Christof Monz, and Josh Schroeder. 2007.(meta-) evaluation of machine translation. In Pro-ceedings of the Second Workshop on Statistical Ma-chine Translation.

Jason Eisner. 1996. Three new probabilistic modelsfor dependency parsing. In COLING.

Jakob Elming, Anders Johannsen, Sigrid Klerke,Emanuele Lapponi, Hector Martınez Alonso, andAnders Søgaard. 2013. Down-stream effects oftree-to-dependency conversions. In NAACL.

Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani,and Eduard Hovy. 2013. Learning whom to trustwith MACE. In NAACL.

Chin-Yew Lin. 2004. ROUGE: a package for auto-matic evaluation of summaries. In Text summariza-tion branches out: Proceedings of the ACL-04 work-shop.

Christopher D Manning. 2011. Part-of-speech tag-ging from 97% to 100%: is it time for some linguis-tics? In Computational Linguistics and IntelligentText Processing. Springer.

Ryan McDonald and Joakim Nivre. 2007. Character-izing the errors of data-driven dependency parsers.In EMNLP-CoNLL.

Ryan McDonald, Koby Crammer, and FernandoPereira. 2005. Online large-margin training of de-pendency parsers. In ACL.

Joakim Nivre and Jens Nilsson. 2005. Pseudo-projective dependency parsing. In ACL.

Joakim Nivre, Johan Hall, and Jens Nilsson. 2004.Memory-based dependency parsing. In CoNLL.

Joakim Nivre, Laura Rimell, Ryan McDonald, and Car-los Gomez-Rodriguez. 2010. Evaluation of depen-dency parsers on unbounded dependencies. In COL-ING.

Joakim Nivre, Cristina Bosco, Jinho Choi, Marie-Catherine de Marneffe, Timothy Dozat, RichardFarkas, Jennifer Foster, Filip Ginter, Yoav Gold-berg, Jan Hajic, Jenna Kanerva, Veronika Laippala,Alessandro Lenci, Teresa Lynn, Christopher Man-ning, Ryan McDonald, Anna Missila, SimonettaMontemagni, Slav Petrov, Sampo Pyysalo, NataliaSilveira, Maria Simi, Aaron Smith, Reut Tsarfaty,Veronika Vincze, and Daniel Zeman. 2015. Univer-sal dependencies 1.0.

Stephan Oepen, Marco Kuhlmann, Yusuke Miyao,Daniel Zeman, Silvie Cinkova, Dan Flickinger, JanHajic, and Zdenka Uresova. 2015. Semeval 2015task 18: Broad-coverage semantic dependency pars-ing. In Proceedings of the 9th International Work-shop on Semantic Evaluation (SemEval 2015).

Kishore Papineni, Salim Roukus, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In ACL, Philadel-phia, Pennsylvania.

Roy Schwartz, Omri Abend, Roi Reichart, and AriRappoport. 2011. Neutralizing linguistically prob-lematic annotations in unsupervised dependencyparsing evaluation. In ACL.

Roy Schwartz, Omri Abend, and Ari Rappoport. 2012.Learnability-based syntactic annotation design. InCOLING.

Reut Tsarfaty, Joakim Nivre, and Evelina Anders-son. 2011. Evaluating dependency parsing: robustand heuristics-free cross-annotation evaluation. InEMNLP.

Reut Tsarfaty, Joakim Nivre, and Evelina Andersson.2012. Cross-framework evaluation for statisticalparsing. In EACL.

320



Feature Selection for Short Text Classification using Wavelet PacketTransform

Anuj MahajanIndian Institute of Technology, Delhi

Hauz Khas, New Delhi 110016India

[email protected]

Sharmistha, Shourya RoyXerox Research Centre India

Bangalore - 560103India

{sharmistha.jat,shourya.roy}@xerox.com

Abstract

Text classification tasks suffer from curse ofdimensionality due to large feature space.Short text data further exacerbates the prob-lem due to their sparse and noisy nature. Fea-ture selection thus becomes an important stepin improving the classification performance.In this paper, we propose a novel feature se-lection method using Wavelet Packet Trans-form. Wavelet Packet Transform (WPT) hasbeen used widely in various fields due to itsefficiency in encoding transient signals. Wedemonstrate how short text classification taskcan be benefited by feature selection usingWPT due to their sparse nature. Our techniquechooses the most discriminating features bycomputing inter-class distances in the trans-formed space. We experimented extensivelywith several short text datasets. Compared towell known techniques our approach reducesthe feature space size and improves the overallclassification performance significantly in allthe datasets.

1 IntroductionText classification task consists of assigning a docu-ment to one or more classes. This can be done us-ing machine learning techniques by training a modelwith labelled documents. Documents are usually repre-sented as vectors with a variety of techniques like bag-of-words(unigram, bigram), TFIDF representation, etc.Typically, text corpora have very high dimensionaldocument representation equal to the size of vocabu-lary. This leads to curse of dimensionality1 in machinelearning models, thereby degrading the performance.

Short text corpora, like SMS, tweets, etc., in partic-ular suffer from sparse high dimensional feature space,due to large vocabulary and short document length. Togive an idea as to how these factors affect the sizeof the feature space we compare Reuters with Twitterdata corpus. In Reuters-21578 corpus there are approx-imately 2.5 Million words in total and 14506 uniquevocabulary entries after standard preprocessing steps

1https://en.wikipedia.org/wiki/Curse_of_dimensionality

(which is the dimensionality of the feature space).However, Twitter 1 corpus, we used for experimentshas approximately 15,000 words in total and featurespace size of 7423 words. Additionally, the averagelength of an English tweet is 10-12 words whereas theaverage length of a document in Reuters-21578 newsclassification corpus is 200 words. Therefore, the di-mensionality is extremely high even for small corporawith short texts. In addition, the average number ofwords in a document is significantly less in short textdata leading to higher sparsity of feature space repre-sentation of documents.

Owing to this high dimensionality problem, one ofthe important steps in text classification workflows isfeature selection. Feature selection techniques for tradi-tional documents have been aplenty and a few seminalsurvey articles have been written on this topic (Blitzer,2008). In contrast, for short text there is much lesswork on statistical feature selection but more focus hasgone to feature engineering towards word normaliza-tion, canonicalization etc. (Han and Baldwin, 2011).

In this paper, we propose a dimensionality reduc-tion technique for short text using Wavelet packet trans-form called Improvised Adaptive Discriminant WaveletPacket Transform (IADWPT). IAWDPT does dimen-sionality reduction by selecting discriminative features(wavelet coefficients) from the Wavelet Packet Trans-form (WPT) representation. Short text data resemblestransient signals in vector representation and WPT en-codes transient signals (signals lasting for very shortduration) well (Learned and Willsky, 1995), using veryfew coefficients. This leads to considerable decrease inthe dimensionality of the feature space along with in-crease in classification accuracy. Additionally, we op-timise the procedure to select the most discriminativefeatures from WPT representation. To the best of ourknowledge this is the first attempt to apply an algorithmbased on wavelet packet transform to the feature selec-tion in short text classification.

2 Related WorkFeature selection has been widely adopted for dimen-sionality reduction of text datasets in the past. Yim-ing Yang et al. (Yang and Pedersen, 1997) performeda comparative study of some of these methods includ-ing, document frequency, information gain(IG), mutual

321

information(MI), �2-test(CHI) and term strength(TS).They concluded that IG and CHI are the most effectivein aggressive dimensionality reduction. The mRMRtechnique proposed by (Peng et al., 2005) selects thebest feature subset by increasing relevancy of featurewith target class and reducing redundancy betweenchosen features.

Wavelet transform provides time-frequency repre-sentation of a given signal. The time-frequency rep-resentation is useful for describing signals with timevarying frequency content. Detailed explanation ofwavelet transform theory is beyond the scope ofthis paper. For detailed theory, refer to Daubechies(Daubechies, 2006; Daubechies, 1992; Coifman andWickerhauser, 2006) and Robi Polikar (Polikar, ). Firstuse of wavelet transform for compression was proposedby Ronald R Coifman et al. (Coifman et al., 1994).Hammad Qureshi et al. (Qureshi et al., 2008) pro-posed an adaptive discriminant wavelet packet trans-form(ADWPT) for feature reduction.

In past wavelet transform has been applied to naturallanguage processing tasks. A survey on wavelet appli-cations in data mining (Li et al., 2002), discusses thebasics and properties of wavelets which make it a veryeffective technique in Data Mining. CC Aggarwal (Ag-garwal, 2002) uses wavelets for strings classification.He notes that wavelet technique creates a hierarchicaldecomposition of the data which can capture trends atvarying levels of granularity and thus helps classifica-tion task with the new representation. Geraldo Xexeoet al. (Xexeo et al., 2008) used wavelet transform torepresent documents for classification.

3 Wavelet Packet Transform forShort-text Dimensionality Reduction

Feature selection performs compression of featurespace to preserve maximum discriminative power offeatures for classification. We use this analogy to docompression of document feature space using WaveletPacket Transform. Vector format(e.g. dictionary en-coded vector) representation of a document is equiv-alent to a digital representation. This vector formatcan then be processed using wavelet transform to geta compressed representation of the document in termsof wavelet coefficients. Document features are trans-formed into wavelet coefficients. Wavelet coefficientsare ranked and selected based on their discriminationpower between classes. Classification model is trainedon these highly informative coefficients. Results showa considerable improvement in model accuracy usingour dimensionality reduction technique.

Typically, vector representation of short text willhave very few non-zero entries due to short length ofthe documents. If we plot count of each word in the dic-tionary on y-axis v/s distinct words on x-axis. Just liketransient signals, the resulting graph will have very fewspikes. Transient signals last for a very little time in thewhole duration of the observation. (Learned and Will-

sky, 1995) show the efficacy of wavelet packet trans-form in representing transient signal. This motivatesour use of Wavelet Packet Transform to encode shorttext.

Wavelet transform is a popular choice for featurerepresentation in image processing. Our approach is in-spired by a related work by (Qureshi et al., 2008). Theypropose Adaptive Discriminant Wavelet Packet Trans-form (ADWPT) based representation for meningiomasubtype classfication. ADWPT obtains a wavelet basedrepresentation by optimising the discrimination powerof the various features. Proposed technique IADWPTdiffers from ADWPT in the way discriminative fea-ture are selected. Next section provides details aboutthe proposed approach IADWPT.

4 IADWPT - Improvised AdaptiveDiscriminant Wavelet PacketTransform

This section presents the proposed short text featureselection technique IADWPT. IADWPT uses waveletpacket transform of the data to extract useful discrimi-native features from the sub-bands at various depths.

Natural language processing tasks usually representtheir documents in dictionary encoded bag-of-wordsrepresentation. This numerical vector representation ofa document is equivalent to signal representation. In or-der to get IADWPT representation of the document fol-lowing steps should be computed:

1) Compute full wavelet packet transform of the doc-ument vector representation.

2) Compute the discrimination power of each coeffi-cient in wavelet packet transform representation.

3) Select the most discriminative coefficients to rep-resent all the documents in the corpus.

Once the 1-D wavelet transform is computed at a de-sired level l, wavelet packet transform (WPT) produces2

l different sets of coefficients (nodes in WPT tree).These coefficients represent the magnitude of variousfrequencies present in the signal at a given time. Weselect the most discriminative coefficients to representall the documents in the corpus by calculating the dis-criminative power of each coefficient.

The classification task consists of c classes with ddocuments. 1-D Wavelet Packet Transform of the dth

k

document yields l levels with f sub bands consistingof m coefficients in each sub band. x

m,f,l

representthe coefficients of Wavelet Packet Transform. Follow-ing terms are defined for Algorithm 1.

• probability density estimates (Sm,f,l

) of a partic-ular sub-band in a level l a training sample docu-ment di

k

of a given Class ci

is given by:

Sk

m,f,l

=

(xkm,f,l)

2

Pj(xk

j,f,l)2

Here, xm,f,l

is the mth coefficient in f th sub-band of lth level of document d

k

. Where, j varies

322

Algorithm 1 IADWPT Algorithm for best discriminative feature selection1: for all classes C do2: Calculate Wavelet Packet Transform for all the documents d

k

in class ci

3: for all Documents dk

do4: Calculate probability density estimates Sk

m,f,l

5: end for6: for all Levels of WPT l and their sub bands f do7: for all Wavelet Packet Transform Coefficients m in subband f do8: Calculate average probability density Aci

m,f,l

9: end for10: end for11: end for12: for all Class Pairs c

a

, cb

do13: Calculate discriminative power Da,b

m,f,l

14: end for15: Select top m0 coefficients for representing documents in corpus

over the length of sub-band. Wavelet supports varywith the bands, Normalization ensures that thefeature selection done gives uniform weightage tothe features from different bands. This step cal-culates the normalised value of coefficients in asub-band.

• Average probability density (Acim,f,l

) estimates arederived using all the training samples d

k

in a givenclass c

i

.

Acim,f,l

=

Pk

S

km,f,l

d

for, coefficient m in sub-band f for class ci

andd is the total number of documents in the class. kvaries over the number of documents in the class.It measures the average value of a coefficient in allthe documents belonging to a class.

• Discriminative power (Da,b

m,f,l

) of each coefficientin lth level’s f th sub band’s mth coefficient, be-tween classes a and b is defined as follows:

Da,b

m,f,l

= |q

Aa

m,f,l

�q

Ab

m,f,l

|

Discriminative power is the hellinger distance be-tween the average probability density estimatesof a coefficient for the two classes. It quantifiesthe difference in the average value of a coeffi-cient between a pair of classes. More the differ-ence, better the discriminative power of the coef-ficient. Thus discriminative features tend to havea higher average probability density in one of theclasses whereas redundant features cancel out intaking the difference in computing the distance.(Rajpoot, 2003) have shown efficacy of Hellingerdistance applied to ADWPT.

Selecting coefficients with greater discriminativepower helps the classifier perform well. Full algorithmis mentioned in algorithm 1.

Multi class classification can then be handled in thisframework using one-vs-one classification. We select

the top m0 features from the wavelet representation forrepresenting the data in the classification task. Timecomplexity of the algorithm is polynomial. The methodis based on adaptive discriminant wavelet packet trans-form (ADWPT) (Qureshi et al., 2008). Therefore, wename it as improvised adaptive discriminant waveletpacket transform (IADWPT). ADWPT uses best basisfor classification which is a union of the various sub-bands selected that can span the transformed space, sonoise is still retained in the signal whereas IADWPTselects coefficients from the sub-band having maximaldiscriminative power thus improving the classificationresults. As opposed to ADWPT, IADWPT is a one waytransform, original signal cannot be recovered from thetransform domain. Experimental results confirm thatIADWPT performs better than ADWPT in short textdatasets.

4.1 IADWPT ExampleFigure 1 Gives intuition of the workings of IAD-WPT transform. Uppermost graph displays the Aver-age probability density in positive class samples. Mid-dle graph in the Figure 1 shows the Average probabil-ity density in negative class samples. These two energyvalues are then subtracted, resulting values are shownin the bottom most component of Figure 1. Peak posi-tive and negative values in the bottommost graph rep-resent the most discriminant features. Absolute valueof the discriminative power can then be used to selectthe most discriminative features to represent each doc-ument in corpus.

5 Experiments and ResultsWe used multiple short text datasets to prove efficacy ofproposed algorithm against state of the art algorithmsfor feature selection.

1) Twitter 1: This dataset is a part of the SemEval2013 task B dataset (Nakov et al., ) for two class senti-ment classification. We gathered 624 examples in pos-itive and negative class each for our experiments.

323

0 500 1000 1500 2000 25000

0.2

0.4

0 500 1000 1500 2000 25000

0.2

0.4

0 500 1000 1500 2000 2500−0.5

0

0.5

Wavelet Features

Energ

y

Calculation of Discriminative features in Wavelet Packet Transform Representation

Figure 1: x-axis represents the wavelet packet transform coefficients, y-axis represents amplitude. From top tobottom, 1) Average probability density Aa

m,f,l

value of coefficients in positive class a), 2) Average probabilitydensity Ab

m,f,l

value of coefficients in negative class (b), 3) Difference between Aa

m,f,l

and Ab

m,f,l

Table 1: CLASSIFICATION RESULTS - SUPPORT VECTOR MACHINE (SVM) AND LOGISTIC REGRESSION (LR)Dataset Baseline MI-avg �2 PCA ADWPT IADWPTClassification accuracy - SVMTwitter 1 47.04 47.57 45.62 59.28 46 63.44SMS Spam 1 - HAM Accuracy 99.82 99.71 99.77 99.87 99.63 99.94SMS Spam 1 - SPAM Accuracy 83.10 83.32 82.96 83.81 83.57 83.92SMS Spam 2 - HAM Accuracy 55.2 56.31 55.62 81.12 54.1 87.7SMS Spam 2 - SPAM Accuracy 46.6 46.7 46.49 92.39 47.3 99.42Total dimensions in best classification accuracy result - SVMTwitter 1 7423 2065 515 540 7423 23SMS Spam 1 9394 3540 550 750 9394 815SMS Spam 2 10681 2985 490 855 1068 250Classification accuracy - Logistic RegressionTwitter 1 75.8 74.97 75.21 76.28 68.2 76.72SMS Spam 1 - HAM Accuracy 97.91 94.67 95.28 98.71 98.03 99.61SMS Spam 1 - SPAM Accuracy 95.48 85.34 86.37 91.37 82.2 87.54SMS Spam 2 - HAM Accuracy 96.02 89.54 92.76 71.21 95.09 98.5SMS Spam 2 - SPAM Accuracy 91.2 88.37 91.38 89.15 92.2 94.51Total dimensions in best classification accuracy result - Logistic RegressionTwitter 1 7423 5600 3250 3575 7423 2749SMS Spam 1 9394 7545 1755 2350 9394 1680SMS Spam 2 10681 6550 3000 3050 10681 9981

2) SMS Spam 1: UCI spam dataset (Almeida, )consists of 5,574 instances of SMS classified intoSPAM and HAM classes. SPAM class is defined asmessages which are not useful and HAM is the classof useful messages. We compare our results with theresults they published in their paper (Almeida et al.,2013). Therefore, we followed the same experimentprocedure as cited in the paper. First 30% samples wereused in train and the rest in test set as reported in thepaper.

3) SMS Spam 2: The dataset was published by Ya-dav et al. (Yadav et al., 2011). It consists of 2000 SMS,1000 SPAM and 1000 HAM messages. Experiment set-tings are same as that of dataset SMS Spam 1.

The goal of our experiments is to examine the ef-fectiveness of the proposed algorithm in feature selec-tion for short text datasets. We measure the effective-ness of the feature selection technique with respect tothe increase in accuracy in the final machine learningtask. Our method does not depend on a specific classi-fier used in the final classification. Therefore, we used

50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Comparision of different feature selection methods: Spam Accuracy

Number of dimensions

Spam

Acc

ura

cy

MRMR

IADWPT

PCA

Chi

MI

Figure 2: Spam Classification Accuracy comparisonacross various feature selection algorithm for SMSSpam 2 dataset

324

50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Dimensions

Ham

Acc

ura

cy

Comparision of different feature selection methods: Ham Accuracy

MRMR

IADWPT

PCA

Chi

MI

Figure 3: Ham Classification Accuracy comparisonacross various feature selection algorithm for SMSSpam 2 dataset

popular classifiers like Support Vector Machine (RBFkernel with grid search) (Cortes and Vapnik, 1995) andLogistic Regression with and without dimensionalityreduction for unigram representation to benchmark theperformance. All the experiments were done with 10fold cross validation and grid search on C parameter.We report results with respect to classification accuracywhich is measured as #correctly classified datapoints

#total datapoints

.We conducted detailed experiments comparing IAD-

WPT using Coiflets of order 2 with other feature selec-tion techniques such as PCA, Mutual Information, �2,mRMR (Peng et al., 2005) and ADWPT (Qureshi etal., 2008). Results are reported in Table 1. The tablereports best accuracy values and respective feature setsize selected by the technique. It can be observed thatIADWPT gives best accuracy in most of the cases withvery few features.

We compared performance of our algorithm withmRMR. Results for SMS Spam 2 dataset are shownin Figure 2 and Figure 3. The plots prove efficacy ofour algorithm versus state of the art mRMR algorithm.mRMR technique could not finish execution for the restof the datasets. It can also be observed from resultsin Table 1 and Figure 1,2 that performance of featureselection algorithms follow consistent pattern in shorttext. Following is observed order of performance of al-gorithms in decreasing order, IADWPT, mRMR, PCA,Chi Square, MI. Further, it is observed that IADWPTperforms well at feature selection without losing dis-criminative information, even when the dimensional-ity of feature space is reduced to as far as 1/40th oforiginal feature space and steadily maintains the ac-curacy as dimensionality is reduced, which makes it asuitable technique for aggressive dimensionality reduc-tion. This also helps in learning ML (machine learn-ing) models faster due to reduced dimensionality. Weplotted the discrimination power of coefficients in eachdataset. Plot suggested that very few coefficients con-tained most of the discriminative power. And, thereforejust working with these coefficients can help in gettinggood accuracies resulting in aggressive dimensionalityreduction. Results establish the effectiveness of IAD-WPT for applicability in compressing short text featurerepresentation and reducing noise to improve classifi-

Figure 4: Plot of Discriminative Power (Da,b

m,f,l

) ar-ranged in descending order

cation accuracy.

5.1 IADWPT EffectivenessShort text data is noisy and consists of many featureswhich are irrelevant to the task of classification of data.IADWPT effectively gets rid of the noise in approxima-tions(as Signal strength is greater than the noise), thefeature selection step at the sub-band level as describedin Algorithm 1, it enforces selection of good discrim-inative features and thus improves classifier accuracy,reducing feature space dimensionality at the same time.Features from sub-bands of the signal are chosen basedon their discriminative power, therefore, the originalsignal information is lost and the transform is not re-versible.

IADWPT gives good compression of data and with-out losing discriminative information, even when thedimensionality of space is reduced to as far as 1/40thof original feature space and is thus steadily maintain-ing the accuracy as dimensionality is reduced, whichmakes it a suitable technique for dimensionality re-duction. This also helps learning machine learningmodels faster due to reduced dimensionality. Figure4 shows the plot of Discriminative Power Da,b

m,f,l

val-ues for coefficients arranged in descending order forSMS Spam 2 dataset. Other datasets displayed similargraph for Discriminative Power. From the figure it canbe observed that few coefficients hold most of the dis-criminative power, and thus aggressive dimensionalityreduction is possible with IADWPT algorithm. Resultsestablish the effectiveness of IADWPT for applicabil-ity in compressing short text feature representation andreducing noise.

6 Conclusion and Future WorkIn this paper, we have proposed IADWPT algorithmfor effective dimensionality reduction for short text cor-pus. The algorithm can be used in a number of scenar-ios where high dimensionality and sparsity pose chal-lenge. Experiments prove efficacy of IADWPT baseddimensionality reduction for short text data. This tech-nique can prove useful to a number of social media dataanalysis applications. In future, we would like to ex-plore theoretical bounds on best number of dimensionsto choose from wavelet representation.

325

ReferencesCharu C. Aggarwal. 2002. On effective classification

of strings with wavelets.

Tiago A Almeida. Sms spam collection v.1.http://www.dt.fee.unicamp.br/˜tiago/smsspamcollection/,[Online;accessed 10-September-2014].

T. Almeida, J. M. G. Hidalgo, and T. P. Silva. 2013.Towards sms spam filtering: Results under a newdataset. International Journal of Information Secu-rity Science.

John Blitzer. 2008. A survey of dimensional-ity reduction techniques for natural language.http://john.blitzer.com/papers/wpe2.pdf,[Online; accessed 10-September-2014].

R. R. Coifman and M. V. Wickerhauser. 2006.Entropy-based algorithms for best basis selection.IEEE Trans. Inf. Theor., 38(2):713–718, September.

RonaldR. Coifman, Yves Meyer, Steven Quake, andM.Victor Wickerhauser. 1994. Signal process-ing and compression with wavelet packets. In J.S.Byrnes, JenniferL. Byrnes, KathrynA. Hargreaves,and Karl Berry, editors, Wavelets and Their Applica-tions, volume 442 of NATO ASI Series, pages 363–379. Springer Netherlands.

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Mach. Learn., 20(3):273–297,September.

Ingrid Daubechies. 1992. Ten Lectures on Wavelets.Society for Industrial and Applied Mathematics,Philadelphia, PA, USA.

I. Daubechies. 2006. The wavelet transform, time-frequency localization and signal analysis. IEEETrans. Inf. Theor., 36(5):961–1005, September.

Bo Han and Timothy Baldwin. 2011. Lexical normal-isation of short text messages. In Proceedings of the49th Annual Meeting of the Association for Compu-tational Linguistics: Human Language Technologies- Volume 1, HLT ’11, pages 368–378, Stroudsburg,PA, USA. Association for Computational Linguis-tics.

Rachel E. Learned and Alan S. Willsky. 1995. Awavelet packet approach to transient signal classifi-cation. Applied and Computational Harmonic Anal-ysis, 2(3):265 – 278.

Tao Li, Qi Li, Shenghuo Zhu, and Mitsunori Ogihara.2002. A survey on wavelet applications in data min-ing. SIGKDD Explor. Newsl., 4(2):49–68, Decem-ber.

Preslav Nakov, Zornitsa Kozareva, Alan Ritter, SaraRosenthal, Veselin Stoyanov, and Theresa Wilson.Semeval-2013 task 2: Sentiment analysis in twitter.

Hanchuan Peng, Fuhui Long, and Chris Ding. 2005.Feature selection based on mutual information: cri-teria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysisand Machine Intelligence, 27:1226–1238.

Robi Polikar. Wavelet Tutorial. http://users.rowan.edu/˜polikar/wavelets/wttutorial.html,[Online;accessed 10-January-2015].

Hammad Qureshi, Olcay Sertel, Nasir Rajpoot, RolandWilson, and Metin Gurcan. 2008. Adaptive dis-criminant wavelet packet transform and local binarypatterns for meningioma subtype classification. InDimitris N. Metaxas, Leon Axel, Gabor Fichtinger,and Gbor Szkely, editors, MICCAI (2), volume 5242of Lecture Notes in Computer Science, pages 196–204. Springer.

NM Rajpoot. 2003. Local discriminant wavelet packetbasis for texture classification. In SPIE Wavelets X,pages 774–783. SPIE.

Geraldo Xexeo, Jano de Souza, Patricia F. Castro, andWallace A. Pinheiro. 2008. Using wavelets toclassify documents. Web Intelligence and Intel-ligent Agent Technology, IEEE/WIC/ACM Interna-tional Conference on, 1:272–278.

Kuldeep Yadav, Ponnurangam Kumaraguru, AtulGoyal, Ashish Gupta, and Vinayak Naik. 2011. Sm-sassassin: Crowdsourcing driven mobile-based sys-tem for sms spam filtering. In Proceedings of the12th Workshop on Mobile Computing Systems andApplications, HotMobile ’11, pages 1–6, New York,NY, USA. ACM.

Yiming Yang and Jan O. Pedersen. 1997. A compar-ative study on feature selection in text categoriza-tion. In Proceedings of the Fourteenth InternationalConference on Machine Learning, ICML ’97, pages412–420, San Francisco, CA, USA. Morgan Kauf-mann Publishers Inc.

326



Learning Adjective Meanings with a Tensor-Based Skip-Gram Model

Jean MaillardUniversity of Cambridge

Computer [email protected]

Stephen ClarkUniversity of Cambridge

Computer [email protected]

Abstract

We present a compositional distributionalsemantic model which is an implementa-tion of the tensor-based framework of Co-ecke et al. (2011). It is an extended skip-gram model (Mikolov et al., 2013) whichwe apply to adjective-noun combinations,learning nouns as vectors and adjectivesas matrices. We also propose a novelmeasure of adjective similarity, and showthat adjective matrix representations leadto improved performance in adjective andadjective-noun similarity tasks, as well asin the detection of semantically anomalousadjective-noun pairs.

1 Introduction

A number of approaches have emerged for com-bining compositional and distributional seman-tics. Some approaches assume that all words andphrases are represented by vectors living in thesame semantic space, and use mathematical op-erations such as vector addition and element-wisemultiplication to combine the constituent vectors(Mitchell and Lapata, 2008). In these relativelysimple methods, the composition function doesnot typically depend on its arguments or their syn-tactic role in the sentence.

An alternative which makes more use of gram-matical structure is the recursive neural networkapproach of Socher et al. (2010). Constituent vec-tors in a phrase are combined using a matrix andnon-linearity, with the resulting vector living in thesame vector space as the inputs. The matrices canbe parameterised by the syntactic type of the com-bining words or phrases (Socher et al., 2013; Her-mann and Blunsom, 2013). Socher et al. (2012)extend this idea by representing the meanings ofwords and phrases as both a vector and a matrix,introducing a form of lexicalisation into the model.

A further extension, which moves us closer toformal semantics (Dowty et al., 1981), is to builda semantic representation in step with the syntac-tic derivation, and have the embeddings of wordsbe determined by their syntactic type. Coecke etal. (2011) achieve this by treating relational wordssuch as verbs and adjectives as functions in thesemantic space. The functions are assumed tobe multilinear maps, and are therefore realised astensors, with composition being achieved throughtensor contraction.1 While the framework speci-fies the “shape” or semantic type of these tensors,it makes no assumption about how the values ofthese tensors should be interpreted (nor how theycan be learned).

A proposal for the case of adjective-noun com-binations is given by Baroni and Zamparelli(2010) (and also Guevara (2010)). Their modelrepresents adjectives as matrices over noun space,trained via linear regression to approximate the“holistic” adjective-noun vectors from the corpus.

In this paper we propose a new solution to theproblem of learning adjective meaning represen-tations. The model is an implementation of thetensor framework of Coecke et al. (2011), hereapplied to adjective-noun combinations as a start-ing point. Like Baroni and Zamparelli (2010), ourmodel also learn nouns as vectors and adjectives asmatrices, but uses a skip-gram approach with neg-ative sampling (Mikolov et al., 2013), extended tolearn matrices.

We also propose a new way of quantifying ad-jective similarity, based on the action of adjec-tives on nouns (consistent with the view that ad-jectives are functions). We use this new measureinstead of the naive cosine similarity function ap-plied to matrices, and obtain competitive perfor-mance compared to the baseline skip-gram vec-tors (Mikolov et al., 2013) on an adjective simi-larity task. We also perform competitively on the

1Baroni et al. (2014) have developed a similar approach.

327

APPLE

PICKONERIPE

FROMTHETREE

CICEROSEXTANTDEFER

Figure 1: Learning the vector for apple in the con-text pick one ripe apple from the tree. The vectorfor apple is updated in order to increase the innerproduct with green vectors and decrease it with redones, which are negatively sampled.

adjective-noun similarity dataset from Mitchelland Lapata (2010). Finally, the tensor-based skip-gram model also leads to improved performance inthe detection of semantically anomalous adjective-noun phrases, compared to previous work.

2 A tensor-based skip-gram model

Our model treats adjectives as linear maps over thevector space of noun meanings, encoded as matri-ces. The algorithm works in two stages: the firststage learns the noun vectors, as in a standard skip-gram model, and the second stage learns the adjec-tive matrices, given fixed noun vectors.

2.1 Training of nouns

To learn noun vectors, we use a skip-gram modelwith negative sampling (Mikolov et al., 2013).Each noun n in the vocabulary is assigned two d-dimensional vectors: a content vector n, whichconstitutes the embedding, and a context vectorn

1. For every occurrence of a noun n in the corpus,the embeddings are updated in order to maximisethe objective function

ÿ

c1PClog �pn ¨ c

1q `ÿ

c1PClog �p´n ¨ c

1q, (1)

where C is a set of contexts for the current noun,and C is a set of negative contexts. The contextsare taken to be the vectors of words in a fixed win-dow around the noun, while the negative contextsare vectors for k words sampled from a unigramdistribution raised to the power of 3{4 (Goldberg,2014). In our experiments, we have set k “ 5.

UNRIPE APPLE

THEGREENSMALL

TASTEDVERYSOUR

APPLEUNRIPE X =

SAYSBRAILLE

OMEN

=

Figure 2: Learning the matrix for unripe in thecontext the green small unripe apple tasted verysour. The matrix for unripe is updated to increasethe inner product of the vector for unripe applewith green vectors and decrease it with red ones.

After each step, both content and context vec-tors are updated via back-propagation. This pro-cedure leads to noun embeddings (content vectors)which have a high inner product with the vectorsof words in the context of the noun, and a lowinner product with vectors of negatively sampledwords. Fig. 1 shows this intuition.

2.2 Training of adjectivesEach adjective a in the vocabulary is assigned amatrix A, initialised to the identity plus uniformnoise. First, all adjective-noun phrases pa, nq areextracted from the corpus. For each pa, nq pair, thecorresponding adjective matrix A and noun vec-tor n are multiplied to compute the adjective-nounvector An. The matrix A is then updated to max-imise the objective function

ÿ

c1PClog �pAn ¨ c

1q `ÿ

c1PClog �p´An ¨ c

1q. (2)

The contexts C are taken to be the vectors of wordsin a window around the adjective-noun phrase,while the negative contexts C are again vectors ofrandomly sampled words. Matrices are initialisedto the identity, while the context vectors C are theresults of Section 2.1.

Finally, the matrix A is updated via back-propagation. Equation 2 means that the inducedmatrices will have the following property: whenmultiplying the matrix with a noun vector, the re-sulting adjective-noun vector will have a high in-ner product with words in the context window ofthe adjective-noun phrase, and low inner productfor negatively sampled words. This is exemplifiedin Figure 2.

328

2.3 Similarity measureThe similarity of two vectors n and m is gener-ally measured using the cosine similarity function(Turney and Pantel, 2010; Baroni et al., 2014),

vecsimpn,mq “ n ¨ m

}n} }m} .

Based on tests using a development set, we foundthat using cosine to measure the similarity of ad-jective matrices leads to no correlation with gold-standard similarity judgements. Cosine similarity,while suitable for vectors, does not capture any in-formation about the function of matrices as linearmaps. We postulate that a suitable measure of thesimilarity of two adjectives should be related tohow similarly they transform nouns.

Consider two adjective matrices A and B. IfAn and Bn are similar vectors for every nounvector n, then we deem the adjectives to be simi-lar. Therefore, one possible measure involves cal-culating the cosine distance between the images ofall nouns under the two adjectives, and taking theaverage or median of these distances. Rather thanworking on every noun in the vocabulary, whichis expensive, we instead take the most frequentnouns, cluster them, and use the cluster centroids(obtained in our case using k-means). The result-ing distance function is given by

matsimpA,Bq “ median

nPNvecsimpAn,Bnq,

(3)where the median is taken over the set of clustercentroids N .2

3 EvaluationThe model is trained on a dump of the EnglishWikipedia, automatically parsed with the C&Cparser (Clark and Curran, 2007). The corpus con-tains around 200 million noun examples, and 30million adjective-noun examples. For every con-text word in the corpus, 5 negative words are sam-pled from the unigram distribution. Subsamplingis used to decrease the number of frequent words(Mikolov et al., 2013). We train 100-dimensionalnoun vectors and 100ˆ100-dimensional adjectivematrices.

3.1 Word SimilarityFirst we test word, rather than phrase, similarityon the MEN test collection (Bruni et al., 2014),

2We chose the median instead of the average as it is moreresistant to outliers in the data.

MODEL CORRELATION

SKIPGRAM-300 0.776TBSG-100 0.769

Table 1: Spearman rank correlation on noun simi-larity task.

MODEL CORRELATION

TBSG-100ˆ100 0.645SKIPGRAM-300 0.638

Table 2: Spearman rank correlation on adjectivesimilarity task.

which contains a set of POS-tagged word pairs to-gether with gold-standard human similarity judge-ments. We use the POS tags to select all noun-noun and adjective-adjective pairs, leaving us witha set of 643 noun-noun pairs and 96 adjective-adjective pairs. For the noun-noun dataset, we aretesting the quality of the 100-dimensional nounvectors from the first stage of the tensor-basedskip-gram model (TBSG), which is essentiallyword2vec applied to just learning noun vectors.These are compared to the 300-dimensional SKIP-GRAM vectors available from the word2vecpage (which have been trained on a very largenews corpus).3

The adjective-adjective pairs are used to testthe 100 ˆ 100 matrices obtained from our TBSGmodel, again compared to the 300-dimensionalSKIPGRAM vectors. The Spearman correlationsbetween human judgements and the similarity ofvectors are reported in Tables 1 and 2. Note thatfor adjectives we used the similarity measure de-scribed in Section 2.3. Table 1 shows that thenoun vectors we use are of a high quality, perform-ing comparably to the SKIPGRAM noun vectorson the noun-noun similarity data. Table 2 showsour TBSG adjective matrices, plus new similaritymeasure, to also perform comparably to the SKIP-GRAM adjective vectors on the adjective-adjectivesimilarity data.

3.2 Phrase SimilarityThe TBSG model aims to learn matrices that actin a compositional manner. Therefore, a more in-teresting evaluation of its performance is to testhow well the matrices combine with noun vectors.

3http://word2vec.googlecode.com/

329

MODEL CORRELATION

TBSG-100 0.50SKIPGRAM-300 (add) 0.48SKIPGRAM-300 (N only) 0.43TBSG-100 (N only) 0.42REG-600 0.37humans 0.52

Table 3: Spearman rank correlation on adjective-noun similarity task.

We use the Mitchell and Lapata (2010) adjective-noun similarity dataset, which contains pairs ofadjective-noun phrases such as last number – vastmajority together with gold-standard human simi-larity judgements. For the evaluation, we calculatethe Spearman correlation between non-averagedhuman similarity judgements and the cosine simi-larity of the vectors produced by various composi-tional models.

The results in Table 3 show that TBSG hasthe best correlation with human judgements ofthe other models tested. It outperforms SKIP-GRAM vectors with both addition and element-wise multiplication as composition functions (thelatter not shown in that table, as it is worse thanaddition). Also reported is the baseline perfor-mance of SKIPGRAM and TBSG when using onlynouns to compute similarity (ignoring the adjec-tives). It is interesting to note that TBSG also out-performs the result of the matrix-vector linear re-gression method (REG-600) of Baroni and Zam-parelli (2010) as reported by Vecchi et al. (2015)on the same dataset. Their method trains a matrixfor every adjective via linear regression to approx-imate corpus-extracted “holistic” adjective-nounvectors, and is therefore similar in spirit to TBSG.

3.3 Semantic AnomalyFinally, we use the model to distinguish betweensemantically acceptable and anomalous adjective-noun phrases, using the data from Vecchi et al.(2011). The data consists of two sets: a set of un-observed acceptable phrases (e.g. ethical statute)and one of deviant phrases (e.g. cultural acne).Following Vecchi et al. (2011) we use two indicesof semantic anomaly. The first, denoted COSINE,is the cosine between the adjective-noun vectorand the noun vector. This is based on the hypothe-sis that deviant adjective-noun vectors will form awider angle with the noun vector. The second in-

COSINE DENSITYMODEL t sig. t sig.

TBSG-100 5.16 ˚˚˚ 5.72 ˚˚˚ADD-300 0.31 2.63 ˚˚MUL-300 -0.56 2.68 ˚˚REG-300 0.48 3.12 ˚˚

Table 4: Correlation on test data for semanticanomalies. Significance levels are marked ˚˚˚ forp † 0.001, ˚˚ for p † 0.01.

dex, denoted DENSITY, is the average cosine dis-tance between the adjective-noun vector and its 10nearest noun neighbours. This measure is basedon the hypothesis that nonsensical adjective-nounsshould not have many neighbours in the spaceof (meaningful) nouns.4 These two measures arecomputed for the acceptable and deviant sets, andcompared using a two-tailed Welch’s t-test.

Table 4 compares the performance of TBSGwith the results of count-based vectors usingaddition (ADD) and element-wise multiplication(MUL) reported by Vecchi et al. (2011), as well asthe matrix-vector linear regression method (REG-300) of Baroni and Zamparelli (2010). TBSG ob-tains the highest scores with both measures.

4 Conclusions

In this paper we have implemented the tensor-based framework of Coecke et al. (2011) in theform of a skip-gram model extended to learnhigher-order embeddings, in this case adjectives asmatrices. While adjectives and nouns are learnedseparately in this study, an obvious extension isto learn embeddings jointly. We find the tensor-based skip-gram model particularly attractive forthe obvious ways in which it can be extended toother parts-of-speech (Maillard et al., 2014). Forexample, in this framework transitive verbs arethird-order tensors which yield a sentence vectorwhen contracted with subject and object vectors.Assuming contextual representations of sentences,these could be learned by the tensor-based skip-gram as a straghtforward extension from second-order (matrices) to third-order tensors (and poten-tially beyond for words requiring even higher or-der tensors).

4Vecchi et al. (2011) also use a third index of semanticanomaly, based on the length of adjective-noun vectors. Weomit this measure as we deem it unsuitable for models notbased on context counts and elementwise vector operations.

330

Acknowledgements

Jean Maillard is supported by an EPSRC Doc-toral Training Grant and a St John’s Scholar-ship. Stephen Clark is supported by ERC Start-ing Grant DisCoTex (306920) and EPSRC grantEP/I037512/1. We would like to thank TamaraPolajnar, Laura Rimell, and Eva Vecchi for usefuldiscussion.

ReferencesMarco Baroni, Roberto Zamparelli 2010. Nouns

are vectors, adjectives are matrices: represent-ing adjective-noun constructions in semantic space.Proceedings of EMNLP, 1183–1193.

Marco Baroni, Raffaela Bernardi, Roberto Zamparelli2014 Frege in space: A program for compositionaldistributional semantics. Linguistic Issues in Lan-guage Technologies 9(6), 5–110.

Marco Baroni, Georgiana Dinu, German Kruszewski2014 Don’t count, predict! A systematic compar-ison of context-counting vs. context-predicting se-mantic vectors. Proceedings of ACL, 238–247.

Elia Bruni, Nam Khanh Tran, Marco Baroni 2014Multimodal Distributional Semantics. Journal ofArtificial Intelligence Research 49, 1–47.

Stephen Clark, James R. Curran 2007 Wide-coverageefficient statistical parsing with CCG and log-linearmodels. Computational Linguistics 33(4), 493–552.

Bob Coecke, Mehrnoosh Sadrzadeh, Stephen Clark2011. Mathematical Foundations for a Composi-tional Distributional Model of Meaning. LinguisticAnalysis 36 (Lambek Festschrift), 345–384.

David R. Dowty, Robert E. Wall, Stanley Peters 1981Introduction to Montague semantics. Dordrecht.

Yoav Goldberg, Omer Levy 2014 word2vecExplained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint1402.3722.

Emiliano Guevara 2010 A regression model ofadjective-noun compositionality in distributional se-mantics. Proceedings of the ACL GEMS workshop,33–37.

Karl Moritz Hermann and Phil Blunsom. 2013 Therole of syntax in vector space models of composi-tional semantics. Proceedings of ACL, 894–904.

Jean Maillard, Stephen Clark, Edward Grefenstette2014. A Type-Driven Tensor-Based Semantics forCCG. Proceedings of the EACL 2014 Type Theoryand Natural Language Semantics Workshop, 46–54.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, Jeff Dean 2013. Distributed representationsof words and phrases and their compositionality.Proceedings of NIPS, 3111–3119.

Jeff Mitchell, Mirella Lapata 2008 Vector-based mod-els of semantic composition. Proceedings of ACL08, 263–244.

Jeff Mitchell, Mirella Lapata 2010 Composition inDistributional Models of Semantics. Cognitive sci-ence 34(8), 1388–1439.

Richard Socher, Christopher D. Manning, Andrew Y.Ng 2010 Learning continuous phrase represen-tations and syntactic parsing with recursive neuralnetworks. Proceedings of the NIPS Deep Learningand Unsupervised Feature Learning Workshop, 1–9.

Richard Socher, Brody Huval, Christopher D. Man-ning, Andrew Y. Ng 2012 Semantic composition-ality through recursive matrix-vector spaces. Pro-ceedings of EMNLP, 1201–1211.

Richard Socher, John Bauer, Christopher D. Manning,Andrew Y. Ng 2013 Parsing with CompositionalVector Grammars. Proceedings of ACL, 455-465.

Mark Steedman 2000 The Syntactic Process. MITPress.

Peter D. Turney, Patrick Pantel 2010. From fre-quency to meaning: vector space models of seman-tics. Journal of Artificial Intelligence Research 37,141–188.

Eva M. Vecchi, Marco Baroni, Roberto Zamparelli2011 (Linear) maps of the impossible: Capturingsemantic anomalies in distributional space. Pro-ceedings of ACL DISCo Workshop, 1–9.

Eva M. Vecchi, Marco Marelli, Roberto Zamparelli,Marco Baroni 2015 Spicy adjectives and nominaldonkeys: Capturing semantic deviance using com-positionality in distributional spaces. Accepted forpublication.

331



Model Selection for Type-Supervised Learningwith Application to POS Tagging

Kristina ToutanovaMicrosoft Research

Redmond, WA, USA

Waleed Ammar⇤School of Computer ScienceCarnegie Mellon University

Pallavi Chourdhury Hoifung PoonMicrosoft Research

Redmond, WA, USA

Abstract

Model selection (picking, for example, thefeature set and the regularization strength)is crucial for building high-accuracy NLPmodels. In supervised learning, we can es-timate the accuracy of a model on a subsetof the labeled data and choose the modelwith the highest accuracy. In contrast,here we focus on type-supervised learn-ing, which uses constraints over the pos-sible labels for word types for supervi-sion, and labeled data is either not avail-able or very small. For the setting whereno labeled data is available, we perform acomparative study of previously proposedand one novel model selection criterion ontype-supervised POS-tagging in nine lan-guages. For the setting where a small la-beled set is available, we show that the setshould be used for semi-supervised learn-ing rather than for model selection only –using it for model selection reduces the er-ror by less than 5%, whereas using it forsemi-supervised learning reduces the errorby 44%.

1 Introduction

Fully supervised training of NLP models (e.g.,part-of-speech taggers, named entity recognizers,relation extractors) works well when plenty of la-beled examples are available. However, manu-ally labeled corpora are expensive to constructin many languages and domains, whereas an al-ternative, if weaker, supervision is often read-ily available. For example, corpora labeled withPOS tags at the token level are only available foraround 35 languages, while tag dictionaries of the

⇤This research was conducted during the author’s intern-ship at Microsoft Research.

form displayed in Fig. 1 are available for manymore languages, either in commercial dictionar-ies or community created resources such as Wik-tionary. Tag dictionaries provide type-level super-vision for word types in the lexicon. Similarly,while sentences labeled with named entities arescarce, gazetteers and databases are more readilyavailable (Bollacker et al., 2008).

There has been substantial research on how bestto build models using such type-level supervision,for POS tagging, super sense tagging, NER, andrelation extraction (Craven et al., 1999; Smith andEisner, 2005; Carlson et al., 2009; Mintz et al.,2009; Johannsen et al., 2014), inter alia, focussingon parametric forms and loss functions for modeltraining. However, there has been little research onthe practically important aspect of model selectionfor type-supervised learning. While some previ-ous work used criteria based on the type-level su-pervision only (Smith and Eisner, 2005; Goldwa-ter and Griffiths, 2007), much prior work used a la-beled set for model selection (Vaswani et al., 2010;Soderland and Weld, 2014). We are not aware ofany prior work aiming to compare or improve ex-isting type-supervised model selection criteria.

For POS tagging, there is also work on us-ing both type-level supervision from lexicons, andprojection from another language (Tackstrom etal., 2013). Methods for training with a smalllabeled set have also been developed (Søgaard,2011; Garrette and Baldridge, 2013; Duong et al.,2014), but there have not been studies on the util-ity of a small labeled set for model selection ver-sus model training. Our contributions are: 1) asimple and generally applicable model selectioncriterion for type-supervised learning, 2) the firstmulti-lingual systematic evaluation of model se-lection criteria for type-supervised models, 3) em-pirical evidence that, if a small labeled set is avail-able, the set should be used for semi-supervised

332

αυτούς ' 'det., pron.'σταθερά ' 'adj., adv., noun'μετέδωσα 'verb'ανάπαυση 'noun'παλιά ' 'adj., adv.'

Greek tag dictionary

ενός ' 'det., num.'γαλακτικέ 'adj.'

train!lexicon!

dev !lexicon!

Figure 1: A tag dictionary (lexicon) for Greek. The splitsinto lextrain and lexdev are discussed in §2.

learning and not only for model selection.

2 Model selection and training fortype-supervised learning

Notation. In type-supervised learning, we haveunlabeled text T = {x} of token sequences, anda lexicon lex which lists possible labels for wordtypes. Model training finds the model param-eters ✓ which minimize a training loss functionL(✓; lex, T ,h). We use h to represent the config-urations and modeling decisions (also known ashyperparameters). Examples include the depen-dency structure between variables, feature tem-plates, and regularization strengths. Given a setof fully-specified hyperparameter configurations{h1, . . . ,hM}, model selection aims to find theconfiguration hm that maximizes the expectedperformance of the corresponding model ✓m ac-cording to a suitable accuracy measure. x is a to-ken sequence, y is a label sequence, and lex[x] isthe set of label sequences compatible with tokensequence x according to lex.

Task. For the application in this paper, the taskis type-supervised POS tagging, and the paramet-ric model family we consider is that of featurizedfirst order HMMs (Berg-Kirkpatrick et al., 2010).The hyperparameters specify the feature set usedand the strength of an L2 regularizer on the pa-rameters.

Evaluation function. The evaluation functionused in model selection is the main focus of thiswork. We use a function eval(m, Tdev) to esti-mate the performance of the model trained withhyperparameters hm on a development set Tdev. Inthe following subsections, we discuss definitionsof eval when the development set Tdev is labeledand when it is unlabeled, respectively.

2.1 Tdev is labeled

When the development set Tdev is labeled, a nat-ural choice of eval is token-level prediction accu-

racy:

evalsup(m, Tdev) =

|Tdev |X

i=1

1(ym[i] = ygold[i])

|Tdev|

Here, we use i to index all tokens in Tdev; ygold[i]denotes the correct POS tag, and ym[i] denotes thepredicted POS tag of the i-th token obtained withhyperparameters hm.

2.2 Tdev is unlabeledWhen token-supervision is not available, we can-not compute evalsup. Instead, previous work onPOS tagging with type supervision (Smith andEisner, 2005) used:

evalcond(m, Tdev) =

X

x2Tdev

log

X

y2lex[x]

p✓m(y | x),

evaljoint(m, Tdev) =

X

x2Tdev

log

X

y2lex[x]

p✓m(x,y)

evalcond estimates the conditional log-likelihoodof “lex-compatible” labels given token sequences,while evaljoint estimates the joint log-likelihoodof lex-compatible labels and token sequences.

The held-out lexicon criterion. We propose anew model selection criterion which estimates pre-diction accuracy more directly and is applicable toany model type, without requiring that the modeldefine conditional or joint probabilities of label se-quences. The idea behind this proposed criterionis simple: we hold out a portion of the lexicon en-tries and use it to estimate model performance asfollows:

evaldevlex(m, T ) =

|T |X

i=1:xi2lexdev

1(ym[i] 2 lex[xi])

|lex[xi]|⇥ |Tx2lexdev|

where lexdev is the held-out portion of the lexiconentries, and xi is the i-th token in T .

The remainder of this section details the theorybehind this criterion. The expected token-level ac-curacy of a model trained with hyperparametershm is defined as E(x,ygold,ym)sD [1(ym = ygold)];where D is a joint distribution over tokens x (incontext), their gold labels ygold, and the predictedlabels ym (for the configuration hm). Since whenno labeled data is available we do not have accessto samples from D, we derive an approximation tothis distribution using lex and T .

We first split the full lexicon into a traininglexicon lextrain and a held-out (or development)

333

lexicon lexdev (see Fig. 1), by sampling words ac-cording to their token frequency c(x) in T , andplacing them in the development or training por-tions of the lexicon such that lexdev covers 25%of the tokens in T . The goal of the sampling pro-cess is to make the distribution of word tags forwords in the development lexicon representativeof the tag distribution for all words.

Given hm, we train a tagging model usinglextrain and use it to predict labels ym for all to-kens in T . We then use the word tokens coveredby the development lexicon and their predictedtags ym to approximate D by letting P (ygold | x)

be a uniform distribution over gold labels consis-tent with the lexicon for x, resulting in the follow-ing approximation PD(x, ygold, ym) /

c(x, ym)⇥ 1(x 2 lexdev, ygold 2 lexdev[x])

|lexdev[x]|

We then compute the expected accuracy asevaldevlex = ED [1(ygold = ym)], and selectthe hyperparameter configuration m which max-imizes this criterion, then re-train the model withthe full lexicon lex.1

3 How to best use a small labeled set TL?

Several prior works used a labeled set for super-vised hyper-parameter selection even when onlytype-level supervision is assumed to be availablefor training (Vaswani et al., 2010; Soderland andWeld, 2014). In this section, we want to an-swer the question: if a small labeled set is avail-able, what are the potential gains from using itfor model selection only, versus using it for bothmodel training and model selection?

A simple way to use a small labeled set forparameter training together with a larger unla-beled set in our type-supervised learning setting,is to do semi-supervised model training as follows(Nigam et al., 2000): Starting with our trainingloss function defined using a lexicon lex and unla-beled set TU L(✓; lex, TU ,hm), we define a com-bined loss function using both the unlabeled setTU and the labeled set TL: L(✓; lex, TU ,hm) +

�L(✓; lex, TL,hm). We then select parameters✓m to minimize the new loss function, where �is now an additional hyperparameter that usually

1Note that this criterion underestimates the performanceof all models in consideration by virtue of evaluating modelversions trained using a subset of the full lexicon, but it canstill be useful for ranking the models.

gives more weight to the labeled set. An advan-tage of this method is that it can be applied to anytype-supervised model using less than 100 linesof code.2 We implement this method for semi-supervised training, and we use the labeled setboth for semi-supervised model training and forhyper-parameter selection using a standard five-fold cross-validation approach.3

4 Experiments

We evaluate the introduced methods for model se-lection and training with type supervision in twotype-supervised settings: when no labeled exam-ples are available, and when a small number of la-beled examples are available.

We use a feature-rich first-order HMMmodel (Berg-Kirkpatrick et al., 2010) with anL2 prior on feature weights.4 Instead of using amultinomial distribution for the local emissionsand transitions, this model uses a log-linear distri-bution (i.e., p(xi | yi) / exp �>f(xi, yi)) with afeature vector f and a weight vector �. We use thefeature set described in (Li et al., 2012): transitionfeatures, word-tag features (hyi, xii) (lowercasedwords with frequency greater than a threshold),whether the word contains a hyphen and/orstarts with a capital letter, character suffixes, andwhether the word contains a digit. We initializethe transition and emission distributions of theHMM using unambiguous words as proposedby (Zhang and DeNero, 2014). Data. We usethe Danish, Dutch, German, Greek, English,Italian, Portuguese, Spanish and Swedish datasetsfrom CoNLL-X and CoNLL-2007 shared tasks(Buchholz and Marsi, 2006; Nivre et al., 2007).We map the POS labels in the CoNLL datasetsto the universal POS tagset (Petrov et al., 2012).We use the tag dictionaries provided by Li etal. (2012). Model configurations. For each

2The labeled data loss is the negative joint log-likelihood of the observed token sequences and labels:�

Px,y2TL

log p✓(x,y).3We split the labeled set into five folds, and for each set-

ting of the hyper-parameters train five different models on45 -ths of the data, estimating accuracy on the remaining 1

5 -th.We average the accuracy estimates from different folds anduse this as a combined estimate of the accuracy of a modeltrained using the full labeled and unlabeled set, given thesehyperparameters. After selecting a configuration of hyperpa-rameters using cross-validation, we then use this configura-tion and retrain the model on all available data.

4We used a first-order HMM for simplicity, but it is pos-sible to obtain better results using a second-order HMM (Liet al., 2012).

334

83.70

85.0584.75

85.20

76

78

80

82

84

86

88

Greek Danish Dutch Italian Spanish Portugese German English Swedish Average

Conditional Likelihood Joint Likelihood Held-out Lexicon English Labeled

Figure 2: Token-level accuracy when doing training and model selection with no labeled data. Model selection with differentcriteria (left to right): conditional log-likelihood, joint log-likelihood, held-out lexicon, and English-labeled.

85.05

85.75

91.73

81

83

85

87

89

91

93

95

97

Greek Danish Dutch Italian Spanish Portugese German English Swedish Average

Joint Likelihood (0 labeled)

Supervised model selection (300 labeled)

Semi-supervised training and supervised model selection (300 labeled)

Figure 3: Token-level accuracy with (left to right): no labeled data, model selection on labeled data (300 sentences), semi-supervised training + model selection on labeled data (300 sentences).

language, we consider M = 15 configurationsof the hyperparameters which vary in the L2

regularization strength (5 values), and the min-imum word frequency for word-tag features (3values). When a small labeled set is available weadditionally choose one of 3 values for the weightof the labeled set (see Section 3). We report finalperformance of models selected using differentcriteria using token-level accuracy on an unseentest set.

No labeled examples. When no labeled exam-ples are available, we do model training and se-lection using only unlabeled text T and a tagginglexicon lex. We compare three type-supervised

model selection criteria: conditional likelihood,joint likelihood, and the held-out lexicon criterionevaldevlex. Additionally, we include the perfor-mance of a method which selects the hyperpa-rameters using labeled data in English and usesthese (same) hyperparameters for English and allother languages (we call this method “English La-beled”). Fig. 2 shows the accuracy of the modelschosen by each of the four criteria on nine lan-guages, as well as the average accuracy across lan-guages. The lower (upper) bounds on average per-formance obtained by always choosing the worst(best) hyperparameters is 82.77 (85.83). evaljoint

outperformed evalcond on eight out of the nine

335

languages and achieved a significantly higher av-erage accuracy (85.05 vs 83.70). evaldevlex out-performed evaljoint on six out of nine languages,but did significantly worse on one language (Ger-man), which resulted in a slightly lower averageaccuracy. Choosing the hyper-parameters usingEnglish labeled data and using the same hyper-parameters for all languages performed compara-bly to evaljoint, with slightly higher average ac-curacy even when limited to the non-English lan-guages (85.0 vs 84.9). Overall the results showedthat the conditional log-likelihood criterion wasdominated by the other three, which were com-parable in average accuracy. Looking at the eightlanguages excluding English (since one criterionuses labeled data for English), the newly proposedheld-out lexicon criterion was the winning methodon five out of eight languages, evalcond was beston one language, evaljoint was best (or tied forbest) on two, and English-labeled was tied for beston one language.

Few labeled examples. We consider two waysof leveraging the labeled examples: (i) type-supervised model training + supervised model se-lection: only use unlabeled examples to optimizemodel parameters, then use the labeled exam-ples for supervised model selection with evalsup,and (ii) semi-supervised model training + super-vised model selection (see Section 3 for details).Fig. 3 shows how much we can improve on themethod with highest average accuracy from Fig-ure 2 (evaljoint), when a small number of ex-amples is available. Using the 300 labeled sen-tences for semi-supervised training and model se-lection reduced the error by 44.6% (comparing tothe model with best average accuracy using onlytype-level supervision with average performanceof 85.05, the semi-supervised average is 91.8). Incontrast, using the 300 sentences to select hyper-parameters only reduced the error by less than 5%

(the average accuracy was 85.75). Even when only50 labeled sentences are used for semi-supervisedtraining and supervised model selection, we stillsee a boost to average accuracy of 89% (resultsnot shown in the Figure).

5 Conclusion

We presented the first comparative evaluation ofmodel selection criteria for type-supervised POS-tagging on many languages. We introduced anovel, generally applicable model selection cri-

terion which outperformed previously proposedones for a majority of languages. We evaluated theutility of a small labeled set for model selectionversus model training, and showed that when suchlabeled set is available, it should not be used solelyfor supervised model selection, because using itadditionally for model parameter training providesstrikingly larger accuracy gains.

AcknowledgmentsWe thank Nathan Schneider and the anonymousreviewers for helpful suggestions.

ReferencesTaylor Berg-Kirkpatrick, Alexandre Bouchard-Cote,

John DeNero, and Dan Klein. 2010. Painless unsu-pervised learning with features. In Proc. of NAACL.

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: A col-laboratively created graph database for structuringhuman knowledge. In Proceedings of the 2008 ACMSIGMOD International Conference on Managementof Data, SIGMOD ’08, pages 1247–1250.

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-Xshared task on multilingual dependency parsing. InCoNLL-X.

Andrew Carlson, Scott Gaffney, and Flavian Vasile.2009. Learning a named entity tagger fromgazetteers with the partial perceptron. In AAAISpring Symposium: Learning by Reading andLearning to Read, pages 7–13.

Mark Craven, Johan Kumlien, et al. 1999. Construct-ing biological knowledge bases by extracting infor-mation from text sources. In ISMB, volume 1999,pages 77–86.

Long Duong, Trevor Cohn, Karin Verspoor, StevenBird, and Paul Cook. 2014. What can we getfrom 1000 tokens? a case study of multilingual postagging for resource-poor languages. In Proc. ofEMNLP.

Dan Garrette and Jason Baldridge. 2013. Learning apart-of-speech tagger from two hours of annotation.In Proc. of NAACL-HLT, pages 138–147.

Sharon Goldwater and Tom Griffiths. 2007. A fullybayesian approach to unsupervised part-of-speechtagging. In Annual meeting-association for compu-tational linguistics.

Anders Johannsen, Dirk Hovy, Hector MartınezAlonso, Barbara Plank, and Anders Søgaard. 2014.More or less supervised supersense tagging of twit-ter. Proc. of* SEM, pages 1–11.

336

Shen Li, Joao Graca, and Ben Taskar. 2012. Wiki-ly supervised part-of-speech tagging. In Proc. ofEMNLP.

Mike Mintz, Steven Bills, Rion Snow, and Dan Ju-rafsky. 2009. Distant supervision for relation ex-traction without labeled data. In Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP: Vol-ume 2-Volume 2, pages 1003–1011. Association forComputational Linguistics.

Kamal Nigam, Andrew Kachites McCallum, SebastianThrun, and Tom Mitchell. 2000. Text classifica-tion from labeled and unlabeled documents usingem. Machine learning, 39.

Joakim Nivre, Johan Hall, Sandra Kubler, Ryan Mc-Donald, Jens Nilsson, Sebastian Riedel, and DenizYuret. 2007. The CoNLL 2007 shared task on de-pendency parsing. In Proc. of CoNLL.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012.A universal part-of-speech tagset. In Proc. of LREC,May.

Noah A. Smith and Jason Eisner. 2005. Contrastiveestimation: Training log-linear models on unlabeleddata. In Proc. of ACL.

Mitchell Koch John Gilmer Stephen Soderland andDaniel S Weld. 2014. Type-aware distantly super-

vised relation extraction with linked arguments. InProc. of EMNLP.

Anders Søgaard. 2011. Semisupervised condensednearest neighbor for part-of-speech tagging. In Pro-ceedings of the 49th Annual Meeting of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies: Short Papers - Volume 2, HLT’11, pages 48–52, Stroudsburg, PA, USA. Associa-tion for Computational Linguistics.

Oscar Tackstrom, Dipanjan Das, Slav Petrov, RyanMcDonald, and Joakim Nivre. 2013. Token andtype constraints for cross-lingual part-of-speech tag-ging. Transactions of the Association for Computa-tional Linguistics, 1:1–12.

Ashish Vaswani, Adam Pauls, and David Chiang.2010. Efficient optimization of an mdl-inspired ob-jective function for unsupervised part-of-speech tag-ging. In Proceedings of the ACL 2010 ConferenceShort Papers, pages 209–214. Association for Com-putational Linguistics.

Hui Zhang and John DeNero. 2014. Observational ini-tialization of type-supervised taggers. In Proceed-ings of the Association for Computational Linguis-

tics (Short Paper Track).

337



One Million Sense-Tagged Instancesfor Word Sense Disambiguation and InductionKaveh Taghipour

Department of Computer ScienceNational University of Singapore

13 Computing DriveSingapore 117417

[email protected]

Hwee Tou NgDepartment of Computer ScienceNational University of Singapore

13 Computing DriveSingapore 117417

[email protected]

Abstract

Supervised word sense disambiguation(WSD) systems are usually the best per-forming systems when evaluated on stan-dard benchmarks. However, these systemsneed annotated training data to functionproperly. While there are some publiclyavailable open source WSD systems, veryfew large annotated datasets are availableto the research community. The two maingoals of this paper are to extract and an-notate a large number of samples and re-lease them for public use, and also to eval-uate this dataset against some word sensedisambiguation and induction tasks. Weshow that the open source IMS WSD sys-tem trained on our dataset achieves state-of-the-art results in standard disambigua-tion tasks and a recent word sense induc-tion task, outperforming several task sub-missions and strong baselines.

1 Introduction

Identifying the meaning of a word automaticallyhas been an interesting research topic for a fewdecades. The approaches used to solve this prob-lem can be roughly categorized into two mainclasses: Word Sense Disambiguation (WSD) andWord Sense Induction (WSI) (Navigli, 2009). Forword sense disambiguation, some systems arebased on supervised machine learning algorithms(Lee et al., 2004; Zhong and Ng, 2010), while oth-ers use ontologies and other structured knowledgesources (Ponzetto and Navigli, 2010; Agirre et al.,2014; Moro et al., 2014).

There are several sense-annotated datasets forWSD (Miller et al., 1993; Ng and Lee, 1996; Pas-sonneau et al., 2012). However, these datasetseither include few samples per word sense oronly cover a small set of polysemous words.

To overcome these limitations, automatic meth-ods have been developed for annotating trainingsamples. For example, Ng et al. (2003), Chanand Ng (2005), and Zhong and Ng (2009) usedChinese-English parallel corpora to extract sam-ples for training their supervised WSD system.Diab (2004) proposed an unsupervised bootstrap-ping method to automatically generate a sense-annotated dataset. Another example of auto-matically created datasets is the semi-supervisedmethod used in (Kubler and Zhekova, 2009),which employed a supervised classifier to label in-stances.

The two main contributions of this paper areas follows. First, we employ the same methodused in (Ng et al., 2003; Chan and Ng, 2005) tosemi-automatically annotate one million trainingsamples based on the WordNet sense inventory(Miller, 1995) and release the annotated corpus forpublic use. To our knowledge, this annotated set ofsense-tagged samples is the largest publicly avail-able dataset for word sense disambiguation. Sec-ond, we train an open source supervised WSD sys-tem, IMS (Zhong and Ng, 2010), using our dataand evaluate it against standard WSD and WSIbenchmarks. We show that our system outper-forms other state-of-the-art systems in most cases.

As any WSD system is also a WSI system whenwe treat the pre-defined sense inventory of theWSD system as the induced word senses, a WSDsystem can also be evaluated and used for WSI.Some researchers believe that, in some cases, WSImethods may perform better than WSD systems(Jurgens and Klapaftis, 2013; Wang et al., 2015).However, we argue that WSI systems have fewadvantages compared to WSD methods and ac-cording to our results, disambiguation systemsconsistently outperform induction systems. Al-though there are some cases where WSI systemscan be useful (e.g., for resource-poor languages),in most cases WSD systems are preferable because

338

of higher accuracy and better interpretability ofoutput.

The rest of this paper is composed of the follow-ing sections. Section 2 explains our methodologyfor creating the training data. We evaluate the ex-tracted data in Section 3 and finally, we concludethe paper in Section 4.

2 Training Data

In order to train a supervised word sense disam-biguation system, we extract and sense-tag datafrom a freely available parallel corpus, in a semi-automatic manner. To increase the coverage andtherefore the ultimate performance of our WSDsystem, we also make use of existing sense-taggeddatasets. This section explains each step in detail.

Since the main purpose of this paper is to buildand release a publicly available training set forword sense disambiguation systems, we selectedthe MultiUN corpus (MUN) (Eisele and Chen,2010) produced in the EuroMatrixPlus project1.This corpus is freely available via the projectwebsite and includes seven languages. An auto-matically sentence-aligned version of this datasetcan be downloaded from the OPUS website2 andtherefore we decided to extract samples from thissentence-aligned version.

To extract training data from the MultiUN par-allel corpus, we follow the approach describedin (Chan and Ng, 2005) and select the Chinese-English part of the MultiUN corpus. The extrac-tion method has the following steps:

1. Tokenization and word segmentation: TheEnglish side of the corpus is tokenized us-ing the Penn TreeBank tokenizer3, while theChinese side of the corpus is segmented us-ing the Chinese word segmenter of (Low etal., 2005).

2. Word alignment: After tokenizing the texts,GIZA++ (Och and Ney, 2000) is used to alignEnglish and Chinese words.

3. Part-of-speech (POS) tagging and lemmati-zation: After running GIZA++, we use theOpenNLP POS tagger4 and then the Word-Net lemmatizer to obtain POS tags and wordlemmas of the English sentence.

1http://www.euromatrixplus.eu/multi-un2http://opus.lingfil.uu.se/MultiUN.php3http://www.cis.upenn.edu/⇠treebank/tokenization.html4http://opennlp.apache.org

4. Annotation: In order to assign a WordNetsense tag to an English word we in a sen-tence, we make use of the aligned Chinesetranslation wc of we, based on the automaticword alignment formed by GIZA++. Foreach sense i of we in the WordNet sense in-ventory (WordNet 1.7.1), a list of Chinesetranslations of sense i of we has been manu-ally created. If wc matches one of these Chi-nese translations of sense i, then we is taggedwith sense i.

The average time needed to manually assignChinese translations to the word senses of oneword type for noun, adjective, and verb is 20, 25,and 40 minutes respectively (Chan, 2008). Theabove procedure annotates the top 60% most fre-quent word types (nouns, verbs, and adjectives) inEnglish, selected based on their frequency in theBrown corpus. This set of selected word types in-cludes 649 nouns, 190 verbs, and 319 adjectives.

Since automatic sentence and word alignmentcan be noisy, and a Chinese word wc can occa-sionally be a valid translation of more than onesense of an English word we, the senses taggedusing the above procedure may be erroneous. Toget an idea of the accuracy of the senses taggedwith this procedure, we manually evaluated a sub-set of 1,000 randomly selected sense-tagged in-stances. Although the sense inventory is fine-grained (WordNet 1.7.1), the sense-tag accuracyachieved is 83.7%. We also performed an erroranalysis to identify the sources of errors. We foundthat only 4% of errors are caused by wrong sen-tence or word alignment. However, 69% of er-roneous sense-tagged instances are the result of aChinese word associated with multiple senses ofa target English word. In such cases, the Chineseword is linked to multiple sense tags and therefore,errors in sense-tagged data are introduced. Our re-sults are similar to those reported in (Chan, 2008).

To speed up the training process, we performrandom sampling on the sense tags with more than500 samples and limit the number of samples persense to 500. However, all samples of senses withfewer than 500 samples are included in the train-ing data. This sampling method ensures that raresense tags also have training samples during theselection process.

In order to improve the coverage of the trainingset, we augment it by adding samples from SEM-COR (SC) (Miller et al., 1993) and the DSO cor-

339

Avg. # samplesper word type

MUN (before sampling) 19,837.6MUN 852.5MUN+SC 55.4MUN+SC+DSO 63.7

Table 3: Average number of samples per wordtype (WordNet 1.7.1)

pus (Ng and Lee, 1996). We only add the 28 mostfrequent adverbs from SEMCOR because we ob-serve almost no improvement when adding all ad-verbs. We notice that the DSO corpus generallyimproves the performance of our system. How-ever, since the annotated DSO corpus is copy-righted, we are unable to release a dataset in-cluding the DSO corpus. Therefore, we experi-ment with two different configurations, one withthe DSO corpus and one without, although the re-leased dataset will not include the DSO corpus.

Since some shared tasks use newer WordNetversions, we convert the training set sense labelsusing the sense mapping files provided by Word-Net5. As replicating our results requires WordNetversions 1.7.1, 2.1, and 3.0, we release our sense-tagged dataset in all three versions. Some statisticsabout the sense-tagged training set can be found inTable 1 to Table 3.

3 Evaluation

For the WSD system, we use IMS (Zhong andNg, 2010) in our experiments. IMS is a super-vised WSD system based on support vector ma-chines (SVM). This WSD system comes with out-of-the-box pre-trained models. However, since theoriginal training set is not released, we use ourown training set (see Section 2) to train IMS andthen evaluate it on standard WSD and WSI bench-marks. This section presents the results obtainedon four WSD and one WSI shared tasks. The fourall-words WSD shared tasks are SensEval-2 (Ed-monds and Cotton, 2001), SensEval-3 task 1 (Sny-der and Palmer, 2004), and both the fine-grainedtask 17 and coarse-grained task 7 of SemEval-2007 (Pradhan et al., 2007; Navigli et al., 2007).These all-words WSD shared tasks provide notraining data to the participants. The selectedword sense induction task in our experiments is

5http://wordnet.princeton.edu/wordnet/download/current-version/

SemEval-2013 task 13 (Jurgens and Klapaftis,2013).

3.1 WSD All-Words TasksThe results of our experiments on WSD tasks arepresented in Table 4. For the SensEval-2 andSensEval-3 test sets, we use the training set withthe WordNet 1.7.1 sense inventory and for theSemEval-2007 test sets, we use training data withthe WordNet 2.1 sense inventory.

In Table 4, IMS (original) refers to the IMS sys-tem trained with the original training instances asreported in (Zhong and Ng, 2010). We also com-pare our systems with two other configurations ob-tained from training IMS on SEMCOR, and SEM-COR plus DSO datasets. In Table 4, these two set-tings are shown by IMS (SC) and IMS (SC+DSO),respectively. Finally, Rank 1 and Rank 2 are thetop two participating systems in the respective all-words tasks.

As shown in Table 4, our systems (both withand without the DSO corpus as training instances)perform competitively with and in some caseseven better than the original IMS and also thebest shared task submissions. This shows thatour training set is of high quality and training asupervised WSD system using our training dataachieves state-of-the-art results on the all-wordstasks. Since the MUN dataset does not cover alltarget word types in the all-words shared tasks,the accuracy achieved with MUN alone is lowerthan the SC and SC+DSO settings. However, theevaluation results show that IMS trained on MUNalone often performs better than or is competi-tive with the WordNet Sense 1 baseline. Finally,it can be seen that adding the training instancesfrom MUN (that is, IMS (MUN+SC) and IMS(MUN+SC+DSO)) often achieves higher accuracythan without MUN instances (IMS (SC) and IMS(SC+DSO)).

3.2 SemEval-2013 Word Sense InductionTask

In order to evaluate our system on a word senseinduction task, we selected SemEval-2013 task 13,the latest WSI shared task. Unlike most other tasksthat assume a single sense is sufficient for repre-senting word senses, this task allows each instanceto be associated with multiple sense labels withtheir applicability weights. This WSI task con-siders 50 lemmas, including 20 nouns, 20 verbs,and 10 adjectives, annotated with the WordNet 3.1

340

noun verb adjective adverb totalMUN (before sampling) 649 190 319 0 1,158MUN 649 190 319 0 1,158MUN+SC 11,446 4,705 5,129 28 21,308MUN+SC+DSO 11,446 4,705 5,129 28 21,308

Table 1: Number of word types in each part-of-speech (WordNet 1.7.1)

number of training samplesnoun verb adjective adverb total size

MUN (before sampling) 14,492,639 4,400,813 4,078,543 0 22,971,995 17.7 GBMUN 503,408 265,785 218,046 0 987,239 745 MBMUN+SC 582,028 341,141 251,362 6,207 1,180,738 872 MBMUN+SC+DSO 687,871 412,482 251,362 6,207 1,357,922 939 MB

Table 2: Number of training samples in each part-of-speech (WordNet 1.7.1). The size column showsthe total size of each dataset in megabytes or gigabytes.

sense inventory. We use WordNet 3.0 in our ex-periments on this task.

We evaluated our system using all measuresused in the shared task. The results are presentedin Table 5. The columns in this table denote thescores of the various systems according to the dif-ferent evaluation metrics used in the WSI sharedtask, which are Jaccard Index, Ksim

� , WNDCG,Fuzzy NMI, and Fuzzy B-Cubed. See (Jurgensand Klapaftis, 2013) for details of the evaluationmetrics.

This table also includes the top two systems inthe shared task, AI-KU (Baskaya et al., 2013) andUnimelb (Lau et al., 2013), as well as Wang-15(Wang et al., 2015). AI-KU uses a languagemodel to find the most likely substitutes for a tar-get word to represent the context. The clusteringmethod used in AI-KU is K-means and the sys-tem gives good performance in the shared task.Unimelb relies on Hierarchical Dirichlet Process(Teh et al., 2006) to identify the sense of a tar-get word using positional word features. Finally,Wang-15 uses Latent Dirichlet Allocation (LDA)(Blei et al., 2003) to model the word sense andtopic jointly. This system obtains high scores, ac-cording to Fuzzy B-Cubed and Fuzzy NMI mea-sures. The last three rows are some baselinesystems: grouping all instances into one cluster,grouping each instance into a cluster of its own,and assigning the most frequent sense in SEM-COR to all instances. As shown in Table 5, train-ing IMS on our training data outperforms all othersystems on three out of five evaluation metrics,

and performs competitively on the remaining twometrics.

IMS trained on MUN alone (IMS (MUN))outperforms IMS (SC) and IMS (SC+DSO) interms of the first three evaluation measures, andachieves comparable Fuzzy NMI and Fuzzy B-Cubed scores. Moreover, the evaluation resultsshow that IMS (MUN) often performs better thanthe SEMCOR most frequent sense baseline. Fi-nally, it can be observed that in most cases, addingtraining instances from MUN significantly im-proves IMS (SC) and IMS (SC+DSO).

4 ConclusionOne of the major problems in building supervisedword sense disambiguation systems is the train-ing data acquisition bottleneck. In this paper,we semi-automatically extracted and sense-taggedan English corpus containing one million sense-tagged instances. This large sense-tagged cor-pus can be used for training any supervised WSDsystems. We then evaluated the performance ofIMS trained on our sense-tagged corpus in severalWSD and WSI shared tasks. Our sense-taggeddataset has been released publicly6. We believeour dataset is the largest publicly available anno-tated dataset for WSD at present.

After training a supervised WSD system usingour training set, we evaluated the system usingstandard benchmarks. The evaluation results showthat our sense-tagged corpus can be used to build aWSD system that performs competitively with the

6http://www.comp.nus.edu.sg/⇠nlp/corpora.html

341

SensEval-2 SensEval-3 SemEval-2007Fine-grained Fine-grained Fine-grained Coarse-grained

IMS (MUN) 64.5 60.6 52.7 78.7IMS (MUN+SC) 68.2 67.4 58.5 81.6IMS (MUN+SC+DSO) 68.0 66.6 58.9 82.3IMS (original) 68.2 67.6 58.3 82.6IMS (SC) 66.1 67.0 58.7 81.9IMS (SC+DSO) 66.5 67.0 57.8 81.6Rank 1 69.0 65.2 59.1 82.5Rank 2 63.6 64.6 58.7 81.6WordNet Sense 1 61.9 62.4 51.4 78.9

Table 4: Accuracy (in %) on all-words word sense disambiguation tasks

Jac. Ind. Ksim� WNDCG Fuzzy NMI Fuzzy B-Cubed

IMS (MUN) 24.6 64.9 33.0 6.9 57.1IMS (MUN+SC) 25.0 65.4 34.2 9.1 55.9IMS (MUN+SC+DSO) 25.5 65.4 35.1 9.7 55.4IMS (original) 23.4 64.5 34.0 8.6 59.0IMS (SC) 22.9 63.5 32.4 6.8 57.3IMS (SC+DSO) 23.4 63.6 32.9 7.1 57.6Wang-15 (ukWac) - - - 9.7 54.5Wang-15 (actual) - - - 9.4 59.1AI-KU (base) 19.7 62.0 38.7 6.5 39.0AI-KU (add1000) 19.7 60.6 21.5 3.5 32.0AI-KU (remove5-add1000) 24.4 64.2 33.2 3.9 45.1Unimelb (5p) 21.8 61.4 36.5 5.6 45.9Unimelb (50k) 21.3 62.0 37.1 6.0 48.3all-instances-1cluster 19.2 60.9 28.8 0.0 62.3each-instance-1cluster 0.0 0.0 0.0 7.1 0.0SEMCOR most freq sense 19.2 60.9 28.8 0.0 62.3

Table 5: Supervised and unsupervised evaluation results (in %) on SemEval-2013 word sense inductiontask

top performing WSD systems in the SensEval-2,SensEval-3, and SemEval-2007 fine-grained andcoarse-grained all-words tasks, as well as the topsystems in the SemEval-2013 WSI task.

Acknowledgments

This research is supported by the Singapore Na-tional Research Foundation under its InternationalResearch Centre @ Singapore Funding Initiativeand administered by the IDM Programme Office.We are also grateful to Christian Hadiwinoto andBenjamin Yap for assistance with performing theerror analysis, and to the anonymous reviewers fortheir helpful comments.

ReferencesEneko Agirre, Oier Lopez de Lacalle, and Aitor Soroa.

2014. Random walks for knowledge-based wordsense disambiguation. Computational Linguistics,40(1):57–84.

Osman Baskaya, Enis Sert, Volkan Cirik, and DenizYuret. 2013. AI-KU: using substitute vectors andco-occurrence modeling for word sense inductionand disambiguation. In Proceedings of the Sev-enth International Workshop on Semantic Evalua-tion (SemEval 2013), pages 300–306.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent Dirichlet allocation. Journal of Ma-chine Learning Research, 3:993–1022.

Yee Seng Chan and Hwee Tou Ng. 2005. Scalingup word sense disambiguation via parallel texts. In

342

Proceedings of the 20th National Conference on Ar-tificial Intelligence, pages 1037–1042.

Yee Seng Chan. 2008. Word Sense Disambiguation:Scaling up, Domain Adaptation, and Application toMachine Translation. Ph.D. thesis, National Univer-sity of Singapore.

Mona Diab. 2004. Relieving the data acquisition bot-tleneck in word sense disambiguation. In Proceed-ings of the 42nd Annual Meeting of the Associationfor Computational Linguistics, pages 303–310.

Philip Edmonds and Scott Cotton. 2001. SENSEVAL-2: Overview. In Proceedings of the Second Interna-tional Workshop on Evaluating Word Sense Disam-biguation Systems, pages 1–5.

Andreas Eisele and Yu Chen. 2010. MultiUN: Amultilingual corpus from United Nation documents.In Proceedings of the Seventh International Confer-ence on Language Resources and Evaluation, pages2868–2872.

David Jurgens and Ioannis Klapaftis. 2013. Semeval-2013 task 13: Word sense induction for gradedand non-graded senses. In Proceedings of the Sev-enth International Workshop on Semantic Evalua-tion (SemEval 2013), pages 290–299.

Sandra Kubler and Desislava Zhekova. 2009. Semi-supervised learning for word sense disambiguation:Quality vs. quantity. In Proceedings of the Inter-national Conference on Recent Advances in NaturalLanguage Processing, pages 197–202.

Jey Han Lau, Paul Cook, and Timothy Baldwin. 2013.unimelb: topic modelling-based word sense induc-tion. In Proceedings of the Seventh InternationalWorkshop on Semantic Evaluation (SemEval 2013),pages 307–311.

Yoong Keok Lee, Hwee Tou Ng, and Tee Kiah Chia.2004. Supervised word sense disambiguation withsupport vector machines and multiple knowledgesources. In Senseval-3: Third International Work-shop on the Evaluation of Systems for the SemanticAnalysis of Text, pages 137–140.

Jin Kiat Low, Hwee Tou Ng, and Wenyuan Guo. 2005.A maximum entropy approach to Chinese word seg-mentation. In Proceedings of the Fourth SIGHANWorkshop on Chinese Language Processing, pages161–164.

George A. Miller, Claudia Leacock, Randee Tengi, andRoss T. Bunker. 1993. A semantic concordance. InProceedings of the Workshop on Human LanguageTechnology, pages 303–308.

George A. Miller. 1995. WordNet: a lexicaldatabase for English. Communications of the ACM,38(11):39–41.

Andrea Moro, Alessandro Raganato, and Roberto Nav-igli. 2014. Entity linking meets word sense disam-biguation: a unified approach. Transactions of theAssociation for Computational Linguistics, 2:231–244.

Roberto Navigli, Kenneth C. Litkowski, and Orin Har-graves. 2007. Semeval-2007 task 07: Coarse-grained English all-words task. In Proceedingsof the Fourth International Workshop on SemanticEvaluations (SemEval-2007), pages 30–35.

Roberto Navigli. 2009. Word sense disambiguation:A survey. ACM Computing Surveys, 41(2):10:1–10:69.

Hwee Tou Ng and Hian Beng Lee. 1996. Integratingmultiple knowledge sources to disambiguate wordsense: An exemplar-based approach. In Proceed-ings of the 34th Annual Meeting of the Associationfor Computational Linguistics, pages 40–47.

Hwee Tou Ng, Bin Wang, and Yee Seng Chan. 2003.Exploiting parallel texts for word sense disambigua-tion: An empirical study. In Proceedings of the41st Annual Meeting of the Association for Compu-tational Linguistics, pages 455–462.

Franz Josef Och and Hermann Ney. 2000. Improvedstatistical alignment models. In Proceedings of the38th Annual Meeting of the Association for Compu-tational Linguistics, pages 440–447.

Rebecca Passonneau, Collin Baker, Christiane Fell-baum, and Nancy Ide. 2012. The MASC word sensesentence corpus. In Proceedings of the Eighth In-ternational Conference on Language Resources andEvaluation, pages 3025–3030.

Simone Paolo Ponzetto and Roberto Navigli. 2010.Knowledge-rich word sense disambiguation rivalingsupervised systems. In Proceedings of the 48th An-nual Meeting of the Association for ComputationalLinguistics, pages 1522–1531.

Sameer Pradhan, Edward Loper, Dmitriy Dligach, andMartha Palmer. 2007. Semeval-2007 task-17: En-glish lexical sample, SRL and all words. In Proceed-ings of the Fourth International Workshop on Se-mantic Evaluations (SemEval-2007), pages 87–92.

Benjamin Snyder and Martha Palmer. 2004. The En-glish all-words task. In Proceedings of the Third In-ternational Workshop on the Evaluation of Systemsfor the Semantic Analysis of Text, pages 41–43.

Yee Whye Teh, Michael I Jordan, Matthew J Beal, andDavid M Blei. 2006. Hierarchical Dirichlet pro-cesses. Journal of the American Statistical Associa-tion, 101(476):1566–1581.

Jing Wang, Mohit Bansal, Kevin Gimpel, Brian D.Ziebart, and Clement T. Yu. 2015. A sense-topicmodel for word sense induction with unsuperviseddata enrichment. Transactions of the Association forComputational Linguistics, 3:59–71.

343

Zhi Zhong and Hwee Tou Ng. 2009. Word sense dis-ambiguation for all words without hard labor. InProceedings of the Twenty-First International JointConference on Artificial Intelligence, pages 1616–1621.

Zhi Zhong and Hwee Tou Ng. 2010. It Makes Sense:A wide-coverage word sense disambiguation systemfor free text. In Proceedings of the 48th AnnualMeeting of the Association for Computational Lin-guistics System Demonstrations, pages 78–83.

344



Reading behavior predicts syntactic categories

Maria Barrett and Anders SøgaardUniversity of Copenhagen

Njalsgade 140DK-2300 Copenhagen S

{dtq912,soegaard}@hum.ku.dk

Abstract

It is well-known that readers are less likelyto fixate their gaze on closed class syn-tactic categories such as prepositions andpronouns. This paper investigates to whatextent the syntactic category of a word incontext can be predicted from gaze fea-tures obtained using eye-tracking equip-ment. If syntax can be reliably predictedfrom eye movements of readers, it canspeed up linguistic annotation substan-tially, since reading is considerably fasterthan doing linguistic annotation by hand.Our results show that gaze features do dis-criminate between most pairs of syntacticcategories, and we show how we can usethis to annotate words with part of speechacross domains, when tag dictionaries en-able us to narrow down the set of potentialcategories.

1 IntroductionEye movements during reading is a well-established proxy for cognitive processing, and itis well-known that readers are more likely to fixateon words from open syntactic categories (verbs,nouns, adjectives) than on closed category itemslike prepositions and conjunctions (Rayner, 1998;Nilsson and Nivre, 2009). Generally, readers seemto be most likely to fixate and re-fixate on nouns(Furtner et al., 2009). If reading behavior is af-fected by syntactic category, maybe reading be-havior can, conversely, also tell us about the syn-tax of words in context.

This paper investigates to what extent gaze datacan be used to predict syntactic categories. Weshow that gaze data can effectively be used to dis-criminate between a wide range of part of speech

(POS) pairs, and gaze data can therefore be used tosignificantly improve type-constrained POS tag-gers. This is potentially useful, since eye-trackingdata becomes more and more readily availablewith the emergence of eye trackers in mainstreamconsumer products (San Agustin et al., 2010).With the development of robust eye-tracking inlaptops, it is easy to imagine digital text providersstoring gaze data, which could then be used to im-prove automated analysis of their publications.Contributions We are, to the best of our knowl-edge, the first to study reading behavior of syntac-tically annotated, natural text across domains, andhow gaze correlates with a complete set of syntac-tic categories. We use logistic regression to showthat gaze features discriminate between POS pairs,even across domains. We then show how gaze fea-tures can improve a cross-domain supervised POStagger. We show that gaze-based predictions arerobust, not only across domains, but also acrosssubjects.

2 ExperimentIn our experiment, 10 subjects read syntacticallyannotated sentences from five domains.

Data The data consists of 250 sentences: 50sentences (min. 3 tokens, max. 120 characters),randomly sampled from each of five different,manually annotated corpora: Wall Street Jour-nal articles (WSJ), Wall Street Journal headlines(HDL), emails (MAI), weblogs (WBL), and Twit-ter (TWI). WSJ and HDL syntactically annotatedsentences come from the OntoNotes 4.0 release ofthe English Penn Treebank.1 The MAI and WBLsections come from the English Web Treebank.2

1catalog.ldc.upenn.edu/LDC2011T032catalog.ldc.upenn.edu/LDC2012T13

345

.AD

JAD

PAD

VCONJ

DET

NOUN

NUM

PRON

PRT

VERB

0.0

0.2

0.4

0.6

0.8

1.0WSJ Fixation probability. n = 858

probability

.AD

JAD

PAD

VCONJ

DET

NOUN

NUM

PRON

PRT

VERB

0.0

0.2

0.4

0.6

0.8

1.0HDL Fixation probability. n = 461

probability

.AD

JAD

PAD

VCONJ

DET

NOUN

NUM

PRON

PRT

VERB

0.0

0.2

0.4

0.6

0.8

1.0TWI Fixation probability. n = 610

probability

.AD

JAD

PAD

VCONJ

DET

NOUN

NUM

PRON

PRT

VERB

0.0

0.2

0.4

0.6

0.8

1.0MAI Fixation probability. n = 575

probability

.AD

JAD

PAD

VCONJ

DET

NOUN

NUM

PRON

PRT

VERB

0.0

0.2

0.4

0.6

0.8

1.0WBL Fixation probability. n = 716

probability

Figure 1: Fixation probability boxplots across fivedomains

The TWI data comes from the work of Foster etal. (2011). We mapped the gold labels to the 12Universal POS (Petrov et al., 2011), but discardedthe category X due to data sparsity.Experimental design The 250 items were readby all 10 participants, but participants read theitems in one of five randomized orders. Neitherthe source domain for the sentence, nor the POStags were revealed to the participant at any time.One sentence was presented at a time in black on alight gray background. Font face was Verdana andfont size was 25 pixels. Sentences were centeredvertically, and all sentences could fit into one line.All sentences were preceded by a fixation cross.The experiment was self-paced. To switch to anew sentence and to ensure that the sentence wasactually processed by the participant, participantsrated the immediate interest towards the sentenceon a scale from 1-6 by pressing the correspondingnumber on the numeric keypad. Participants wereinstructed to read and continue to the next sentenceas quickly as possible. The actual experiment waspreceded by 25 practice sentences to familiarizethe participant with the experimental setup.

Our apparatus was a Tobii X120 eye trackerwith a 15” monitor. Sampling rate was 120 Hzbinocular. Participants were seated on a chair ap-proximately 65 cm from the display. We recruited

10 participants (7 male, mean age 31.30 ±4.74))from campus. All were native English speakers.Their vision was normal or corrected to normal,and none were diagnosed with dyslexia. All wereskilled readers. Minimum educational level wasan ongoing MA. Each session lasted around 40minutes. One participant had no fixations on a fewsentences. We believe that erroneous key strokescaused the participant to skip a few sentences.

Features There are many different features forexploring cognitive load during reading (Rayner,1998). We extracted a broad selection of cognitiveeffort features from the raw eye-tracking data inorder to determine which are more fit for the task.The features are inspired by Salojarvi et al. (2003),who used a similarly exploratory approach. Wewanted to cover both oculomotor features, such asfixations on previous and subsequent words, andmeasures relating to early (e.g. first fixation du-ration) and late processing (e.g. regression desti-nations / departure points and total fixation time).We also included reading speed and reading depthfeatures, such as fixation probability and total fix-ation time per word. In total, we have 32 gazefeatures, where some are highly correlated (suchas number of fixations on a word and total fixationtime per sentence).

Dundee Corpus The main weakness of the exper-iment is the small dataset. As future work, weplan to replicate the experiment with a $99 eyetracker for subjects to use at home. This willmake it easy to collect thousands of sentences,leading to more robust gaze-based POS models.Here, instead, we include an experiment with theDundee corpus (Kennedy and Pynte, 2005). TheDundee corpus is a widely used dataset in re-search on reading and consists of gaze data for10 subjects reading 20 newswire articles (about51,000 words). We extracted the same word-basedfeatures as above, except probability for 1st and2nd fixation, and sentence-level features (in theDundee corpus, subjects are exposed to multiplesentences per screen window), and used them asfeatures in our POS tagging experiments (§3).

Learning experiments In our experiments, weused type-constrained logistic regression withL2-regularization and type-constrained (averaged)structured perceptron (Collins, 2002; Tackstromet al., 2013). In all experiments, unless otherwisestated, we trained our models on four domains andevaluated on the fifth to avoid over-fitting to the

346

Rank Feature % of votes

0 Fixation prob 19.01 Previous word fixated binary 13.72 Next word fixated binary 13.23 nFixations 12.24 First fixation duration on every word 9.15 Previous fixation duration 7.06 Mean fixation duration per word 6.67 Re-read prob 5.78 Next fixation duration 2.09 Total fixation duration per word 2.0

Table 1: 10 most used features by stability selec-tion from logistic regression classification of allPOS pairs on all domains, 5-fold cross validation.

(a) Content words (b) Function words

Figure 2: Scatter plot of frequency and fixationprobability for content words (NOUN, VERB,ADJ, NUM) and function words (PRON, CONJ,ADP, DET, PRT)

characteristics of a specific domain. Our tag dic-tionary is from Wiktionary3 and covers 95% of alltokens.

3 ResultsDomain differences Our first observation is thatthe gaze characteristics differ slightly across do-mains, but more across POS. Figure 1 presents the

3https://code.google.com/p/wikily-supervised-pos-tagger/downloads/list

Figure 3: Error reduction of logistic regressionover a majority baseline. All domains

fixation probabilities across the 11 parts of speech.While the overall pattern is similar across the fivedomains (open category items are more likely tobe fixated), we see domain differences. For ex-ample, pronouns are more likely to be fixated inheadlines. The explanation could lie in the dif-ferent distributions of function words and contentwords. It is established and unchallenged thatfunction words are fixated on about 35% of thetime and content words are fixated on about 85%of the time (Rayner and Duffy, 1988). In our data,these numbers vary among the domains accordingto frequency of that word class, see Figure 2. Fig-ure 2a shows that there is a strong linear correla-tion between content word frequency and contentword fixation probability among the different do-mains: Pearson’s ⇢ = 0.909. From Figure 2b,there is a negative correlation between functionword frequency and function word fixation proba-bility: Pearson’s ⇢ = �0.702.Predictive gaze features To investigate whichgaze features were more predictive of part ofspeech, we used stability selection (Meinshausenand Buhlmann, 2010) with logistic regressionclassification on all binary POS classifications.Fixation probability was the most informative fea-ture, but also whether the words around the wordis fixated is important along with number of fixa-tions. In our binary discrimination and POS tag-ging experiments, using L2-regularization or av-eraging with all features was superior (on Twitterdata) to using stability selection for feature selec-tion. We also asked a psycholinguist to select asmall set of relatively independent gaze features fitfor the task (first fixation duration, fixation proba-bility and re-read probability), but again, using allfeatures with L2-regularization led to better per-formance on the Twitter data.Binary discrimination First, we trained L2-regularized logistic regression models to discrim-inate between all pairs of POS tags only usinggaze features. In other words, for example weselected all words annotated as NOUN or VERB,and trained a logistic regression model to discrim-inate between the two in a five-fold cross valida-tion setup. We report error reduction acc�baseline

1�baselinein Figure 3.POS tagging We also tried evaluating our gazefeatures directly in a supervised POS tagger.4 We

4https://github.com/coastalcph/rungsted

347

SP +GAZE +DGAZE +FREQLEN +DGAZE+FREQLEN

HDL 0.807 0.822 0.822 0.826 0.843MAI 0.791 0.831 0.834 0.795 0.831TWI 0.771 0.787 0.800 0.772 0.793WBL 0.836 0.854 0.858 0.850 0.861WSJ 0.831 0.837 0.838 0.831 0.859

Macro-av 0.807 0.826 0.830 0.815 0.837Table 2: POS tagging results on different test sets using 200 out-of-domain sentences for training.DGAZE is using gaze features from Dundee. Best result for each row in bold face

trained a type-constrained (averaged) perceptronmodel with drop-out and a standard feature model(from Owoputi et al. (2013)) augmented with theabove gaze features. The POS tagger was trainedon a very small seed of data (200 sentences), doing20 passes over the data, and evaluated on out-of-domain test data; training on four domains, testingon one. For the gaze features, instead of using to-ken gaze features, we first built a lexicon with av-erage word type statistics from the training data.We normalize the gaze matrix by dividing withits standard deviation. This is the normalizer inTurian et al. (2010) with � = 1.0. We conditionon the gaze features of the current word, only. Wecompare performance using gaze features to us-ing only word frequency, estimating from the (un-labeled) English Web Treebank corpus, and wordlength (FREQLEN).

The first three columns in Table 2 show, thatgaze features help POS tagging, at least whentrained on very small seeds of data. Error reduc-tion using gaze features from the Dundee corpus(DGAZE) is 12%. We know that gaze features cor-relate with word frequency and word length, butusing these features directly leads to much smallerperformance gains. Concatenating the two fea-tures sets leads to the best performance, with anerror reduction of 16%.

In follow-up experiments, we observe that aver-aging over 10 subjects when collecting gaze fea-tures does not seem as important as we expected.Tagging accuracies on raw (non-averaged) data areonly about 1% lower. Finally, we also tried run-ning logistic regression experiments across sub-jects rather than domains. Here, tagging accura-cies were again comparable to our set-up, suggest-ing that gaze features are also robust across sub-jects.

4 Related workMatthies and Søgaard (2013) present results thatsuggest that individual variation among (academ-ically trained) subjects’ reading behavior was nota greater source of error than variation within sub-jects, showing that it is possible to predict fixationsacross readers. Our work relates to such work,studying the robustness of reading models acrossdomains and readers, but it also relates in spirit toresearch on using weak supervision in NLP, e.g.,work on using HTML markup to improve depen-dency parsers (Spitkovsky, 2013) or using click-through data to improve POS taggers (Ganchev etal., 2012).

5 ConclusionsWe have shown that it is possible to use gazefeatures to discriminate between many POS pairsacross domains, even with only a small dataset anda small set of subjects. We also showed that gazefeatures can improve the performance of a POStagger trained on small seeds of data.

ReferencesMichael Collins. 2002. Discriminative training meth-

ods for Hidden Markov Models. In EMNLP.

Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner,Josef Le Roux, Joakim Nivre, Deirde Hogan, andJosef van Genabith. 2011. From news to comments:Resources and benchmarks for parsing the languageof Web 2.0. In IJCNLP.

Marco R Furtner, John F Rauthmann, and PierreSachse. 2009. Nomen est omen: Investigating thedominance of nouns in word comprehension witheye movement analyses. Advances in Cognitive Psy-chology, 5:91.

Kuzman Ganchev, Keith Hall, Ryan McDonald, andSlav Petrov. 2012. Using search-logs to improvequery tagging. In ACL.

348

Alan Kennedy and Joel Pynte. 2005. Parafoveal-on-foveal effects in normal reading. Vision research,45(2):153–168.

Franz Matthies and Anders Søgaard. 2013. Withblinkers on: Robust prediction of eye movementsacross readers. In EMNLP, Seattle, Washington,USA.

Nicolai Meinshausen and Peter Buhlmann. 2010.Stability selection. Journal of the Royal Statis-tical Society: Series B (Statistical Methodology),72(4):417–473.

Matthias Nilsson and Joakim Nivre. 2009. Learningwhere to look: Modeling eye movements in reading.In CoNLL.

Olutobi Owoputi, Brendan O’Connor, Chris Dyer,Kevin Gimpel, Nathan Schneider, and Noah ASmith. 2013. Improved part-of-speech tagging foronline conversational text with word clusters. InNAACL.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011.A universal part-of-speech tagset. arXiv preprintarXiv:1104.2086.

K.; Rayner and S. A. Duffy. 1988. On-line compre-hension processes and eye movements in reading.In G. E. MacKinnon M. Daneman and T. G. Waller,editors, Reading research: Advances in theory andpractice, pages 13–66. Academic Press, New York.

Keith Rayner. 1998. Eye movements in reading andinformation processing: 20 years of research. Psy-chological bulletin, 124(3):372.

Jarkko Salojarvi, Ilpo Kojo, Jaana Simola, and SamuelKaski. 2003. Can relevance be inferred from eyemovements in information retrieval. In Proceedingsof WSOM, volume 3, pages 261–266.

Javier San Agustin, Henrik Skovsgaard, Emilie Mol-lenbach, Maria Barret, Martin Tall, Dan WitznerHansen, and John Paulin Hansen. 2010. Evalua-tion of a low-cost open-source gaze tracker. In Pro-ceedings of the 2010 Symposium on Eye-TrackingResearch & Applications, pages 77–80. ACM.

Valentin Ilyich Spitkovsky. 2013. Grammar Inductionand Parsing with Dependency-and-Boundary Mod-els. Ph.D. thesis, STANFORD UNIVERSITY.

Oscar Tackstrom, Dipanjan Das, Slav Petrov, RyanMcDonald, and Joakim Nivre. 2013. Token andtype constraints for cross-lingual part-of-speech tag-ging. TACL, 1:1–12.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: a simple and general methodfor semi-supervised learning. In ACL.

349

Author Index

Agic, Željko, 315Al-Badrashiny, Mohamed, 42Allen, James, 226Ammar, Waleed, 332Åström, Kalle, 185

Baldridge, Jason, 22Baldwin, Timothy, 83Ballesteros, Miguel, 289Barbosa, Luciano, 123Barrett, Maria, 345Berzak, Yevgeni, 94Bhatt, Himanshu Sharad, 52Bird, Steven, 113Bogdanova, Dasha, 123Boyd-Graber, Jordan, 194

Carreras, Xavier, 289Chang, Kai-Wei, 12Chen, Mingrui, 237Choi, Ho-Jin, 279Choi, Key-Sun, 132Choudhury, Pallavi, 332Cimiano, Philipp, 153Clark, Stephen, 327Cohen, Shay B., 1Cohn, Trevor, 113Cook, Paul, 113Cotterell, Ryan, 164

Dagan, Ido, 175Diab, Mona, 42Do, Hyun-Woo, 279dos Santos, Cicero, 123Duong, Long, 113Dyer, Chris, 22

Elfardy, Heba, 42

Felt, Paul, 194Ferraro, Gabriela, 83Fraser, Alexander, 164

Garrette, Dan, 22Georgiev, Georgi, 310

Gildea, Daniel, 32Goldberger, Jacob, 175Gulordava, Kristina, 247Guo, Yuhong, 73Guzmán, Francisco, 62

Hashimoto, Kazuma, 268Henderson, James, 142Hou, Weiwei, 83Hovy, Dirk, 103Huang, LiGuo, 237

Ishiwatari, Shonosuke, 300

Jat, sharmistha, 321Jeong, Young-Seob, 279Johannsen, Anders, 103

Kaji, Nobuhiro, 300Katz, Boris, 94Kayser, Michael, 305Kim, Youngsik, 132Kim, Zae Myung, 279Kitsuregawa, Masaru, 300Klinger, Roman, 153

Levy, Omer, 175Li, Zeheng, 237Lim, Chae-Gyun, 279Luong, Thang, 305

Mahajan, Anuj, 321Maillard, Jean, 327Manning, Christopher D., 305Martínez Alonso, Héctor, 315Merkler, Danijela, 315Merlo, Paola, 247Mihaylov, Todor, 310Miwa, Makoto, 268Müller, Thomas, 164

Nakov, Preslav, 62, 310Ng, Hwee Tou, 338Ng, Vincent, 237Nugues, Pierre, 185

351

Peng, Haoruo, 12Peng, Xiaochang, 32Perera, Ian, 226Plank, Barbara, 315Poon, Hoifung, 332

Qu, Lizhen, 83

Rappoport, Ari, 258Reichart, Roi, 94, 258Riezler, Stefan, 1Ringger, Eric, 194Roth, Dan, 12Roy, Shourya, 52, 321Ruppenhofer, Josef, 215

Schneider, Nathan, 83Schütze, Hinrich, 164, 204Schwartz, Roy, 258Semwal, Deepali, 52Seppi, Kevin, 194Søgaard, Anders, 103, 315, 345Shwartz, Vered, 175Smith, Noah A., 22Sokolov, Artem, 1Song, Linfeng, 32Stenetorp, Pontus, 268

Taghipour, Kaveh, 338Toutanova, Kristina, 332Toyoda, Masashi, 300Tsuruoka, Yoshimasa, 268

Vogel, Stephan, 62

Weegar, Rebecka, 185Wiegand, Michael, 215

XIAO, MIN, 73

Yazdani, Majid, 142Yin, Wenpeng, 204Yoshinaga, Naoki, 300

Zadrozny, Bianca, 123Zhou, Liyuan, 83

CoNLL 2015 The 19th Conference on Computational Natural ...

Documents

Transcript of CoNLL 2015 The 19th Conference on Computational Natural ...