arXiv:2112.02721v1 [cs.CL] 6 Dec 2021

39
NL-Augmenter A Framework for Task-Sensitive Natural Language Augmentation December 5, 2021 Kaustubh D. Dhole 3,18† , Varun Gangal 7† , Sebastian Gehrmann 23† , Aadesh Gupta 3† , Zhenhao Li 32† , Saad Mahamood 90† , Abinaya Mahendiran 45† , Simon Mille 53† , Ashish Srivastava 2† , Samson Tan 48,91† , Tongshuang Wu 81† , Jascha Sohl-Dickstein 22† , Jinho D. Choi 18† , Eduard Hovy 7† , Ondˇ rej Dušek 10† , Sebastian Ruder 13† , Sajant Anand 68 , Nagender Aneja 74 , Rabin Banjade 77 , Lisa Barthe 19 , Hanna Behnke 32 , Ian Berlot-Attwell 80 , Connor Boyle 81 , Caroline Brun 49 , Marco Antonio Sobrevilla Cabezudo 79 , Samuel Cahyawijaya 26 , Emile Chapuis 52 , Wanxiang Che 24 , Mukund Choudhary 37 , Christian Clauss 33 , Pierre Colombo 52 , Filip Cornell 41 , Gautier Dagan 84 , Mayukh Das 63 , Tanay Dixit 30 , Thomas Dopierre 39 , Paul-Alexis Dray 89 , Suchitra Dubey 1 , Tatiana Ekeinhor 86 , Marco Di Giovanni 51 , Rishabh Gupta 28 , Rishabh Gupta 29 , Louanes Hamla 19 , Sang Han 73 , Fabrice Harel-Canada 70 , Antoine Honoré 86 , Ishan Jindal 27 , Przemyslaw K. Joniak 66 , Denis Kleyko 75 , Venelin Kovatchev 65 , Kalpesh Krishna 71 , Ashutosh Kumar 34 , Stefan Langer 59 , Seungjae Ryan Lee 55 , Corey James Levinson 33 , Hualou Liang 15 , Kaizhao Liang 76 , Zhexiong Liu 78 , Andrey Lukyanenko 43 , Vukosi Marivate 14 , Gerard de Melo 25 , Simon Meoni 33 , Maxime Meyer 86 , Afnan Mir 4 , Nafise Sadat Moosavi 62 , Niklas Muennigho50 , Timothy Sum Hon Mun 64 , Kenton Murray 40 , Marcin Namysl 20 , Maria Obedkova 33 , Priti Oli 77 , Nivranshu Pasricha 46 , Jan Pfister 83 , Richard Plant 17 , Vinay Prabhu 73 , Vasile P˘ ais , 57 , Libo Qin 24 , Shahab Raji 58 , Pawan Kumar Rajpoot 56 , Vikas Raunak 44 , Roy Rinberg 11 , Nicolas Roberts 82 , Juan Diego Rodriguez 72 , Claude Roux 49 , Vasconcellos P. H. S. 54 , Ananya B. Sai 30 , Robin M. Schmidt 16 , Thomas Scialom 89 , Tshephisho Sefara 12 , Saqib N. Shamsi 88 , Xudong Shen 48 , Yiwen Shi 15 , Haoyue Shi 67 , Anna Shvets 19 , Nick Siegel 4 , Damien Sileo 42 , Jamie Simon 68 , Chandan Singh 68 , Roman Sitelew 33 , Priyank Soni 3 , Taylor Sorensen 6 , William Soto 61 , Aman Srivastava 85 , KV Aditya Srivatsa 37 , Tony Sun 69 , Mukund Varma T 30 , A Tabassum 47 , Fiona Anting Tan 36 , Ryan Teehan 9 , Mo Tiwari 60 , Marie Tolkiehn 8 , Athena Wang 4 , Zijian Wang 33 , Zijie J. Wang 21 , Gloria Wang 31 , Fuxuan Wei 24 , Bryan Wilie 35 , Genta Indra Winata 5 , Xinyi Wu 81 , Witold Wydma´ nski 38 , Tianbao Xie 24 , Usama Yaseen 59 , M. Yee 33 , Jing Zhang 18 , Yue Zhang 87 1 ACKO, 2 Agara Labs, 3 Amelia R&D, New York, 4 Applied Research Laboratories, The University of Texas at Austin, 5 Bloomberg, 6 Brigham Young University, 7 Carnegie Mellon University, 8 Center for Data and Computing in Natural Sciences, Universität Hamburg, 9 Charles River Analytics, 10 Charles University, Prague, 11 Columbia University, 12 Council for Scientific and Industrial Research, 13 DeepMind, 14 Department of Computer Science, University of Pretoria, 15 Drexel University, 16 Eberhard Karls University of Tübingen, 17 Edinburgh Napier University, 18 Emory University, 19 Fablab by Inetum in Paris, 20 Fraunhofer IAIS, 21 Georgia Tech, 22 Google Brain, 23 Google Research, 24 Harbin Institute of Technology, 25 Hasso Plattner Institute / University of Potsdam, 26 Hong Kong University of Science and Technology, 27 IBM Research, 28 IIIT Delhi, 29 IIT Delhi, 30 IIT Madras, 31 Illinois Mathematics and Science Academy, 32 Imperial College, London, 33 Independent, 34 Indian Institute of Science, Bangalore, 35 Institut Teknologi Bandung, 36 Institute of Data Science, National University of Singapore, 37 International Institute of Information Technology, Hyderabad, 38 Jagiellonian University, Poland, 39 Jean Monnet University, 40 Johns Hopkins’, 41 KTH Royal Institute of Technology, 42 KU Leuven, 43 MTS AI, France, 44 Microsoft, Redmond, WA, 45 Mphasis NEXT Labs, 46 National University of Ireland Galway, 47 National University of Science and Technology, Pakistan, 48 National University of Singapore, 49 Naver Labs Europe, 50 Peking University, 51 Politecnico di Milano and University of Bologna, 52 Polytechnic Institute of Paris, 53 Pompeu Fabra University, 54 Pontifical Catholic University of Minas Gerais, Brazil, 55 Princeton University, 56 Rakuten India, 57 Research Institute for Artificial Intelligence Mihai Dr˘ ag˘ anescu, Romanian Academy, 58 Rutgers University, 59 Siemens AG, 60 Stanford University, 61 SyNaLP, LORIA, 62 TU Darmstadt, 63 Technical University of Braunschweig, 64 The Alan Turing Institute, 65 The University of Texas at Austin; (University of Barcelona, University of Birmingham), 66 The University of Tokyo, 67 Toyota Technological Institute at Chicago, 68 UC Berkeley, 69 UC Santa Barbara / Google, 70 UCLA, 71 UMass Amherst, 72 UT Austin, 73 UnifyID, 74 Universiti Brunei Darussalam, 75 University of California, Berkeley and Research Institutes of Sweden, 76 University of Illinois, Urbana Champaign, 77 University of Memphis, 78 University of Pittsburgh, 79 University of São Paulo, 80 University of Toronto, 81 University of Washington, 82 University of Wisconsin–Madison, 83 University of Würzburg, 84 Universty of Edinburgh, 85 VMware, 86 Vade, 87 Westlake Institute for Advanced Study, 88 Whirlpool Corporation, 89 reciTAL, 90 trivago N.V., 91 Salesforce Research arXiv:2112.02721v1 [cs.CL] 6 Dec 2021

Transcript of arXiv:2112.02721v1 [cs.CL] 6 Dec 2021

NL-AugmenterA Framework for Task-Sensitive Natural Language Augmentation

December 5, 2021

Kaustubh D. Dhole3,18†, Varun Gangal7†, Sebastian Gehrmann23†, Aadesh Gupta3†,Zhenhao Li32†, Saad Mahamood90†, Abinaya Mahendiran45†, Simon Mille53†,

Ashish Srivastava2†, Samson Tan48,91†, Tongshuang Wu81†, Jascha Sohl-Dickstein22†,Jinho D. Choi18†, Eduard Hovy7†, Ondrej Dušek10†, Sebastian Ruder13†, Sajant Anand68,

Nagender Aneja74, Rabin Banjade77, Lisa Barthe19, Hanna Behnke32, Ian Berlot-Attwell80,Connor Boyle81, Caroline Brun49, Marco Antonio Sobrevilla Cabezudo79, Samuel Cahyawijaya26,Emile Chapuis52, Wanxiang Che24, Mukund Choudhary37, Christian Clauss33, Pierre Colombo52,

Filip Cornell41, Gautier Dagan84, Mayukh Das63, Tanay Dixit30, Thomas Dopierre39,Paul-Alexis Dray89, Suchitra Dubey1, Tatiana Ekeinhor86, Marco Di Giovanni51,

Rishabh Gupta28, Rishabh Gupta29, Louanes Hamla19, Sang Han73, Fabrice Harel-Canada70,Antoine Honoré86, Ishan Jindal27, Przemyslaw K. Joniak66, Denis Kleyko75, Venelin Kovatchev65,

Kalpesh Krishna71, Ashutosh Kumar34, Stefan Langer59, Seungjae Ryan Lee55,Corey James Levinson33, Hualou Liang15, Kaizhao Liang76, Zhexiong Liu78,

Andrey Lukyanenko43, Vukosi Marivate14, Gerard de Melo25, Simon Meoni33, Maxime Meyer86,Afnan Mir4, Nafise Sadat Moosavi62, Niklas Muennighoff50, Timothy Sum Hon Mun64,

Kenton Murray40, Marcin Namysl20, Maria Obedkova33, Priti Oli77, Nivranshu Pasricha46,Jan Pfister83, Richard Plant17, Vinay Prabhu73, Vasile Pais, 57, Libo Qin24, Shahab Raji58,

Pawan Kumar Rajpoot56, Vikas Raunak44, Roy Rinberg11, Nicolas Roberts82,Juan Diego Rodriguez72, Claude Roux49, Vasconcellos P. H. S.54, Ananya B. Sai30,

Robin M. Schmidt16, Thomas Scialom89, Tshephisho Sefara12, Saqib N. Shamsi88, Xudong Shen48,Yiwen Shi15, Haoyue Shi67, Anna Shvets19, Nick Siegel4, Damien Sileo42, Jamie Simon68,

Chandan Singh68, Roman Sitelew33, Priyank Soni3, Taylor Sorensen6, William Soto61,Aman Srivastava85, KV Aditya Srivatsa37, Tony Sun69, Mukund Varma T30, A Tabassum47,

Fiona Anting Tan36, Ryan Teehan9, Mo Tiwari60, Marie Tolkiehn8, Athena Wang4, Zijian Wang33,Zijie J. Wang21, Gloria Wang31, Fuxuan Wei24, Bryan Wilie35, Genta Indra Winata5, Xinyi Wu81,

Witold Wydmanski38, Tianbao Xie24, Usama Yaseen59, M. Yee33, Jing Zhang18, Yue Zhang87

1ACKO, 2Agara Labs, 3Amelia R&D, New York, 4Applied Research Laboratories, The University of Texas at Austin,5Bloomberg, 6Brigham Young University, 7Carnegie Mellon University, 8Center for Data and Computing in Natural Sciences,

Universität Hamburg, 9Charles River Analytics, 10Charles University, Prague, 11Columbia University, 12Council for Scientific

and Industrial Research, 13DeepMind, 14Department of Computer Science, University of Pretoria, 15Drexel University,16Eberhard Karls University of Tübingen, 17Edinburgh Napier University, 18Emory University, 19Fablab by Inetum in Paris,20Fraunhofer IAIS, 21Georgia Tech, 22Google Brain, 23Google Research, 24Harbin Institute of Technology, 25Hasso Plattner

Institute / University of Potsdam, 26Hong Kong University of Science and Technology, 27IBM Research, 28IIIT Delhi, 29IIT

Delhi, 30IIT Madras, 31Illinois Mathematics and Science Academy, 32Imperial College, London, 33Independent, 34Indian

Institute of Science, Bangalore, 35Institut Teknologi Bandung, 36Institute of Data Science, National University of Singapore,37International Institute of Information Technology, Hyderabad, 38Jagiellonian University, Poland, 39Jean Monnet University,

40Johns Hopkins’, 41KTH Royal Institute of Technology, 42KU Leuven, 43MTS AI, France, 44Microsoft, Redmond, WA,45Mphasis NEXT Labs, 46National University of Ireland Galway, 47National University of Science and Technology, Pakistan,

48National University of Singapore, 49Naver Labs Europe, 50Peking University, 51Politecnico di Milano and University of

Bologna, 52Polytechnic Institute of Paris, 53Pompeu Fabra University, 54Pontifical Catholic University of Minas Gerais, Brazil,55Princeton University, 56Rakuten India, 57Research Institute for Artificial Intelligence Mihai Draganescu, Romanian Academy,

58Rutgers University, 59Siemens AG, 60Stanford University, 61SyNaLP, LORIA, 62TU Darmstadt, 63Technical University of

Braunschweig, 64The Alan Turing Institute, 65The University of Texas at Austin; (University of Barcelona, University of

Birmingham), 66The University of Tokyo, 67Toyota Technological Institute at Chicago, 68UC Berkeley, 69UC Santa Barbara /

Google, 70UCLA, 71UMass Amherst, 72UT Austin, 73UnifyID, 74Universiti Brunei Darussalam, 75University of California,

Berkeley and Research Institutes of Sweden, 76University of Illinois, Urbana Champaign, 77University of Memphis,78University of Pittsburgh, 79University of São Paulo, 80University of Toronto, 81University of Washington, 82University of

Wisconsin–Madison, 83University of Würzburg, 84Universty of Edinburgh, 85VMware, 86Vade, 87Westlake Institute for

Advanced Study, 88Whirlpool Corporation, 89reciTAL, 90trivago N.V., 91Salesforce Research

arX

iv:2

112.

0272

1v1

[cs

.CL

] 6

Dec

202

1

Abstract

Data augmentation is an important compo-nent in the robustness evaluation of modelsin natural language processing (NLP) andin enhancing the diversity of the data theyare trained on. In this paper, we presentNL-Augmenter, a new participatory Python-based natural language augmentation frame-work which supports the creation of both trans-formations (modifications to the data) and fil-ters (data splits according to specific features).We describe the framework and an initial setof 117 transformations and 23 filters for a va-riety of natural language tasks. We demon-strate the efficacy of NL-Augmenter by usingseveral of its tranformations to analyze the ro-bustness of popular natural language models.The infrastructure, datacards and robutstnessanalysis results are available publicly on theNL-Augmenter repository (https://github.com/GEM-benchmark/NL-Augmenter).

1 Introduction

Data augmentation, the act of creating new dat-apoints by slightly modifying copies or creatingsynthetic data based on existing data, is an im-portant component in the robustness evaluationof models in natural language processing (NLP)and in enhancing the diversity of their trainingdata. Most data augmentation techniques create ex-amples through transformations of existing exam-ples which are based on prior task-specific knowl-edge (Feng et al., 2021; Chen et al., 2021). Suchtransformations seek to disrupt model predictionsor can be used as training candidates for improvingregularization and denoising models, for example,through consistency training (Xie et al., 2020); Fig-ure 1 shows a few possible transformations for asample sentence.

However, a vast majority of transformationsdo not alter the structure of examples in drasticand meaningful ways, rendering them qualitativelyless effective as potential training or test examples.Moreover, different NLP tasks may benefit fromtransforming different linguistic properties. for ex-ample, changing the word “happy” to “very happy”in an input is more relevant for sentiment analysisthan for summarization (Mille et al., 2021). Despite

† Organizers & Steering Committee* Please send requests to the correspondence email:

[email protected].

this, many transformations may be universially use-ful, for example changing places to ones from dif-ferent geographic regions, or changing names tothose from different cultures. As such, having asingle place to collect both task-specific and task-independent augmentations will ease the barrier tocreating appropriate suites of augmentations thatshould be applied to different tasks.

Natural language and its long-tailed nature (Bam-man, 2017) lead to a very high diversity of possi-ble surface forms. If only a handful number ofways to paraphrase a text are available, it can behard to generalize across radically different sur-face forms. Besides, data drawn i.i.d. from suchlong-tailed distribution represents itself exactly inproportion to its occurence in the dataset. Eval-uating NLP systems on such data means that thehead of the distribution is emphasized even in thetest dataset and that rare phenomena are implic-itly ignored. However, informed transformationsor the identification of such tail examples may re-quire a wide range of domain knowledge or specificcultural backgrounds. We thus argue that a collec-tion of transformations that should be applied toNLP datasets should be done by capitalizing on the“wisdom-of-researchers”.

To enable more diverse and better characterizeddata during testing and training, we create a Python-based natural language augmentation framework,NL-Augmenter.1 With the help of researchersacross subfields in computational linguistics andNLP, we collect many creative ways to augmentdata for natural language tasks. To encourage task-specific implementations, we tie each transforma-tion to a widely-used data format (e.g. text pair,a question-answer pair, etc.) along with varioustask types (e.g. entailment, tagging, etc.) that theyintend to benefit. We demonstrate the efficacy ofNL-Augmenter by using some of its transforma-tions to analyze the robustness of popular naturallanguage models.

A majority of the augmentations that the frame-work supports are transformations of single sen-tences that aim to paraphrase these sentences invarious ways. NL-Augmenter loosens the defini-tion of “transformations” from the logic-centricview of strict equivalence to the more descriptiveview of linguistics, closely resembling Bhagat andHovy (2013)’s “quasi-paraphrases”. We extend

1https://github.com/GEM-benchmark/NL-Augmenter

NL-AugmenterJohn likes expensive Italian pizzas. John expensive .

John likes expensive Italian pizzas .#LikesPizzas #Likes #John #Pizzas

Expensive italian pizzas, John likes.

John likes e ensive Italian pizzas .

John likes expensive Italian pizzas(italian dish of flattened bread and toppings).

John is a big fan of Italy, especially of the rich and cheap pizzas.

JJoohhnn lliikkeess eexxppeennssiivvee IIttaalliiaann ppiizzzzaass ..

Jo4n lik3s 3xpensiv3 1talian pizzas .

John likes expensive actually Italian actually pizzas In my opinion .

John confirmed that he likes expensive Italian pizzas.

Joḫn ⱡikẽs ęxṕensίѷḝ إէ al aἣ p׀ ƨzas .

NL-Augmenter

John likes pure bead Italian pizzas.

John is fond of expensive Italian pizzas.

John likes expensive Italian food .

John likes expensive Italienisch pizzas .

ashish

ashish

Figure 1: A few randomly chosen transformations of NL-Augmenter for the original sentence John likes expensivepizzas. While the meaning (almost) always remains the same and identifiable by humans, models can have a muchharder time representing the transformed sentences.

this to accommodate noise, intentional and acci-dental human mistakes, sociolinguistic variation,semantically-valid style, syntax changes, as wellas artificial constructs that are unambiguous to hu-mans (Tan et al., 2021b). Some transformationsvary the socio-linguistic perspective permitting acrucial source of variation wherein language goalsspan beyond conveying ideas and content.

In addition to transformations, NL-Augmenteralso provides a variety of filters, which can be usedto filter data and create subpopulations of giveninputs, according to features such as input complex-ity, input size, etc. Unlike a transformation, the out-put of a filter is a boolean value, indicating whetherthe input meets the filter criterion, e.g. whetherthe input text is toxic. The filters allow splittingexisting datasets and hence evaluating models onsubsets with specific linguistic properties.

In this paper, we apply the collected transfor-mations and filters to several datasets and showto what exent the different types of perturbationsdo affect some models. The paper is organized asfollows. We first discuss the rise of participatorybenchmarks in Section 2. In Section 3, we intro-duce the participatory workshop and the repositoryof NL-Augmenter. In Section 4, we present therobustness analysis performed on the participants’

submissions, and in Section 4 we provide a broaderimpact discussion. All filters and transformationsare listed and described in details in the Appendix.

2 Related Work

NL-Augmenter enables both data augmentationand robustness testing by constructing the libraryin a participatory fashion. We provide an overviewof the related work in these lines of research.

2.1 Evolving Participatory BenchmarksTo address the problem of under-resourced Africanlanguages in machine translation, Masakhaneadopted a bottom-up, participatory approach toconstruct machine translation benchmarks for overthirty languages (Nekoto et al., 2020). This col-laborative approach is increasingly adopted in theNLP communtiy to create evolving benchmarksin response to the rapid pace of NLP progress.The Generation Evaluation and Metrics bench-mark (Gehrmann et al., 2021), which started thedevelopment of NL-Augmenter, is a participatoryproject to document and improve tasks and theirevaluation in natural language generation. BIG-Bench2 proposes a collaborative framework to col-lect a large suite of few-shot tasks to gauge the

2https://github.com/google/BIG-bench

abilities of large, pretrained language models. Dyn-aBench (Kiela et al., 2021) proposes to iterativelyevaluate models in a human-in-the-loop fashion byconstructing more challenging examples after eachround of model evaluation. SyntaxGym (Gauthieret al., 2020) provides a standardized platform forresearchers to contribute and use evaluation setsand focuses on targeted syntactic evaluation of Lan-guage Models (LMs), particularly psycholinguisti-cally motivated ones. In contrast, our transforma-tions operate on a larger variety of tasks and modeltypes — they are not required to be syntactically oreven linguistically motivated.

2.2 Wisdom-of-ResearchersThere are many ways to introduce variation in asentence without altering its meaning; the lived ex-periences of a diverse group of individuals couldhelp with identifying and codifying the myriad di-mensions of variation as executable transforma-tions (Tan et al., 2021b). Leveraging the wisdom-of-the-crowd (Galton, 1907; Yi et al., 2010) is com-mon in our field of natural language processing,with the use of Amazon Mechanical Turk to gen-erate and annotate data in exchange for monetaryreturns. The aforementioned BIG-Bench project,hosted on GitHub, offers co-authorship in exchangefor task contribution. We can think of this as a sortof wisdom-of-researchers. Similarly, we crowd-source transformations, in the form of Python codesnippets, in return for co-authorship.

2.3 Robustness Evaluation ToolsThere are many projects with similar goals thatinspired NL-Augmenter. For example Gardneret al. (2020) proposed creating “contrast” setsof perturbed test examples. In their approach,each example is manually perturbed, which maylead to higher-quality results but is costly to repli-cate for each new task due to scale and annota-tor cost. TextAttack (Morris et al., 2020) is a li-brary enabling the adversarial evaluation of EnglishNLP models. Partially overcoming this limitation,TextFlint (Wang et al., 2021a), supports robust-ness evaluation in English and Chinese. It coverslinguistic and task-specficic transformations, ad-versarial attacks, and subpopulation analyses. Incontrast, while the majority are focused on English,NL-Augmenter comprises transformations and fil-ters that work for many different languages andeach contribution can specify a set of supportedlanguages.

Robustness Gym (Goel et al., 2021) unifies fourdifferent types of robustness tests — subpopula-tions, transformations, adversarial attacks, and eval-uation sets — under a single interface. However, itdepends on existing libraries for its transformations.In contrast, NL-Augmenter focuses on compilinga set of transformations in an open source and col-laborative fashion, which is reflected in its sizeand diversity. Checklist (Ribeiro et al., 2020) ar-gues for the need to go beyond simple accuracyand evaluate the model on basic linguistic capa-bilities, for example their response to negations.Polyjuice (Wu et al., 2021) perturbs examples us-ing GPT-2 — though this is automatic and scalable,it offers limited control over type of challengingexamples generated, making fine-grained analysisbeyond global challenge-set level difficult. In con-trast, our method offers a richer taxonomy with117 (and growing) transformations for extensiveanalysis and comparison.

Tan et al. (2021b) propose decomposing eachreal world environment into a set of dimensionsbefore using randomly sampled and adversariallyoptimized transformations to measure the model’saverage- and worst-case performance along eachdimension. NL-Augmenter can be used, out-of-the-box, to measure average-case performance and weplan to extend it to support worst-case evaluation.

3 NL-Augmenter

NL-Augmenter is a crowd-sourced suite to facil-itate rapid augmentation of data for NLP tasksto assist in training and evaluating models. NL-augmenter was introduced in Mille et al. (2021)in the context of the creation of evaluation suitesfor the GEM benchmark (Gehrmann et al., 2021);three types of evaluation sets were proposed: (i)transformations, i.e. original test sets are perturbedin different ways (e.g. backtranslation, introductionof typographical errors, etc.), (ii) subpopulations,i.e. test subsets filtered according to features suchas input complexity, input size, etc.; and (iii) datashifts, i.e. new test sets that do not contain any ofthe original test set material.

In this paper, we present a participant-drivenrepository for creating and testing transforma-tions and filters, and for applying them to alldataset splits (training, development, evaluation)and to all NLP tasks (NLG, labeling, question an-swering, etc.). As shown by Mille et al. (2021),applying filters and tranformations to developmen-

Format of a Transformation

The name of the transformation, ReplaceFinancialAmount followed bythe interface SentenceOperation.

The tasks that the transformation is applicable to.The languagesfor which transformations are generated. And the relevant key-words which categorise the transformation.

class ReplaceFinancialAmount(SentenceOperation):

tasks = [

TaskType.TEXT_CLASSIFICATION,

TaskType.TEXT_TO_TEXT_GENERATION,]

languages = ["en"]

keywords = [

"lexical",

"rule-based",

"external-knowledge-based",

"possible-meaning-alteration",

"high-precision"]

def __init__(self, seed: int = 0, max_outputs: int = 1):

super().__init__(seed=seed, max_outputs=max_outputs)

def generate(self, sentence: str) -> List[str]:

"""

The actual logic of the transformation. The

‘generate‘ method takes in a sentence and returns

multiple transformed sentences.

"""

return transformed_sentences

1

Figure 2: Participants were expected to write their python class adhering to the above format.

t/evaluation data splits allows for testing the robust-ness of models and for identifying possible biases;on the other hand, applying transformations and fil-ters to training data (data augmentation) allows forpossibly mitigating the detected robustness and biasissues (Wang et al., 2021b; Pruksachatkun et al.,2021; Si et al., 2021).

In this section, we provide organizational details,list the transformations and filters that the reposi-tory currently contains, and we present the list oftags we associated to tranformations and filters andhow we introduced them.

3.1 Participatory Workshop on GitHub

A workshop was organized towards constructingthis full-fledged participant-driven repository. Un-like a traditional workshop wherein people submitpapers, participants were asked to submit pythonimplementations of transformations to the GitHubrepository. Organizers of this workshop createda base repository extending Mille et al. (2021)’sNLG evaluation suite and incorporated a set of in-terfaces, each of which catered to popular NL ex-ample formats. This formed the backbone of therepository. A sample set of transformations andfilters alongwith evaluation scripts were providedas starter code. Figure 2 show an annotated codesnippet of a submission. Following the format ofBIG-Bench’s review process, multiple review crite-ria were designed for accepting contributions. The

review criteria (see Appendix A) guided partici-pants to follow a style guide, incorporate test casesin JSON format, and encouraged novelty and speci-ficity. Apart from the general software develop-ment advantages of test cases, they made reviewingsimpler by providing an overview of the transfor-mation’s capability and scope of generations.

3.2 Transformations and filters

Tables 1 and 2 list respectively the 117 transforma-tions and 23 filters that are currently found in theNL-Augmenter repository (alphabetically orderedaccording to the submission name in the reposi-tory). For each transformation/filter, a link to thecorresponding Appendix subsection is provided,where a detailed description, illustrations and an ex-ernal link to implementation in the NL-Augmenterrepository can be found.

3.3 Tags for the classification ofperturbations

We defined a list of tags which are useful for anefficient navigation in the pool of existing perturba-tions and for understanding the performance char-acteristics of the contributed transformations andfilters (see e.g. the robustness analysis presentedin Section 4.2). There are three main categories oftags: (i) General properties tags, (ii) Output proper-ties tags, and (iii) Processing properties tags.

General properties tags are shown in Table 3,

Transformation App. Transformation App.Abbreviation Transformation B.1 Mix transliteration B.60Add Hash-Tags B.2 MR Value Replacement B.61Adjectives Antonyms Switch B.3 Multilingual Back Translation B.62AmericanizeBritishizeEnglish B.4 Multilingual Dictionary Based Code Switch B.63AntonymsSubstitute B.5 Multilingual Lexicon Perturbation B.64Auxiliary Negation Removal B.6 Causal Negation and Strengthening B.65AzertyQwertyCharsSwap B.7 Question Rephrasing transformation B.66BackTranslation B.8 English Noun Compound Paraphraser [N+N] B.67BackTranslation for Named Entity Recognition B.9 Number to Word B.68Butter Fingers Perturbation B.10 Numeric to Word B.69Butter Fingers Perturbation For Indian Languages B.11 OCR Perturbation B.70Change Character Case B.12 Add Noun Definition B.71Change Date Format B.13 Pig Latin Cipher B.72Change Person Named Entities B.14 Pinyin Chinese Character Transcription B.73Change Two Way Named Entities B.15 SRL Argument Exchange B.74Chinese Antonym and Synonym Substitution B.16 ProtAugment Diverse Paraphrasing B.75Chinese Pinyin Butter Fingers Perturbation B.17 Punctuation B.76Chinese Person NE and Gender Perturbation B.18 Question-Question Paraphraser for QA B.77Chinese (Simplified and Traditional) Perturbation B.19 Question in CAPS B.78City Names Transformation B.20 Random Word Deletion B.79Close Homophones Swap B.21 Random Upper-Case Transformation B.80Color Transformation B.22 Double Context QA B.81Concatenate Two Random Sentences (Bilingual) B.23 Replace Abbreviations and Acronyms B.82Concatenate Two Random Sentences (Monolingual) B.24 Replace Financial Amounts B.83Concept2Sentence B.25 Replace Numerical Values B.84Contextual Meaning Perturbation B.26 Replace Spelling B.85Contractions and Expansions Perturbation B.27 Replace nouns with hyponyms or hypernyms B.86Correct Common Misspellings B.28 Sampled Sentence Additions B.87Country/State Abbreviation B.29 Sentence Reordering B.88Decontextualisation of the main Event B.30 Emoji Addition for Sentiment Data B.89Diacritic Removal B.31 Shuffle Within Segments B.90Disability/Differently Abled Transformation B.32 Simple Ciphers B.91Discourse Marker Substitution B.33 Slangificator B.92Diverse Paraphrase Generation B.34 Spanish Gender Swap B.93Dislexia Words Swap B.35 Speech Disfluency Perturbation B.94Emoji Icon Transformation B.36 Paraphrasing through Style Transfer B.95Emojify B.37 Subject Object Switch B.96English Inflectional Variation B.38 Sentence Summarizaiton B.97English Mention Replacement for NER B.39 Suspecting Paraphraser for QA B.98Filler Word Augmentation B.40 Swap Characters Perturbation B.99Style Transfer from Informal to Formal B.41 Synonym Insertion B.100French Conjugation Substitution B.42 Synonym Substitution B.101Gender And Culture Diversity Name Changer B.43 Syntactically Diverse Paraphrasing B.102Neopronoun Substitution B.44 Subsequence Substitution for Seq. Tagging B.103Gender Neutral Rewrite B.45 Tense B.104GenderSwapper B.46 Token Replacement Based on Lookup Tables B.105GeoNames Transformation B.47 Transformer Fill B.106German Gender Swap B.48 Added Underscore Trick B.107Grapheme to Phoneme Substitution B.49 Unit converter B.108Greetings and Farewells B.50 Urban Thesaurus Swap B.109Hashtagify B.51 Use Acronyms B.110Insert English and French Abbreviations B.52 Visual Attack Letter B.111Leet Transformation B.53 Weekday Month Abbreviation B.112Lexical Counterfactual Generator B.54 Whitespace Perturbation B.113Longer Location for NER B.55 Context Noise for QA B.114Longer Location Names for testing NER B.56 Writing System Replacement B.115Longer Names for NER B.57 Yes-No Question Perturbation B.116Lost in Translation B.58 Yoda Transformation B.117Mixed Language Perturbation B.59

Table 1: List of transformations and link to their detailed descriptions in Appendix

Filter App. Filter App.Code-Mixing Filter C.1 Polarity Filter C.13Diacritics Filter C.2 Quantitative Question Filter C.14Encoding Filter C.3 Question type filter C.15Englishness Filter C.4 Repetitions Filter C.16Gender Bias Filter C.5 Phonetic Match Filter C.17Group Inequity Filter C.6 Special Casing Filter C.18Keyword Filter C.7 Speech-Tag Filter C.19Language Filter C.8 Token-Amount filter C.20Length Filter C.9 Toxicity Filter C.21Named-entity-count Filter C.10 Universal Bias Filter C.22Numeric Filter C.11 Yes/no question filter C.23Oscillatory Hallucinations Filter C.12

Table 2: List of filters and link to their detailed descriptions in Appendix

Property Definition TagsAugmented set type Transformation or Filter (Subpopulation)? Filter, Transformation, Multiple (specify),

Unclear, N/A

General purpose What will the data be used for? Augmentingtraining data? Testing robustness? Findingand fixing biases? Etc.

Augmentation, Bias, Robustness, Other(specify), Multiple (specify), Unclear, N/A

Task type For which NLP task(s) will the perturbationbe beneficial?

Quality estimation, Question answering,Question generation, RDF-to-text generation,Sentiment analysis, Table-to-text generation,Text classification, Text tagging, Text-to-textgeneration

Language(s) To which language(s) is is the perturbationapplied?

*

Linguistic level On which linguistic level does the perturba-tion operate?

Discourse, Semantic, Style, Lexical, Syntac-tic, Word-order, Morphological, Character,Other (specify), Multiple (specify), Unclear,N/A

Table 3: Criteria and possible tags for General Properties of perturbations

and cover the type of the augmentation, i.e.whether it is a transformation or a filter (Augmentedset type), its general purpose, i.e. whether it is in-tended for augmentation, robustness, etc. (Generalpurpose), for which NLP tasks the created data willbe useful (Task type), to which languages it hasbeen applied (Language(s)), and on which linguis-tic level of representation it operates, i.e. semantic,syntactic, lexical, etc. (Linguistic level).

Output properties tags, shown in Table 4, applyto transformations only; they provide indicationsabout how the data was affected during the respec-tive transformations. There are currently six prop-erties in this category: one to capture the number ofdifferent outputs that a transformation can produce(Output/Input ratio), one to capture in which aspectthe input and the output are alike (Input/Ouptut sim-ilarity), and four to capture intrinsic qualities ofthe produced text or structured data, namely howwere the meaning, the grammaticality, the readabil-ity and the naturalness affected by the transforma-

tion (respectively Meaning preservation, Grammat-icality preservation, Readability preservation andNaturalness preservation). Note that apart fromOutput/Input ratio, the output properties tags needto be specified manually for each transformation/-filter (see Section 3.4), and are thus subject to theinterpretation of the annotator.

Processing properties tags, shown in Table 5,capture information related to the type of process-ing applied on the input (Input data processing),the type of algorithm used (Algorithm type), howit is implemented (Implementation), its estimatedprecision and recall (Precision/recall) and compu-tational complexity (Computational complexity /Time), and whether an accelerator is required toapply the transformation/filter (GPU required?).

3.4 Tag retrieval and assignment

Transformation and filters are assigned tags foreach of the properties listed in Tables 3-5. Thereare two sources for the tags: (i) assigning them

Property Definition TagsOutput/input ratio Does the transformation generate one single

output for each input, or a few, or many?=1, >1 (Low), >1 (High), Multiple (specify),Unclear, N/A

Input/output similarity On which level are the input and output simi-lar (if applicable)?

Aural, Meaning, Visual, Other (specify), Mul-tiple (specify), Unclear, N/A

Meaning preservation If you compare the output with the input, howis the meaning affected by the transforma-tion?

Always-preserved, Possibly-changed,Always-changed, Possibly-added, Always-added, Possibly-removed, Always-removed,Multiple (specify), Unclear, N/A

Grammaticality preser-vation

If you compare the output with the input, howis the grammatical correctness affected by thetransformation?

Always-preserved, Possibly-impaired,Always-impaired, Possibly-improved,Always-improved, Multiple (specify),Unclear, N/A

Readability preservation If you compare the output with the input, howis the easyness of read affected by the trans-formation?

Always-preserved, Possibly-impaired,Always-impaired, Possibly-improved,Always-improved, Multiple (specify),Unclear, N/A

Naturalness preserva-tion

If you compare the output with the input, howis the naturalness of the text affected by thetransformation?

Always-preserved, Possibly-impaired,Always-impaired, Possibly-improved,Always-improved, Multiple (specify),Unclear, N/A

Table 4: Criteria and possible tags for Output Properties of perturbations (applicable to tranformations only)

Property Definition TagsInput data processing What kind of NL processing is applied to the

input?Addition, Chunking, Paraphrasing, Parsing,PoS-Tagging, Removal, Segmentation, Sim-plification, Stemming, Substitution, Tokeni-sation, Translation, Other (specify), Multiple(specify), Unclear, N/A

Implementation Is the perturbation implemented as rule-basedor model-based?

Model-based, Rule-based, Both, Unclear,N/A

Algorithm type What type of algorithm is used to implementthe perturbation?

API-based, External-knowledge-based,LSTM-based, Transformer-Based, Other(specify), Multiple (specify), Unclear, N/A

Precision/recall To what extent does the perturbation generatewhat it intends to generate (precision)? Towhat extent does the perturbation return anoutput for any input (recall)?

High-precision-High-recall, High-precision-Low-recall, Low-precision-High-recall, Low-precision-Low-recall, Unclear, N/A

GPU Required? Is GPU needed to run the perturbation? No, Yes, Unclear, N/A

Computational complex-ity / Time

How would you assess the computationalcomplexity of running the perturbation?Does it need a lot of time to run?

High, Medium, Low

Table 5: Criteria and possible tags for Processing Properties of perturbations

manually, and (ii) using existing metadata embed-ded in the respective source code implementationsof each given transformation and filter. The in-codemetadata (see e.g. the Keywords field in Figure 2)provides descriptions for each one identificable as-pects such as the language(s) supported, the typeof task that the transformation or filter is applica-ble for, and other characteristical keywords. Thespecification and type of this metadata was pre-defined as a requirement for all contributors to the

NL-Augmenter project to enable identification ofthe type of transformation of filter being written bytheir respective author(s).

This metadata was initally collected through thecreation of an automated script which programati-cally iterated through each transformation and filterand gathered all stated metadata. The metadata wasthen mapped by the script into discrete propertygroups as defined in Tables 3-5. All contributingauthors were invited to review the initally collected

metadata and, where possible, add additional data.

4 Robustness Analysis

All authors of the accepted perturbations wereasked to provide the task performance scores foreach of their respective transformations or filters.In Section 4.1 we provide details on how the scoreswere obtained, and in Section 4.2 we provide a firstanalysis of these scores.

4.1 Experiment

The perturbations are currently split into threegroups, according to the task(s) they will be eval-uated on: text classification tasks, tagging tasks,and question-answering tasks. For experiments inthis paper, we focus on text classification and onthe relevant perturbations. We compare the mod-els’ performance on the original data and on theperturbed data. The percentage of sentences beingchanged by a transformation (transformation rate)and the percentage of performance drop on the per-turbed data compared to the performance on theoriginal data (score variation) are reported.

Tasks. We choose four evaluation datasetsamong three English NLP tasks: (1) sentimentanalysis on both short sentences (SST-2 (Socheret al., 2013)) and full paragraphs (IMDB MovieReview (Maas et al., 2011)), (2) Duplicate questiondetection (QQP) (Wang et al., 2019a), and (3) Nat-ural Language Inference (MNLI) (Williams et al.,2017). These tasks cover both classifications onsingle sentences, as well as pairwise comparisons,and have been widely used in various counterfac-tual analysis and augmentation experiments (Wuet al., 2021; Kaushik et al., 2019; Gardner et al.,2020; Ribeiro et al., 2020).

Evaluation models. We represent eachdataset/task with its corresponding mostdownloaded large model hosted on Hug-gingface (Wolf et al., 2020), resulting in fourmodels for evaluation: roberta-base-SST-2,roberta-base-imdb, roberta-large-mnli,and bert-base-uncased-QQP.3

Perturbation strategy. For each task, we per-turb a random sample of 20% of the validationset. Since all the transformations are on single textsnippets, for datasets with sentence pairs, i.e., QQP

3huggingface.co/{textattack/roberta-base-SST-2,textattack/roberta-base-imdb,textattack/bert-base-uncased-QQP,roberta-large-mnli }

and MNLI, we perturb the first question and thepremise sentence, respectively.

4.2 Results and AnalysisIn this section, Tables 6 to 16 show the results of therobustness analysis performed on the four datasetsdescribed in Section 4.1 and presented accordingto the tags introduced in Section 3.3.

General purpose (Table 6): Transformationsdesigned with a “robustness testing” objective dis-played mean performance drops between 9% and13.7% across models. Interestingly, 34 sentencetransformations designed for “augmentation” tasksshowed similar mean robustness drops ranging be-tween 4% and 13%, emphasizing the need to drawon the paraphrasing literature to improve robust-ness testing.

Task type (Table 7): The results table showsthat there is not necessarily a correlation betweenwhich task a transformation is marked to be rele-vant for and which task it actually challenges therobustness of the models on.

Linguistic level (Table 8): Transformationsmaking character level and morphological changeswere able to show drastic levels of drops in perfor-mance compared to those making lexical or syn-tactic changes. These drops in performance wereconsistent across all four models. roberta-largefinetuned on the MNLI dataset was the most brit-tle - character-level transformations on an averagedropped performance by over 31% and morpho-logical changes dropped it by 28% while thosewhich made lexical changes displayed a mean dropof 4.4%. The visual_attack_letters (B.111)transformation, which replaces characters with sim-ilarly looking ones (like y and v), shows a largeaccuracy drop from 94% to 56% on the ‘roberta-base‘ model fine tuned on SST. ‘bert-base-uncased‘fine-tuned on the QQP dataset drops from 92 to 69.roberta-large-mnli drops from 91 to 47. In thecase of visual_attack_letters, one can easilyconceive a scenario in which a model is appliedto OCR text which likely exhibit similar proper-ties. In this case, one may expect similarly poorperformance, arguably attributed to a narrow set ofcharacters that the models have been exposed to.

Meaning preservation (Table 10): 22 transfor-mations which were marked as highly meaning pre-serving surprisingly showed a larger average per-formance drop as compared to 20 of those whichwere marked as possibly meaning changing. Notdiscounting the possibility of the noisiness of the

SST-2 Roberta-base QQP BERT-base-unc. MNLI Roberta-large IMDB Roberta-baseTag #All #Evl RT VarS #Evl RT VarS #Evl RT VarS #Evl RT VarS

Augmentation 34 20 0.63 -13.25 20 0.75 -6 18 0.74 -8.89 17 0.73 -4.41Bias 3 1 0.5 -5 2 0.52 -11.5 2 0.53 -16 1 0.71 0Robustness 15 8 0.82 -9.38 7 0.59 -8.14 7 0.65 -12.14 7 0.88 -13.71Other* 1 1 0.5 -38 1 0.5 -23 1 0.5 -44 1 0.6 1Multiple* 21 13 0.72 -4.15 13 0.64 -5.08 12 0.68 -4.08 11 0.92 -5.64

Total 74 43 43 40 37

Table 6: Results of the robustness evaluation from the perspective of the General purpose criterion (#All = Totalnumber of tags, #Evl Total number of evaluations collected, RT = Transformation rate, VarS = Score variation)

SST-2 Roberta-base QQP BERT-base-unc. MNLI Roberta-large IMDB Roberta-baseTag #All #Evl RT VarS #Evl RT VarS #Evl RT VarS #Evl RT VarS

Qual. estim. 2 2 0.52 -2.5 2 0.51 -6 2 0.53 -6.5 1 0.56 0Question ans. 3 2 0.7 -0.5 2 0.89 -1.5 2 0.77 -1 2 0.98 -4Question gen. 2 1 0.41 0 1 0.77 -1 1 0.54 -2 1 0.97 -5RDF to text 1 1 0.01 0 1 0.02 0 1 0.04 0 1 0.21 0Sentiment ana. 4 1 0.99 -12 1 0.99 -14 1 0.93 -18 1 1 -15Table to text 1 1 0.01 0 1 0.02 0 1 0.04 0 1 0.21 0Text class. 95 52 0.71 -9.27 52 0.68 -6.21 49 0.69 -8.33 43 0.83 -5.74Text tagging 25 17 0.79 -10.94 17 0.64 -6.82 16 0.66 -9.75 13 0.84 -9.23Text to text gen. 92 49 0.69 -8.86 49 0.66 -5.86 46 0.68 -7.57 40 0.79 -5.62

Total 231 126 126 119 103

Table 7: Results of the robustness evaluation from the perspective of the Task type criterion (#All = Total numberof tags, #Evl Total number of evaluations collected, RT = Transformation rate, VarS = Score variation)

SST-2 Roberta-base QQP BERT-base-unc. MNLI Roberta-large IMDB Roberta-baseTag #All #Evl RT VarS #Evl RT VarS #Evl RT VarS #Evl RT VarS

Semantic 3 1 1 -35 1 1 -20 1 1.0 -42 1 1 -3Lexical 44 30 0.67 -5.83 30 0.61 -5 30 0.64 -4.4 25 0.73 -2.44Syntactic 3 1 1 -8 1 0.74 -7 1 0.85 -15 1 1 0Word-order 2 2 0.6 -1.5 2 0.61 -1 2 0.63 -2 1 1 0Morphological 3 2 0.75 -25.5 2 0.75 -21.5 2 0.75 -28.5 2 0.8 -4.5Character 6 2 1 -16.5 2 1.0 -12.5 1 0.95 -31 2 1 -26Other* 1 1 0 0 1 0.7 -4 0 1 1 -1Multiple* 25 9 0.74 -11.22 9 0.71 -7 9 0.74 -12.56 8 0.8 -14.5Unclear 1 1 1 -46 1 0.79 -2 0 0

Total 92 49 49 46 41

Table 8: Results of the robustness evaluation from the perspective of the Linguistic level criterion (#All = Totalnumber of tags, #Evl Total number of evaluations collected, RT = Transformation rate, VarS = Score variation)

SST-2 Roberta-base QQP BERT-base-unc. MNLI Roberta-large IMDB Roberta-baseTag #All #Evl RT VarS #Evl RT VarS #Evl RT VarS #Evl RT VarS

Aural 5 3 1 -4.33 3 0.7 -6.67 2 0.7 -6.5 3 0.85 -3.67Meaning 51 31 0.6 -8.58 32 0.64 -5.72 31 0.64 -7.52 28 0.74 -5.75Visual 12 7 0.86 -15.29 6 0.8 -10.17 5 0.8 -12.8 5 0.92 -1Other* 5 1 0.83 0 1 0.55 -4 1 0.69 -2 0Multiple* 2 1 1 -34 1 1 -20 1 1.0 -38 2 1 -23N/A 2 2 0.92 -1 2 0.67 -6 2 0.77 -5 0

Total 77 45 45 42 38

Table 9: Results of the robustness evaluation from the perspective of the Input/output similarity criterion (#All =

Total number of tags, #Evl Total number of evaluations collected, RT = Transformation rate, VarS = Score variation)

SST-2 Roberta-base QQP BERT-base-unc. MNLI Roberta-large IMDB Roberta-baseTag #All #Evl RT VarS #Evl RT VarS #Evl RT VarS #Evl RT VarS

Alw. preserved 40 22 0.65 -9.77 22 0.63 -7.36 22 0.61 -11.23 19 0.72 -9.89Poss. changed 33 20 0.78 -5.45 20 0.73 -5.15 17 0.75 -4.76 18 0.87 -1.5Alw. changed 12 5 0.7 -4 5 0.54 -5.4 5 0.61 -6.8 3 0.78 -7.33Alw. added 2 1 0 -94 1 0.7 -4 1 0.78 0 1 0.99 -1Poss. removed 2 2 1 -18 2 1 -13 2 0.88 -23.5 1 1 -3

Total 89 50 50 47 42

Table 10: Results of the robustness evaluation from the perspective of the Meaning preservation criterion (#All =

Total number of tags, #Evl Total number of evaluations collected, RT = Transformation rate, VarS = Score variation)

SST-2 Roberta-base QQP BERT-base-unc. MNLI Roberta-large IMDB Roberta-baseTag #All #Evl RT VarS #Evl RT VarS #Evl RT VarS #Evl RT VarS

Alw. preserved 31 19 0.59 -10.58 19 0.52 -4.63 18 0.53 -8.11 17 0.76 -4.94Poss. impaired 36 20 0.69 -3.15 20 0.69 -4.55 19 0.72 -4.21 18 0.81 -2.11Alw. impaired 2 1 0.93 -7 1 0.94 -20 1 0.92 -16 1 1 -1Poss. improved 6 6 0.83 -16.33 6 0.8 -8.17 5 0.79 -14.8 2 0.52 -1.5Unclear 1 1 1 -34 1 1 -20 1 1.0 -38 1 1 -45N/A 2 2 1 -23.5 2 1 -22 2 1 -27 2 1 -36.5

Total 79 49 49 46 41

Table 11: Results of the robustness evaluation from the perspective of the Grammaticality preservation criterion(#All = Total number of tags, #Evl Total number of evaluations collected, RT = Transformation rate, VarS = Scorevariation)

SST-2 Roberta-base QQP BERT-base-unc. MNLI Roberta-large IMDB Roberta-baseTag #All #Evl RT VarS #Evl RT VarS #Evl RT VarS #Evl RT VarS

Alw. preserved 25 15 0.66 -3 15 0.54 -3.47 15 0.56 -5.53 12 0.83 -2.33Poss. impaired 38 24 0.64 -10.67 24 0.69 -6.25 22 0.69 -6.59 22 0.79 -2.41Alw. impaired 9 4 1 -25.25 4 1.0 -17.25 3 0.98 -36.67 4 1 -40Poss. improved 4 4 0.75 -11.75 4 0.75 -8.75 4 0.75 -16.25 2 0.52 -1.5Alw. improved 2 1 -1 1 -6 1 0.77 -5 0Unclear 1 1 1 0 1 0.06 0 1 0.15 0 1 0.32 0

Total 79 49 49 46 41

Table 12: Results of the robustness evaluation from the perspective of the Readability preservation criterion(#All = Total number of tags, #Evl Total number of evaluations collected, RT = Transformation rate, VarS = Scorevariation)

SST-2 Roberta-base QQP BERT-base-unc. MNLI Roberta-large IMDB Roberta-baseTag #All #Evl RT VarS #Evl RT VarS #Evl RT VarS #Evl RT VarS

Alw. preserved 18 9 0.59 -3.33 10 0.52 -3.5 9 0.51 -7.44 9 0.75 -2.56Poss. impaired 45 29 0.66 -8.48 29 0.64 -5.38 27 0.67 -5.15 24 0.79 -1.75Alw. impaired 8 4 1.0 -20.5 4 1.0 -16.25 4 0.97 -23.25 4 1 -32.25Poss. improved 4 4 0.75 -11.75 4 0.75 -8.75 4 0.75 -16.25 2 0.52 -1.5Unclear 1 1 1 -34 1 1 -20 1 1.0 -38 1 1 -45

Total 77 47 48 45 40

Table 13: Results of the robustness evaluation from the perspective of the Naturalness preservation criterion(#All = Total number of tags, #Evl Total number of evaluations collected, RT = Transformation rate, VarS = Scorevariation)

SST-2 Roberta-base QQP BERT-base-unc. MNLI Roberta-large IMDB Roberta-baseTag #All #Evl RT VarS #Evl RT VarS #Evl RT VarS #Evl RT VarS

Addition 1 1 0 -94 1 0.7 -4 1 0.78 0 1 0.99 -1Paraphrasing 5 5 0.79 -1.8 5 0.74 -5.6 4 0.77 -6.25 3 0.77 -0.67Parsing 1 1 0.02 0 1 0.16 -1 1 0.15 0 1 0.59 0PoS-Tagging 5 3 0.44 -11.67 3 0.54 -6.67 3 0.54 -14.33 2 0.98 -1.5Removal 2 2 1 -4.5 2 0.74 -6.5 2 0.81 -10 1 1 0Segmentation 3 1 1 -4 1 0.93 -6 1 0.94 -5 1 1 -4Substitution 17 13 0.63 -8.08 14 0.61 -8 14 0.64 -9.36 13 0.67 -5Tokenisation 23 9 0.67 -4.89 9 0.5 -4.22 9 0.54 -4.56 10 0.76 -3.8Translation 3 2 0.99 -11 2 0.99 -13.5 2 0.97 -18.5 1 1 -15Other* 3 2 1 -17 2 1.0 -10 1 0.95 -38 2 1 -23Multiple* 13 6 0.69 -1.33 5 0.6 -2.2 5 0.58 -4.8 3 0.72 -2Unclear 1 1 1 -46 1 0.79 -2 0 0N/A 3 2 0.85 -18.5 2 0.9 -14 2 0.89 -20.5 2 1 -32

Total 81 48 48 45 40

Table 14: Results of the robustness evaluation from the perspective of the Input data processing criterion (#All =

Total number of tags, #Evl Total number of evaluations collected, RT = Transformation rate, VarS = Score variation)

SST-2 Roberta-base QQP BERT-base-unc. MNLI Roberta-large IMDB Roberta-baseTag #All #Evl RT VarS #Evl RT VarS #Evl RT VarS #Evl RT VarS

Model-based 19 11 0.95 -11.27 11 0.93 -7.64 9 0.93 -11.78 7 0.81 -2.43Rule-based 66 38 0.65 -9.24 38 0.61 -6.26 37 0.64 -8.14 34 0.79 -6.5Both 6 2 0.31 0 2 0.5 -0.5 2 0.42 -1.5 1 0.97 -5Unclear 1 1 1 -7 1 0.84 -4 1 0.9 -2 1 1 -1

Total 103 52 52 49 43

Table 15: Results of the robustness evaluation from the perspective of the Implementation criterion (#All = Totalnumber of tags, #Evl Total number of evaluations collected, RT = Transformation rate, VarS = Score variation)

SST-2 Roberta-base QQP BERT-base-unc. MNLI Roberta-large IMDB Roberta-baseTag #All #Evl RT VarS #Evl RT VarS #Evl RT VarS #Evl RT VarS

API-based 22 14 0.78 -7.86 14 0.67 -7 13 0.73 -9.23 11 0.88 -11.45Ext. K.-based 33 19 0.47 -11 19 0.55 -6.95 19 0.55 -7.89 20 0.68 -4.45LSTM-based 1 1 1 0 1 1.0 0 0 0.9 1 1 -1Transf.-based 15 7 0.89 -9.57 7 0.85 -5.29 6 0.87 -7.17 1 1 -4Multiple* 3 1 0.41 0 1 0.77 -1 1 0.54 -2 1 0.97 -5Unclear 1 0 0 0 1 -1N/A 24 4 1.0 -13.25 4 0.77 -8.5 4 0.75 -18.75 3 0.89 -6

Total 103 46 46 43 38

Table 16: Results of the robustness evaluation from the perspective of the Algorithm type criterion (#All = Totalnumber of tags, #Evl Total number of evaluations collected, RT = Transformation rate, VarS = Score variation)

transformation’s logic, we believe further investiga-tion could help understand whether models focuson the meaning of words or sentences or take short-cuts by focusing on commonly occuring surfaceforms associated with a particular prediction, aswas already shown for some phenomena by Mc-Coy et al. (2019), among others.

Grammaticality preservation (Table 11): Pre-serving grammaticality did not correlate with highrobustness. Transformations marked as grammati-cality always-preserved showed significant av-erage drops of 10.6%, 8.1% and 4.6% acrossroberta-base-SST-2, roberta-large-mnliand bert-base-uncased-QQP respectively. Forexample, the grapheme_to_phoneme transforma-

tion showed drastic drops in performance: 13%,20% and 13% respectively.

Readability and Naturalness (Tables 12-13):In general, as expected, the transformations taggedas modifying the readability or naturalness showlarge drops across all tasks and models, in particu-lar the ones tagged as “always imparing” the input.

Unsurprisingly, many of the injected perturba-tions, despite being artificial would not distracthuman readers from the actual meaning and intentof the text (e.g. simple_ciphers transformation(B.91)). Character level perturbations might notdistract human readers as much as compared toword level perturbations but the above languagemodels on the other hand behaved contrarily. Such

departure from learning meaningful abstractions isfurther validated with the low correlation of gram-maticality preservation and robustness. These re-sults further re-question how we can expand thesemodels from being just pure statistical learners tothose which can incorporate meaning and surface-level abstraction, both across natural as well as arti-ficial constructs. The large drops in performance ofsuch perturbations necessitate looking at expand-ing training sets with even artificial data sources aswell expand our definitions of text similarity frompure linguistic ones to those which abstract mor-phological, visual and other errors which can beunambiguous to humans.

Tables 9, 14, 15 and 16 show the robustnessscores for Input/Output similarity, Input pro-cessing, Implementation and Algorithm type re-spectively. The score drops for these criteria maynot be easily interpretable; e.g. that model-basedimplementations showed comparatively larger av-erage drops as compared to rule-based implementa-tions may not be due to the difference in implemen-tation, but rather to which transformations wereimplemented that way .

5 Discussion and Broader Impact

Limitations In Section 4.2, we analyze the re-sults of applying some of the transformations onexisting datasets and running models on the per-turbed data. Even though it was not possible to testall of the currently existing perturbations (mostlydue to time constraints), the overall results showthat the tested perturbations do pose a challengeto different models on different tasks, with quasi-systematic score drops. However, with so manytransformations applied to four different datasets,the presented robustness analysis can only be shal-low, and a separate analysis of each transformationwould be needed in order to get more informativeinsights. Second, our superficial analysis aboverelies on tags which were in many cases annotatedby hand, and some of the suprising results (e.g.meaning-preserving transformations are more chal-lenging than non-meaning-preserving ones) mayreflect a lack of consistency in the annotations. Webelieve that assessing the quality of the tag assign-ment so as to ensure a high inter-annotator agree-ment will be needed for reliable analyses in thefuture. Finally, the current robustness analysis onlyshows that the perturbations are effective for de-tecting a possible weakness in a model; further

experiments are needed to demonstrate that the per-turbations can also help mitigating the weaknessesthey bring to light.

Dilution of Contributions While this is not ourintent, there is a risk in large scale collections ofwork like this that individual contributions are be-ing less appreciated than releasing them as a stan-dalone project. This risk is a tradeoff with theadvantage that it becomes much easier to switchbetween different transformations, which can leadto a better adoption of introduced methods. Toproactively give appropriate credit, each transfor-mation has a data card mentioning the contributorsand all participants are listed as co-authors of thispaper. We further encourage all users of our reposi-tory to cite the work that a specific implementationbuilds on, if appropriate. The relevant citationsare listed on the respective data cards and in thedescription in the appendix. In the same vein, thereis a risk of NL-Augmenter as a whole to monopo-lize the augmentation space due to its large scope,leading to less usage of related work which maycover additional transformations or filters. Whilethis is not our intention and we actively workedwith contributors to related repositories to integratetheir work, we encourage researchers to try othersolutions as well.

Participatory Setup Conducting research in en-vironments with a shared mission, a low barrierof entry, and directly involving affected commu-nities was popularizied by Nekoto et al. (2020).This kind of participatory work has many advan-tages, most notably that it changes the typicallyprescriptive research workflow toward a more in-clusive one. Another advantage is that throughopen science, anyone can help shape the overallmission and improve the end result. Following therelated BIG-bench project, we aimed to design NL-Augmenter in a similar spirit – by providing theinfrastructure, the participation barrier is reducedto filling a templated interface and providing testexample. By making the interface as flexible aspossible, the contributions range from filters forsubpopulations with specific protected attributes totransformations via neural style transfer. Throughthis wide range, we hope that researchers can ap-ply a wider range of augmentation and evaluationsstrategies to their data and models.

6 Conclusion

In this paper, we introduced NL-Augmenter, aframework for text transformations and filterswith the goal to assisst in robustness testing andother data augmentation tasks. We demonstratedthat through an open participation strategy, NL-Augmenter can cover a much wider set of lan-guages, tasks, transformations, and filters than re-lated work without a loss of focus. In total, ourrepository provides > 117 transformations and > 23filters which are all documented and tested and willcontribute toward more robust NLP models and anevaluation thereof. As we point out in our analysis,there is much room to improve NL-Augmenter. Wewelcome future contributions to improve its cov-erage of the potential augmentation space and toaddress its current shortcomings. Future work mayfurther include data augmentation experiments ata larger scale to investigate the effect on modelrobustness.

7 Organization

NL-Augmenter is a large effort organized by re-searchers and developers ranging across differentniches of NLP. To acknowledge everyone’s contri-butions, we list the contribution statements belowfor all.Steering Committee: Kaustubh Dhole, VarunGangal, Sebastian Gehrmann, Aadesh Gupta, Zhen-hao Li, Saad Mahmood, Simon Mille, Jascha Sohl-Dickstein, Ashish Srivastava, Samson Tan, Tong-shuang Wu and Abinaya Mahendiran make up thesteering committee. Jinho D. Choi, Eduard Hovy& Sebastian Ruder provided guidance and feed-back. Kaustubh Dhole coordinates and leads theNL-Augmenter effort. All others provide feedbackand discuss larger decisions regarding the direc-tion of NL-Augmenter and act as organizers andreviewers.Repository: Kaustubh, Aadesh, Zhenhao, Tong-shuang, Ashish, Saad, Varun & Abinaya cre-ated the interfaces and the base repository NL-Augmenter for participants to contribute. Thiswas also a continuation of the repository developedfor creating challenge sets (Mille et al., 2021) forGEM (Gehrmann et al., 2021). All the other au-thors expanded this repository with their implemen-tations.Reviewers: Kaustubh, Simon, Zhenhao, Sebas-tian, Varun, Samson, Abinaya, Saad, Tongshuang,Aadesh, Ondrej were involved in reviewing the

submissions of participants of the first phase. Inthe 2nd phase, all other authors performed a cross-review, in which participants were paired with 3other partcipants. This was followed by a metareview by the organizers.

Robustness Evaluation: Ashish, Tongshuang,Kaustubh & Zhenhao created the evaluation engine.Simon, Kaustubh, Saad, Abinaya & Tongshuangperformed the robustness analysis.

Website: Aadesh and Sebastian created the web-pages for the project.

References

2006. Respectful Disability Language: Here’s What’sUp! https://www.aucd.org/docs/add/sa_summits/Language%20Doc.pdf.

David Bamman. 2017. Natural language processing forthe long tail. In DH.

Alexandre Berard, Ioan Calapodescu, and ClaudeRoux. 2019. Naver labs europe’s systems for thewmt19 machine translation robustness task. arXivpreprint arXiv:1907.06488.

Rahul Bhagat and Eduard Hovy. 2013. What is a para-phrase? Computational Linguistics, 39(3):463–472.

Abhinav Bhatt and Kaustubh D. Dhole. 2020. Bench-marking biorelex for entity tagging and relation ex-traction.

Steven Bird. 2006. Nltk: the natural language toolkit.In Proceedings of the COLING/ACL 2006 Interac-tive Presentation Sessions, pages 69–72.

Smorga’s Board. 2021. Frequently misspelled word listfor dyslexia.

Claire Bonial, Jena Hwang, Julia Bonn, Kathryn Con-ger, Olga Babko-Malaya, and Martha Palmer. 2012.English propbank annotation guidelines. Center forComputational Language and Education ResearchInstitute of Cognitive Science University of Col-orado at Boulder, 48.

Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal,and Diyi Yang. 2021. An empirical survey of dataaugmentation for limited data learning in nlp.

Xiang Dai and Heike Adel. 2020. An analysis ofsimple data augmentation for named entity recogni-tion. In Proceedings of the 28th International Con-ference on Computational Linguistics, pages 3861–3867, Barcelona, Spain (Online). International Com-mittee on Computational Linguistics.

Prithiviraj Damodaran. Styleformer.

Sebastian Deorowicz and Marcin G Ciura. 2005. Cor-recting spelling errors by modelling their causes. In-ternational journal of applied mathematics and com-puter science, 15:275–285.

Kaustubh D. Dhole. 2020. Resolving intent ambigui-ties by retrieving discriminative clarifying questions.

Emily Dinan, Angela Fan, Adina Williams, Jack Ur-banek, Douwe Kiela, and Jason Weston. 2020.Queens are powerful too: Mitigating gender bias indialogue generation. In Proceedings of the 2020Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 8173–8188, On-line. Association for Computational Linguistics.

William B Dolan and Chris Brockett. 2005. Automati-cally constructing a corpus of sentential paraphrases.In Proceedings of the Third International Workshopon Paraphrasing (IWP2005).

Thomas Dopierre, Christophe Gravier, and WilfriedLogerais. 2021. Protaugment: Unsupervised diverseshort-texts paraphrasing for intent detection meta-learning. CoRR, abs/2105.12995.

Steffen Eger and Yannik Benz. 2020. From hero tozéroe: A benchmark of low-level adversarial attacks.

Steffen Eger, Gözde Gül Sahin, Andreas Rücklé, Ji-Ung Lee, Claudia Schulz, Mohsen Mesgar, Kr-ishnkant Swarnkar, Edwin Simpson, and IrynaGurevych. 2019a. Text processing like humans do:Visually attacking and shielding NLP systems. InProceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers), pages1634–1647, Minneapolis, Minnesota. Associationfor Computational Linguistics.

Steffen Eger, Gözde Gül Sahin, Andreas Rücklé, Ji-Ung Lee, Claudia Schulz, Mohsen Mesgar, Kr-ishnkant Swarnkar, Edwin Simpson, and IrynaGurevych. 2019b. Text processing like humansdo: Visually attacking and shielding NLP systems.CoRR, abs/1903.11508.

Ben Eisner, Tim Rocktäschel, Isabelle Augenstein,Matko Bošnjak, and Sebastian Riedel. 2016.emoji2vec: Learning emoji representations fromtheir description. In Proceedings of The FourthInternational Workshop on Natural Language Pro-cessing for Social Media, pages 48–54, Austin, TX,USA. Association for Computational Linguistics.

Marzieh Fadaee, Arianna Bisazza, and Christof Monz.2017. Data augmentation for low-resource neuralmachine translation. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 567–573, Vancouver, Canada. Association for Computa-tional Linguistics.

Angela Fan, Shruti Bhosale, Holger Schwenk, ZhiyiMa, Ahmed El-Kishky, Siddharth Goyal, MandeepBaines, Onur Celebi, Guillaume Wenzek, VishravChaudhary, et al. 2021. Beyond english-centric mul-tilingual machine translation. Journal of MachineLearning Research, 22(107):1–48.

Angela Fan, Shruti Bhosale, Holger Schwenk, ZhiyiMa, Ahmed El-Kishky, Siddharth Goyal, Man-deep Baines, Onur Çelebi, Guillaume Wenzek,Vishrav Chaudhary, Naman Goyal, Tom Birch, Vi-taliy Liptchinsky, Sergey Edunov, Edouard Grave,Michael Auli, and Armand Joulin. 2020. Be-yond english-centric multilingual machine transla-tion. ArXiv, abs/2010.11125.

Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chan-dar, Soroush Vosoughi, Teruko Mitamura, and Ed-uard Hovy. 2021. A survey of data augmentation ap-proaches for nlp. arXiv preprint arXiv:2105.03075.

Francis Galton. 1907. Vox populi (the wisdom ofcrowds). Nature, 75(7):450–451.

Varun Gangal, Steven Y Feng, Eduard Hovy,and Teruko Mitamura. 2021. Nareor: Thenarrative reordering problem. arXiv preprintarXiv:2104.06669.

Matt Gardner, Yoav Artzi, Victoria Basmov, JonathanBerant, Ben Bogin, Sihao Chen, Pradeep Dasigi,Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,et al. 2020. Evaluating models’ local decisionboundaries via contrast sets. In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing: Findings, pages 1307–1323.

Jon Gauthier, Jennifer Hu, Ethan Wilcox, Peng Qian,and Roger Levy. 2020. Syntaxgym: An onlineplatform for targeted evaluation of language models.In Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics: SystemDemonstrations, pages 70–76.

Sebastian Gehrmann, Tosin Adewumi, KarmanyaAggarwal, Pawan Sasanka Ammanamanchi,Anuoluwapo Aremu, Antoine Bosselut, Khy-athi Raghavi Chandu, Miruna-Adriana Clinciu,Dipanjan Das, Kaustubh Dhole, Wanyu Du,Esin Durmus, Ondrej Dušek, Chris ChinenyeEmezue, Varun Gangal, Cristina Garbacea, Tat-sunori Hashimoto, Yufang Hou, Yacine Jernite,Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mi-hir Kale, Dhruv Kumar, Faisal Ladhak, AmanMadaan, Mounica Maddela, Khyati Mahajan,Saad Mahamood, Bodhisattwa Prasad Majumder,Pedro Henrique Martins, Angelina McMillan-Major, Simon Mille, Emiel van Miltenburg, MoinNadeem, Shashi Narayan, Vitaly Nikolaev, AndreNiyongabo Rubungo, Salomey Osei, Ankur Parikh,Laura Perez-Beltrachini, Niranjan Ramesh Rao,Vikas Raunak, Juan Diego Rodriguez, SashankSanthanam, João Sedoc, Thibault Sellam, SamiraShaikh, Anastasia Shimorina, Marco AntonioSobrevilla Cabezudo, Hendrik Strobelt, Nishant

Subramani, Wei Xu, Diyi Yang, Akhila Yerukola,and Jiawei Zhou. 2021. The GEM benchmark: Nat-ural language generation, its evaluation and metrics.In Proceedings of the 1st Workshop on NaturalLanguage Generation, Evaluation, and Metrics(GEM 2021), pages 96–120, Online. Association forComputational Linguistics.

Daniel Gildea and Martha Stone Palmer. 2002. Thenecessity of parsing for predicate argument recogni-tion. In Proceedings of the 40th Annual Meeting ofthe Association for Computational Linguistics, July6-12, 2002, Philadelphia, PA, USA, pages 239–246.ACL.

Karan Goel, Nazneen Rajani, Jesse Vig, Samson Tan,Jason Wu, Stephan Zheng, Caiming Xiong anndMohit Bansal, and Christopher Ré. 2021. Robust-ness Gym: Unifying the NLP evaluation landscape.arXiv preprint arXiv:2101.04840.

Yoav Goldberg. 2017. Neural network methods for nat-ural language processing. Synthesis lectures on hu-man language technologies, 10(1):1–309.

Tanya Goyal and Greg Durrett. 2020. Neural syntacticpreordering for controlled paraphrase generation. InProceedings of the 58th Annual Meeting of the As-sociation for Computational Linguistics, pages 238–252, Online. Association for Computational Linguis-tics.

Sharath Chandra Guntuku, Mingyang Li, Louis Tay,and Lyle H Ungar. 2019. Studying cultural differ-ences in emoji usage across the east and the west. InProceedings of the International AAAI Conferenceon Web and Social Media, volume 13, pages 226–235.

Aadesh Gupta, Kaustubh D. Dhole, Rahul Tarway,Swetha Prabhakar, and Ashish Shrivastava. 2021.Candle: Decomposing conditional and conjunctivequeries for task-oriented dialogue systems.

Fabrice Harel-Canada. 2021. Sibyl.

Iris Hendrickx, Zornitsa Kozareva, Preslav Nakov, Di-armuid Ó Séaghdha, Stan Szpakowicz, and TonyVeale. 2013. Semeval-2013 task 4: Free paraphrasesof noun compounds. In Second Joint Conferenceon Lexical and Computational Semantics (* SEM),Volume 2: Proceedings of the Seventh InternationalWorkshop on Semantic Evaluation (SemEval 2013),pages 138–143.

hyperreality@GitHub. American british english trans-lator. https://github.com/hyperreality/American-British-English-Translator.

Hamid Jalalzai, Pierre Colombo, Chloé Clavel, EricGaussier, Giovanna Varni, Emmanuel Vignon, andAnne Sabourin. 2020. Heavy-tailed representations,text polarity classification & data augmentation.In Advances in Neural Information Processing Sys-tems, volume 33, pages 4295–4307. Curran Asso-ciates, Inc.

Robin Jia and Percy Liang. 2017a. Adversarial ex-amples for evaluating reading comprehension sys-tems. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 2021–2031. Association for Compu-tational Linguistics.

Robin Jia and Percy Liang. 2017b. Adversarial exam-ples for evaluating reading comprehension systems.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages2021–2031, Copenhagen, Denmark. Association forComputational Linguistics.

Ishan Jindal, Ranit Aharonov, Siddhartha Brahma,Huaiyu Zhu, and Yunyao Li. 2020. Improvedsemantic role labeling using parameterized neigh-borhood memory adaptation. arXiv preprintarXiv:2011.14459.

Divyansh Kaushik, Eduard Hovy, and Zachary C Lip-ton. 2019. Learning the difference that makesa difference with counterfactually-augmented data.arXiv preprint arXiv:1909.12434.

Hrant Khachatrian, Lilit Nersisyan, Karen Ham-bardzumyan, Tigran Galstyan, Anna Hakobyan, Ar-sen Arakelyan, A. Rzhetsky, and A. G. Galstyan.2019. Biorelex 1.0: Biological relation extractionbenchmark. In BioNLP@ACL.

Douwe Kiela, Max Bartolo, Yixin Nie, DivyanshKaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vid-gen, Grusha Prasad, Amanpreet Singh, Pratik Ring-shia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel,Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mo-hit Bansal, Christopher Potts, and Adina Williams.2021. Dynabench: Rethinking benchmarking inNLP. In Proceedings of the 2021 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 4110–4124, Online. Association forComputational Linguistics.

Paul R. Kingsbury and Martha Palmer. 2002. Fromtreebank to propbank. In Proceedings of the ThirdInternational Conference on Language Resourcesand Evaluation, LREC 2002, May 29-31, 2002, LasPalmas, Canary Islands, Spain. European LanguageResources Association.

Tomáš Kocisky, Jonathan Schwarz, Phil Blunsom,Chris Dyer, Karl Moritz Hermann, Gábor Melis, andEdward Grefenstette. 2018. The narrativeqa readingcomprehension challenge. Transactions of the Asso-ciation for Computational Linguistics, 6:317–328.

Venelin Kovatchev, Phillip Smith, Mark Lee, and RoryDevine. 2021. Can vectors read minds better thanexperts? comparing data augmentation strategiesfor the automated scoring of children’s mindread-ing ability. In Proceedings of the 59th Annual Meet-ing of the Association for Computational Linguisticsand the 11th International Joint Conference on Nat-ural Language Processing (Volume 1: Long Papers),

pages 1196–1206, Online. Association for Computa-tional Linguistics.

Kalpesh Krishna, John Wieting, and Mohit Iyyer. 2020.Reformulating unsupervised style transfer as para-phrase generation. In Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 737–762, Online. Asso-ciation for Computational Linguistics.

Ashutosh Kumar, Satwik Bhattamishra, Manik Bhan-dari, and Partha Talukdar. 2019. Submodularoptimization-based diverse paraphrasing and its ef-fectiveness in data augmentation. In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Longand Short Papers), pages 3609–3619, Minneapolis,Minnesota. Association for Computational Linguis-tics.

Guillaume Lample, Alexis Conneau, Marc’AurelioRanzato, Ludovic Denoyer, and Hervé Jégou. 2018.Word translation without parallel data. In Interna-tional Conference on Learning Representations.

Charlyn M Laserna, Yi-Tai Seih, and James W Pen-nebaker. 2014. Um... who like says you know: Fillerword use as a function of age, gender, and person-ality. Journal of Language and Social Psychology,33(3):328–338.

Mark Lauer. 1995. Designing Statistical LanguageLearners: Experiments on Noun Compounds. Ph.D.thesis.

Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.Higher-order coreference resolution with coarse-to-fine inference. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 2 (Short Papers), pages687–692.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation,and comprehension. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 7871–7880.

Quentin Lhoest, Albert Villanova del Moral, YacineJernite, Abhishek Thakur, Patrick von Platen, SurajPatil, Julien Chaumond, Mariama Drame, Julien Plu,Lewis Tunstall, Joe Davison, Mario Šaško, Gun-jan Chhablani, Bhavitvya Malik, Simon Brandeis,Teven Le Scao, Victor Sanh, Canwen Xu, NicolasPatry, Angelina McMillan-Major, Philipp Schmid,Sylvain Gugger, Clément Delangue, Théo Matus-sière, Lysandre Debut, Stas Bekman, Pierric Cistac,Thibault Goehringer, Victor Mustar, François Lagu-nas, Alexander M. Rush, and Thomas Wolf. 2021a.Datasets: A community library for natural languageprocessing.

Quentin Lhoest, Albert Villanova del Moral, Patrickvon Platen, Thomas Wolf, Mario Šaško, Yacine Jer-nite, Abhishek Thakur, Lewis Tunstall, Suraj Patil,Mariama Drame, Julien Chaumond, Julien Plu, JoeDavison, Simon Brandeis, Victor Sanh, Teven LeScao, Kevin Canwen Xu, Nicolas Patry, StevenLiu, Angelina McMillan-Major, Philipp Schmid,Sylvain Gugger, Nathan Raw, Sylvain Lesage, An-ton Lozhkov, Matthew Carrigan, Théo Matussière,Leandro von Werra, Lysandre Debut, Stas Bek-man, and Clément Delangue. 2021b. huggingface/-datasets: 1.14.0.

Dianqi Li, Yizhe Zhang, Hao Peng, Liqun Chen, ChrisBrockett, Ming-Ting Sun, and Bill Dolan. 2020a.Contextualized perturbation for textual adversarialattack. arXiv preprint arXiv:2009.07502.

Dianqi Li, Yizhe Zhang, Hao Peng, Liqun Chen, ChrisBrockett, Ming-Ting Sun, and Bill Dolan. 2020b.Contextualized perturbation for textual adversarialattack. CoRR, abs/2009.07502.

Zhenhao Li and Lucia Specia. 2019. Improving neu-ral machine translation robustness via data augmen-tation: Beyond back-translation. Proceedings ofthe 5th Workshop on Noisy User-generated Text (W-NUT 2019).

Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, PeiZhou, Chandra Bhagavatula, Yejin Choi, and XiangRen. 2019. Commongen: A constrained text gener-ation challenge for generative commonsense reason-ing. arXiv preprint arXiv:1911.03705.

Zihan Liu, Genta Indra Winata, and Pascale Fung.2021. Continual mixed-language pre-training forextremely low-resource neural machine translation.arXiv preprint arXiv:2105.03953.

Zihan Liu, Genta Indra Winata, Zhaojiang Lin, PengXu, and Pascale Fung. 2020. Attention-informedmixed-language training for zero-shot cross-lingualtask-oriented dialogue systems. In Proceedings ofthe AAAI Conference on Artificial Intelligence, vol-ume 34, pages 8433–8440.

Lajanugen Logeswaran, Honglak Lee, and Samy Ben-gio. 2018. Content preserving text generation withattribute controls. In Advances in Neural Infor-mation Processing Systems 31: Annual Confer-ence on Neural Information Processing Systems2018, NeurIPS 2018, December 3-8, 2018, Mon-tréal, Canada, pages 5108–5118.

Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Aman-charla, and Anupam Datta. 2019. Gender bias inneural natural language processing.

Edward Ma. 2019. Nlp augmentation.https://github.com/makcedward/nlpaug.

Andrew Maas, Raymond E Daly, Peter T Pham, DanHuang, Andrew Y Ng, and Christopher Potts. 2011.Learning word vectors for sentiment analysis. In

Proceedings of the 49th annual meeting of the as-sociation for computational linguistics: Human lan-guage technologies, pages 142–150.

Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn Treebank. Computa-tional Linguistics, 19(2):313–330.

Vukosi Marivate and Tshephisho Sefara. 2020. Improv-ing short text classification through global augmenta-tion methods. In International Cross-Domain Con-ference for Machine Learning and Knowledge Ex-traction, pages 385–399. Springer.

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.Right for the wrong reasons: Diagnosing syntacticheuristics in natural language inference. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 3428–3448,Florence, Italy. Association for Computational Lin-guistics.

Merriam-Webster. What is a diacritic, anyway?

Simon Mille, Kaustubh D. Dhole, Saad Mahamood,Laura Perez-Beltrachini, Varun Gangal, Mihir Kale,Emiel van Miltenburg, and Sebastian Gehrmann.2021. Automatic construction of evaluation suitesfor natural language generation datasets.

George A Miller. 1998. WordNet: An electronic lexicaldatabase. MIT press.

Shubhanshu Mishra, Sijun He, and Luca Belli. 2020.Assessing demographic bias in named entity recog-nition. CoRR, abs/2008.03415.

John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby,Di Jin, and Yanjun Qi. 2020. Textattack: A frame-work for adversarial attacks, data augmentation, andadversarial training in nlp. In Proceedings of the2020 Conference on Empirical Methods in Natu-ral Language Processing: System Demonstrations,pages 119–126.

Marcin Namysl, Sven Behnke, and Joachim Köhler.2020. NAT: Noise-aware training for robust neuralsequence labeling. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 1501–1517, Online. Associationfor Computational Linguistics.

Marcin Namysl, Sven Behnke, and Joachim Köhler.2021. Empirical error modeling improves robust-ness of noisy neural sequence labeling. In Find-ings of the Association for Computational Linguis-tics: ACL-IJCNLP 2021, pages 314–329, Online.Association for Computational Linguistics.

Wilhelmina Nekoto, Vukosi Marivate, TshinondiwaMatsila, Timi Fasubaa, Taiwo Fagbohungbe,Solomon Oluwole Akinola, Shamsuddeen Muham-mad, Salomon Kabongo Kabenamualu, SalomeyOsei, Freshia Sackey, Rubungo Andre Niyongabo,Ricky Macharm, Perez Ogayo, Orevaoghene Ahia,

Musie Meressa Berhe, Mofetoluwa Adeyemi,Masabata Mokgesi-Selinga, Lawrence Okegbemi,Laura Martinus, Kolawole Tajudeen, Kevin Degila,Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer,Jason Webster, Jamiil Toure Ali, Jade Abbott,Iroro Orife, Ignatius Ezeani, Idris AbdulkadirDangana, Herman Kamper, Hady Elsahar, Good-ness Duru, Ghollah Kioko, Murhabazi Espoir,Elan van Biljon, Daniel Whitenack, ChristopherOnyefuluchi, Chris Chinenye Emezue, BonaventureF. P. Dossou, Blessing Sibanda, Blessing Bassey,Ayodele Olabiyi, Arshath Ramkilowan, Alp Öktem,Adewale Akinfaderin, and Abdallah Bashir. 2020.Participatory research for low-resourced machinetranslation: A case study in African languages.In Findings of the Association for ComputationalLinguistics: EMNLP 2020, pages 2144–2160,Online. Association for Computational Linguistics.

Toan Q Nguyen, Kenton Murray, and David Chiang.2021. Data augmentation by concatenation for low-resource translation: A mystery and a solution. InProceedings of the International Workshop on Spo-ken Language Translation, Online. Association forComputational Linguistics.

Vasile Florian Pais. 2019. Contributions to semanticprocessing of texts; Identification of entities and re-lations between textual units; Case study on Roma-nian language. Ph.D. thesis.

Martha Palmer, Paul R. Kingsbury, and Daniel Gildea.2005. The proposition bank: An annotated corpus ofsemantic roles. Comput. Linguistics, 31(1):71–106.

Soham Parikh, Ananya B. Sai, Preksha Nema, andMitesh M. Khapra. 2019. Eliminet: A model foreliminating options for reading comprehension withmultiple choice questions. CoRR, abs/1904.02651.

Kyubyong Park and Seanie Lee. 2020. g2pm: A neu-ral grapheme-to-phoneme conversion package formandarin chinese based on a new open benchmarkdataset. CoRR, abs/2004.03136.

Charles Pierse. 2021. Transformers Interpret.

Aleksandra Piktus, Necati Bora Edizel, Piotr Bo-janowski, Edouard Grave, Rui Ferreira, and FabrizioSilvestri. 2019. Misspelling oblivious word embed-dings. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages3226–3234, Minneapolis, Minnesota. Associationfor Computational Linguistics.

Emily Pitler, Mridhula Raghupathy, Hena Mehta, AniNenkova, Alan Lee, and Aravind Joshi. 2008. Eas-ily identifiable discourse relations. In Coling 2008:Companion volume: Posters, pages 87–90, Manch-ester, UK. Coling 2008 Organizing Committee.

Girishkumar Ponkiya, Rudra Murthy, Pushpak Bhat-tacharyya, and Girish Palshikar. 2020. Looking in-side noun compounds: Unsupervised prepositional

and free paraphrasing using language models. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing: Findings,pages 4313–4323.

Girishkumar Ponkiya, Kevin Patel, Pushpak Bhat-tacharyya, and Girish Palshikar. 2018. Treat us likethe sequences we are: Prepositional paraphrasingof noun compounds using lstm. In Proceedings ofthe 27th International Conference on ComputationalLinguistics, pages 1827–1836.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-sakaki, Livio Robaldo, Aravind Joshi, and Bon-nie Webber. 2008. The Penn Discourse TreeBank2.0. In Proceedings of the Sixth InternationalConference on Language Resources and Evaluation(LREC’08), Marrakech, Morocco. European Lan-guage Resources Association (ELRA).

Yada Pruksachatkun, Satyapriya Krishna, JwalaDhamala, Rahul Gupta, and Kai-Wei Chang. 2021.Does robustness improve fairness? approaching fair-ness with word substitution robustness methods fortext classification. In Findings of the Associationfor Computational Linguistics: ACL-IJCNLP 2021,pages 3320–3331, Online. Association for Computa-tional Linguistics.

Libo Qin, Minheng Ni, Yue Zhang, and WanxiangChe. 2020. Cosda-ml: Multi-lingual code-switchingdata augmentation for zero-shot cross-lingual nlp.In Proceedings of the Twenty-Ninth InternationalJoint Conference on Artificial Intelligence, IJCAI-20, pages 3853–3860. International Joint Confer-ences on Artificial Intelligence Organization. Maintrack.

Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J. Liu. 2019. Exploring the limitsof transfer learning with a unified text-to-text trans-former.

Julio Raffo. 2021. WGND 2.0.

Vikas Raunak, Arul Menezes, and Marcin Junczys-Dowmunt. 2021. The curious case of hallucinationsin neural machine translation. In Proceedings of the2021 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 1172–1183, On-line. Association for Computational Linguistics.

Abhilasha Ravichander, Siddharth Dalmia, MariaRyskina, Florian Metze, Eduard Hovy, and Alan WBlack. 2021. NoiseQA: Challenge Set Evaluationfor User-Centric Question Answering. In Confer-ence of the European Chapter of the Association forComputational Linguistics (EACL), Online.

Mehdi Regina, Maxime Meyer, and Sébastien Goutal.2020. Text data augmentation: Towards bet-ter detection of spear-phishing emails. CoRR,abs/2007.02033.

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,and Sameer Singh. 2020. Beyond accuracy: Be-havioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 4902–4912, Online. Association for Computational Lin-guistics.

Haoyue Shi, Karen Livescu, and Kevin Gimpel. 2021.Substructure substitution: Structured data augmen-tation for NLP. In Findings of the Associationfor Computational Linguistics: ACL-IJCNLP 2021,pages 3494–3508, Online. Association for Computa-tional Linguistics.

Peng Shi and Jimmy Lin. 2019a. Simple bert mod-els for relation extraction and semantic role labeling.arXiv preprint arXiv:1904.05255.

Peng Shi and Jimmy Lin. 2019b. Simple BERT mod-els for relation extraction and semantic role labeling.CoRR, abs/1904.05255.

Ashish Shrivastava, Kaustubh Dhole, Abhinav Bhatt,and Sharvani Raghunath. 2021. Saying No is AnArt: Contextualized Fallback Responses for Unan-swerable Dialogue Queries. In Proceedings of the59th Annual Meeting of the Association for Compu-tational Linguistics and the 11th International JointConference on Natural Language Processing (Vol-ume 2: Short Papers), pages 87–92, Online. Asso-ciation for Computational Linguistics.

Vered Shwartz and Ido Dagan. 2018. Paraphrase toexplicate: Revealing implicit noun-compound rela-tions. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 1200–1211.

Chenglei Si, Zhengyan Zhang, Fanchao Qi, ZhiyuanLiu, Yasheng Wang, Qun Liu, and Maosong Sun.2021. Better robustness by more coverage: Adver-sarial and mixup data augmentation for robust fine-tuning. In Findings of the Association for Computa-tional Linguistics: ACL-IJCNLP 2021, pages 1569–1576, Online. Association for Computational Lin-guistics.

R. Smith. 2007. An overview of the tesseract OCRengine. In 9th International Conference on Docu-ment Analysis and Recognition (ICDAR 2007), 23-26 September, Curitiba, Paraná, Brazil, pages 629–633. IEEE Computer Society.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing,

pages 1631–1642, Seattle, Washington, USA. Asso-ciation for Computational Linguistics.

Amane Sugiyama and Naoki Yoshinaga. 2019. Dataaugmentation using back-translation for context-aware neural machine translation. In Proceedingsof the Fourth Workshop on Discourse in MachineTranslation (DiscoMT 2019), pages 35–44, HongKong, China. Association for Computational Lin-guistics.

Tony Sun, Kellie Webster, Apurva Shah, William YangWang, and Melvin Johnson. 2021. They, them,theirs: Rewriting with gender-neutral english.CoRR, abs/2102.06788.

Fiona Anting Tan, Devamanyu Hazarika, See-KiongNg, Soujanya Poria, and Roger Zimmermann. 2021a.Causal augmentation for causal sentence classifi-cation. In Proceedings of the First Workshopon Causal Inference and NLP, pages 1–20, PuntaCana, Dominican Republic. Association for Compu-tational Linguistics.

Samson Tan and Shafiq Joty. 2021. Code-mixing onsesame street: Dawn of the adversarial polyglots. InProceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 3596–3616, Online. Association for Compu-tational Linguistics.

Samson Tan, Shafiq Joty, Kathy Baxter, Araz Taeihagh,Gregory A. Bennett, and Min-Yen Kan. 2021b. Re-liability testing for natural language processing sys-tems. In Proceedings of the 59th Annual Meeting ofthe Association for Computational Linguistics andthe 11th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),pages 4153–4169, Online. Association for Computa-tional Linguistics.

Samson Tan, Shafiq Joty, Min-Yen Kan, and RichardSocher. 2020. It’s morphin’ time! Combatinglinguistic discrimination with inflectional perturba-tions. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 2920–2935, Online. Association for Computa-tional Linguistics.

Jörg Tiedemann and Santhosh Thottingal. 2020.OPUS-MT — Building open translation services forthe World. In Proceedings of the 22nd Annual Con-ferenec of the European Association for MachineTranslation (EAMT), Lisbon, Portugal.

Ashwin Vijayakumar, Michael Cogswell, RamprasaathSelvaraju, Qing Sun, Stefan Lee, David Crandall,and Dhruv Batra. 2018. Diverse beam search forimproved description of complex scenes.

Ashwin K. Vijayakumar, Michael Cogswell, Ram-prasaath R. Selvaraju, Qing Sun, Stefan Lee, David J.Crandall, and Dhruv Batra. 2016. Diverse beamsearch: Decoding diverse solutions from neural se-quence models. CoRR, abs/1610.02424.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019a.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In 7thInternational Conference on Learning Representa-tions, ICLR 2019, New Orleans, LA, USA, May 6-9,2019. OpenReview.net.

Xiao Wang, Qin Liu, Tao Gui, Qi Zhang, YichengZou, Xin Zhou, Jiacheng Ye, Yongxin Zhang, RuiZheng, Zexiong Pang, Qinzhuo Wu, Zhengyan Li,Chong Zhang, Ruotian Ma, Zichu Fei, Ruijian Cai,Jun Zhao, Xingwu Hu, Zhiheng Yan, Yiding Tan,Yuan Hu, Qiyuan Bian, Zhihua Liu, Shan Qin, BolinZhu, Xiaoyu Xing, Jinlan Fu, Yue Zhang, MinlongPeng, Xiaoqing Zheng, Yaqian Zhou, Zhongyu Wei,Xipeng Qiu, and Xuanjing Huang. 2021a. TextFlint:Unified multilingual robustness evaluation toolkitfor natural language processing. In Proceedings ofthe 59th Annual Meeting of the Association for Com-putational Linguistics and the 11th InternationalJoint Conference on Natural Language Processing:System Demonstrations, pages 347–355, Online. As-sociation for Computational Linguistics.

Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu,and Ting Liu. 2019b. Cross-lingual bert transforma-tion for zero-shot dependency parsing. In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 5721–5727.

Yuxuan Wang, Wanxiang Che, Ivan Titov, Shay B. Co-hen, Zhilin Lei, and Ting Liu. 2021b. A closer lookinto the robustness of neural dependency parsersusing better adversarial examples. In Findings ofthe Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2344–2354, Online. Associa-tion for Computational Linguistics.

Jason W. Wei and Kai Zou. 2019. EDA: easy dataaugmentation techniques for boosting performanceon text classification tasks. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing,EMNLP-IJCNLP 2019, Hong Kong, China, Novem-ber 3-7, 2019, pages 6381–6387. Association forComputational Linguistics.

John Wieting and Kevin Gimpel. 2017. Pushing thelimits of paraphrastic sentence embeddings with mil-lions of machine translations. In arXiv preprintarXiv:1711.05732.

John Wieting, Jonathan Mallinson, and Kevin Gimpel.2017. Learning paraphrastic sentence embeddingsfrom back-translated bitext. In Proceedings of Em-pirical Methods in Natural Language Processing.

Adina Williams, Nikita Nangia, and Samuel R Bow-man. 2017. A broad-coverage challenge corpus forsentence understanding through inference. arXivpreprint arXiv:1704.05426.

Steven Wilson, Walid Magdy, Barbara McGillivray,Kiran Garimella, and Gareth Tyson. 2020. Urbandictionary embeddings for slang NLP applications.In Proceedings of the 12th Language Resourcesand Evaluation Conference, pages 4764–4773, Mar-seille, France. European Language Resources Asso-ciation.

Sam Wiseman and Alexander M. Rush. 2016.Sequence-to-sequence learning as beam-search op-timization. In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Process-ing, pages 1296–1306, Austin, Texas. Associationfor Computational Linguistics.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander Rush. 2020. Trans-formers: State-of-the-art natural language process-ing. In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer,and Daniel S Weld. 2021. Polyjuice: Generatingcounterfactuals for explaining, evaluating, and im-proving models. In Proceedings of the 59th AnnualMeeting of the Association for Computational Lin-guistics.

Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong,and Quoc Le. 2020. Unsupervised data augmenta-tion for consistency training. Advances in NeuralInformation Processing Systems, 33.

Liang Xu, Qianqian Dong, Cong Yu, Yin Tian,Weitang Liu, Lu Li, and Xuanwei Zhang. 2020.Cluener2020: Fine-grained name entity recognitionfor chinese. arXiv preprint arXiv:2001.04351.

Usama Yaseen and Stefan Langer. 2021. Data augmen-tation for low-resource named entity recognition us-ing backtranslation. CoRR, abs/2108.11703.

Sheng Kung Yi, Mark Steyvers, Michael Lee, andMatthew Dry. 2010. Wisdom of the crowds in min-imum spanning tree problems. In Proceedings ofthe Annual Meeting of the Cognitive Science Society,volume 32.

Alex Yorke. butter-fingers. https://github.com/alexyorke/butter-fingers.

Yunfei. Chinese-Names-Corpus . https://github.com/wainshine/Chinese-Names-Corpus.

Jing Zhang, Bonggun Shin, Jinho D Choi, and Joyce CHo. 2021. Smat: An attention-based deep learn-ing solution to the automation of schema matching.In European Conference on Advances in Databasesand Information Systems, pages 260–274. Springer.

Wei Emma Zhang, Quan Z. Sheng, and Ahoud Abdul-rahmn F. Alhazmi. 2019a. Generating textual adver-sarial examples for deep learning models: A survey.CoRR, abs/1901.06796.

Yuan Zhang, Jason Baldridge, and Luheng He. 2019b.PAWS: paraphrase adversaries from word scram-bling. In Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, NAACL-HLT 2019, Minneapolis, MN, USA,June 2-7, 2019, Volume 1 (Long and Short Papers),pages 1298–1308. Association for ComputationalLinguistics.

Zhe Zhao, Hui Chen, Jinbin Zhang, Xin Zhao, Tao Liu,Wei Lu, Xi Chen, Haotang Deng, Qi Ju, and Xiaoy-ong Du. 2019. Uer: An open-source toolkit for pre-training models. EMNLP-IJCNLP 2019, page 241.

A Review criteria for submissionevaluation

Figure 3 shows the detailed review criteria used forevaluating the transformation and filters submis-sions.

B Transformations

The following is the list of all accepted transforma-tions to NL-Augmenter project. Many of the trans-formations tokenize the sentences using SpaCy4 orNLTK (Bird, 2006) tokenizers. We discuss the im-plementations of each alongwith their limitations.The title of each transformation subsection is click-able and redirects to the actual python implemen-tation. Many of the transformations use externallibraries and we urge readers to look at each imple-mentation and its corresponding ‘requirements.txt‘files.

B.1 Abbreviation TransformationThis transformation replaces a word or phrase withits abbreviated counterpart “homework” -> “hwk”using a web-scraped slang dictionary.5

� You ) yu driving at 80miles per hour ) mph is why insuranceis ) tis so freaking ) frigginexpensive.

B.2 Add Hash-TagsThis transformation uses words in the text to gener-ate hashtags. These hastags are then appeneded tothe original text. Using the same words appearing

4https://spacy.io/5Scraped from https://www.noslang.com/

dictionary

Figure 3: Participants and reviewers were provided with a set of review criteria.

in the sentence to generate the hashtags acts as re-dundant noise that models should learn to ignore.Hashtags are widespread in social media channelsand are used to draw attention to the source textand also as a quick stylistic device.

� I love domino’s pizza. )

#LovePizza #Love #I #Pizza

B.3 Adjectives Antonyms Switch

This transformation switches English adjectivesin a sentence with their WordNet (Miller, 1998)antonyms to generate new sentences with possi-bly different meanings and can be useful for taskslike Paraphrase Detection, Paraphrase Generation,Semantic Similarity, and Recognizing Textual En-

tailment.

� Amanda’s mother was verybeautiful ) ugly .

B.4 AmericanizeBritishizeEnglishThis transformation takes a sentence and tries toconvert it from British English to American Englishand vice-versa. A select set of words have beentaken from hyperreality@GitHub.

� I love the pastel colours )

colors

B.5 AntonymsSubstituteThis transformation introduces semantic diversityby replacing an even number of adjective/adverbantonyms in a given text. We assume that an evennumber of antonyms transforms will revert backsentence semantics; however, an odd number oftransforms will revert the semantics. Thus, ourtransform only applies to the sentence that has aneven number of revertible adjectives or adverbs.We called this mechanism double negation.

� Steve is able ) unable torecommend movies that depicts thelives of beautiful ) ugly minds.

B.6 Auxiliary Negation RemovalThis is a low-coverage transformation which targetssentences that contain negations. It removes nega-tions in English auxillaries and attempts to generatenew sentences with the oppposite meaning.

� Ujjal Dev Dosanjh was not ) UjjalDev Dosanjh was the 1st Premier ofBritish Columbia from 1871 to 1872.

B.7 AzertyQwertyCharsSwap

� Preferably use the above downloadlink, as the release tarballsare generated deterministically) qre generqted deterministicqllywhereas GitHub’s are not.

B.8 BackTranslationThis transformation translates a given English sen-tence into German and back to English.This trans-formation acts like a light paraphraser. Multiplevariations can be easily created via changing pa-rameters like the language as well as the translationmodels which are available in plenty. Backtrans-lation has been quite popular now and has been aquick way to augment examples (Li and Specia,2019; Sugiyama and Yoshinaga, 2019).

� Andrew finally returned )

eventually gave Chris the French bookthe French book I bought last week.

B.9 BackTranslation for Named EntityRecognition

This transformation splits the token sequencesinto segments of entity mention(s) and “contexts”around the entity mention(s). Backtranslation isused to paraphrase the contexts around the entitymention(s), thus resulting in a different surfaceform from the original token sequence. The resul-tant tokens are also assigned new tags. Exploit-ing this transformation has shown to empiricallybenefit named entity tagging (Yaseen and Langer,2021) and hence could arguably benefit other low-resource tagging tasks (Bhatt and Dhole, 2020;Khachatrian et al., 2019; Gupta et al., 2021).

B.10 Butter Fingers Perturbation

This perturbation adds noise to all types of textsources (sentence, paragraph, etc.) proportional tonoise erupting from keyboard typos making com-mon spelling errors. Few letters picked at randomare replaced with letters which are at keyboard posi-tions near the source letter. The implementation hasbeen borrowed from here (Yorke) as used in (Milleet al., 2021). There has also been some recent workin NoiseQA (Ravichander et al., 2021) to mimickkeyboard typos.

� Sentences ) Senhences withgapping, such as Paul likes coffee) coffwe and Mary tea, lack an overtpredicate to indicate ) indicatx therelation ) relauion between two ormore arguments ) argumentd .

B.11 Butter Fingers Perturbation For IndianLanguages

This implements the butter fingers perturbation asused above for 7 Indian languages: Bangla, Gu-jarati, Hindi, Kannada, Malayalam, Oriya, Punjabi,Tamil, and Telugu. The implementation consid-ers the InScript keyboard 6 which is decreed as astandard for Indian scripts.

B.12 Change Character Case

This transformation acts like a perturbation andrandomly swaps the casing of some of the letters.

6https://en.wikipedia.org/wiki/InScript_keyboard

The transformation’s outputs will not work withuncased models or languages without casing.

� Alice in Wonderland is a 2010American live- action ) actIon /animated ) anImated dark fantasy )

faNtasy adventure film.

B.13 Change Date Format

This transformation changes the format of dates.

� The first known case of COVID-19was identified in Wuhan, China inDecember ) Dec 2019.

B.14 Change Person Named Entities

This perturbation changes the name of the personfrom one name to another by making use of thelexicon of person names in Ribeiro et al. (2020).

� Andrew ) Nathaniel finallyreturned the French book to Christhat I bought last week

B.15 Change Two Way Named Entities

This perturbation also changes the name of theperson but also makes a parallel change in the labelor reference text with the same name making ituseful for text-to-text generation tasks.

� He finally returned the Frenchbook to Chris ) Austin that I boughtlast week

B.16 Chinese Antonym and SynonymSubstitution

This transformation substitutes Chinese words withtheir synonyms or antonyms by using the Chinesedictionary7 and NLP Chinese Data Augmentationdictionary.8

B.17 Chinese Pinyin Butter FingersPerturbation

This transformation implements the Butter FingersPerturbation for Chinese characters. Few Chinesewords and characters that are picked at random willbe substituted with others that have similar pinyin(based on the default Pinyin keyboards in Windowsand Mac OS). It uses a database of 16142 Chinesecharacters 9 and its associated pinyins to generatethe perturbations for Chinese characters. A smaller

7Chinese Dictionary: https://github.com/guotong1988/chinese_dictionary

8NLP Chinese Data Augmentation: https://github.com/425776024/nlpcda

9https://github.com/pwxcoo/chinese-xinhua

database of 3500 10 more frequently seen Chinesecharacters are also used in the perturbations with ahigher probability of being used compared to lessfrequently seen Chinese characters. It also usesa database of 575173 words 11 that are combinedfrom several sources 12 in order to generate pertur-bations for Chinese words.

B.18 Chinese Person Named Entities andGender Perturbation

This perturbation adds noise to all types of textsources containing Chinese names (sentence, para-graph, etc.) by swapping a Chinese name with an-other Chinese name whilst also allowing the possi-bility of gender swap. CLUENER (Xu et al., 2020;Zhao et al., 2019) is used for tagging named enti-ties in Chinese. The list of names is taken from theChinese Names Corpus! (Yunfei). It can provideassistance in detecting biases present in languagemodels and the ability to infer implicit gender infor-mation when presented with gender-specific names.This can also be useful in mitigating representationbiases in the input text.

B.19 Chinese (Simplified & Traditional)Perturbation

This perturbation adds noise to all types of textsources containing Chinese words and characters(sentence, paragraph, etc.) by changing the wordsand characters between Simplified and TraditionalChinese as well as other variants of Chinese Char-acters such as Japanese Kanji, character-level andphrase-level conversion, character variant conver-sion and regional idioms among Mainland China,Taiwan and Hong Kong, all available as config-urations originally in the OpenChineseConvertproject .13

B.20 City Names TransformationThis transformation replaces instances of populousand well-known cities in Spanish and English sen-tences with instances of less populous and lesswell-known cities to help reveal demographic bi-ases (Mishra et al., 2020) prevelant in named entityrecognition models. The choice of cities have beentaken from the World Cities Dataset .14

10https://github.com/elephantnose/characters11http://thuocl.thunlp.org/12https://github.com/fighting41love/Chinese_from_dongxiexidian

13https://github.com/BYVoid/OpenCC14https://www.kaggle.com/juanmah/world-cities

� The team was established inDallas ) Viera West in 1898 and wasa charter member of the NFL in 1920.

B.21 Close Homophones Swap

Humans are generally guided by their senses andare unconsciously robust against phonetic attacks.Such types of attacks are highly popular in lan-guages like English which has an irregular map-ping between pronunciation and spelling (Eger andBenz, 2020). This transformation mimics writingbehaviors where users swap words with similar ho-mophones either intentionally or by accident. Thistransformation acts like a perturbation to test ro-bustness. Few words picked at random are replacedwith words with similar homophones which soundsimilar or look similar. Some of the word choicesmight not be completely natural to normal humanbehavior, since humans "prefer" some words overothers even they sound exactly the same. So itmight not be fully reflecting the natural distributionof intentional or unintentional swapping of words.

� Sentences with gapping, such asPaul likes coffee and Mary tea ) Tee, lack an overt predicate to indicatethe ) Thee relation between two ormore ) Morr arguments.

B.22 Color Transformation

This transformation augments the input sentence byrandomly replacing mentioned colors with differentones from the 147 extended color keywords speci-fied by the World Wide Web Consortium (W3C) .15

Some of the colors include “dark sea green”, “mistyrose”, “burly wood”.

� Tom bought 3 apples, 1 orange )

misty rose , and 4 bananas and paid$10.

B.23 Concatenate Two Random Sentences(Bilingual)

Given a dataset, this transformation concatenatesa sentence with a previously occuring sentence asexplained in (Nguyen et al., 2021). A monolin-gual version is mentioned in the subsequent subsec-tion below. This concatenation would benefit alltext tasks that use a transformer (and likely othersequence-to-sequence architectures). Previouslypublished work (Nguyen et al., 2021) has shown

15https://www.w3.org/TR/2021/REC-css-color-3-20210805/

a large gain in performance of low-resource ma-chine translation using this method. In particular,the learned model is stronger due to being able tosee training data that has context diversity, lengthdiversity, and (to a lesser extent) position shifting.

B.24 Concatenate Two Random Sentences(Monolingual)

This is the monolingual counterpart of the above.

� I am just generating a very veryvery long sentence to make sure thatthe method is able to handle it. Itdoes not even need to be a sentence.Right? This is not splitting onpunctuation... I am just generatinga very very very long sentence tomake sure that the method is able tohandle it. It does not even need tobe a sentence. Right? This is notsplitting on punctuation...

B.25 Concept2SentenceThis transformation intakes a sentence, its associ-ated integer label, and (optionally) a dataset namethat is supported by huggingface/datasets (Lhoestet al., 2021a,b). It works by extracting keywordconcepts from the original sentence, passing theminto a BART (Lewis et al., 2020) transformertrained on CommonGen (Lin et al., 2019) to gen-erate a new, related sentence which reflects theextracted concepts. Providing a dataset allowsthe function to use transformers-interpret (Pierse,2021) to identify the most critical concepts foruse in the generative step. Underneath the hood,this transform makes use of the Sibyl tool (Harel-Canada, 2021), which is capable of also transform-ing the label as well. However, this particular im-plementation of C2S generates new text that isinvariant (INV) with respect to the label. Sincethe model is trained on CommonGen, which isfocussed on image captioning, the style of the out-put sentence would be geared towards scenic de-scritions and might not necessarily adhere to thesyntax of the original sentence. Besides, it canbe hard to argue that a handful subset of keywordscould provide a complete description of the originalsentence.

B.26 Contextual Meaning PerturbationThis transformation was designed to model the"Chinese Whispers" or "Telephone" children’sgame: The transformed sentence appears fluent

and somewhat logical, but the meaning of the orig-inal sentence might not be preserved. To achievelogical coherence, a pre-trained language model isused to replace words with alternatives that matchthe context of the sentence. Grammar mistakes arereduced by limiting the type of words consideredfor changes (based on POS tagging) and replacingadjectives with adjectives, nouns with nouns, etc.where possible.

This transformation benefits users who seek per-turbations that preserve fluency but not the meaningof the sentence. For instance, it can be used in sce-narios where the meaning is relevant to the task, butthe model shows a tendency to over-rely on simplerfeatures such as the grammatical correctness andgeneral coherence of the sentence. A real-worldexample would be the training of quality estimationmodels for machine translation (does the transla-tion maintain the meaning of the source?) or fortext summarisation (does the summary capture thecontent of the source?).

Word substitution with pre-trained languagemodels has been explored in different settings. Forexample, the augmentation library nlpaug (Ma,2019) and the adversarial attack library TextAt-tack (Morris et al., 2020) include contextual pertur-bation methods. However, their implementationsdo not offer control over the type of words thatshould be perturbed and introduce a large numberof grammar mistakes. If the aim is to change thesentence’s meaning while preserving its fluency,this transformation can help to get the same effectwith significantly fewer grammatical errors. Liet al. (2020a) propose an alternative approach toachieve a similar objective.

B.27 Contractions and ExpansionsPerturbation

This perturbation substitutes the text with popularexpansions and contractions, e.g., “I’m” is changedto “I am”and vice versa. The list of commonlyused contractions & expansions and the implemen-tation of perturbation has been taken from Check-list (Ribeiro et al., 2020).

� He often does n’t ) not come toschool.

B.28 Correct Common MisspellingsThis transformation acts like a lightweight spell-checker and corrects common misspellings appear-ing in text by looking for words in Wikipedia’sLists of Common Misspellings.

� Andrew andd ) and Alice finallyreturnd ) returned the French bookthat I bought lastr ) last week

B.29 Country/State Abbreviation

This transformation replaces country and statenames with their common abbreviations.16 Abbre-viations can be common across different locations:

� “MH” can refer to Country Meath in Ireland aswell as the state of Maharashtra in India and hencethis transformation might result in a slight loss ofinformation, especially if the surrounding contextdoesn’t have enough signals.

� One health officer and oneepidemiologist have boarded the shipin San Diego, CA ) California on April13, 2015 to conduct an environmentalhealth assessment.

B.30 Decontextualisation of the main Event

Semantic Role Labelling (SRL) is a powerful shal-low semantic representation to determine who didwhat to whom, when, and where (and why and howetc). The core arguments generally talk about theparticipants involved in the event. Addtionally, con-textual arguments on the other hand provide morespecific information about the event. After tag-ging a sentence with an appropriate semantic rolelabels using an SRL labeller (Jindal et al., 2020;Shi and Lin, 2019a). This transformation cropsout contextual arguments to create a new sentencewith a minimal description of the event. Helpingto generate textual pairs for entailment.

B.31 Diacritic Removal

“Diacritics are marks placed above or below (orsometimes next to) a letter in a word to indicatea particular pronunciation — in regard to accent,tone, or stress — as well as meaning, especiallywhen a homograph exists without the marked letteror letters.” Merriam-Webster. This transformationremoves these diacritics or accented characters, andreplaces them with their non-accented versions. Itcan be common for non-native or inexperiencedspeakers to miss out on any accents and specifynon-accented versions.

� She lookèd ) looked east an shelookèd ) looked west.

16Countries States Cities Database: https://github.com/dr5hn/countries-states-cities-database

B.32 Disability/Differently AbledTransformation

Disrespectful language can make people feel ex-cluded and represent an obstacle towards their fullparticipation in the society (Res, 2006). This low-coverage transformation substitutes outdated ref-erences to references of disabilities with more ap-propriate and respectful ones which avoid nega-tive connotations. A small list of inclusive wordsand phrases have been taken from a public articleon inclusive communication, Wikipedia’s list ofdisability-related terms with negative connotations,terms to avoid while writing about disability.

� They are deaf )

person or people with a hearingdisability.

B.33 Discourse Marker Substitution

This perturbation replaces a discourse marker ina sentence by a semantically equivalent marker.Previous work has identified discourse markersthat have low ambiguity (Pitler et al., 2008). Thistransformation uses the corpus analysis on PDTB2.0 (Prasad et al., 2008) to identify discourse mark-ers that are associated with a discourse relationwith a chance of at least 0.5. Then, a marker isreplaced with a different marker that is associatedto the same semantic class.

� It has plunged 13% since )

inasmuch as July to around 26 centsa pound. A year ago ethylene sold for33 cents

B.34 Diverse Paraphrase Generation UsingSubModular Optimization and DiverseBeam Search

This transformation generates multiple paraphrasesof a sentence by employing 4 candidate selectionmethods on top of a base set of backtranslationmodels. 1) DiPS (Kumar et al., 2019) 2) DiverseBeam Search (Vijayakumar et al., 2018) 3) BeamSearch (Wiseman and Rush, 2016) 4) Random. Un-like beam search which generally focusses on thetop-k candidates, DiPS introduces a novel formu-lation of using submodular optimisation to focuson generating more diverse paraphrases and hasbeen proven to be an effective data augmenter fortasks like intent recognition and paraphrase detec-tion (Kumar et al., 2019). Diverse Beam Searchattempts to generate diverse sequences by employ-ing a diversity promoting alternative to the classical

beam search (Wiseman and Rush, 2016).

B.35 Dislexia Words Swap

This transformation acts like a perturbation by al-tering some words of the sentences with abbera-tions (Board, 2021) that are likely to happen in thecontext of dyslexia.

� Biden hails your ) you’rerelationship with Australia just daysafter new partnership drew ire fromFrance.

B.36 Emoji Icon Transformation

This transformation converts emojis into theirequivalent keyboard format (e.g., -> ":)" ) andvice versa (e.g., ":)" -> ).

B.37 Emojify

This transformation augments the input sentence byswapping words with emojis of similar meanings.Emojis, introduced in 1997 as a set of pictogramsused in digital messaging, have become deeply in-tegrated into our daily communication. More than10% of tweets17 and more than 35% of Instagramposts18 include one or more emojis in 2015. Giventhe ubiquitousness of emojis, there is a growingbody of work researching the linguistic and cul-tural aspects of emojis (Guntuku et al., 2019) andhow we can leverage the use of emojis to help solveNLP tasks (Eisner et al., 2016).

� Apple is looking at buying U.K.

startup for $132 billion. ) is

at startup for $

.

B.38 English Inflectional Variation

This transformation adds inflectional variation toEnglish words and can be used to test the robust-ness of models against inflectional variations. InEnglish, each inflection generally maps to a Part-Of-Speech tag 19 in the Penn Treebank (Marcuset al., 1993). For each content word in the sentence,it is first lemmatised before randomly sampling avalid POS category and reinflecting the word ac-cording to the new category. The sampling processfor each word is constrained using its POS tag tomaintain the original sense for polysemous words.

17https://blog.twitter.com/en_us/a/2015/emoji-usage-in-tv-conversation

18https://instagram-engineering.com/19Penn TreeBank POS

This has been adapted from the Morpheus (Tanet al., 2020) adversarial attack.

� Ujjal Dev Dosanjh served ) serveas 33rd Premier ) Premiers of BritishColumbia from 2000 to 2001

B.39 English Mention Replacement for NER

This transformation randomly swaps an entity men-tion with another entity mention of the same en-tity type. Exploiting this transformation as a dataaugmentation strategy has been empirically shownto improve the performance of underlying (NER)models (Dai and Adel, 2020).

B.40 Filler Word Augmentation

This augmentation adds noise in the form ofcolloquial filler phrases. 23 different phrasesare chosen across 3 different categories: gen-eral filler words and phrases ("uhm", "err", "ac-tually", "like", "you know"...), phrases empha-sizing speaker opinion/mental state ("I think/be-lieve/mean", "I would say"...) & phrases indicatinguncertainty ("maybe", "perhaps", "probably", "pos-sibly", "most likely").The latter two categories hadshown promising results Kovatchev et al. (2021)when they were concatenated at the beginning ofthe sentence unlike this implementation which per-form insertions at any random positions. Fillerwords are based on the work of Laserna et al. (2014)but have not been explored in the context of dataaugmentation.

B.41 Style Transfer from Informal to Formal

This transformation transfers the style of text fromformal to informal and vice versa. It uses the im-plementation of Styleformer (Damodaran).

� What you upto ) currently doing ?

B.42 French Conjugation Substitution

This transformation change the conjugation ofverbs for simple french sentences with a specifiedtense. It detects the pronouns used in the sentencein order to conjugate accordingly whenever a sen-tence contains differents verbs. This version onlyworks for indicative tenses. It also only works forsimple direct sentences (subject, verb, COD/COI),which contains a pronoun as subject (il, elle, je etc.).It does not detect when the subject is a couple ofnouns ("les enfants" or "la jeune femme").

B.43 Gender And Culture Diversity NameChanger (1-way and 2-way)

Corpora exhibits many representational biases andthis transformation focuses on one particular me-diator, the personal names. It diversifies names inthe corpora along two critical dimensions, genderand cultural background. Technically, the transfor-mation samples a (country, gender) pair and thenrandomly draws a name from that (country, gender)pair to replace the original name. We collected42812 distinct names from 141 countries.They areprimarily from the World Gender Name Dictionary(Raffo, 2021).

Common name augmentations do not considertheir gender and cultural implication. Thus, theydo not necessarily mitigate biases or promote theminority’s representation because the augmentedname may be from the same gender and culturalbackground. This is the case, for example in theCheckList’s (Ribeiro et al., 2020) implementedname augmentation. Taking the interaction of thenames therein with ours, 34.0%, 33.5%, 31.9%,30.8% of them are popular names in US, Canada,Australia, and UK, respectively. Only 0.4%, 0.4%,0.5%, 2.1% of them are from India, Korea, China,and Kazakhstan.

� Rachel ) Charity Green, asheltered but friendly woman, fleesher wedding day and wealthy yetunfulfilling life.

B.44 Neopronoun SubstitutionThis transformation performs grammatically cor-rect substitution from English to English of the gen-dered pronouns, he/she, in a given sentence withtheir neopronoun counterparts, based on a list com-piled by UNC Greensboro and LGBTA WIKI.20

NLP models, such as those for neural machinetranslation, often fail to recognize the neopronounsand treat them as proper nouns. This transforma-tion seeks to render the training data used in NLPpipelines more neopronoun aware to reduce the riskof trans-erasure. The reason why a simple look-up-table approach might not work is due to the factthat the case may differ depending on the context.

� She ) They had her ) theirfriends tell her ) them about theevent.

20https://intercultural.uncg.edu/wp-content/uploads/Neopronouns-Explained-UNCG-Intercultural-Engagement.pdf

B.45 Gender Neutral RewriteThis transformation involves rewriting an Englishsentence containing a single gendered entity withits gender-neutral variant. One application is ma-chine translation, when translating from a languagewith gender-neutral pronouns (e.g. Turkish) to alanguage with gendered pronouns (e.g. English).This transformation is based on the algorithm pro-posed by Sun et al. (2021).

� His ) Their dream is to be afireman ) firefighter when he ) theygrows ) grow up.

B.46 GenderSwapperThis transformation introduces gender diversity tothe given data. If used as data augmentation fortraining, the transformation might mitigate gen-der bias, as shown in Dinan et al. (2020). It alsomight be used to create a gender-balanced evalua-tion dataset to expose the gender bias of pre-trainedmodels. This transformation performs lexical sub-stitution of the opposite gender. The list of genderpairs (shepherd <–> shepherdess) is taken from Luet al. (2019). Genderwise names used from Ribeiroet al. (2020) are also randomly swapped.

B.47 GeoNames TransformationThis transformation augments the input sentencewith information based on location entities (specif-ically cities and countries) available in the Geo-Names database.21 E.g., if a country name is found,the name of the country is appended with infor-mation about the country like its capital city, itsneighbouring countries, its continent, etc. Someinitial ideas of this nature were explored in Pais(2019).

B.48 German Gender SwapThis transformation replaces the masculine nounsand pronouns with their female counterparts forGerman sentences from a total of 2226 commonGerman names.22

� Er ) Sie ist ein Arzt )

eine Ärztin und mein Vater )

meine Mutter .

B.49 Grapheme to Phoneme SubstitutionThis transformation adds noise to a sentence byrandomly converting words to their phonemes.

21http://download.geonames.org/export/dump/22https://de.wiktionary.org/wiki/Verzeichnis:Deutsch/Namen

Grapheme-to-phoneme substitution is useful inNLP systems operating on speech. An exampleof grapheme to phoneme substitution is “permit”→ P ER0 M IH1 T’.

B.50 Greetings and FarewellsThis transformation replaces greetings (e.g. "Hi","Howdy") and farewells (e.g. "See you", "Goodnight") with their synonymous equivalents.

� Hey ) Hi everyone. It’s nice )

Pleased to meet you. How have ) areyou been ?

B.51 HashtagifyThis transformation modifies an input sentenceby identifying named entities and other commonwords and turning them into hashtags, as often usedin social media.

B.52 Insert English and FrenchAbbreviations

This perturbation replaces in texts some wellknown English and French words or expressionswith (one of) their abbreviations. Many of the ab-breviations covered here are quite common on so-cial medias platforms, even though some of themare quite generic. This implementation is partly in-spired by recent work in Machine Translation (Be-rard et al., 2019).

B.53 Leet TransformationVisual perturbations are often used to disguise of-fensive comments on social media (e.g., “!d10t”)or as a distinct writing style (“1337” in “leetspeak”) (Eger et al., 2019a), especially commonin scenarios like video gaming. Humans are un-consciously robust to such visually similar texts.This perturbation replaces letters with their visuallysimilar “leet” counterparts.23

� Ujjal Dev Dosanjh served )

U7jal 0ev D0san74 serv3d as 33rdPremier of British Columbia from )

Pr33i3r 0f 8ritis4 00lu36ia fr0m 2000to ) t0 2001

B.54 Lexical Counterfactual GeneratorThis transformation generates counterfactuals bysimply substituting negative words like “not”, “nei-ther” in one sentence of a semantically similar sen-tence pair. The substituted sentence is then back-translated in an attempt to correct for grammatical-

23https://simple.wikipedia.org/wiki/Leet

ity. This transformation would be useful for taskslike entailment and paraphrase detection.

B.55 Longer Location for NERThis transformation augments data for Named En-tity Recognition (NER) tasks by augmenting exam-ples which have a Location Tag. Names of loca-tions are expanded by appending them with cardi-nal directions like “south”, “N”, “northwest”, etc.The transformation ensures that the tags of the newsentence are accordingly modified.

B.56 Longer Location Names for testingNER

This transformation augments data for Named En-tity Recognition (NER) tasks by augmenting exam-ples that have a Location (LOC) Tag. Names oflocation are expanded by inserting random prefixor postfix word(s). The transformation also ensuresthat the labels of the new tags are accordingly mod-ified.

B.57 Longer Names for NERThis transformation augments data for Named En-tity Recognition (NER) tasks by augmenting exam-ples which have a Person Tag. Names of people areexpanded by inserting random characters as initials.The transformation also ensures that the labels ofthe new tags are accordingly modified.

B.58 Lost in TranslationThis transformation is a generalization of the Back-Translation transformation to any sequence of lan-guages supported by the Helsinki-NLP OpusMTmodels (Tiedemann and Thottingal, 2020).

� Andrew finally returned )

brought Chris back the French bookthe French book I bought last week Ibought last week

B.59 Mixed Language PerturbationMixed language training has been effective forcross-lingual tasks (Liu et al., 2020), to help gen-erate data for low-resource scenarios (Liu et al.,2021) and for multilingual translation (Fan et al.,2021). Two transformations translate randomlypicked words in the text from English to other lan-guages (e.g., German). It can be used to test therobustness of a model in a multilingual setting.

� Andrew finally returned the ) dieComic book to Chris that I bought lastweek ) woche

B.60 Mix transliterationThis transformation transliterates randomly pickedwords from the input sentence (of given source lan-guae script) to a target language script. It can beused to train/test multilingual models to improve/e-valuate their ability to understand complete or par-tially transliterated text.

B.61 MR Value ReplacementThis perturbation adds noise to a key-value mean-ing representation (MR) (and its corresponding sen-tence) by randomly substituting values/words withtheir synonyms (or related words). This transfor-mation uses a simple strategy to align values of aMR and tokens in the corresponding sentence in-spired by how synonyms are substituted for taskslike machine translation (Fadaee et al., 2017). Thisway, there could be some problems in complexsentences. Besides, the transformation might intro-duce non-grammatical segments.

B.62 Multilingual Back TranslationThis transformation translates a given sentencefrom a given language into a pivot language andthen back to the original language. This transforma-tion is a simple paraphraser that works on 100 dif-ferent languages. Back Translation has been quitepopular now and has been a quick way to augment(Li and Specia, 2019; Sugiyama and Yoshinaga,2019; Fan et al., 2020).

� Being honest ) Honesty shouldbe one of our most importantcharacter traits ) characteristics

B.63 Multilingual Dictionary Based CodeSwitch

This transformation generates multi-lingual code-switching data to fine-tune encoders of large lan-guage models (Qin et al., 2020; Tan and Joty, 2021;Wang et al., 2019b) by making use of bilingualdictionaries of MUSE (Lample et al., 2018).

B.64 Multilingual Lexicon PerturbationThis perturbation helps to creates code-mixed sen-tences for both high-resource and low-resource lan-guages by randomly translating words with a speci-fied probability from any supported languages (e.g.,English) to other supported languages (e.g., Chi-nese) by using a multilingual lexicon. Thus, it canbe used to generate code-mixed training data to im-prove models for multilingual and cross-lingual set-tings. As of now 100 languages are supported and

3000 common English words listed on ef.com 24

are supported. The lexicon implementation is also160x faster than its model based counterpart.

B.65 Causal Negation & StrengtheningThis transformation is targeted at augmentingCausal Relations in text and adapts the code fromthe paper Causal Augmentation for Causal Sen-tence Classification (Tan et al., 2021a). There aretwo operations: 1. Causal Negation: Negativewords like "not, no, did not" are introduced intosentences to unlink the causal relation. 2. CausalStrengthening: Causal meaning is strengthened byconverting weaker modal words into stronger oneslike "may" to "will" to assert causal strength.

The implementation provides users with the op-tion to amend causal meaning automatically fromthe root word of the sentence, or by explicitly high-lighting the index of the word they wish to amend.Additionally, we include WordNet (Miller, 1998)synonyms and tense matching to allow for morenatural augmentations.

� The rs7044343 polymorphismcould be ) was involved in regulatingthe production of IL-33.

B.66 Question Rephrasing transformationThis implementation rephrases questions for sen-tence tasks by using the T5 model used in B.75 forQuestion Answering tasks.

B.67 English Noun Compound Paraphraser[N+N]

This transformation replaces two-word noun com-pounds with a paraphrase, based on the com-pound paraphrase dataset from SemEval 2013 Task4 (Hendrickx et al., 2013). It currently only worksfor English. Any two-word compound that appearsin a dataset of noun compound paraphrases will bereplaced by a paraphrase. If more than one two-word compound appears, then all combinations ofcompound paraphrases (including no paraphrase atall) will be returned. For example, the paraphrasesof “club house” include “house for club activities”,“house for club members”, “house in which a clubmeets”, etc. We start with replacing paraphraseswith the highest score (the specified frequency inthe annotated dataset), and paraphrases with thesame score (ties) are sorted randomly. This trans-formation currently only checks for noun com-

24https://www.ef.com/wwen/english-resources/english-vocabulary/top-3000-words/

pounds from Hendrickx et al. (2013) and thereforehas low coverage. To improve it, other datasetscould be added, e.g., from Ponkiya et al. (2018)or Lauer (1995). To attain even wider-coverage(at the expense of lower precision), machine learn-ing approaches such as Shwartz and Dagan (2018)or Ponkiya et al. (2020) could be considered. Inaddition, some of the the paraphrases in Hendrickxet al. (2013) sound a little odd (e.g., "blood cell" ->"cell of blood") and may not fit well in context.

B.68 Number to WordThis transformation acts like a perturbation to im-prove robustness on processing numerical values.The perturbated sentence contains the same infor-mation as the initial sentence but with a differentrepresentation of numbers.

B.69 Numeric to WordThis transformation translates numbers in numericform to their textual representations. This includesgeneral numbers, long numbers, basic math charac-ters, currency, date, time, phone numbers, etc.

B.70 OCR PerturbationThis transformation directly induces Optical Char-acter Recognition (OCR) errors into the input text.It renders the input sentence as an image and rec-ognizes the rendered text using the OCR engineTesseract 4 (Smith, 2007). It works with text inEnglish, French, Spanish, and German. The imple-mentation follows previous work by Namysl et al.(2021).

B.71 Add Noun DefinitionThis transformation appends noun definitions ontothe original nouns in a sentence. Definitions ofnouns are collected from Wikidata .25

B.72 Pig Latin CipherThis transformation translates the original text intopig latin. Pig Latin is a well-known determinis-tic transformation of English words, and can beviewed as a cipher which can be deciphered by ahuman with relative ease. The resulting sentencesare completely unlike examples typically used inlanguage model training. As such, this augmenta-tion change the input into inputs which are difficultfor a language model to interpret, while being rela-tively easy for a human to interpret.

25https://www.wikidata.org/wiki/Wikidata:Main_Page

B.73 Pinyin Chinese CharacterTranscription

This transformation transcribes Chinese charactersinto their Mandarin pronunciation using the Pinyinromanization scheme. The Character-to-Pinyinconverter at the core of this transformation is aneural model by Park and Lee (2020).

B.74 SRL Argument ExchangeThis perturbation adds noise to all types of Englishtext sources (sentence, paragraph, etc.) propor-tional to the number of arguments identified bySRL BERT (Shi and Lin, 2019b). Different rulesare applied to deterministically modify the sen-tence in a meaning-preserving manner. Rules lookas follows: if ARGM-LOC and ARGM-TMP bothpresent, exchange them.Example: [ARG0: Alex] [V: left] [ARG2:for Delhi] [ARGM-COM: with his wife][ARGM-TMP: at 5 pm] . → Alex leftfor Delhi at 5 pm with his wife.The transformation relies on propbank annota-tions (Bonial et al., 2012; Kingsbury and Palmer,2002; Palmer et al., 2005; Gildea and Palmer,2002).

B.75 ProtAugment Diverse ParaphrasingThis transformation utilizes the ProtAugmentmethod by Dopierre et al. (2021). The paraphrasegeneration model is a BART model (Lewis et al.,2020), finetuned on the paraphrase generation taskusing 3 datasets: Google-PAWS (Zhang et al.,2019b), MSR (Dolan and Brockett, 2005), Quora.26

When parpahrasing a sentence, the transforma-tion useu Diverse Beam Search (Vijayakumar et al.,2016) to generate diverse outputs. The diversitypenalty term is by default set to 0.5 but can beset to custom values. Additionally, the transforma-tion can use the following generation constraints:(1) A fraction of the words in the input sentenceare forbidden in the paraphrase (default 0.7). (2)All bi-grams in the input sentence are forbidden inthe paraphrase. This means the paraphrase cannotcontain any bi-gram that are in the input sentence.This constraint enforces the paraphrase generationmodel to change the sentence structure.

B.76 PunctuationThis transformation removes/adds punctuationfrom an English sentence. This transformation was

26https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

first introduced by Mille et al. (2021) and used asan example implemention for NL-Augmenter.

B.77 Question-Question Paraphraser for QAThis transformation creates new QA pairs by gen-erating question paraphrases from a T5 model fine-tuned on Quora Question pairs .27 Generated ques-tions can have a very different surface form fromthe original question making it a strong paraphrasegenerator. A T5 model (Raffel et al., 2019; Wolfet al., 2020) fine tuned 28 on the Quora QuestionPairs dataset was being used to generate questionparaphrases. This transformation would benefitQuestion Answering, Question Generation as wellas other tasks which could indirectly benefit eg. fordialog tasks (Shrivastava et al., 2021; Dhole, 2020).

B.78 Question in CAPSThis transformation upper-cases the context of aquestion answering example. It also adds upper-cased versions of the original answers to the set ofacceptable model responses.

B.79 Random Word DeletionThis transformation randomly removes a word witha given probability p (by default 0.25). The trans-formation relies on whitespace tokenization andthus only works for English and other languagesthat are segmented via whitespace. Due to the de-structive nature of the transformation, it is likelythat the meaning of a sequence may be changedas a result of the change. A similar transforma-tion was suggested by Wei and Zou (2019). Worddropout (Goldberg, 2017) has been common tohelp models understand unknown words encoun-tered during evaluation by exposing them to thisunknown-word condition during training itself.

B.80 Random Upper-Case TransformationThis perturbation adds noise to all types of textsources (sentence, paragraph, etc.) by randomlyadding upper cased letters. With a default probablyof 0.1, each character in a sequence is upper-cased.This transformation does not rely on a tokenizerand thus works with all languages that have upperand lower-case letters. One limiation of this trans-formation is that it will not affect a tokenizer thatdoes lower case for all input. A similar transforma-tion was suggested by Wei and Zou (2019). Further

27Quora Question Pairs28https://huggingface.co/ramsrigouthamg/t5_paraphraser

improvement of this transformation exists by po-tentially relying on extreme value theory (Jalalzaiet al., 2020).

B.81 Double Context QA

This transformation repeats the context of a ques-tion answering example. This should not changethe result in any way.

B.82 Replace Abbreviations and Acronyms

This transformation changes abbreviations andacronyms appearing in an English text to their ex-panded form and respectively, changes expandedabbreviations and acronyms appearing in a text totheir shorter form. For example, “send this file asapto human resources” might be changed to “sendthis file as soon as possible to HR”. The list ofabbreviations and acronyms used in this transfor-mation where manually gathered focusing on com-mon abbreviations present in business communi-cations. When abbreviation are context-dependentor highly specific, the induced change may changethe meaning of a text, or an abbreviation may notbe available in the lookup. The transformation wasfirst introduced by Regina et al. (2020).

B.83 Replace Financial Amounts

This transformation replaces financial amountsthroughout a text with the same value in a differentcurrency. The replacement changes the amount,the writing format as well as the currency of thefinancial amount. For example, the sentence “I oweFred € 20 and I need € 10 for the bus.” might bechanged to “I owe Fred 2 906.37 Yen and I need 1453.19 Yen for the bus.” The transformation wasfirst introduced by Regina et al. (2020).

B.84 Replace Numerical Values

This transformation looks for numerical values inan English text and replaces it with another randomvalue of the same cardinality. For example, “6.9”may be replaced by “4.2”, or “333” by “789”. Thetransformation was first introduced by Mille et al.(2021).

B.85 Replace Spelling

This transformation adds noise to all types of En-glish text sources (sentence, paragraph, etc.) usingcorpora of common spelling errors introduced byDeorowicz and Ciura (2005). Each word with acommon misspelling is replaced by the version

with mistake with a probability p which by defaultis set to 0.2.

B.86 Replace nouns with hyponyms orhypernyms

This transformation replaces common nouns withother related words that are either hyponyms orhypernyms. Hyponyms of a word are more specificin meaning (such as a sub-class of the word), eg:’spoon’ is a hyponym of ’cutlery’. Hypernymsare related words with a broader meaning (such asa generic category /super-class of the word), eg:’colour’ is a hypernym of ’red’. Not every wordwill have a hypernym or hyponym.

B.87 Sampled Sentence Additions

This transformation adds generated sentence toall types of English text sources (sentence, para-graph, etc.) by passing the input text to a GPT-2model (Radford et al., 2019). By default, GPT-XLis used, together with the prompt “paraphrase:”appended to the original text, after which up to 75tokens are sampled. Since the additional text issampled from a model, the model may introduceharmful language or generate text that contradictsthe earlier text or changes its meaning. The idea tosample one or more additional sentences was firstintroduced by Jia and Liang (2017a).

B.88 Sentence Reordering

This perturbation adds noise to all types of textsources (paragraph, document, etc.) by ran-domly shuffling the order of sentences in the in-put text (Lewis et al., 2020). Sentences are firstpartially decontextualized by resolving corefer-ence (Lee et al., 2018).

This transformation is limited to input text thathas more than one sentence. There are still caseswhere coreference can not be enough for decontex-tualization. For example, there could be occurencesof ellipsis as demonstrated by Gangal et al. (2021)or events could be mentioned in a narrative stylewhich makes it difficult to perform re-ordering orshuffling (Kocisky et al., 2018) while keeping thecontext of the discourse intact.

B.89 Emoji Addition for Sentiment Data

This transformation adds positive emojis and smi-leys to positive sentiment data and negative emojisto negative sentiment data. For non-labelled data,it adds neutral smileys.

B.90 Shuffle Within Segments

In this transformation, a token sequence, for ex-ample BIO-tagged, is split into coherent segments.Thus, each segment corresponds to either a mentionor a sequence of out-of-mention tokens. For exam-ple, a sentence “She did not complain of headacheor any other neurological symptoms .” with tagsO O O O O B-problem O B-problem I-problemI-problem I-problem O is split into five segments:[She did not complain of ], [headache], [or], [anyother neurological symptoms], [.]. Then for eachsegment, a binomial distribution (p=0.5) is usedto decide whether it should be shuffled. If yes, theorder of the tokens within the segment is shuffledwhile the label order is kept unchanged. This trans-formation is inspired by Dai and Adel (2020).

B.91 Simple Ciphers

This transformation modifies the text in ways that ahuman could rapidly decipher, but which make theinput sequences almost completely unlike typicalinput sequences which are used during languagemodel training. This transformation includes thefollowing text modifications: double the characters,double the words, add spaces between the charac-ters, reverse all characters in the text, reverse thecharacters within each word, reverse the order ofthe words in the text, substitute homoglyphs, rot13cipher.

B.92 Slangificator

This transformation replaces some of the words (inparticular, nouns, adjectives, and adverbs) of anEnglish text with their corresponding slang. Thereplacement is done with the subset of the "Dictio-nary of English Slang & Colloquialisms".29 Theamount of replacement is proportional to the corre-sponding probabilities of replacement (by default,0.5 for nouns, adjectives, and adverbs each).

B.93 Spanish Gender Swap

This transformation changes the gender of all ani-mate entities (mostly referring to people, and someanimals) in a given Spanish sentence from mascu-line to feminine. This includes masculine nounswith feminine equivalents (e.g., doctor → doc-tora), nouns with a common gender (“sustantivoscomunes en cuanto al género”, e.g., el violinista→ la violinista), personal pronouns, and (option-ally) given names often used with a given gender

29http://www.peevish.co.uk/slang/index.htm

(e.g., Pedro → Alicia). Epicene nouns are excluded.In addition, the gender of adjectives, determiners,pronouns and participles are modified in order tomaintain the grammatical agreement.

B.94 Speech Disfluency PerturbationThis perturbation randomly inserts speech disfluen-cies in the form of filler words into English texts.With a given probability (0.2 by default), a speechdisfluency is inserted between words. The defaultdisfluencies are "um", "uh", "erm", "ah", and "er".At least one filler word is always inserted by thistransformation.

B.95 Paraphrasing through Style TransferThis transformation provides a range of possiblestyles of writing English language. The followingstyles can be chosen:

• Shakespeare - Trained on written works byShakespeare.

• Switchboard - Trained on a collection of con-versational speech transcripts.

• Tweets - Trained on 5.2M English tweets.• Bible - Trained on texts from the Bible.• Romantic poetry - Trained on romantic poetry.• Basic - A light, basic paraphraser with no spe-

cific style.The transformation follows the models and for-

mulations by Krishna et al. (2020).

B.96 Subject Object SwitchThis transformation switches the subject and ob-ject of English sentences to generate new sentenceswith a very high surface similarity but very differ-ent meaning. This can be used, for example, foraugmenting data for models that assess SemanticSimilarity.

B.97 Sentence SummarizaitonThis transformation compresses English sentencesby extracting subjects, verbs, and objects of thesentence. It also retains any negations. For ex-ample, “Stillwater is not a 2010 American live-action/animated dark fantasy adventure film” turnsinto “Stillwater !is film”. Zhang et al. (2021) useda similar idea to this transformation.

B.98 Suspecting Paraphraser for QAThis paraphraser transforms a yes/no question intoa declarative sentence with a question tag ,30 which

30https://www.englishclub.com/grammar/tag-questions.htm

helps to add more question specific informality tothe dataset. Example: ”Did the American NationalShipment company really break its own fleet?” ->”The American National Shipment company reallybroke its own fleet, didn’t it”.

B.99 Swap Characters PerturbationThis perturbation randomly swaps two adjacentcharacters in a sentence or a paragraph with a de-fault probability (Zhang et al., 2019a).

B.100 Synonym InsertionThis perturbation adds noise to all types of textsources (sentence, paragraph, etc.) by randomlyinserting synonyms of randomly selected wordsexcluding punctuations and stopwords (Marivateand Sefara, 2020).

B.101 Synonym SubstitutionThis perturbation randomly substitutes some wordsin an English text with their WordNet (Miller, 1998)synonyms.

B.102 Syntactically Diverse Paraphrasingusing Sow Reap models

This transformation is capable of generating mul-tiple syntactically diverse paraphrases for a givensentence based on the work of Goyal and Durrett(2020). The model paraphrases inputs using a twostep framework: 1) SOW (Source Order reWrit-ing): This step enumerates multiple feasible syn-tactic transformations of the input sentence. 2)REAP (REarrangement Aware Paraphrasing): Thisstep conditions on the multiple reorderings/ rear-ragements produced by SOW and outputs diverseparaphrases corresponding to these reoderings. Thetransformation is designed to work only on single-sentence inputs. Multi-sentence inputs results inan empty string/no transformation. The model aretrained on the ParaNMT-50M dataset (Wieting andGimpel, 2017; Wieting et al., 2017), which can beargued to be a bit noisy.

B.103 Subsequence Substitution forSequence Tagging

This transformation performs same-label subse-quence substitution for the task of sequence tag-ging, which replaces a subsequence of the inputtokens with another one that has the same sequenceof tags (Shi et al., 2021). This is done as follows:(1) Draw a subsequence A from the input (tokens,tags) tuple. (2) Draw a subsequence B within the

whole dataset, with the same tag subsequence. (3)Substitute A with B in the input example.

B.104 Change English TenseThis transformation converts English sentencesfrom one tense to the other, for example simplepresent to simple past. This transformation wasintroduced by Logeswaran et al. (2018).

B.105 Token Replacement Based on LookupTables

This transformation replaces input tokens with theirperturbed versions sampled from a given lookuptable of replacement candidates. Lookup tablescontaining OCR errors and misspellings from priorwork are given as examples. Thus, by default, thetransformation induces plausible OCR errors andhuman typos to the input sentence.

The transformation is an adapted and improvedversion of the lookup table-based noise inductionmethod from Namysl et al. (2020). The OCRlookup table is from Namysl et al. (2021) and themisspellings from Piktus et al. (2019).

B.106 Transformer FillThis perturbation replaces words based on recom-mendations from a masked language model. Thetransformation can limit replacements to certainPOS tags (all enabled by default). Many previouspapers have used this technique for data augmen-tation (Ribeiro et al., 2020; Li et al., 2020b, interalia).

B.107 Underscore TrickThis perturbation adds noise to the text sourceslike sentence, paragraph, etc. This transformationacts like a perturbation to test robustness. It re-places some random spaces with underscores (oreven other selected symbols). This perturbationwould benefit all tasks which have a sentence/para-graph/document as input like text classification andtext generation, especially on tasks related to un-derstanding/generating scripts.

B.108 Unit converterThis transformation converts length and weightmeasures to different units (e.g., kilometers tomiles) picking at random the new unit but con-verting accurately the quantity. The transformationconserves the format of the original quantity: "100pounds" is converted to "1600 ounces" but "one-hundred pounds" is converted to "one thousand,

six hundred ounces". Generated transformationsdisplay high similarity to the source sentences.

B.109 Urban Thesaurus Swap

This perturbation randomly picks nouns from theinput source to convert to related terms drawn fromthe Urban Dictionary 31 resource. It can be appliedto an input text to produce semantically-similaroutput texts in order to generate more robust testsets. We first select nouns at random, then querythe Urban Thesaurus website 32 to obtain a list ofrelated terms to swap in (Wilson et al., 2020).

B.110 Use Acronyms

This transformation changes groups of words fortheir equivalent acronyms. It’s a simple substitu-tion of groups of words for their acronyms. It helpsto increase the size of the dataset as well as im-proving the understanding of acronyms of modelstrained on data augmented with this transformation.This transformations works to increase the data forany task that has input texts. It is specially interest-ing for tasks on semantic similarity, where modelsshould be aware of the equivalence between a setof words and their acronym. The quality of thetransformation depends on the list of acronyms. Asof now, this list was scraped from wikipedia’s Listof Acronyms 33 and naively filtered, which leavesspace for improvement .

B.111 Visual Attack Letter

This perturbation replaces letters with visually sim-ilar, but different, letters. Every letter was embed-ded into 576-dimensions. The nearest neighborsare obtained through cosine distance. To obtainthe embeddings the letter was resized into a 24x24image, then flattened and scaled. This follows theImage Based Character Embedding (ICES) (Egeret al., 2019a).

The top neighbors from each letter are chosen.Some were removed by judgment (e.g. the near-est neighbors for ’v’ are many variations of theletter ’y’) which did not qualify from the imageembedding (Eger et al., 2019b).

B.112 Weekday Month Abbreviation

This transformation abbreviates or expands thenames of months and weekdays, e.g. Mon. ->

31https://www.urbandictionary.com/32https://urbanthesaurus.org/33https://en.wikipedia.org/wiki/Lists_of_acronyms

Monday. Generated transformations display highsimilarity to the source sentences and does not al-ter the meaning and the semantic of the originaltexts. It does not abbreviate plural names, e.g. Sun-days, and does not influence texts without namesof weekdays or months.

B.113 Whitespace Perturbation

This perturbation adds noise to text by randomlyremoving or adding whitespaces.

B.114 Context Noise for QA

This transformation chooses a set of words at ran-dom from the context and the question and formsa sentence out of them. The sentence is thenprepended or appended to the context to createa new QA pair. The transformation is inspired bythe the AddAny method described in AdversarialSQUAD (Jia and Liang, 2017b). However, insteadof probing the model to generate adversaries, ran-dom words from the context and the question aresimply selected and joined together into a sentence,ignoring grammaticality. The transformation at-tempts to create novel QA pairs assuming that theintroduction of random words to the context is lesslikely to change the answer choice to an askedquestion.

B.115 Writing System Replacement

This transformation replaces the writing system ofthe input with another writing system. We use CJKUnified Ideographs34 as the source of characters forthe generated writing systems. The transformationwould benefit text classification tasks, especially inthe cases where the input writing system is undeci-phered.

B.116 Yes-No Question Perturbation

This transformation turns English non-compoundstatements into yes-no questions. The generatedquestions can be answered by the statements thatwere used to generate them. The text is left largelyunchanged other than the fronted/modified/addedauxiliaries and be-verbs.

The transformation works by getting dependencyparse and POS tags from a machine learning modeland applying human-engineered, rule-based trans-formations to those parses/tags. This transforma-tion would particularly benefit question-answering

34https://en.wikipedia.org/wiki/CJK_Unified_Ideographs

and question-generation tasks, as well as provid-ing surplus legal text for language modeling andmasked language modeling.

B.117 Yoda Transformation

This perturbation modifies sentences to flip theclauses such that it reads like "Yoda Speak". Forexample, "Much to learn, you still have". Thisform of construction is sometimes called "XSV",where "the “X” being a stand-in for whatever chunkof the sentence goes with the verb", and appearsvery rarely in English normally. The rarity of thisconstruction in ordinary language makes it particu-larly well suited for NL augmentation and servesas a relatively easy but potentially powerful test ofrobustness.

C Filters

The following is the list of all submitted filters toNL-Augmenter. Filters are used to filter data andcreate subpopulations of given inputs, accordingto features such as input complexity, input size,etc. Therefore, the output of a filter is a booleanvalue, indicating that whether the input meet thefilter criterion. We discuss the implementations ofeach filter alongwith their limitations. The title ofeach filter subsection is clickable and redirects tothe actual python implementation.

C.1 Code-Mixing Filter

This filter identifies whether the input text is code-mixed. It checks that there is at least one sentencein the text where there are tokens representing atleast ‘k’ unique languages (with at least a ‘thresh-old‘ level of confidence that the token is of that lan-guage). It is useful for collecting code-mixed datato test the model’s performance on multilingualtasks. The filter relies on ftlid35 for languagedetection, therefore, this filter might be limited bythe performance of the language detection tool.

� (containing code-mixing) Yo estaba con Es-teban yesterday, he was telling me about lo que suesposa vio en los Estados Unidos. )3True

C.2 Diacritics Filter

This filter checks whether any character in the sen-tence has a diacritic. It can be used to create splitsof the dataet where the sentences have diacritics.Accented characters are typically among the rarercharacters and checking the model performance on

35https://pypi.org/project/ftlid/

such a split might help investigate model robust-ness.

� (containing diacritics) She lookèd east an shelookèd west. )3True

C.3 Encoding Filter

This filter filters examples which contain charac-ters outside a given encoding. It can be used tofind examples containing e.g. non-ASCII Unicodecharacters. Filtering out and testing examples thatcontain these characters can provide feedback onhow to improve the models accordingly, since mostmodels are trained with plain English text, whichcontains mostly ASCII characters. Sometimes non-ASCII character are even explicitly stripped away.

� (containing non-ASCII characters) That sou-venir sure was expensive at 60£.. or was it 60€?)3True

C.4 Englishness Filter

This filter identifies texts that contain uniquelyBritish spellings, vocabulary, or slang. The fil-ter uses a vocabulary of common British word-s/phrases and checks the number of occurence ofBritish words in the given texts. The text is selectedif the number exceeds a pre-defined threshold.

� (containing British spellings) Colour is anattribute of light that is perceived by the human eye.)3True

C.5 Gender Bias Filter

This filter filters a text corpus to measure genderfairness with respect to a female gender represen-tation. It supports four languages (i.e. English,French, Polish and Russian) and can be used todefine whether the female gender is sufficientlyrepresented in a tested subset of sentences. Thefilter uses a list of lexicals, which includes filtercategories such as personal pronouns, words defin-ing the relation, titles and names, corresponding tothe female and male genders accordingly.

� (texts with unbalanced representation) "Hewent home", "He drives a car", "She has returned")3True

C.6 Group Inequity Filter

This is a bilingual filter (for English and Frenchlanguages), which helps to discover potential groupinequity issues in the text corpus. It is a topic ag-nostic filter which accepts user-defined parameters,consisting of keywords inherent to minor group

(which potentially might suffer from the discrimi-nation), major group, minor factor and major factor.The filter first flags the sentences as belonging tothe minor, and the major groups, and then, the sen-tences from each of the groups are used to definethe intersection with both factors. The filter thencompares whether the percentage of major factorsexceeds that of the minor factors to determine ifthe sentences have group inequity issues.

� (containing group inequity issues) "He is adoctor", "She is a nurse", "She works at the hospi-tal" )3True

C.7 Keyword FilterThis is a simple filter, which filters examples basedon a pre-defined set of keywords. It can be usefulin creating splits for a specific domain.

� (containing keyword "at") Andrew playedcricket in India )3True

C.8 Language FilterThis filter selects texts that match any of a givenset of ISO 639-1 language codes (the default lan-guage being English). Language matching is per-formed using a pre-trained langid.py model in-stance. The model provides normalized confidencescores. A minimum threshold score needs to be set,and all sentences with confidence scores above thisthreshold are accepted by the filter.

� (is English texts) Mein Luftkissenfahrzeugist voller Aale )7False

C.9 Length FilterThis filter filters data with the input text lengthmatching a specified threshold. It can be useful increating data with different length distributions.

� (containing more than 3 words) Andrewplayed cricket in India )3True

C.10 Named-entity-count FilterThis filter filters data where the number of NamedEntities in the input match a specified threshold(based on the supported conditions).

� (containing more than 1 named entity) NovakDjokovic is the greatest tennis player of all time.)3True

C.11 Numeric FilterThis filter filters example which contain a numericvalue. In the tasks like textual entailment, questionanswering etc., a quantity (number) could directlyaffect the final label/response. This filter can be

used to create splits to measure the performanceseparately on texts containing numeric values.

� (containing numbers in texts) John bought acar worth dollar twenty five thousand . )3True

C.12 Oscillatory Hallucinations Filter

This filter is designed to operate in text generationsystems’ outputs, with the purpose of extracting os-cillatory hallucinations. Oscillatory hallucinationsare one class of hallucinations characterized by re-peating bigram structure in the output(Raunak et al.,2021). Typically, these behaviors are observed inmodels trained on noisy corpora. The filter countsthe frequency of bigrams in both source and targettexts, and compare the frequency difference witha pre-set threshold to determine whether the textsincludes oscillatory hallucinations.

� (containing hallucinations in target texts)Source: "Community, European Parliament com-mon regional policy, Mediterranean region", Target:"Arbeitsbedingungen, berufliche Bildung, beru-fliche Bildung, berufliche Bildung" )3True

C.13 Polarity Filter

This filter filters a transformed text if it does not re-tain the same polarity as an original text. This filterhelps not to distort training data during augmen-tation for sentiment analysis-related tasks. Whilegenerating new data for a sentiment analysis task,it is important to make sure that generated data islabelled correctly.

� (texts retaining polarity) "Hotel is terrible","Hotel is great" )7False

C.14 Quantitative Question Filter

This is a simple rule-based filter that can be usedto identify quantitative questions. It can help toanalyse models’ performance on questions whichrequire numerical understanding. It is also usefulto study possible biases in question generation.

� (being quantitative question) How long doesthe journey take? )3True

C.15 Question type filter

This filter helps identify the question category ofa question answering example based on the ques-tion word or the named entity type of the answer.Knowledge of the question type can help in the de-velopment of question answering systems (Parikhet al., 2019) as well as for assessing performanceon individual splits.

� (being where question) Where is Delhi lo-cated ? )3True

C.16 Repetitions Filter

This filter finds texts with repetitions with simpleheuristic rules. It might be helpful in finding repeti-tions that frequently occur in the spoken languagedata.

� (containing repetitions in texts) I I want tosleep )3True

C.17 Phonetic Match Filter

This filter selects texts that contain matching entriesto a list of supplied keywords. It first transform theinput sentence and the keywords into phenetic unitsand then compare whether the two phenetic unitsets have overlap.

� (containing homophones of keyword "trom-bone") I left my trombno on the train )3True

C.18 Special Casing Filter

This filter checks if the input sentence has a specialcasing, i.e. the string is either all lowercased, alluppercased or has title casing. It might be usefulfor creating splits that contain texts with unusualcasing, e.g. misspellings.

� (text being uppercased/lowercased/titlecased)let’s go to chipotle )3True

C.19 Speech-Tag Filter

This filter filters an example text based on a setof speech tags and identifies whether the count ofselected POS tags meet the pre-defined conditions(e.g. above the threshold).

� (containing 1 verb and 2 numbers in texts) Itall happened between November 2007 and Novem-ber 2008. )3True

C.20 Token-Amount filter

This filter filters an example text based on whethercertain keywords are present in a specified amount.

� (containing 2 occurances of "in") Andrewplayed cricket in a soccer stadium in India at 9pm)3True

C.21 Toxicity Filter

This filter filters an example text which has a toxi-city value matching a particular threshold. It usesa pre-trained toxicity detector, which can provide7 toxicity scores. All the 7 types of toxicity scorescan be used as criteria for the filtering.

� (text being toxic) I disagree. It is not sup-posed to work that way. )7False

C.22 Universal Bias FilterThis filter works the same way as the Gender BiasFilter, but measures balance of representation formore categories (religion, race, ethnicity, gender,sexual orientation, age, appearance, disability, ex-perience, education, economic status). The lexicalseeds representing these categories are currentlyavailable in English only, however the pool of lan-guages can be extended by a simple addition ofthe lexical seeds in a desired language to the lexi-cals.json file.

� (texts being biased) "He is going to make acake.", "She is going to program", "Nobody likeswashing dishes", "She agreed to help him" )7False

C.23 Yes/no question filterThis filter allows to select questions that can becorrectly answered with either ’yes’ or ’no’. Sinceit is rule-based, the limitation of this filter is thatquestions that are ambiguous might not be recog-nized.

� (text being yes/no question) Wasn’t she angrywhen you told her about the accident? )3True