BanglaBERT: Language Model Pretraining and Benchmarks ...

10
BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla Anonymous ACL submission Abstract In this paper, we introduce ‘BanglaBERT’, a 001 BERT-based Natural Language Understanding 002 (NLU) model pretrained in Bangla, a widely 003 spoken yet low-resource language in the NLP 004 literature. To pretrain BanglaBERT, we collect 005 27.5 GB of Bangla pretraining data (dubbed 006 ‘Bangla2B+’) by crawling 110 popular Bangla 007 sites. We introduce a new downstream task 008 dataset on Natural Language Inference (NLI) 009 and benchmark on four diverse NLU tasks 010 covering text classification, sequence labeling, 011 and span prediction. In the process, we bring 012 them under the first-ever Bangla Language Un- 013 derstanding Evaluation (BLUE) benchmark. 014 BanglaBERT achieves state-of-the-art results 015 outperforming multilingual and monolingual 016 models. We will make the BanglaBERT 017 model, the new datasets, and a leaderboard 018 publicly available to advance Bangla NLP. 019 1 Introduction 020 Despite being the sixth most spoken language in 021 the world with over 300 million native speakers 022 constituting 4% of the world’s total population, 1 023 Bangla is considered a resource-scarce language. 024 Joshi et al. (2020b) categorized Bangla in the lan- 025 guage group that lacks efforts in labeled data col- 026 lection and relies on self-supervised pretraining 027 (Devlin et al., 2019; Radford et al., 2019; Liu et al., 028 2019) to boost the natural language understanding 029 (NLU) task performances. To date, the Bangla lan- 030 guage has been continuing to rely on fine-tuning 031 multilingual pretrained language models (PLMs) 032 (Ashrafi et al., 2020; Das et al., 2021; Islam et al., 033 2021). However, since multilingual PLMs cover 034 a wide range of languages (Conneau and Lam- 035 ple, 2019; Conneau et al., 2020), they are large 036 (have hundreds of millions of parameters) and re- 037 quire substantial computational resources for fine- 038 tuning. They also tend to show degraded perfor- 039 mance for low-resource languages (Wu and Dredze, 040 1 https://w.wiki/Psq 2020) on downstream NLU tasks. Motivated by 041 the triumph of language-specific models (Martin 042 et al. (2020); Polignano et al. (2019); Canete et al. 043 (2020); Antoun et al. (2020), inter alia) over mul- 044 tilingual models in many other languages, in this 045 work, we present BanglaBERT – a BERT-based 046 (Devlin et al., 2019) Bangla NLU model pretrained 047 on 27.5 GB data (which we name ‘Bangla2B+’) 048 we meticulously crawled 110 popular Bangla web- 049 sites to facilitate NLU applications in Bangla. 050 We also introduce a Bangla Natural Language 051 Inference (NLI) dataset, a task previously unex- 052 plored in Bangla, and evaluate BanglaBERT on 053 four diverse downstream tasks on sentiment clas- 054 sification, NLI, named entity recognition, and 055 question answering. We bring these tasks to- 056 gether to establish the first-ever Bangla Language 057 Understanding Evaluation (BLUE) benchmark. 058 We compare two widely used multilingual mod- 059 els to BanglaBERT using the BLUE benchmark 060 and find that BanglaBERT excels on all the tasks. 061 We summarize our contributions as follows: 062 1. We present BanglaBERT, a pretrained BERT 063 model for Bangla, and introduce a new Bangla 064 natural language inference (NLI) dataset. 065 2. We introduce Bangla Language Understand- 066 ing Evaluation (BLUE) benchmark and pro- 067 vide a set of strong baselines. 068 3. We release code and provide a leaderboard to 069 spur future research on Bangla NLU. 2 070 2 BanglaBERT 071 2.1 Pretraining Data 072 A high volume of good quality text data is a prereq- 073 uisite for pretraining large language models. For 074 instance, BERT (Devlin et al., 2019) is pretrained 075 on the English Wikipedia and the Books corpus 076 (Zhu et al., 2015) containing 3.3 billion tokens. 077 Subsequent works like RoBERTa (Liu et al., 2019) 078 2 https://github.com/hidden/hidden 1

Transcript of BanglaBERT: Language Model Pretraining and Benchmarks ...

BanglaBERT: Language Model Pretraining and Benchmarks forLow-Resource Language Understanding Evaluation in Bangla

Anonymous ACL submission

AbstractIn this paper, we introduce ‘BanglaBERT’, a001BERT-based Natural Language Understanding002(NLU) model pretrained in Bangla, a widely003spoken yet low-resource language in the NLP004literature. To pretrain BanglaBERT, we collect00527.5 GB of Bangla pretraining data (dubbed006‘Bangla2B+’) by crawling 110 popular Bangla007sites. We introduce a new downstream task008dataset on Natural Language Inference (NLI)009and benchmark on four diverse NLU tasks010covering text classification, sequence labeling,011and span prediction. In the process, we bring012them under the first-ever Bangla Language Un-013derstanding Evaluation (BLUE) benchmark.014BanglaBERT achieves state-of-the-art results015outperforming multilingual and monolingual016models. We will make the BanglaBERT017model, the new datasets, and a leaderboard018publicly available to advance Bangla NLP.019

1 Introduction020

Despite being the sixth most spoken language in021

the world with over 300 million native speakers022

constituting 4% of the world’s total population,1023

Bangla is considered a resource-scarce language.024

Joshi et al. (2020b) categorized Bangla in the lan-025

guage group that lacks efforts in labeled data col-026

lection and relies on self-supervised pretraining027

(Devlin et al., 2019; Radford et al., 2019; Liu et al.,028

2019) to boost the natural language understanding029

(NLU) task performances. To date, the Bangla lan-030

guage has been continuing to rely on fine-tuning031

multilingual pretrained language models (PLMs)032

(Ashrafi et al., 2020; Das et al., 2021; Islam et al.,033

2021). However, since multilingual PLMs cover034

a wide range of languages (Conneau and Lam-035

ple, 2019; Conneau et al., 2020), they are large036

(have hundreds of millions of parameters) and re-037

quire substantial computational resources for fine-038

tuning. They also tend to show degraded perfor-039

mance for low-resource languages (Wu and Dredze,040

1https://w.wiki/Psq

2020) on downstream NLU tasks. Motivated by 041

the triumph of language-specific models (Martin 042

et al. (2020); Polignano et al. (2019); Canete et al. 043

(2020); Antoun et al. (2020), inter alia) over mul- 044

tilingual models in many other languages, in this 045

work, we present BanglaBERT – a BERT-based 046

(Devlin et al., 2019) Bangla NLU model pretrained 047

on 27.5 GB data (which we name ‘Bangla2B+’) 048

we meticulously crawled 110 popular Bangla web- 049

sites to facilitate NLU applications in Bangla. 050

We also introduce a Bangla Natural Language 051

Inference (NLI) dataset, a task previously unex- 052

plored in Bangla, and evaluate BanglaBERT on 053

four diverse downstream tasks on sentiment clas- 054

sification, NLI, named entity recognition, and 055

question answering. We bring these tasks to- 056

gether to establish the first-ever Bangla Language 057

Understanding Evaluation (BLUE) benchmark. 058

We compare two widely used multilingual mod- 059

els to BanglaBERT using the BLUE benchmark 060

and find that BanglaBERT excels on all the tasks. 061

We summarize our contributions as follows: 062

1. We present BanglaBERT, a pretrained BERT 063

model for Bangla, and introduce a new Bangla 064

natural language inference (NLI) dataset. 065

2. We introduce Bangla Language Understand- 066

ing Evaluation (BLUE) benchmark and pro- 067

vide a set of strong baselines. 068

3. We release code and provide a leaderboard to 069

spur future research on Bangla NLU.2 070

2 BanglaBERT 071

2.1 Pretraining Data 072

A high volume of good quality text data is a prereq- 073

uisite for pretraining large language models. For 074

instance, BERT (Devlin et al., 2019) is pretrained 075

on the English Wikipedia and the Books corpus 076

(Zhu et al., 2015) containing 3.3 billion tokens. 077

Subsequent works like RoBERTa (Liu et al., 2019) 078

2https://github.com/hidden/hidden

1

and XLNet (Yang et al., 2019) used more extensive079

web-crawled data with heavy filtering and cleaning.080

Bangla is a rather resource-constrained language081

in the web domain; for example, the Bangla082

Wikipedia dump from July 2021 is only 650 MB,083

two orders of magnitudes smaller than the English084

Wikipedia. As a result, we had to crawl the web085

extensively to collect our pretraining data. We se-086

lected 110 Bangla websites by their Amazon Alexa087

rankings3 and the volume and quality of extractable088

texts by inspecting each website. The contents in-089

cluded encyclopedias, news, blogs, e-books, sto-090

ries, social media/forums, etc.4 The amount of data091

totaled around 35 GB.092

There are also noisy sources of Bangla data093

dumps, a couple of prominent ones being OSCAR094

(Suárez et al., 2019) and CCNet (Wenzek et al.,095

2020). However, they contained lots of offensive096

texts; we found them infeasible to clean thoroughly.097

Fearing potential harmful impacts (Luccioni and098

Viviano, 2021), we opted not to use them.5099

2.2 Pre-processing100

We performed thorough deduplication on the pre-101

training data, removed non-textual contents (e.g.,102

HTML/JavaScript tags), and filtered out non-103

Bangla pages using a language classifier (Joulin104

et al., 2017). After the processing, the dataset was105

reduced to 27.5 GB in size containing 5.25M docu-106

ments having 306.66 words on average.107

We trained a Wordpiece (Wu et al., 2016) vo-108

cabulary of 32k subword tokens on the resulting109

corpus with a 400 character alphabet, kept larger110

than the native Bangla alphabet to capture code-111

switching (Poplack, 1980) and allow romanized112

Bangla contents for better generalization. We lim-113

ited the length of a training sample to 512 tokens114

and did not cross document boundaries (Liu et al.,115

2019) while creating a data point. After tokeniza-116

tion, we had 7.18M samples with an average length117

of 304.14 tokens and containing 2.18B tokens in118

total; hence we named the dataset ‘Bangla2B+’.119

2.3 Pretraining Objective120

Self-supervised pretraining objectives leverage un-121

labeled data. For example, BERT (Devlin et al.,122

2019) was pretrained with masked language mod-123

eling (MLM) and next sentence prediction (NSP).124

3www.alexa.com/topsites/countries/BD4The completely list of the sources can be found in the

Appendix.5We cover other ethical considerations in the Appendix.

Several works built on top of this, e.g., RoBERTa 125

(Liu et al., 2019) removed NSP and pretrained with 126

longer sequences, SpanBERT (Joshi et al., 2020a) 127

masked contiguous spans of tokens, while works 128

like XLNet (Yang et al., 2019) introduced objec- 129

tives like factorized language modeling. 130

We pretrained BanglaBERT using ELECTRA 131

(Clark et al., 2020b), pretrained with the Replaced 132

Token Detection (RTD) objective, where a gener- 133

ator and a discriminator model are trained jointly. 134

The generator is fed as input a sequence with a 135

portion of the tokens masked (15% in our case) 136

and is asked to predict them using the rest of the 137

input (i.e., standard MLM). The masked tokens 138

are then replaced by tokens sampled from the gen- 139

erator’s output distribution for the corresponding 140

masks, and the discriminator then has to predict 141

whether each token is from the original sequence 142

or not. After pretraining, the discriminator is used 143

for fine-tuning. Clark et al. (2020b) argued that 144

RTD back-propagates loss from all tokens of a se- 145

quence, in contrast to 15% tokens of the MLM ob- 146

jective, giving the model more signals to learn from. 147

Evidently, ELECTRA achieves comparable down- 148

stream performance to RoBERTa or XLNet with 149

only a quarter of their training time. This compu- 150

tational efficiency motivated us to use ELECTRA 151

for our implementation of BanglaBERT. 152

2.4 Model Architecture & Hyperparameters 153

We pretrained the base ELECTRA model (a 12- 154

layer Transformer encoder with 768 embedding 155

size, 768 hidden size, 12 attention heads, 3072 156

feed-forward size, generator-to-discriminator ratio 15713 , 110M parameters) with 256 batch size for 2.5M 158

steps on a v3-8 TPU instance on GCP. We used 159

the Adam (Kingma and Ba, 2015) optimizer with a 160

2e-4 learning rate and linear warmup of 10k steps. 161

162

3 The Bangla Language Understanding 163

Evaluation (BLUE) Benchmark 164

Many works have studied different Bangla NLU 165

tasks in isolation, e.g., sentiment classification (Das 166

and Bandyopadhyay, 2010; Sharfuddin et al., 2018; 167

Tripto and Ali, 2018), semantic textual similarity 168

(Shajalal and Aono, 2018), parts-of-speech (PoS) 169

tagging (Alam et al., 2016), named entity recogni- 170

tion (NER) (Ashrafi et al., 2020). However, Bangla 171

NLU has not yet had a comprehensive unified study. 172

Motivated by the surge of NLU research brought 173

2

Task Corpus |Train| |Dev| |Test| Metric DomainSentiment Classification SentNoB 12,575 1,567 1,567 Macro-F1 Social MediaNatural Language Inference BNLI 381,449 2,419 4,895 Acc. Misc.Named Entity Recognition MultiCoNER 14,500 800 800 Micro-F1 Misc.Question Answering TyDiQA 4,542 238 226 EM/F1 Wikipedia

Table 1: Statistics of the Bangla Language Understanding Evaluation (BLUE) benchmark.

about by benchmarks in other languages, e.g., En-174

glish (Wang et al., 2018), French (Le et al., 2020),175

Korean (Park et al., 2021), we establish the first-176

ever Bangla Language Understanding Evaluation177

(BLUE) benchmark.178

NLU generally comprises three types of tasks:179

text classification, sequence labeling, and text span180

prediction. Text classification tasks can further181

be sub-divided into single-sequence and sequence-182

pair classification. Therefore, we consider a total183

of four tasks for BLUE. For each task type, we184

carefully select one downstream task dataset. We185

emphasize the quality and open availability of the186

datasets while making the selection. We briefly187

mention them below:188

1. Single-Sequence Classification: Sentiment189

classification is perhaps the most-studied Bangla190

NLU task, with some of the earlier works dat-191

ing back over a decade (Das and Bandyopadhyay,192

2010). Hence, we chose this as the single-sequence193

classification task. However, most Bangla senti-194

ment classification datasets are not publicly avail-195

able. We could only find two public datasets: BYSA196

(Tripto and Ali, 2018) and SentNoB (Islam et al.,197

2021). We found BYSA to have many duplications.198

Even worse, many duplicates had different labels.199

SentNoB had better quality and covered a broader200

set of domains, making the classification task more201

challenging. Hence, we opted to use the latter.202

2. Sequence-pair Classification: In contrast203

to single-sequence classification, there has been204

a dearth of sequence-pair classification works in205

Bangla. We found work on semantic textual simi-206

larity (Shajalal and Aono, 2018), but the dataset is207

not publicly available. As such, we curated a new208

Bangla Natural Language Inference (BNLI) dataset209

for sequence-pair classification. We chose NLI as210

the representative task due to its fundamental im-211

portance in NLU. Given two sentences, a premise212

and a hypothesis as input, a model is tasked to pre-213

dict whether or not the hypothesis is entailment,214

contradiction, or neutral to the premise. We used215

the same curation procedure as the XNLI (Conneau216

et al., 2018) dataset: we translated the MultiNLI217

(Williams et al., 2018) training data using the En- 218

glish to Bangla translation model by Hasan et al. 219

(2020) and had the evaluation sets translated by 220

expert human translators.6 Due to the possibility 221

of the incursion of errors during automatic transla- 222

tion, we used the Language-Agnostic BERT Sen- 223

tence Embeddings (LaBSE) (Feng et al., 2020) of 224

the translations and original sentences to compute 225

their similarity and discarded all sentences below a 226

similarity threshold of 0.70. Moreover, to ensure 227

good-quality translation, we used similar quality 228

assurance strategies as Guzmán et al. (2019). 229

3. Sequence Labeling: In this task, all words of 230

a text sequence have to be classified. Named En- 231

tity Recognition (NER) and Parts-of-Speech (PoS) 232

tagging are two of the most prominent sequence 233

labeling tasks. We chose the Bangla portion of 234

SemEval 2022 MultiCoNER7 dataset for BLUB. 235

4. Span Prediction: Extractive question answer- 236

ing is a standard choice for text span prediction. 237

We used the Bangla portion of the TyDiQA8 (Clark 238

et al., 2020a) dataset for this task. We posed the 239

task analogous to SQuAD 2.0 (Rajpurkar et al., 240

2018): presented with a text passage and a ques- 241

tion, a model has to predict whether or not it is 242

answerable. If answerable, the model has to find 243

the minimal text span that answers the question. 244

We present detailed statistics of the BLUE bench- 245

mark in Table 1. 246

4 Experiments & Results 247

We fine-tuned BanglaBERT on the downstream 248

tasks and compared with three multilingual mod- 249

els: mBERT (Devlin et al., 2019), XLM-R base 250

and large (Conneau et al., 2020), and sahajBERT 251

(Diskin et al., 2021) (an ALBERT-based (Lan et al., 252

6We present more details in the Appendix.7The test set of MultiCoNER will be released in December.

We used the dev set for test and a portion of the training setfor dev. We will redo all evaluations with the released test set.

8The test set of TyDiQA is not publicly available. LikeMultiCoNER, we used the validation set for test purposes anda portion of the training set for validation. We removed theYes/No questions and subsampled the unanswerable questionsto have the same frequency as the answerable ones.

3

Models |Params.| SC NLI NER QA BLUE ScoreZero-shot cross-lingual transfer learningmBERT 180M – 62.22 39.27 59.44/65.00 –XLM-R (base) 270M – 72.18 45.37 57.66/63.47 –XLM-R (large) 550M – 78.16 57.74 70.35/77.22 –Supervised fine-tuningmBERT 180M 67.59 75.13 68.97 75.51/80.12 73.46XLM-R (base) 270M 69.54 78.46 73.32 75.81/79.98 75.42XLM-R (large) 550M 70.97 82.40 78.39 83.92/87.87 80.71sahajBERT 18M 71.12 76.92 70.94 76.40/81.44 75.36BanglaBERT 110M 72.89 82.80 77.78 82.89/87.60 80.79

Table 2: Performance comparison of baselines and pretrained models on different downstream tasks. Scores inbold texts have statistically significant (p < 0.05) difference from others with bootstrap sampling (Koehn, 2004).

2020) pretrained Bangla model). We also show the253

zero-shot cross-lingual transfer results fine-tuned254

on the English counterpart of each dataset (except255

for SentNoB, which had no English equivalent).256

All pretrained models were fine-tuned for 3-20257

epochs with batch size 32, and the learning rate258

was tuned from {2e-5, 3e-5, 4e-5, 5e-5}. We per-259

formed fine-tuning with three random seeds and260

reported their average in Table 2. In all the tasks,261

BanglaBERT outperformed multilingual models262

and monolingual sahajBERT, achieving a BLUE263

score (the average score of all tasks) of 80.79, even264

coming head-to-head with XLM-R (large).265

BanglaBERT is not only superior in perfor-266

mance, but it is also substantially compute- and267

memory-efficient. For instance, it may seem that268

sahajBERT is more efficient than BanglaBERT due269

to its smaller size, but it takes 2-3.5x time and 2.4-270

3.33x memory as BanglaBERT to fine-tune.9271

Sample efficiency It is often challenging to an-272

notate training samples in real-world scenarios, es-273

pecially for low-resource languages like Bangla.274

So, in addition to compute- and memory-efficiency,275

sample-efficiency (Howard and Ruder, 2018) is an-276

other necessity of PLMs. To assess the sample277

efficiency of BanglaBERT, we limit the number of278

training samples and see how it fares against other279

models. We compare it with XLM-R (large) and280

plot their performances on the SC and NLI tasks10281

for different sample size in Figure 1.282

Results show that when we have fewer number283

of samples (≤ 1k), BanglaBERT has substantially284

better performance (2-9% on SC, 6-10% on NLI)285

over XLM-R (large), making it more practically286

applicable for resource-scarce downstream tasks.287

9Detailed comparison can be found in the Appendix.10Results for the other tasks can be found in the Appendix.

100 500 1000 5000 Full Data

35

45

55

65

75

BanglaBERT

XLM-R(Large)

Sentiment Classification

Training Samples

Macr

o-F

1 (

%)

0.1k 1k 10k 100k Full Data30

40

50

60

70

80

BanglaBERT

XLM-R(Large)

Natural Language Inference

Training Samples

Acc

ura

cy (

%)

Figure 1: Sample-efficiency tests with SC and NLI.

5 Conclusion & Future Works 288

Creating language-specific models is often infeasi- 289

ble for low-resource languages lacking ample data. 290

Hence, researchers are compelled to use multilin- 291

gual models for languages that do not have strong 292

pretrained models. To this end, we introduced 293

BanglaBERT, an NLU model in Bangla, a widely 294

spoken yet low-resource language. We presented 295

a new downstream dataset on NLI and established 296

the BLUE Benchmark, setting new state-of-the-art 297

results with BanglaBERT. In future, we will include 298

other Bangla NLU benchmarks (e.g., dependency 299

parsing (de Marneffe et al., 2021)) in BLUE and 300

investigate the benefits of initializing Bangla NLG 301

models from BanglaBERT. 302

4

References303

Firoj Alam, Shammur Absar Chowdhury, and Sheak304Rashed Haider Noori. 2016. Bidirectional305lstms—crfs networks for bangla pos tagging. In3062016 19th International Conference on Computer307and Information Technology (ICCIT), pages 377–308382. IEEE.309

Wissam Antoun, Fady Baly, and Hazem Hajj. 2020.310AraBERT: Transformer-based model for Arabic lan-311guage understanding. In Proceedings of the 4th312Workshop on Open-Source Arabic Corpora and Pro-313cessing Tools, with a Shared Task on Offensive Lan-314guage Detection, pages 9–15, Marseille, France. Eu-315ropean Language Resource Association.316

I. Ashrafi, M. Mohammad, A. S. Mauree, G. M. A.317Nijhum, R. Karim, N. Mohammed, and S. Mo-318men. 2020. Banner: A cost-sensitive contextualized319model for bangla named entity recognition. IEEE320Access, 8:58206–58226.321

José Canete, Gabriel Chaperon, Rodrigo Fuentes, and322Jorge Pérez. 2020. Spanish pre-trained BERT model323and evaluation data. In Proceedings of the Practi-324cal ML for Developing Countries Workshop at ICLR3252020, PML4DC.326

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan327Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and328Jennimaria Palomaki. 2020a. TyDi QA: A bench-329mark for information-seeking question answering in330typologically diverse languages. Transactions of the331Association for Computational Linguistics, 8:454–332470.333

Kevin Clark, Minh-Thang Luong, Quoc V Le, and334Christopher D Manning. 2020b. ELECTRA: pre-335training text encoders as discriminators rather than336generators. In 8th International Conference on337Learning Representations, ICLR 2020, Addis Ababa,338Ethiopia, April 26-30, 2020. OpenReview.net.339

Alexis Conneau, Kartikay Khandelwal, Naman Goyal,340Vishrav Chaudhary, Guillaume Wenzek, Francisco341Guzmán, Edouard Grave, Myle Ott, Luke Zettle-342moyer, and Veselin Stoyanov. 2020. Unsupervised343cross-lingual representation learning at scale. In344Proceedings of the 58th Annual Meeting of the Asso-345ciation for Computational Linguistics, pages 8440–3468451, Online. Association for Computational Lin-347guistics.348

Alexis Conneau and Guillaume Lample. 2019. Cross-349lingual language model pretraining. In Advances in350Neural Information Processing Systems, volume 32,351pages 7059–7069. Curran Associates, Inc.352

Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-353ina Williams, Samuel Bowman, Holger Schwenk,354and Veselin Stoyanov. 2018. XNLI: Evaluating355cross-lingual sentence representations. In Proceed-356ings of the 2018 Conference on Empirical Methods357in Natural Language Processing, pages 2475–2485,358

Brussels, Belgium. Association for Computational 359Linguistics. 360

Amitava Das and Sivaji Bandyopadhyay. 2010. Phrase- 361level polarity identification for bangla. Int. J. Com- 362put. Linguist. Appl.(IJCLA), 1(1-2):169–182. 363

Avishek Das, Omar Sharif, Mohammed Moshiul 364Hoque, and Iqbal H. Sarker. 2021. Emotion clas- 365sification in a resource constrained language using 366transformer-based approach. In Proceedings of the 3672021 Conference of the North American Chapter of 368the Association for Computational Linguistics: Stu- 369dent Research Workshop, pages 150–158, Online. 370Association for Computational Linguistics. 371

Marie-Catherine de Marneffe, Christopher D. Man- 372ning, Joakim Nivre, and Daniel Zeman. 2021. Uni- 373versal Dependencies. Computational Linguistics, 37447(2):255–308. 375

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 376Kristina Toutanova. 2019. BERT: Pre-training of 377deep bidirectional transformers for language under- 378standing. In Proceedings of the 2019 Conference 379of the North American Chapter of the Association 380for Computational Linguistics: Human Language 381Technologies, Volume 1 (Long and Short Papers), 382pages 4171–4186, Minneapolis, Minnesota. Associ- 383ation for Computational Linguistics. 384

Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, 385Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, 386Dmitry Popov, Dmitry Pyrkin, Maxim Kashirin, 387Alexander Borzunov, Albert Villanova del Moral, 388Denis Mazur, Ilia Kobelev, Yacine Jernite, Thomas 389Wolf, and Gennady Pekhimenko. 2021. Distributed 390deep learning in open collaborations. 391

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen 392Arivazhagan, and Wei Wang. 2020. Language- 393agnostic bert sentence embedding. arXiv preprint 394arXiv:2007.01852. 395

Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan 396Pino, Guillaume Lample, Philipp Koehn, Vishrav 397Chaudhary, and Marc’Aurelio Ranzato. 2019. The 398FLORES evaluation datasets for low-resource ma- 399chine translation: Nepali–English and Sinhala– 400English. In Proceedings of the 2019 Conference on 401Empirical Methods in Natural Language Processing 402and the 9th International Joint Conference on Natu- 403ral Language Processing (EMNLP-IJCNLP), pages 4046098–6111, Hong Kong, China. Association for 405Computational Linguistics. 406

Tahmid Hasan, Abhik Bhattacharjee, Kazi Samin, Ma- 407sum Hasan, Madhusudan Basak, M. Sohel Rahman, 408and Rifat Shahriyar. 2020. Not low-resource any- 409more: Aligner ensembling, batch filtering, and new 410datasets for Bengali-English machine translation. In 411Proceedings of the 2020 Conference on Empirical 412Methods in Natural Language Processing (EMNLP), 413pages 2612–2623, Online. Association for Computa- 414tional Linguistics. 415

5

Jeremy Howard and Sebastian Ruder. 2018. Universal416language model fine-tuning for text classification. In417Proceedings of the 56th Annual Meeting of the As-418sociation for Computational Linguistics (Volume 1:419Long Papers), pages 328–339, Melbourne, Australia.420Association for Computational Linguistics.421

Khondoker Ittehadul Islam, Sudipta Kar, Md Saiful Is-422lam, and Mohammad Ruhul Amin. 2021. SentNoB:423A dataset for analysing sentiment on noisy Bangla424texts. In Findings of the Association for Computa-425tional Linguistics: EMNLP 2021, pages 3265–3271,426Punta Cana, Dominican Republic. Association for427Computational Linguistics.428

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.429Weld, Luke Zettlemoyer, and Omer Levy. 2020a.430SpanBERT: Improving pre-training by representing431and predicting spans. Transactions of the Associa-432tion for Computational Linguistics, 8:64–77.433

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika434Bali, and Monojit Choudhury. 2020b. The state and435fate of linguistic diversity and inclusion in the NLP436world. In Proceedings of the 58th Annual Meet-437ing of the Association for Computational Linguistics,438pages 6282–6293, Online. Association for Computa-439tional Linguistics.440

Armand Joulin, Edouard Grave, Piotr Bojanowski, and441Tomas Mikolov. 2017. Bag of tricks for efficient442text classification. In Proceedings of the 15th Con-443ference of the European Chapter of the Association444for Computational Linguistics: Volume 2, Short Pa-445pers, pages 427–431, Valencia, Spain. Association446for Computational Linguistics.447

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A448method for stochastic optimization. In 3rd Inter-449national Conference on Learning Representations,450ICLR 2015, San Diego, CA, USA, May 7-9, 2015,451Conference Track Proceedings.452

Philipp Koehn. 2004. Statistical significance tests453for machine translation evaluation. In Proceed-454ings of the 2004 Conference on Empirical Meth-455ods in Natural Language Processing, pages 388–456395, Barcelona, Spain. Association for Computa-457tional Linguistics.458

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,459Kevin Gimpel, Piyush Sharma, and Radu Soricut.4602020. Albert: A lite bert for self-supervised learning461of language representations. In International Con-462ference on Learning Representations.463

Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Max-464imin Coavoux, Benjamin Lecouteux, Alexandre Al-465lauzen, Benoît Crabbé, Laurent Besacier, and Didier466Schwab. 2020. Flaubert: Unsupervised language467model pre-training for french. In Proceedings of468The 12th Language Resources and Evaluation Con-469ference, pages 2479–2490, Marseille, France. Euro-470pean Language Resources Association.471

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- 472dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, 473Luke Zettlemoyer, and Veselin Stoyanov. 2019. 474Roberta: A robustly optimized bert pretraining ap- 475proach. arXiv preprint arXiv:1907.11692. 476

Alexandra Luccioni and Joseph Viviano. 2021. What’s 477in the box? an analysis of undesirable content in 478the Common Crawl corpus. In Proceedings of the 47959th Annual Meeting of the Association for Compu- 480tational Linguistics and the 11th International Joint 481Conference on Natural Language Processing (Vol- 482ume 2: Short Papers), pages 182–189, Online. As- 483sociation for Computational Linguistics. 484

Louis Martin, Benjamin Muller, Pedro Javier Or- 485tiz Suárez, Yoann Dupont, Laurent Romary, Éric 486de la Clergerie, Djamé Seddah, and Benoît Sagot. 4872020. CamemBERT: a tasty French language model. 488In Proceedings of the 58th Annual Meeting of the 489Association for Computational Linguistics, pages 4907203–7219, Online. Association for Computational 491Linguistics. 492

Sungjoon Park, Jihyung Moon, Sungdong Kim, 493Won Ik Cho, Ji Yoon Han, Jangwon Park, Chisung 494Song, Junseong Kim, Youngsook Song, Taehwan 495Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, 496Younghoon Jeong, Inkwon Lee, Sangwoo Seo, 497Dongjun Lee, Hyunwoo Kim, Myeonghwa Lee, 498Seongbo Jang, Seungwon Do, Sunkyoung Kim, 499Kyungtae Lim, Jongwon Lee, Kyumin Park, Jamin 500Shin, Seonghyun Kim, Lucy Park, Alice Oh, Jung- 501Woo Ha, and Kyunghyun Cho. 2021. KLUE: Ko- 502rean language understanding evaluation. In Thirty- 503fifth Conference on Neural Information Processing 504Systems Datasets and Benchmarks Track (Round 2). 505

Marco Polignano, Pierpaolo Basile, Marco de Gemmis, 506Giovanni Semeraro, and Valerio Basile. 2019. Al- 507berto: Italian BERT language understanding model 508for NLP challenging tasks based on tweets. In Pro- 509ceedings of the Sixth Italian Conference on Com- 510putational Linguistics, Bari, Italy, November 13-15, 5112019, CEUR Workshop Proceedings. 512

Shana Poplack. 1980. Sometimes I’ll start a sentence 513in Spanish Y TERMINO EN ESPAÑOL: toward 514a typology of code-switching. Linguistics, 18(7- 5158):581–618. 516

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, 517Dario Amodei, and Ilya Sutskever. 2019. Language 518models are unsupervised multitask learners. OpenAI 519blog, 1(8). 520

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. 521Know what you don’t know: Unanswerable ques- 522tions for SQuAD. In Proceedings of the 56th An- 523nual Meeting of the Association for Computational 524Linguistics (Volume 2: Short Papers), pages 784– 525789, Melbourne, Australia. Association for Compu- 526tational Linguistics. 527

6

Md Shajalal and Masaki Aono. 2018. Semantic textual528similarity in bengali text. In 2018 International Con-529ference on Bangla Speech and Language Processing530(ICBSLP), pages 1–5. IEEE.531

Abdullah Aziz Sharfuddin, Md Nafis Tihami, and532Md Saiful Islam. 2018. A deep recurrent neural533network with bilstm model for sentiment classifica-534tion. In 2018 International Conference on Bangla535Speech and Language Processing (ICBSLP), pages5361–4. IEEE.537

Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent538Romary. 2019. Asynchronous Pipeline for Process-539ing Huge Corpora on Medium to Low Resource In-540frastructures. In 7th Workshop on the Challenges541in the Management of Large Corpora (CMLC-5427), Cardiff, United Kingdom. Leibniz-Institut für543Deutsche Sprache.544

Nafis Irtiza Tripto and Mohammed Eunus Ali. 2018.545Detecting multilabel sentiment and emotions from546bangla youtube comments. In 2018 International547Conference on Bangla Speech and Language Pro-548cessing (ICBSLP), pages 1–6.549

Alex Wang, Amanpreet Singh, Julian Michael, Fe-550lix Hill, Omer Levy, and Samuel Bowman. 2018.551GLUE: A multi-task benchmark and analysis plat-552form for natural language understanding. In Pro-553ceedings of the 2018 EMNLP Workshop Black-554boxNLP: Analyzing and Interpreting Neural Net-555works for NLP, pages 353–355, Brussels, Belgium.556Association for Computational Linguistics.557

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-558neau, Vishrav Chaudhary, Francisco Guzmán, Ar-559mand Joulin, and Edouard Grave. 2020. CCNet:560Extracting high quality monolingual datasets from561web crawl data. In Proceedings of the 12th Lan-562guage Resources and Evaluation Conference, pages5634003–4012, Marseille, France. European Language564Resources Association.565

Adina Williams, Nikita Nangia, and Samuel Bowman.5662018. A broad-coverage challenge corpus for sen-567tence understanding through inference. In Proceed-568ings of the 2018 Conference of the North American569Chapter of the Association for Computational Lin-570guistics: Human Language Technologies, Volume5711 (Long Papers), pages 1112–1122, New Orleans,572Louisiana. Association for Computational Linguis-573tics.574

Shijie Wu and Mark Dredze. 2020. Are all languages575created equal in multilingual BERT? In Proceedings576of the 5th Workshop on Representation Learning for577NLP, pages 120–130, Online. Association for Com-578putational Linguistics.579

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V580Le, Mohammad Norouzi, Wolfgang Macherey,581Maxim Krikun, Yuan Cao, Qin Gao, Klaus582Macherey, et al. 2016. Google’s neural machine583translation system: Bridging the gap between human584and machine translation. arXiv:1609.08144.585

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- 586bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. 587Xlnet: Generalized autoregressive pretraining for 588language understanding. In Advances in Neural 589Information Processing Systems, volume 32, pages 5905753–5763. Curran Associates, Inc. 591

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- 592dinov, Raquel Urtasun, Antonio Torralba, and Sanja 593Fidler. 2015. Aligning books and movies: Towards 594story-like visual explanations by watching movies 595and reading books. In Proceedings of the 2015 596IEEE International Conference on Computer Vision 597(ICCV), ICCV ’15, page 19–27, USA. IEEE Com- 598puter Society. 599

7

Appendix600

Pretraining Data Sources601

We used the following sites for data collection. We602

categorize the sites into six types:603

604

Encyclopedia:605

• bn.banglapedia.org606• bn.wikipedia.org607• songramernotebook.com608

News:609

• anandabazar.com610• arthoniteerkagoj.com611• bangla.24livenewspaper.com612• bangla.bdnews24.com613• bangla.dhakatribune.com614• bangla.hindustantimes.com615• bangladesherkhela.com616• banglanews24.com617• banglatribune.com618• bbc.com619• bd-journal.com620• bd-pratidin.com621• bd24live.com622• bengali.indianexpress.com623• bigganprojukti.com624• bonikbarta.net625• chakarianews.com626• channelionline.com627• ctgtimes.com628• ctn24.com629• daily-bangladesh.com630• dailyagnishikha.com631• dainikazadi.net632• dainikdinkal.net633• dailyfulki.com634• dailyinqilab.com635• dailynayadiganta.com636• dailysangram.com637• dailysylhet.com638• dainikamadershomoy.com639• dainikshiksha.com640• dhakardak-bd.com641• dmpnews.org642• dw.com643• eisamay.indiatimes.com644• ittefaq.com.bd645• jagonews24.com646• jugantor.com647• kalerkantho.com648• manobkantha.com.bd649• mzamin.com650• ntvbd.com651• onnodristy.com652

• pavilion.com.bd 653• prothomalo.com 654• protidinersangbad.com 655• risingbd.com 656• rtvonline.com 657• samakal.com 658• sangbadpratidin.in 659• somoyerkonthosor.com 660• somoynews.tv 661• tbsnews.net 662• teknafnews.com 663• thedailystar.net 664• voabangla.com 665• zeenews.india.com 666• zoombangla.com 667

Blogs: 668

• amrabondhu.com 669• banglablog.in 670• bigganblog.org 671• biggani.org 672• bigyan.org.in 673• bishorgo.com 674• cadetcollegeblog.com 675• choturmatrik.com 676• horoppa.wordpress.com 677• muktangon.blog 678• roar.media/bangla 679• sachalayatan.com 680• shodalap.org 681• shopnobaz.net 682• somewhereinblog.net 683• subeen.com 684• tunerpage.com 685• tutobd.com 686

E-books/Stories: 687

• banglaepub.github.io 688• bengali.pratilipi.com 689• bn.wikisource.org 690• ebanglalibrary.com 691• eboipotro.github.io 692• golpokobita.com 693• kaliokalam.com 694• shirisherdalpala.net 695• tagoreweb.in 696

Social Media/Forums: 697

• banglacricket.com 698• bn.globalvoices.org 699• helpfulhub.com 700• nirbik.com 701• pchelplinebd.com 702• techtunes.io 703

8

Miscellaneous:704

• banglasonglyric.com705• bdlaws.minlaw.gov.bd706• bdup24.com707• bengalisongslyrics.com708• dakghar.org709• gdn8.com710• gunijan.org.bd711• hrw.org712• jakir.me713• jhankarmahbub.com714• jw.org715• lyricsbangla.com716• neonaloy.com717• porjotonlipi.com718• sasthabangla.com719• tanzil.net720

We wrote custom crawlers for each site above721

(except the Wikipedia dumps).722

Quality Control in Human Translation:723

Translations were done by expert translators who724

provide translation services for renowned Bangla725

newspapers. Each translated sentence was further726

assessed for quality by another expert. If found to727

be of low quality, it was again translated by the orig-728

inal translator. The sample was then discarded alto-729

gether if found to be of low quality again. Fewer730

than 100 samples were discarded in this process.731

Compute and Memory Efficiency Tests732

To validate that BanglaBERT is more efficient in733

terms of memory and compute, we measured each734

model’s training time and memory usage during735

the fine-tuning of each task. All tests were done736

on a desktop machine with an 8-core Intel Core-i7737

11700k CPU and NVIDIA RTX 3090 GPU. We738

used the same batch size, gradient accumulation739

steps, and sequence length for all models and tasks740

for a fair comparison. We use relative time and741

memory (GPU VRAM) usage considering those742

of BanglaBERT as units. The results are shown in743

Table 3. (We mention the upper and lower values744

of the different tasks for each model)745

Additional Sample Efficiency Tests746

Due to space restrictions, we move the sample ef-747

ficiency results of the NER and QA tasks to the748

appendix. We plot the results in Figure 2.749

Similar results are also observed here for the750

NER task, where BanglaBERT is more sample-751

efficient when we have ≤ 1k training samples. In752

the QA task however, both models have identical753

performance for all sample counts.754

Model Time Memory UsagemBERT 1.14x-1.92x 1.12x-2.04xXLM-R (base) 1.29-1.81x 1.04-1.63xXLM-R (large) 3.81-4.49x 4.44-5.55xSahajBERT 2.40-3.33x 2.07-3.54xBanglaBERT 1.00x 1.00x

Table 3: Compute and memory efficiency tests

100 500 1000 5000 Full Data

0

20

40

60

80

BanglaBERT

XLM-R(Large)

Named Entity Recognition

Training Samples

Mic

ro-F

1 (

%)

100 500 1000 2000 Full Data

50

60

70

80

90

BanglaBERT

XLM-R(Large)

Question Answering

Training Samples

F1 (

%)

Figure 2: Sample-efficiency tests with NER and QA.

Ethical Considerations 755

Dataset and Model Release: The Copy Right 756

Act, 200011 of Bangladesh allows reproduction 757

and public release of copy-right materials for non- 758

commercial research purposes. As a transformative 759

research work, we will release BanglaBERT un- 760

der a non-commercial license. Furthermore, we 761

will release only the pretraining data for which we 762

know the distribution will not cause any copyright 763

infringement. The downstream task datasets can 764

all be made publicly available under a similar non- 765

commercial license. 766

Human Translation: Human translators were 767

paid as per standard rates in local currencies. 768

Text Content: We tried to minimize offensive texts 769

by explicitly crawling the sites where such contents 770

would be nominal. However, we can guarantee ab- 771

solutely no objectionable content and recommend 772

using the model carefully, especially for text gen- 773

11http://bdlaws.minlaw.gov.bd/act-details-846.html

9

eration purposes. Furthermore, we removed the774

personal information of the content writers by not775

considering the author fields while collecting the776

data.777

10