BanglaBERT: Language Model Pretraining and Benchmarks ...
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of BanglaBERT: Language Model Pretraining and Benchmarks ...
BanglaBERT: Language Model Pretraining and Benchmarks forLow-Resource Language Understanding Evaluation in Bangla
Anonymous ACL submission
AbstractIn this paper, we introduce ‘BanglaBERT’, a001BERT-based Natural Language Understanding002(NLU) model pretrained in Bangla, a widely003spoken yet low-resource language in the NLP004literature. To pretrain BanglaBERT, we collect00527.5 GB of Bangla pretraining data (dubbed006‘Bangla2B+’) by crawling 110 popular Bangla007sites. We introduce a new downstream task008dataset on Natural Language Inference (NLI)009and benchmark on four diverse NLU tasks010covering text classification, sequence labeling,011and span prediction. In the process, we bring012them under the first-ever Bangla Language Un-013derstanding Evaluation (BLUE) benchmark.014BanglaBERT achieves state-of-the-art results015outperforming multilingual and monolingual016models. We will make the BanglaBERT017model, the new datasets, and a leaderboard018publicly available to advance Bangla NLP.019
1 Introduction020
Despite being the sixth most spoken language in021
the world with over 300 million native speakers022
constituting 4% of the world’s total population,1023
Bangla is considered a resource-scarce language.024
Joshi et al. (2020b) categorized Bangla in the lan-025
guage group that lacks efforts in labeled data col-026
lection and relies on self-supervised pretraining027
(Devlin et al., 2019; Radford et al., 2019; Liu et al.,028
2019) to boost the natural language understanding029
(NLU) task performances. To date, the Bangla lan-030
guage has been continuing to rely on fine-tuning031
multilingual pretrained language models (PLMs)032
(Ashrafi et al., 2020; Das et al., 2021; Islam et al.,033
2021). However, since multilingual PLMs cover034
a wide range of languages (Conneau and Lam-035
ple, 2019; Conneau et al., 2020), they are large036
(have hundreds of millions of parameters) and re-037
quire substantial computational resources for fine-038
tuning. They also tend to show degraded perfor-039
mance for low-resource languages (Wu and Dredze,040
1https://w.wiki/Psq
2020) on downstream NLU tasks. Motivated by 041
the triumph of language-specific models (Martin 042
et al. (2020); Polignano et al. (2019); Canete et al. 043
(2020); Antoun et al. (2020), inter alia) over mul- 044
tilingual models in many other languages, in this 045
work, we present BanglaBERT – a BERT-based 046
(Devlin et al., 2019) Bangla NLU model pretrained 047
on 27.5 GB data (which we name ‘Bangla2B+’) 048
we meticulously crawled 110 popular Bangla web- 049
sites to facilitate NLU applications in Bangla. 050
We also introduce a Bangla Natural Language 051
Inference (NLI) dataset, a task previously unex- 052
plored in Bangla, and evaluate BanglaBERT on 053
four diverse downstream tasks on sentiment clas- 054
sification, NLI, named entity recognition, and 055
question answering. We bring these tasks to- 056
gether to establish the first-ever Bangla Language 057
Understanding Evaluation (BLUE) benchmark. 058
We compare two widely used multilingual mod- 059
els to BanglaBERT using the BLUE benchmark 060
and find that BanglaBERT excels on all the tasks. 061
We summarize our contributions as follows: 062
1. We present BanglaBERT, a pretrained BERT 063
model for Bangla, and introduce a new Bangla 064
natural language inference (NLI) dataset. 065
2. We introduce Bangla Language Understand- 066
ing Evaluation (BLUE) benchmark and pro- 067
vide a set of strong baselines. 068
3. We release code and provide a leaderboard to 069
spur future research on Bangla NLU.2 070
2 BanglaBERT 071
2.1 Pretraining Data 072
A high volume of good quality text data is a prereq- 073
uisite for pretraining large language models. For 074
instance, BERT (Devlin et al., 2019) is pretrained 075
on the English Wikipedia and the Books corpus 076
(Zhu et al., 2015) containing 3.3 billion tokens. 077
Subsequent works like RoBERTa (Liu et al., 2019) 078
2https://github.com/hidden/hidden
1
and XLNet (Yang et al., 2019) used more extensive079
web-crawled data with heavy filtering and cleaning.080
Bangla is a rather resource-constrained language081
in the web domain; for example, the Bangla082
Wikipedia dump from July 2021 is only 650 MB,083
two orders of magnitudes smaller than the English084
Wikipedia. As a result, we had to crawl the web085
extensively to collect our pretraining data. We se-086
lected 110 Bangla websites by their Amazon Alexa087
rankings3 and the volume and quality of extractable088
texts by inspecting each website. The contents in-089
cluded encyclopedias, news, blogs, e-books, sto-090
ries, social media/forums, etc.4 The amount of data091
totaled around 35 GB.092
There are also noisy sources of Bangla data093
dumps, a couple of prominent ones being OSCAR094
(Suárez et al., 2019) and CCNet (Wenzek et al.,095
2020). However, they contained lots of offensive096
texts; we found them infeasible to clean thoroughly.097
Fearing potential harmful impacts (Luccioni and098
Viviano, 2021), we opted not to use them.5099
2.2 Pre-processing100
We performed thorough deduplication on the pre-101
training data, removed non-textual contents (e.g.,102
HTML/JavaScript tags), and filtered out non-103
Bangla pages using a language classifier (Joulin104
et al., 2017). After the processing, the dataset was105
reduced to 27.5 GB in size containing 5.25M docu-106
ments having 306.66 words on average.107
We trained a Wordpiece (Wu et al., 2016) vo-108
cabulary of 32k subword tokens on the resulting109
corpus with a 400 character alphabet, kept larger110
than the native Bangla alphabet to capture code-111
switching (Poplack, 1980) and allow romanized112
Bangla contents for better generalization. We lim-113
ited the length of a training sample to 512 tokens114
and did not cross document boundaries (Liu et al.,115
2019) while creating a data point. After tokeniza-116
tion, we had 7.18M samples with an average length117
of 304.14 tokens and containing 2.18B tokens in118
total; hence we named the dataset ‘Bangla2B+’.119
2.3 Pretraining Objective120
Self-supervised pretraining objectives leverage un-121
labeled data. For example, BERT (Devlin et al.,122
2019) was pretrained with masked language mod-123
eling (MLM) and next sentence prediction (NSP).124
3www.alexa.com/topsites/countries/BD4The completely list of the sources can be found in the
Appendix.5We cover other ethical considerations in the Appendix.
Several works built on top of this, e.g., RoBERTa 125
(Liu et al., 2019) removed NSP and pretrained with 126
longer sequences, SpanBERT (Joshi et al., 2020a) 127
masked contiguous spans of tokens, while works 128
like XLNet (Yang et al., 2019) introduced objec- 129
tives like factorized language modeling. 130
We pretrained BanglaBERT using ELECTRA 131
(Clark et al., 2020b), pretrained with the Replaced 132
Token Detection (RTD) objective, where a gener- 133
ator and a discriminator model are trained jointly. 134
The generator is fed as input a sequence with a 135
portion of the tokens masked (15% in our case) 136
and is asked to predict them using the rest of the 137
input (i.e., standard MLM). The masked tokens 138
are then replaced by tokens sampled from the gen- 139
erator’s output distribution for the corresponding 140
masks, and the discriminator then has to predict 141
whether each token is from the original sequence 142
or not. After pretraining, the discriminator is used 143
for fine-tuning. Clark et al. (2020b) argued that 144
RTD back-propagates loss from all tokens of a se- 145
quence, in contrast to 15% tokens of the MLM ob- 146
jective, giving the model more signals to learn from. 147
Evidently, ELECTRA achieves comparable down- 148
stream performance to RoBERTa or XLNet with 149
only a quarter of their training time. This compu- 150
tational efficiency motivated us to use ELECTRA 151
for our implementation of BanglaBERT. 152
2.4 Model Architecture & Hyperparameters 153
We pretrained the base ELECTRA model (a 12- 154
layer Transformer encoder with 768 embedding 155
size, 768 hidden size, 12 attention heads, 3072 156
feed-forward size, generator-to-discriminator ratio 15713 , 110M parameters) with 256 batch size for 2.5M 158
steps on a v3-8 TPU instance on GCP. We used 159
the Adam (Kingma and Ba, 2015) optimizer with a 160
2e-4 learning rate and linear warmup of 10k steps. 161
162
3 The Bangla Language Understanding 163
Evaluation (BLUE) Benchmark 164
Many works have studied different Bangla NLU 165
tasks in isolation, e.g., sentiment classification (Das 166
and Bandyopadhyay, 2010; Sharfuddin et al., 2018; 167
Tripto and Ali, 2018), semantic textual similarity 168
(Shajalal and Aono, 2018), parts-of-speech (PoS) 169
tagging (Alam et al., 2016), named entity recogni- 170
tion (NER) (Ashrafi et al., 2020). However, Bangla 171
NLU has not yet had a comprehensive unified study. 172
Motivated by the surge of NLU research brought 173
2
Task Corpus |Train| |Dev| |Test| Metric DomainSentiment Classification SentNoB 12,575 1,567 1,567 Macro-F1 Social MediaNatural Language Inference BNLI 381,449 2,419 4,895 Acc. Misc.Named Entity Recognition MultiCoNER 14,500 800 800 Micro-F1 Misc.Question Answering TyDiQA 4,542 238 226 EM/F1 Wikipedia
Table 1: Statistics of the Bangla Language Understanding Evaluation (BLUE) benchmark.
about by benchmarks in other languages, e.g., En-174
glish (Wang et al., 2018), French (Le et al., 2020),175
Korean (Park et al., 2021), we establish the first-176
ever Bangla Language Understanding Evaluation177
(BLUE) benchmark.178
NLU generally comprises three types of tasks:179
text classification, sequence labeling, and text span180
prediction. Text classification tasks can further181
be sub-divided into single-sequence and sequence-182
pair classification. Therefore, we consider a total183
of four tasks for BLUE. For each task type, we184
carefully select one downstream task dataset. We185
emphasize the quality and open availability of the186
datasets while making the selection. We briefly187
mention them below:188
1. Single-Sequence Classification: Sentiment189
classification is perhaps the most-studied Bangla190
NLU task, with some of the earlier works dat-191
ing back over a decade (Das and Bandyopadhyay,192
2010). Hence, we chose this as the single-sequence193
classification task. However, most Bangla senti-194
ment classification datasets are not publicly avail-195
able. We could only find two public datasets: BYSA196
(Tripto and Ali, 2018) and SentNoB (Islam et al.,197
2021). We found BYSA to have many duplications.198
Even worse, many duplicates had different labels.199
SentNoB had better quality and covered a broader200
set of domains, making the classification task more201
challenging. Hence, we opted to use the latter.202
2. Sequence-pair Classification: In contrast203
to single-sequence classification, there has been204
a dearth of sequence-pair classification works in205
Bangla. We found work on semantic textual simi-206
larity (Shajalal and Aono, 2018), but the dataset is207
not publicly available. As such, we curated a new208
Bangla Natural Language Inference (BNLI) dataset209
for sequence-pair classification. We chose NLI as210
the representative task due to its fundamental im-211
portance in NLU. Given two sentences, a premise212
and a hypothesis as input, a model is tasked to pre-213
dict whether or not the hypothesis is entailment,214
contradiction, or neutral to the premise. We used215
the same curation procedure as the XNLI (Conneau216
et al., 2018) dataset: we translated the MultiNLI217
(Williams et al., 2018) training data using the En- 218
glish to Bangla translation model by Hasan et al. 219
(2020) and had the evaluation sets translated by 220
expert human translators.6 Due to the possibility 221
of the incursion of errors during automatic transla- 222
tion, we used the Language-Agnostic BERT Sen- 223
tence Embeddings (LaBSE) (Feng et al., 2020) of 224
the translations and original sentences to compute 225
their similarity and discarded all sentences below a 226
similarity threshold of 0.70. Moreover, to ensure 227
good-quality translation, we used similar quality 228
assurance strategies as Guzmán et al. (2019). 229
3. Sequence Labeling: In this task, all words of 230
a text sequence have to be classified. Named En- 231
tity Recognition (NER) and Parts-of-Speech (PoS) 232
tagging are two of the most prominent sequence 233
labeling tasks. We chose the Bangla portion of 234
SemEval 2022 MultiCoNER7 dataset for BLUB. 235
4. Span Prediction: Extractive question answer- 236
ing is a standard choice for text span prediction. 237
We used the Bangla portion of the TyDiQA8 (Clark 238
et al., 2020a) dataset for this task. We posed the 239
task analogous to SQuAD 2.0 (Rajpurkar et al., 240
2018): presented with a text passage and a ques- 241
tion, a model has to predict whether or not it is 242
answerable. If answerable, the model has to find 243
the minimal text span that answers the question. 244
We present detailed statistics of the BLUE bench- 245
mark in Table 1. 246
4 Experiments & Results 247
We fine-tuned BanglaBERT on the downstream 248
tasks and compared with three multilingual mod- 249
els: mBERT (Devlin et al., 2019), XLM-R base 250
and large (Conneau et al., 2020), and sahajBERT 251
(Diskin et al., 2021) (an ALBERT-based (Lan et al., 252
6We present more details in the Appendix.7The test set of MultiCoNER will be released in December.
We used the dev set for test and a portion of the training setfor dev. We will redo all evaluations with the released test set.
8The test set of TyDiQA is not publicly available. LikeMultiCoNER, we used the validation set for test purposes anda portion of the training set for validation. We removed theYes/No questions and subsampled the unanswerable questionsto have the same frequency as the answerable ones.
3
Models |Params.| SC NLI NER QA BLUE ScoreZero-shot cross-lingual transfer learningmBERT 180M – 62.22 39.27 59.44/65.00 –XLM-R (base) 270M – 72.18 45.37 57.66/63.47 –XLM-R (large) 550M – 78.16 57.74 70.35/77.22 –Supervised fine-tuningmBERT 180M 67.59 75.13 68.97 75.51/80.12 73.46XLM-R (base) 270M 69.54 78.46 73.32 75.81/79.98 75.42XLM-R (large) 550M 70.97 82.40 78.39 83.92/87.87 80.71sahajBERT 18M 71.12 76.92 70.94 76.40/81.44 75.36BanglaBERT 110M 72.89 82.80 77.78 82.89/87.60 80.79
Table 2: Performance comparison of baselines and pretrained models on different downstream tasks. Scores inbold texts have statistically significant (p < 0.05) difference from others with bootstrap sampling (Koehn, 2004).
2020) pretrained Bangla model). We also show the253
zero-shot cross-lingual transfer results fine-tuned254
on the English counterpart of each dataset (except255
for SentNoB, which had no English equivalent).256
All pretrained models were fine-tuned for 3-20257
epochs with batch size 32, and the learning rate258
was tuned from {2e-5, 3e-5, 4e-5, 5e-5}. We per-259
formed fine-tuning with three random seeds and260
reported their average in Table 2. In all the tasks,261
BanglaBERT outperformed multilingual models262
and monolingual sahajBERT, achieving a BLUE263
score (the average score of all tasks) of 80.79, even264
coming head-to-head with XLM-R (large).265
BanglaBERT is not only superior in perfor-266
mance, but it is also substantially compute- and267
memory-efficient. For instance, it may seem that268
sahajBERT is more efficient than BanglaBERT due269
to its smaller size, but it takes 2-3.5x time and 2.4-270
3.33x memory as BanglaBERT to fine-tune.9271
Sample efficiency It is often challenging to an-272
notate training samples in real-world scenarios, es-273
pecially for low-resource languages like Bangla.274
So, in addition to compute- and memory-efficiency,275
sample-efficiency (Howard and Ruder, 2018) is an-276
other necessity of PLMs. To assess the sample277
efficiency of BanglaBERT, we limit the number of278
training samples and see how it fares against other279
models. We compare it with XLM-R (large) and280
plot their performances on the SC and NLI tasks10281
for different sample size in Figure 1.282
Results show that when we have fewer number283
of samples (≤ 1k), BanglaBERT has substantially284
better performance (2-9% on SC, 6-10% on NLI)285
over XLM-R (large), making it more practically286
applicable for resource-scarce downstream tasks.287
9Detailed comparison can be found in the Appendix.10Results for the other tasks can be found in the Appendix.
100 500 1000 5000 Full Data
35
45
55
65
75
BanglaBERT
XLM-R(Large)
Sentiment Classification
Training Samples
Macr
o-F
1 (
%)
0.1k 1k 10k 100k Full Data30
40
50
60
70
80
BanglaBERT
XLM-R(Large)
Natural Language Inference
Training Samples
Acc
ura
cy (
%)
Figure 1: Sample-efficiency tests with SC and NLI.
5 Conclusion & Future Works 288
Creating language-specific models is often infeasi- 289
ble for low-resource languages lacking ample data. 290
Hence, researchers are compelled to use multilin- 291
gual models for languages that do not have strong 292
pretrained models. To this end, we introduced 293
BanglaBERT, an NLU model in Bangla, a widely 294
spoken yet low-resource language. We presented 295
a new downstream dataset on NLI and established 296
the BLUE Benchmark, setting new state-of-the-art 297
results with BanglaBERT. In future, we will include 298
other Bangla NLU benchmarks (e.g., dependency 299
parsing (de Marneffe et al., 2021)) in BLUE and 300
investigate the benefits of initializing Bangla NLG 301
models from BanglaBERT. 302
4
References303
Firoj Alam, Shammur Absar Chowdhury, and Sheak304Rashed Haider Noori. 2016. Bidirectional305lstms—crfs networks for bangla pos tagging. In3062016 19th International Conference on Computer307and Information Technology (ICCIT), pages 377–308382. IEEE.309
Wissam Antoun, Fady Baly, and Hazem Hajj. 2020.310AraBERT: Transformer-based model for Arabic lan-311guage understanding. In Proceedings of the 4th312Workshop on Open-Source Arabic Corpora and Pro-313cessing Tools, with a Shared Task on Offensive Lan-314guage Detection, pages 9–15, Marseille, France. Eu-315ropean Language Resource Association.316
I. Ashrafi, M. Mohammad, A. S. Mauree, G. M. A.317Nijhum, R. Karim, N. Mohammed, and S. Mo-318men. 2020. Banner: A cost-sensitive contextualized319model for bangla named entity recognition. IEEE320Access, 8:58206–58226.321
José Canete, Gabriel Chaperon, Rodrigo Fuentes, and322Jorge Pérez. 2020. Spanish pre-trained BERT model323and evaluation data. In Proceedings of the Practi-324cal ML for Developing Countries Workshop at ICLR3252020, PML4DC.326
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan327Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and328Jennimaria Palomaki. 2020a. TyDi QA: A bench-329mark for information-seeking question answering in330typologically diverse languages. Transactions of the331Association for Computational Linguistics, 8:454–332470.333
Kevin Clark, Minh-Thang Luong, Quoc V Le, and334Christopher D Manning. 2020b. ELECTRA: pre-335training text encoders as discriminators rather than336generators. In 8th International Conference on337Learning Representations, ICLR 2020, Addis Ababa,338Ethiopia, April 26-30, 2020. OpenReview.net.339
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,340Vishrav Chaudhary, Guillaume Wenzek, Francisco341Guzmán, Edouard Grave, Myle Ott, Luke Zettle-342moyer, and Veselin Stoyanov. 2020. Unsupervised343cross-lingual representation learning at scale. In344Proceedings of the 58th Annual Meeting of the Asso-345ciation for Computational Linguistics, pages 8440–3468451, Online. Association for Computational Lin-347guistics.348
Alexis Conneau and Guillaume Lample. 2019. Cross-349lingual language model pretraining. In Advances in350Neural Information Processing Systems, volume 32,351pages 7059–7069. Curran Associates, Inc.352
Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-353ina Williams, Samuel Bowman, Holger Schwenk,354and Veselin Stoyanov. 2018. XNLI: Evaluating355cross-lingual sentence representations. In Proceed-356ings of the 2018 Conference on Empirical Methods357in Natural Language Processing, pages 2475–2485,358
Brussels, Belgium. Association for Computational 359Linguistics. 360
Amitava Das and Sivaji Bandyopadhyay. 2010. Phrase- 361level polarity identification for bangla. Int. J. Com- 362put. Linguist. Appl.(IJCLA), 1(1-2):169–182. 363
Avishek Das, Omar Sharif, Mohammed Moshiul 364Hoque, and Iqbal H. Sarker. 2021. Emotion clas- 365sification in a resource constrained language using 366transformer-based approach. In Proceedings of the 3672021 Conference of the North American Chapter of 368the Association for Computational Linguistics: Stu- 369dent Research Workshop, pages 150–158, Online. 370Association for Computational Linguistics. 371
Marie-Catherine de Marneffe, Christopher D. Man- 372ning, Joakim Nivre, and Daniel Zeman. 2021. Uni- 373versal Dependencies. Computational Linguistics, 37447(2):255–308. 375
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 376Kristina Toutanova. 2019. BERT: Pre-training of 377deep bidirectional transformers for language under- 378standing. In Proceedings of the 2019 Conference 379of the North American Chapter of the Association 380for Computational Linguistics: Human Language 381Technologies, Volume 1 (Long and Short Papers), 382pages 4171–4186, Minneapolis, Minnesota. Associ- 383ation for Computational Linguistics. 384
Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, 385Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, 386Dmitry Popov, Dmitry Pyrkin, Maxim Kashirin, 387Alexander Borzunov, Albert Villanova del Moral, 388Denis Mazur, Ilia Kobelev, Yacine Jernite, Thomas 389Wolf, and Gennady Pekhimenko. 2021. Distributed 390deep learning in open collaborations. 391
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen 392Arivazhagan, and Wei Wang. 2020. Language- 393agnostic bert sentence embedding. arXiv preprint 394arXiv:2007.01852. 395
Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan 396Pino, Guillaume Lample, Philipp Koehn, Vishrav 397Chaudhary, and Marc’Aurelio Ranzato. 2019. The 398FLORES evaluation datasets for low-resource ma- 399chine translation: Nepali–English and Sinhala– 400English. In Proceedings of the 2019 Conference on 401Empirical Methods in Natural Language Processing 402and the 9th International Joint Conference on Natu- 403ral Language Processing (EMNLP-IJCNLP), pages 4046098–6111, Hong Kong, China. Association for 405Computational Linguistics. 406
Tahmid Hasan, Abhik Bhattacharjee, Kazi Samin, Ma- 407sum Hasan, Madhusudan Basak, M. Sohel Rahman, 408and Rifat Shahriyar. 2020. Not low-resource any- 409more: Aligner ensembling, batch filtering, and new 410datasets for Bengali-English machine translation. In 411Proceedings of the 2020 Conference on Empirical 412Methods in Natural Language Processing (EMNLP), 413pages 2612–2623, Online. Association for Computa- 414tional Linguistics. 415
5
Jeremy Howard and Sebastian Ruder. 2018. Universal416language model fine-tuning for text classification. In417Proceedings of the 56th Annual Meeting of the As-418sociation for Computational Linguistics (Volume 1:419Long Papers), pages 328–339, Melbourne, Australia.420Association for Computational Linguistics.421
Khondoker Ittehadul Islam, Sudipta Kar, Md Saiful Is-422lam, and Mohammad Ruhul Amin. 2021. SentNoB:423A dataset for analysing sentiment on noisy Bangla424texts. In Findings of the Association for Computa-425tional Linguistics: EMNLP 2021, pages 3265–3271,426Punta Cana, Dominican Republic. Association for427Computational Linguistics.428
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.429Weld, Luke Zettlemoyer, and Omer Levy. 2020a.430SpanBERT: Improving pre-training by representing431and predicting spans. Transactions of the Associa-432tion for Computational Linguistics, 8:64–77.433
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika434Bali, and Monojit Choudhury. 2020b. The state and435fate of linguistic diversity and inclusion in the NLP436world. In Proceedings of the 58th Annual Meet-437ing of the Association for Computational Linguistics,438pages 6282–6293, Online. Association for Computa-439tional Linguistics.440
Armand Joulin, Edouard Grave, Piotr Bojanowski, and441Tomas Mikolov. 2017. Bag of tricks for efficient442text classification. In Proceedings of the 15th Con-443ference of the European Chapter of the Association444for Computational Linguistics: Volume 2, Short Pa-445pers, pages 427–431, Valencia, Spain. Association446for Computational Linguistics.447
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A448method for stochastic optimization. In 3rd Inter-449national Conference on Learning Representations,450ICLR 2015, San Diego, CA, USA, May 7-9, 2015,451Conference Track Proceedings.452
Philipp Koehn. 2004. Statistical significance tests453for machine translation evaluation. In Proceed-454ings of the 2004 Conference on Empirical Meth-455ods in Natural Language Processing, pages 388–456395, Barcelona, Spain. Association for Computa-457tional Linguistics.458
Zhenzhong Lan, Mingda Chen, Sebastian Goodman,459Kevin Gimpel, Piyush Sharma, and Radu Soricut.4602020. Albert: A lite bert for self-supervised learning461of language representations. In International Con-462ference on Learning Representations.463
Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Max-464imin Coavoux, Benjamin Lecouteux, Alexandre Al-465lauzen, Benoît Crabbé, Laurent Besacier, and Didier466Schwab. 2020. Flaubert: Unsupervised language467model pre-training for french. In Proceedings of468The 12th Language Resources and Evaluation Con-469ference, pages 2479–2490, Marseille, France. Euro-470pean Language Resources Association.471
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- 472dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, 473Luke Zettlemoyer, and Veselin Stoyanov. 2019. 474Roberta: A robustly optimized bert pretraining ap- 475proach. arXiv preprint arXiv:1907.11692. 476
Alexandra Luccioni and Joseph Viviano. 2021. What’s 477in the box? an analysis of undesirable content in 478the Common Crawl corpus. In Proceedings of the 47959th Annual Meeting of the Association for Compu- 480tational Linguistics and the 11th International Joint 481Conference on Natural Language Processing (Vol- 482ume 2: Short Papers), pages 182–189, Online. As- 483sociation for Computational Linguistics. 484
Louis Martin, Benjamin Muller, Pedro Javier Or- 485tiz Suárez, Yoann Dupont, Laurent Romary, Éric 486de la Clergerie, Djamé Seddah, and Benoît Sagot. 4872020. CamemBERT: a tasty French language model. 488In Proceedings of the 58th Annual Meeting of the 489Association for Computational Linguistics, pages 4907203–7219, Online. Association for Computational 491Linguistics. 492
Sungjoon Park, Jihyung Moon, Sungdong Kim, 493Won Ik Cho, Ji Yoon Han, Jangwon Park, Chisung 494Song, Junseong Kim, Youngsook Song, Taehwan 495Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, 496Younghoon Jeong, Inkwon Lee, Sangwoo Seo, 497Dongjun Lee, Hyunwoo Kim, Myeonghwa Lee, 498Seongbo Jang, Seungwon Do, Sunkyoung Kim, 499Kyungtae Lim, Jongwon Lee, Kyumin Park, Jamin 500Shin, Seonghyun Kim, Lucy Park, Alice Oh, Jung- 501Woo Ha, and Kyunghyun Cho. 2021. KLUE: Ko- 502rean language understanding evaluation. In Thirty- 503fifth Conference on Neural Information Processing 504Systems Datasets and Benchmarks Track (Round 2). 505
Marco Polignano, Pierpaolo Basile, Marco de Gemmis, 506Giovanni Semeraro, and Valerio Basile. 2019. Al- 507berto: Italian BERT language understanding model 508for NLP challenging tasks based on tweets. In Pro- 509ceedings of the Sixth Italian Conference on Com- 510putational Linguistics, Bari, Italy, November 13-15, 5112019, CEUR Workshop Proceedings. 512
Shana Poplack. 1980. Sometimes I’ll start a sentence 513in Spanish Y TERMINO EN ESPAÑOL: toward 514a typology of code-switching. Linguistics, 18(7- 5158):581–618. 516
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, 517Dario Amodei, and Ilya Sutskever. 2019. Language 518models are unsupervised multitask learners. OpenAI 519blog, 1(8). 520
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. 521Know what you don’t know: Unanswerable ques- 522tions for SQuAD. In Proceedings of the 56th An- 523nual Meeting of the Association for Computational 524Linguistics (Volume 2: Short Papers), pages 784– 525789, Melbourne, Australia. Association for Compu- 526tational Linguistics. 527
6
Md Shajalal and Masaki Aono. 2018. Semantic textual528similarity in bengali text. In 2018 International Con-529ference on Bangla Speech and Language Processing530(ICBSLP), pages 1–5. IEEE.531
Abdullah Aziz Sharfuddin, Md Nafis Tihami, and532Md Saiful Islam. 2018. A deep recurrent neural533network with bilstm model for sentiment classifica-534tion. In 2018 International Conference on Bangla535Speech and Language Processing (ICBSLP), pages5361–4. IEEE.537
Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent538Romary. 2019. Asynchronous Pipeline for Process-539ing Huge Corpora on Medium to Low Resource In-540frastructures. In 7th Workshop on the Challenges541in the Management of Large Corpora (CMLC-5427), Cardiff, United Kingdom. Leibniz-Institut für543Deutsche Sprache.544
Nafis Irtiza Tripto and Mohammed Eunus Ali. 2018.545Detecting multilabel sentiment and emotions from546bangla youtube comments. In 2018 International547Conference on Bangla Speech and Language Pro-548cessing (ICBSLP), pages 1–6.549
Alex Wang, Amanpreet Singh, Julian Michael, Fe-550lix Hill, Omer Levy, and Samuel Bowman. 2018.551GLUE: A multi-task benchmark and analysis plat-552form for natural language understanding. In Pro-553ceedings of the 2018 EMNLP Workshop Black-554boxNLP: Analyzing and Interpreting Neural Net-555works for NLP, pages 353–355, Brussels, Belgium.556Association for Computational Linguistics.557
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-558neau, Vishrav Chaudhary, Francisco Guzmán, Ar-559mand Joulin, and Edouard Grave. 2020. CCNet:560Extracting high quality monolingual datasets from561web crawl data. In Proceedings of the 12th Lan-562guage Resources and Evaluation Conference, pages5634003–4012, Marseille, France. European Language564Resources Association.565
Adina Williams, Nikita Nangia, and Samuel Bowman.5662018. A broad-coverage challenge corpus for sen-567tence understanding through inference. In Proceed-568ings of the 2018 Conference of the North American569Chapter of the Association for Computational Lin-570guistics: Human Language Technologies, Volume5711 (Long Papers), pages 1112–1122, New Orleans,572Louisiana. Association for Computational Linguis-573tics.574
Shijie Wu and Mark Dredze. 2020. Are all languages575created equal in multilingual BERT? In Proceedings576of the 5th Workshop on Representation Learning for577NLP, pages 120–130, Online. Association for Com-578putational Linguistics.579
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V580Le, Mohammad Norouzi, Wolfgang Macherey,581Maxim Krikun, Yuan Cao, Qin Gao, Klaus582Macherey, et al. 2016. Google’s neural machine583translation system: Bridging the gap between human584and machine translation. arXiv:1609.08144.585
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- 586bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. 587Xlnet: Generalized autoregressive pretraining for 588language understanding. In Advances in Neural 589Information Processing Systems, volume 32, pages 5905753–5763. Curran Associates, Inc. 591
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- 592dinov, Raquel Urtasun, Antonio Torralba, and Sanja 593Fidler. 2015. Aligning books and movies: Towards 594story-like visual explanations by watching movies 595and reading books. In Proceedings of the 2015 596IEEE International Conference on Computer Vision 597(ICCV), ICCV ’15, page 19–27, USA. IEEE Com- 598puter Society. 599
7
Appendix600
Pretraining Data Sources601
We used the following sites for data collection. We602
categorize the sites into six types:603
604
Encyclopedia:605
• bn.banglapedia.org606• bn.wikipedia.org607• songramernotebook.com608
News:609
• anandabazar.com610• arthoniteerkagoj.com611• bangla.24livenewspaper.com612• bangla.bdnews24.com613• bangla.dhakatribune.com614• bangla.hindustantimes.com615• bangladesherkhela.com616• banglanews24.com617• banglatribune.com618• bbc.com619• bd-journal.com620• bd-pratidin.com621• bd24live.com622• bengali.indianexpress.com623• bigganprojukti.com624• bonikbarta.net625• chakarianews.com626• channelionline.com627• ctgtimes.com628• ctn24.com629• daily-bangladesh.com630• dailyagnishikha.com631• dainikazadi.net632• dainikdinkal.net633• dailyfulki.com634• dailyinqilab.com635• dailynayadiganta.com636• dailysangram.com637• dailysylhet.com638• dainikamadershomoy.com639• dainikshiksha.com640• dhakardak-bd.com641• dmpnews.org642• dw.com643• eisamay.indiatimes.com644• ittefaq.com.bd645• jagonews24.com646• jugantor.com647• kalerkantho.com648• manobkantha.com.bd649• mzamin.com650• ntvbd.com651• onnodristy.com652
• pavilion.com.bd 653• prothomalo.com 654• protidinersangbad.com 655• risingbd.com 656• rtvonline.com 657• samakal.com 658• sangbadpratidin.in 659• somoyerkonthosor.com 660• somoynews.tv 661• tbsnews.net 662• teknafnews.com 663• thedailystar.net 664• voabangla.com 665• zeenews.india.com 666• zoombangla.com 667
Blogs: 668
• amrabondhu.com 669• banglablog.in 670• bigganblog.org 671• biggani.org 672• bigyan.org.in 673• bishorgo.com 674• cadetcollegeblog.com 675• choturmatrik.com 676• horoppa.wordpress.com 677• muktangon.blog 678• roar.media/bangla 679• sachalayatan.com 680• shodalap.org 681• shopnobaz.net 682• somewhereinblog.net 683• subeen.com 684• tunerpage.com 685• tutobd.com 686
E-books/Stories: 687
• banglaepub.github.io 688• bengali.pratilipi.com 689• bn.wikisource.org 690• ebanglalibrary.com 691• eboipotro.github.io 692• golpokobita.com 693• kaliokalam.com 694• shirisherdalpala.net 695• tagoreweb.in 696
Social Media/Forums: 697
• banglacricket.com 698• bn.globalvoices.org 699• helpfulhub.com 700• nirbik.com 701• pchelplinebd.com 702• techtunes.io 703
8
Miscellaneous:704
• banglasonglyric.com705• bdlaws.minlaw.gov.bd706• bdup24.com707• bengalisongslyrics.com708• dakghar.org709• gdn8.com710• gunijan.org.bd711• hrw.org712• jakir.me713• jhankarmahbub.com714• jw.org715• lyricsbangla.com716• neonaloy.com717• porjotonlipi.com718• sasthabangla.com719• tanzil.net720
We wrote custom crawlers for each site above721
(except the Wikipedia dumps).722
Quality Control in Human Translation:723
Translations were done by expert translators who724
provide translation services for renowned Bangla725
newspapers. Each translated sentence was further726
assessed for quality by another expert. If found to727
be of low quality, it was again translated by the orig-728
inal translator. The sample was then discarded alto-729
gether if found to be of low quality again. Fewer730
than 100 samples were discarded in this process.731
Compute and Memory Efficiency Tests732
To validate that BanglaBERT is more efficient in733
terms of memory and compute, we measured each734
model’s training time and memory usage during735
the fine-tuning of each task. All tests were done736
on a desktop machine with an 8-core Intel Core-i7737
11700k CPU and NVIDIA RTX 3090 GPU. We738
used the same batch size, gradient accumulation739
steps, and sequence length for all models and tasks740
for a fair comparison. We use relative time and741
memory (GPU VRAM) usage considering those742
of BanglaBERT as units. The results are shown in743
Table 3. (We mention the upper and lower values744
of the different tasks for each model)745
Additional Sample Efficiency Tests746
Due to space restrictions, we move the sample ef-747
ficiency results of the NER and QA tasks to the748
appendix. We plot the results in Figure 2.749
Similar results are also observed here for the750
NER task, where BanglaBERT is more sample-751
efficient when we have ≤ 1k training samples. In752
the QA task however, both models have identical753
performance for all sample counts.754
Model Time Memory UsagemBERT 1.14x-1.92x 1.12x-2.04xXLM-R (base) 1.29-1.81x 1.04-1.63xXLM-R (large) 3.81-4.49x 4.44-5.55xSahajBERT 2.40-3.33x 2.07-3.54xBanglaBERT 1.00x 1.00x
Table 3: Compute and memory efficiency tests
100 500 1000 5000 Full Data
0
20
40
60
80
BanglaBERT
XLM-R(Large)
Named Entity Recognition
Training Samples
Mic
ro-F
1 (
%)
100 500 1000 2000 Full Data
50
60
70
80
90
BanglaBERT
XLM-R(Large)
Question Answering
Training Samples
F1 (
%)
Figure 2: Sample-efficiency tests with NER and QA.
Ethical Considerations 755
Dataset and Model Release: The Copy Right 756
Act, 200011 of Bangladesh allows reproduction 757
and public release of copy-right materials for non- 758
commercial research purposes. As a transformative 759
research work, we will release BanglaBERT un- 760
der a non-commercial license. Furthermore, we 761
will release only the pretraining data for which we 762
know the distribution will not cause any copyright 763
infringement. The downstream task datasets can 764
all be made publicly available under a similar non- 765
commercial license. 766
Human Translation: Human translators were 767
paid as per standard rates in local currencies. 768
Text Content: We tried to minimize offensive texts 769
by explicitly crawling the sites where such contents 770
would be nominal. However, we can guarantee ab- 771
solutely no objectionable content and recommend 772
using the model carefully, especially for text gen- 773
11http://bdlaws.minlaw.gov.bd/act-details-846.html
9