Proceedings of the Fifth Workshop on Teaching NLP - ACL ...

183
NAACL-HLT 2021 Teaching NLP Proceedings of the Fifth Workshop June 10 - 11, 2021 Mexico City (virtual)

Transcript of Proceedings of the Fifth Workshop on Teaching NLP - ACL ...

NAACL-HLT 2021

Teaching NLP

Proceedings of the Fifth Workshop

June 10 - 11, 2021Mexico City (virtual)

©2021 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-954085-36-7

ii

Introduction

Welcome to the Fifth Workshop on Teaching Natural Language Processing (NLP). This online workshopfeatured an exciting mix of papers, teaching material submissions, panels, talks, and participatoryactivities.

The field of NLP is growing rapidly, with new state-of-the-art methods emerging every year, if not sooner.As educators in NLP, we struggle to keep up. We need to make decisions about what to teach and howto teach it with every offering of a course, sometimes even as a course is being offered. The fast-pacednature of NLP brings unique challenges for curriculum design, and the immense growth of the field haslead to not just core NLP courses, but also to more specialized classes and seminars in subareas such asNatural Language Understanding, Computational Social Science, Machine Translation, and many more.We also have an increasing number of students interested in NLP, bringing with them a wide range ofbackgrounds and experiences.

We were happy to accept 13 long papers and 13 short papers on teaching materials. The latterwere accompanied by exercises and assignments as Jupyter notebooks, software, slides, and teachingguidelines that will be made available via a repository created as result of the workshop. Both typesof papers cover many topics: curriculum selection, teaching strategies, adapting to different studentaudiences, resources for assignments, and course or program design.

Our workshop also featured two panels, one on "What should we be teaching?" and another on "Whatdoes industry need?". The first panel featured Isabelle Augenstein (University of Copenhagen), EmilyM. Bender (University of Washington), Yoav Goldberg (Bar Ilan University), and Dan Jurafsky (StanfordUniversity). The second panel featured Lenny Bronner (The Washington Post), Delip Rao (Allen Institutefor AI), Frank Rudzicz (University of Toronto), and Rachael Tatman (Rasa). We also had two amazinginvited speakers, Ines Montani (Explosion) and Jason Eisner (Johns Hopkins University).

We thank the Program Committee, who thoughtfully reviewed these papers this year. We also appreciatethe sponsorship funding we received from Google and Duolingo. Finally, we thank the workshopparticipants, whose interests in teaching allow us to establish and grow the next generation of NLPresearchers and practitioners.

David Jurgens, Varada Kolhatkar, Lucy Li, Margot Mieskes, and Ted Pedersen (the co-organizers)

iii

Organizing Committee

David Jurgens, University of Michigan

Varada Kolhatkar, University of British Columbia

Lucy Li, University of California, Berkeley

Margot Mieskes, Darmstadt University of Applied Sciences

Ted Pedersen, University of Minnesota, Duluth

Keynote Speakers

Jason Eisner, Johns Hopkins University

Ines Montani, Explosion

Panelists

Isabelle Augenstein, University of Copenhagen

Emily M. Bender, University of Washington

Lenny Bronner, The Washington Post

Yoav Goldberg, Bar-Ilan University

Dan Jurafsky, Stanford University

Delip Rao, Allen Institute for AI

Frank Rudzicz, University of Toronto

Rachael Tatman, Rasa

Program Committee

Benedikt Adelmann, University of Hamburg

Thomas Arnold, Technische Universität Darmstadt

Denilson Barbosa, University of Alberta

Austin Blodgett, Georgetown University

Su Lin Blodgett, Microsoft Research

Brendon Boldt, Carnegie Mellon University

Chris Brew, LivePerson

Julian Brooke, University of British Columbia

Paul Cook, University of New Brunswick

Carrie Demmans Epp, University of Alberta

Ivan Derzhanski, Bulgarian Academy of Sciences

Anna Feldman, Montclair State University

Alvin Grissom II, Haverford College

Bradley Hauer, University of Alberta

v

Marti A. Hearst, University of California, Berkeley

Katherine Keith, University of Massachusetts, Amherst

Grzegorz Kondrak, University of Alberta

Valia Kordoni, Humboldt University Berlin

Sandra Kübler, Indiana University

Vladislav Kubon, Charles University in Prague

James Kunz, Google / University of California, Berkeley

Lori Levin, Carnegie Mellon University

Deryle Lonsdale, Brigham Young University

Olga Lyashevskaya, HSE University

Christian M. Meyer, Technische Universitat Darmstadt

Bridget McInnes, Virginia Commonwealth University

Julie Medero, Harvey Mudd College

Gerald Penn, University of Toronto

Alexander Piperski, HSE University

Christopher Potts, Stanford University

Anoop Sarkar, Simon Fraser University

Nathan Schneider, Georgetown University

Alexandra Schofield, Harvey Mudd College

Melanie Siegel, Hochschule Darmstadt

Sowmya Vajjala, National Research Council

Gerhard Van Huyssteen, North-West University

Shira Wein, Georgetown University

Richard Wicentowski, Swarthmore College

Shuly Wintner, University of Haifa

Torsten Zesch, University of Duisburg-Essen

Heike Zinsmeister, University of Hamburg

vi

Table of Contents

Pedagogical Principles in the Online Teaching of Text Mining: A RetrospectionRajkumar Saini, György Kovács, Mohamadreza Faridghasemnia, Hamam Mokayed, Oluwatosin

Adewumi, Pedro Alonso, Sumit Rakesh and Marcus Liwicki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Teaching a Massive Open Online Course on Natural Language ProcessingEkaterina Artemova, Murat Apishev, Denis Kirianov, Veronica Sarkisyan, Sergey Aksenov and

Oleg Serikov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Natural Language Processing 4 All (NLP4All): A New Online Platform for Teaching and Learning NLPConcepts

Rebekah Baglini and Hermes Hjorth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

A New Broad NLP Training from Speech to KnowledgeMaxime Amblard and Miguel Couceiro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Applied Language Technology: NLP for the HumanitiesTuomo Hiippala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

A Crash Course on Ethics for Natural Language ProcessingAnnemarie Friedrich and Torsten Zesch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A dissemination workshop for introducing young Italian students to NLPLucio Messina, Lucia Busso, Claudia Roberta Combei, Alessio Miaschi, Ludovica Pannitto, Gabriele

Sarti and Malvina Nissim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

MiniVQA - A resource to build your tailored VQA competitionJean-Benoit Delbrouck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

From back to the roots into the gated woods: Deep learning for NLPBarbara Plank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Learning PyTorch Through A Neural Dependency Parsing ExerciseDavid Jurgens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A Balanced and Broadly Targeted Computational Linguistics CurriculumEmma Manning, Nathan Schneider and Amir Zeldes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Gaining Experience with Structured Data: Using the Resources of Dialog State Tracking Challenge 2Ronnie Smith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

The Flipped Classroom model for teaching Conditional Random Fields in an NLP courseManex Agirrezabal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Flamingos and Hedgehogs in the Croquet-Ground: Teaching Evaluation of NLP Systems for Undergrad-uate Students

Brielen Madureira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

An Immersive Computational Text Analysis Course for Non-Computer Science Students at Barnard Col-lege

Adam Poliak and Jalisha Jenifer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

vii

Introducing Information Retrieval for Biomedical Informatics StudentsSanya Taneja, Richard Boyce, William Reynolds and Denis Newman-Griffis . . . . . . . . . . . . . . . . . .96

Contemporary NLP Modeling in Six Comprehensive Programming AssignmentsGreg Durrett, Jifan Chen, Shrey Desai, Tanya Goyal, Lucas Kabela, Yasumasa Onoe and Jiacheng

Xu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Interactive Assignments for Teaching Structured Neural NLPDavid Gaddy, Daniel Fried, Nikita Kitaev, Mitchell Stern, Rodolfo Corona, John DeNero and Dan

Klein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104

Learning about Word Vector Representations and Deep Learning through Implementing Word2vecDavid Jurgens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Naive Bayes versus BERT: Jupyter notebook assignments for an introductory NLP courseJennifer Foster and Joachim Wagner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Natural Language Processing for Computer Scientists and Data Scientists at a Large State UniversityCasey Kennington . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

On Writing a Textbook on Natural Language ProcessingJacob Eisenstein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Learning How To Learn NLP: Developing Introductory Concepts Through Scaffolded DiscoveryAlexandra Schofield, Richard Wicentowski and Julie Medero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

The Online Pivot: Lessons Learned from Teaching a Text and Data Mining Course in Lockdown, En-hancing online Teaching with Pair Programming and Digital Badges

Beatrice Alex, Clare Llewellyn, Pawel Orzechowski and Maria Boutchkova . . . . . . . . . . . . . . . . . 138

Teaching NLP outside Linguistics and Computer Science classrooms: Some challenges and some oppor-tunities

Sowmya Vajjala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Teaching NLP with Bracelets and Restaurant Menus: An Interactive Workshop for Italian StudentsLudovica Pannitto, Lucia Busso, Claudia Roberta Combei, Lucio Messina, Alessio Miaschi, Gabriele

Sarti and Malvina Nissim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

viii

Workshop Program

Thursday, June 10, 2021

Poster Session 1

+ Long Papers : Courses and Curricula

Pedagogical Principles in the Online Teaching of Text Mining: A RetrospectionRajkumar Saini, György Kovács, Mohamadreza Faridghasemnia, HamamMokayed, Oluwatosin Adewumi, Pedro Alonso, Sumit Rakesh and Marcus Liwicki

Teaching a Massive Open Online Course on Natural Language ProcessingEkaterina Artemova, Murat Apishev, Denis Kirianov, Veronica Sarkisyan, SergeyAksenov and Oleg Serikov

Natural Language Processing 4 All (NLP4All): A New Online Platform for Teachingand Learning NLP ConceptsRebekah Baglini and Hermes Hjorth

A New Broad NLP Training from Speech to KnowledgeMaxime Amblard and Miguel Couceiro

+ Teaching Materials Short Papers : Courses and Curricula

Applied Language Technology: NLP for the HumanitiesTuomo Hiippala

A Crash Course on Ethics for Natural Language ProcessingAnnemarie Friedrich and Torsten Zesch

A dissemination workshop for introducing young Italian students to NLPLucio Messina, Lucia Busso, Claudia Roberta Combei, Alessio Miaschi, LudovicaPannitto, Gabriele Sarti and Malvina Nissim

+ Teaching Materials Short Papers: Tools and Assignments

MiniVQA - A resource to build your tailored VQA competitionJean-Benoit Delbrouck

ix

Thursday, June 10, 2021 (continued)

From back to the roots into the gated woods: Deep learning for NLPBarbara Plank

Learning PyTorch Through A Neural Dependency Parsing ExerciseDavid Jurgens

Panel 1 : "What Should We Be Teaching?" Panelists : Isabelle Augenstein (Uof Copenhagen), Emily M. Bender (U of Washington), Yoav Goldberg (Bar IlanU), and Dan Jurafsky (Stanford U)

Poster Session 2

+ Long Papers : Courses and Curricula

A Balanced and Broadly Targeted Computational Linguistics CurriculumEmma Manning, Nathan Schneider and Amir Zeldes

Gaining Experience with Structured Data: Using the Resources of Dialog StateTracking Challenge 2Ronnie Smith

The Flipped Classroom model for teaching Conditional Random Fields in an NLPcourseManex Agirrezabal

+ Teaching Materials Short Papers: Courses and Curricula

Flamingos and Hedgehogs in the Croquet-Ground: Teaching Evaluation of NLPSystems for Undergraduate StudentsBrielen Madureira

An Immersive Computational Text Analysis Course for Non-Computer Science Stu-dents at Barnard CollegeAdam Poliak and Jalisha Jenifer

Introducing Information Retrieval for Biomedical Informatics StudentsSanya Taneja, Richard Boyce, William Reynolds and Denis Newman-Griffis

x

Thursday, June 10, 2021 (continued)

+ Teaching Materials Short Papers : Tools and Assignments

Contemporary NLP Modeling in Six Comprehensive Programming AssignmentsGreg Durrett, Jifan Chen, Shrey Desai, Tanya Goyal, Lucas Kabela, Yasumasa Onoeand Jiacheng Xu

Interactive Assignments for Teaching Structured Neural NLPDavid Gaddy, Daniel Fried, Nikita Kitaev, Mitchell Stern, Rodolfo Corona, JohnDeNero and Dan Klein

Learning about Word Vector Representations and Deep Learning through Imple-menting Word2vecDavid Jurgens

Naive Bayes versus BERT: Jupyter notebook assignments for an introductory NLPcourseJennifer Foster and Joachim Wagner

Oral Presentations 1

+ Long Papers : Courses and Curricula

Natural Language Processing for Computer Scientists and Data Scientists at aLarge State UniversityCasey Kennington

On Writing a Textbook on Natural Language ProcessingJacob Eisenstein

Learning How To Learn NLP: Developing Introductory Concepts Through Scaf-folded DiscoveryAlexandra Schofield, Richard Wicentowski and Julie Medero

xi

Thursday, June 10, 2021 (continued)

Keynote Address : Ines Montani (Explosion)

Friday, June 11, 2021

Oral Presentations 2

+ Long Papers : Courses and Curricula

The Online Pivot: Lessons Learned from Teaching a Text and Data Mining Coursein Lockdown, Enhancing online Teaching with Pair Programming and DigitalBadgesBeatrice Alex, Clare Llewellyn, Pawel Orzechowski and Maria Boutchkova

Teaching NLP outside Linguistics and Computer Science classrooms: Some chal-lenges and some opportunitiesSowmya Vajjala

Teaching NLP with Bracelets and Restaurant Menus: An Interactive Workshop forItalian StudentsLudovica Pannitto, Lucia Busso, Claudia Roberta Combei, Lucio Messina, AlessioMiaschi, Gabriele Sarti and Malvina Nissim

Panel 2 : "What Does Industry Need?" Panelists : Lenny Bronner (Washing-ton Post), Delip Rao (AI2), Frank Rudzicz (U of Toronto), and Rachel Tatman(Rasa)

Keynote Address : Jason Eisner (Johns Hopkins University)

xii

Proceedings of the Fifth Workshop on Teaching NLP, pages 1–12June 10–11, 2021. ©2021 Association for Computational Linguistics

Pedagogical Principles in the Online Teaching of NLP: A RetrospectionGyörgy Kovács1, Rajkumar Saini1, Mohamadreza Faridghasemnia2, Hamam Mokayed1

Tosin Adewumi1, Pedro Alonso1, Sumit Rakesh1 and Marcus Liwicki11Luleå Tekniska Universitet / Luleå, Sweden-97187

2Örebro Universitet / Örebro, Sweden-70182{gyorgy.kovacs, rajkumar.saini}@ltu.se

[email protected]{hamam.mokayed, oluwatosin.adewumi, pedro.alonso}@ltu.se

[email protected], [email protected]

AbstractThe ongoing COVID-19 pandemic hasbrought online education to the forefrontof pedagogical discussions. To make thisincreased interest sustainable in a post-pandemic era, online courses must be built onstrong pedagogical foundations. With a longhistory of pedagogic research, there are manyprinciples, frameworks, and models availableto help teachers in doing so. These modelscover different teaching perspectives, suchas constructive alignment, feedback, and thelearning environment. In this paper, we dis-cuss how we designed and implemented ouronline Natural Language Processing (NLP)course following constructive alignment andadhering to the pedagogical principles ofLTU. By examining our course and analyzingstudent evaluation forms, we show that wehave met our goal and successfully deliveredthe course. Furthermore, we discuss the addi-tional benefits resulting from the current modeof delivery, including the increased reusabilityof course content and increased potential forcollaboration between universities. Lastly, wealso discuss where we can and will furtherimprove the current course design.

Keywords- NLP, Constructive Alignment,LTU’s Pedagogical Principles, Student Acti-vation, Online Pedagogy, COVID19, Canvas,Blackboard, Zoom

1 Introduction

With the COVID-19 pandemic, academic institu-tions were pushed to moving their education on-line (Gallagher and Palmer, 2020; Dick et al., 2020).Furthermore, even institutes that still allowed stu-dents on campus were more open to online alter-natives. However, in order to solidify the resultsattained in online education (Ossiannilsson, 2021)in these extraordinary circumstances, it is crucialto demonstrate that this mode of education canbe a viable alternative to presential education interms of the underlying pedagogical fundamentals

and the success of practical implementation. Forthis, in setting up our Natural Language Processingcourse, our goal was to do so based on constructivealignment and LTU’s pedagogical principles. Wehypothesized that adherence to both set of princi-ples would be possible in an online version too.In this paper, by examining the design and imple-mentation of the course and analyzing the studentfeedback, we will demonstrate the viability of ourhypothesis.

The rest of the paper is organized as follows.First, we discuss the related literature in Section 1.1,followed by LTU’s pedagogical principles in Sec-tion 1.2. Then in Section 2, we present the coursedesign. That is followed by the Teaching and learn-ing activities being discussed in Section 3. Next,we present the course evaluation result and the per-spective of a volunteering student of the coursein Section 4. Lastly, we conclude the paper andoutline planned improvements in Section 5.

1.1 Related work

The teaching paradigm has been moving from ateacher-centered view to a more student-centeredperspective (Kaymakamoglu, 2018). Meaning thatinstead of focusing on the role of the teacher, thefocus is more and more on what the student shoulddo, that is, process the material through deliberatepractice, collaboration, and active reflection. To ef-fectively support this process, teaching is plannedand conducted with the student’s disposition inmind, considering their prior knowledge, expecta-tions, study skills, and other conditions. With theproper planning, design, and implementation of thecourse, active learning can then be achieved.

This active learning or student activation (Cookand Babon, 2017) is achieved when students aregoing into lectures and tutorials prepared to engagein the learning process, and they are not just pas-sively trying to absorb information. Active learningencourages active cognitive processing of informa-

1

tion, and the concept is not a new one. Confucius isoften quoted as saying: "Tell me and I will forget,show me and I may remember, involve me and Iwill learn”. The use of student activation or ac-tive learning is well suited for presential learning.In an online setting, this type of learning has tobe adapted to acquire the same knowledge fromonline activities. The main goal for students is tolearn as well from online as in presential lectures.

Following levels from (Center for Teaching andLearning, 1990) might help to trigger student acti-vation.

• To test the students’ memory by asking ques-tions related to specific facts, terms, principles,or theories.

• To utilize the students’ knowledge to solveproblems or analyze a situation.

• To exercise the informed judgment of the stu-dents.

Many pedagogical theories and frameworks havebeen developed to facilitate effective teaching cov-ering different aspects of teaching (Kandlbinder,2014; Chi and Wylie, 2014; Cook and Babon, 2017;Rust, 2002). However, with the advancement oftechnology and globalization, the traditional ped-agogical models were evolved to make distancelearning possible. Students can sit anywhere andlearn online through the internet and connect withother students in the physical classroom or online.

Our designed course follows pedagogical prin-ciples to enhance learning outcomes. The courseis segmented into 83 videos, given by eight lectur-ers of different expertise, coherent and harmonized.This well-balanced theory-practical course coversa broad range of applications and approaches.

1.2 LTU’s Pedagogical PrinciplesTo support the development of teaching practicesand students developing the necessary skills re-quired for them to become independent actors intheir respective fields, LTU developed its pedagog-ical idea centering around nine important princi-ples (Luleå University of Technology, UniversityPedagogy Center). These principles support Öre-bro University pedagogical principles (Örebro Uni-versity), and we believe other courses, in other uni-versities, should follow the same principles. Here,we briefly introduce these principles (along withfeedback regarding the course) as a barometer forour course design.

1. Emphasize relevant knowledge and skill, isan important goal at university education toemphasize relevant skills and knowledge; thatis to say, students should acquire the relevanttheoretical knowledge, as well as the skillsnecessary to apply that knowledge.

2. Encourage active cognitive processing, soas students engage with the subject matterdeliberately, contributing to long-term learn-ing (Chi and Wylie, 2014).

3. Choose forms of assessment that enhancelearning. The design of assessment hasa great impact on student learning (Sny-der, 1971; Wiiand, 1998; Ramsden, 2003),and thus it is important to choose the meth-ods (“If you want to change student learn-ing then change the methods of assess-ment” (Brown G., 1999)), frequency, and con-tent of assessment in a way that would con-tribute to student learning. Naturally, activelearning, and student activation discussed inmore detail in Section refssec:related are cru-cial components of adherence to this principle.

4. Ensure clarity in the course and task de-sign, an important principle to make sure stu-dents have a clear picture of the course and itsrequirements. One approach to achieve this isthat of constructive alignment (Kandlbinder,2014), making clear connections between thegoals of the program, the intended learningoutcomes of the course, the assessment meth-ods, and the activities carried out through thecourse.

5. Promote knowledge-enhancing feedback.It is important that students receive feed-back on their work beside the final assess-ment, which could guide their progress andinform them how close they are to learningoutcomes (Elmgren and Henriksson, 2018).The impact of feedback is highest when it isin-time, personalized, and specific.

6. Foster independence and self-awareness isan important principle when we consider thecurrent need for life-long learning and the goalof university education as enabling students tobecome independent actors in their respectivefields.

2

7. Be aware of your students’ starting points,as learning is more effective when the newinformation is either integrated into existingmental frameworks or said frameworks arereshaped by the new information. For this,however, it is crucial that we are aware of thecurrent mental framework of students. Fur-thermore, being aware of the starting pointof learners is also crucial to identify skill orknowledge gaps that should be filled.

8. Communicate high, positive expectations,so as to stimulate students for higher perfor-mance; a phenomenon widely known as thePygmalion effect (Goddard, 1995).

9. Create a supportive learning environment,is a principle supporting all preceding princi-ples. Since the proper environment is key tocommunicate not only high but positive ex-pectations in a manner that encourages andnot discourages students. Similarly, in a sup-portive environment, students are more opento discussing their experiences and receivingfeedback.

As we consider adherence to them especially dif-ficult in an online setting, we will give particularconsideration to principles 2, 5, and 9. For exam-ple, in the absence of in-presence teaching, whenin-person interaction between students and teacherand among students is not possible, it is especiallychallenging to create a supportive learning environ-ment, and the opportunities to give feedback alsodiminish. Online learning management systemslike Canvas and Blackboard provide the infrastruc-tures to conduct courses online. However, similarto classrooms, it is up to teachers and course de-velopers to fill them with content and utilize themeffectively. We would discuss the learning manage-ment systems and how we used them to support thepedagogical principles in Section 3.3.

2 Course Design

In this section, we examine the course design basedon LTU’s pedagogical principles (see Section 1.2),the principle of constructive alignment, and otherpedagogical considerations. We should note herethat one may arrive at similar design patterns basedsolely on practical considerations as well. However,we consider it an integral part of our contributionthat the design of the NLP course discussed here is

rooted not only in pedagogical practices but also inpedagogical theories and research.

2.1 Objectives/Intended Learning OutcomesIn accordance with constructive alignment (Kandl-binder, 2014), course level objectives and IntendedLearning Outcomes (ILOs) were set at the begin-ning of process of course design. For this, we werefollowing the pedagogical principles, as well asABCD and SMART techniques (Tullu et al., 2015;Smaldino et al., 2005; Doran et al., 1981). Basedon these factors, the ILOs for the NLP course wereas follows: After passing the course, the studentshould be able to. . .

• . . . explain and use text preprocessing tech-niques

• . . . describe a text analytics system togetherwith its components, optional and mandatoryones

• . . . explain how text could be analyzed

• . . . evaluate results of text analytics

• . . . analyze and reflect on the various tech-niques used in text analytics and the parame-ters needed as well as the problem solved

• . . . plan and execute a text analytics experi-ment

2.2 Course descriptionThe course contents were designed according to theILOs discussed in Section 2.1. The syllabus andcontent of the course were designed after examin-ing NLP courses, and other similar courses offeredby different universities, as well as reviewing rele-vant literature (Agarwal, 2013; Hearst, 2005; Liddyand McCracken, 2005). Our NLP course consistsof seven modules that provide motivation, tools,and techniques for solving NLP problems. Thissection explains the contents of the modules brieflyto give the reader a better understanding of how theILOs are reflected in the proposed contents.

The course starts at the first module with NLPapplications, providing links to online API, chat-bots, and dialog systems to motivate students. Thenwe discuss basic definitions and concepts in NLP,such as what a language is. The second module ofthis course is devoted to structure analysis of texts.Morphological analysis like n-grams and filtering(such as identifying missing values, tokenization,

3

stemming & lemmatization, handling URLs, ac-cents, contractions, typos, digits, etc.) is coveredin the second module. Moreover, Part Of Speech(POS) is described, including its application and amore in-depth view into it. This module also de-scribes the syntax and syntactic analysis of a text,followed by feature extraction and text representa-tion; it starts with commonly used features in NLPand describes different encodings that can representtext to machines.

The third module of this course delivers the the-ory and practice of neural networks. It starts withthe basics of neurons and network architectures,then training using backpropagation followed byvanishing gradients and over/under fitting problems.RNN, LSTM, GRU, and CNN layers are coveredin depth. This module finishes by debugging neu-ral networks, discussing regularizers, and coveringcommon problems of neural networks.

Having the essential tools of NLP, the fourthmodule is more towards application, text classifica-tion, and how NLP problems can be solved usingneural networks. This module starts in learningtaxonomies, with a step-by-step design of an end-to-end neural network. It also discusses commonchoices for simple network architectures, learningtaxonomies, multi-class, multi-label, and multi-task problems. After that, text classification isgrounded to the problem of tagging in NLP, fol-lowed by transfer learning methods. At the end ofthis module, standard metrics in NLP are discussed.

The fifth module is dedicated to semantics, in-troducing semantics and symbols in NLP and pos-sible ways to represent them. The theory of framesand synsets are discussed as the formal represen-tation of semantics. Then, vector representationsof semantics are discussed alongside different ap-proaches to obtain them. Approaches like distribu-tional hypothesis, statistical methods, and neuralmethods such as Word2vec are covered. At the endof this module, some of the state-of-the-art wordembeddings are reviewed.

Sequence to sequence architectures, encoder-decoder architecture, attention, transformers, andBert model are discussed in the next module. Atthe end, some different topics such as structuredprediction, arc-standard transition parsing, and in-sights into dialog and chatbots, image captioning,and gender bias in NLP were discussed.

2.3 Video Clips

Our main challenge here is to deliver the materialin a manner that still adheres to Principle 2 andencourages active cognitive processing. To achievethis goal, the design of the lecture format is one ofthe first consequences. For that, we split our mate-rial into short videos with a length of at most tenminutes. (McKeachie et al., 2006; Benjamin, 2002;Ozan and Ozarslan, 2016; Bordes et al., 2021). De-livering the course in short videos has the benefit ofencouraging active cognitive processing of studentsamong videos, as discussed in Section 1.

Another benefit of presenting our course contentin short videos is the potential for the reusability ofvideos (Crame, 2016). Lastly, splitting all subjectsinto smaller topics facilitates sharing the workloadamong many presenters. It allowed us to organizethe course in a joint setting, as discussed in moredetail in Section 2.4.

2.4 Joint Course

Designing and developing a course for multipleinitiatives requires a deep knowledge of the field, abroad range of knowledge in the field, and a deeperunderstanding of each topic. This requirement isnot easy to be fulfilled by one lecturer. Thus, wemake it a joint course between Örebro University(ORU) and LTU. This allows benefiting from mul-tiple lecturers that let each deliver a specializedlecture. Load sharing is the other benefit of a jointcourse. However, designing and developing a jointcourse with multiple lecturers brought some diffi-culties, such as:

• Keeping harmony between lectures.

• Segmenting sub-topics to keep the lengthsshort.

• Distributing each subtopic between lecturers.

• Avoiding repetition in lectures.

To overcome these difficulties, lecturers from thetwo universities hold weekly meetings to design thecourse and distribute sub-topics among themselves.Lecturers created material by themselves, and thenthe material and the narratives are reviewed weekly.Moreover, after the course’s first run, all materialsare reviewed again to keep the outcome coherentand harmonic. This revision and harmonization ofmaterial should contribute to the clarity of coursecontent (Principle 4).

4

2.5 Assessment and Assignments

To achieve desired students’ activation, we de-signed the initial assignments and project tasksinvolving ready-to-use web API for various appli-cations like hate speech, profanity filtering, imagecaptioning. Later, the focus is on how these applica-tions are developed using NLP. Therefore, we covertheoretical, numerical, and programming tasks inassignments and projects.

The course’s assessment design is based on aquote by David Boud “students can escape badteaching, but they cannot escape bad assessment”.The forms of assessment in the joint NLP coursewere based on guiding the students towards achiev-ing their learning outcomes. We avoid designingthe assessment as a written exam at the end of thecourse as we are looking to engage the studentswith the course contents all the way till the endof the course (Principle 2, 3) (Luleå University ofTechnology, University Pedagogy Center). The as-sessment had been done in two stages which arepractical project and oral discussion. The practi-cal project is designed to follow the principle ofthe constructive alignment (Kandlbinder, 2014).The project’s main objective is to lead the studentto develop an actual NLP application (Principle8) (Luleå University of Technology, UniversityPedagogy Center) but in the form of accumula-tive five tasks. The task will guide the studentsthrough the different levels of Bloom’s cognitivedomain (Bloom et al., 1956), namely Remembering,Understanding, Applying, Analyzing, Evaluatingand Creating.

Each task is designed to match with the weeklydelivered contents, so it will give real feedback onhow the students understand and implement theconcepts in practice (Principle 5) (Luleå Univer-sity of Technology, University Pedagogy Center),and whether they can transfer the knowledge (the-ory) from one context to another context (Practi-cal Implementation). Students have the flexibil-ity to choose the programming language (however,Python is preferred), framework, or tool that theywould like to use to reach the target. The pro-vided flexibility will give a chance to find multipleapproaches proposed by different students to sortout the problem (Principle 6, (Luleå University ofTechnology, University Pedagogy Center)). Thesharing of different ideas among the class will addsignificant value to the course outcome. The finalexam is an oral exam for each student. It is more

like a discussion, and we assess each student basedon that. An open discussion trying to mimic theunderstanding of a real problem related to NLPis conducted. The discussion starts with an ask-ing about the general understanding of one of theknown NLP applications, such as machine trans-lation. All the consecutive questions are asked toevaluate the practical understanding of the problemand test the need in each stage to implement the so-lution. The examiner and examinee use the zoom’swhiteboard to convey the questions and answersfor better clarity.

3 Teaching and Learning Activities

The course is designed to maximize the intendedlearning outcomes. We refer to students somepython programming tutorials available online toget familiarized with Python. This section de-scribes this course’s teaching and learning activ-ities, which includes delivering lectures in shortvideos, live sessions, lab sessions, and our learningmanagement system.

3.1 Live Sessions

The lectures are delivered through recorded videoson each module of the course. However, it is essen-tial to take the students’ questions, reflections andaddress their potential confusion after they havewatched the video lectures. For this, we conductedweekly live sessions for each module. Studentslearned from the video lectures and discussed withthe instructors in the live sessions, thus, alignedwith flipped classroom learning. These live ses-sions where students had the opportunity to directlyask the instructors contributed toa supportive learn-ing environment (Principle 9), and gave us the op-portunity to provide the students with feedback(Principle 5).

The live sessions were delivered to address thequeries regarding the theoretical concepts, practi-cals, contents, specific assignments, and other or-ganizational queries. We addressed the immediatequeries, and the queries asked in discussion threadsin Canvas for the modules covered until a specificlive session. However, there was a scope of queriesrelated to the upcoming part of the course.

Live sessions are conducted online throughzoom every week during the course. There areat least two instructors and one teaching assistantpresent during online live sessions. First, we tookthe queries asked in the learning management sys-

5

tem. We have a follow-up discussion regardingthose queries. Next, we have other questions re-lated to the theory and practicals.

We encourage students to ask more and morequestions/confusions to enhance their learning(Gibbs, 2005). In the end, we ask for regular feed-back (Principle 5) (Luleå University of Technology,University Pedagogy Center) for each module, ei-ther in the live sessions or in learning managementsystem.

3.2 Lab Sessions

Practical (lab) sessions make up student-centeredteaching strategies for experiential courses (Moateand Cox, 2015). The lab sessions provide practi-cal programming experience to the students, fol-lowing Kolb’s cycle and ICAP framework (Kolb,2014; Chi and Wylie, 2014). Students watch orread a given experience (as concrete experience)and reflect on it (as reflective observation) beforeconceptualizing how to go about implementationthrough flowchart or pseudocode and finally activeexperimentation by writing and running the codes.Furthermore, in a lab session, students get to askquestions and get possible answers from their col-leagues or the instructor. They also get formativefeedback after the lab session (Principle 5) throughonline channels identified in this work, making itpossible for them to improve their practical skills.They can work together in groups (using onlinecollaborative tools), which is sometimes preferred.This makes it an interactive session, thereby achiev-ing the interacting stage of the ICAP framework(Chi and Wylie, 2014). Assessment at the end ofeach lab session provides a good context and affectslearning (Rust, 2002).

The technique is a vital element of pedagogy,which is the science of teaching (Lea, 2004). Thisapproach fulfills the emerging trend of flipped class-room (Europass Teacher Academy, 2020), wherestudents are responsible for their own learning be-fore attending classes (Ng, 2018). The sessionshelp to bridge the gap in knowledge between stu-dents and the instructors. It was essential to beaware that not all the students had the same levelof programming experience. Hence, it was essen-tial to provide the practical sessions in stages thatcould address the beginner, intermediate and expe-rienced developers. Students were given tasks thatprogressed from introductory programming withPython for text preprocessing to introduction to

the Pytorch framework and, finally, an advancedmachine translation (MT) task.

3.3 Learning Management SystemsThe lack of in-person interaction between studentsand teachers did emphasize the framework throughwhich the students were interacting with the coursematerial and the teachers. The two universities of-fering the NLP course used different frameworksor Learning Management Systems (LMSs) for thistask; however, the principles to bear in mind werethe same. Most importantly, the course materialhad to be presented in the LMS in a way to makeit clear for the student what their task is and howto proceed with their learning (Principle 4: ensureclarity in the course and task design). Another crit-ical aspect of the LMS was to facilitate interactionbetween teachers and students, providing an openchannel of communication (Principle 9: Create asupportive learning environment). Additionally,it was required that the LMS facilitates feedback(Principle 5: Promote knowledge-enhancing feed-back) and supports the assessment methods to beused for the course (Principle 3: Choose forms ofassessment that enhance learning). In the followingsections, we will discuss how each LMS was usedto achieve these goals.

3.3.1 Canvas and BlackboardBoth Canvas and Blackboard are cloud-based LMSused for courses at many educational institutes.Here, we will discuss (in numerical order) howwe set up the course in them to support the above-listed pedagogical principles.

• Principle 3: Both Canvas and Blackboard en-ables the posting and evaluation of the assign-ment types we used (for more details, see Sec-tion 2.5). Moreover, they also enable studentsto assign peer feedback tasks for students au-tomatically. This feature contributes to theadherence to Principle 6 (Foster independenceand self-awareness) by encouraging studentsto reflect on others’ learning and the require-ments of each assignment.

• Principle 4: On the starting page (the entrypoint for students to the course room), thecourse syllabus is linked, and students can alsofind the course plan here, along with a weekly,thematical breakdown of the course. In thisbreakdown, we also provide links that allowstudents to access all materials allotted for the

6

given weeks. Moreover, to further enhancethe clarity in course design, the material wasadded weekly to direct students’ attention tothe current tasks. Another tool provided byCanvas and Blackboard that serves clarity isassignments page, where students can accessall assignments in one place, enabling them tooverview what assignments they have alreadyfulfilled, what assignments are still due, andwhen due dates are coming up.

• Principle 5: The opportunity to provideknowledge-enhancing feedback was two-fold.For one, teachers could rate and commenton student assignments. Moreover, teacherscould provide feedback through the variousforums on the course’s discussion page, alsohosted on Canvas and Blackboard.

• Principle 9: The discussion page of the LMSused also served as an open communicationchannel among students and between studentsand teachers. This contributes to a high degreeof interaction that is important for a supportivelearning environment.

4 Results

The NLP course (7.5 credits) was a part of the one-year Data Science Masters program at LTU. Therewere 24 students enrolled in the NLP course. Thestudents were from academia and industry both.Most students had a background in computer sci-ence or engineering and thus had at least an intro-ductory course in programming. Some students,however, had no programming background.

4.1 A student’s perspective of the course

This section is the work of a student who took thecourse and gave his comments, as discussed below.

Keeping in mind that students need to be pre-pared for both the industry and research front. Stu-dents need to focus on using tools and methodsthat are most widely used. When there are so manytools to deal with, students get overburdened. Here,the instructor needs to follow a suitable pathwayto let students understand the concept and imple-ment an executable project for demonstration. Theclass contains students with both programming andnon-programming backgrounds. Python was usedin the project assignments. The students without aprogramming background may not feel confident

using Python. Therefore, there should be an alter-native to that.

Text Preprocessing is an essential step in NLPthat various libraries exist for that. The main at-tention of lecturers was on applying preprocess-ing (such as identifying missing values, tokeniza-tion, lemmatization, etc.) on a text file. It did notseem easy to apply those on Pandas dataframes.Specially applying regular expressions on Pandasdataframe for a beginner seemed to be daunting.Though the instructors have done their job of bothconcepts and practicalities, students felt less confi-dent to carry the instructor’s work and stuck usinglibraries. So, one needs to develop a clear strategyor steps for being selective in using libraries thatare more helpful for carrying real project develop-ment. It helps to form the proper foundations forthe student. Therefore students would feel moreconfident to carry forward the concepts taught bythe lecturers in a class. Similarly, when using li-braries such as Tensorflow and Pytorch.

Some students had high expectations towards theinstructors that instructors should have supportedthem in building a web-based or desktop NLP ap-plication, i.e., taking the NLP course project totechnology readiness level 7. However, this wasbeyond this course.

Another difficulty students were facing wasunderstanding the speech patterns in the videosrecorded by lecturers coming from different back-grounds, and having different accents. Moreover,there were many technical terms and idioms in thecourse that are hard to capture at the first sight.A neat solution to this can be subtitles. Addingsubtitles to recorded lectures improve speech un-derstanding, and helps students to capture new tech-nical terms.

This course came with various interesting pointsfrom students’ perspectives. Such as:

• Efficient knowledge transfer.

• To make the student feel confident at the endof the course by applying theoretical and prac-tical aspects learned in the course.

• Removing the barrier for students comingfrom different cultures, different backgrounds.

• In initial stages, sticking to most widelyused programming language and libraries andecosystem in NLP. At the same time, being

7

selective in libraries, which are more useful inday-to-day project/coding implementations.

• Motivating students to build their own models,and its customization.

4.2 Course evaluation by students

The department surveys each course. Students areasked to give a rating on a scale of 1-6 (1: stronglydisagree, 6: strongly agree). Students are encour-aged to provide feedback; however, it is up to themif they want to give the feedback or not. Here, wediscuss the students’ reflections we received afterthe course. In total, eight students gave their re-sponses in the course evaluation report out of thetwenty-four students taking the course. The surveyconsists of six sections: self-assessment, courseaims and content, quality of teaching, course mate-rials, examination, and overall assessment.

Table 1 shows the stats of students′ self-assessment. It can be noticed that many studentsdid not spend sufficient effort. In a way, we failedin motivating students to put their efforts into thecourse. It may be due to students’ other commit-ments. However, the fact remains that we couldnot manage students to put enough effort into thecourse. Thus, we need to think about how we canimprove more in this regard.

The following questions (Table 2) were askedregarding ILOs of the course, to ensure that thecourse design and implementation adhere to oneaspect of the principle of constructive alignment(i.e. the teaching and learning activities supportthe ILOs), as well as that Principles 1 (Emphasizerelevant knowledge and skill) and 4 (Ensure clarityin course design) were at least partially fulfilled.While the last question partially reflected in Princi-ple 9 as well (Create a supportive learning environ-ment). The average scores for these statements arebetween 4.25 and 4.9, suggesting that students onaverage agree with the given statements, thus wehave managed to achieve these goals in the designand delivery of the course. The lowest agreementwas given for the third statement, thus our main ob-jective for the upcoming installment of the coursewill be to ensure that the study guide provides bet-ter guidance for the students.

Another set of questions were asked for the eval-uation of course delivery, and the exam conducted(see Table 3). For one, these questions measure an-other aspect of the constructive alignment (i.e. thealignment of assessment with ILOs - Question 3.5).

Moreover, it is also measured here, how well adher-ence to some of LTUs principles was implemented.It can once again be noted here that all average val-ues are above 3.5, thus students are more inclinedto agree with the given statements than they areto disagree with them. In particular, they rate thealignment between the examination and ILOs (andthus the adherence to constructive alignment) quitehigh. The second highest score was given for thetechnical support and communication, suggestingthat this aspect of creating a supportive learningenvironment (Principle 9) was rather successful.The score given for the rewarding nature of theoret-ical teaching and learning activities was only onedecimal lower, which suggests that in this aspect,we were successful in emphasizing relevant knowl-edge and skill (Principle 1). However, the scoreconcerning the practical aspects of teaching andlearning activities was considerably lower (thoughstill closer to six than one), suggesting that thereis more room for improvement in terms of practi-cal tasks and assignments, as some comments alsoconfirmed. Another way to address the score is tocommunicate more clearly towards students thatbuilding a complete web-based or desktop NLPapplication is beyond the scope of this introductorycourse. Another area where there is more room forimprovement is regarding the input of instructorssupporting student learning (Question 3.1).

Lastly, the overall assessment of the course bystudents is shown in Table 4. While, the overallscores are encouraging for us (in particular thequestion regarding the overall impression of stu-dents about the course - Question 4.3), there is stillscope for improvement in all sections, and our goalfor future installments of the course is to achieveeven higher student satisfaction scores.

5 Conclusion and Improvements plannedfor the future versions of the course

Here, we discuss the course and the improvementswe plan for the future course.

5.1 Conclusion

This paper discusses how the course was designed,organized, and delivered online at the university.We followed the pedagogy principles in all thesephases of the course. We argue that we can deliverthe course fully online even after the pandemic aswe delivered the course up to a satisfactory level.To be precise, we hypothesized that our course

8

Table 1: Students′ self-assessment regarding the NLP course

No. Question Averagescore

1.1

How many hours of study have you in average dedicated to this courseper week, including both scheduled and non-scheduled time?

1.2 I am satisfied with my efforts during the course. 4.51.3 have participated in all the teaching and learning activities in the course. 4.01.4 I have prepared myself prior to all teaching and learning activities. 3.3

Table 2: Evaluation of achieving ILOs and aim of the conducted NLP course

No. Question Averagescore

2.1 The intended learning outcomes of the course have been clear. 4.92.2 The contents of the course have helped me to achieve the ILOs of the course 4.52.3 The course planning and the study guide have provided good guidance 4.25

Table 3: Evaluation of the course delivery and the exam of the conducted NLP course

No. Question Averagescore

3.1 The teacher’s input has supported my learning. 4.03.2 The teaching and learning activities of the theoretical nature have been rewarding 4.6

3.3The practical/creative teaching and learning activities of the coursee.g. labs, field trips, teaching practice, placements/internships, project workhave been rewarding.

4.0

3.4The technical support for communication, e.g. learning platform, e-learningresources, has been satisfactory.

4.7

3.5 The examination was in accordance with the ILOs of the course. 5.0

Table 4: Overall assessment of the NLP course by the students

No. Question Averagescore

4.1 The workload of the course is appropriate for the number of credits given. 4.34.2 Given the aims of the course the level of work required has been appropriate 4.14.3 My overall impression is that this has been a good course 4.6

9

would adhere to the LTU’s pedagogical principlesand other pedagogical theories refereed in this pa-per and delivered online at the same time. Thiscould be verified from the students’ response re-port; The average scores (Table 4) greater than 4support our hypothesis. In fact the scores in Tables2, and 3 also support our hypothesis. However,there is always room for improvement. We figuredout many things to improve even before the coursewas finished. The students’ response report gaveus a clear idea of where to put more energy to im-prove the course, e.g., as observed from Table 3,we will improve on teachers’ efforts and projectsrelated activities. The planned improvements arelisted below.

5.2 Improvements planned

Here we discuss the improvements planned for fu-ture iterations of the course based on student feed-back and pedagogical principles.

5.2.1 Two-layered course

Our course design started to be for multiple initia-tives, for people from the industry and people fromacademia, people with no background in AI andmaths, and people with a strong background. Thisended up in designing a general course that canbe used for all people with different backgroundsand goals. One track of 3 credits for industrialstudents and one track of 7.5 credits for academicstudents. Although this idea never came to real life,we would like to mention our final thoughts of it asone of the possible future works.

Students from the industry most often have dif-ferent backgrounds, limited knowledge in mathe-matics, and stronger motivation, looking for spe-cific applications. On the other hand, academicstudents have a better knowledge of mathematics.They have sufficient background in the field and areinterested in learning a broad range of applicationsand topics.

Thus, we designed the course to cover theories,concepts, applications, and their implementation.For example, for a topic like neural networks, thematerials should include mathematical background,practical usage, and possible tweaks and configu-rations. The choice of two tracks will be takingtheoretical subtopics only for academics and prac-ticalities for both. In the end, all who finish thecourse have a broad understanding of various ap-plications in NLP that should satisfy their interests.

5.2.2 Other Improvements• Better naming convention of the videos for

clarity

• Adding subtitles to the videos for better un-derstanding (the video lectures are deliveredin English, but none of the lecturers are nativespeakers).

• Adding quizzes between videos for better stu-dent activation and learning.

• Removing handwritten notes from the lecturevideos and slides, as in some cases studentsfound that difficult to read.

• Multiple practical tasks with different levelsof difficulty can be provided to cater to thestudents’ different levels of programming ex-perience, so each student can pick the taskapplicable to them.

• Splitting the project into subtasks aligned withthe NLP pipeline.

• More live sessions to support students’ learn-ing.

• Giving more to the point references to thecontent related to the theory and project im-plementation.

• A tutorial on specific libraries to use duringthe course and their setup.

• Additional tutorial on building a usable webapp, e.g., for Hate speech detection, whereanyone can feed a text and classify it.

We hope that future versions of the course willbe better in all aspects (planning, designing, orga-nizing, and conducting) and perspectives.

ReferencesApoorv Agarwal. 2013. Teaching the basics of nlp and

ml in an introductory course to information science.In Proceedings of the Fourth Workshop on TeachingNLP and CL, pages 77–84.

L.T. Benjamin. 2002. Lecturing. In S.F. Davis andW. Buskist, editors, The Teaching of psychology: Es-says in honor of Wilbert J. McKeachie and CharlesL. Brewer, pages 57–67.

10

B. S. Bloom, M. B. Engelhart, E. J. Furst, W. H. Hill,and D. R. Krathwohl. 1956. Taxonomy of educa-tional objectives. The classification of educationalgoals. Handbook 1: Cognitive domain. LongmansGreen, New York.

Stephen J. Bordes, Donna Walker, Louis JonathanModica, Joanne Buckland, and Andrew K. Sobering.2021. Towards the optimal use of video recordingsto support the flipped classroom in medical schoolbasic sciences education. Medical Education On-line, 26(1):1841406. PMID: 33119431.

Atkins M. Brown G. 1999. Effective teaching in highereducation. Routledge.

Center for Teaching and Learning. 1990. Improvingmultiple choice questions. For your consideration...suggestions and reflections on Teaching and Learn-ing, (8).

Michelene TH Chi and Ruth Wylie. 2014. The ICAPframework: Linking cognitive engagement to ac-tive learning outcomes. Educational psychologist,49(4):219–243.

Brian Robert Cook and Andrea Babon. 2017. Activelearning through online quizzes: better learning andless (busy) work. Journal of Geography in HigherEducation, 41(1):24–38.

Cynthia J. Crame. 2016. Effective educational videos:Principles and guidelines for maximizing studentlearning from video content. CBE life sciences ed-ucation, 15(4).

Geoffrey Dick, Asli Yagmur Akbulut, and Vic Matta.2020. Teaching and learning transformation in thetime of the coronavirus crisis. Journal of Infor-mation Technology Case and Application Research,22(4):243–255.

George T. Doran et al. 1981. There’s a SMART way towrite management’s goals and objectives. Manage-ment review, 70(11):35–36.

Maya Elmgren and Ann-Sofie Henriksson. 2018. Aca-demic Teaching. Studentlitteratur AB.

Europass Teacher Academy. 2020. Flipped classroom.online. Accessed on: March 08, 2021.

Sean Gallagher and Jason Palmer. 2020. The pandemicpushed universities online. the change was long over-due.

Graham Gibbs. 2005. Improving the quality of studentlearning. University of South Wales (United King-dom).

R. W. Goddard. 1995. The pygmalion effect. Person-nel Journal, 64(6):10–16.

Marti A Hearst. 2005. Teaching applied natural lan-guage processing: Triumphs and tribulations. InProceedings of the Second ACL Workshop on Effec-tive Tools and Methodologies for Teaching NLP andCL, pages 1–8.

Peter Kandlbinder. 2014. Constructive alignment inuniversity teaching. HERDSA News, 36(3):5–6.

Sibel Ersel Kaymakamoglu. 2018. Teachers’ beliefs,perceived practice and actual classroom practice inrelation to traditional (teacher-centered) and con-structivist (learner-centered) teaching (note 1). Jour-nal of Education and Learning, 7(1):29–37.

David A Kolb. 2014. Experiential learning: Experi-ence as the source of learning and development. FTpress.

Mary R Lea. 2004. Academic literacies: A peda-gogy for course design. Studies in higher education,29(6):739–756.

Elizabeth D Liddy and Nancy J McCracken. 2005.Hands-on nlp for an interdisciplinary audience.

Luleå University of Technology, University PedagogyCenter. LTU’s pedagogical principles. https://ltu.instructure.com/courses/8692/files/1364701/download?wrap=1.

W.J. McKeachie, M. Svinicki, M.D. Svinicki, B.K.Hofer, and R.M. Suinn. 2006. McKeachie’s Teach-ing Tips: Strategies, Research, and Theory for Col-lege and University Teachers. College teaching se-ries. Houghton Mifflin.

Randall M Moate and Jane A Cox. 2015. Learner-centered pedagogy: Considerations for applica-tion in a didactic course. Professional Counselor,5(3):379–389.

Eugenia MW Ng. 2018. Integrating self-regulationprinciples with flipped classroom pedagogy for firstyear university students. Computers & Education,126:65–74.

Ledningsstaben Örebro University. Örebro Uni-versity basic pedagogical view. https://www.oru.se/om-universitetet/vision-strategi-och-regelverk/pedagogisk-grundsyn/.

Ebba Ossiannilsson. 2021. The new normal: Postcovid-19 is about change and sustainability. NearEast University Online Journal of Education,4(1):72–77.

Ozlem Ozan and Yasin Ozarslan. 2016. Video lecturewatching behaviors of learners in online courses. Ed-ucational Media International, 53(1):27–41.

P. Ramsden. 2003. Learning to Teach in Higher Edu-cation. RoutledgeFalmer.

Chris Rust. 2002. The impact of assessment on stu-dent learning: how can the research literature practi-cally help to inform the development of departmen-tal assessment strategies and learner-centred assess-ment practices? Active learning in higher education,3(2):145–158.

11

S.E. Smaldino, J.D. Russell, and R. Heinich. 2005.Instructional Technology and Media for Learning.Pearson/Merrill/Prentice Hall.

B.R. Snyder. 1971. The Hidden Curriculum. Borzoibook. Knopf.

Milind S Tullu, Sandeep B Bavdekar, and Nirmala NRege. 2015. Educational (learning) objectives:Guide to teaching and learning. The Art of Teach-ing Medical Students-E-Book, page 89.

Towe Wiiand. 1998. Examinationen i fokus :högskolestudenters lärande och examination : en lit-teraturöversikt.

12

Proceedings of the Fifth Workshop on Teaching NLP, pages 13–27June 10–11, 2021. ©2021 Association for Computational Linguistics

Teaching a Massive Open Online Course on Natural Language Processing

Ekaterina Artemova1,2∗, Murat Apishev 1, Veronika Sarkisyan 1,Sergey Aksenov1, Denis Kirjanov3, Oleg Serikov 1,4

1 HSE University, Moscow, Russia2 Huawei Noah’s Ark lab, Moscow, Russia3 SberDevices, Sberbank, Moscow, Russia4 DeepPavlov, MIPT, Dolgoprudny, Russia

https://openedu.ru/course/hse/TEXT/

Abstract

This paper presents a new Massive Open On-line Course on Natural Language Process-ing, targeted at non-English speaking students.The course lasts 12 weeks; every week con-sists of lectures, practical sessions, and quizassignments. Three weeks out of 12 are fol-lowed by Kaggle-style coding assignments.

Our course intends to serve multiple purposes:(i) familiarize students with the core con-cepts and methods in NLP, such as languagemodeling or word or sentence representations,(ii) show that recent advances, including pre-trained Transformer-based models, are builtupon these concepts; (iii) introduce architec-tures for most demanded real-life applications,(iv) develop practical skills to process texts inmultiple languages. The course was preparedand recorded during 2020, launched by the endof the year, and in early 2021 has received pos-itive feedback.

1 Introduction

The vast majority of recently developed onlinecourses on Artificial Intelligence (AI), Natural Lan-guage Processing (NLP) included, are oriented to-wards English-speaking audiences. In non-Englishspeaking countries, such courses’ audience is unfor-tunately quite limited, mainly due to the languagebarrier. Students, who are not fluent in English,find it difficult to cope with language issues andstudy simultaneously. Thus the students face seri-ous learning difficulties and lack of motivation tocomplete the online course. While creating new on-line courses in languages other than English seemsredundant and unprofitable, there are multiple rea-sons to support it. First, students may find it easierto comprehend new concepts and problems in theirnative language. Secondly, it may be easier tobuild a strong online learning community if stu-dents can express themselves fluently. Finally, and

∗Corresponding author, email: [email protected]

more specifically to NLP, an NLP course aimed atbuilding practical skills should include language-specific tools and applications. Knowing how touse tools for English is essential to understand thecore principles of the NLP pipeline. However, it isof little use if the students work on real-life appli-cations in the non-English industry.

In this paper, we present an overview of an on-line course aimed at Russian-speaking students.This course was developed and run for the first timein 2020, achieving positive feedback. Our courseis a part of the HSE university’s online specializa-tion on AI and is built upon previous courses in thespecialization, which introduced core concepts incalculus, probability theory, and programming inPython. Outside of the specialization, the coursecan be used for additional training of students ma-joring in computer science or software engineeringand others who fulfill prerequisites.

The main contributions of this paper are:

• We present the syllabus of a recent wide-scopemassive open online course on NLP, aimed ata broad audience;

• We describe methodological choices made forteaching NLP to non-English speaking stu-dents;

• In this course, we combine recent deep learn-ing trends with other best practices, such astopic modeling.

The remainder of the paper is organized as fol-lows: Section 2 introduces methodological choicesmade for the course design. Section 3 presents thecourse structure and topics in more details. Sec-tion 4 lists home works. Section 5 describes thehosting platform and its functionality.

2 Course overview

The course presented in this paper is split into twomain parts, six weeks each, which cover (i) core

13

NLP concepts and approaches and (ii) main applica-tions and more sophisticated problem formulations.The first six weeks’ main goal is to present dif-ferent word and sentence representation methods,starting from bag-of-words and moving to wordand sentence embeddings, reaching contextualizedword embeddings and pre-trained language mod-els. Simultaneously we introduce basic problemdefinitions: text classification, sequence labeling,and sequence-to-sequence transformation. The firstpart of the course roughly follows Yoav Goldberg’stextbook (Goldberg, 2017), albeit we extend it withpre-training approaches and recent Transformer-based architectures.

The second part of the course introduces BERT-based models and such NLP applications as ques-tion answering, text summarization, and informa-tion extraction. This part adopts some of the ex-planations from the recent draft of “Speech andLanguage Processing” (Jurafsky and Martin, 2000).An entire week is devoted to topic modeling, andBigARTM (Vorontsov et al., 2015), a tool for topicmodeling developed in MIPT, one of the top Rus-sian universities and widely used in real-life ap-plications. Overall practical sessions are aimedat developing text processing skills and practicalcoding skills.

Every week comprises both a lecture and a prac-tical session. Lectures have a “talking head” for-mat, so slides and pre-recorded demos are pre-sented, while practical sessions are real-time cod-ing sessions. The instructor writes code snippets inJupyter notebooks and explains them at the sametime. Overall every week, there are 3-5 lecturevideos and 2-3 practical session videos. Weeks 3,5, 9 are extended with coding assignments.

Weeks 7 and 9 are followed by interviews. Inthese interviews, one of the instructors’ talks tothe leading specialist in the area. Tatyana Shav-rina, one of the guests interviewed, leads an R&Dteam in Sber, one of the leading IT companies. Thesecond guest, Konstantin Vorontsov, is a professorfrom one of the top universities. The guests areasked about their current projects and interests, ca-reer paths, what keeps them inspired and motivated,and what kind of advice they can give.

The final mark is calculated according to theformula:

# of accepted coding assignment

+0.7mean(quiz assignment mark)

Coding assignments are evaluated on the binaryscale (accepted or rejected), and quiz assignmentsare evaluated on the 10 point scale. To earn a cer-tificate, the student has to earn at least 4 points.

In practical sessions, we made a special effort tointroduce tools developed for processing texts inRussian. The vast majority of examples, utilizedin lectures, problems, attempted during practicalsessions, and coding assignments, utilized datasetsin Russian. The same choice was made by PavelBraslavski, who was the first to create an NLPcourse in Russian in 2017 (Braslavski, 2017). Weutilized datasets in English only if Russian lacksthe non-commercial and freely available datasetsfor the same task of high quality.

Some topics are intentionally not covered in thecourse. We focus on written texts and do not ap-proach the tasks of text-to-speech and speech-to-text transformations. Low-resource languages spo-ken in Russia are out of the scope, too. Besides, wealmost left out potentially controversial topics, suchas AI ethics and green AI problems. Although webriefly touch upon potential biases in pre-trainedlanguage models, we have to leave out a large bodyof research in the area, mainly oriented towardsthe English language and the US or European so-cial problems. Besides, little has been explored inhow neural models are affected by those biases andproblems in Russia.

The team of instructors includes specialists fromdifferent backgrounds in computer science and the-oretical linguists. Three instructors worked onlectures, two instructors taught practical sessions,and three teaching assistants prepared home assign-ments and conducted question-answering sessionsin the course forum.

3 Syllabus

Week 1. Introduction. The first introductorylecture consists of two parts. The first partoverviews the core tasks and problems in NLP,presents the main industrial applications, such assearch engines, Business Intelligence tools, andconversational engines, and draws a comparisonbetween broad-defined linguistics and NLP. Toconclude this part, we touch upon recent trends,which can be grasped easily without the need to godeep into details, such as multi-modal applications(Zhou et al., 2020), cross-lingual methods (Fenget al., 2020; Conneau et al., 2020) and computa-tional humor (Braslavski et al., 2018; West and

14

Horvitz, 2019). Throughout this part lecture, wetry to show NLP systems’ duality: those aimedat understanding language (or speech) and thoseaimed at generating language (or speech). Themost complex systems used for machine transla-tion, for example, aim at both. The second partof the lecture introduces such basic concepts asbag-of-words, count-based document vector repre-sentation, tf-idf weighting. Finally, we explorebigram association measures, PMI and t-score.We point out that these techniques can be used toconduct an exploratory analysis of a given collec-tion of texts and prepare input for machine learningmethods.

Practical session gives an overview of text pre-possessing techniques and simple count-based textrepresentation models. We emphasize how prepos-sessing pipelines can differ for languages such asEnglish and Russian (for example, what is prefer-able, stemming or lemmatization) and give ex-amples of Python frameworks that are designedto work with the Russian language (pymystem3(Segalovich), pymorphy2 (Korobov, 2015)). Wealso included an intro to regular expressions be-cause we find this knowledge instrumental bothwithin and outside NLP tasks.

During the first weeks, most participants arehighly motivated, we can afford to give them morepractical material, but we still need to end up withsome close-to-life clear examples. We use a simplesentiment analysis task on Twitter data to demon-strate that even the first week’s knowledge (togetherwith understanding basic machine learning) allowsparticipants to solve real-world problems. At thesame time, we illustrate how particular steps oftext prepossessing can have a crucial impact on themodel’s outcome.

Week 2. Word embeddings. The lecture intro-duces the concepts of distributional semantics andword vector representations. We familiarize thestudents with early models, which utilized sin-gular value decomposition (SVD) and move to-wards more advanced word embedding models,such as word2vec (Mikolov et al., 2013) andfasttext (Bojanowski et al., 2017). We brieflytouch upon the hierarchical softmax and the hash-ing trick and draw attention to negative samplingtechniques. We show ways to compute word dis-tance, including Euclidean and cosine similaritymeasures.

We discuss the difference between word2vec

Figure 1: One slide from Lecture 2. Difference be-tween raw texts (top line), bag-of-words (middle line),and bag-of-vectors (bottom line). Background words:text, words, vectors.

and GloVe (Pennington et al., 2014) models andemphasize main issues, such as dealing with out-of-vocabulary (OOV) words and disregarding richmorphology. fasttext is then claimed to ad-dress these issues. To conclude, we present ap-proaches for intrinsic and extrinsic evaluation ofword embeddings. Fig. 1 explains the differencebetween bag-of-words and bag-of-vectors.

In practical session we explore only ad-vanced word embedding models (word2vec,fasttext and GloVe) and we cover three mostcommon scenarios for working with such models:using pre-trained models, training models fromscratch and tuning pre-trained models. Givinga few examples, we show that fasttext as acharacter-level model serves as a better word repre-sentation model for Russian and copes better withRussian rich morphology. We also demonstratesome approaches of intrinsic evaluation of models’quality, such as solving analogy tasks (like wellknown “king - man + woman = queen”) and eval-uating semantic similarity and some useful tech-niques for visualization of word embeddings space.

This topic can be fascinating for students whensupplemented with illustrative examples. Explor-ing visualization of words clusters on plots or solv-ing analogies is a memorable part of the “classic”NLP part of most students’ course.

Week 3. Text classification. The lecture con-siders core concepts for supervised learning. Webegin by providing examples for text classificationapplications, such as sentiment classification andspam filtering. Multiple problem statements, suchas binary, multi-class, and multi-label classifica-tion, are stated. To introduce ML algorithms, westart with logistic regression and move towards neu-

15

ral methods for text classification. To this end, weintroduce fasttext as an easy, out-of-the-boxsolution. We introduce the concept of sentence(paragraph) embedding by presenting doc2vecmodel (Le and Mikolov, 2014) and show how suchembeddings can be used as input to the classifica-tion model. Next, we move towards more sophisti-cated techniques, including convolutional modelsfor sentence classification (Kim, 2014). We donot discuss backpropagation algorithms but referto the DL course of the specialization to refreshunderstanding of neural network training. We showways to collect annotated data on crowdsourcingplatforms and speed up the process using activelearning (Esuli and Sebastiani, 2009). Finally, weconclude with text augmentation techniques, in-cluding SMOTE (Chawla et al., 2002) and EDA(Wei and Zou, 2019).

In the practical session we continue workingwith the text classification on the IMDb movies re-views dataset. We demonstrate several approachesto create classification models with different wordembeddings. We compare two different ways toget sentence embedding, based on any word em-bedding model: by averaging word vectors andusing tf-idf weights for a linear combination ofword vectors. We showcase fasttext tool fortext classification using its built-in classificationalgorithm.

Additionally, we consider use GloVe word em-bedding model to build a simple ConvolutionalNeural Network for text classification. In this weekand all of the following, we use PyTorch 1 as aframework for deep learning.

Week 4. Language modeling. The lecture fo-cuses on the concept of language modelling. Westart with early count-based models (Song andCroft, 1999) and create a link to Markov chains.We refer to the problem of OOV words and showthe add-one smoothing method, avoiding moresophisticated techniques, such as Knesser-Neysmoothing (Kneser and Ney, 1995), for the sake oftime. Next, we introduce neural language models.To this end, we first approach Bengio’s languagemodel (Bengio et al., 2003), which utilizes fullyconnected layers. Second, we present recurrentneural networks and show how they can be usedfor language modeling. Again, we remind the stu-dents of backpropagation through time and gradi-ent vanishing or explosion, introduced earlier in

1https://pytorch.org/

the DL course. We claim, that LSTM (Hochre-iter and Schmidhuber, 1997) and GRU (Chunget al., 2014) cope with these problems. As a briefrevision of the LSTM architecture is necessary, weutilize Christopher Olah’s tutorial (Olah, 2015).We pay extra attention to the inner working ofthe LSTM, following Andrej Karpathy’s tutorial(Karpathy, 2015). To add some research flavor tothe lecture, we talk about text generation (Sutskeveret al., 2011), its application, and different decodingstrategies (Holtzman et al., 2019), including beamsearch and nucleus sampling. Lastly, we introducethe sequence labeling task (Ma and Hovy, 2016)for part-of-speech (POS) tagging and named entityrecognition (NER) and show how RNN’s can beutilized as sequence models for the tasks.

The practical session in this week is dividedinto two parts. The first part is dedicated to lan-guage models for text generation. We experi-ment with count-based probabilistic models andRNN’s to generate dinosaur names and get familiarwith perplexity calculation (the task and the datawere introduced in Sequence Models course fromDeepLearning.AI 2). To bring things together, stu-dents are asked to make minor changes in the codeand run it to answer some questions in the week’squiz assignment.

The second part of the session demonstrates theapplication of RNN’s to named entity recognition.We first introduce the BIO and BIOES annotationschemes and show frameworks with pre-trainedNER models for English (Spacy 3) and Russian(Natasha4) languages. Further, we move on toCNN-biLSTM-CRF architecture described in thelecture and test it on CoNLL 2003 shared taskdata (Sang and De Meulder, 2003).

Week 5. Machine Translation. This lecturestarts with referring to the common experience ofusing machine translation tools and a historicaloverview of the area. Next, the idea of encoder-decoder (seq2seq) architecture opens the techni-cal part of the lecture. We start with RNN-basedseq2seq models (Sutskever et al., 2014) and in-troduce the concept of attention (Bahdanau et al.,2015). We show how attention maps can be usedfor “black box” interpretation. Next, we revealthe core architecture of modern NLP, namely, the

2https://www.coursera.org/learn/nlp-sequence-models

3https://spacy.io4https://natasha.github.io

16

Transformer model (Vaswani et al., 2017) and askthe students explicitly to take this part seriously.Following Jay Allamar’s tutorial (Alammar, 2015),we decompose the transformer architecture and gothrough it step by step. In the last part of the lecture,we return to machine translation and introduce qual-ity measures, such as WER and BLEU (Papineniet al., 2002), touch upon human evaluation and thefact that BLEU correlates well with human judg-ments. Finally, we discuss briefly more advancedtechniques, such as non-autoregressive models (Guet al., 2017) and back translation (Hoang et al.,2018). Although we do not expect the studentto comprehend these techniques immediately, wewant to broaden their horizons so that they canthink out of the box of supervised learning andautoregressive decoding.

In the first part of practical session we solvethe following task: given a date in an arbitraryformat transform it to the standard format “dd-mm-yyyy” (for example, “18 Feb 2018”, “18.02.2018”,“18/02/2018”→ “18-02-2018”). We adopt the codefrom PyTorch machine translation tutorial 5 to ourtask: we use the same RNN encoder, RNN decoder,and its modification - RNN encoder with atten-tion mechanism - and compare the quality of twodecoders. We also demonstrate how to visualizeattention weights.

The second part is dedicated to the Transformermodel and is based on the Harvard NLP tutorial(Klein et al., 2017) that decomposes the article“Attention is All You Need” (Vaswani et al., 2017).Step by step, like in the lecture, we go throughthe Transformer code, trying to draw parallels witha simple encoder-decoder model we have seen inthe first part. We describe and comment on everylayer and pay special attention to implementingthe attention layer and masking and the shapes ofembeddings and layers.

Week 6. Sesame Street I. The sixth lecture andthe next one are the most intense in the course.The paradigm of pre-trained language models isintroduced in these two weeks. The first modelto discuss in detail is ELMo (Peters et al., 2018).Next, we move to BERT (Devlin et al., 2019) andintroduce the masked language modeling and nextsentence prediction objectives. While presentingBERT, we briefly revise the inner working of Trans-

5https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

Figure 2: To spice up the lectures, the lecturer isdressed in an ELMo costume

former blocks. We showcase three scenarios tofine-tune BERT: (i) text classification by using dif-ferent pooling strategies ([CLS], max or mean),(ii) sentence pair classification for paraphrase iden-tification and for natural language inference, (iii)named entity recognition. SQuAD-style question-answering, at which BERT is aimed too, as avoidedhere, as we will have another week for QA systems.Next, we move towards GPT-2 (Radford et al.)and elaborate on how high-quality text generationcan be potentially harmful. To make the differ-ence between BERT’s and GPT-2’s objective moreclear, we draw parallels with the Transformer ar-chitecture for machine translation and show thatBERT is an encoder-style model, while GPT-2 is adecoder-style model. We show Allen NLP (Gard-ner et al., 2018) demos of how GPT-2 generatestexts and how attention scores implicitly resolvecoreference.

In this week, we massively rely on Jay Allamar’s(Alammar, 2015) tutorial and adopt some of thesebrilliant illustrations. One of the main problems,though, rising in this week is the lack of Russianterminology, as the Russian-speaking communityhas not agreed on the proper ways to translate suchterms as “contextualized encoder” or “fine-tuning”.To spice up this week, we were dressed in SesameStreet kigurumis (see Fig. 2).

The main idea of the practical session is todemonstrate ELMo and BERT models, considered

17

earlier in the lecture. The session is divided intotwo parts, and in both parts, we consider text clas-sification, using ELMo and BERT models, respec-tively.

In the first part, we demonstrate how to useELMo word embeddings for text classificationon the IMBdb dataset used in previous sessions.We use pre-trained ELMo embeddings by Al-lenNLP (Gardner et al., 2018) library and imple-ment a simple recurrent neural network with a GRUlayer on top for text classification. In the end, wecompare the performance of this model with thescores we got in previous sessions on the samedataset and demonstrate that using ELMo embed-dings can improve model performance.

The second part of the session is focused onmodels based on Transformer architecture. Weuse huggingface-transformers library (Wolf et al.,2020) and a pre-trained BERT model to build aclassification algorithm for Google play applica-tions reviews written in English. We implementan entire pipeline of data preparation, using a pre-trained model and demonstrating how to fine-tunethe downstream task model. Besides, we imple-ment a wrapper for the BERT classification modelto get the prediction on new text.

Week 7. Sesame Street II. To continue divinginto the pre-trained language model paradigm, thelecture first questions, how to evaluate the model.We discuss some methods to interpret the BERT’sinner workings, sometimes referred to as BERTol-ogy (Rogers et al., 2021). We introduce a fewcommon ideas: BERT’s lower layers account forsurface features, lower to middle layers are respon-sible for morphology, while the upper-middle lay-ers have better syntax representation (Conneauand Kiela, 2018). We talk about ethical issues(May et al., 2019), caused by pre-training on rawweb texts. We move towards the extrinsic evalua-tion of pre-trained models and familiarize the stu-dents with GLUE-style evaluations (Wang et al.,2019b,a). The next part of the lecture covers dif-ferent improvements of BERT-like models. Weshow how different design choices may affect themodel’s performance in different tasks and presentRoBERTa (Liu et al., 2019), and ALBERT (Lanet al., 2019) as members of a BERT-based family.We touch upon the computational inefficiency ofpre-trained models and introduce lighter models,including DistillBERT (Sanh et al., 2019). To besolid, we touch upon other techniques to compress

pre-trained models, including pruning (Sajjad et al.,2020) and quantization (Zafrir et al., 2019), but donot expect the students to be able to implementthese techniques immediately. We present the con-cept of language transferring and introduce multi-lingual Transformers, such as XLM-R (Conneauet al., 2020). Language transfer becomes more andmore crucial for non-English applications, and thuswe draw more attention to it. Finally, we coversome of the basic multi-modal models aimed atimage captioning and visual question answering,such as the unified Vision-Language Pre-training(VLP) model (Zhou et al., 2020).

In the practical session we continue discussingBERT-based models, shown in the lectures. Thesession’s main idea is to consider different tasksthat may be solved by BERT-based models andto demonstrate different tools and approaches forsolving them. So the practical session is dividedinto two parts. The first part is devoted to namedentity recognition. We consider a pre-trained cross-lingual BERT-based NER model from the Deep-Pavlov library (Burtsev et al., 2018) and demon-strate how it can be used to extract named entitiesfrom Russian and English text. The second partis focused on multilingual zero-shot classification.We consider the pre-trained XLM-based model byHuggingFace, discuss the approach’s key ideas,and demonstrate how the model works, classify-ing short texts in English, Russian, Spanish, andFrench.

Week 8. Syntax parsing. The lecture is devotedto computational approaches to syntactic parsingand is structured as follows. After a brief intro-duction about the matter and its possible applica-tions (both as an auxiliary task and an indepen-dent one), we consider syntactic frameworks de-veloped in linguistics: dependency grammar (Tes-nière, 2015) and constituency grammar (Bloom-field, 1936). Then we discuss only algorithmsthat deal with dependency parsing (mainly becausethere are no constituency parsers for Russian), sowe turn to graph-based (McDonald et al., 2005)and transition-based (Aho and Ullman, 1972) de-pendency parsers and consider their logics, struc-ture, sorts, advantages, and drawbacks. Afterward,we familiarize students with the practical side ofparsing, so we introduce syntactically annotatedcorpora, Universal Dependencies project (Nivreet al., 2016b) and some parsers which perform forRussian well (UDPipe (Straka and Straková, 2017),

18

DeepPavlov Project (Burtsev et al., 2018)). Thelast part of our lecture represents a brief overviewof the problems which were not covered in previ-ous parts: BERTology, some issues of web-textsparsing, latest advances in computational syntax(like enhanced dependencies (Schuster and Man-ning, 2016)).

The practical session starts with a quickoverview of CoNLL-U annotation format (Nivreet al., 2016a): we show how to load, parse and visu-alize such data on the example from the SynTagRuscorpus 6. Next, we learn to parse data with pre-trained UDPipe models (Straka et al., 2016) andRussian-language framework Natasha. To demon-strate some practical usage of syntax parsing, wefirst understand how to extract subject-verb-object(SVO) triples and then design a simple template-based text summarization model.

Week 9. Topic modelling The focus of this lec-ture is topic modeling. First, we formulate thetopic modeling problem and ways it can be usedto cluster texts or extract topics. We explain thebasic probabilistic latent semantic analysis (PLSA)model (HOFMANN, 1999), that modifies early ap-proaches, which were based on SVD (Dumais,2004). We approach the PLSA problem using theExpectation-Minimization (EM) algorithm and in-troduce the basic performance metrics, such as per-plexity and topic coherence.

As the PLSA problem is ill-posed, we famil-iarize students with regularization techniques us-ing Additive Regularization for Topic Modeling(ARTM) model (Vorontsov and Potapenko, 2015)as an example. We describe the general EM algo-rithm for ARTM and some basic regularizers. Thenwe move towards the Latent Dirichlet Allocation(LDA) model (Blei et al., 2003) and show that themaximum a posteriori estimation for LDA is thespecial case of the ARTM model with a smoothingor sparsing regularizer (see Fig. 3 for the explana-tion snippet). We conclude the lecture with a briefintroduction to multi-modal ARTM models andshow how to generalize different Bayesian topicmodels based on LDA. We showcase classification,word translation, and trend detection tasks as multi-modal models.

In practical session we consider the models dis-cussed in the lecture in a slightly different order.First, we take a closer look at Gensim realization

6https://universaldependencies.org/treebanks/ru_syntagrus/index.html

Figure 3: One slide from Lecture 9. Sparsification ofan ARTM model explained.

of the LDA model (Rehurek and Sojka, 2010),pick up the model’s optimal parameters in terms ofperplexity and topic coherence, and visualize themodel with pyLDAvis library. Next, we exploreBigARTM (Vorontsov et al., 2015) library, partic-ularly LDA, PLSA, and multi-modal models, andthe impact of different regularizers. For all experi-ments, we use a corpus of Russian-language newsfrom Lenta.ru 7 which allows us to compare themodels to each other.

Week 10. In this lecture we discussed mono-lingual seq2seq problems, text summarization andsentence simplification. We start with extractivesummarization techniques. The first approach intro-duced is TextRank (Mihalcea and Tarau, 2004). Wepresent each step of this approach and explain thatany sentence or keyword embeddings can be usedto construct a text graph, as required by the method.Thus we refer the students back to earlier lectures,where sentence embeddings were discussed. Next,we move to abstractive summarization techniques.To this end, we present performance metrics, suchas ROUGE (Lin, 2004) and METEOR (Baner-jee and Lavie, 2005) and briefly overview pre-Transformer architectures, including Pointer net-works (See et al., 2017). Next, we show recentpre-trained Transformer-based models, which aimat multi-task learning, including summarization.To this end, we discuss pre-training approaches ofT5 (Raffel et al., 2020) and BART (Lewis et al.,2020), and how they help to improve the perfor-mance of mono-lingual se2seq tasks. Unfortu-nately, when this lecture was created, multilingualversions of these models were not available, sothey are left out of the scope. Finally, we talk aboutsentence simplification task (Coster and Kauchak,2011; Alva-Manchego et al., 2020) and its socialimpact. We present SARI (Xu et al., 2016) as a

7https://github.com/yutkin/Lenta.Ru-News-Dataset

19

metric for sentence simplification performance andstate, explain how T5 or BART can be utilized forthe task.

The practical session is devoted to extractivesummarization and TextRank algorithm. We areurged to stick to extractive summarization, as Rus-sian lacks annotated datasets, but, at the same time,the task is demanded by in industry—extractivesummarization compromises than between the needfor summarization techniques and the absence oftraining datasets. Nevertheless, we used annotatedEnglish datasets to show how performance metricscan be used for the task. The CNN/DailyMail ar-ticles are used as an example of a dataset for thesummarization task. As there is no standard bench-mark for text summarization in Russian, we haveto use English to measure different models’ per-formance. We implement the TextRank algorithmand compare it with the algorithm from the Net-workX library (Hagberg et al., 2008). Also, wedemonstrate how to estimate the performance ofthe summarization by calculating the ROUGE met-ric for the resulting algorithm using the PyRougelibrary8. This practical session allows us to referback the students to sentence embedding modelsand showcase another application of sentence vec-tors.

Week 11. The penultimate lecture approachesQuestion-Answering (QA) systems and chat-bottechnologies. We present multiple real-life in-dustrial applications, where chat-bots and QAtechnologies are used, ranging from simple task-oriented chat-bots for food ordering to help deskor hotline automation. Next, we formulate the coreproblems of task-oriented chat-bots, which are in-tent classification and slot-filling (Liu and Lane,2016) and revise methods, to approach them. Af-ter that, we introduce the concept of a dialog sce-nario graph and show how such a graph can guideusers to complete their requests. Without goingdeep into technical details, we show how ready-made solutions, such as Google Dialogflow9, canbe used to create task-oriented chat-bots. Next,we move towards QA models, of which we paymore attention to information retrieval-based (IR-based) approaches and SQuAD-style (Rajpurkaret al., 2016) approaches. Since natural languagegeneration models are not mature enough (at leastfor Russian) to be used in free dialog, we explain

8urlhttps://github.com/andersjo/pyrouge9https://cloud.google.com/dialogflow

how IR-based techniques imitate a conversationwith a user. Finally, we show how BERT can beused to tackle the SQuAD problem. The lectureis concluded by comparing industrial dialog assis-tants created by Russian companies, such as Yan-dex.Alisa or Mail.ru Marusya.

In the practical session we demonstrate severalexamples of using Transformer-based models forQA task. Firstly, we try to finetune Electra model(Clark et al., 2020) on COVID-19 questions dataset10 and BERT on SQuAD 2.0 (Rajpurkar et al.,2018) (we use code from hugginface tutorial 11

for the latter). Next, we show an example of us-age of pretrained model for Russian-language datafrom DeepPavlov project. Finally, we explore howto use BERT for joint intent classification and slotfilling task (Chen et al., 2019).

Week 12. The last lecture wraps up the courseby discussing knowledge graphs (KG) and someof their applications for QA systems. We revisecore information extraction problems, such as NERand relation detection, and show how they can beused to extract a knowledge graph from unstruc-tured texts (Paulheim, 2017). We touch upon theentity linking problem but do not go deep into de-tails. To propose to students an alternative view toinformation extraction, we present machine read-ing comprehension approaches for NER (Li et al.,2019a) and relation detection (Li et al., 2019b), re-ferring to the previous lecture. Finally, we closethe course by revising all topics covered. We recitethe evolution of text representation models frombag-of-words to BERT. We show that all the prob-lems discussed throughout the course fall into oneof three categories: (i) text classification or sen-tence pair classification, (ii) sequence tagging, (iii)sequence-to-sequence transformation. We draw at-tention to the fact that the most recent models cantackle all of the problem categories. Last but notleast we revise, how all of these problem statementsare utilized in real-life applications.

The practical session in this week is dedi-cated to information extraction tasks with StanfordCoreNLP library (Manning et al., 2014). The ses-sion’s main idea is to demonstrate using the toolfor constructing knowledge graphs based on natu-ral text. We consider different ways of using the

10https://github.com/xhlulu/covid-qa11https://huggingface.co/

transformers/custom_datasets.html#question-answering-with-squad-2-0

20

library and experimented with using the library tosolve different NLP tasks that were already con-sidered in the course: tokenization, lemmatization,POS-tagging, and dependency parsing. The libraryincludes models for 53 languages, so we considerexamples of solving these tasks for English andRussian texts. Besides, relation extraction is consid-ered using the Open Information Extraction (Ope-nIE) module from the CoreNLP library.

4 Home works

The course consists of multiple ungraded quizassignments, 11 graded quiz assignments, threegraded coding assignments. Grading is performedautomatically in a Kaggle-like fashion.

4.1 Quiz Assignments

Every video lecture is followed by an ungradedquiz, consisting of 1-2 questions. A typical ques-tion address the core concepts introduced:

• What kind of vectors are more common forword embedding models?A1: dense (true), A2: sparse (false)

• What kind of layers are essential for GPT-2model?A1: transformer stacks (true), A2: recurrentlayers (false), A3: convolutional layers (false),A4: dense layers (false)

A graded test is conducted every week, exceptthe very last one. It consists of 12-15 questions,which we tried to split into three parts, being moreor less of the same complexity. First part questionsabout main concepts and ideas introduced duringthe week. These questions are a bit more compli-cated than after video ones:

• What part of an encoder-decoder model solvesthe language modeling problem, i.e., the nextword prediction?A1: encoder (false), A2: decoder (true)

• What are the BPE algorithm units?A1: syllables (false), A2: morphemes (false),A3: n−grams (true), A4: words (false)

Second part of the quiz asks the students to conductsimple computations by hand:

• Given a detailed description of an neural archi-tecture, compute the number of parameters;

• Given a gold-standard NER annotation anda system output, compute token-based andspan-based micro F1.

The third part of the quiz asks to complete asimple programming assignment or asks about thecode presented in practical sessions:

• Given a pre-trained language model, computeperplexity of a test sentence

• Does DeepPavlov cross-lingual NER modelrequire to announce the language of the inputtext?

For convenience and to avoid format ambiguity,all questions are in multiple-choice format. Forquestions, which require a numerical answer, weprovided answer options in the form of intervals,with one of the endpoints excluded.

Each quiz is estimated on a 10 point scale. Allquestions have equal weights.

The final week is followed by a comprehensivequiz covering all topics studied. This quiz is oblig-atory for those students who desire to earn a certifi-cate.

4.2 Coding assignments

There are three coding assignments concerningthe following topics: (i) text classification, (ii) se-quence labeling, (iii) topic modeling. Assignmentsgrading is binary. Text classification and sequencelabeling assignments require students to beat thescore of the provided baseline submission. Topicmodeling assignment is evaluated differently.

All the coding tasks provide students with thestarter code and sample submission bundles. Thenumber of student’s submissions is limited. Samplesubmission bundles illustrate the required submis-sion format and could serve as the random baselinefor each task. Submissions are evaluated using theMoodle12 (Dougiamas and Taylor, 2003) CodeRun-ner13 (Lobb and Harlow, 2016) plugin.

4.2.1 Text classification and sequencelabeling coding assignments

Text classification assignment is based on theHarry Potter and the Action Prediction Chal-lenge from Natural Language dataset (Vilares andGómez-Rodríguez, 2019), which uses fiction fan-tasy texts. Here, the task is the following: given

12https://moodle.org/13https://coderunner.org.nz/

21

some text preceding a spell occurrence in the text,predict this spell name. Students are providedwith starter code in Jupyter notebooks (Pérez andGranger, 2007). Starter code implements all theneeded data pre-processing, shows how to imple-ment the baseline Logistic Regression model, andprovides code needed to generate the submission.

Students’ goal is to build three different modelsperforming better than the baseline. The first oneshould differ from the baseline model by only hy-perparameter values. The second one should be aGradient Boosting model. The third model to buildis a CNN model. All the three models’ predictionson the provided testing dataset should be then sub-mitted to the scoring system. Submissions, whereall the models beat the baseline models classifica-tion F1-score, are graded positively.

Sequence labeling Sequence labeling assign-ment is based on the LitBank data (Bamman et al.,2019). Here, the task is to given fiction texts, per-form a NER labeling. Students are provided with astarter code for data pre-processing and submissionpackaging. Starter code also illustrates building arecurrent neural model using the PyTorch frame-work, showing how to compose a single-layer uni-directional RNN model.

Students’ goal is to build a bidirectional LSTMmodel that would outperform the baseline. Sub-missions are based on the held-out testing subsetprovided by the course team.

4.2.2 Topic modeling assignment

Topic modeling assignment motivation is to givestudents practical experience with LDA (Blei et al.,2003) algorithm. The assignment is organized asfollows: first, students have to download and pre-process Wikipedia texts.

Then, the following experiment should be con-ducted. The experiment consists of training andexploring an LDA model for the given collection oftexts. The task is to build several LDA models forthe given data: models differ only in the configurednumber of topics. Students are asked to explorethe obtained models using the pyLDAvis (Sievertand Shirley, 2014) tool. This stage is not evalu-ated. Finally, students are asked to submit the topiclabels that LDA models assign to words providedby the course team. Such a prediction should beperformed for each of the obtained models.

5 Platform description

The course is hosted on OpenEdu 14 - an educa-tional platform created by the Association “Na-tional Platform for Open Education”, establishedby leading Russian universities. Our course and allcourses on the platform are available free of chargeso that everyone can access all materials (includ-ing videos, practical Jupyter notebooks, tests, andcoding assessments). The platform also providesa forum where course participants can ask ques-tions or discuss the material with each other andlecturers.

6 Expected outcomes

First of all, we expect the students to understandbasic formulations of the NLP tasks, such as textclassification, sentence pair modeling, sequencetagging, and sequence-to-sequence transformation.We expect the students to be able to recall coreterminology and use it fluently. In some weeks,we provide links to extra materials, mainly in En-glish, so that the students can learn more aboutthe topic themselves. We hope that after complet-ing the course, the students become able to readthose materials. Secondly, we anticipate that aftercompleting the course, the students are comfort-able using popular Python tools to process texts inRussian and English and utilize pre-trained mod-els. Thirdly, we hope that the students can stateand approach their tasks related to NLP, using theknowledge acquired, conducting experiments, andevaluating the results correctly.

7 Feedback

The early feedback we have received so far is posi-tive. Although the course has only been advertisedso far to a broader audience, we know that thereare two groups interested in the course. First, somestudents come to study at their own will. Secondly,selected topics were used in offline courses in aninverse classroom format or as additional materials.The students note that our course is a good startingpoint for studying NLP and helps navigate a broadrange of topics and learn the terminology. Some ofthe students note that it was easy for them to learnin Russian, and now, as they feel more comfort-able with the core concepts, they can turn to readdetailed and more recent sources. Unfortunately,programming assignments turn out to be our weak

14https://npoed.ru/

22

Figure 4: The results of survey among course participants. Left: current educational level. Right: professionalarea.

spot, as there are challenging to complete, and littlefeedback on them can be provided.

We ask all participants to fill in a short surveyafter they enroll in the course. So far, we havereceived about 100 responses. According to theresults, most students (78%) have previously takenonline courses, but only 24% of them have experi-ence with courses from foreign universities. Theaverage age of course participants is 32 years; mostof them already have or are getting a higher educa-tion (see Fig. 4 for more details). Almost half of thestudents are occupied in Computer Science area,20% have a background in Humanities, followedby Engineering Science (16%).

We also ask students about their motivation inthe form of a multiple-choice question: almost halfof them (46%) stated that they want to improvetheir qualification either to improve at their currentjob (33%) or to change their occupation (13%), and20% answered they enrolled the course for researchand academic purposes. For the vast majority ofthe student, the reputation of HSE university wasthe key factor to select this course among otheravailable.

8 Conclusion

This paper introduced and described a new massiveopen online course on Natural Language Process-ing targeted at Russian-speaking students. Thistwelve-week course was designed and recordedduring 2020 and launched by the end of the year.In the lectures and practical session, we managedto document a paradigm shift caused by the discov-ery and widespread use of pre-trained Transformer-based language models. We inherited the best oftwo worlds, showing how to utilize both static wordembeddings in a more traditional machine learningsetup and contextualized word embeddings in the

most recent fashion. The course’s theoretical out-come is understanding and knowing core conceptsand problem formulations, while the practical out-come covers knowing how to use tools to processtext in Russian and English.

Early feedback we got from the students is pos-itive. As every week was devoted to a new topic,they did not find it difficult to keep being engaged.The ways we introduce the core problem formula-tions and showcase different tools to process textsin Russian earned approval. What is more, thepresented course is used now as supplementary ma-terial in a few off-line educational programs to thebest of our knowledge.

Further improvements and adjustments, whichcould be made for the course, include new homeworks related to machine translation or mono-lingual sequence-to-sequence tasks and the devel-opment of additional materials in written form tosupport mathematical calculations, avoided in thevideo lecture for the sake of time.

References

Alfred V Aho and Jeffrey D Ullman. 1972. The theoryof parsing, translation, and compiling.

Jay Alammar. 2015. The illustrated transformer. JayAlammar blog.

Fernando Alva-Manchego, Carolina Scarton, and Lu-cia Specia. 2020. Data-driven sentence simplifica-tion: Survey and benchmark. Computational Lin-guistics, 46(1):135–187.

Dzmitry Bahdanau, Kyung Hyun Cho, and YoshuaBengio. 2015. Neural machine translation byjointly learning to align and translate. In 3rd Inter-national Conference on Learning Representations,ICLR 2015.

23

David Bamman, Sejal Popat, and Sheng Shen. 2019.An annotated dataset of literary entities. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 2138–2144.

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In Proceedingsof the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation and/or sum-marization, pages 65–72.

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, andChristian Janvin. 2003. A neural probabilistic lan-guage model. The journal of machine learning re-search, 3:1137–1155.

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. the Journal of ma-chine Learning research, 3:993–1022.

Leonard Bloomfield. 1936. Language or ideas? Lan-guage, pages 89–95.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Pavel Braslavski. 2017. Nlp – how will it be in russian?Habr blog.

Pavel Braslavski, Vladislav Blinov, Valeria Bolotova,and Katya Pertsova. 2018. How to evaluate humor-ous response generation, seriously? In Proceedingsof the 2018 Conference on Human Information In-teraction & Retrieval, pages 225–228.

Mikhail Burtsev, Alexander Seliverstov, RafaelAirapetyan, Mikhail Arkhipov, Dilyara Baymurz-ina, Nickolay Bushkov, Olga Gureenkova, TarasKhakhulin, Yurii Kuratov, Denis Kuznetsov, et al.2018. Deeppavlov: Open-source library for dia-logue systems. In Proceedings of ACL 2018, SystemDemonstrations, pages 122–127.

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall,and W Philip Kegelmeyer. 2002. Smote: syntheticminority over-sampling technique. Journal of artifi-cial intelligence research, 16:321–357.

Qian Chen, Zhu Zhuo, and Wen Wang. 2019. BERTfor joint intent classification and slot filling. CoRR,abs/1902.10909.

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence mod-eling. In NIPS 2014 Workshop on Deep Learning,December 2014.

Kevin Clark, Minh-Thang Luong, Quoc V. Le, andChristopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather thangenerators. In ICLR.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzmán, Édouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 8440–8451.

Alexis Conneau and Douwe Kiela. 2018. Senteval: Anevaluation toolkit for universal sentence representa-tions. In Proceedings of the Eleventh InternationalConference on Language Resources and Evaluation(LREC 2018).

William Coster and David Kauchak. 2011. Simple en-glish wikipedia: a new text simplification task. InProceedings of the 49th Annual Meeting of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, pages 665–669.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Martin Dougiamas and Peter Taylor. 2003. Moo-dle: Using learning communities to create an opensource course management system.

Susan T Dumais. 2004. Latent semantic analysis. An-nual review of information science and technology,38(1):188–230.

Andrea Esuli and Fabrizio Sebastiani. 2009. Activelearning strategies for multi-label text classification.In European Conference on Information Retrieval,pages 102–113. Springer.

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, NaveenArivazhagan, and Wei Wang. 2020. Language-agnostic bert sentence embedding. arXiv preprintarXiv:2007.01852.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F Liu, Matthew Pe-ters, Michael Schmitz, and Luke Zettlemoyer. 2018.Allennlp: A deep semantic natural language process-ing platform. In Proceedings of Workshop for NLPOpen Source Software (NLP-OSS), pages 1–6.

Yoav Goldberg. 2017. Neural network methods for nat-ural language processing. Synthesis lectures on hu-man language technologies, 10(1):1–309.

Jiatao Gu, James Bradbury, Caiming Xiong, Vic-tor OK Li, and Richard Socher. 2017. Non-autoregressive neural machine translation. arXivpreprint arXiv:1711.02281.

24

Aric Hagberg, Pieter Swart, and Daniel S Chult. 2008.Exploring network structure, dynamics, and func-tion using networkx. Technical report, Los AlamosNational Lab.(LANL), Los Alamos, NM (UnitedStates).

Vu Cong Duy Hoang, Philipp Koehn, GholamrezaHaffari, and Trevor Cohn. 2018. Iterative back-translation for neural machine translation. In Pro-ceedings of the 2nd Workshop on Neural MachineTranslation and Generation, pages 18–24.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

T HOFMANN. 1999. Probabilistic latent semanticanalysis. In Proc. Conf. on Uncertainty in ArtificialIntelligence (UAI), 1999, pages 289–296.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, andYejin Choi. 2019. The curious case of neural text de-generation. In International Conference on Learn-ing Representations.

Daniel Jurafsky and James H Martin. 2000. Speechand language processing: An introduction to naturallanguage processing, computational linguistics, andspeech recognition.

Andrej Karpathy. 2015. The unreasonable effective-ness of recurrent neural networks. Andrej Karpathyblog, 21:23.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. CoRR, abs/1408.5882.

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel-lart, and Alexander M. Rush. 2017. Opennmt:Open-source toolkit for neural machine translation.In Proc. ACL.

Reinhard Kneser and Hermann Ney. 1995. Improvedbacking-off for m-gram language modeling. In 1995international conference on acoustics, speech, andsignal processing, volume 1, pages 181–184. IEEE.

Mikhail Korobov. 2015. Morphological analyzer andgenerator for russian and ukrainian languages. InInternational Conference on Analysis of Images, So-cial Networks and Texts, pages 320–332. Springer.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2019. Albert: A lite bert for self-supervised learningof language representations. In International Con-ference on Learning Representations.

Quoc Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. In Interna-tional conference on machine learning, pages 1188–1196. PMLR.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.

2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation,and comprehension. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 7871–7880.

Xiaoya Li, Jingrong Feng, Yuxian Meng, QinghongHan, Fei Wu, and Jiwei Li. 2019a. A unified mrcframework for named entity recognition. arXivpreprint arXiv:1910.11476.

Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, AriannaYuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019b.Entity-relation extraction as multi-turn question an-swering. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics,pages 1340–1350.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Text summarizationbranches out, pages 74–81.

Bing Liu and Ian Lane. 2016. Attention-based recur-rent neural network models for joint intent detectionand slot filling. Interspeech 2016, pages 685–689.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Richard Lobb and Jenny Harlow. 2016. Coderunner:A tool for assessing computer programming skills.ACM Inroads, 7(1):47–51.

Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-quence labeling via bi-directional lstm-cnns-crf. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 1064–1074.

Christopher D Manning, Mihai Surdeanu, John Bauer,Jenny Rose Finkel, Steven Bethard, and David Mc-Closky. 2014. The stanford corenlp natural languageprocessing toolkit. In Proceedings of 52nd annualmeeting of the association for computational linguis-tics: system demonstrations, pages 55–60.

Chandler May, Alex Wang, Shikha Bordia, SamuelBowman, and Rachel Rudinger. 2019. On measur-ing social biases in sentence encoders. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 622–628.

Ryan McDonald, Koby Crammer, and FernandoPereira. 2005. Online large-margin training of de-pendency parsers. In Proceedings of the 43rd An-nual Meeting of the Association for ComputationalLinguistics (ACL’05), pages 91–98.

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-ing order into text. In Proceedings of the 2004 con-ference on empirical methods in natural languageprocessing, pages 404–411.

25

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In Proceedings of the 26th International Con-ference on Neural Information Processing Systems-Volume 2, pages 3111–3119.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajic, Christopher D. Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, Reut Tsarfaty, and Daniel Zeman.2016a. Universal Dependencies v1: A multilingualtreebank collection. In Proceedings of the Tenth In-ternational Conference on Language Resources andEvaluation (LREC’16), pages 1659–1666, Portorož,Slovenia. European Language Resources Associa-tion (ELRA).

Joakim Nivre, Marie-Catherine De Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajic, Christopher D Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, et al. 2016b. Universal dependen-cies v1: A multilingual treebank collection. InProceedings of the Tenth International Conferenceon Language Resources and Evaluation (LREC’16),pages 1659–1666.

Christopher Olah. 2015. Understanding lstm networks.Christopher Olah blog.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of the40th annual meeting of the Association for Compu-tational Linguistics, pages 311–318.

Heiko Paulheim. 2017. Knowledge graph refinement:A survey of approaches and evaluation methods. Se-mantic web, 8(3):489–508.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP), pages 1532–1543.

Fernando Pérez and Brian E. Granger. 2007. IPython:a system for interactive scientific computing. Com-puting in Science and Engineering, 9(3):21–29.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of NAACL-HLT, pages2227–2237.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. Language mod-els are unsupervised multitask learners.

Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2020. Exploring the lim-its of transfer learning with a unified text-to-texttransformer. Journal of Machine Learning Research,21:1–67.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. CoRR, abs/1806.03822.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392.

Radim Rehurek and Petr Sojka. 2010. Software Frame-work for Topic Modelling with Large Corpora. InProceedings of the LREC 2010 Workshop on NewChallenges for NLP Frameworks, pages 45–50, Val-letta, Malta. ELRA. http://is.muni.cz/publication/884893/en.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2021. A primer in bertology: What we know abouthow bert works. Transactions of the Association forComputational Linguistics, 8:842–866.

Hassan Sajjad, Fahim Dalvi, Nadir Durrani, andPreslav Nakov. 2020. Poor man’s bert: Smallerand faster transformer models. arXiv preprintarXiv:2004.03844.

Erik Tjong Kim Sang and Fien De Meulder. 2003. In-troduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceed-ings of the Seventh Conference on Natural LanguageLearning at HLT-NAACL 2003, pages 142–147.

Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108.

Sebastian Schuster and Christopher D Manning. 2016.Enhanced english universal dependencies: An im-proved representation for natural language under-standing tasks. In Proceedings of the Tenth Interna-tional Conference on Language Resources and Eval-uation (LREC’16), pages 2371–2378.

Abigail See, Peter J Liu, and Christopher D Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1073–1083.

Ilya Segalovich. A fast morphological algorithm withunknown word guessing induced by a dictionary fora web search engine.

Carson Sievert and Kenneth Shirley. 2014. Ldavis:A method for visualizing and interpreting topics.In Proceedings of the workshop on interactive lan-guage learning, visualization, and interfaces, pages63–70.

Fei Song and W Bruce Croft. 1999. A general languagemodel for information retrieval. In Proceedings ofthe eighth international conference on Informationand knowledge management, pages 316–321.

26

Milan Straka, Jan Hajic, and Jana Straková. 2016. UD-Pipe: Trainable pipeline for processing CoNLL-Ufiles performing tokenization, morphological analy-sis, POS tagging and parsing. In Proceedings ofthe Tenth International Conference on Language Re-sources and Evaluation (LREC’16), pages 4290–4297, Portorož, Slovenia. European Language Re-sources Association (ELRA).

Milan Straka and Jana Straková. 2017. Tokenizing,pos tagging, lemmatizing and parsing ud 2.0 withudpipe. In Proceedings of the CoNLL 2017 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies, pages 88–99.

Ilya Sutskever, James Martens, and Geoffrey E Hin-ton. 2011. Generating text with recurrent neural net-works. In ICML.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.Advances in Neural Information Processing Systems,27:3104–3112.

Lucien Tesnière. 2015. Elements of Structural Syntax.John Benjamins Publishing Company.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of the 31st InternationalConference on Neural Information Processing Sys-tems, pages 6000–6010.

David Vilares and Carlos Gómez-Rodríguez. 2019.Harry Potter and the action prediction challengefrom natural language. In Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers), pages 2124–2130, Minneapolis, Minnesota.Association for Computational Linguistics.

Konstantin Vorontsov, Oleksandr Frei, Murat Apishev,Peter Romov, and Marina Dudarenko. 2015. Bi-gartm: Open source library for regularized multi-modal topic modeling of large collections. In Inter-national Conference on Analysis of Images, SocialNetworks and Texts, pages 370–381. Springer.

Konstantin Vorontsov and Anna Potapenko. 2015. Ad-ditive regularization of topic models. MachineLearning, 101(1):303–323.

Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel R Bowman. 2019a. Superglue:A stickier benchmark for general-purpose languageunderstanding systems. Advances in Neural Infor-mation Processing Systems, 32.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R Bowman. 2019b.Glue: A multi-task benchmark and analysis platformfor natural language understanding. In 7th Inter-national Conference on Learning Representations,ICLR 2019.

Jason Wei and Kai Zou. 2019. Eda: Easy data augmen-tation techniques for boosting performance on textclassification tasks. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 6383–6389.

Robert West and Eric Horvitz. 2019. Reverse-engineering satire, or “paper on computational hu-mor accepted despite making serious advances”. InProceedings of the AAAI Conference on Artificial In-telligence, volume 33, pages 7265–7272.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Transformers: State-of-the-art natural language pro-cessing. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

Wei Xu, Courtney Napoles, Ellie Pavlick, QuanzeChen, and Chris Callison-Burch. 2016. Optimizingstatistical machine translation for text simplification.Transactions of the Association for ComputationalLinguistics, 4:401–415.

Ofir Zafrir, Guy Boudoukh, Peter Izsak, and MosheWasserblat. 2019. Q8bert: Quantized 8bit bert.arXiv preprint arXiv:1910.06188.

Luowei Zhou, Hamid Palangi, Lei Zhang, HoudongHu, Jason Corso, and Jianfeng Gao. 2020. UnifiedVision-language Pre-training for Image Captioningand VQA. In Proceedings of the AAAI Conferenceon Artificial Intelligence, volume 34, pages 13041–13049.

27

Proceedings of the Fifth Workshop on Teaching NLP, pages 28–33June 10–11, 2021. ©2021 Association for Computational Linguistics

Natural Language Processing 4 All (NLP4All): A New Online Platformfor Teaching and Learning NLP Concepts

Anonymous NAACL-HLT 2021 submission

Abstract

Natural Language Processing offers new in-001sights into language data across almost all dis-002ciplines and domains, and allows us to cor-003roborate and/or challenge existing knowledge.004The primary hurdles to widening participation005in and use of these new research tools are, first,006a lack of coding skills in students across K-16,007and in the population at large, and second, a008lack of knowledge of how NLP-methods can009be used to answer questions of disciplinary in-010terest outside of linguistics and/or computer011science. To broaden participation in NLP and012improve NLP-literacy, we introduced a new013tool web-based tool called Natural Language014Processing 4 All (NLP4All). The intended015purpose of NLP4All is to help teachers facili-016tate learning with and about NLP, by providing017easy-to-use interfaces to NLP-methods, data,018and analyses, making it possible for non- and019novice-programmers to learn NLP concepts in-020teractively.021

1 Introduction022

An emerging body of work has explored ways of023

lowering the threshold for people to work with AI024

and ML-technologies, specifically in educational025

contexts. Much of this work has focused on making026

AI “explainable” (Gunning, 2017) by visualizing027

the underlying math, or visualizing how machines028

make decisions. However, NLP has been largely029

absent from these efforts so far. To address this gap,030

we developed a new educational tool called Natu-031

ralLanguageProcessing4All (NLP4All), which al-032

lows teachers to interactively introduce applica-033

tions of statistical NLP to students without any034

coding skills.035

NLP4All1 is a web-based interface for teaching036

and learning NLP concepts, designed with flex-037

ibility and accessibility in mind. It is an open038

1A demo version of NLP4All can be accessed here:http://86.52.121.12:5000/, pre-loaded with the data and analy-sis described in this paper.

source application built in Python on top of the 039

Flask framework, and can therefore be easily ex- 040

tended with existing Python-based NLP- and ML- 041

packages. The first prototype of NLP4All is de- 042

signed to work with tweets only, but we are cur- 043

rently expanding to be able to work with any kind 044

of text, bringing NLP tools to a wider array of 045

disciplines and student populations. 046

We first present the design of the tool and its 047

current capabilities, and then briefly describe two 048

different real-world settings in which we have used 049

NLP4All. 050

2 Teaching text classification with 051

NLP4All 052

NLP4All provides two different user types: teach- 053

ers and students. Whereas teachers can see data 054

from all students and can create new projects, stu- 055

dents’ activities are more limited in the system. 056

The system is organized into user groups, 057

projects, and analyses. To better describe how 058

the system, we will briefly outline how each of 059

these work. 060

2.1 User Groups 061

User groups provide an easy way to organize 062

groups of students. A group will consist of stu- 063

dents who are doing the same activities, and simply 064

act as an easy way to add students to projects. They 065

will typically consist of students in one class. User 066

groups can be created by a teacher and associated 067

with a unique sign-up link to be distributed to the 068

intended recipient group. 069

2.2 Projects 070

Projects in NLP4All offer teachers a way to orga- 071

nize a lesson by selecting some texts of interest, 072

and tying them to a user group. A project consists 073

of a title, a description, a user group (of students), 074

and a set corpora that will be included in the project. 075

Teachers create projects with associated datasets 076

128

prior to classroom sessions; they can either choose077

to use several pre-loaded datasets (Tweets from the078

accounts of different American and Danish politi-079

cians or political parties) or upload their own texts080

in .csv or .json format.081

Figure 1: Teacher view displaying Projects interface.

2.3 Analyses082

Inside a project, both teachers and students can cre-083

ate a new analysis. An analysis is NLP4All’s name084

for an untrained model and all data associated with085

training it. Students can create a new analysis if086

they think that they have trained their old model087

poorly and want to start from scratch. Students can088

only create personal analyses - i.e. analyses that089

are unique to their account, and not shared among090

other members of the user group. Teachers, in con-091

trast, can create analyses that are shared between092

all members of a user group.093

There are two different ways in which an analy-094

sis can be shared (Fig. 2): the teacher can choose095

to just share the texts that students hand-label, or096

they can choose to share an underlying model. For097

the former case, the teacher can specify a number098

of texts from each category, and NLP4All will cre-099

ate a mini-corpus of just those texts for students100

to work with. For the latter case, students work101

with the whole corpus of the texts present in the102

project, but all train the same underlying model as103

they hand label texts.104

NLP4All also supports text annotation, but we105

do not discuss this functionality here.106

3 Example: Teaching classification with107

NLP4All108

The initial version of NLP4All features interactive109

tools for a curriculum module on text classifica-110

tion with Naïve Bayes and Logistic Regression111

models. In this section, we walk through an exam-112

ple of how NLP4All could be used in the classroom113

Figure 2: New Analysis interface.

to introduce Naïve Bayes text classification using 114

a corpus of posts from the Twitter accounts of Joe 115

Biden and Bernie Sanders collected in the run-up 116

to the 2020 Democratic Primaries. 117

Upon logging in, students see a landing page 118

which lists all projects that the student is currently 119

part of, as determined by the instructor (Fig. 3). 120

Figure 3: Landing page for students.

3.1 Hand labeling: The Tweet View 121

The current implementation of NLP4All has a spe- 122

cial view called the Tweet View, where students 123

hand label Twitter data, as seen in Fig. 4. At the 124

bottom of the page, it shows the tweet currently be- 125

ing labeled. Students label each tweet by dragging 126

the Twitter bird to whichever side of the circle rep- 127

resents the category that they think the it belongs to. 128

For example, by dragging the bird to the green part 129

of the circle, a student would label the tweet in Fig. 130

4 as having been written by Bernie Sanders. All 131

229

Tweets in this dataset were pre-processed so that132

mentions, hashtags, and links were replaced with133

mention, #hashtag, and http://link, respectively.134

Figure 4: The Tweet View interface.

In the top right corner Fig. 4, the Tweet View135

also shows the model’s best estimate of who wrote136

the tweet, based on the data that it has been trained137

on so far. For the current tweet, the model estimates138

that Biden wrote the tweet with around .63 and that139

Bernie wrote it with the remaining .37. By making140

the classification explicit to students, we hope to141

achieve two different goals. First, we want students142

to understand how each word contributes to the143

overall classification. Second, we want students144

to critically reflect on whether they agree with the145

model’s assessment. It is, of course, important146

when working with ML to not always trust your147

model. By giving students a clear idea of how148

the model reaches its conclusions, we hope that149

students can learn not only to be skeptical of ML150

models in a general sense, but that they can begin to151

understand why particular features or combinations152

of features may confuse a model, and through this153

get a better sense of what can go wrong when ML154

makes classifications. Finally, at the top of the155

page, students are shown how many more tweets156

this student needs to hand label before they are157

done with all tweets.158

4 Using label and word statistics to159

facilitate learning160

Clicking the Label & Word Statistics link at the161

top of the page gives an overview of all tweets162

that students in this shared analysis have labeled,163

how many labeled them correctly, and how many164

labeled them incorrectly. 165

The purpose of this view is for the teacher to be 166

able to discuss with students which tweets are hard 167

to classify (i.e. the ones that many hand label incor- 168

rectly), which ones are easy, to foster discussion 169

in the classroom what we know about these data. 170

The screenshot below shows us this list, sorted by 171

correct-% in ascending order. Here, we see that 22 172

out of 23 students mislabeled the tweet with the 173

text, “FLORIDA: Today is the LAST DAY to reg- 174

ister for the Democratic primary. You must register 175

as a Democrat to vote in the March 17th primary 176

at http://link Let’s win this together! http://link”. 177

This is not surprising, given how generic this tweet 178

is given the choice between two democratic candi- 179

dates competing in the same primaries. 1803/15/2021 NLP4All

86.52.121.12:5000/shared_analysis_view?analysis=1 1/1

Figure 5: Label & Word Statistics view. After studentsparticipate in hand labeling data, this view can be usedto facilitate discussion.

Sorting the list in descending order of correct-%, 181

we can see the tweets that all or most students la- 182

beled correctly. For instance, the tweet “We will 183

not defeat Donald Trump with a candidate who, 184

instead of holding the crooks on Wall Street ac- 185

countable, blamed the end of racist policies such 186

as redlining for the financial crisis.” was seemingly 187

easily recognizable as a Bernie-tweet (23/23 stu- 188

dents labeled it correctly.) Similarly, “We need a 189

leader who will be ready on day one to pick up the 190

pieces of Donald Trump’s broken foreign policy 191

and repair the damage he has caused around the 192

world. http://link” was easily recognizable by 23 193

students as having been written by the Biden team. 194

(The wrong guess was from the teacher demonstrat- 195

ing the system to students.) 196

The See all words link brings up a table of all 197

words present in tweets that have been labeled so 198

far. This list shows how many times each word 199

330

Figure 6: Label & Word Statistics view, sorted by %correctly labeled.

has appeared in the set of labeled tweets, and to200

which extent each word predicts each of the cat-201

egories in the project, here Joe Biden and Bernie202

Sanders. For instructional purposes, this list can203

be used to discuss a variety of questions: In the204

screenshot above, the list is sorted by how many205

times a word appears. Since the texts have not been206

filtered, this can act as a point of entry to a dis-207

cussion of why only some words are meaningful208

when it comes to distinguishing between different209

classes. The teacher may choose this moment to210

introduce students to the concept of stop words, or211

even to the notion of statistical power laws behind212

word frequency (Zipf’s law).213

Figure 7: Word statistics, sorted by frequency.

4.1 Model training and evaluation214

NLP4All also lets students create their own Naïve215

Bayes models by specifying a set search terms to216

train their model on. Asterisks work as wildcards,217

and can be placed anywhere in a word, including 218

at the front or back. Importantly, for the purpose of 219

reflection and classroom discussions, students are 220

prompted to also state a reason why they think this 221

would be a good word feature for distinguishing 222

between the categories of text in the project. In 223

the screenshot in Fig. 8 we see how one student 224

has added four different terms and their reasons for 225

inclusion. 226

Figure 8: Student-defined word features

By clicking Run Model button at the bottom, 227

NLP4All finds all words that match the search 228

terms (including wild cards) and trains a Naïve 229

Bayes model based on them. It returns the screen 230

shown in the example screenshot in Fig. 9. 231

Here, the user is shown a table with information 232

on each word found from their set of search terms: 233

which category the model predicts based on the 234

training set, how accurately that word was for pre- 235

dicting tweets in the test set, and how many tweets 236

contain the word (‘targeted’) in the test set. 237

In this particular case, we see that the word ‘bil- 238

lionaire’ is the most predictive term: based on the 239

training set, the model predicts that a tweet con- 240

taining ‘billionaire’ is written by Bernie Sanders, 241

and this was the case in every single one of the 242

54 tweets containing this word in the test set. Fi- 243

nally, each search term earns a score. This provides 244

a “gamified” element of NLP4All that can be ig- 245

nored, but that has been found to be motivating and 246

fun to many students. Students can iterative im- 247

prove their models by adding or removing search 248

431

Figure 9: Feedback on model performance

terms, and running these analyses until they are249

happy with the terms they have found.250

NLP4All users can also view confusion matrices251

for any of their trained models, as illustrated in Fig.252

10.253

Figure 10: Viewing a confusion matrix

5 NLP4All in the classroom: Case studies 254

5.1 Introducing text classification to MA 255

students in the Humanities 256

Recent revisions to the national study regulations 257

for humanities students in Denmark place an em- 258

phasis on digitization and digital literacy. This 259

poses a challenge, however, as students in these 260

programs typically have little background or inter- 261

est in math, statistics, or programming and some 262

lack even basic computer skills. 263

Working with a faculty member in at a ma- 264

jor Danish university, we developed a classroom 265

module on Classification as part of a introductory 266

course on Computational Linguistics. The students 267

in the class were second semester masters students 268

in Linguistics and Cognitive Semiotics. Only one 269

out of 22 students had any background in program- 270

ming, and none had taken a specialization in math 271

or science in high school. Student prepared for the 272

two-week module by reading an introductory text- 273

book chapter on Document Classification (Dickin- 274

son et al., 2012). 275

In a post-classroom evaluation of students 276

(n=20), 100% agreed that ‘the in-class exercises 277

using NLP4All were effective for learning’ and that 278

the exercises ‘improved my understanding of text 279

classification’. In additional comments, several stu- 280

dents reported that they enjoyed the gamified and 281

competitive aspect of NLP4All, while others men- 282

tioned that they liked the opportunity to work with 283

real-world social media data. 284

5.2 Facilitating social studies discussion in a 285

Danish high school 286

We tested NLP4All in a Social Studies high school 287

classroom. In collaboration with a social studies 288

teacher, we developed a 6-hour learning unit on lan- 289

guage, ideology and political parties. The unit was 290

designed to address one of learning goals of our na- 291

tional learning standards, specifying that students 292

should learn about the different policy positions of 293

political parties (we have 13 in our national par- 294

liament.) In other words, the purpose was not to 295

teach NLP, but to teach with NLP, and to offer NLP- 296

methods as a way of analyzing larger amounts of 297

text than is otherwise possible. 298

24 2nd year (sophomore) Danish high school stu- 299

dents participated, with roughly equal numbers of 300

girls and boys. In a survey sent out to students prior 301

to our test, none of these students self-reported as 302

having any programming experience, and 20 out of 303

532

24 reported no or low interest in computer science304

or machine learning. All had self-selected into “A-305

level” Social Studies, a 3-year elective class. About306

one third of students had immigrant backgrounds,307

slightly above the national average.308

The teacher and students used NLP4All to dis-309

cuss tweets from pairs of Danish political parties.310

First, students had to label tweets and a model to311

tell a socialist and a nationalist party apart. Then,312

students did the same with the same socialist party,313

and a libertarian party.314

We cannot report on more concrete findings or315

analyses of learning data at present, as these re-316

sults are currently under review at another venue.317

However, in evaluations the students reported en-318

joying being able to provide concrete evidence for319

their analyses. To them, purely qualitative analy-320

ses sometimes feel fluffy, but by showing that their321

analyses were backed up by hundreds or thousands322

of tweets made them feel more comfortable making323

claims during classroom discussions.324

6 Comparison to Prior Work325

We have found two systems that are similar to326

NLP4All in certain ways, though also different in327

others. GATE (Cunningham, 2002) is a combined328

Java API and graphical interface that makes it easy329

to create NLP pipelines without writing all code330

from scratch. While it was originally made for re-331

searchers, it has also been used in teaching contexts332

(Bontcheva et al., 2002) because it makes it easy333

for novice programmers to implement more sophis-334

ticated NLP methods than they could do on their335

own. However, presumably because GATE was336

made for researchers, it is not made for classroom337

contexts, and does not offer interfaces on data that338

would be useful for teachers during the teaching sit-339

uation. Light et al. (2005) present a web interface340

that lets novices process text with common models341

and methods like NLTK’s PoS tagger and grammar342

parser. The web interface lets novices combine343

these models when processing text and visualizes344

output. However, similar to GATE, this interface345

does not provide views on data that are relevant to346

the teaching context. Additionally, the modules are347

black boxed to the students and do not provide any348

information on how the models work, how they are349

trained, or how they make predictions.350

7 Conclusion 351

At present, NLP4All provides support for teaching 352

the following technical topics, without requiring 353

any programming on the part of teachers or stu- 354

dents: 355

• Classification algorithms 356

– Naive Bayes 357

– Logistic regression 358

• Feature selection 359

• Supervised machine learning, test and train 360

sets 361

• Model evaluation 362

– Precision, recall, f-measure 363

– Confusion matrices 364

With a grant received in Spring 2021, the platform 365

will be extended to support new learning mod- 366

ules on tf-idf, vector-based representations of texts, 367

topic modeling, and word embeddings. 368

References 369

Kalina Bontcheva, Hamish Cunningham, Valentin 370Tablan, Diana Maynard, and Oana Hamza. 2002. 371Using GATE as an Environment for Teaching NLP. 372In Proceedings of the ACL-02 Workshop on Effec- 373tive tools and methodologies for teaching natural 374language processing and computational linguistics, 375pages 54–62. 376

Hamish Cunningham. 2002. GATE: A framework and 377graphical development environment for robust NLP 378tools and applications. In Proc. 40th annual meet- 379ing of the association for computational linguistics 380(ACL 2002), pages 168–175. 381

Markus Dickinson, Chris Brew, and Detmar Meurers. 3822012. Language and computers. John Wiley & 383Sons. 384

David Gunning. 2017. Explainable artificial intelli- 385gence (xai). Defense Advanced Research Projects 386Agency (DARPA), nd Web, 2(2). 387

Marc Light, Robert Arens, and Xin Lu. 2005. Web- 388based interfaces for natural language processing 389tools. In Proceedings of the Second ACL Workshop 390on Effective Tools and Methodologies for Teaching 391NLP and CL, pages 28–31. 392

633

Proceedings of the Fifth Workshop on Teaching NLP, pages 34–45June 10–11, 2021. ©2021 Association for Computational Linguistics

A New Broad NLP Training from Speech to Knowledge

Maxime Amblard and Miguel CouceiroLORIA, UMR 7503, Université de Lorraine, Inria and CNRS

{maxime.amblard, miguel.couceiro}@univ-lorraine.fr

Abstract

In 2018, the Master Sc. in NLP1 opened atthe IDMC2, Université de Lorraine - Nancy,France. Far from being a creation ex-nihilo,it is the product of a history and many reflec-tions on the field and its teaching. This articleproposes epistemological and critical elementson the opening and maintainance of this so farnew master’s program in NLP.

1 Introduction

This article discusses the epistemological back-ground of the creation of the Master of Scienceprogram in natural language processing (NLP) atthe University of Lorraine in 2018. It puts for-ward a critical analysis of the environment, of themethodology chosen to produce this program, andit highlights the salient elements for the teachingof NLP.

Currently, the master’s degree is taught at theIDMC of the University de Lorraine. It is accessi-ble to students trained within the undergraduate pro-gram of the institute or to students that successfullyapply to the program. A jury from the pedagogi-cal team evaluates the adequacy of the candidates’profile with the requirements and expectations ofthe training. For the detailed description of themaster’s degree in NLP, please visit our dedicatedwebsite3.

We propose a singular training at the Frenchlevel, and probably at the international level. It is aMaster’s degree that gives the tools and methods tocarry out language data processing. If NLP is at theheart of the training, we have opened up to speechand knowledge processing. The objective is to traintomorrow’s professionals in both the economic andacademic fields for language data processing.

1Natural Language Processing2Institut des Sciences du Digital, du Management et de la

Cognition https://idmc.univ-lorraine.fr/3https://idmc.univ-lorraine.fr/idmc-master-degree-in-

natural-language-processing/

In this article we present the Master’s degree inNLP at the IDMC. We start by reviewing the histori-cal background that is important to the current NLPMaster’s program, before presenting the approachexplaining the constitution of the program. We willthen return to the program itself and discuss somestructural aspects. Finally, we put forward someelements of analysis.

2 History

The Department of Mathematics and Computer Sci-ence at the University of Nancy 2 pursued a policyof opening up both of these disciplines to humanand social sciences. At the beginning of the 2000s,the department proposed a training program basedon these aspects by integrating economics and busi-ness management. Following this interdisciplinaryview, the pedagogical team created a bachelor’sdegree program in cognitive sciences. Several the-matic openings were made, in particular, towardspsychology, biology, linguistics and neurosciences.The dynamics was successful and paved the wayto the opening of a complete master’s program inCognitive Sciences (two years). The core of thetraining has clearly been cognitive sciences.

2.1 Research environment

The French higher education and research impliesthat research and teaching are carried out in dif-ferent components. Thus, the teachers carry outtheir research in a different research laboratory.In Nancy (France), two laboratories are particu-larly concerned: LORIA4 and ATILF5. These twolaboratories are mixed research units, i.e., univer-sity laboratories co-accredited by a research center.This co-accreditation makes it possible to integratestaff who only have research duties. It appears asa quality label for the research carried out. ATILF

4https://www.loria.fr/en/5https://www.atilf.fr//

34

is co-accredited by the CNRS6, and LORIA is co-accredited by the CNRS and Inria7. The two lab-oratories ensure a sustained research activity inthe field in Nancy. ATILF is organized with threethemes: lexicons; corpora and knowledge; and dy-namic aspects of language. Atilf is known for theTrésor de la Langue Française Informatisé, a largeonline dictionary of French.

Research in computer science is carried outjointly by LORIA and the Inria Nancy Grand-Estcenter. Beyond institutional issues, the researcherswork jointly on all computer science domains. Inparticular, LORIA is organised into research de-partments, one of which is the largest and entirelydevoted to NLP research. The department gathersabout fifty permanent staff members in eight teamsof different sizes (Cello, Team K, Multispeech, Or-pailleur, Read, SMarT, Sémagramme, Synalp).

2.2 Creation of a training program in NLP

The dynamics of the teaching of mathematics andcomputer science open to other disciplines, and thestrong research activity in Nancy have proven tobe a favorable ground for the creation of a trainingprogram in NLP. In the early 2000s, the Master’sdegree in Cognitive Science integrated language-related issues into some of its courses. A DESS, anapplied training program at the master’s level, wasalso opened for a few years under the direction ofFiammetta Nammer and Yannick Toussaint. How-ever, as the people in charge were not in the teach-ing component and the theme was not yet explicitlyrecognized by the French industry, this attempt hadto be stopped quickly. It nevertheless establishedthe possibility of developing a curriculum in NLP.

It was in 2005 that the main turning point tookplace. At that time, the dynamic was set up, embod-ied by Patrick Blackburn and Guy Perrier. The lat-ter was a university professor in computer scienceand set up a training program integrating studentsfrom computer science programs with linguisticsprograms within the Master’s degree in CognitiveScience. During this time, Patrick Blackburn wasworking on the creation of the Erasmus MundusLanguage and Communication Technologies con-sortium, of which Nancy would be the french part-ner.

The Erasmus Mundus master’s program on Lan-

6Centre National pour la Recherche Scientifiquehttps://www.cnrs.fr/

7Institut National de la Recherche en Informatique et Au-tomatique https://inria.fr/

guage and Communication Technologies8 is a jointmaster’s program between seven European Univer-sities. This is an excellence program supported byEuropean Union, which offers appealing grants tostudents each year. The recipients are selected atthe international level and based on selection cri-teria9 that take into account the qualitiy of theirtrainning and their motivation. Some students inte-grate the program as self-funded. Each student ofthe program spends a full year in two partner uni-versity of the consortium. They also share activitiesand events all together during the year.

In 2009, Maxime Amblard took over the respon-sibility of the Master’s program on Cognitive Sci-ences (SC), institutionalizing the existence of theNLP theme in the university. From that momenton, he has been in charge of the NLP theme, firstas head of the SC Master’s program from 2009 to201210, and then as head of the M2 TAL11 special-ity from 2010 to 2012 and from 2015 to 2018. Hecarried out two years of sabatical leave12 fully ded-icated to research. He was then the project leaderfor the creation of the Master’s program in NLPthat was opened in 2018. He was the creator andorganizor of the teaching from 2009 to today, bothin Cognitive Sciences and in NLP. The responsibili-ties of 2nd year of the NLP program were assumedby Fabienne Venant in 2009-2010, Laure Buhry in2013-14, then Miguel Couceiro from 2018. The co-ordination of the Erasmus Mundus LCT at the Uni-versité de Lorraine has been carried out by MiguelCouceiro since 2015.

2.3 Other contextual elements

In addition to the synergy between the cognitivescience and NLP courses, it is worth noting thepresence of other important training factors. On theone hand, there is a Master’s degree in LanguageSciences as well as a second Erasmus Mundu pro-gram specialized in Lexicology: EMLEX13. Thisprogram works very differently because the classesshare semesters together on a given campus. Thestudents are therefore not systematically presentin Nancy. In all cases, they are also internationalstudents selected for their results and motivation.

8https://lct-master.org/9See https://lct-master.org/

10The responsibility of the SC master has since been as-sumed by Manuel Rebuschi

11“Traitement automatique des langues” that then becameNLP.

12délégation13https://www.emlex.phil.fau.eu/

35

In addition, the Université de Lorraine integratesengineering schools into its structure. In particular,the Ecole des Mines de Nancy. This engineeringtraining is recognized as being of high quality inFrance. However, students from this type of train-ing rarely go on to complete a PhD, despite theirqualities. For those who are interested in this typeof opening, it is traditional to offer double courseswith Master’s programs. Thus we have set up an ex-change protocol that allows students from the Ecoledes Mines to join the Master’s program during thelast year of the program.

All these elements made the Université de Lor-raine a suitable environment for developing themaster’s program in NLP.

3 Definition of the NLP program

The first question is to decide to whom the programis intended. To do this we went back to the ambi-tion we wanted to give to the program. Once theobjective was clarified, we were able to work ondefining the content of the program.

3.1 Program’s objectives

The first observation made in 2016 was that therewas an effective increase in the need for NLP, bothfor industrial issues and for the development ofresearch. At that time, a very strong dynamic wasalready in place, driven by AI and deep learning.

We realized that the training we had developeduntil then was attached to a more traditional defi-nition of NLP, rooted in computer science and lin-guistics. Its objectives being mainly to accompanystudents towards research, this was not very surpris-ing, but it proved to be out of step with scientificand societal evolutions.

The project took shape in the idea of consideringthat students should leave the Master’s programfor industry as much as for research. We alreadyconsidered that a significant part of the companiesconcerned were also concerned with research is-sues. This shift towards more applicative issuesalso seemed necessary to us to be able to integratemore students into the training.

Moreover, our experience with the second yearof the Master’s program showed us that the studentswho successfully completed the program couldhave very heterogeneous profiles. The challengefor us was more to define the type of profile thatwe wanted to find at the end of the training andto propose paths to bring the students there. This

being the case, in view of the size of the classes, itwas not possible to multiply the paths and above all,it seemed important to us that the profiles be builtin interaction with each other. To achieve this, weneed to have a precise vision of the type of trainingwe can accept.

We have therefore chosen to build a training pro-gram centered on computer science and mathemat-ics for the NLP that is as open to research issues asit is to applications.

3.2 Methodology

To build the program, a project manager was ap-pointed in 2015 by the teaching component in theperson of Maxime Amblard with the task of de-signing a project to be presented to the universityin June 2016. Some examples were used, suchas (Bender et al., 2008).

The project manager made the choice to startfrom scratch, considering that the contents presentat that time needed to be renovated. He set up aworking group gathering members of the LORIAand ATILF teams working on language data whowanted to invest in the new training. The aim was tointegrate new colleagues, new views and possiblynew themes/issues.

A shared space was set up online allowing allparties to access the working material as it becameavailable. The working group started with 13 peo-ple and ended up with 25 people. It was decidedto meet the equivalent of once a month until July2026. After each meeting, the leader wrote a reportthat was sent to all members, as well as a doo-dle to choose the date of the next meeting and theagenda of this meeting. The participants thus hadthe agenda well in advance, which allowed themto prepare the meeting as well as possible or tosend their points of view and their work in caseof absence. This methodology ensured that no in-formation was lost and that the work dynamic wasmaintained throughout the year. In addition, meet-ing times were strictly adhered to. In the end, thisprocedure worked well in a complicated context.All the proposals made were heard and discussed.Some less committed participants naturally with-drew from the project. The vast majority of theparticipants stayed until the end and others werereally committed to the project. We did not need tochange the methodology during the process. Themost difficult element to manage was to maintaina focus with the group and synthesize all the pro-

36

posals as we went along. We return to the majororientations in the following section.

Higher education and research are undergoingnumerous transformations. This is obviously thecase in France. In particular, at that time, majorprograms were being launched and it was importantto position training in this ecosystem. In the end,we quickly dismissed this issue, considering thatthe constitution of a scientifically solid programwould always be easier to defend.

After several discussion sessions, we came tosome important conclusions:

• the environment has many competences aswell in data processing as in linguistics, whichmoreover with habits of work in common;

• the Erasmus Mundus LCT obliged us to teachin English, which is not always easy in France,but this should be a strength for the recruit-ment of international students;

• we had a scientific positioning neither in com-puter science nor in linguistics, clearly basedon mathematics and computer science;

• it is now possible to award a French diplomain NLP;

• the needs in both the industrial and the aca-demic worlds are important.

We have also considered two hypotheses on theorganization of the training: the non-opening ofthe training to distance learning because it requiresanother type of organization that we cannot proposefor now, and the possibility of carrying out thetraining in alternation. This program is a Frenchspecificity which allows students to study whilebeing employed by a company. Thus, the trainingalternates (in the literal sense) periods of trainingat the institute and periods of work in a company.

One important aspect is to have concluded thatif we have many strengths in the field of NLP, wehave the specificity of covering a wider field oflanguage data processing. If we are not the onlyones in France, we wanted to make a specificityof our mathematic and computer science position-ing in the field of language data processing. Wehave therefore decided to offer a curriculum enti-tled “Computer Science, Speech, Text and Knowl-edge”. Thus, the new program deals with speechprocessing, NLP and knowledge. It is obvious thatthe reconciliation of formal and statistics tools op-erated in the last ten years facilitates this closeness.

3.3 Broad Program Directions

We have therefore chosen to integrate studentswhose initial training is either in computer scienceor linguistics, with an appetite for formalization,programming and mathematics. A single pathwayis proposed through the training that brings togetherthese different profiles. The objective is to bringall students to the same exit point. Beyond thisdeclaration of intent, it is obvious that we cannottransform in two years linguists who have neverdone programming into operational computer sci-entists, just as we cannot transform computer sci-entists into linguists in the same time. On the otherhand, students must be comfortable enough to over-come the paralysis of working in a new field, to gobeyond naive approaches and above all to be ableto discuss the different aspects precisely with spe-cialists in the other field. To achieve this, there isnothing better than to mix these profiles throughoutthe training.

This mixing adds entropy to the construction ofindividual paths. To compensate for this, we haveproposed a very legible and regular architecture inthe training. This apparently very rigid organiza-tion allows students to build their path together.

Moreover, as we have already mentioned, theaim is to offer a course with a strong research com-ponent, while at the same time offering numerousapplications.

It was therefore decided that the first year wouldserve as a substrate to establish the background byopening up to long-term prospects. The secondyear would focus on NLP topics, with many moreapplications and especially a significant amount oftime devoted to an internship either in a laboratoryor in a company. The training is given over twoyears, i.e. 4 semesters:

• two semesters of the first year of 260 effectiveteaching hours each - 30 credits each

• one long semester of the second year of 350effective hours of teaching each - 30 credits

• one internship of at least 5 months (onesemester) - 30 credits

The architecture of each of the three semesters isthe same with 5 teaching units of equal importancein the validation of the training. The organizationof studies in France implies that each semestervalidates 30 credits (ECTS).

37

3.4 Renewal of Teaching Practices

As we have already mentioned, we did not want totransform our teaching practices by switching todistance learning, especially because we thoughtit was important to have different profiles workingtogether. It is difficult to measure this at a distance.

However, we questioned these practices to pro-pose more applied teaching or with effective datamanipulation, as well as the implementation ofgroup work, in particular in the form of projects.

Concerning the discovery of research, we pro-pose a project in groups of 2 to 4 students, called“supervised project”. This is a project carried outduring the first year, at the initiative of a researcher,and which leads to the writing of two reports. Thefirst part of the year is a bibliographic work. Itconsists of taking the time to read and understandscientific articles on the state of the art and to putthem in perspective with the proposed project. Thework is synthesized in a bibliographic report whichopens to the second part of the work which is amore classical realization part. The students pro-duce a report, which is a first experience of longwriting before their final report of master, as wellas a defense of 20 minutes in front of a jury and thepresentation of a poster. The objective is to havethem carry out a research project over a long periodof time, with links to the literature on the one handand actual achievements on the other, and also tomaster the codes of scientific presentation. Thiswork often leads to publications in internationalconference workshops.

As an example, here are some titles of projectscarried out in recent years on different themes:Speaker Adaptation Techniques for AutomaticSpeech Recognition, Testing Greenberg’s Linguis-tic Universals on the Universal Dependencies Cor-pora using a Graph Rewriting tool, Does my ques-tion answer your answer, Anomaly detection withdeep learning models, ...

The counterpart of this work is the realizationof an applied group project throughout the firstsemester of the second year of the program. It iscarried out by groups of 3 to 4 students. The ini-tiative is left to the students, who are neverthelesssupervised by two teachers. Once the applicationis identified, the students implement the conceptsthey have encountered in their training to producean application. The functional and deployable ap-plications are put online to showcase the work done.The students also produce a report explaining their

approach and their achievement, and they make adefense. Here some examples of realization sub-jects:

• GECko+ is built on top of two Artificial In-telligence models in order to correct spellingand grammar mistakes, and tackling discoursefallacies with BERT (Devlin et al., 2018).

• Askme is a Question Answering model builton the Stanford Question Answering Dataset(SQuAD) with a fine-tuned BERT. Askme isable to automatically answer factual questionswithout being aware of the context.

• IGrander Essay: Automated Essay Gradingsystems provide an easy and time-efficientsolution to essay scoring.

• Multilingual multispeaker expressive Text-to-speech system: The main goal of thiswork is from text input to be able to gener-ate speech with expressivity for multiple lan-guages, which are currently French and En-glish, with an end-to-end multilingual text-to-speech (TTS) system.

The first experiences have shown that both for-mats are very formative. It is common for groupsof students to go from very enthusiastic to over-whelmed phases. Thanks to the mentoring pro-cess, all projects are completed by the end of thesemester. In both cases, although very differentfrom each other, the situation allows them to betterunderstand where the interests of NLP are and es-pecially where the difficulties are. Working over along period of time shows them the importance ofanticipating problems in both research and develop-ment. We make sure that the groups are mixed interms of profiles to avoid the pitfall of groups stuckon IT developments, as well as groups missing outon linguistic issues. We note that an additionalexercise is paper writing in the format of the mainconferences in the field.

3.5 Support for internationalizationAs part of the development of French universities,the gouvernment has set up the development of ma-jor projects, under the name Project Investment ofthe Future (PIA). The Université de Lorraine ben-efits from a major project of this type called Lor-raine University of Excellence (LUE). This projectcovers several themes around systems engineering.The training program has benefited from financial

38

support from this program to support internation-alization. For this purpose, we are translating thepresentation documents into several languages, inparticular into Russian, Persian, Greek and Turk-ish. The objective is to accompany as effectivelyas possible the arrival of students in the training.

In addition, we have made a promotional filmfor international students to highlight the necessarycomplementarity between computer science andlinguistics that is achieved within our training. Theproduction was made in a professional way by re-lying on the students and their profile. Once again,the aim is to highlight the diversity and quality ofthe students’ profiles.

4 Program Design

4.1 EducationWe do not want to propose an advertising brochureof the training, also we do not give explicitly thetitles of each teaching sub-unit. This being said,the training is thought to put forward continuumbetween the semesters, as much as possible on thewhole training, at least between the semesters. Itis this dynamic that we put forward. We invitethe readers to refer to the descriptions of the train-ing for more details. We have the three semesterswith regular teaching. Units of the first semesterare numbered 70X, ones of the second semesterwith 80X and those of the last semester with 90X,where X takes its value in {1, 2, 3, 4, 5}. In eachsemester, the students follow all the units. For now,the programm has no optional course.

The first units described in Table 1 gather thefundamental background in mathematics and com-puter science. For these units, the continuum isnatural: on one side probabilities and statistics -machine learning - neural networks, and on theother one programming - semantic web - data min-ing/recommendation.

The second units, see Table 2, gathers the teach-ings around corpora and formal tools14.

The third units, see Table 3 gathers the teachingon software engineering and data sciences.

The linguistics units are the one ending by 4, seeTable 4.

Finally, the fifth units, see Table 5, is the project-based teaching that we have detailed above, towhich we add language teaching (French for non-French speakers, English for the others). In the

14For units two and three in the second year, we make sureto offer state of the art teaching applied to language data.

second year, we add opening lessons, in particulararound ethical issues.

Students are evaluated on the acquisition of tar-geted skills, identified for the Sc. Master.

To compensate for the differences in levels, thetechnical courses of the first semester are accom-panied by a refresher course. After several yearsof courses, we notice that the difference in level isnot caught up between the different audiences. Wehave decided to change the organization of somecourses as of the next academic year. For example,for Python programming, two courses will be of-fered, one for beginners and an advanced course,shared with cognitive science students.

The teachers are free to choose the teachingmethods they follow, within the general perspec-tive of the program. We clearly share the objectivesfor the end of the program where students are au-tonomous in dealing with the scientific literatureand in participating to produce new results.

4.2 Example of a course

It is obviously not possible to describe the allcourses of the training and it is not the object ofthis article. We want to highlight one course in par-ticular which allows us to put forward the principleof training by project by mixing profiles.

This is Methods4NLP given in unit 704. It isone of the very first courses introducing NLP tostudents. The teaching is divided into several se-quences: a first one allowing to set up a commonculture around NLP, a second academic one pre-senting the basic formalizations that they will de-velop throughout the two years of training, and thesetting up of a project.

The first phase can be divided into three parts:

• a seminar to set the spectrum of NLP, fromlinguistic aspects to technical developments.This teaching allows to give a common cul-ture to the students at the beginning of theprogram by positioning NLP in relation to thechallenges of AI.

• apprehension of the concepts by experimen-tation with unplugged activities (Bell et al.,2009)(Romero et al., 2018): one simulat-ing machine learning with a machine playingNim’s game, the other on Huffman codingwith groundhogs. The students meet the twoparadigms of NLP in an intuitive way.

39

701 Probabilities, Statistics and Algorithms for AI: The course deals with fundamentalmathematical tools, particularly statistical and algorithmic tools, which are necessaryto define and resolve an artificial intelligence problem. The course unit stresses acase study approach in order to ensure the acquisition of theoretical aspects andpractical application (Elementary mathematics tools, propabilities and statistics; andPython programming, both for beginners and advanced users)

801 Machine Learning and Semantic Web: This course takes on the fundamental princi-ples of Machine Learning, of data mining and knowledge extraction. All the notionsare illustrated through practical applications on real data (Machine learning theoryand Web Semantics)

901 Deep Learning and Data Mining: The objective of this course unit is to acquiremachine learning tools, mainly deep neural networks and factorisation matrices, tobe able to manipulate these tools when considering practical applications (tweets,traces of e-learning), as well as to expand the students’ knowledge in semantic weband the extensions of analysis of formal concepts for textual and relational dataprocessing (Neural Networks, Deep Neural Networks; Data mining (structured dataand text); and Collaborative filtering)

Table 1: Description of the teaching units of block 1

702 Design and Acquisition of corpus: This course aims to introduce techniques ofconstruction, structuring, annotating and archival of textual oral or multimodalcorpora, which play an essential role in the analysis of the structure of spokenand written language, and on the other hand in the training and evaluation of NLPalgorithms. This subject is complex as it is necessary that (1) the corpora be restrictedto a reasonable size in order to guarantee the proper collection of corresponding dataand that (2) this data sufficiently represents the phenomena studied (Written Corpora;and Spoken Corpora).

802 Formal Tools: This course unit is dedicate to the introduction to theoretical frame-works and logic used in symbolic approaches to the modeling of language. It consistsof mathematical logic and of formal languages. The objective is to familiarize thestudents with these formal models, their properties, the demonstration techniquesassociated with them, and the notions of calculability and complexity.

902 Text and Speech Processing: Automatic processing of texts and speeches involvesdifferent methods of machine learning. This class will introduce these methods andillustrate their use through examples and practical application using tools developed(Speech processing, Processing Textual Data, Terminology and ontology).

Table 2: Description of the teaching units of block 2

703 Software Engineering: Collection, analysis and formalization of customers’ needs(Software design and; Functional analysis; specifications; and Project management)

803 Data Science: This course unit introduces fundamental techniques for the extraction,storage, cleaning, visualisation and analysis of data. We give a practical introductionto the tools and software libraries which allows the processing of data. We combinetheoretical sessions with programming exercises which allows students to put intopractice the software and concepts taught during the course.

903 Natural Language and Discourse: This course unit gathers courses which deal withdiscursive and semantic processing of language (Application to texts; Computationalsemantics; and Discourse and Dialog modelling)

Table 3: Description of the teaching units of block 3

40

704 Linguistics for NLP-1: This course takes on one hand the fundamental elements ofNLP and on the other hand the phonological and morphological elements, which arestudied through a language sciences based approach (Methods for Natural LanguageProcessing; Phonology; and Morphology)

804 Linguistics for NLP-2: This course takes up where the previous teachings of lin-guistics left off, wherein the content and methodologies of the courses are, however,independent from these prior teachings. The courses are concentrated on syntax andsemantics, as well as lexicology. In this context, the focus is put on the question offormalisation of linguistic rules (Lexicology: lexical units and phraseology; Syntax;and Semantics)

904 Lexicon and Grammars for NLP: This course introduces advanced tools for thecomputational modeling of different types of linguistic information which describelexical units (lexical resources) or the rules of organisation of larger units (grammar)(Diachronic and synchronic lexicology; Lexical resources; and Syntactic framework)

Table 4: Description of the teaching units of block 4

705 This course unit is composed of the first part of the year-long supervised project(which will be finalized in the second semester), as well as language classes. Theproject consists of group work (in pairs), which is supervised by researchers, inwhich the students will carry out the bibliographic part of the final report. Languageclasses will allow non-anglophone students to become more familiar with scientificenglish, the language in which all courses are conducted, whereas the french classeswill facilitate non-francophone students’ social and cultural integration. The courseallows students to test their first skills acquired during the semester while synthesizingthe different research issues concerned by a more open research topic. Students areevaluated on the acquisition of targeted skills, identified for the Sc. Master.

805 This course unit is made up of language classes as well as the second part of the year-long supervised project (UE 705) which is finalized during the second semester. Thesecond part of the year-long project begins by following through with the procedureintroduced in the first semester groupwork, and leads up to the presentation at theend of the year, explaining the implementation of the project, which puts to test allof the skills the student acquires during the first year of the program. The languageclasses allow students to become more comfortable with the knowledge of scientificenglish.

905 Projects and Foreign Language: This course unit gathers many teaching including theproject and language classes (Software project; Law and ethics; Research methods;Professional integration; Foreign language courses (French or English))

Table 5: Description of the teaching units of block 5

41

• first experiments to highlight the possibilityof getting results with few developments withtwo labs: one on FastText (Bojanowski et al.,2016)(Joulin et al., 2016) and the one onScikit Learn (Pedregosa et al., 2011), are pro-posed. The objective is to make all studentsaware of the ease of manipulating data with-out understanding what is behind it, and thedifficulty of confronting how to improve theresults.

The second phase is carried out in two parallelparts: project and lectures.

For the project part, the students form groupsof 4 in a balanced way between the profiles. It isrequested that the students of the same national-ity do not stay together and especially that all thegroups have explicit different profiles (linguist andcomputer scientist). The groups choose the sub-ject of their development. They are only requiredto explicitly include NLP aspects in their project(POS tagging, Named Entity recognition, Machinelearning, etc.).

This project is shared with another course thatdeals with written corpora. For this course, theproject must contain the constitution of a corpus,its normalization, as much as possible its annota-tion with study of the quality of the annotation, andthus implementation in a defined context. Thus,the themes of the projects are more naturally re-lated to NLP than to speech processing, or even toknowledge processing.

For information, here are a few topics chosenby the students: Gender bias in young targetedliterature: 19th century vs. early 21st century, Au-thor Identification for Philosophers, Song LyricsGenerator, Autonomous Vocabulary Assistant, ...

It is interesting to let the students choose a topicthat interests them. Sometimes we see studentswith very specific subjects, letting them work onthem allows us to build on their engagement. If thesubject is not relevant, it allows us to put them backinto a more appropriate training perspective. Thereis no such thing as a bad topic. The different topicsmake students aware of the difference between thedesire to achieve a development and the possibilityof achieving it. In general, we see that some groupsprefer to stay close to a very classical topic andoften regret not having explored more diversity oftopics. Moreover, a recurrent element appears onthe question of the evaluation of development. Thispractice makes them aware of the significance of

anticipating the implementation of the evaluationin order to measure the quality of the final project.

Students are monitored weekly. They have tomake a 1 minute presentation of the progress oftheir developments in front of the whole class.They may not have made any progress, what isimportant is that all students follow the progress ofall groups. During the Covid-19 period, the follow-up was done by video-conference and it was diffi-cult to share this experience with the whole class.During the semester, students submit 3 progressreports, the first ones being only a few pages long.They are used to document problems, issues andprogress. The final report is due before the examperiod during which a defense is organized.

Finally, the last part consists of more academicteaching that takes up the two main paradigms ofNLP and defines them explicitly (Agarwal, 2013).The first part deals with out-of-context grammarsand automata. These concepts are often knownby students more or less well. This allows on theone hand to bring everyone up to speed and on theother hand to show how basic concepts in com-puter science find links with linguistics. This isdone by looking at Turing’s work on abstract ma-chines, or through Chomsky’s hierarchy. Then thecourse highlights the use of statistical techniquesby explaining Bayes’ models and automatic clas-sification as done by Jurafsky (Jurafsky and Mar-tin, 2018). Finally, the last part explains dynamicprogramming by focusing on the syntax with thealgorithm of CKY (Kozen, 1977). This part ofthe course is carried out in a traditional way withplenary teaching and tutorials.

This course is the course which launches thetraining. Indeed, if it begins with a phase requiringlittle knowledge a priori, it continues with a phasethat requires the mobilization of technical knowl-edge delivered in the other courses, as well as lin-guistic knowledge. The fact of leaving the choiceof the theme allows to motivate the students, whileshowing them the need to be autonomous and pro-active in the training. In addition, a very traditionalpart of the course allows students to establish com-mon ground by building epistemological bridgesbetween computer science and linguistics.

5 Analysis of student profiles

After having presented the structure, the organiza-tion and the stakes of the training, we propose tocome back on quantitative elements concerning the

42

# applicants. # students Comp. Mixed ling Fr Eu non-EU2015-16 M2 35 17 41 24 35 24 12 652016-17 M2 20 6 67 17 17 67 0 332017-18 M2 33 17 47 18 29 41 0 59

Mean 29,33 13,33 51,66 19,66 27 44 4 52,332018-19 M1 47 16 25 38 38 50 13 38

M2 21 7 57 29 14 29 14 572019-20 M1 65 20 30 25 45 35 30 35

M2 32 20 45 20 35 30 0 702020-21 M1 117 33 45 18 36 36 12 52

M2 33 27 48 19 33 26 15 59Mean M1 76,33 23 33,33 27 39,66 40,33 18,33 41,66Mean M2 28,66 18 50 22,66 27,33 28,33 9,66 62

Mean 52,5 20,5 41,66 24,83 33,5 34,33 14 51,83

Table 6: Distribution of students in the training over the last 6 years: number of candidates, number of students,then distribution of students in percentage (%) between the 3 profiles (CS, mixed and L), then according to theirnationality (France, Europe and non-EU).

disciplinary profile of the students, the diversity oftheir origin and their success. In order to give alittle more perspective to these elements, we relyon the data of the Master’s since 2018, that is tosay 3 academic years, as well as on the M2 TALspeciality over the 3 previous years.

Before the definition of the program, we hadmade a study on the 2008-2017 classes. We notedthat 36% of the students were French, 45% fromanother European country and 19% from a non-European country, which is proof of a good diver-sity of origin. Of those who continued their studiesafter the course, 40% joined a company, 33% weredoing a thesis and 27% had an academic post.

Table 6 presents the evolution of the number ofcandidates and students in the training before thecreation of the Master (3 years) and for the first 3years of the training. It also contains the distribu-tion of students according to the profiles (Fosler-Lussier, 2008): computer science, mixed or linguis-tics. Figure 1a shows the evolution of the data overtime. We observe that the distribution is homo-geneous over the different years of training, withan average of 41.6% computer scientists over thefirst three years of the Master’s program for 33.5%linguists. This good distribution makes it possibleto build balanced work groups. Moreover, we ob-serve that the opening as a Master has significantlyincreased the average number of this profile from27 to 33.5. This implies reinforcing the coursefor this type of profile and increasing the meansimplemented on their programming skills.

Another interesting element that appears in thistable, put foward in Figure 1b, is the increase inthe number of candidates, with 117 candidates forthe M1 year in 2020-21. It should be noted thatthis number of applicants does not include studentstrained within the IDMC or Erasmus Mundus LCTstudents who are selected through another process.For M2 students, only a few students are selected,the majority coming from the first year of training.Our experience leads us to limit even more the num-ber of students entering the second year directly,because on the one hand these students have gapsthat they are unable to fill, and on the other handthey have difficulties integrating the class. Theseelements must of course be put into perspectivebecause the effects of the health crisis must also betaken into account.

Finally, the other axis of analysis proposed is tolook at the nationalities of the students. It can beseen that the average number of French studentshas decreased significantly, from 44% to 34.5%.This decrease did not affect the rate of overseastudents, keeping their number slightly above 50.

Finally, an important element is to analyze theprofiles at entry with the success at the end of theMaster’s degree. The first element is that the stu-dents who follow the two years of training havebetter results and rankings. This supports the ideathat the training has singular characteristics thatare difficult to catch up on. This is understandablebecause it is necessary to master concepts in com-puter science and linguistics, and moreover it is

43

(a) Origin and background of the students.

0

20

40

60

80

100

120

140

160

2015-16 2016-17 2017-18 2018-19 2019-20 2020-21

# applicants # students

(b) Number of applicants and students.

Figure 1: Data of the program from 2015 to 2020.

necessary to master the tools and methods of AI,which are specific and rather abstract.

One should not believe that a profile at the en-trance would be privileged in this type of training.The ringleader or the first ones of the promotionsdo not have a priori determined profiles. Theymay come from traditional computer science or lin-guistics. What seems to make the difference is, inaddition to their aptitude at entry, their investmentin the training throughout the two years. This reas-sures us that the aim is to bring together the profiles,while respecting their original specificities.

6 Conclusion and perspectives

We highlighted the process of creating the Master’sdegree in NLP at the IDMC. This training is uniquein that it delivers a French diploma in NLP. Stu-dents who enter the program have a backgroundeither in computer science or in linguistics. Thetraining aims to give them strong skills to trainfuture professionals in language data processing.

We have integrated collaborations with severalcomponents in our environment (Erasmus MundusLCT, EMLEX, Ecole des Mines, etc.). The train-ing is attractive and has good results. Studentstrained in this way are currently making choicesbetween different possible careers, rather than be-ing affected by the economic and sanitary situationdue to the health crisis, that impacted negativelythe job market.

Over the different years observed, we noticedthat the number of internship proposals on the onehand and the number of job offers on the otherhand, are constantly increasing. We see the searchfor profiles explicitly in NLP. Moreover, our open-ing towards more applied profiles has not been at

the expense of the relationship with the academicworld. Out of the first class that followed the 2years of the training, 7 are pursuing a thesis whichis very positive. These two dynamics reinforce ourcommitment to the development of the program.

Since the beginning of the program, we haveasked for oral and written evaluations from thestudents on both the teaching and the organizationof the training. This feedback has allowed us toimprove the content while remaining within thesame general framework. The integration of thestudents at the end of the training confirms us inthe belief of the interest of the training that wehave. The accreditation we have obtained is comingto an end and we must now prepare a proposalfor a general evolution for 2023. The Master isnow considered mature and can benefit from moreautonomy.

For the future, we want to consolidate the train-ing by slightly increasing the flow of students untilwe reach 40 students per year, especially with stu-dents from the European area. In addition, we areworking on developing opportunities for research,which would allow us to offer training courses ofvery good quality towards PhD.

Acknowledgments

We truly want to thank the anonymous review-ers for their helpful comments and insights. Thiswork was supported partly by the french PIAproject “Lorraine Universite d’Excellence”, refer-ence ANR-15-IDEX-04-LUE.

44

ReferencesApoorv Agarwal. 2013. Teaching the basics of NLP

and ML in an introductory course to information sci-ence. In Proceedings of the Fourth Workshop onTeaching NLP and CL, pages 77–84, Sofia, Bulgaria.Association for Computational Linguistics.

Tim Bell, Jason Alexander, Isaac Freeman, and MickGrimley. 2009. Computer science unplugged:School students doing real computing without com-puters. The New Zealand Journal of Applied Com-puting and Information Technology, 13(1):20–29.

Emily M Bender, Fei Xia, and Erik Bansleben. 2008.Building a flexible, collaborative, intensive master’sprogram in computational linguistics. In Proceed-ings of the Third Workshop on Issues in TeachingComputational Linguistics, pages 10–18.

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Eric Fosler-Lussier. 2008. Strategies for teaching“mixed” computational linguistics classes. In Pro-ceedings of the Third Workshop on Issues in Teach-ing Computational Linguistics, pages 36–44.

Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2016. Bag of tricks for efficient textclassification. arXiv preprint arXiv:1607.01759.

Daniel Jurafsky and James H Martin. 2018. Speechand language processing (draft). Chapter A: HiddenMarkov Models (Draft of September 11, 2018). Re-trieved March, 19:2019.

Dexter C Kozen. 1977. The cocke–kasami–youngeralgorithm. In Automata and Computability, pages191–197. Springer.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Margarida Romero, Benjamin Lille, Thierry Viéville,Marie Duflot-Kremer, Cindy de Smet, and DavidBelhassein. 2018. Analyse comparative d’une ac-tivité d’apprentissage de la programmation en modebranché et débranché. In Educode-Conférence inter-nationale sur l’enseignement au numérique et par lenumérique.

45

Proceedings of the Fifth Workshop on Teaching NLP, pages 46–48June 10–11, 2021. ©2021 Association for Computational Linguistics

Applied Language Technology: NLP for the Humanities

Tuomo HiippalaDepartment of Languages, University of Helsinki

P.O. Box 24, Unioninkatu 40 B.00014 University of Helsinki, [email protected]

Abstract

This contribution describes a two-course mod-ule that seeks to provide humanities ma-jors with a basic understanding of languagetechnology and its applications using Python.The learning materials consist of interac-tive Jupyter Notebooks and accompanyingYouTube videos, which are openly availablewith a Creative Commons licence.

1 Introduction

Language technology is increasingly applied inthe humanities (Hinrichs et al., 2019). This con-tribution describes a two-course module namedApplied Language Technology, which seeks to pro-vide humanities majors with a basic understandingof language technology and practical skills neededto apply language technology using Python. Themodule is intended to empower the students byshowing that language technology is both accessi-ble and applicable to research in the humanities.

2 Pedagogical Approach

The learning materials seek to address two majorpedagogical challenges. The first challenge con-cerns terminology: in my experience, the greatesthurdle in teaching language technology to human-ities majors is not ‘technophobia’ (Öhman, 2019,480), but the technical jargon that acts as a gate-keeper to knowledge in the field (cf. Maton, 2014).This issue is fundamental to teaching students withno previous experience of programming. To exem-plify, beginners in my class occasionally interpretthe term ‘code’ in phrases such as “Write yourcode here” as a numerical code needed to unlockan exercise, as opposed to a command written in aprogramming language. For this reason, the learn-ing materials introduce concepts in Python and lan-guage technology in layperson terms and graduallybuild up the vocabulary needed to advance beyondthe learning materials.

The second challenge involves the diversity ofthe humanities, which covers a broad range of dis-ciplines with different epistemological and method-ological standpoints. This results in considerabledifferences in previous knowledge among the stu-dents: linguistics majors may be more likely to beexposed to computational methods and tools thantheir counterparts majoring in philosophy or arthistory. Some students may have taken an introduc-tory course in Java or Python, whereas others havenever used a command line interface before. Toaddress this issue, the learning materials are basedon Jupyter Notebooks, which provide an environ-ment familiar to most students – a web browser –for interactive programming. The command line isused for interacting with GitHub, which is used todistribute the learning materials and exercises.

The module also emphasises peer and collabora-tive learning: 20% of the course grade is awardedfor activity on the course discussion forum hostedon GitHub. All activity – both asking and answer-ing questions – counts positively towards the fi-nal grade. This allows the students with previousknowledge to help onboard newcomers. Accord-ing to student feedback, this also fosters a senseof community. The discussion forum is also usedto discuss weekly readings, which focus on ethics(e.g. Hovy and Spruit, 2016; Bird, 2020) and therelationship between language technology and hu-manities (e.g. Kuhn, 2019; Nguyen et al., 2020).These discussions are guided by questions that en-courage the students to draw on their disciplinarybackgrounds, which exposes them to a wide rangeof perspectives to language technology and the hu-manities.

3 Learning Materials

The learning materials cover two seven-weekcourses.

The first course starts by introducing rich, plainand structured text and character encodings, fol-

46

lowed by file input/output in Python, common datastructures for manipulating textual data and regu-lar expressions. The course then exemplifies basicNLP tasks, such as tokenisation, part-of-speech tag-ging, syntactic parsing and sentence segmentationby using the spaCy 3.0 natural language processinglibrary (Honnibal et al., 2020) to process exam-ples in the English language. This is followed byan introduction to basic metrics for evaluating theperformance of language models. The course con-cludes with a brief tour of the pandas library forstoring and manipulating data (McKinney, 2010).

The second course begins with an introductionto processing diverse languages using the Stanzalibrary (Qi et al., 2020), shows how Stanza caninterfaced with spaCy, and how the resulting anno-tations can be searched for linguistic patterns usingspaCy. The course then introduces word embed-dings to provide the students with a general under-standing of this technique and its role in modernNLP, which is also increasingly applied in researchon the humanities. The course finishes with anexploration of discourse-level annotations in theGeorgetown University Multilayer Corpus (Zeldes,2017), which showcases the CoNLL-U annotationschema.

To what extent the students meet the learn-ing objectives is measured in weekly exercises.The weekly assignments are distributed throughGitHub Classroom and automatically graded usingnbgrader1, which allows generating feedback fileswith comments that are then pushed back to thestudent repositories on GitHub. The exercises arealso revisited in weekly walkthrough sessions toallow the students to ask questions about the assign-ments. The students are also required to completea final assignment for both courses: the first courseconcludes with a group project that involves prepar-ing a set of data for further analysis, whereas thesecond course finishes with a longer individual as-signment.

All learning materials are openly available witha Creative Commons 4.0 CC-BY licence at theaddresses provided in the following section. Accessto the weekly exercises is available on request.

4 Technical Stack

The learning materials are based on Jupyter Note-books (Kluyver et al., 2016) hosted in their own

1https://nbgrader.readthedocs.io

GitHub repository.2 This repository constitutes asubmodule of a separate repository for the website,which is hosted on ReadTheDocs.3 The notebookscontaining the learning materials are rendered intoHTML using the Myst-NB parser from the Exe-cutable Books project.4 This allows keeping thelearning materials synchronised, and enables theusers to clone the notebooks without the sourcecode for the website. Myst-NB also adds links toBinder (Project Jupyter et al., 2018) to each note-book on the ReadTheDocs website, which enablesanyone to execute and explore the code.

The Jupyter Notebooks provide a familiar envi-ronment for interactively exploring Python and thevarious libraries used, whereas the ReadTheDocswebsite is meant to be used as a reference work.Both media embed videos from a YouTube channelassociated with the courses.5 These short explana-tory videos exploit the features of the underlyingaudiovisual medium, such as overlaid arrows, ani-mations and other modes of presentation to explainthe topics.

5 Conclusion

This contribution has introduced a two-course mod-ule that aims to teach humanities majors to ap-ply language technology using Python. Targetedat a student population with diverse disciplinarybackgrounds and levels of previous experience, thelearning materials use multiple media and layper-son terms to build up the vocabulary needed toengage with Python and language technology, com-plemented by the use of a familiar environment – aweb browser – for interactive programming usingJupyter Notebooks.

ReferencesSteven Bird. 2020. Decolonising speech and lan-

guage technology. In Proceedings of the 28th Inter-national Conference on Computational Linguistics,pages 3504–3519.

Erhard Hinrichs, Marie Hinrichs, Sandra Kübler, andThorsten Trippel. 2019. Language technology fordigital humanities: introduction to the special issue.

2https://github.com/Applied-Language-Technology/notebooks

3https://applied-language-technology.readthedocs.io

4https://executablebooks.org5https://www.youtube.com/c/

AppliedLanguageTechnology

47

Language Resources and Evaluation, 53(4):559–563.

Matthew Honnibal, Ines Montani, Sofie Van Lan-deghem, and Adriane Boyd. 2020. spaCy:Industrial-strength Natural Language Processing inPython. DOI: 10.5281/zenodo.1212303.

Dirk Hovy and Shannon L. Spruit. 2016. The so-cial impact of natural language processing. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), pages 591–598. Association for Computa-tional Linguistics.

Thomas Kluyver, Benjamin Ragan-Kelley, Fer-nando Pérez, Brian Granger, Matthias Bussonnier,Jonathan Frederic, Kyle Kelley, Jessica Hamrick,Jason Grout, Sylvain Corlay, Paul Ivanov, DamiánAvila, Safia Abdalla, Carol Willing, and Jupyterdevelopment team. 2016. Jupyter Notebooks – apublishing format for reproducible computationalworkflows. In Positioning and Power in AcademicPublishing: Players, Agents and Agendas, pages87–90, Netherlands. IOS Press.

Jonas Kuhn. 2019. Computational text analysis withinthe humanities: How to combine working practicesfrom the contributing fields? Language Resourcesand Evaluation, 53(4):565–602.

Karl Maton. 2014. Knowledge and Knowers: Towardsa Realist Sociology of Education. Routledge, NewYork and London.

Wes McKinney. 2010. Data structures for statisticalcomputing in Python. In Proceedings of the 9thPython in Science Conference, pages 51–56.

Dong Nguyen, Maria Liakata, Simon DeDeo, JacobEisenstein, David Mimno, Rebekah Tromble, andJane Winters. 2020. How we do things with words:Analyzing text as social and cultural data. Frontiersin Artificial Intelligence, 3.

Emily Öhman. 2019. Teaching computational methodsto humanities students. In Proceedings of the Digi-tal Humanities in the Nordic Countries 4th Confer-ence, volume 2364 of CEUR Workshop Proceedings,pages 479–494.

Project Jupyter, Matthias Bussonnier, Jessica Forde,Jeremy Freeman, Brian Granger, Tim Head, ChrisHoldgraf, Kyle Kelley, Gladys Nalvarte, AndrewOsheroff, M Pacer, Yuvi Panda, Fernando Perez,Benjamin Ragan Kelley, and Carol Willing. 2018.Binder 2.0 – Reproducible, interactive, sharable en-vironments for science at scale. In Proceedings ofthe 17th Python in Science Conference, pages 113 –120.

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,and Christopher D. Manning. 2020. Stanza: APython natural language processing toolkit for manyhuman languages. In Proceedings of the 58th An-nual Meeting of the Association for Computational

Linguistics: System Demonstrations, pages 101–108.Association for Computational Linguistics.

Amir Zeldes. 2017. The GUM corpus: Creating mul-tilayer resources in the classroom. Language Re-sources and Evaluation, 51(3):581–612.

48

Proceedings of the Fifth Workshop on Teaching NLP, pages 49–51June 10–11, 2021. ©2021 Association for Computational Linguistics

A Crash Course on Ethics for Natural Language Processing

Annemarie Friedrich1 Torsten Zesch2

1Bosch Center for Artificial Intelligence, Renningen, Germany2Language Technology Lab, University Duisburg-Essen, Germany

[email protected]@uni-due.de

Abstract

It is generally agreed upon in the naturallanguage processing (NLP) community thatethics should be integrated into any curricu-lum. Being aware of and understanding the rel-evant core concepts is a prerequisite for follow-ing and participating in the discourse on ethi-cal NLP. We here present ready-made teachingmaterial in the form of slides and practical ex-ercises on ethical issues in NLP, which is pri-marily intended to be integrated into introduc-tory NLP or computational linguistics courses.By making this material freely available, weaim at lowering the threshold to adding ethicsto the curriculum. We hope that increasedawareness will enable students to identify po-tentially unethical behavior.

1 Motivation and Overview

The recent sharp rise in the capabilities of naturallanguage processing (NLP) methods has led to awide application of language technology, influenc-ing many aspects of our daily lives. As a result,NLP technology can have considerable real-worldconsequences (Vitak et al., 2016; Hovy and Spruit,2016). While language technology has the aimof supporting humans, mis-use of data or abuseof subjects, mis-representation, or direct harm aresome of numerous potentially critical problems.Hence, it is crucial for NLP researchers, develop-ers, and deciders to be aware of the social implica-tions of deploying a particular piece of languagetechnology. They should be able to analyze thedegree to which a setup conforms with ethical prin-ciples, which guide what is considered ‘right’ or‘wrong’ (Deigh, 2010), and be aware of the poten-tially harmful side even of components developedwith the aim of supporting humans.

As a consequence, it is crucial to start early andembed ethics into the NLP curriculum to be taughtalong with the technological and linguistic mate-rial in an interactive manner (Bender et al., 2020).

While many teachers are very open to the topic areaas demonstrated by the lively participation in eventssuch as the 2020 Workshop on Integrating Ethicsinto the NLP Curriculum (Bender et al., 2020),ethics for NLP also constitutes a very broad topicarea where even teachers might not feel knowl-edgeable enough. We argue that many more lec-turers would include ethics if there existed somefreely available ready-made teaching material thatcan be easily integrated in existing courses, or thatcould serve as a starting point for designing a morein-depth course. In this paper, we describe sucha material, which is primarily intended to be inte-grated into introductory NLP courses.

Our “crash course” does not claim exhaustive-ness and aims at breadth rather than depth, intend-ing to give a good overview of the field and high-lighting potential issues. The main focus is to en-able students to behave ethically in their future re-search and work careers, and to continue their ownresearch and reading. To be able to participate inthe discourse on ethical issues in NLP, the first stepis to learn about the important concepts and ter-minology. The material also provides numeroussuggestions for in-class discussions and exerciseswith the aim of gaining a deeper understanding.

The material consists of extensively com-mented slides and suggestions for practical ex-ercises. The comments and lists of references serveboth for teacher preparation and as a script for stu-dents. The teaching material is freely availableunder CC-BY-SA from our website.1 Finally, bothethics and NLP are dynamic fields which may re-quire updating views, beliefs, opinions, and theo-ries. We therefore welcome continuous feedbackregarding and contributions to the teaching mate-rial.

Benefit of this Course. The main differenceversus existing freely available material is that ourcrash course is short, self-explanatory in the con-

1https://gscl.org/en/resources/ethics-crash-course

49

text of the provided comments such that other lec-turers can easily work with it, and contains manylinked exercises. The Markkula Center for Ap-plied Ethics at Santa Clara University offers a slideset with a crash course on ethics along with materi-als for discussions focusing of ethics in the generalarea of technology.2 Our intention is very simi-lar, but the focus is on NLP-specific issues. Wefound most existing freely-available courses onethics in NLP to go into depth and usually span anentire semester.3 In addition, we are aware of sev-eral recent tutorials at ACL venues. Most similarto our materials is the tutorial on socially respon-sible NLP by Tsvetkov et al. (2018), which is acondensed version of a full-semester class.4 Ourcourse differs from this tutorial in length and bymostly concentrating on NLP-specific examples.Chang et al. (2019) focus on sub-topics of ethicalNLP such as fairness and mitigating bias.5 Benderet al. (2020) have started a collection of pointers toexisting materials, as well as general pointers forteaching ethics for NLP.6 This collection has beena very valuable starting point for our own research.

Ethical Considerations. Admittedly, a poten-tial issue is that teachers could just “check off” thetopic by using our material without deeper engage-ment, individual reading or reasoning. However,we believe that having a good starting point willactually lead to both teachers and students doingmore research on this subject, and our companionmaterial emphasizes the benefit of digging deeper.

2 Course Description

Format. The crash course consists of an inter-active lecture accompanied by practical exercises.The slides are available as extensively commentedGoogle slides, which can be easily adapted andexported into a number of common formats. Ex-ercises consist of reading assignments, examplesand case studies that serve as the basis for pair orgroup discussions, or essays if submitting writtenmaterial is essential, e.g., for grading purposes.

Learning Goals. After the crash course, stu-dents will have acquired a basic understanding

2https://www.scu.edu/ethics-in-technology-practice3See, e.g, http://demo.clab.cs.cmu.edu/ethical_nlp2020/#readings,

http://faculty.washington.edu/ebender/2017_5754Materials are also publicly available at https://sites.google.

com/view/srnlp/.5http://web.cs.ucla.edu/~kwchang/talks/emnlp19-fairnlp6https://aclweb.org/aclwiki/Notes_on_Teaching_Ethics_in_NLP,

https://aclweb.org/aclwiki/Ethics_in_NLP

of the relevant terminology and concepts. Theyshould understand that there are different ethicaltheories at interplay with the field of NLP whichare currently developing best practices for ethicalconduct and systems (Prabhumoye et al., 2019).As a result, they should be able to critically reflectthe on-going discourse in the community and, lastbut not least, have acquired the basis for behavingethically in their own work. It is not our goal toprovide ‘ultimate’ definitions of the concepts andwe strongly advise lecturers not to pose exam ques-tions aiming at memorizing definitions. Instead,the learning goals could be tested in the form ofgroup presentations or written essays.

The predominant principle behind the designof our crash course is activation. Besides givingclear descriptions of relevant concepts and termi-nology, we believe that a sensitivity for ethicalissues can only be achieved by actively thinkingabout problems. We hence provide discussion ques-tions for most topics covered in the crash course.We strongly recommend taking the time for shortpair- or small-group discussions on these questionsbefore further discussion in the plenum.

Topics Covered. Our course aims at providinga good overview of the field, offering references asstarting points for deeper research. We consider ourteaching materials to be a ‘living document’ thatwill be updated or extended continuously. Topicscovered in our first version include, among oth-ers, bias, fairness, privacy, and analyzing NLP usecases or methods from different ethical perspec-tives. For an up-to-date overview of the coursecontent, please refer directly to the material.

3 Discussion

Our stated goal is to inform a broader public aboutthe on-going discourse about ethics in NLP, andeducate future NLP researchers, developers and de-ciders about an ethical approach to NLP researchtechnology. We hope that our materials will be ofbenefit not only in university classrooms, but alsoin other settings such as reading groups or indus-trial meet-ups. We hence publish our resourcesunder the CC-BY-SA 4.0 license,7 which, underthe conditions of stating the source and redistribu-tion under the same license, allows copying, redis-tributing, adapting and mixing the material in anymedium or format.

7https://creativecommons.org/licenses/by-sa/4.0/

50

Acknowledgments

We thank Dirk Hovy for sharing his materials onethics and for his feedback on this crash course.We also thank Ronja Laarman-Quante, Sophie Hen-ning, Heike Adel, Andrea Horbach, and the anony-mous reviewers for their valuable feedback.

ReferencesEmily M. Bender, Dirk Hovy, and Alexandra Schofield.

2020. Integrating ethics into the NLP curriculum.In Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics: TutorialAbstracts, pages 6–9, Online. Association for Com-putational Linguistics.

Kai-Wei Chang, Vinod Prabhakaran, and Vicente Or-donez. 2019. Bias and fairness in natural languageprocessing. In Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Process-ing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP):Tutorial Abstracts, Hong Kong, China. Associationfor Computational Linguistics.

John Deigh. 2010. An Introduction to Ethics. Cam-bridge University Press.

Dirk Hovy and Shannon L. Spruit. 2016. The socialimpact of natural language processing. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers), pages 591–598, Berlin, Germany. Associationfor Computational Linguistics.

Shrimai Prabhumoye, Elijah Mayfield, and Alan WBlack. 2019. Principled Frameworks for Eval-uating Ethics in NLP Systems. arXiv preprintarXiv:1906.06425.

Yulia Tsvetkov, Vinodkumar Prabhakaran, and RobVoigt. 2018. Socially responsible NLP. In Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Tutorial Abstracts, pages 24–26, NewOrleans, Louisiana. Association for ComputationalLinguistics.

Jessica Vitak, Katie Shilton, and Zahra Ashktorab.2016. Beyond the Belmont Principles: Ethical Chal-lenges, Practices, and Beliefs in the Online DataResearch Community. In Proceedings of the 19thACM Conference on Computer-Supported Cooper-ative Work & Social Computing, CSCW ’16, page941–953, New York, NY, USA. Association forComputing Machinery.

51

Proceedings of the Fifth Workshop on Teaching NLP, pages 52–54June 10–11, 2021. ©2021 Association for Computational Linguistics

A dissemination workshop for introducing young Italian students to NLP

Lucio MessinaIndependent Researcher

[email protected]

Lucia BussoAston University

[email protected]

Claudia Roberta CombeiUniversity of Bologna

[email protected]

Ludovica PannittoUniversity of Trento

[email protected]

Alessio MiaschiUniversity of Pisa

[email protected]

Gabriele SartiUniversity of [email protected]

Malvina NissimUniversity of [email protected]

Abstract

We describe and make available the game-based material developed for a laboratory runat several Italian science festivals to popularizeNLP among young students.

1 Introduction

The present paper aims at describing in detail theteaching materials developed and used for a seriesof interactive dissemination workshops on NLPand computational linguistics1. These workshopswere designed and delivered by the authors on be-half of the Italian Association for ComputationalLinguistics (AILC, www.ai-lc.it), with theaim of popularizing Natural Language Processing(NLP) among young Italian students (13+) and thegeneral public. The workshops were run in thecontext of nationwide popular science festivals andopen-day events both onsite (at BergamoScienza,and Scuola Internazionale Superiore di Studi Avan-zati [SISSA], Trieste) and online (at Festival dellaScienza di Genova, the BRIGHT European Re-searchers’ Night, high school ITS Tullio Buzzi inPrato and the second edition of Science Web Festi-val, engaging over 700 participants in Center andNorthern Italy from 2019 to 2021.2

The core approach of the workshop remainedthe same throughout all the events. However, thematerials and activities were adapted to a varietyof different formats and time-slots, ranging from30 to 90 minutes. We find that this workshop –

1In this discussion, and throughout the paper, we conflatethe terms Natural Language Processing and ComputationalLinguistics and use them interchangeably.

2Links to events are in the repository’s README file.

thanks to its modular nature – can fit different tar-get audiences and different time slots, dependingon the level of interactive engagement requiredfrom participants and on the level of granularityof the presentation itself. Other than on the levelof engagement expected of participants, time re-quired can also vary depending on the participants’background and metalinguistic awareness.

Our interactive workshops took the form of mod-ular games where participants, guided by trainedtutors, acted as if they were computers that had torecognize speech and text, as well as to generatewritten sentences in a mysterious language theyknew nothing about.

The present contribution only describes theteaching materials and provide a general outlineof the activities composing the workshop. For adetailed discussion and reflection on the workshopgenesis and goals and on how it was received bythe participants see (Pannitto et al., 2021).

The teaching support consist in an interactivepresentation plus hands-on material, either in hard-copy or digital form. We share a sample presen-tation3 and an open-access repository4 containingboth printable materials to download and scripts toreproduce them on different input data.

2 Workshop and materials

The activity contains both more theoretical andhands-on parts, which are cast as games.

3https://docs.google.com/presentation/d/1ebES_K8o3I2ND_1iMyBlQmrH733eQ6m_K8a0tON8reo/edit?usp=sharing

4https://bitbucket.org/melfnt/malvisindi

52

Awareness The first part consists of a brief intro-duction to (computational) linguistics, focusing onsome common misconceptions (slides 3-5), and onexamples of linguistic questions (slides 6-12). Dueto their increasing popularity, we chose vocal as-sistants as practical examples of NLP technologiesaccounting for how humans and machines differin processing speech in particular and language ingeneral (slides 20-39).

Games The core of the activity is inspired by theword salad puzzle (Radev and Pustejovsky, 2013)and is organized as a game revolving around a fun-damental problem in NLP: given a set of words,participants are asked to determine the most likelyordering for a sentence containing those words.This is a trivial problem when approached in aknown language (i.e., consider reordering the to-kens garden, my, is, the, in, dog), but an apparentlyimpossible task when semantics is not accessible,which is the most common situation for simpleNLP algorithms.

To make participants deal with language as acomputer would do, we asked them to composesentences using tokens obtained by transliteratingand annotating 60 sentences from the well knownfairy tale "Snow White" to a set of symbols. Weproduced two possible versions of the masked ma-terials: either replacing each word with a randomsequence of DINGs (e.g. d©Ü= for the word morn-ing) or replacing them with a corresponding non-word (for example croto for the word morning).The grammatical structure of each sentence is repre-sented by horizontal lines on top of it representingphrases (such as noun or verb phrases), while theparts of speech are indicated by numbers from 0 to9 placed as superscripts on each word (Figure 1).

Figure 1: The first sentence of the corpus "on asnowy day a queen was sewing by herwindow" translated using DINGs (above) and usingnon-words (below)

Participants were divided into two teams, oneteam would be working on masked Italian and theother on masked English. Both teams were giventhe corpus in A3 format and were told that the textsare written in a fictional language.

Two activities were then run, focusing on two

Figure 2: Example cards, both showing a word. Left:a card for the first activity, with a button loop to threadit in a sentence. Right: a card for the second activity,with part of speech (number) at the bottom.

Figure 3: Possible rules extracted from the corpus.Each rule is made of felt strips for phrases, cards withnumbers for parts of speech, and “=” cards.

different algorithms for sentence generation. Inthe first, participants received a deck of cards eachequipped with a button loop (Figure 2) and show-ing a token from the corpus. Participants had tocreate new valid sentences by rearranging the cardsaccording to the bigrams’ distribution in the cor-pus. Using the bracelet method (slides 52-61), theycould physically thread tokens into sentences.

In the second activity (slides 63-92), the partici-pants extracted grammatical rules from the corpus,and used them to generate new sentences. In or-der to write the grammar, participants were givenfelt strips reproducing the colors of the annotation,a deck of cards with numbers (identifying partsof speech) and a deck of “=” symbols (Figure 3).With a new deck of words (Figure 2), not all presentin the corpus, participants had to generate a sen-tence using the previously composed rules.

Reflection and Outlook By superimposing aplexiglass frame on the A3 corpus pages (Figure 4),the true nature of the corpora was eventually re-vealed. The participants could see the original texts(in Italian and English) and translate the sentencesthey had created previously.

The activity ended with a discussion of recentNLP technologies and their commercial applica-tions (slides 93-96), and of what it takes to becomea computational linguist today (slides 97-99).

53

Figure 4: A session of the workshop: the original lan-guage of the corpus has just been revealed by superim-posing plexiglass supports on the corpus tables.

3 Activity Preparation

The preparation of the activity consists of severalsteps: (1) creating and tagging the corpora withmorpho-syntactic and syntactic categories as de-scribed in the repository; (2) choosing the words toinclude in the card decks: these must be manuallyselected but scripts are provided to generate possi-ble sentences based on bigram co-occurrences, andto extract all the possible grammar rules present inthe annotation; (3) when the produced sentencesand grammar are satisfactory, scripts are providedto generate (i) the printable formats of corpora anddecks of cards, (ii) a dictionary to support the trans-lation of sentences in the last part of the work-shop, and (iii) clear-text corpora; (4) sentencesfrom the clear-text corpora have to be manuallycut and glued on a transparent support that can besuperimposed on the printed corpora to reveal thesentences; (5) finally, some manual work is neces-sary: producing strips of felt or any material withthe same colors used in the corpus; cutting threads;attaching a button loop to the relevant cards, etc.

4 Reusability

In the spirit of open science and to encourage thepopularization of NLP, the teaching materials andsource code are freely available in our repository(see footnotes 2-3 for the links). The print-readymaterial is released under CC BY-NC; the sourcecode is distributed under the GNU gpl license ver-sion 3. All scripts work for Python versions 3.6 orabove and the overall process requires python3,lualatex, pdftk and pdfnup as detailed inthe README.md file in the repository, which con-tains all necessary instructions.

Acknowledgements

We would like to thank the board of the Italian As-sociation for Computational Linguistics (AILC) forthe support given to the workshop development anddelivery. We are also grateful to BergamoScienza,Festival della Scienza di Genova, Science Web Fes-tival for hosting the activity during the festivals; toILC-CNR “Antonio Zampolli" and ColingLab (Uni-versity of Pisa) for hosting us during the EuropeanNight of Research; and to the Scuola InternazionaleSuperiore di Studi Avanzati (SISSA) and ITI TullioBuzzi for hosting our activities with their students.We also thank Dr. Mirko Lai, who has collaboratedon the development of the web interface for theonline versions of our activity.

ReferencesLudovica Pannitto, Lucia Busso, Claudia Roberta

Combei, Lucio Messina, Alessio Miaschi, GabrieleSarti, and Malvina Nissim. 2021. Teaching NLPwith bracelets and restaurant menus: An interactiveworkshop for italian students. In Proceedings of theFifth Workshop on Teaching NLP and CL, Onlineevent. Association for Computational Linguistics.

Dragomir Radev and James Pustejovsky. 2013. Puzzlesin logic, languages and computation: the red book.Springer.

54

Proceedings of the Fifth Workshop on Teaching NLP, pages 55–58June 10–11, 2021. ©2021 Association for Computational Linguistics

MiniVQA - A resource to build your tailored VQA competition

Jean-Benoit DelbrouckStanford University

[email protected]

Abstract

MiniVQA1 is a Jupyter notebook to build a tai-lored VQA competition for your students. Theresource creates all the needed resources to cre-ate a classroom competition that engages andinspires your students on the free, self-serviceKaggle platform. “InClass competitions makemachine learning fun!”2.

1 Task overview

Beyond simply recognizing what objects arepresent, vision-and-language tasks, such ascaptioning (Chen et al., 2015) or visual questionanswering (Antol et al., 2015), challenge systemsto understand a wide range of detailed semanticsof an image, including objects, attributes, spatialrelationships, actions and intentions, and howall these concepts are referred to and groundedin natural language. VQA is therefore viewedas a suitable way to evaluate a system readingcomprehension.

“Since questions can be devised to query any as-pect of text comprehension, the ability to answerquestions is the strongest possible demonstrationof understanding“ (Lehnert, 1977)

2 To teach NLP

A trained model cannot perform satisfactorily onthe Visual Question Answering task without lan-guage understanding abilities. A visual-only sys-tem (i.e., processing only the image) will not havethe visual reasoning skills required to provide cor-rect answers (as it doesn’t process the questionsas input). Two NLP challenges arise to build amultimodal model:

• The questions must be processed by the sys-tem as input. This is the opportunity to teach

1https://github.com/jbdel/miniVQA2https://www.kaggle.com/

different features extractions techniques forsentences. The techniques can range fromunsupervised methods (bag of words, tf-idfor Word2Vec/Doc2Vec (Mikolov et al., 2013)models) to supervised methods (RecurrentNeural Networks (Hochreiter and Schmidhu-ber, 1997) or Transformers (Vaswani et al.,2017) neural networks).

• The extracted linguistic features must be in-corporated into the visual processing pipeline.This challenge lies at the intersection of visionand language research. Different approaches,such as early and late fusion techniques, canbe introduced.

All implemented techniques can be evaluated onthe difficult task that is VQA. Because answeringquestions about a visual context is a hard cognitivetask, using different NLP approaches can lead tosignificant performance differences.

3 Resource overview

The source code is split into several sections, eachsection containing tunable parameters to modulatethe dataset to match your needs. For example, youcan choose the number of possible answers, thesample size per answer or the balance of labelsbetween splits.

MiniVQA built upon two datasets, the VQA V2dataset (Goyal et al., 2017) and the VQA-Meddataset (Ben Abacha et al., 2020). You can chooseto create a competition based on natural images ormedical images. MiniVQA proposes 443k ques-tions on 82.7k images and 4547 unique answersfor the former, and 7.4k unique image-questionpairs and 332 answers for the latter 3.

Finally, MiniVQA provides a second Jupyter note-book that trains and evaluates a baseline VQA

3As of 2021

55

model on any dataset you create. You are freeto share it or not amongst your class.

4 Resource presentation

The following sections present the features ofMiniVQA. We illustrate these features using theVQA v2.0 dataset as a matter of example. Thesame can be applied to the VQA-Med dataset.

4.1 Automatic download

The resource automatically downloads annotations(questions, image_ids and answer)s from the of-ficial datasets websites. Any pre-processing thatmust be done is carried out. The number of possi-ble questions and unique answers are printed to theuser, along with random examples.

4.2 Decide the volume of your dataset

Using MiniVQA, you can choose the size of yourdataset according to several settings.

num_answers the number of possible differentanswers (i.e., how many classes).

sampling_type how to select samples, choosebetween "top" or "random". Value "top" getsthe ’num_answers’ most common answers in thedataset.

sampling_exclude_top and sam-pling_exclude_bottom you can choose toexclude the n most popular or least popularanswers (the most popular answers is "no" andcontains 80.000 examples).

min_samples and max_samples if sam-pling_type is random, you can choose aminimum and maximum number of samples withmin_samples and max_samples.

This section outputs a bar graph containing thelabel distributions and some additional information.Figure 1 shows two examples.

Figure 1: sampling_type random (left) and sam-pling_type top (right).

4.3 Create dataset filesThis section creates the chosen samples a json for-mat. Two more parameters are available.sample_clipping set to select n maximumsamples per answer. This setting is particularlyhandy if you chose the "top" sampling_type in theprevious section.

im_download you can choose to download theimages of the selected samples directly throughhttp requests. Though rather slow, this allows theuser not to download the images of the full dataset.

resize if im_download is set to True, images aresquared-resized to n pixels. For your mini-VQAproject, you might want to use lower resolution foryour images (faster training). n = 128 is a goodchoice.

4.4 Create splitsAs in any competition, participants are provided atrain and validation set with ground-truth answers,and a test-set without these answers.

train_size and valid_size fraction of the totalexamples selected to populate the splits. 0.8 and0.1 are usually good values. The rest (0.1) goes inthe test set.

balanced whether or not labels are homoge-neously distributed across splits.

Figure 2 shows an example of the balanced settingeffect:

56

Figure 2: balance set to True (left) and to False (right).

Regardless of the balanced value, at least one sam-ple of each label is put in each split.

4.5 Explore the question embedding space

It is possible to compute questions embedding us-ing pre-trained transformer models. Each repre-sentation is then reduced into a two-dimensionalpoint using t-SNE algorithm (van der Maaten andHinton, 2008). These embeddings can also be usedfor section 5. MiniVQA plots two projections, onefor the 5 most popular question types and one withrandomly chosen question types.

Figure 3: Questions embedding using a pretrained bert-base-nli-mean-tokens model.

4.6 Download files

Finally, you can download the dataset file in json,the splits, and optionally, the images.{train, val, sample_submission}.csv csv filescontaining question_id, label.

test.csv must be given to students. They mustfill it with their systems predictions formatted likesample_submission.csv which contains randompredictions.

answer_key.csv is the ground-truth file that hasto be stored on Kaggle (see Appendix A).

answer_list.csv maps the label to the answer innatural language (i.e., label 0 is answer at line 1 in

answer_list, etc.).

image_question.json maps an image_id to a listof questions (that concerns image_id). Each ques-tion is a tuple (question_id, question).

5 Baseline Model

The provided baseline consists of a dataloader thatopens images and resize them to 112× 112 pixelsand that embeds questions to a feature vector ofsize 768 from a pre-trained DistilBERT (Sanhet al.).

The network consists a modified Resnet (He et al.,2016) that takes as input an RBG image of size112× 112 and outputs a feature map of size 512×4× 4 that is flattened and then concatenated withthe question representation of size 768. Finally, aclassification layer projects this representation toprobabilities over answers. The network is traineduntil no improvements is recorded on the validationset (early-stopping).

ReferencesStanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, andDevi Parikh. 2015. VQA: Visual Question Answer-ing. In International Conference on Computer Vision(ICCV).

Asma Ben Abacha, Vivek V. Datla, Sadid A. Hasan,Dina Demner-Fushman, and Henning Müller. 2020.Overview of the vqa-med task at imageclef 2020: Vi-sual question answering and generation in the medi-cal domain. In CLEF 2020 Working Notes, CEURWorkshop Proceedings, Thessaloniki, Greece. CEUR-WS.org <http://ceur-ws.org>.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-ishna Vedantam, Saurabh Gupta, Piotr Dollár, andC Lawrence Zitnick. 2015. Microsoft coco captions:Data collection and evaluation server. arXiv preprintarXiv:1504.00325.

Yash Goyal, Tejas Khot, Douglas Summers-Stay,Dhruv Batra, and Devi Parikh. 2017. Making the V inVQA matter: Elevating the role of image understand-ing in Visual Question Answering. In Conference onComputer Vision and Pattern Recognition (CVPR).

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recogni-tion. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR).

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Longshort-term memory. Neural computation, 9(8):1735–1780.

57

Wendy Lehnert. 1977. Human and computational ques-tion answering. Cognitive Science, 1(1):47–73.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representationsof words and phrases and their compositionality. In Ad-vances in Neural Information Processing Systems, vol-ume 26. Curran Associates, Inc.

Victor Sanh, Lysandre DEBUT, Julien CHAUMOND,Thomas WOLF, and Hugging Face. Distilbert, a dis-tilled version of bert: smaller, faster, cheaper andlighter.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of MachineLearning Research, 9:2579–2605.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is all youneed. In Advances in Neural Information ProcessingSystems, volume 30. Curran Associates, Inc.

A Create a competition on Kaggle

Navigate to http://www.kaggle.com/inclass. Follow the instructions to setupan InClass competition. Upload files train.csv,val.csv, test.csv, answer_key.csv, and sam-ple_submission.csv when prompted.

B image_question.json structure

Figure 4: Maps an image_id to a list of questions (thatconcerns image_id). Each question is a tuple (ques-tion_id, question).

58

Proceedings of the Fifth Workshop on Teaching NLP, pages 59–61June 10–11, 2021. ©2021 Association for Computational Linguistics

From back to the roots into the gated woods: Deep learning for NLP

Barbara PlankIT University of [email protected]

Abstract

Deep neural networks have revolutionizedmany fields, including Natural Language Pro-cessing. This paper outlines teaching materi-als for an introductory lecture on deep learn-ing in Natural Language Processing (NLP).The main submitted material covers a summerschool lecture on encoder-decoder models.Complementary to this is a set of jupyter note-book slides from earlier teaching, on whichparts of the lecture were based on. The maingoal of this teaching material is to provide anoverview of neural network approaches to nat-ural language processing, while linking mod-ern concepts back to the roots showing tradi-tional essential counterparts. The lecture de-parts from count-based statistical methods andspans up to gated recurrent networks and atten-tion, which is ubiquitous in today’s NLP.

1 Introduction

In 2015, the “deep learning tsunami” hit ourfield (Manning, 2015). In 2011, neural networkswere not really “a thing” yet in NLP; they wereeven hardly taught. I remember when in 2011 An-drew Ng introduced the free online course on ma-chine learning, which soon after led Daphne Kollerand him to start massive online education: “Manyscientists think the best way to make progress [..]is through learning algorithms called neural net-works”.1 Ten years later, it is hard to imagine anyNLP education not touching upon neural networksand in particular, representation learning. Whileneural networks have undoubtedly pushed the field,it has led to a more homogenized field; some lec-turers even question whether to include pre-neuralmethods. I believe it is essential to teach statisticalfoundations besides modern DL, ie., to go back tothe roots, to better be equipped for the future.2

1Accessed March 15, 2021: https://bit.ly/2OZH2MZ

2Disclaimer: This submitted teaching material is limited asit spans a single lecture. However, I believe it to be essential to

2 Structure of the summer school lecture

This paper outlines teaching material (Keynoteslides, Jupyter notebooks), which I would like toprovide to the community for reuse when teachingan introductory course on deep learning for NLP.3

The main material outlined in this paper is aKeynote presentation slide deck for a 3-h lectureon encoder-decoder models. An overview of thelecture (in non-temporal form), from foundations,to representations to neural networks (NNs) be-yond vanilla NNs is given in Figure 1. Moreover, Iprovide complementary teaching material in formof Jupyter notebook (convertable to slides). I’veused these notebooks (slides and exercises) duringearlier teaching and in parts formed the basis of thesummer school lecture (see Section 3).

Figure 1: Overview of the core concepts covered.

The lecture covers NLP methods broadly fromintroduction concepts of more traditional NLPmethods to recent deep learning methods. A par-ticular focus is to contrast traditional approachesfrom the statistical NLP literature (sparse repre-

teach traditional methods besides modern neural approaches(e.g. count-based vs prediction-based word representations;statistical n-grams vs neural LMs; naive Bayes and logisticregression vs neural classifiers, to name a few examples).

3Available at: https://github.com/bplank/teaching-dl4nlp

59

sentations, n-grams) to their deep learning-basedcounterparts (dense representations, ‘soft’ ngramsin CNNS and RNN-based encoders). Consequently,the following topics are covered in the lecture:

• Introduction to NLP and DL (deep learning),what makes language so difficult; traditionalversus neural approaches

• N-gram Language Models,

• Feedforward Neural Networks (FFNNs)

• What’s the input? Sparse traditional vs denserepresentations; Bag of words (BOW) vs con-tinuous BOW (CBOW)

• Neural Language Models (LMs)

• Convolutional Neural Networks (CNNs) forText Classification (‘soft n-grams’)

• Recurrent Neural Networks (RNNs) for TextProcessing, RNNs as LMs

• Bidirectional RNNs, stacking, character rep-resentations

• Gated RNNs (GRU, LSTMs)

• Deep contextualized embeddings (Embed-dings as LMs, Elmo)

• Attention

The lecture was held as 3-h lecture at the firstAthens in NLP (AthNLP) summer school in 2019.In the overall schedule of the summer school, thislecture was the 3rd in a row of six. It was scheduledafter an introduction to classification (by Ryan Mc-Donald), and a lecture on structured prediction (byXavier Carreras). The outlined encoder-decoderlecture was then followed by a lecture on ma-chine translation (by Arianna Bisazza; the lecturebuild upon this lecture here and included the trans-former), machine reading (by Sebastian Riedel)and dialogue (by Vivian Chen).4 Each lecture wasenhanced by unified lab exercises.5

4Videos of all lectures are available at: http://athnlp2019.iit.demokritos.gr/schedule/index.html

5Exercises were kindly provided as unified framework bythe AthNLP team, and hence are not provided here. They in-cluded, lab 1: Part-of-Speech tagging with the perceptron; lab2: POS tagging with the structured perceptron; lab 3: neuralencoding for text classification; lab 4: neural language model-ing; lab 5: machine translation; lab 6: question answering

As key textbook references, I would like to referthe reader to chapters 3-9 of Jurafsky and Mar-tin’s 3rd edition (under development) (Jurafskyand Martin, 2020),6 and Yoav Goldberg’s NLPprimer (Goldberg, 2015). Besides these textbooks,key papers include (Kim, 2014) for CNNs on texts,attention (Luong et al., 2015) and Elmo (Peterset al., 2018).

3 Complementary notebooks of earliermaterial

This lecture evolved from a series of lectures givenearlier, amongst which a short course given inMalta in 2019, and a MSc-level course I taughtat the University of Groningen (Language Technol-ogy Project). To complement the Keynote slidesof the summer school lecture provided here, ear-lier Jupyter notebooks can be found at the website.These cover a subset of the material above.

4 Conclusions

This short paper outlines teaching material for an in-troductory lecture on deep learning for NLP. By re-leasing this teaching material, I hope to contributematerial that fellow researchers find useful whenteaching introductory courses on DL for NLP. Forcomments and suggestions to improve upon thismaterial, please reach out to me.

Acknowledgements

I would like to thank Arianna Bisazza for manyfruitful discussions in preparing this summerschool lecture, and the AthNLP 2019 organizersfor the invitation, with special thanks to AndreasVlachos, Yannis Konstas and Ion Androutsopou-los. I would like to thank many colleagues whoinspired me throughout the years in so many ways,including those who provide excellent teaching ma-terial online. Thanks goes to Abigal See, AndersJohannsen, Alexander Koller, Chris Manning, DanJurafsky, Dirk Hovy, Graham Neubig, Greg Durrett,Malvina Nissim, Richard Johannson, Ryan McDon-ald and Philip Koehn. Special thank to those whoinspired me in one way or another for teachingthe beauty and challenges of computational linguis-tics, NLP or deep learning (in chronological or-der): Raffaella Bernardi, Reut Tsarfaty and KhalilSima’an, Gertjan van Noord, Ryan McDonald andYoav Goldberg.

6https://web.stanford.edu/~jurafsky/slp3/

60

ReferencesYoav Goldberg. 2015. A primer on neural network

models for natural language processing.

Dan Jurafsky and James H. Martin. 2020. Speech andlanguage processing : an introduction to naturallanguage processing, computational linguistics, andspeech recognition. Upper Saddle River, N.J.

Yoon Kim. 2014. Convolutional neural networksfor sentence classification. In Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1746–1751,Doha, Qatar. Association for Computational Lin-guistics.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proceedings of the2015 Conference on Empirical Methods in Natu-ral Language Processing, pages 1412–1421, Lis-bon, Portugal. Association for Computational Lin-guistics.

Christopher D Manning. 2015. Computational linguis-tics and deep learning. Computational Linguistics,41(4):701–707.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.

61

Proceedings of the Fifth Workshop on Teaching NLP, pages 62–64June 10–11, 2021. ©2021 Association for Computational Linguistics

Learning PyTorch Through A Neural Dependency Parsing Exercise

David JurgensSchool of InformationUniversity of [email protected]

Abstract

Dependency parsing is increasingly the pop-ular parsing formalism in practice. This as-signment provides a practice exercise in im-plementing the shift-reduce dependency parserof Chen and Manning (2014). This parseris a two-layer feed-forward neural network,which students implement in PyTorch, provid-ing practice in developing deep learning mod-els and exposure to developing parser models.

1 Introduction

Deep learning methods are ubiquitous in nearlyall areas of NLP. However, some applications forthese models require extensive training or data thatmake implementing the model in a classroom in-feasible without additional computational support.This homework introduces a simple-yet-powerfulnetwork for performing dependency parsing us-ing the shift-reduce parser of Chen and Manning(2014). This three-layer network is efficient to train,making it suitable for development by all students,exposes students to dependency parsing, and helpsconnect how a neural network can be used in prac-tice. Through the assignment, students implementthe network, the training procedure, and explorehow the parser operates.

2 Design and Learning Goals

This exercise is designed for an advanced under-graduate or graduate-level course in NLP. The mate-rial is appropriate for students who have previouslycovered simple neural networks and are learningabout dependency parsing. The exercise containsextensive scaffolding code that simplifies the train-ing process by turning the CoNLL treebank data(Nivre et al., 2007) into training examples for thestudents and computing the unlabeled attachmentscore (UAS) for evaluating the model. The tech-nical depth of the material is likely too involvedfor an Applied NLP or Linguistics-focus setting;

the material could be re-purposed for an early exer-cise in a Machine-learning focused course with theaddition of more content on network design. Stu-dents typically have three weeks to complete theassignment, with the majority finishing the maintasks within a week. Through doing the exercise,students practice building neural models and un-derstanding how the core training procedure is im-plemented in PyTorch (Paszke et al., 2019, e.g.,what is a loss function and an optimizer), whichenables them to design, extend, or modify a varietyof PyTorch implementations for later assignmentsand course projects.

The assignment has the following three broadlearning objectives. First, the exercise is an in-troduction to developing neural models using thePyTorch library. The relatively simple nature ofthe network reduces the scope of design (e.g.,compared with implementing an RNN or LSTM),which allows students to understand how to con-struct a basic network, use layers, embeddings,and loss functions. Further, because the modelcan be trained efficiently, this allows students tocomplete the entire assignment on a laptop that isseveral years old, which reduces the overhead ofneeding students to gain access to GPUs or moreadvanced computing. Through the exercise, stu-dents experiment with changing different networkhyperparameters and designs and measuring theireffect on performance. These simple modificationsallow students to build confidence and gain intu-ition on which kinds of modifications may improveperformance—or dramatically slow training time.

Second, students gain familiarity with how touse pre-trained embeddings in downstream appli-cations. The dependency parser makes use ofthese embeddings for its initial word representa-tions and the assignment provides an optional ex-ercise to have students try embeddings from differ-ent sources (e.g., Twitter-based embeddings) or nopre-training at all in order to see how these affect

62

performance and convergence time. This learningobjective helps bridge the conceptual material tolater pre-trained language models like BERT if theyhave not been introduced earlier.

Third, students should gain a basic familiaritywith dependency parsing and how a shift-reduceparser works. Shift-reduce parsing is a classic tech-nique (Aho and Ullman, 1973) and has been widelyadopted for multiple parsing tasks beyond syntax,such as semantic parsing (Misra and Artzi, 2016) ordiscourse parsing (Ji and Eisenstein, 2014). This as-signment helps students understand the basic struc-tures for parsing (e.g., the stack and buffer) to seehow neural approaches can be used in practice.The concepts in the homework are connected totextbook material in Chapter 14 of Speech & Lan-guage Processing (Jurafsky and Martin, 2021, 3rded.), which provides additional examples and defi-nitions.

3 Homework Description

This homework has students implement two keycomponents of the Chen and Manning (2014)parser in PyTorch, using extensive scaffolding codeto handle the parsing preparation. First, studentsimplement the feed-forward neural network, whichrequires using three common elements of neuralnetworks in NLP: multiple layers, embeddings, anda custom activation function. These elements pro-vide conceptual scaffolding for later implementa-tions on attention and recurrent layers in networks.Second, students implement the main training loop,which requires students to understand how the lossis computed and gradient descent is applied. Thissecond step is found in nearly all code using net-works built from scratch (or built upon pre-trainedparameters) and enables students to learn this pro-cess in a simplified setting for use later. Thesetwo components are broken into multiple discretesteps to help students figure out where to start inthe code.

The second part of the homework has studentsexplore the parser in two ways. First, students ex-amine the parsing data structures and report thefull parse of a sentence of their choice; this explo-ration has students consider the steps required for asuccessful parse. As a part of this exploration, stu-dents are required to report a parse that is incorrect;this latter task requires students to understand whatis a correct dependency parse and diagnose whatsteps the parser has taken to introduce the error.

In some iterations, we have asked the students toreport on an error introduced from a non-projectiveparse of their choice, though some students foundit difficult to come up with an example of their own.Second, students are asked to extend the network insome way of their choice (e.g., add a layer or adddropout). This extension helps introduce additionaldesign components and build intuition.

4 Potential Extensions

Prior parts of the course examine word vectors andthis parsing exercise includes an optional exten-sion to allow students to see their effect in practice.Here, we provide students with multiple pre-trainedvectors from different domains (e.g., Twitter andWikipedia) and ask them to report on convergencetimes and accuracy. Students may also optionallyfreeze these embeddings to test how well their in-formation can generalize. Since training data areknown to contain biases that affect downstreamperformance (e.g., Garimella et al., 2019), one ad-ditional extension could be to test how particularembeddings perform better or worse on parsingtext from specific groups.

After implementing the model, the assignmenthas students explore the parsing outputs and in-termediate state, which provides some groundingfor how shift-reduce works in practice. However,due to the focus on learning PyTorch, the parsingcomponent of the exercise is less in-depth; a moreparsing-focus variant of this assignment could havestudents perform the CoNLL-to-training-data con-version in order to see how dependency trees canbe turned into a sequence of shift-reduce opera-tions and how such a process introduces errors fornon-projective parses.

5 Reflection on Student Experiences

In two iterations of the homework, students havereported the exercise helped demystify deep learn-ing and make the concepts accessible. Once themodel was implemented and working, some stu-dents were surprisingly active in trying differentmodel designs; these students reported feeling ex-cited that making these simple extensions could beso easy, which encouraged them to try implement-ing deep learning models in their course projects.

The initial version of this homework did notinclude any parsing introspection. Students ex-pressed feeling like the homework was more aboutbuilding a network than learning about parsing. As

63

a result, the second iteration added more diagnos-tics for students to see what the parser is doingand the exploration component. This addition wasintentionally simple to avoid substantially expand-ing the scope of the assignment. However, addingmore parsing-oriented tasks remains an active areaof development for this homework.

ReferencesAlfred V Aho and Jeffrey D Ullman. 1973. The the-

ory of parsing, translation, and compiling, volume 1.Prentice-Hall Englewood Cliffs, NJ.

Danqi Chen and Christopher D Manning. 2014. Afast and accurate dependency parser using neural net-works. In Proceedings of the 2014 conference onempirical methods in natural language processing(EMNLP), pages 740–750.

Aparna Garimella, Carmen Banea, Dirk Hovy, andRada Mihalcea. 2019. Women’s syntactic resilienceand men’s grammatical luck: Gender-bias in part-of-speech tagging and dependency parsing. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 3493–3498.

Yangfeng Ji and Jacob Eisenstein. 2014. Representa-tion learning for text-level discourse parsing. In Pro-ceedings of the 52nd annual meeting of the associa-tion for computational linguistics (volume 1: Longpapers), pages 13–24.

Dan Jurafsky and James H. Martin. 2021. Speech &Language Processing, 3rd edition. Prentice Hall.

Dipendra Misra and Yoav Artzi. 2016. Neural shift-reduce ccg semantic parsing. In Proceedings of the2016 conference on empirical methods in naturallanguage processing, pages 1775–1786.

Joakim Nivre, Johan Hall, Sandra Kübler, Ryan Mc-Donald, Jens Nilsson, Sebastian Riedel, and DenizYuret. 2007. The conll 2007 shared task on depen-dency parsing. In Proceedings of the 2007 JointConference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning (EMNLP-CoNLL), pages 915–932.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. 2019. Pytorch: An imperativestyle, high-performance deep learning library. arXivpreprint arXiv:1912.01703.

64

Proceedings of the Fifth Workshop on Teaching NLP, pages 65–69June 10–11, 2021. ©2021 Association for Computational Linguistics

A Balanced and Broadly Targeted Computational Linguistics Curriculum

Emma Manning Nathan Schneider Amir ZeldesGeorgetown University

{esm76, nathan.schneider, amir.zeldes}@georgetown.edu

Abstract

This paper describes the primarily-graduatecomputational linguistics and NLP curriculumat Georgetown University, a U.S. research uni-versity that has seen significant growth in theseareas in recent years. We discuss the prin-ciples behind our curriculum choices, includ-ing recognizing the various academic back-grounds and goals of our students; teachinga variety of skills with an emphasis on work-ing directly with data; encouraging collabora-tion and interdisciplinary work; and includinglanguages beyond English. We reflect on chal-lenges we have encountered, such as the dif-ficulty of teaching programming skills along-side NLP fundamentals, and discuss areas forfuture growth.

1 Introduction

This paper describes the computational linguisticsand NLP curriculum at Georgetown University, aprivate research university whose computationallinguistics program has grown significantly overthe past 7 years. This curriculum has been deve-loped with a high degree of collaboration betweenthe Linguistics and Computer Science departments,as well as involvement with related programs suchas the Master’s in Data Science and Analytics. Wereflect on several principles that underlie our cur-riculum, and discuss opportunities for further ex-pansion.

2 Course Offerings

Table 1 summarizes our main graduate-levelcourses focusing on computational linguistics andNLP.1 These include:

1All are standard fall or spring semester courses for 3 cred-its, with 150 minutes of instruction time per week. The totalnumber of 3-credit courses a student takes varies considerablyby graduate program: 8–10 for the CS MS; 10 for the CSPh.D.; 8–12 for the Linguistics MS; and 18 for the LinguisticsPh.D. This includes departmental core requirements and othercourse options beyond computational linguistics and NLP.

• A 3-course NLP sequence for novice program-mers, which can be shortened to 1 or 2 coursesfor students proficient in Python (discussed in§3).

• A group of courses targeting methods for com-putational linguistics research: corpus designand use (§5), statistical analysis with R, andmachine learning techniques.

• A selection of application-oriented speech andlanguage technology courses encompassingspeech processing, dialogue systems, and ma-chine translation.

• Special topics courses addressing issues suchas social factors and ethics in NLP, discourseparsing, grammar formalisms, and meaningrepresentation design and parsing. These tendto be reading- and research-oriented courses,whereas the other courses place more empha-sis on implementation and theory learning.

Advanced students of NLP can also take a num-ber of related courses in the CS and Analyticsdepartments on topics like information retrievaland machine learning for general (and not onlylanguage-oriented) purposes.2

Syllabi for courses taught by Nathan Schnei-der (instructor S in the table), including de-tailed schedules with course materials, can befound via http://people.cs.georgetown.edu/

nschneid/teaching.html. Recent syllabi forcourses taught by Amir Zeldes (instructor Z) canbe found at https://corpling.uis.georgetown.edu/amir/pdf/syllabi/.

3 Interdisciplinarity

Our students include many from both the Linguis-tics and Computer Science departments, as well assome from other programs, such as Data Science& Analytics and language departments. We havedeveloped a sequence of NLP courses designed

2A full list of CL-relevant courses are described at: http://gucl.georgetown.edu/gu-cl-curriculum.pdf

65

Course Target audience Frequency InstructorIntro NLP (INLP) any except CS Annual Z

NLP Advanced Python for CL Ling+Analytics Annual AEmpirical Methods in NLP (ENLP) Ling+CS Annual SComputational Corpus Linguistics any Annual Z

CL METHODS Analyzing Language Data with R Ling 2 Years ZMachine Learning for Linguistics Ling 2 Years ZSpeech Processing Ling 2 Years A

APPLICATIONS Dialogue Systems any 2 Years AStatistical/Neural Machine Translation any 2 Years ASocial Factors in CL/AI any 2 Years A

SPECIAL Discourse Modeling Ling+CS 2 Years ZTOPICS Grammar Formalisms Ling 3–4 Years P

Meaning Representations Ling+CS 2 Years S

Table 1: Courses oriented specifically at computational linguistics or NLP and targeting graduate students (manyare also open to undergraduates in their third and fourth years). The first group is the main NLP sequence thatincludes Python programming and fundamental algorithms, representations, and tasks; fluent Python programmerscan start with ENLP. The second group focuses on computational linguistic methods. The third group focuses onapplication areas and associated tools. The last group consists of special topics. Instructors: Courses designatedS or Z are taught by dedicated computational linguistics faculty, Nathan Schneider and Amir Zeldes. GrammarFormalisms is taught by Paul Portner, a Linguistics professor. Other courses, designated A, are taught by Adjunctprofessors (different for each course).

to accommodate these various backgrounds. Lin-guistics students with little or no prior program-ming experience are introduced to basic Pythonand NLP foundations in an Introduction to NLP(INLP) course;3 they can then further develop theirprogramming skills with Computational Linguis-tics with Advanced Python before taking EmpiricalMethods in NLP (ENLP). Students who alreadyhave strong programming skills, such as Com-puter Science graduate students, can begin theirNLP journey in this same ENLP course, whichhas projects emphasizing collaboration betweenstudents of different backgrounds;4 as discussedin Fosler-Lussier (2008), cross-disciplinary collab-orations are helpful to establish respect betweenstudents from different fields and mitigate the chal-lenges of disparate backgrounds. Many otherNLP courses, such as those focusing on DialogueSystems and Machine Translation, are also cross-listed between the Linguistics and CS departments,which, as noted in e.g. Baldridge and Erk (2008),helps these courses reach a wider audience.

Teaching NLP concepts alongside basic pro-3Introducing these together allows linguistics students who

are unsure how interested they are in NLP to get a taste of itin just one class, without requiring them to spend time on annon-language-related programming class first.

4Depending on class makeup, there are sometimes require-ments for the composition project groups to enforce this, e.g.that each group needs to contain at least one linguist.

gramming skills has been a significant challenge.INLP requires no prior programming experience,but students who enter the course with none some-times struggle to grasp programming concepts atthe speed they are taught, and many students relyon significant support from teaching assistants tosuccessfully complete the course’s programmingassignments. Our experience has taught us thatfrequent contact and check-ins initiated by teach-ing assistants are very important for catching stu-dents who may fall behind before assignment sub-missions make problems more obvious. Use ofIDEs with syntax validation and auto-complete fa-cilities, which are freely available for academicpurposes, are also very useful in this respect,and in recent years students have used PyCharm(https://www.jetbrains.com/pycharm/) as theirfirst Python IDE for this purpose.

Previously, linguistics students who completedINLP were encouraged to enroll in ENLP im-mediately afterward. However, we found thatINLP alone did not adequately prepare studentsfor the more advanced programming assignmentsin ENLP—INLP assignments tend to involve mak-ing fairly limited modifications to provided startercode, while ENLP expects independent implemen-tation of more substantial algorithms. Thus, theAdvanced Python course was introduced to give

66

students more practice implementing algorithmsfor linguistic tasks as code. This bridges the gapbetween the introductory and more advanced NLPcourses; however, it does mean that linguistics stu-dents who enter the program with little or no pro-gramming experience may need to take a sequenceof 3 courses to gain a thorough understanding ofNLP fundamentals, while students with a CS back-ground only need to take one course. ENLP’s As-signment 0 is a diagnostic of Python proficiency tohelp students choose the appropriate course level.5

4 Balancing Skills Taught

Along with coming from different academic back-grounds, we acknowledge that students studyingNLP have a variety of goals: for example, theymay wish to pursue NLP in academia or industry,or they may be interested in using computationalmethods for linguistics, or other Digital Humanitiesor Social Science fields. To support these varyinggoals, we endeavor to teach a balance of differ-ent skills and perspectives on NLP. While somecourses emphasize algorithms, others focus moreon computational representations of language, oncreating and using resources such as corpora, oron using existing NLP tools. We are also care-ful to consider that not all NLP applications arerealized in a Big Data context, and we thereforeinclude units targeting low resource settings acrossour course offerings.

5 Focus on Data

In all courses, we emphasize working directlywith language data. This is perhaps best exem-plified in the Computational Corpus Linguisticscourse, which teaches corpus design and construc-tion methods along with analytical Corpus Linguis-tics methodology and relevant readings on data andits potential pitfalls. As part of its assignment struc-ture, the course integrates a set of five annotationprojects in which each student chooses a singledocument from a selection of genres to annotatethroughout the semester with a variety of annota-tions layers, including structural markup (whichteaches XML basics), part-of-speech tagging, de-pendency treebanking, entity and coreference res-olution, and finally discourse parsing. A unit oninter-annotator agreement evaluates and compares

5http://people.cs.georgetown.edu/cosc572/s21/a0/

the students’ own work, underscoring the subjec-tive nature of ‘ground-truth’ data, the range of lin-guistic variation across genres, and the importanceof consistency and validation. At its end, the courseengages students in a ‘real-world’ research project,which produces valuable linguistically annotateddata, which can be released to the research commu-nity under an open license as part of the George-town University Multilayer (GUM) Corpus (Zeldes,2017) if students so wish.6

Several other courses also include practice an-notation of data, including a POS tagging in-classexercise in ENLP, and annotation in three differ-ent semantic representations in Meaning Repre-sentations. Others include error analysis of NLPsystems, such as a comparison of the output fromstatistical and neural translation systems in Ma-chine Translation. The Discourse Modeling courseteaches discourse parsing frameworks and algo-rithms, including introducing students to topicsin annotating Rhetorical Structure Theory (Mannand Thompson, 1988), Segmented Discourse Rep-resentation Theory (SDRT, Asher and Lascarides2003) and the Penn Discourse Treebank framework(PDTB, Prasad et al. 2014).

6 Collaboration

Our coursework emphasizes frequent collaborationamong students. This includes in-class group ac-tivities, such as practicing part-of-speech taggingin small groups in ENLP, or working together asa class to create a morphological analyzer for alow resource language in INLP (an activity whichliterally runs in a simultaneous collaborative onlinecode editing format). On a larger scale, studentswork in groups on final projects in courses suchas ENLP, and have collaborated on an entry to ashared task in a our discourse parsing course, withthe resulting system winning some shared task cat-egories. For this latter project, students attemptedto tackle the same task in small groups, and finallysubmitted an ensemble system fed by each group’smodel to the competition.

Some classes use wikis to maintain informationabout course content, such as annotation guide-lines for Computational Corpus Linguistics andsome seminars tackling specific topics; this allowsstudents to collaborate not only with their class-mates, but with past and future students of the samecourse, which also increases the sense of relevance

6https://corpling.uis.georgetown.edu/gum/

67

of course work, as students can see that their workmay live on long after they complete the course.

7 Including Languages Beyond English

In response to an unfortunate tendency of NLPteaching and research to focus primarily on English,we try to include data and examples from otherlanguages when possible, while keeping in mindthat students cannot be expected to know theselanguages in detail. In ENLP, for example, eachstudent gives a short presentation on a differentlanguage of their choosing to develop awarenessof the diversity of the world’s languages, and thechallenges of NLP on different languages. Otherassignments integrating data from other languagesinclude a finite state transducer for Japanese mor-phology in INLP as well as a unit on a ‘surprise’low resource language, work on multilingual dis-course treebanks in our discourse parsing course,statistical analysis of non-English data in our stats-centered R course for language data, and analysisof data from other languages in Lexical FunctionalGrammar (LFG) and Head-driven Phrase StructureGrammar (HPSG) in Grammar Formalisms. In thepast we have also offered a dedicated course onparallel and other types of multilingual corpora,which we hope to be able to offer again, based onthe availability of resources.

8 Teaching Research Skills

While many of the courses in table 1 covertextbook-style fundamentals, the Special Topicscourses expose students to the scientific literaturein particular areas. For Ph.D. students in par-ticular, this provides the opportunity to engagewith research ideas by reading critically and de-veloping original ideas as term papers or projects.Two courses include a mock reviewing activitysimulating an ACL reviewing experience. Finalprojects in several courses—including Corpus Lin-guistics, Machine Learning for Linguistics, ENLP,and Meaning Representations—consist of an open-ended research project with an ACL-style writeupfor the final report.

9 Directions for Growth

The current curriculum caters primarily to graduatestudents, though many of the courses are also avail-able to advanced undergraduates, who sometimescontinue on into our regular Computational Lin-guistics MS, or Accelerated MS programs. While

we do offer a few undergrad-specific classes, suchas ‘Algorithms for NLP,’ ‘Languages and Comput-ers,’ and ‘Multilingual and Parallel Corpora,’ theseare taught on an occasional basis; in the future, re-sources allowing, we would like to develop a moreconsistent NLP curriculum aimed at undergradu-ates.

We have recently introduced a course on SocialFactors in NLP to address a major gap in our cur-riculum, which is the lack of material focusingon the impact of real world NLP applications onsociety, and the ways in which models reflect de-mographic and other types of bias. While this isa step toward teaching a better understanding ofthe relationship of NLP to society, we believe itis worthwhile to integrate more content on soci-etal impact and ethical considerations into the coreNLP courses as well, and are working to do so forcoming years. We would also like to continue toexpand our curriculum to address other topics thatcurrently receive little coverage, such as grammarengineering and computational psycholinguistics.

Acknowledgments

We would like to acknowledge the other fac-ulty who teach the courses described in this pa-per: Matthew Marge, Corey Miller, ElizabethMerkhofer, Paul Portner, Achim Ruopp, and Shab-nam Tafreshi. We thank the anonymous reviewersand members of the NERT lab for feedback on thispaper, as well as the organizers of the workshop.

ReferencesNicholas Asher and Alex Lascarides. 2003. Logics of

Conversation. Studies in Natural Language Process-ing. Cambridge University Press, Cambridge.

Jason Baldridge and Katrin Erk. 2008. Teaching com-putational linguistics to a large, diverse student body:Courses, tools, and interdepartmental interaction.In Proceedings of the Third Workshop on Issuesin Teaching Computational Linguistics, pages 1–9, Columbus, Ohio. Association for ComputationalLinguistics.

Eric Fosler-Lussier. 2008. Strategies for teaching“mixed” computational linguistics classes. In Pro-ceedings of the Third Workshop on Issues in Teach-ing Computational Linguistics, pages 36–44, Colum-bus, Ohio. Association for Computational Linguis-tics.

William C. Mann and Sandra A. Thompson. 1988.Rhetorical Structure Theory: Toward a functionaltheory of text organization. Text, 8(3):243–281.

68

Rashmi Prasad, Bonnie Webber, and Aravind Joshi.2014. Reflections on the Penn Discourse TreeBank,comparable corpora, and complementary annotation.Computational Linguistics, 40(4):921–950.

Amir Zeldes. 2017. The GUM corpus: Creating mul-tilayer resources in the classroom. Language Re-sources and Evaluation, 51(3):581–612.

69

Proceedings of the Fifth Workshop on Teaching NLP, pages 70–79June 10–11, 2021. ©2021 Association for Computational Linguistics

Gaining Experience with Structured Data: Using the Resources of DialogState Tracking Challenge 2

Ronnie W. SmithDepartment of Computer Science

East Carolina UniversityGreenville, NC 27858 USA

[email protected]

AbstractThis paper describes a class project for a re-cently introduced undergraduate NLP coursethat gives computer science students the op-portunity to explore the data of Dialog StateTracking Challenge 2 (DSTC 2). Student back-ground, curriculum choices, and project de-tails are discussed. The paper concludes withsome instructor advice and final reflections.

1 Introduction

One of the consequences of the data explosion ofthe past twenty-five years is the likelihood thata large number of computer scientists and com-puting professionals will have the opportunity todevelop analytical tools for processing large, struc-tured datasets during their careers. This is of courseespecially true in the domain of NLP. A varietyof NLP application domains can serve as an op-portunity to provide undergraduate students witha chance to gain experience with the processingof structured data in order to solve an interesting“real world” problem. This paper describes a classproject for an undergraduate NLP course that givesstudents the opportunity to explore the data of Dia-log State Tracking Challenge 2 (DSTC 2). Studentbackground, curriculum choices, and project de-tails are discussed. The paper concludes with someadvice for instructors who might be interested inincorporating a DSTC 2 based project into theircourse and final reflections about student feedback.

2 Background

At East Carolina University, a Natural LanguageProcessing course was added to our curriculum dur-ing academic year 2015-2016. It is a junior/seniorlevel course for which the prerequisites are datastructures and introductory statistics. Significantfactors that influenced the decisions about curricu-lum and pedagogy during the initial offerings ofthe course (spring semesters 2017 through 2019)include the following.

• While we also have a machine learning course,it is not a prerequisite. Consequently, the NLPinstructor cannot assume all students havecompleted the machine learning course.

• While Python is a rapidly growing program-ming language of choice, our students havenot had significant exposure to the languageprior to the NLP course. We use Java andC++ in our introductory and data structurescourses.

• Only a small percentage of our studentschoose graduate study and almost all of thosewho do only seek a terminal masters degree.Nevertheless, it is important to expose under-graduate computer science majors to the toolsof research if for no other reason than to givethem an awareness of how research advance-ments are made.

• It is this author’s belief that an undergraduatedegree in computer science is merely a “li-cense to learn.” It is essential that our studentsunderstand the necessity of and gain experi-ence with self-directed learning before theygraduate.

Based on these factors, Natural Language Pro-cessing with Python (Bird et al., 2009) was chosenas the textbook for the course. It had been the basisfor a successful independent study course with aDuke University undergraduate during spring 2013.Notable advantages of this choice include the fol-lowing.

1. Gives students the opportunity to engage insome self-directed learning of a new program-ming language (Python) and a new API (theNatural Language Toolkit (Bird et al., 2008)).

2. Provides a gentle introduction to machinelearning techniques suitable for our students

70

who had not yet taken the machine learningcourse.

3. Offers a low cost to students as it is availablefree of charge online.

4. Provides a rich collection of exercises bywhich students can begin to gain proficiencyand confidence in working with collections ofstructured data.

A main disadvantage of this text is its limiteddiscussion of linguistic theory, but that can be ad-dressed by other assigned readings.

With the cooperation of the department chair, theclass size was restricted (15 in 2017, 25 in 2018,and 29 in 2019). This enabled the opportunity toprovide more individualized learning opportunities.The primary such opportunity was the class projectusing the DSTC 2 data that was assigned for thespring 2018 and spring 2019 offerings of the course.Discussion of this project will be the focus of theremainder of the paper.

3 DSTC 2 Overview

For DSTC 2 a general discussion of the challengeand challenge results are provided in (Hendersonet al., 2014a). The ground rules for the challengeare specified in (Henderson et al., 2013). A sum-mary of the details most relevant to its use for aclass project is provided below.

3.1 Problem DomainThe dialog environment was a telephone-based dia-log system where the user task was to obtain restau-rant information. During data collection each sys-tem user was given a dialog scenario to follow.Example scenario descriptions extracted from twoof the log files are given below.

• Task 09825: You want tofind a cheap restaurant andit should be in the south partof town. Make sure you getthe address and phone number.

• Task 11937: You want to findan expensive restaurant and itshould serve portuguese food.If there is no such venue howabout north american type offood. You want to know thephone number and postcode ofthe venue.

The basic structure of the dialogs has the follow-ing pattern.

1. Acquire from the user a set of constraintsabout the type of restaurant desired. Usersmay supply constraint information about area,food, name, and price range. This phasemay require multiple iterations as user goalschange (such as from portuguese food to northamerican food for task 11937).

2. Once the constraints have been acquired, pro-vide information about one or more restau-rants that satisfy the constraints. Users may re-quest that additional attributes about a restau-rant be provided (such as address and phonenumber).

An example transcript of an actual dialog forcompleting task 11937 is provided in appendix A.

As described in the challenge handbook (Hen-derson et al., 2013), during each call, the dialogsystem logged detailed information that providesthe needed input for a separate module for handlingdialog state tracking. Further details about the datacollection process are described next.

3.2 Data Collection ProcessThere were two different speech recognition (SR)models and 3 different dialog management mod-els for a total of six different dialog systems thatwere used in the data collection process. Approx-imately 500 dialogs were collected using each ofthe six systems. There were a total of 184 uniqueusers that were recruited using Amazon Mechan-ical Turk. Data using two of the dialog managersacross both SR models (i.e., four of the six dialogsystems) were used for training and developmentwhile data collected using the third dialog manager(1117 dialogs) was used as the test set for evalua-tion.1

4 Possible Activities with the DSTC 2Data

For a group of students with sufficient technicalbackground with Python and/or machine learning,a class project that allows students to develop theirown dialog state tracking system is certainly feasi-ble. Students could start from scratch and enhance

1The total number of dialogs was 3235. They were subdi-vided by the challenge organizers into a training set of 1612dialogs, a dev set of 506 dialogs and then the test set of 1117dialogs.

71

one of the rule-based baseline trackers that are pro-vided in the DSTC 2 dataset or else develop orenhance a machine learning approach for tracking(e.g. (Williams, 2014), (Henderson et al., 2014b))2.In either case students should base their approachon a data-driven analysis of the nature of the di-alogs.

However, this is not the only possible use ofthe data. In a circumstance where students do nothave the necessary background to develop theirown tracker, they can still engage with the data bydeveloping their own analysis tools to glean infor-mation useful for studying other aspects of dialogprocessing such as miscommunication handling.Given the instructor’s personal research interestsand the students’ background, this seemed to bethe best way to proceed with a project as describednext.

5 In-class Project Activities

About midway through the semester after complet-ing the first five chapters of Natural Language Pro-cessing with Python, a week of class is taken tointroduce students to dialog state tracking and theproject. The goals for that week of class includethe following.

• Introduce students to the natural languagedialog problem. Resources used include avideo of the Circuit Fix-it Shoppe dialogsystem (Hipp and Smith, 1993) and an in-troductory paper on natural language inter-faces (Smith, 2005).

• Introduce students to the dialog state track-ing problem using selected information fromthe first four sections of the DSTC 2 hand-book (Henderson et al., 2013).

• Introduce students to the project requirements(see section 6).

• Provide a brief introduction to the structure ofthe raw data used in the project. The data isrepresented using JavaScript Object Notation(JSON). This introduction includes a templatePython program that can be used as the basisfor the software development activities of thestudents. Detailed study of appendix A in the

2There are also several other papers related to DSTC 2 inthe SIGDIAL 2014 proceedings (Georgila et al., 2014). Thetwo specifically cited discuss the two best performing trackersin the original challenge.

challenge handbook is essential for successfulcompletion of the project. This template pro-gram can access all dialogs based on a list ofdialog ID’s that are specified in a file whosename is specified as a command line argu-ment. For each accessed dialog the templateprogram extracts and displays the dialog actsof the system and speaker.

• Introduce students to other data resourcesavailable for the project. Besides the rawdata, a supplemental resource that is providedare annotated transcripts of the dialogs. Tokeep students from being overwhelmed, eachstudent is assigned the dialogs of a specificspeaker. Several of the speakers interactedwith all six of the dialog systems. Each stu-dent is assigned a different speaker.3 The totalnumber of dialogs in each set is between 15and 20 dialogs per speaker. Besides the tran-scribed system output and Automatic SpeechRecognition (ASR) input, other informationprovided in the annotated transcript includesthe following.4

1. The formal dialog act of the system ut-terance.

2. The list of hypotheses provided by theSpoken Language Understanding (SLU)module including scoring.

3. The actual transcription of what was spo-ken by the user.

4. The chosen hypothesis of the SLU alongwith a comparison of that hypothesis towhat was actually spoken.

5. The current state of the dialog track-ing process with respect to the in-formable attributes (area, food, name,and pricerange) for which informationhas been provided at some point in thedialog. Note that as in the case of thesample dialog for task 11937, the infor-mation for a specific goal can change(such as the food preference changingfrom “portuguese” to “north american”.).

There are a few other in-class activities that re-late more specifically to the project requirements.

3Besides acting as a limit on the amount of data a studenthad to consider, another reason for this was to generate datathat could be used to study if speakers behaved differentlywith the different dialog systems.

4Details about the formal notation are provided in (Hen-derson et al., 2013).

72

They will be mentioned in the following sectionthat discusses the project requirements.

6 Project Requirements

Students are required to submit a separate Pythonprogram for carrying out each of the followingtasks.

1. Basic performance analysis of the speech qual-ity.

2. Automatic annotation of dialog state change.

3. Automatic generation of dialog feature infor-mation for miscommunication detection.

More detailed discussion about each of theseactivities will follow.

6.1 Basic performance analysis of the speechquality

The intent of this requirement was to give studentsa gentle introduction to modifying the Python tem-plate program to access other elements in the rawdialog data. Their program was required to producethe following information for each user utterance.

• Number of words actually spoken by the hu-man speaker.

• Number of words in the highest scored livespeech recognition hypothesis.

• Total number of unique words found in theunion of all the live speech recognition hy-potheses.

• A label describing whether or not the utter-ance was understood.5

6.2 Automatic annotation of dialog statechange

Dialog state change occurs when the user eithersupplies constraint information for possible restau-rant recommendations (area, food, name, or pricerange) or else requests that additional attributesabout a restaurant be provided (such as address andphone number). In this task students must imple-ment a program that tracks the changes in thesesupplied values as the dialog proceeds. The al-gorithm for carrying out this task was discussedas part of the in-class activities. For each user

5This required using a function call to an instructor-provided Python module.

utterance, the program had to specify the set ofattributes for which (1) a new value had been sup-plied; (2) a modified value has been supplied; and(3) a value has been removed.

6.3 Automatic generation of dialog featureinformation for miscommunicationdetection

This part of the project gave students a chance toconduct their own analysis and offer their own in-sight into what readily computable features of thedialogs might be helpful to a dialog manager in de-termining that miscommunication occurred. Eachstudent was required to propose three possibilitiesfor feature detection, and in a one-on-one meeting,we would discuss the options and settle on a partic-ular choice.6 Details about chosen features and theresults are presented in section 9.3.

This particular project requirement was usedspring semester 2018 but not spring semester 2019.The reason for this was not due to any problemwith the requirement other than the amount of timerequired by the instructor to meet with the stu-dents and then evaluate the work. Unfortunately forspring semester 2019 due to workload constraintsthe instructor did not have sufficient time to overseeand evaluate that requirement. Instead a standard-ized third requirement was used that asked studentsto conduct a performance analysis for the SLUmodule that looked at its performance as a func-tion of the presence or absence of a “high-scoring”NULL semantic hypothesis where the definition of“high-scoring” was specified as a run-time parame-ter.

6.4 Project Report Requirements

Detailed requirements are provided in appendix B.The intent of the report requirement was to givestudents a chance to reflect on the dialog state track-ing problem and its relationship to detection ofdialog miscommunication. In earlier course as-signments, students had been asked to reflect andwrite about the domain problem at hand. One suchassignment was from the second chapter of the Nat-ural Language Processing with Python textbookwhere students were asked to calculate word fre-quencies for words of their choice for five differ-ent genres in the Brown corpus. Students wereasked to come up with words whose presence or

6Students were required to submit their proposal for ad-vance review. Meetings were scheduled at 10 minute intervals.

73

absence might be typical for that genre. In theirreflection, students were asked to explain their ra-tionale for the genres they chose and to discuss thesequence of insights/lessons learned as differentsets of words were tested. Students were specif-ically challenged to provide evidence of thought-ful inquiry—demonstrate a sequence of cyclingthrough hypothesis, test, result, and refinement. Itwas hoped that prior experience with this modeof activity would be helpful while working on theproject.

7 Project Assessment

7.1 Weightings

The three parts of the project were weighted asfollows.

1. Speech quality performance (30%).

2. Dialog state change annotation (40%).

3. Generation of dialog feature information(2018)/SLU performance (2019) (30%).

For each part, code correctness/quality wasweighted at 70% while report quality was weightedat 30%.

7.2 Evaluating code correctness/quality

For the first two parts of the project as well as the2019 part 3 requirement to analyze SLU perfor-mance, it is certainly possible to base code cor-rectness on automated testing. Unfortunately, thatis likely to penalize excessively logic errors of aminor nature that could lead to large discrepan-cies in output results. Given the limited scope ofeach program, a checklist of features can be manu-ally inspected reasonably quickly. A time estimatewould be 20 to 30 minutes per student. As wouldbe expected, correct solutions do not take as longto check.

Checking correctness of the 2018 part 3 require-ment where students implemented their own fea-ture generation algorithm is not as straightforward.While it was possible to steer the students towardsa total of about five different dialog features ratherthan more than 20 different programs, it was stillnot a simple task.

7.3 Evaluating report quality

Two main questions were examined.

1. Did the student address each of the report re-quirements?

2. Does the report exhibit evidence that the stu-dent seriously looked at the results of runningthe software and base their observations onthat.

As might be expected, there was a wide disparityin the quality of the efforts. Some excerpts fromthe reports are the following.

• “Part 3 of this assignment seems to work whenit wants to.”

• “. . . much of our language is fluff.”

In contrast, some students wrote multiple para-graphs analyzing details and making tangible pro-posals for how to use the results to help with han-dling dialog miscommunication. One student whowas concurrently enrolled in an information re-trieval course and had learned about Naive Bayesclassifiers chose to explore using the data producedfrom part 1 (speech quality), and implement a clas-sifier to see how it would perform (unsurprisingly,not well).

Fortunately in general, many students engagedin meaningful self-discovery of principles for ef-fective human-computer dialog interaction such asthe usefulness of careful design in the phrasing ofquestions to the user. Several students also notedthat excessive user terseness is not always helpful.

8 Classroom reinforcement of theirresearch efforts

Given the challenging nature of this project to ourstudents, it would have been unwise to have spentthe final weeks of the semester with excessive pre-sentation of new material. To reinforce the goalof student exposure to the process of research andto connect their work to the research community,two 50 minute class periods were used lookingat four papers from the 2014 SIGDial conferencewhere the work from DSTC 2 was presented. Classtime was spent watching the videos of the originalconference presentations of these papers ((Hen-derson et al., 2014a), (Smith, 2014), (Williams,2014), and (Henderson et al., 2014b)). These areavailable at https://www.superlectures.com/sigdial2014/. This classroom activitywas conducted two weeks prior to the end of the

74

semester. The final week was spent being avail-able for additional consultation about projects aswell as spending class time on answering questionsabout NLP that were posed by students at the be-ginning of the semester. This was a chance to helpthem see what they had learned during the semester,and to understand better what remains an unsolvedproblem.

9 Student Outcomes

9.1 Part 1: Speech quality performance

This requirement served its purpose beautifully.One problem that exists in undergraduate computerscience study is the pervasive belief that almost anyprogramming question is answerable by an Internetsearch—you just have to submit the magic words toget the answer to appear. To meet this requirementand do the rest of the project, students really hadto study the handbook and apply the information itcontained to the provided template program. Thiswas mentioned several times in student project re-ports in response to the “what was your biggestchallenge and how did you overcome it?” question.

9.2 Part 2: Dialog state change annotation

The algorithm presentation in class did not go intothe details of how to access the needed data. Again,students had to apply the lessons learned from part1 as well as further investigate the handbook tounderstand how to access needed data. Betweenthe data access and the required data structuresneeded in their implementation, students becausemuch more capable of identifying the differences inuse between Python lists and dictionaries. This partof the project tended to be the most challenging forthe students.

9.3 Part 3: Generation of dialog featureinformation (2018)

Besides setting a flag to indicate repetition of asystem response, the primary choice of studentswas to drill deeper into the ASR and SLU data.The most common approaches are given below.

• Counting the occurrences of an<attribute,value> pair (e.g. <food,italian>)in the different SLU hypotheses.

• Calculating cumulative confidence in an<attribute,value> pair over the various SLUhypotheses.

• Combining SLU score with utterance lengthin some fashion.

• Detecting the presence or absence of theNULL (no interpretation) hypothesis for theSLU.

• Detecting the presence or absence of<attribute,value> information in the SLU thatappeared in the ASR.

Given the technical challenges required, the finalitem was only attempted by a few students.

A key misconception that arose in several stu-dents initial proposals for feature generation was aconfusion over what is available at the time the dia-log is occurring. Several students tried to proposeusing information only available after the dialogshad completed (i.e. using the correctness annota-tions for what was actually spoken and what thecorrect SLU hypothesis updated dialog state track-ing values should be). While this information canbe helpful in reassessing the performance of dia-log system modules, it cannot be directly used indetecting miscommunication during an ongoingdialog. I believe that their work on part 2: dia-log state change annotation led to this confusionsince they were using the post-dialog annotationsto complete that task. An extra 15 minutes of classtime addressing this issue prior to student proposalsshould have cleared up this misconception.

9.4 Part 3: SLU performance (2019)Most students implemented this correctly. Thiswas one place where it might have been helpfulto have the students also run their solution on theentire DSTC 2 dataset as well as the dialogs of theirassigned speaker to see if the SLU performance ontheir speaker was representative of overall perfor-mance.

10 Advice for Instructors

Besides in an introductory undergraduate NLPclass for computer science majors with limitedbackground in machine learning and Python, I be-lieve a project based on the DSTC 2 data can beused in a variety of contexts.

• In an advanced NLP class where students al-ready have a machine learning background.

• In an advanced undergraduate data structuresclass. The project could be a means to get stu-dents interested in NLP based on a common

75

activity—having a conversation with someoneelse.

• In an accelerated summer camp environmentfor talented undergraduates.

If interested in using the DSTC 2 data for a classproject, I would suggest the following steps.

1. Go to https://github.com/matthen/dstc and download thefollowing files at a minimum.

• handbook.pdf• dstc2_traindev.tar.gz - Train

and development datasets for DSTC 2.• dstc2_test.tar.gz - Test dataset

for DSTC 2.• dstc2_scripts.tar.gz - Evalua-

tion scripts, baseline tracker and othertools.

• HWU_baseline.zip - Alternativebaseline tracker, provided by ZhuoranWang.

2. Install the downloaded files and make sureyou can run one of the supplied trackers onthe entire dataset.

3. Streamline the baseline tracker code and/orscoring code to access and process a subset ofthe features from the datasets. Alternatively,you may wish to consult the teaching materi-als web site for this workshop to access thedemo scripts that should be made available.

4. Explore the /scripts/config subdirec-tory within the installation to understand theuse of the flist files for listing the specificdialogs to be processed during the executionof a script.

Once you’ve successfully completed these tasks,you should be well on your way to developing yourown project with this data. Feel free to consult theauthor of this paper as needed.

11 Lessons Learned

Student feedback was largely anecdotal and quitefavorable. While it is certainly the case that in re-sponse to the “What would you change to improvethe course?”, the most common answer was to havemore time spent on classifiers/machine learning,students did enjoy working on the project. The

most interesting piece of feedback that was ac-quired was from a student who originally submittedan incomplete project during the first offering ofthe assignment (spring 2018). The student was al-lowed an extra couple of weeks to complete theproject. At some point during fall semester 2018the student took the time to communicate with meto let me know that the student had used the projectas part of a job interview when asked to give atechnical presentation about a previous project thestudent had completed. The student said, “. . . thefinal project in your NLP class was by far the mostinteresting one that I’ve been assigned in college. . . ”.

While this was gratifying feedback, the moreimportant outcome is the belief that the course andproject did achieve the goals of giving studentsexperience with self-directed learning and engage-ment with the tools of research. We are fortunatein NLP to have a wide variety of interesting prob-lems available to us that will naturally intrigue ourstudents regardless of their original interest in NLP.If one wishes to use dialog processing as the in-teresting problem, the DSTC 2 data is a valuableresource for use in the classroom. Regardless, Iwould encourage instructors to pick a task that ex-cites them. Our teaching is much stronger whenwe are demonstrating the need for self-directedlearning ourselves. Our research provides such ameans.

12 Acknowledgements

I would like to thank my computer science depart-ment chair, Dr. Venkat Gudivada, for allowing meto teach the initial offerings of the course with a re-stricted enrollment. This enabled me to offer moreindividualized exploratory research opportunitieswith our undergraduates.

None of this would have been possible withoutthe existence of the DSTC 2 challenge. A specialthanks goes to the DSTC 2 organizers, MatthewHenderson, Blaise Thomson, and Jason Williamsas well as to the University of Cambridge, Mi-crosoft Research, and SIGDIAL (Special InterestGroup on Discourse and Dialogue) for their spon-sorship and support of the challenge.

A final thanks goes to the anonymous reviewerswho helped me better understand what would be ofmost value to the readers of this paper.

76

References

Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural Language Processing with Python: An-alyzing Text with the Natural Language Toolkit.O’Reilly, Beijing. Current version of the book isavailable at http://www.nltk.org/book/.

Steven Bird, Ewan Klein, Edward Loper, and JasonBaldridge. 2008. Multidisciplinary instruction withthe natural language toolkit. In Proceedings ofthe Third Workshop on Issues in Teaching Compu-tational Linguistics, pages 62–70, Columbus, Ohio.Association for Computational Linguistics.

Kallirroi Georgila, Matthew Stone, Helen Hastie, andAni Nenkova, editors. 2014. Proceedings of the15th Annual Meeting of the Special Interest Groupon Discourse and Dialogue (SIGDIAL). Associationfor Computational Linguistics, Philadelphia, PA,U.S.A.

M. Henderson, B. Thomson, and J. Williams.2013. Dialog State Tracking Challenge 2 &3. https://github.com/matthen/dstc/blob/master/handbook.pdf.

M. Henderson, B. Thomson, and J. Williams. 2014a.The second dialog state tracking challenge. In Pro-ceedings of the SIGdial 2014 Conference, pages263–272, Philadelphia, U.S.A. Association for Com-putational Linguistics.

Matthew Henderson, Blaise Thomson, and SteveYoung. 2014b. Word-based dialog state trackingwith recurrent neural networks. In Proceedingsof the 15th Annual Meeting of the Special Inter-est Group on Discourse and Dialogue (SIGDIAL),pages 292–299, Philadelphia, PA, U.S.A. Associa-tion for Computational Linguistics.

D. R. Hipp and R. W. Smith. 1993. A demonstration ofthe “circuit fix-it shoppe”. In Video Proceedings ofthe AAAI Conference.

Ronnie Smith. 2014. Comparative error analysis of dia-log state tracking. In Proceedings of the 15th AnnualMeeting of the Special Interest Group on Discourseand Dialogue (SIGDIAL), pages 300–309, Philadel-phia, PA, U.S.A. Association for Computational Lin-guistics.

R.W. Smith. 2005. Natural language interfaces. In En-cyclopedia of Language and Linguistics, 2 edition,pages 496–503. Elsevier Limited, Oxford.

Jason D. Williams. 2014. Web-style ranking and SLUcombination for dialog state tracking. In Proceed-ings of the 15th Annual Meeting of the Special Inter-est Group on Discourse and Dialogue (SIGDIAL),pages 282–291, Philadelphia, PA, U.S.A. Associa-tion for Computational Linguistics.

A Sample Dialog Transcript

Figure 1 is the transcript of an actual dialog forcompleting Task 11937 (description listed in sec-tion 3.1). SYS denotes utterances by the computersystem and ASR is the speech returned by thespeech recognizer. This sample dialog was usedduring class presentation about DSTC 2. It wasused in order to make clear to the students thatwhat ASR returns is not always what was actuallysaid, but it is close enough in this example so thatthe intent is still understood. A more detailed tran-script including what was actually said is also madeavailable to the students.

B Sample Project Report Requirements

This information comes from the second offeringof the dialog state tracking project during springsemester 2019. The report requirements for theprevious year (where part three of the project al-lowed students to choose a dialog miscommunica-tion feature to extract) is identical except the finalparagraph was omitted.

B.1 Report Organization

Your report should have three sections, one sectionfor each required investigation.

Section 1: Basic performance analysis of thespeech quality

Section 2: Automatic annotation of dialog statechange

Section 3: Basic performance analysis of theSLU module

Each section should contain the following.

• A list of any known deficiencies in the soft-ware. If you have no known deficiencies, ex-plicitly say so.

• A description of the biggest challenge youfaced in successfully completing the softwarefor the given investigation. Did you overcomethe challenge and if so, how?

• A discussion of your most interesting obser-vation based on the data produced by thesoftware. A quality discussion will not onlypresent the results, but also present somethoughtful analysis that relates to the chal-lenge of dialog state tracking. Possible con-texts in which to frame the discussion includethe following.

77

1. An observation based on differences as afunction of the dialog system being used(recall that each S[0-2]A[0-1] com-bination represents a different dialog sys-tem).

2. An observation based on differences as afunction of dialog type (goal change vs.non-goal change).

3. An observation that provides insightabout one or more areas of explorationthat could lead to improved performanceby dialog state trackers, especially in re-gards to detecting possible miscommuni-cation.

NOTE: You are not required to use the samecontext (or any of these contexts for that matter)for each section. Use whatever seems appropri-ate based on your observations as it relates to thespecific topic.

IMPORTANT: A quality discussion of yourmost interesting observation for Section 3 shouldalso include a proposal for either extending the pro-gram you wrote, or proposing a completely newsoftware tool (i.e. program) for deeper investiga-tion of the relevant issues you have identified forimproving performance in dialog state tracking.

78

Session: data/Mar13_S2A0/voip-72e50baa85-20130326_053416

Turn 0SYS > Hello , welcome to the Cambridge restaurant system. You can ask forrestaurants by area , price range or food type . How may I help you?ASR > i don’t mind an expensive restaurant that serves portuguese food

Turn 1SYS > I’m sorry but there is no restaurant serving expensive portuguese foodASR > what about north american food

Turn 2SYS > I’m sorry but there is no restaurant serving expensive portuguese foodASR > north american food

Turn 3SYS > Could you please repeat that?ASR > what about north american food

Turn 4SYS > gourmet burger kitchen serves north american food in the expensiveprice rangeASR > what is the phone number

Turn 5SYS > The phone number of gourmet burger kitchen is 01223 312598 .ASR > what is the post code

Turn 6SYS > The post code of gourmet burger kitchen is C.B 2, 1 A.BASR > bad

Turn 7SYS > I am sorry but there is no other north american restaurant in theexpensive price rangeASR > thank you been bad

Turn 8SYS > gourmet burger kitchen serves north american food in the expensiveprice rangeASR > can you get back

Turn 9SYS > Can I help you with anything else?ASR > thank you good bye

Figure 1: Sample dialog for completing Task 11937

79

Proceedings of the Fifth Workshop on Teaching NLP, pages 80–86June 10–11, 2021. ©2021 Association for Computational Linguistics

The Flipped Classroom model for teaching Conditional Random Fields inan NLP courseManex Agirrezabal

Centre for Language Technology (CST) - Department of Nordic Studies and LinguisticsUniversity of Copenhagen / Københavns Universitet

Emil Holms Kanal 22300 Copenhagen (Denmark)

[email protected]

Abstract

In this article, we show and discuss our experi-ence in applying the flipped classroom methodfor teaching Conditional Random Fields ina Natural Language Processing course. Wepresent the activities that we developed to-gether with their relationship to a cognitivecomplexity model (Bloom’s taxonomy). Afterthis, we provide our own reflections and expec-tations of the model itself. Based on the evalu-ation got from students, it seems that studentslearn about the topic and also that the methodis rewarding for some students. Additionally,we discuss some shortcomings and we proposepossible solutions to them. We conclude thepaper with some possible future work.

1 Introduction

In this article we would like to provide experienceon developing a flipped classroom environment ina Natural Language Processing course. The mostcommon approach of teaching involves a teacher“pouring” knowledge to its students, as if studentswere plants and knowledge was water. There is atendency, though, to make classes more active andinteractive, in which communication does not onlyhappen from the teacher to the students, but alsofrom student to student.

Bloom’s taxonomy1 (Bloom et al., 1956; An-derson et al., 2001) is a model that describes thecognitive load of different types of work or activ-ities. On the one hand, some activities, such asremembering or understanding specific character-istics of, for instance, a model of any kind, wouldbe considered to require a low cognitive load. Onthe other hand, evaluating which model is best byconsidering the characteristics of each, could beconsidered to be in a higher level of complexity inBloom’s taxonomy.

1https://www.flickr.com/photos/vandycft/29428436431

If we want our students to be more knowledge-able and reflective, we need to go beyond the lowerlevels in Bloom’s taxonomy. The traditional ap-proach of introducing topics in a lecture wouldmake students remember and understand the cov-ered topics. We believe, though, that with theseteaching practices there is a time limitation to gobeyond the first two levels of Bloom’s taxonomy.

A more efficient approach could be to ask stu-dents to work on a set of topics beforehand, getprepared, and then, work on activities that involvea higher cognitive load. These activities could in-volve applying and analyzing the acquired knowl-edge to other aspects. In this article, we presentour experience in teaching Conditional RandomFields as a Flipped Classroom. We teach this in acourse at the Master level about Natural LanguageProcessing (NLP).

The article is structured as follows. First we in-troduce the flipped classroom method and somechallenges. After that, we present the program,course and lecture in which the new teachingmethod will be used. Then, we discuss the charac-teristics of the implied student and also the evalu-ation method that we use. Later, we describe thelectures structure and the different activities and weclassify them according to Bloom’s taxonomy. Wethen include some discussion, based on the expe-rience. Finally, we conclude the work and suggestsome possible future directions.

2 Flipped Classroom: Are we going toturn around the desks?

Flipped Classroom (Lage et al., 2000; Brame,2013) is a teaching approach in which students getfirst exposure to the material of a lecture outside ofclass, and the in-class activities involve applyingthe learned content. Provision of lecture materialcan be done in a number of ways, such as readingmaterial, video lectures as slideshows, podcasts,and so on.

80

Students are expected to do the homework, andthe majority of the in-class activities fully dependon that. Because of that, the teacher should makesure that students do their homework, because eventhough we name it in various ways, watching lec-tures or reading articles is still homework (Nielsen,2012), and the challenge of making students ac-complish with that is still there. A possible idea formaking sure that students do the reading homeworkcould be to ask them to do a quiz or reward themsomehow.

Apart from challenges regarding students, wemay not forget that all the activities, homework,readings, and so on have to be retrieved, selected orproduced. Furthermore, as the content covered inclass is more complex, the teacher will have to bemore prepared. All this results in a larger workingload in the preparation of the class and its activities.

3 Background about program, courseand lecture

In this section, we briefly introduce the M. Sc. pro-gram, the course and the specific lecture in whichwe will be focusing on.

The IT & Cognition program

The IT & Cognition program at the Universityof Copenhagen is an international and interdisci-plinary program2 that accepts a small group of stu-dents every year. The program covers three mainareas: Natural Language Processing, Image Pro-cessing and Cognitive Science.

In the first semester, the students acquire the re-quired basic skills for their further development inthe specialization area in which they are interested.Scientific Programming, Language Processing I,Cognitive Science I and Vision and Image Process-ing are mandatory subjects that cover these basicskills.

In the second semester students continue learn-ing about Language Processing and Cognitive Sci-ence in further specialized courses, and they alsoget an introduction to Data Science. Besides, theystart their specialization process with an electivecourse.

The third and forth semesters are devoted to athird course about Cognitive Science and three dif-ferent electives and after that, students work ontheir thesis project.

2https://studies.ku.dk/masters/it-and-cognition/

Language Processing 1 and 2 (LP1 and LP2)

These courses are taught during a whole academicyear (two semesters). There is one lecture per weekwhich lasts for two hours. Each course, LP1 andLP2, goes through fourteen weeks in the fall andspring semesters, respectively.

The Intended Learning Outcomes (ILOs) of bothcourses are quite similar. The main differences arein the degree of depth in which topics are cov-ered. On the one hand, the first course offers basicknowledge about different tasks relevant to NaturalLanguage Processing and their relationship to cur-rent society. On the other hand, the second courseis more geared towards the development of moreadvanced algorithms and their application in morespecific tasks.

Assessment methodWe assess students by asking them to work on apredefined topic. The common procedure is toperform experiments for a relevant NLP task andafterwards, they should write a scientific articlereporting on these experiments.

Lecture: Conditional Random Fields

One of the topics covered in the second Lan-guage Processing course covers knowledge aboutsequence tagging. We cover Hidden Markov Mod-els, Maximum Entropy Markov Models and Condi-tional Random Fields, as sequential tagging models.This last model will be covered in one and a halfsessions. In the first session (2 hours) we will covertheoretical questions and students are expected tounderstand them. In the following week, there willbe one hour, and students will get hands-on practiceabout Conditional Random Fields.

The lecture itself has the goal of providing thestudents an understanding of how Maximum En-tropy Markov Models (MEMMs) (McCallum et al.,2000) and Conditional Random Fields (CRFs) (Laf-ferty et al., 2001) make predictions compared tohard classification methods. Additionally, theyshould understand a common problem of MEMMs(Label Bias problem (Bottou, 1991; Lafferty et al.,2001)) and how CRFs solve such limitation. Fi-nally, students should also be able to use CRFs fortheir own research after the lectures.

4 The implied student

The background of our student group is very het-erogeneous. Some people may have a Computer

81

Science related background, and therefore, suffi-cient experience in programming. Other studentsdo not have the same background, but they arestrong in other aspects, such as linguistics, neuro-science or psychology. Besides, at the time that ouranalyzed lecture happens, students have already re-ceived lectures about programming (first semester),so therefore, this level gap should be significantlysmaller.

5 Evaluation method

In recent years, the usual teaching practice in theLanguage Processing series has been lectures. Asthe goal of this experiment is to check whetherthe flipped classroom can support students in theirlearning or not, we will implement this teachingmethod for the section about Conditional RandomFields, and analyze how students feel about it.

We evaluate this teaching practice by asking stu-dents to fill in a survey. In the survey we ask stu-dents about their general knowledge about sometopics (MEMMs, CRFs, Label Bias problem) butalso about whether they would be able to use CRFsfor their own work. Please find below the questionsthat we asked in the questionnaire:

1. Do you know what a MEMM is (Maximum EntropyMarkov Model)?

2. Do you know what a CRF is (Conditional RandomField)?

3. Do you know what the Label Bias problem is?4. Do you feel capable of using a CRF for your own prob-

lems, such as developing a Named Entity Recognitionsystem?

5. Do you feel that this structure (teaching style) is morerewarding?

6. Do you feel that this structure is more demanding (men-tally)?

The response to these questions could be either“Yes”, “Roughly” or “No”. Finally, there are twoquestions related to the specific teaching method,in which we ask students whether the teaching prac-tice is more demanding (mentally) and also whetherit is more rewarding, compared to the lecturing ap-proach. The survey was made one week before thelecture day and after the lecture was done.

6 Activities

In this section we describe the activities that wemade for students before the lectures, during thelectures and after the lectures. The activities willbe made public, hoping that they will be useful forother NLP teachers and/or researchers.

6.1 Before lectureBefore the lecture, students were asked to watchtwo video lectures. The two videos were avail-able at the university learning platform and theywere made by ourselves. The first video coversMaximum Entropy Markov Models and we intro-duce them by showing the relationship to LogisticRegression and Hidden Markov Models. Theselast models (HMMs) were introduced two weeksbefore in this same course. We use Maximum En-tropy Markov Models as a middle step in order tounderstand Conditional Random Fields, which arediscussed in the second video. We talk about theLabel Bias problem and show how this is solved byusing global normalization in Conditional RandomFields.

Students should also read the paper that intro-duces Conditional Random Fields (Lafferty et al.,2001)3.

6.2 In classroom (online)As mentioned before, CRFs will be covered in oneand a half lecture sessions. These sessions are heldonline because of the current situation, followingour health authorities requirements.

Before we start the lecture, students are askedwhether there are questions regarding the readingand watching activities. We will briefly recapitulatesome aspects, such as where could sequential mod-els like HMM, MEMM or CRFs be useful: Part-of-Speech (POS) tagging, Named Entity Recognition(NER), and besides, any other task that requiresthe production of tags for a given sequence of ele-ments.

Then, the goal of the remaining time in the ses-sion is threefold: (1) to revise and get an under-standing of why sequential models such as HMMs,MEMMs or CRFs are more powerful than hard clas-sification models, e.g. Maximum Entropy, SupportVector Machines (SVM), and so on; (2) to makeit clear what the Label Bias problem is; and (3)to show how CRFs solve the Label Bias problemby using global normalization. After this session,there will be time to show CRFs working in prac-tice, so that students get hands-on experience.

We describe below four different exercises thatstudents will have to do in small groups (3-4 peo-ple). These exercises have an increasing level of

3We believe that including an additional article about Max-imum Entropy Markov Models (McCallum et al., 2000) isrelevant and very helpful for students. Unfortunately, we didnot include it this year.

82

t 0 1 2 3 4WORDS My smartphone worked very wellPOS tags PRP N VP MOD RBFeatures 1,2,0 0,10,0 0,6,1 0,4,0 0,4,0

complexity, as it will be seen.

Exercise 1Understand how prediction is made in a MaximumEntropy POS tagger. Students are given one sen-tence and the POS tags for each word in that sen-tence. They are told that the model is trained us-ing three very simple features: isUpperCase,length_chars, endsInEd), and they have tosimulate by hand how predictions are made foreach word. We also provide a list of 10 possiblePOS tags. In order to make sure that the conceptsabout the weight matrix are understood, we asksome checkpoint questions, such as the shape ofthe weight matrix, provided that the input matrixhas a size of 1x3 and the output matrix has a sizeof 1x10.

Each group of students should provide two out-puts to the teacher. Given an input and a weightmatrix,4 they should be able to see which POS tagwould be returned by the model. We also ask themto describe, in one sentence with their own words,how the model produces the output for each word.

Considering Bloom’s taxonomy (Bloom et al.,1956), this exercise could be considered an exerciseto remember, understand and apply concepts, andthus, in the three lower levels.

6.2.1 Exercise 2In this exercise, students are given the same ex-act sentence as before. The difference is that thePOS tagger with which the students will work is aMaximum Entropy Markov Model (MEMM), andtherefore, there is no independent predictions.

Students already learned about the Viterbi al-gorithm for Hidden Markov Models (HMM). Thegoal of this exercise is to remember how this algo-rithm works and also to try to be aware which isthe difference between HMMs and MEMMs, i.e.the use of an extended set of features besides singlewords and tags.

In order to raise that awareness, the exercise isto go through the pseudo-code of the Viterbi al-gorithm for HMMs (Jurafsky and Martin, 2008,p. 220), understand it and find out which are the

4we also provide the dot product output to save time

specific elements that have to be changed, so thatthis works with MEMMs. As a hint in order toguess what should be put there, we provide stu-dents a trellis of the example above. We highlightone node in that trellis and discuss what its out-going arcs represent: A probability distributionPS′(S|0) = [P1, P2, ..., Pk]. The value of k de-pends on the number of states, i.e. output classes.

This exercise requires understanding the Viterbialgorithm for HMMs and understanding whatshould be modified to make it work with MEMMs.As students have to draw connections betweenthese two models, we consider that the cognitiveload, based on Bloom’s taxonomy, is higher thanin exercise 1.

Exercise 3In the previous exercise, when we mention theprobability distribution PS′(S|O), we mention thatsome of those probabilities could be zero. Becauseof that, in this exercise we ask students to reflect onwhat would happen if we have a node with manyzero probabilities. For example, what happens if anode has 2 non-zero probabilities? What if it has 10non-zero probabilities? We ask them this questionto think about those probabilities and make themaware of the Label Bias problem.

In Bloom’s taxonomy this exercise could be seenas evaluating and/or analyzing, as students shouldbe able to find out the problem by themselves. Wedo not ask them to apply what they know, but tothink about how having more or less arcs couldaffect the probabilities of nodes, and thus, the finalsequence probabilities.

Exercise 4As they already identified the problem, the lastexercise is to suggest a solution for that problem.They should analyze the trellis given in exercise 2and think about a possible solution. Students aregiven 5-10 minutes to discuss the topic. Afterwards,we discuss their possible solutions and if nobodyreaches the actual solution, we introduce globalnormalization (per observation sequence), which isused in Conditional Random Fields.

In this case, the students would be in a similarscenario as the researchers that found out the Label

83

Figure 1: The left plot shows the responses to the survey that we did one week before the flipped classroom sessionstarted. The one on the right shows the responses to the same questions, but after the session was over. Based onthe responses, we can say that students felt that they understood these four relevant aspects. It looks like, though,that in the future we should emphasize in the practical aspect of the taught models (Look at the right column in thesecond figure).

Bias problem. As such, we consider that this is atthe highest level of Bloom’s taxonomy, in whichstudents discuss and conjecture about a possiblesolution for the problem.

Practical applications

After covering the previous exercises, in the nextsession, we cover Conditional Random Fields froma practical perspective, for which we show studentshow to process text to use it with Conditional Ran-dom Fields. We also discuss how to extract relevantfeatures.

6.3 After classroom (exam project)

With regards to the final project, there will be apractical session in which different methods, in-cluding CRFs, will be shown. The students will bethen able to apply the obtained knowledge.

7 Discussion and evaluation

Our expectation is that students will learn more,in a better way with a similar effort (from theirside). We think that the reward (for students) ofthis teaching method is high, on the one hand. On

the other hand, the required time for preparation(for teachers) is higher.

The preparation time increase is directly relatedwith the fact that the topics that are covered areexpected to be more complex. As students alreadygot lecturing as homework, then the in-class activi-ties become more complicated, and it is because ofthis that the teacher must have a deeper understand-ing of the topic in question. Besides, in our specificcase the video lectures were made by us. We couldsave time by using the available resources in on-line learning platforms, but we decided to makethem ourselves, making the preparation more time-consuming.

The evaluation, based on a survey that studentshad to answer, seems positive, although there areissues. Considering how much students learned,we can say that the learning goals were satisfied, asit can be seen in Figure 1. Each column in the plotsrepresents the answers of our students to a questionregarding knowledge on different concepts. Theyhad to answer either Yes, Roughly or No. As wecan see, students did not have much expertise onthe topic before the lecture and preparation. Af-terwards, the responses show that students got the

84

required understanding.Besides, we also asked whether this teaching

method is more rewarding. 7 out of 15 answeredyes. There was one that answered no. Finally,there were other 7 out of 15 people that did nottake any stance, as they responded “I don’t know”.From these results we cannot strongly confirm thatthe method, in the way that we implemented it, isrewarding. It seems quite rewarding, though.

Together with these answers, we gave the stu-dents the option of writing further comments.There were some positive comments, and someothers were possible issues with suggestions forimprovements. One mentions that it is hard notto be able to ask questions when you watch thelecture, and that it is hard to remember the contextof the question in class. A possible solution to thisissue could be to include a discussion forum foreach video lecture, so that students could post thequestion immediately in the forum. Another stu-dent pointed out that homework distribution wasfar from being optimal, as all homework was givenin the first week and in the second week there wasnothing. This should definitely be thought in a waythat the homework load is more balanced. On thebright side, it seemed positive to have the chance ofwatching again the lecture videos. Also, some feltthat group work was better and that it was nice tohave more time to understand code and exercises.

8 Conclusion and Future directions

In this paper, we presented a possible class struc-ture for teaching Conditional Random Fields inalmost two lectures. This class is formulated as aFlipped classroom, in which there is a strong work-load in the students’ preparation and this allowsstudents to get a further understanding of the topic,compared to traditional lecturing.

We contemplate the Flipped Classroom as a rel-evant teaching method for teaching complex topics.We believe that by asking students to do part ofthe work beforehand has allowed us to go one stepbeyond in the understanding of CRFs. Further-more, the exercise types that we covered were inthe highest orders in Bloom’s taxonomy, showingthe efficiency of the method.

The feedback that we got from students seemsto show that, in general, they learn about the topicin question. It further seems that the method isrewarding for some students. We believe that themethodology has advantages, e.g. students get a

deeper understanding of the topic, and disadvan-tages, for instance a higher preparation time. Con-sidering both aspects, a possibility could be to finda balance between flipping only a portion of thewhole lecture and having the other portion as amore traditional lecture.

In our prepared lectures, we decided to empha-size on a problem that Conditional Random Fieldsfix, the Label Bias problem. In the future, it wouldbe interesting to include a discussion about the ob-servation bias (Klein and Manning, 2002), whichhappens when the prediction is made by totallyignoring the labels.

As students have a very varied background, wecould already observe differences in the time ofexecution of the exercises. A possible solutionto this is to apply differentiated teaching (Rocket al., 2008), where the teaching content is adjustedto some groups of students, making the activitiesmore tailored to those student clusters.

Acknowledgements

First of all, I would like to acknowledge thewhole Teaching and Learning in Higher Education(TLHE) program taught at the University of Copen-hagen, especially to my pedagogical and academicsupervisors, Lis Lak Risager (TEACH centre) andPatrizia Paggio (Centre for Language Technology),respectively. I would like to thank also the studentsat Language Processing 2 (Spring semester, course2020/2021, IT & Cognition program) for being sohelpful in the development of this model. Last, butnot least, I thank the anonymous reviewers for theirvaluable comments and revisions. These reviewsare not only useful for this article, but also for thedevelopment of the whole course of Language Pro-cessing.

ReferencesL.W. Anderson, B.S. Bloom, D.R. Krathwohl,

P. Airasian, K. Cruikshank, R. Mayer, P. Pintrich,J. Raths, and M. Wittrock. 2001. A Taxonomyfor Learning, Teaching, and Assessing: A Revi-sion of Bloom’s Taxonomy of Educational Objec-tives. Longman.

Benjamin S Bloom et al. 1956. Taxonomy of educa-tional objectives. vol. 1: Cognitive domain. NewYork: McKay, 20:24.

Léon Bottou. 1991. Une Approche théorique del’Apprentissage Connexionniste: Applications à laReconnaissance de la Parole. Ph.D. thesis, Univer-sité de Paris XI, Orsay, France.

85

Cynthia Brame. 2013. Flipping the classroom (re-trieved last 2021-03-15).

Daniel Jurafsky and James H Martin. 2008. Speechand language processing: An introduction to speechrecognition, computational linguistics and naturallanguage processing. Upper Saddle River, NJ: Pren-tice Hall.

Dan Klein and Christopher D Manning. 2002. Condi-tional structure versus conditional estimation in nlpmodels. In Proceedings of the 2002 Conference onEmpirical Methods in Natural Language Processing(EMNLP 2002), pages 9–16.

John Lafferty, Andrew McCallum, and Fernando CNPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data.

Maureen J Lage, Glenn J Platt, and Michael Treglia.2000. Inverting the classroom: A gateway to creat-ing an inclusive learning environment. The journalof economic education, 31(1):30–43.

Andrew McCallum, Dayne Freitag, and Fernando CNPereira. 2000. Maximum entropy markov modelsfor information extraction and segmentation. InIcml, volume 17, pages 591–598.

Lisa Nielsen. 2012. Five reasons i’m not flippingover the flipped classroom. Technology & Learning,32(10):46–46.

Marcia L Rock, Madeleine Gregg, Edwin Ellis, andRobert A Gable. 2008. REACH: A framework fordifferentiating classroom instruction. PreventingSchool Failure: Alternative Education for Childrenand Youth, 52(2):31–47.

86

Proceedings of the Fifth Workshop on Teaching NLP, pages 87–91June 10–11, 2021. ©2021 Association for Computational Linguistics

Flamingos and Hedgehogs in the Croquet-Ground: Teaching Evaluationof NLP Systems for Undergraduate Students

Brielen MadureiraComputational LinguisticsDepartment of Linguistics

University of [email protected]

Abstract

This report describes the course Evaluation ofNLP Systems, taught for Computational Lin-guistics undergraduate students during the win-ter semester 20/21 at the University of Pots-dam, Germany. It was a discussion-basedseminar that covered different aspects of eval-uation in NLP, namely paradigms, commonprocedures, data annotation, metrics and mea-surements, statistical significance testing, bestpractices and common approaches in specificNLP tasks and applications.

1 Motivation

“Alice soon came to the conclusion that it was a verydifficult game indeed.” 1

When the Queen of Hearts invited Alice to hercroquet-ground, Alice had no idea how to playthat strange game with flamingos and hedgehogs.NLP newcomers may be as puzzled as her whenthey enter the Wonderland of NLP and encountera myriad of strange new concepts: Baseline, F1score, glass box, ablation, diagnostic, extrinsic andintrinsic, performance, annotation, metrics, human-based, test suite, shared task. . .

Although experienced researchers and practition-ers may easily relate them to the evaluation of NLP

1Alice in Wonderland by Lewis Carroll, public domain.Illustration by John Tenniel, public domain, via WikimediaCommons.

models and systems, for newcomers like undergrad-uate students it is not simply a matter of lookingup their definition. It is necessary to show themthe big picture of what and how we play in thecroquet-ground of evaluation in NLP.

The NLP community clearly cares for doingproper evaluation. From earlier works like thebook by Karen Spärck Jones and Julia R. Galliers(1995) to the winner of ACL 2020 best paper award(Ribeiro et al., 2020) and recent dedicated work-shops, e.g. Eger et al. (2020), the formulation ofevaluation methodologies has been a prominenttopic in the field.

Despite its importance, evaluation is usually cov-ered very briefly in NLP courses due to a tightschedule. Teachers barely have time to discussdataset splits, simple metrics like accuracy, preci-sion, recall and F1 Score, and some techniques likecross validation. As a result, students end up learn-ing about evaluation on-the-fly as they begin theircareers in NLP. The lack of structured knowledgemay cause them to be unacquainted with the multi-faceted metrics and procedures, which can renderthem partially unable to evaluate models criticallyand responsibly. The leap from that one lecture towhat is expected in good NLP papers and softwareshould not be underestimated.

The course Evaluation of NLP Systems, which Itaught for undergraduate Computational Linguis-tics students in the winter semester of 20/21 atthe University of Potsdam, Germany, was a read-ing and discussion-based learning approach withthree main goals: i) helping participants becomeaware of the importance of evaluation in NLP; ii)discussing different evaluation methods, metricsand techniques; and iii) showing how evaluation isbeing done for different NLP tasks.

The following sections provide an overview ofthe course content and structure. With some adap-tation, this course can also be suitable for moreadvanced students.

87

Topic Content

Paradigms Kinds of evaluation and main steps, e.g. intrinsic and extrinsic,manual and automatic, black box and glass box.

Common Procedures Overview about the use of measurements, baselines, dataset splits,cross validation, error analysis, ablation, human evaluation andcomparisons.

Annotation How to annotate linguistic data, evaluate the annotation and howthe annotation scheme can affect the evaluation of a system’sperformance.

Metrics and Measurements Outline of the different metrics commonly used in NLP, what theyaim to quantify and how to interpret them.

Statistical Significance Testing Hypothesis testing for comparing the performance of two systemsin the same dataset.

Best Practices The linguistic aspect of NLP, reproducibility and the social impactof NLP.

NLP Case Studies Group presentations about specific approaches in four NLPtasks/applications (machine translation, natural language genera-tion, dialogue and speech synthesis) and related themes (the historyof evaluation, shared tasks, ethics and ACL’s code of conduct andreplication crisis).

Table 1: Overview of the course content.

2 Course Content and Format

Table 1 presents an overview of the topics discussedin the course. Details about the weekly reading listsare available at the course’s website.2

The course happened 100% online due to thepandemic. It was divided into two parts. In thefirst half of the semester, students learned about theevaluation methods used in general in NLP and, tosome extent, machine learning. After each meeting,I posted a pre-recorded short lecture, slides and areading list about the next week’s content. Theparticipants had thus one week to work throughthe material anytime before the next meeting slot.I provided diverse sources like papers, blogposts,tutorials, slides and videos.

I started the online meetings with a wrap-up andfeedback about the previous week’s content. Then,I randomly split them into groups of 3 or 4 par-ticipants in breakout sessions so that they coulddiscuss a worksheet together for about 45 minutes.I encouraged them to use this occasion to profitfrom the interaction and brainstorming with their

2https://briemadu.github.io/evalNLP/schedule

peers and exchange arguments and thoughts. Afterthe meeting, they had one week to write down theirsolutions individually and submit it.

In the second half of the semester, they dividedinto 4 groups to analyze how evaluation is beingdone in specific NLP tasks. For larger groups, otherNLP tasks can be added. They prepared grouppresentations and discussion topics according togeneral guidelines and an initial bibliography thatthey could expand. Students provided anonymousfeedback about each other’s presentations for meand I then shared it with the presenters, to have thechance to filter abusive or offensive comments.

The last lecture was a tutorial about useful met-rics available in scikit-learn and nltk Python li-braries using Jupyter Notebook (Kluyver et al.,2016).

Finally, they had six weeks to work on a finalproject. Students could select one of the followingthree options: i) a critical essay on the developmentand current state of evaluation in NLP, discussingthe positive and negative aspects and where to gofrom here; ii) a hands-on detailed evaluation ofan NLP system of their choice, which could be,

88

for example, an algorithm they implemented foranother course; or iii) a summary of the course inthe format of a small newspaper.

3 Participants

Seventeen bachelor students of Computational Lin-guistics attended the course. At the University ofPotsdam, this seminar falls into the category ofa module called Methods of Computational Lin-guistics, which is intended for students in the 5th

semester of their bachelor course. Still, one studentin the 3rd and many students in higher semestersalso took part.

By the 5th semester, students are expected tohave completed introductory courses on linguis-tics (phonetic and phonology, syntax, morphology,semantics and psycho- and neurolinguistics), com-putational linguistics techniques, computer scienceand programming (finite state automata, advancedPython and other courses of their choice), introduc-tion to statistics and empirical methods and founda-tions of mathematics and logic, as well as varyingseminars related to computational linguistics.

Although there were no formal requirements fortaking this course, students should preferably befamiliar some common tasks and practices in NLPand the basics of statistics.

4 Outcomes

I believe this course successfully introduced stu-dents to several fundamental principles of evalua-tion in NLP. The quality of their submissions, espe-cially the final project, was, in general, very high.By knowing how to properly manage flamingosand hedgehogs, they will hopefully be spared thesentence “off with their head!” as they continuetheir careers in NLP. The game is not very difficultwhen one learns the rules.

Students gave very positive feedback at the endof the semester about the content, the literatureand the format. They particularly enjoyed the op-portunity to discuss with each other, saying it wasgood to exchange what they recalled from the read-ing. They also stated that what they learned con-tributed to their understanding in other courses andimproved their ability to document and evaluatemodels they implement. The course was also usefulfor them to start reading more scientific literature.

In terms of improvements, they mentioned thatthe weekly workload could be reduced. They alsoreported that the reading for the week when we

covered statistical significance testing was too ad-vanced. Still, they could do the worksheet since itdid not dive deep into the theory.

The syllabus, slides and suggested readings areavailable on the course’s website.3 The referenceshere list the papers and books used to put togetherthe course and has no ambition of being exhaus-tive. In case this course is replicated, the referencesshould be updated with the most recent papers. Ican share the worksheets and guidelines for thegroup presentation and the project upon request.Feedback from readers is very welcome.

Acknowledgments

In this course, I was inspired and used materialavailable online by many people, to whom I amthankful. I also thank the students who were veryengaged during the semester and made it a reward-ing experience for me. Moreover, I am gratefulfor the anonymous reviewers for their detailed andencouraging feedback.

ReferencesValerie Barr and Judith L. Klavans. 2001. Verification

and validation of language processing systems: Is itevaluation? In Proceedings of the ACL 2001 Work-shop on Evaluation Methodologies for Languageand Dialogue Systems.

Yonatan Belinkov and James Glass. 2019. Analysismethods in neural language processing: A survey.Transactions of the Association for ComputationalLinguistics, 7:49–72.

Anja Belz. 2009. That’s nice. . . what can you do withit? Computational Linguistics, 35(1):111–118.

Emily M. Bender and Batya Friedman. 2018. Datastatements for natural language processing: Towardmitigating system bias and enabling better science.Transactions of the Association for ComputationalLinguistics, 6:587–604.

Taylor Berg-Kirkpatrick, David Burkett, and DanKlein. 2012. An empirical investigation of statis-tical significance in NLP. In Proceedings of the2012 Joint Conference on Empirical Methods in Nat-ural Language Processing and Computational Nat-ural Language Learning, pages 995–1005, Jeju Is-land, Korea. Association for Computational Linguis-tics.

Chris Callison-Burch, Cameron Fordyce, PhilippKoehn, Christof Monz, and Josh Schroeder. 2007.

3https://briemadu.github.io/evalNLP/

89

(meta-) evaluation of machine translation. In Pro-ceedings of the Second Workshop on Statistical Ma-chine Translation, pages 136–158, Prague, CzechRepublic. Association for Computational Linguis-tics.

Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao.2020. Evaluation of text generation: A survey.arXiv preprint arXiv:2006.14799.

Eirini Chatzikoumi. 2020. How to evaluate machinetranslation: A review of automated and human met-rics. Natural Language Engineering, 26(2):137–161.

Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, GuillermoEchegoyen, Sophie Rosset, Eneko Agirre, and MarkCieliebak. 2021. Survey on evaluation methodsfor dialogue systems. Artificial Intelligence Review,54(1):755–810.

Rotem Dror, Gili Baumer, Marina Bogomolov, and RoiReichart. 2017. Replicability analysis for naturallanguage processing: Testing significance with mul-tiple datasets. Transactions of the Association forComputational Linguistics, 5:471–486.

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re-ichart. 2018. The hitchhiker’s guide to testing statis-tical significance in natural language processing. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 1383–1392, Melbourne, Aus-tralia. Association for Computational Linguistics.

Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, andRoi Reichart. 2020. Statistical significance testingfor natural language processing. Synthesis Lectureson Human Language Technologies, 13(2):1–116.

Steffen Eger, Yang Gao, Maxime Peyrard, Wei Zhao,and Eduard Hovy, editors. 2020. Proceedings ofthe First Workshop on Evaluation and Comparisonof NLP Systems. Association for Computational Lin-guistics, Online.

Karën Fort, Gilles Adda, and K. Bretonnel Cohen.2011. Last words: Amazon Mechanical Turk: Goldmine or coal mine? Computational Linguistics,37(2):413–420.

J.R. Galliers, K.S. Jones, and University of Cambridge.Computer Laboratory. 1993. Evaluating NaturalLanguage Processing Systems. Computer Labora-tory Cambridge: Technical report. University ofCambridge, Computer Laboratory.

Albert Gatt and Emiel Krahmer. 2018. Survey of thestate of the art in natural language generation: Coretasks, applications and evaluation. Journal of Artifi-cial Intelligence Research, 61:65–170.

Kyle Gorman and Steven Bedrick. 2019. We need totalk about standard splits. In Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics, pages 2786–2791, Florence,Italy. Association for Computational Linguistics.

Lynette Hirschman and Henry S. Thompson. 1997.Overview of Evaluation in Speech and Natural Lan-guage Processing, page 409–414. Cambridge Uni-versity Press, USA.

Dirk Hovy and Shannon L. Spruit. 2016. The socialimpact of natural language processing. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers), pages 591–598, Berlin, Germany. Associationfor Computational Linguistics.

David M. Howcroft, Anya Belz, Miruna-AdrianaClinciu, Dimitra Gkatzia, Sadid A. Hasan, SaadMahamood, Simon Mille, Emiel van Miltenburg,Sashank Santhanam, and Verena Rieser. 2020.Twenty years of confusion in human evaluation:NLG needs evaluation sheets and standardised def-initions. In Proceedings of the 13th InternationalConference on Natural Language Generation, pages169–182, Dublin, Ireland. Association for Computa-tional Linguistics.

Karen Sparck Jones and Julia R Galliers. 1995. Evalu-ating natural language processing systems: An anal-ysis and review, volume 1083 of Lecture Notes inArtificial Intelligence. Springer-Verlag Berlin Hei-delberg.

Margaret King. 1996. Evaluating natural languageprocessing systems. Communications of the ACM,39(1):73–79.

Thomas Kluyver, Benjamin Ragan-Kelley, Fer-nando Pérez, Brian Granger, Matthias Bussonnier,Jonathan Frederic, Kyle Kelley, Jessica Hamrick,Jason Grout, Sylvain Corlay, Paul Ivanov, DamiánAvila, Safia Abdalla, and Carol Willing. 2016.Jupyter notebooks – a publishing format for repro-ducible computational workflows. In Positioningand Power in Academic Publishing: Players, Agentsand Agendas, pages 87 – 90. IOS Press.

Zachary C. Lipton and Jacob Steinhardt. 2019. Trou-bling trends in machine learning scholarship: Someml papers suffer from flaws that could misleadthe public and stymie future research. Queue,17(1):45–77.

Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-worthy, Laurent Charlin, and Joelle Pineau. 2016.How NOT to evaluate your dialogue system: Anempirical study of unsupervised evaluation metricsfor dialogue response generation. In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2122–2132, Austin,Texas. Association for Computational Linguistics.

Jekaterina Novikova, Ondrej Dušek, Amanda Cer-cas Curry, and Verena Rieser. 2017. Why we neednew evaluation metrics for NLG. In Proceedingsof the 2017 Conference on Empirical Methods inNatural Language Processing, pages 2241–2252,Copenhagen, Denmark. Association for Computa-tional Linguistics.

90

Patrick Paroubek, Stéphane Chaudiron, and LynetteHirschman. 2007. Principles of evaluation in nat-ural language processing. Traitement Automatiquedes Langues, 48(1):7–31.

Carla Parra Escartín, Wessel Reijers, Teresa Lynn, JossMoorkens, Andy Way, and Chao-Hong Liu. 2017.Ethical considerations in NLP shared tasks. In Pro-ceedings of the First ACL Workshop on Ethics in Nat-ural Language Processing, pages 66–73, Valencia,Spain. Association for Computational Linguistics.

Ehud Reiter and Anja Belz. 2009. An investigation intothe validity of some metrics for automatically evalu-ating natural language generation systems. Compu-tational Linguistics, 35(4):529–558.

Philip Resnik and Jimmy Lin. 2010. Evaluation ofNLP systems. The handbook of computational lin-guistics and natural language processing. Chapter11., 57.

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,and Sameer Singh. 2020. Beyond accuracy: Be-havioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 4902–4912, Online. Association for Computational Lin-guistics.

Stefan Riezler and John T. Maxwell. 2005. On somepitfalls in automatic evaluation and significance test-ing for MT. In Proceedings of the ACL Workshopon Intrinsic and Extrinsic Evaluation Measures forMachine Translation and/or Summarization, pages57–64, Ann Arbor, Michigan. Association for Com-putational Linguistics.

Nathan Schneider. 2015. What I’ve learned about anno-tating informal text (and why you shouldn’t take myword for it). In Proceedings of The 9th LinguisticAnnotation Workshop, pages 152–157, Denver, Col-orado, USA. Association for Computational Linguis-tics.

Noah A Smith. 2011. Linguistic structure prediction.Synthesis lectures on human language technologies,4(2):1–274.

Anders Søgaard, Anders Johannsen, Barbara Plank,Dirk Hovy, and Hector Martínez Alonso. 2014.What’s in a p-value in NLP? In Proceedings of theEighteenth Conference on Computational NaturalLanguage Learning, pages 1–10, Ann Arbor, Michi-gan. Association for Computational Linguistics.

Karen Sparck Jones. 1994. Towards better NLP sys-tem evaluation. In Human Language Technology:Proceedings of a Workshop held at Plainsboro, NewJersey, March 8-11, 1994.

Chris van der Lee, Albert Gatt, Emiel van Miltenburg,Sander Wubben, and Emiel Krahmer. 2019. Bestpractices for the human evaluation of automaticallygenerated text. In Proceedings of the 12th Interna-tional Conference on Natural Language Generation,

pages 355–368, Tokyo, Japan. Association for Com-putational Linguistics.

Petra Wagner, Jonas Beskow, Simon Betz, JensEdlund, Joakim Gustafson, Gustav Eje Henter,Sébastien Le Maguer, Zofia Malisz, Éva Székely,Christina Tånnander, et al. 2019. Speech synthesisevaluation—state-of-the-art assessment and sugges-tion for a novel research program. In Proceedings ofthe 10th Speech Synthesis Workshop (SSW10).

91

Proceedings of the Fifth Workshop on Teaching NLP, pages 92–95June 10–11, 2021. ©2021 Association for Computational Linguistics

An Immersive Computational Text Analysis Course for Non-ComputerScience Students at Barnard College

Adam Poliak Jalisha JeniferBarnard College, Columbia University

{apoliak,jjenifer}@barnard.edu

Abstract

We provide an overview of a new Computa-tional Text Analysis course that will be taughtat Barnard College over a six week period inMay and June 2021. The course is targeted tonon Computer Science at a U.S. Liberal Artscollege that wish to incorporate fundamentalNatural Language Processing tools in their re-search and studies. During the course, stu-dents will complete daily programming tuto-rials, read and review contemporary researchpapers, and propose and develop independentresearch projects.

1 Introduction

As computing has become foundational to 21st cen-tury literacy (Guzdial, 2019), students across manydisciplines are increasingly interested in gainingcomputing literacy and skills. Barnard College ismeeting this increased demand through its Think-ing Quantitatively and Empirically and ThinkingTechnologically and Digitally requirements that re-spectively aim for students “to develop basic com-petence in the use of one or more mathematical,statistical, or deductive methods” and “foster stu-dents’ abilities to use advanced technologies forcreative productions, scholarly projects, scientificanalysis or experimentation.”1

This new 3-credit Computational Text Analysiscourse will enable roughly 25 students to leverageadvanced NLP technologies in their work.2 Thegoal of the course is to introduce students to thetools and quantitative methods to discover, mea-sure, and infer phenomena from large amountsof text. Unlike a traditional Natural LanguageProcessing class, students will not implement keyalgorithms from scratch. Rather, students will

1http://catalog.barnard.edu/barnard-college/curriculum/requirements-liberal-arts-degree/foundations/

2https://coms2710.barnard.edu/

Week Topic

1 Bash, Python, & Math bootcamp2 Words, Words, Words3 Topic Modeling4 Data Collection5 Machine Learning6 Advanced Topics & Final Projects

Table 1: Theme for each of the six weeks.

learn how to use existing python libraries likenltk (Loper and Bird, 2002), gensim (Rehurek andSojka, 2011), spacy (Honnibal et al., 2020), andsklearn (Pedregosa et al., 2011) to analyze largeamounts of textual data. Students will also gainfamillarity with webscraping tools and APIs oftenused to collect textual data.

2 Course Design

Due to COVID-19 schedule changes, the coursewill meet remotely for 95 minutes four days a weekover 6 weeks. Each week will revolve around aclass of methods used in computational text analy-sis, shown in Table 1. Course meetings will beginwith a 30 minute lecture motivating and explaininga specific concept or method. Corresponding read-ings introducing these methods will be providedbefore class. In the subsequent 30 minute interval,we will collaboratively work through a Jupyter-Notebook demonstrating the day’s concept usingreal-world textual data.3 Both the lecture and col-laborative demo will be recorded and available tostudents after class. Students can use the remaining35 minutes to begin and complete daily homeworktutorials where they will further experiment withapplying methods to provided real-world text.

3We will continue this practice from a previous onlineIntroductory Data Science class (https://coms1016.barnard.edu/) where students found live coding demosto be helpful.

92

Assignment Grade Percent

Daily Tutorials 20Weekly Homework 30

Paper Reviews 15Final Project 35

Table 2: Assignments and corresponding gradeweights.

2.1 Assignments

Students will complete four types of assignmentsduring the six week course to gain familiarity, com-fort, and confidence applying computational tex-tual analysis methods to large corpora (Table 2).With an emphasis on learning by doing, complet-ing daily tutorials independently will help stu-dents solidify their understanding of the day’s ma-terial while working through examples where thesemethods have been successfully deployed. Dailytutorials will primarily be graded using an auto-grader. Many of the daily tutorials are adaptionsof notebooks from similar courses, e.g. JonathanReeve’s Computational Literary Analysis course4

at Columbia University, Matthew Wilkens’s andDavid Mimno’s Text Mining for History and Liter-ature course5 at Cornell, and Christopher Hench’sand Claudia von Vacano’s Rediscovering Text asData course6 at Berkeley.

Each week, students will be given a corpus anda research question and will be tasked with usingthe previous week’s methods to try answering thespecific research question. These weekly home-works will give students freedom to experimentwith different methods and see how different de-sign choices can have significant impacts. Table 3outlines the theme and corpora for each weeklyhomework. Students will be allowed to work onthese assignments in pairs. Completed notebookscontaining students’ code, figures, and a brief write-up will be graded manually. To help prepare stu-dents for final projects, feedback will include ques-tions and comments related to both the technicalmethods used and presentation.7

4https://icla2020b.jonreeve.com/5https://github.com/wilkens-teaching/

info3350-f206https://github.com/henchc/

Rediscovering-Text-as-Data7Daily tutorials and weekly homeworks will be publicly

available at https://github.com/BC-COMS-2710/summer21-material. Solutions and autograders will beavailable to other instructors wishing to adopt this course.

Theme Corpus

Readability U.S Presidential Inaugural AddressesDocument Similarity NYTimes Obituaries

Web Scraping Columbia Course ReviewsMachine Learning Political Tweets

Table 3: Themes and corresponding corpus for the fourweek-long homeworks different assignments.

Students will complete five brief paper reviewsduring the course. Each week, students will choosefrom a set of research papers and write a brief re-view of the paper. Each review will include a shortsummary and a few sentences reflecting on theiropinion on the paper’s methods and findings. Thiswill help students appreciate the wide-ranging ap-plicability of these methods and see more exampleswhere researchers in other fields have successfullyleveraged computational text analysis methods.

For their final projects, students will developtheir own research question based on a corpus thatthey will collect. Students will apply at least twomethods covered in the course to answer their re-search question. By the end of the fourth week,students will propose their research question andwill begin to collect a corpus to study. Studentswill present a brief progress update to the rest ofthe class during the last scheduled course meeting(Monday June 14th) and provide feedback to eachother. Students will use the official finals period toincorporate feedback and submit a final write updescribing their research.

2.2 Computational Infrastructure

All assignments will be completed using Jupyter-Labs hosted on a JupyterHub server maintained byColumbia University IT at no cost to students. Thisenables equal access for students who might nothave resources to store and process large amountsof text. Additionally, this will ensure students usethe same library and tools versions.

2.3 Prerequisites

We strongly recommend students to have takenat least one prior programming course. While wemake no assumptions about students prior program-ming experience and we cover bash and pythonbasics, we believe having one prior programmingcourse will help students hit the ground running.Furthermore, this allows us to cover more advancedtopics in depth. In future versions of the course tak-ing place during a traditional 13-week semester, we

93

Figure 1: Statistics from 18 registered students who completed a pre-course interest form. Breakdown of studentsby major (left) and year (right).

will consider removing this recommendation.

2.4 TextbooksWe will primarily use the recent Text Analysis inPython for Social Scientists: Discovery and Ex-ploration textbook (Hovy, 2021) because of its as-sertion that “using off-the-shelf libraries . . . stripsaway some of the complexity of the task, whichmakes it a lot more approachable” thus enabling“programming novices to use efficient code to tryout a research idea.” We will use select readingsfrom the NLTK book (Bird et al., 2009), the mostrecent edition of Speech and Language Process-ing (Jurafsky and Martin, 2000),8, An Introductionto Text Mining: Research Design, Data Collec-tion, and Analysis (Ignatow and Mihalcea, 2017),and Melanie Walsh’s online Cultural Analytics &Python textbook.9

3 Prospective Students

With summer course registration underway, 25 stu-dents are currently registered for the course. Sincethe course is under active development and studentscome from a range of academic backgrounds, wehave asked students to complete a brief survey tohelp us prepare and tailor the class as much as pos-sible. Figure 1 shows results from the 12 studentswho have completed the survey so far. While wesee interest from Computer Science students, mostof the students are programming novices – half ofthe survey participants indicated they have no priorprogramming experience outside of one or two pro-gramming courses.10 Many students indicated they

8Specifically Chapters 4 and 5 for Machine Learning.9https://melaniewalsh.github.io/

Intro-Cultural-Analytics/welcome.html10The larger number of interested computer science students

might be due to the course being offered by the Computer

have corpora in mind that they are interested instudying or that they plan on using these meth-ods to enhance their undergraduate theses in otherfields.

4 Measuring STEM-Related Attitudes &Beliefs

The beliefs and attitudes students possess aboutlearning play a significant role in their academicperformance (Elliot and Dweck, 2013). Given thefact that our course is targeting non-computer sci-ence majors, we are interested in seeing how theacademic attitudes held by participating studentsrelate to and change throughout their participationin the course. We are particularly interested in mea-suring students’ beliefs about the malleability oftheir own intelligence, as these beliefs have beenshown to predict student motivation, engagement,and performance in academic courses (De Castellaand Byrne, 2015). To measure the effect such be-liefs on students in this course, we will adminis-ter the self-theory implicit theories of intelligencescale (De Castella and Byrne, 2015). We will alsomeasure students’ motivation for computer scienceusing an adapted version of the Science MotivationQuestionnaire (Glynn et al., 2007). We will havestudents complete these surveys at the beginningand end of the course, which will enable us to a)measure whether students’ initial attitudes signif-icantly relate to their performance in the course,and b) measure whether participation in the courseleads to changes in student attitudes over time.

5 Conclusion

We described a new Computational Text Analysiscourse that will be taught at Barnard College in

Science department.

94

May-June 2021. The broad goal of the course isfor students to gain technical skills allowing themto successfully incorporate Natural Language Pro-cessing methods into their respective fields and re-search. We will rely on proven methods to measurethe effect the course has on students’ perceptionof their abilities and skills. All material developedfor the course (slides, assignments, autograders)will be available for instructors to adopt at theirinstitutions. An updated version of this paper willbe available on the first author’s webpage with re-flections and lessons learned once the course iscompleted.

Acknowledgements

We thank Maya Bickel for providing feedback anddebugging assignments. Additionally, we thankMichael Weisner and his team at Columbia Univer-sity Information Technology for helping maintainthe computational infrastructure for the course.

ReferencesSteven Bird, Ewan Klein, and Edward Loper. 2009.

Natural language processing with Python: analyz-ing text with the natural language toolkit. " O’ReillyMedia, Inc.".

Krista De Castella and Donald Byrne. 2015. My in-telligence may be more malleable than yours: Therevised implicit theories of intelligence (self-theory)scale is a better predictor of achievement, motiva-tion, and student disengagement. European Journalof Psychology of Education, 30(3):245–267.

Andrew J Elliot and Carol S Dweck. 2013. Handbookof competence and motivation. Guilford Publica-tions.

Shawn M Glynn, Gita Taasoobshirazi, and PeggyBrickman. 2007. Nonscience majors learning sci-ence: A theoretical model of motivation. Journal ofResearch in Science Teaching: The Official Journalof the National Association for Research in ScienceTeaching, 44(8):1088–1107.

Mark Guzdial. 2019. Computing education as a foun-dation for 21st century literacy. In Proceedings ofthe 50th ACM Technical Symposium on ComputerScience Education, pages 502–503.

Matthew Honnibal, Ines Montani, Sofie Van Lan-deghem, and Adriane Boyd. 2020. spaCy:Industrial-strength Natural Language Processing inPython.

Dirk Hovy. 2021. Text Analysis in Python for SocialScientists: Discovery and Exploration. Elements inQuantitative and Computational Methods for the So-cial Sciences. Cambridge University Press.

Gabe Ignatow and Rada Mihalcea. 2017. An introduc-tion to text mining: Research design, data collection,and analysis. Sage Publications.

Daniel Jurafsky and James H Martin. 2000. Speechand language processing: An introduction to naturallanguage processing, computational linguistics, andspeech recognition.

Edward Loper and Steven Bird. 2002. Nltk: The natu-ral language toolkit. In In Proceedings of the ACLWorkshop on Effective Tools and Methodologies forTeaching Natural Language Processing and Compu-tational Linguistics. Philadelphia: Association forComputational Linguistics.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Radim Rehurek and Petr Sojka. 2011. Gensim–pythonframework for vector space modelling. NLP Centre,Faculty of Informatics, Masaryk University, Brno,Czech Republic, 3(2).

95

Proceedings of the Fifth Workshop on Teaching NLP, pages 96–98June 10–11, 2021. ©2021 Association for Computational Linguistics

Introducing Information Retrieval for Biomedical Informatics Students

Sanya B. Taneja Richard D. Boyce William T. Reynolds Denis Newman-GriffisUniversity of Pittsburgh

Pittsburgh, PA, USA{sbt12, rdb20, wtr8, dnewmangriffis}@pitt.edu

Abstract

Introducing biomedical informatics (BMI) stu-dents to natural language processing (NLP) re-quires balancing technical depth with practi-cal know-how to address application-focusedneeds. We developed a set of three activi-ties introducing introductory BMI students toinformation retrieval with NLP, covering doc-ument representation strategies and languagemodels from TF-IDF to BERT. These activi-ties provide students with hands-on experiencetargeted towards common use cases, and intro-duce fundamental components of NLP work-flows for a wide variety of applications.

1 Introduction

Natural language processing (NLP) technologieshave become a fundamental tool for biomedical in-formatics (BMI) research, with uses ranging frommining new protein-protein interactions from sci-entific literature to recommending medications inclinical care. In most cases, however, the researchquestion at hand is a biological or clinical one, andNLP tools are used off-the-shelf or lightly adaptedrather than being the focus of research and devel-opment. When introducing BMI students to NLP,instructors must therefore navigate a balance be-tween the computational and linguistic insights be-hind why NLP technologies work the way they do,and practical know-how for the application-focusedsettings where most students will go on to use NLP.

We developed a set of three activities designed toexpose introductory BMI students to the fundamen-tals of NLP and provide hands-on experience withNLP techniques. These activities were designed foruse in the Foundations of Biomedical InformaticsI course at the University of Pittsburgh, a surveycourse which introduces students to core methodsand topics in biomedical informatics. The courseis required for all students in the Biomedical In-formatics Training Program; students have a range

of experience in computer science, and no back-ground in artificial intelligence or NLP is required.The sequence of activities, implemented as Jupyternotebooks, comprise a single assignment focusedon information retrieval, a common use case forNLP in all areas of BMI research. The assignmentfollowed lectures focused on information retrieval,word embeddings, and language models (presentedonline in Fall 2020). §2 gives an overview of thethree activities, and we note directions for furtherrefinement of the activities in §3.

2 Assignment Details

Our set of three activities was designed with twoprimary learning goals in mind:

• Expose introductory BMI students to funda-mental strategies for text representation andlanguage models, geared towards informationretrieval in biomedical contexts; and

• Provide students with hands-on experiencecreating NLP workflows using pre-built tools.

Our Jupyter notebooks provide a sequence of codesamples to analyze and execute, combined withbackground questions to assess understanding ofthe computational and linguistic insights behindthe NLP technologies students are using.

Notebook 1: Fundamentals of documentanalysis. In this notebook, students were firstintroduced to basic preprocessing tasks in NLPworkflows such as tokenization, stemming, cas-ing, and stop-word removal. Using a corpusfrom the Natural Language Toolkit (NLTK) (Birdet al., 2009), the notebook demonstrated two in-dexing techniques - inverted indexing and creationof a weighted document-term matrix using termfrequency-inverse document frequency (TF-IDF).Students then implemented a synthetic informationretrieval task involving a collection of 12 docu-ments mapped to 20 queries as a reference set. Thestudents evaluated the information retrieval system

96

with synthetic results for two queries comprisingnumeric values for each document in the query. Stu-dents measured system performance using TRECevaluation measures including recall, precision, in-terpolated precision-recall average, and mean aver-age precision. The evaluation measures were im-plemented using the pytrec_eval library (Van Gyseland de Rijke, 2018).

Notebook 2: Introduction to word embed-dings. In this notebook, students were introducedto word embeddings as a text representation toolfor NLP. The students first created embeddingsusing singular value decomposition (SVD) of a co-occurrence matrix using the corpus from Notebook1. This established the idea of capturing semanticsimilarity in texts as opposed to a tradition bag-of-words model. The notebook then used pretrainedword2vec embeddings (Mikolov et al., 2013) todemonstrate a more refined approach of SVD. Thestudents were able to visualize both the embeddingapproaches in the notebook. This was particularlyimportant as most students did not have prior ex-perience with embeddings and plotting the wordembeddings can lead to greater insight into thevariable semantic similarity that embedding repre-sentations provide over lexical features.

Notebook 3: Introduction to BERT and clin-icalBERT. As all the activities are designed forintroductory students, YouTube tutorials were usedto provide background on neural networks and de-sign decisions in language models for students withminimal background in NLP. Students were intro-duced to NLP workflows and language models us-ing the Transformers library. Transformers is anopen-source library developed by Hugging Facethat provides a collection of pretrained models andgeneral-purpose architectures for natural languageprocessing (Wolf et al., 2020), based on the Trans-former architecture (Vaswani et al., 2017). Thisincludes BERT (Devlin et al., 2019) and clinical-BERT (Alsentzer et al., 2019), which are used inthis notebook to implement Named Entity Recog-nition and Medical Language Inference tasks.

The notebook guided the students through aMedical Language Inference task to infer knowl-edge from clinical texts. Language inference inmedicine is an important application of NLP asclinical notes such as those containing past med-ical history of a patient contain vital informationthat is utilized by clinicians to draw useful infer-ences (Romanov and Shivade, 2018). The task also

introduced BMI students to challenges unique toapplying NLP on clinical texts such as domain-specific vocabulary, diversity in abbreviations, con-tent structure, and named entity recognition withclinical jargon. The students used the MedNLI(Romanov and Shivade, 2018) dataset created fromMIMIC III (Johnson et al., 2016) clinical notes forthis task. Building on knowledge from the previousnotebooks, they implemented workflows to com-pare the performance of BERT and clinicalBERTmodels for prediction of labels in clinical texts.The students were encouraged to understand theimportance of domain representation in pretrainingdata and the process of fine-tuning NLP models fordomain-specific language.

3 Discussion

We designed three activities to demonstrate fun-damental concepts and workflows for applicationof NLP to introductory BMI students. While thescope of NLP in the biomedical field is much largerthan one assignment, we developed the activitiesto provide students with a modular workflow ofcomponents that are applicable to other NLP ap-plications besides information retrieval; i.e., textpreprocessing, indexing, execution, and evaluation.

One practical challenge for clinically-focusedexercises is the limited availability of benchmarkdatasets. Most clinical datasets require Data UseAgreements and individual training requirementsthat are cumbersome for a classroom setting.Thus,the first two notebooks use the NLTK corpus whichis a popular dataset for introducing NLP conceptswithout the challenges present in clinical datasets.While MIMIC (Johnson et al., 2016) is a valuable,relatively accessible source for clinical text, avail-able annotations for it are limited. Students canthus be introduced to fine-tuning of language mod-els for the specifics of medical language, but in-structors must anticipate challenges in providingtrain/test splits for supervised machine learning.

We further take advantage of popular pre-builtlibraries (like Transformers) so that students canfocus on the application rather than constructingneural networks. In an application-focused setting,technical knowledge of neural NLP systems is lessnecessary, but users of those systems still need tounderstand what kinds of regularities they rely onand when they may be unreliable. The activitiesare thus designed to reflect the perspective of thepractical challenges that students will face when

97

working in biomedical NLP.Finally, one important area we are investigat-

ing as we refine and extend these teaching mate-rials is the ethical considerations of biomedicalAI technologies. The intersection of medical andAI ethics poses several challenging questions fordesigning, training, and applying AI technologiesin the biomedical setting (Char et al., 2018; Ke-skinbora, 2019), and sensitive information oftendescribed in medical text presents further ethicalquestions for NLP systems (Lehman et al., 2021).Thus, determining what BMI students have a re-sponsibility to understand about the NLP tools theyuse, and how we most effectively teach that infor-mation in limited course time, is key to broadeningthe responsible use of NLP in BMI research.

All materials presented in this paper are availablefrom https://github.com/dbmi-pitt/bioinf_teachingNLP.

Acknowledgments

This work was supported in part by the NationalLibrary of Medicine of the National Institutes ofHealth under award number T15 LM007059.

ReferencesEmily Alsentzer, John Murphy, William Boag, Wei-

Hung Weng, Di Jin, Tristan Naumann, and MatthewMcDermott. 2019. Publicly available clinical BERTembeddings. In Proceedings of the 2nd ClinicalNatural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA. Association forComputational Linguistics.

Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural language processing with Python: analyz-ing text with the natural language toolkit. " O’ReillyMedia, Inc.".

Danton S Char, Nigam H Shah, and David Magnus.2018. Implementing Machine Learning in HealthCare - Addressing Ethical Challenges. The NewEngland journal of medicine, 378(11):981–983.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In Proceedings of the 2019 Conferenceof the North {A}merican Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Alistair EW Johnson, Tom J Pollard, Lu Shen,H Lehman Li-Wei, Mengling Feng, Moham-mad Ghassemi, Benjamin Moody, Peter Szolovits,

Leo Anthony Celi, and Roger G Mark. 2016. Mimic-iii, a freely accessible critical care database. Scien-tific data, 3(1):1–9.

Kadircan H Keskinbora. 2019. Medical ethics consid-erations on artificial intelligence. Journal of Clini-cal Neuroscience, 64:277–282.

Eric Lehman, Sarthak Jain, Karl Pichotta, Yoav Gold-berg, and Byron C. Wallace. 2021. Does bert pre-trained on clinical notes reveal sensitive data? arXivpreprint arXiv:2104.07762.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. arXiv preprint arXiv:1310.4546.

Alexey Romanov and Chaitanya Shivade. 2018.Lessons from natural language inference in the clin-ical domain. In Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Process-ing, pages 1586–1596, Brussels, Belgium. Associa-tion for Computational Linguistics.

Christophe Van Gysel and Maarten de Rijke. 2018.Pytrec_eval: An extremely fast python interface totrec_eval. In SIGIR. ACM.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In I Guyon, U V Luxburg, S Bengio,H Wallach, R Fergus, S Vishwanathan, and R Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 30, pages 5998–6008. Curran Asso-ciates, Inc.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Transformers: State-of-the-art natural language pro-cessing. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

98

Proceedings of the Fifth Workshop on Teaching NLP, pages 99–103June 10–11, 2021. ©2021 Association for Computational Linguistics

Contemporary NLP Modeling in Six ComprehensiveProgramming Assignments

Greg Durrett∗ Jifan Chen Shrey Desai Tanya GoyalLucas Kabela Yasumasa Onoe Jiacheng Xu

Department of Computer ScienceThe University of Texas at [email protected]

Abstract

We present a series of programming assign-ments, adaptable to a range of experience lev-els from advanced undergraduate to PhD, toteach students design and implementation ofmodern NLP systems. These assignmentsbuild from the ground up and emphasize full-stack understanding of machine learning mod-els: initially, students implement inference andgradient computation by hand, then use Py-Torch to build nearly state-of-the-art neuralnetworks using current best practices. Topicsare chosen to cover a wide range of model-ing and inference techniques that one mightencounter, ranging from linear models suit-able for industry applications to state-of-the-art deep learning models used in NLP research.The assignments are customizable, with con-strained options to guide less experienced stu-dents or open-ended options giving advancedstudents freedom to explore. All of them canbe deployed in a fully autogradable fashion,and have collectively been tested on over 300students across several semesters.1

1 Introduction

This paper presents a series of assignments de-signed to give a survey of modern NLP through thelens of system-building. These assignments pro-vide hands-on experience with concepts and imple-mentation practices that we consider critical for stu-dents to master, ranging from linear feature-basedmodels to cutting-edge deep learning approaches.The assignments are as follows:

A1. Sentiment analysis with linear models (Panget al., 2002) on the Stanford Sentiment Tree-bank (Socher et al., 2013).

∗Corresponding author. Subsequent authors listed alpha-betically.

1See https://cs.utexas.edu/~gdurrett forpast offerings and static versions of these assignments; con-tact Greg Durrett for access to the repository with instructorsolutions.

A2. Sentiment analysis with feedforward “deepaveraging” networks (Iyyer et al., 2015) usingGloVe embeddings (Pennington et al., 2014).

A3. Hidden Markov Models and linear-chain con-ditional random fields (CRFs) for named en-tity recognition (NER) (Tjong Kim Sang andDe Meulder, 2003), using features similar tothose from Zhang and Johnson (2003).

A4. Character-level RNN language modeling(Mikolov et al., 2010).

A5. Semantic parsing with seq2seq models (Jiaand Liang, 2016) on the GeoQuery dataset(Zelle and Mooney, 1996).

A6. Reading comprehension on SQuAD (Ra-jpurkar et al., 2016) using a simplified ver-sion of the DrQA model (Chen et al., 2017),similar to BiDAF (Seo et al., 2016).

A1-A5 come with autograders. These train eachstudent’s model from scratch and evaluate perfor-mance on the development set of each task, verify-ing whether their code behaves as intended. Theautograders are bundled to be deployable on Grade-scope using their Docker framework.2 These cod-ing assignments can also be supplemented with con-ceptual questions for hybrid assignments, thoughwe do not distribute those as part of this release.

Other Courses and Materials Several otherwidely-publicized courses like Stanford CS224Nand CMU CS 11-747 are much more “neural-first”views of NLP: their assignments delve more heav-ily into word embeddings and low-level neuralimplementation like backpropagation. By con-trast, this course is designed to be a survey that

2For the CRF and seq2seq modeling assignments, a customframework must be used, as Gradescope autograders cannothandle these. We grade these in a batch fashion on a singleinstructional machine, which poses some logistical challenges.

99

Assignment Main Concepts Output ComponentsLinear FFNN Enc Dec Attn

A1: Sentiment (Linear) Classification, SGD, bag-of-words Binary �A2: Sentiment (FFNNs) PyTorch, word embeddings Binary � �A3: HMMs and CRFs for NER Structured prediction, dyn. prog. Tags �A4: Language Modeling Neural sequence modeling Token seq � � � �A5: Seq2seq Semantic Parsing Encoder-decoder, attention Token seq � � � � �A6: Reading Comprehension QA, domain adaptation Span � � � �

Table 1: Breakdown of assignments. The concepts and model components in each are designed to build on oneanother. A gray square indicates partial engagement with a concept, typically when students are already given theneeded component or it isn’t a focus of the assignment.

also covers topics like linear classification, genera-tive modeling (HMMs), and structured inference.Other hands-on courses discussed in prior Teach-ing NLP papers (Klein, 2005; Madnani and Dorr,2008; Baldridge and Erk, 2008) make some simi-lar choices about how to blend linguistics and CSconcepts, but our desire to integrate deep learningas a primary (but not the sole) focus area guides ustowards a different set of assignment topics.

2 Design Principles

This set of assignments was designed after weasked ourselves, what should a student taking NLPknow how to build? NLP draws on principles frommachine learning, statistics, linguistics, algorithms,and more, and we set out to expose students to arange of ideas from these disciplines through thelens of implementation. This choice follows the“text processing first” (Bird, 2008) or “core tools”(Klein, 2005) views of the field, with the idea thatstudents can study undertake additional study ofparticular topic areas and quickly get up to speedon modeling approaches given the building blockspresented here.

2.1 Covering Model Types

There are far too many NLP tasks and models tocover in a single course. Rather than focus on ex-posing students to the most important applications,we instead designed these assignments to featurea range of models along the following typologicaldimensions.

Output space The prediction spaces of modelsconsidered here include binary/multiclass (A1, A2),structured (sequence in A3, span in A6), and natu-ral language (sequence of words in A4, executablequery in A5). While structured models have fallenout of favor with the advent of neural networks,we view tagging and parsing as fundamental ped-

agogical tools for getting students to think aboutlinguistic structure and ambiguity, and these areemphasized in our courses.

Modeling framework We cover generative mod-els with categorical distributions (A3), linearfeature-based models including logistic regression(A1) and CRFs (A3), and neural networks (A2, A4,A5, A6). These particularly highlight differencesin training, optimization, and inference requiredfor these different techniques.

Neural architectures We cover feedforward net-works (A2), recurrent neural encoders (A4, A5,A6), seq2seq models (A5), and attention (A5, A6).From these, Transformers (Vaswani et al., 2017)naturally emerge even though they are not explic-itly implemented in an assignment.

2.2 Other Desiderata

A major consideration in designing these assign-ments was to enable understanding withoutlarge-scale computational resources. Maintain-ing simplicity and tractability is the major reasonwe do not feature more exploration of pre-trainedmodels (Devlin et al., 2019). These factors are alsowhy we choose character-level language model-ing (rather than word-level) and seq2seq semanticparsing (rather than translation): training large au-toregressive models to perform well when outputvocabularies are in the tens of thousands requiressignificant engineering expertise. While we teachstudents skills like debugging and testing modelson simplified settings, we still found it less painfulto build our projects around these more tractabletasks where students can iterate quickly.

Another core goal was to allow students to buildsystems from the ground-up using simple, un-derstandable code. We build on PyTorch prim-itives (Paszke et al., 2019), but otherwise avoidusing frameworks like Keras, Huggingface, or Al-

100

lenNLP. The code is also somewhat “underengi-neered:” we avoid an overly heavy reliance onPythonic constructs like list comprehensions orgenerators as not all students come in with a highlevel of familiarity with Python.

What’s missing Parsing is notably absent fromthese assignments; we judged that both chartparsers and transition-based parsers involved toomany engineering details specific to these settings.All of our classes do cover parsing and in somecases have other hands-on components that engagewith parsing, but students do not actually build aparser. Instead, sequence models are taken as anexample of structured inference, and other classifi-cation tasks are used instead of transition systems.

From a system-building perspective, the biggestomissions are pre-training and Transformers.These can be explored in the context of finalprojects, as we describe in the next section.

Finally, our courses integrate additional discus-sion around ethics, with specific discussions sur-rounding bias in word embeddings (Bolukbasiet al., 2016; Gonen and Goldberg, 2019) and eth-ical considerations of pre-trained models (Benderet al., 2021), as well as an open-ended discussionsurrounding social impact and ethical considera-tions of NLP, deep learning, and machine learn-ing. These are not formally assessed at present, butwe are considering this for future iterations of thecourse given these topics’ importance.

3 Deployment

These assignments have been used in four differ-ent versions of an NLP survey course: an upper-level undergraduate course, a masters level course(delivered online), and two PhD-level courses. Inthe online MS course, these constitute the onlyassessment. For courses delivered in a tradi-tional classroom format, we recommend choos-ing a subset of the assignments and supplement-ing with additional written assignments testingconceptual understanding.

Our undergrad courses use A1, A2, A4, and afinal project based on A6. We use additional writ-ten assignments covering word embedding tech-niques, syntactic parsing, machine translation, andpre-trained models. Our PhD-level courses useA1, A2, A3, A5, and an independent final project.The assignments also support further “extension”options: for example, in A3, beam search is pre-sented as optional and students can also explore

Assignment Eisenstein Jurafsky + Martin

A1 2, 4 4, 5A2 3 7A3 7 8A4 6 7, 9A5 12, 18 11, 15A6 17.5 23.2

Table 2: Book chapters associated with each assign-ment; gray indicates an imperfect match. Our coursesuse a combination of Eisenstein, ad hoc lecture noteson certain topics, and academic papers.

parallel decoding for the CRF or features for NERto work better on German. For the seq2seq model,they could experiment with Transformers or imple-ment constrained decoding to always produce validlogical forms.

We believe that A1 and A2 could be adaptedto use in a wide range of courses, but A3-A6 aremost appropriate for advanced undergraduates orgraduate students.

Syllabus Table 2 pairs these assignments withreadings in texts by Jurafsky and Martin (2021)and Eisenstein (2019). See Greg Durrett’s coursepages for complete sets of readings.

Logistics We typically provide students around2 weeks per assignment. Their submission eitherconsists of just the code or a code with a briefreport, depending on the course format. Studentscollaborate on assignments through a discussionboard on Piazza as well as in person. We haverelatively low incidence of students copying code,assessed using Moss over several semesters.

Pain Points Especially on A3, A4, and A5, wecome across students who find debugging to be amajor challenge. In the assignments, we suggeststrategies to verify parts of inference code indepen-dently of training, as well as simplified tasks to testmodels on, but some students find it challenging orare unwilling to pursue these avenues. On a similarnote, students often do not have a prior on whatthe system should do. It might not raise a red flagthat their code takes an hour per epoch, or gets 3%accuracy on the development set, and they end upgetting stuck as a result. Understanding what thesefailures mean is something we emphasize. Finally,students sometimes have (real or perceived) lack ofbackground on either coding or the mathematicalfundamentals of the course; however, many suchstudents end up doing well in these courses as theirfirst ML/NLP courses.

101

Acknowledgments

We would like to acknowledge the additional gradu-ate and undergraduate TAs for various offerings ofthese courses: Christopher Crabtree, Uday Kusu-pati, Shivangi Mahto, Abhilash Potluri, ShivangSingh, Xi Ye, and Didi Zhou. Our thanks alsogo out to all of the students who have taken thesecourses, whose comments and experiences havehelped make them stronger. Thanks as well to theanonymous reviewers for their helpful comments.

In the development of these materials, we con-sulted courses and teaching materials by EmilyBender, Sam Bowman, Chris Dyer, Mohit Iyyer,Vivek Srikumar, and many others. We would alsolike to thank Jacob Eisenstein, Dan Jurafsky, JamesH. Martin, and Yoav Goldberg for their helpfultextbooks.

ReferencesJason Baldridge and Katrin Erk. 2008. Teaching com-

putational linguistics to a large, diverse student body:Courses, tools, and interdepartmental interaction.In Proceedings of the Third Workshop on Issuesin Teaching Computational Linguistics, pages 1–9, Columbus, Ohio. Association for ComputationalLinguistics.

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On theDangers of Stochastic Parrots: Can Language Mod-els Be Too Big? In Proceedings of the 2021 ACMConference on Fairness, Accountability, and Trans-parency, FAccT ’21, page 610–623, New York, NY,USA. Association for Computing Machinery.

Steven Bird. 2008. Defining a core body of knowledgefor the introductory computational linguistics cur-riculum. In Proceedings of the Third Workshop onIssues in Teaching Computational Linguistics, pages27–35, Columbus, Ohio. Association for Computa-tional Linguistics.

Tolga Bolukbasi, Kai-Wei Chang, James Zou,Venkatesh Saligrama, and Adam Kalai. 2016.Man is to Computer Programmer as Woman is toHomemaker? Debiasing Word Embeddings. InProceedings of the 30th International Conference onNeural Information Processing Systems, NIPS’16,page 4356–4364, Red Hook, NY, USA. CurranAssociates Inc.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Computa-tional Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Jacob Eisenstein. 2019. Introduction to Natural Lan-guage Processing. MIT Press.

Hila Gonen and Yoav Goldberg. 2019. Lipstick on apig: Debiasing methods cover up systematic genderbiases in word embeddings but do not remove them.In Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), pages 609–614,Minneapolis, Minnesota. Association for Computa-tional Linguistics.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daumé III. 2015. Deep Unordered Com-position Rivals Syntactic Methods for Text Classi-fication. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),pages 1681–1691, Beijing, China. Association forComputational Linguistics.

Robin Jia and Percy Liang. 2016. Data Recombinationfor Neural Semantic Parsing. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages12–22, Berlin, Germany. Association for Computa-tional Linguistics.

Dan Jurafsky and James H. Martin. 2021. Speech andLanguage Processing, 3rd Ed. Online.

Dan Klein. 2005. A core-tools statistical NLP course.In Proceedings of the Second ACL Workshop on Ef-fective Tools and Methodologies for Teaching NLPand CL, pages 23–27, Ann Arbor, Michigan. Asso-ciation for Computational Linguistics.

Nitin Madnani and Bonnie J. Dorr. 2008. Combiningopen-source with research to re-engineer a hands-on introductory NLP course. In Proceedings ofthe Third Workshop on Issues in Teaching Compu-tational Linguistics, pages 71–79, Columbus, Ohio.Association for Computational Linguistics.

Tomas Mikolov, M. Karafiát, L. Burget, J. Cernocký,and S. Khudanpur. 2010. Recurrent neural networkbased language model. In Interspeech.

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.2002. Thumbs up? Sentiment Classification usingMachine Learning Techniques. In Proceedings ofthe 2002 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP 2002), pages 79–86. Association for Computational Linguistics.

102

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Köpf, EdwardYang, Zach DeVito, Martin Raison, Alykhan Tejani,Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Jun-jie Bai, and Soumith Chintala. 2019. PyTorch: AnImperative Style, High-Performance Deep LearningLibrary. arXiv cs.CL 1912.01703.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global Vectors for WordRepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543, Doha,Qatar. Association for Computational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ Questionsfor Machine Comprehension of Text. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 2383–2392,Austin, Texas. Association for Computational Lin-guistics.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2016. Bidirectional atten-tion flow for machine comprehension. arXiv cs.CL1611.01603.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive Deep Models forSemantic Compositionality Over a Sentiment Tree-bank. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing,pages 1631–1642, Seattle, Washington, USA. Asso-ciation for Computational Linguistics.

Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition.In Proceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003, pages142–147.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In Advances in Neural Information Pro-cessing Systems (NeurIPS).

John M. Zelle and Raymond J. Mooney. 1996. Learn-ing to Parse Database Queries Using Inductive LogicProgramming. In Proceedings of the Thirteenth Na-tional Conference on Artificial Intelligence - Volume2 (AAAI).

Tong Zhang and David Johnson. 2003. A Robust RiskMinimization based Named Entity Recognition Sys-tem. In Proceedings of the Seventh Conference onNatural Language Learning at HLT-NAACL 2003,pages 204–207.

103

Proceedings of the Fifth Workshop on Teaching NLP, pages 104–107June 10–11, 2021. ©2021 Association for Computational Linguistics

Interactive Assignments for Teaching Structured Neural NLP

David Gaddy, Daniel Fried, Nikita Kitaev, Mitchell Stern, Rodolfo Corona,John DeNero and Dan Klein

University of California, Berkeley{dgaddy,denero,klein}@berkeley.edu

Abstract

We present a set of assignments for a graduate-level NLP course. Assignments are designedto be interactive, easily gradable, and to givestudents hands-on experience with several keytypes of structure (sequences, tags, parse trees,and logical forms), modern neural architec-tures (LSTMs and Transformers), inferencealgorithms (dynamic programs and approxi-mate search) and training methods (full andweak supervision). We designed assignmentsto build incrementally both within each assign-ment and across assignments, with the goal ofenabling students to undertake graduate-levelresearch in NLP by the end of the course.

1 Overview

Our course contains five implementation projectsfocusing on neural methods for structured predic-tion tasks in NLP. Over a range of tasks from lan-guage modeling to machine translation to syntac-tic and semantic parsing, the projects cover meth-ods such as LSTMs and Transformers, dynamicprogramming, beam search, and weak supervision(learning from denotations). Our aim was to let stu-dents incrementally and interactively develop mod-els and see their effectiveness on real NLP datasets.Section 2 describes the tasks and objectives of theprojects in more detail. Links to assignments areavailable at https://sites.google.com/view/nlp-assignments.

1.1 Target AudienceOur course is designed for early-stage graduate stu-dents in computer science. We expect students tohave a good background in machine learning, in-cluding prior experience with implementing neuralnetworks. Many of our students will go on to con-duct research in NLP or related machine learningdisciplines. Some advanced undergraduates mayalso join the course if they have sufficient back-ground and an interest in NLP research.

1.2 Course Structure

Our course is 14 weeks long. The projects fillnearly the entire semester, with roughly 2 weeksgiven to complete each project. We used lectures tocover the projects’ problems and methods at a highlevel; however, the projects require students to berelatively skilled at independent implementation.

1.3 Design Strategy

The content of our projects was chosen to cover keytopics along three primary dimensions: applicationtasks, neural components, and inference mecha-nisms. Our projects introduce students to someof the core tasks in NLP, including language mod-eling, machine translation, syntactic parsing, andsemantic parsing. Key neural network models forNLP are also introduced, covering recurrent net-works, attention mechanisms, and the Transformerarchitecture (Vaswani et al., 2017). Finally, theprojects cover inference mechanisms for NLP, in-cluding beam search and dynamic programmingmethods like CKY.

All projects are implemented as interactivePython notebooks designed for use on Google’sColab infrastructure.1 This setup allows studentsto use GPUs for free and with minimal setup. Thenotebooks consist of instructions interleaved withcode blocks for students to fill in. We providescaffolding code with less pedagogically-centralcomponents like data loading already filled in, sothat students can focus on the learning objectivesfor the projects. Students implement neural net-work components using the PyTorch framework(Paszke et al., 2019).

Each project is broken down into a series ofmodules that can be verified for correctness be-fore moving on. For example, when implement-ing a neural machine translation system, the stu-dents first implement and verify a basic sequence-

1colab.research.google.com

104

to-sequence model, then attention, and finally beamsearch. This setup allows students to debug eachcomponent individually and allows instructors togive partial credit for each module. The modulesare designed to validate student code without wait-ing for long training runs, with a total model train-ing time of less than one hour per project.

Our projects are graded primarily with scriptedautograders hosted on Gradescope,2 allowing aclass of hundreds of students to be administered bya small course staff. Grades are generally based onaccuracy on a held-out test set, where students aregiven inputs for this set and submit their model’spredictions to the grader. While students cannotsee their results on the held-out set until after thedue date, the assignments include specific targetsfor validation set accuracies that can be used bystudents to verify the correctness of their solutions.

Each project concludes with an open-ended sec-tion where the students experiment with modifica-tions or ablations to the models implemented andsubmit a 1-page report describing the motivationbehind their contribution and an analysis of their re-sults. This section gives students more of a chanceto explore their own ideas and can also help distin-guish students who are putting in extra effort onthe projects.

2 Assignments

2.1 Project 0: Intro to PyTorch Mini-ProjectThis project serves primarily as an introduction tothe project infrastructure and to the PyTorch frame-work. Students implement a classifier to predict themost common part-of-speech tag for English wordtypes from the words’ characters. Students firstimplement a simple neural model based on poolingcharacter embeddings, then a slightly more com-plex model with character n-gram representations.This project provides much more detailed instruc-tions than later projects to help students who areless familiar with deep learning implementation,walking them through each step of the training andmodeling code.

2.2 Project 1: Language ModelingThis project introduces students to sequential out-put prediction, using classical statistical methodsand auto-regressive neural modeling. Students im-plement a series of language models of increasingcomplexity and train them on English text. First

2www.gradescope.com

they implement a basic n-gram model, then addbackoff and Kneser-Ney smoothing (Ney et al.,1994). Next, they implement a feed-forward neu-ral n-gram model, and an LSTM language model(Hochreiter and Schmidhuber, 1997). The last sec-tion of this project is an open-ended explorationwhere students can try any method to further im-prove results, either from a list of ideas we providedor an idea of their own.

2.3 Project 2: Neural Machine TranslationThis project covers conditional language modeling,using neural sequence-to-sequence models with at-tention. Students incrementally implement a neuralmachine translation model to translate from Ger-man to English on the Multi30K dataset (Elliottet al., 2016). This dataset is simpler than stan-dard translation benchmarks and affords trainingand evaluating an effective model in a matter ofminutes rather than days, allowing students to in-teractively develop and debug models. Studentsfirst implement a baseline LSTM-based sequence-to-sequence model (Sutskever et al., 2014) withoutattention, view the model’s predictions, and eval-uate performance using greedy decoding. Then,students incrementally add an attention mechanism(Bahdanau et al., 2015) and beam search decoding.Finally, students visualize the model’s attentiondistributions.

2.4 Project 3: Constituency Parsing andTransformers

This project covers constituency parsing, the Trans-former neural network architecture (Vaswani et al.,2017), and structured decoding via dynamic pro-gramming. Students first implement a Transformerencoder and validate it using a part-of-speech tag-ging task on the English Penn Treebank (Marcuset al., 1993). Then, students incrementally builda Transformer-based parser by first constructing amodel that makes constituency and labeling deci-sions for each span in a sentence, then implement-ing CKY decoding (Cocke, 1970; Kasami, 1966;Younger, 1967) to ensure the resulting output is atree. The resulting model, which is a small versionof the parser of Kitaev and Klein (2018), achievesreasonable performance on the English Penn Tree-bank in under half an hour of training.

2.5 Project 4: Semantic ParsingThis project introduces students to predicting ex-ecutable logical forms and to training with weak

105

supervision. Students implement a neural seman-tic parser for the GEOQA geographical questionanswering dataset of Krishnamurthy and Kollar(2013). This dataset contains English questionsabout simple relational databases. To familiarizethemselves with the syntax and semantics of thedataset, students first implement a simple execu-tion method which evaluates a logical form on adatabase to produce an answer. To produce logi-cal forms from questions, students then implementa sequence-to-sequence architecture with a con-strained decoder and a copy mechanism (Jia andLiang, 2016). Students verify their model by train-ing first in a supervised setting with known logi-cal forms, then finally train it only from question-answer pairs by searching over latent logical forms.

3 Findings in Initial Course Offerings

An initial iteration of the course was taught to 60students, and an offering for over 100 students isin progress. Overall, we have found the projects tobe a great success.

In the first iteration of the course, 81% of stu-dents completed the course and submitted all fiveprojects. From a mid-semester survey, students re-ported taking 19.88 hours on average to completeProject 1. We observed that students with no priorexperience programming with deep learning frame-works took significantly longer on the projects andrequired more assistance. In future semesters, weintend to strengthen the deep learning prerequisitesrequired for the course to ensure adequate back-ground for success.

The online project infrastructure worked withminimal issues. While the Colab platform doeshave the downside of timeouts due to inactivity, webelieve the use of free GPU resources outweighedthis cost. By encouraging students to downloadcheckpoints after training runs, they were able toavoid re-training after each timeout. A handful ofstudents who spent very long stretches of time onthe projects in a single day reported having tem-porary limits placed on GPU usage (e.g. after 8+hours of continuous use), but these limits couldbe circumvented by logging in with a separate ac-count.

Most students were able to successfully com-plete the majority of each assignment, but thereremained some point spread to distinguish perfor-mance for grading (mean 94%, standard deviation12%). Due to the use of autograding, instructor

code grading effort totaled less than several hoursper project. One aspect of grading that we are con-tinuing to improve is the open-ended exploration re-port, and we plan to better define and communicateexpectations in future semesters by introducing aclear rubric for the report section.

Overall, these projects proved a valuable re-source for the success of our course and for prepar-ing students for NLP research. At the end of lastyear’s course, one student submitted an extensionof their exploration work on one project to EMNLPand presented at the conference.

ReferencesDzmitry Bahdanau, Kyung Hyun Cho, and Yoshua

Bengio. 2015. Neural machine translation byjointly learning to align and translate. In 3rd Inter-national Conference on Learning Representations,ICLR 2015.

John Cocke. 1970. Programming languages and theircompilers: Preliminary notes.

Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-cia Specia. 2016. Multi30K: Multilingual English-German image descriptions. In Proceedings of the5th Workshop on Vision and Language, pages 70–74, Berlin, Germany. Association for ComputationalLinguistics.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Robin Jia and Percy Liang. 2016. Data recombinationfor neural semantic parsing. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages12–22, Berlin, Germany. Association for Computa-tional Linguistics.

Tadao Kasami. 1966. An efficient recognitionand syntax-analysis algorithm for context-free lan-guages. Coordinated Science Laboratory Report no.R-257.

Nikita Kitaev and Dan Klein. 2018. Constituency pars-ing with a self-attentive encoder. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 2676–2686.

Jayant Krishnamurthy and Thomas Kollar. 2013.Jointly learning to parse and perceive: Connectingnatural language to the physical world. Transac-tions of the Association for Computational Linguis-tics, 1:193–206.

Mitch Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn Treebank. Computa-tional Linguistics, 19(2):313–330.

106

Hermann Ney, Ute Essen, and Reinhard Kneser. 1994.On structuring probabilistic dependences in stochas-tic language modelling. Computer Speech & Lan-guage, 8(1):1–38.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. 2019. PyTorch: An imperative style,high-performance deep learning library. Advancesin Neural Information Processing Systems, 32:8026–8037.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.Advances in Neural Information Processing Systems,27:3104–3112.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. Advances in Neural Information Process-ing Systems, 30:5998–6008.

Daniel H. Younger. 1967. Recognition and parsing ofcontext-free languages in time n3. Information andcontrol, 10(2):189–208.

107

Proceedings of the Fifth Workshop on Teaching NLP, pages 108–111June 10–11, 2021. ©2021 Association for Computational Linguistics

Learning about Word Vector Representations and Deep Learning throughImplementing Word2vec

David JurgensSchool of InformationUniversity of [email protected]

Abstract

Word vector representations are an essentialpart of an NLP curriculum. Here, we de-scribe a homework that has students imple-ment a popular method for learning word vec-tors, word2vec. Students implement the coreparts of the method, including text preprocess-ing, negative sampling, and gradient descent.Starter code provides guidance and handles ba-sic operations, which allows students to focuson the conceptually challenging aspects. Af-ter generating their vectors, students evaluatethem using qualitative and quantitative tests.

1 Introduction

NLP curricula typically include content on wordsemantics, how semantics can be learned computa-tionally through word vectors, and what are the vec-tors’ uses. This document describes an assignmentfor having students implement word2vec (Mikolovet al., 2013a,b), a popular method that relies ona single-layer neural network. This homework isdesigned to introduce students to word vectors andsimple neural networks by having them implementthe network from scratch, without the use of deep-learning libraries. The assignment is appropriatefor upper-division undergraduates or graduate stu-dents who are familiar with python programming,have some experience with the numpy library (Har-ris et al., 2020), and have been exposed to conceptsaround gradients and neural networks. Throughimplementing major portions of the word2vec soft-ware and using the learned vectors, students willgain a deeper understanding of how networks aretrained, how to learn word vectors, and their usesin downstream tasks.

2 Design and Learning Goals

This homework is designed to take place just be-fore the middle stretch of the class, after lexical se-mantics and machine learning concepts have been

introduced. The content is designed at the level ofan NLP student who (1) has some technical back-ground and at least one advanced course in statisticsand (2) will implement or adapt new NLP meth-ods. This level is deeper than what is needed for apurely Applied NLP setting but too shallow for amore Machine Learning focused NLP class, whichwould likely benefit from additional derivationsand proofs around the gradient descent to solidifyunderstanding. The homework has typically beenassigned over a three to four week period; manystudents complete the homework in the course of aweek, but the longer time frame enables studentswith less background or programming experienceto work through the steps. The material preparesstudents for advanced NLP concepts around deeplearning and pre-trained language models, as wellas provides intuition for what steps modern deeplearning libraries perform.

The homework has three broad learning goals.First, the training portion of the homework helpsdeepen students’ understanding of machine learn-ing concepts, gradient descent, and develop com-plex NLP software. Central to this design is hav-ing students turn the equations in the homeworkand formal descriptions of word2vec into softwareoperations. This step helps students understandhow to ground equations found in some papers intothe more-familiar language of programming, whilealso building a more intuition for how gradientdescent and backpropagation work in practice.

Second, the process of software developmentaids students in developing larger NLP softwaremethods that involve end-to-end development. Thisgoal includes seeing how different algorithmic soft-ware designs work and are implemented. The speedof training requires that students be moderately ef-ficient in how they implement their software. Forexample, the use of for loops instead of vector-ized numpy operations will lead to a significantslow down in performance. In class instruction and

108

tutorials detail how to write the relevant efficientnumerical operations, which help guide studentsto identify where and how to selectively optimize.However, slow code will still finish correctly al-lowing students to debug for their initial implemen-tations for correctness. This need for performantcode creates opportunities for students to practicetheir performance optimizing skills.

Third, the lexical semantics portion of the home-work exposes students to the uses and limitationsof word vectors. Through training the vectors, stu-dents understand how statistical regularities in co-occurrence can be used to learn meaning. Quali-tative and quantitative evaluations show studentswhat their model has learned (e.g., using vectoranalogies) and introduce them to concepts of poly-semy, fostering a larger discussion on what can becaptured in a vector representation.

3 Homework Description

The homework has students implement two core as-pects of the word2vec algorithm using numpy forthe numeric portions, and then evaluate with twodownstream tasks. The first aspect has students per-form the commonly-used text preprocessing stepsthat turn a raw text corpus into self-supervised train-ing examples. This step includes removing low-frequency tokens and subsampling tokens based ontheir frequency. The second aspect focuses on thecore training procedure, including (i) negative sam-pling for generating negative examples of contextwords, (ii) performing gradient descent to updatethe two word vector matrices, and (iii) computingthe negative log-likelihood. These tasks are bro-ken into eight discrete steps that guide students inhow to do each aspect. The assignment documentincludes links to more in-depth descriptions of themethod including the extensive description of Rong(2014) and the recent chapter of Jurafsky and Mar-tin (2021, ch. 6) to help students understand themath behind the training procedure.

In the second part of the homework, studentsevaluate the learned vectors in two downstreamtasks. The first task has students load these vec-tors using the Gensim package (Rehurek and So-jka, 2010) and perform vector arithmetic opera-tions to find word-pair analogies and examine thenearest-neighbors of words; this qualitative evalua-tion exposes students to what is or is not learned bythe model. The second task is quantitative evalua-tion that has students generate word-pair similarity

scores for the subset of the SimLex-999 (Hill et al.,2015) present in their training corpus, which is up-loaded to Kaggle InClass1 to see how their vectorscompare with others; this leaderboard helps stu-dents identify a bug in their code (via a low-scoringsubmission) and occasionally prompts students tothink about how to improve/extend their code toattain a higher score.

Potential Extensions The word2vec method hasbeen extended in numerous ways in NLP to im-prove its vectors (e.g., Ling et al., 2015; Yu andDredze, 2014; Tissier et al., 2017). This assignmentincludes descriptions of other possible extensionsthat students can explore, such as implementingdropout, adding learning rate decay, or making useof external knowledge during training. Typically, asingle extension to word2vec is included as a partof the homework to help ground the concept incode but without increasing the difficulty of the as-signment. Students who are interested in deepeningtheir understanding can use these as starting pointsto see how to develop their own NLP methods as apart of a course project.

This assignment also provides multiple possibili-ties for examining the latent biases learned in wordvectors. Prior work has established that pretrainedvectors often encode gender and racial biases basedon the corpora they are trained on (e.g., Caliskanet al., 2017; Manzini et al., 2019). In a future ex-tension, this assignment could be adapted to useWikipedia biographies as a base corpus and havestudents identify how occupations become more as-sociated with gendered words during training (Garget al., 2018). Once this bias is discovered, studentscan discuss various methods for mitigating it (e.g.,Bolukbasi et al., 2016; Zhao et al., 2017) and howtheir method might be adapted to avoid other formsof bias. This extension can help students criticallythink about what is and is not being captured inpretrained vectors and models.

4 Reflection on Student Experiences

Student experiences on this homework have beenvery positive, with multiple students expressing astrong sense of satisfaction at completing the home-work and being able to understand the algorithmand software backing word2vec. Several studentsreported feeling like completing this assignmentwas a great confidence boost and that they were

1https://www.kaggle.com/c/about/inclass

109

now more confident in their ability to understandNLP papers and connect algorithms, equations, andcode. The majority of student difficulties happen intwo sources. First, the vast majority of bugs happenwhen implementing the gradient descent and cal-culating the negative log-likelihood (NLL). Whileonly a few lines of code in total, this step requirestranslating the loss function for word2vec (in thenegative sampling case) into numpy code. Thistranslation task appeared daunting at first for manystudents, though they found creating the eventualsolution rewarding for being able to ground similarequations in NLP papers. Two key componentsfor mitigating early frustration were (1) includingbuilt-in periodic reports of the NLL, which helpstudents quickly spot whether there are numericerrors that lead to infinity or NaN values and (2)adding early-stopping and printing nearest neigh-bors of instructor-provided words (e.g., “January”)which should be thematically coherent after onlya few minutes of training. These components helpstudents quickly identify the presence of a bug inthe gradient descent.

The second student difficulty comes from thetext preprocessing steps. The removal of low-frequency words and frequency-based subsamplingsteps require students to have a solid distinctionof type versus token in practice in order to sub-sample tokens (versus types). I suspect that be-cause many of these routine preprocessing steps aredone for the student by common libraries (e.g., theCountVectorizer of Scikit Learn (Pedregosaet al., 2011)), these steps feel unfamiliar. Com-mon errors in this theme were subsampling typesor producing a sequence of word types (rather thantokens) to use for training.

References

Tolga Bolukbasi, Kai-Wei Chang, James Zou,Venkatesh Saligrama, and Adam Kalai. 2016.Man is to computer programmer as woman is tohomemaker? debiasing word embeddings. InProceedings of the 30th Conference on NeuralInformation Processing Systems (NeurIPS).

Aylin Caliskan, Joanna J Bryson, and ArvindNarayanan. 2017. Semantics derived automaticallyfrom language corpora contain human-like biases.Science, 356(6334):183–186.

Nikhil Garg, Londa Schiebinger, Dan Jurafsky, andJames Zou. 2018. Word embeddings quantify100 years of gender and ethnic stereotypes. Pro-

ceedings of the National Academy of Sciences,115(16):E3635–E3644.

Charles R Harris, K Jarrod Millman, Stéfan J van derWalt, Ralf Gommers, Pauli Virtanen, David Cour-napeau, Eric Wieser, Julian Taylor, Sebastian Berg,Nathaniel J Smith, et al. 2020. Array programmingwith numpy. Nature, 585(7825):357–362.

Felix Hill, Roi Reichart, and Anna Korhonen. 2015.Simlex-999: Evaluating semantic models with (gen-uine) similarity estimation. Computational Linguis-tics, 41(4):665–695.

Dan Jurafsky and James H. Martin. 2021. Speech &Language Processing, 3rd edition. Prentice Hall.

Wang Ling, Chris Dyer, Alan W Black, and IsabelTrancoso. 2015. Two/too simple adaptations ofword2vec for syntax problems. In Proceedings ofthe 2015 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, pages 1299–1304.

Thomas Manzini, Lim Yao Chong, Alan W Black, andYulia Tsvetkov. 2019. Black is to criminal as cau-casian is to police: Detecting and removing multi-class bias in word embeddings. In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Longand Short Papers), pages 615–621.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013b. Distributed repre-sentations of words and phrases and their composi-tionality. arXiv preprint arXiv:1310.4546.

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-learn:Machine learning in python. the Journal of machineLearning research, 12:2825–2830.

Radim Rehurek and Petr Sojka. 2010. Software frame-work for topic modelling with large corpora. In InProceedings of the LREC 2010 workshop on newchallenges for NLP frameworks. Citeseer.

Xin Rong. 2014. word2vec parameter learning ex-plained. arXiv preprint arXiv:1411.2738.

Julien Tissier, Christophe Gravier, and AmauryHabrard. 2017. Dict2vec: Learning word embed-dings using lexical dictionaries. In Proceedings ofthe 2017 Conference on Empirical Methods in Natu-ral Language Processing, pages 254–263.

110

Mo Yu and Mark Dredze. 2014. Improving lexicalembeddings with semantic knowledge. In Proceed-ings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers), pages 545–550.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-donez, and Kai-Wei Chang. 2017. Men also likeshopping: Reducing gender bias amplification usingcorpus-level constraints. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 2979–2989.

111

Proceedings of the Fifth Workshop on Teaching NLP, pages 112–114June 10–11, 2021. ©2021 Association for Computational Linguistics

Naive Bayes versus BERT: Jupyter notebook assignments for anintroductory NLP course

Jennifer Foster and Joachim WagnerSchool of Computing

Dublin City Universityjfoster|[email protected]

Abstract

We describe two Jupyter notebooks that formthe basis of two assignments in an introductoryNatural Language Processing (NLP) moduletaught to final year undergraduate students atDublin City University. The notebooks showthe students how to train a bag-of-words polar-ity classifier using multinomial Naive Bayes,and how to fine-tune a polarity classifier usingBERT. The students take the code as a startingpoint for their own experiments.

1 Introduction

We describe two Jupyter1 notebooks that form thebasis of two assignments in a new introductoryNatural Language Processing (NLP) module taughtto final year students on the B.Sc. in Data Scienceprogramme at Dublin City University. As part of aprior module on this programme, the students havesome experience with the NLP problem of qualityestimation for machine translation. They have alsostudied machine learning and are competent Pythonprogrammers. Since this is the first Data Sciencecohort, there are only seven students. Four graduatestudents are also taking the module.

The course textbook is the draft 3rd edition of(Jurafsky and Martin, 2009).2 It is impossible toteach the entire book in a twelve week module andso we concentrate on the first ten chapters. Thefollowing topics are covered:

1. Pre-processing

2. N-gram Language Modelling

3. Text Classification using Naive Bayes and Lo-gistic Regression

4. Sequence Labelling using Hidden MarkovModels and Conditional Random Fields

5. Word Vectors1https://jupyter.org/2https://web.stanford.edu/~jurafsky/

slp3/

6. Neural Net Architectures (feed-forward, re-current, transformer)

7. Ethical Issues in NLP

The course is fully online for the 2020/2021 aca-demic year. Lectures are pre-recorded and there areweekly live sessions where students anonymouslyanswer comprehension questions via zoom pollsand spend 20-30 minutes in breakout rooms work-ing on exercises. These involve working out toy ex-amples, or using online tools such as the AllenNLPonline demo3 (Gardner et al., 2018) to examine thebehaviour of neural NLP systems.

Assessment takes the form of an online end-of-semester open-book exam worth 60% and threeassignments worth 40%. The first assignment isworth 10% and involves coding a bigram languagemodel from scratch. The second and third assign-ments are worth 15% each and involve experimen-tation, using Google Colab4 as a platform. For bothassignments, a Jupyter notebook is provided to thestudents which they are invited to use as a basisfor their experiments. We describe both of these inturn.

2 Notebooks

We describe the assignment objectives, the note-books we provide to the students5 and the experi-ments they carried out.

2.1 Notebook One: Sentiment Polarity withNaive Bayes

The assignment The aim of this assignment isto help students feel comfortable carrying out textclassification experiments using scikit-learn (Pe-dregosa et al., 2011). Sentiment analysis of moviereviews is chosen as the application since it is a

3https://demo.allennlp.org/4https://colab.research.google.com5An updated version of the notebooks will be made avail-

able in the materials repository of the Teaching-NLP 2021workshop.

112

familiar and easily understood task and domain, re-quiring little linguistic expertise. We use the datasetof Pang and Lee (2004) because its relatively smallsize (2,000 documents) makes it quicker to train on.The documents are provided in tokenised form andhave been split into ten cross-validation folds. Weprovide a Jupyter notebook implementing a base-line bag-of-words Naive Bayes classifier whichassigns a label positive or negative to a review. Thestudents are asked to experiment with this baselinemodel and to attempt to improve its accuracy byexperimenting with

1. different learning algorithms, e.g. logistic re-gression, decision trees, support vector ma-chines, etc.

2. different feature sets, such as handling nega-tion, including bigrams and trigrams, usingsentiment lexicons and performing linguisticanalysis of the input

They are asked to use the same cross-validationset-up as the baseline system. Marks are awardedfor the breadth of experimentation, the experimentdescriptions, code clarity, average 10-fold cross-validation accuracy and accuracy on a ‘hidden’ testset (also movie reviews).

The notebook We implement document-levelsentiment polarity prediction for movie reviewswith multinomial Naive Bayes and bag-of-wordsfeatures. We first build and test the functionality toload the dataset into a nested list of documents, sen-tences and tokens, each document annotated withits polarity label. Then we show code to collect thetraining data vocabulary and assign a unique ID toeach entry. Documents are then encoded as bag-of-word feature vectors in NumPy (Harris et al.,2020), optionally clipped at frequency one to pro-duce binary vectors. Finally, we show how to traina multinomial Naive Bayes model with scikit-learn,obtain a confusion matrix, measure accuracy andreport cross-validation results. The functionality isdemonstrated using a series of increasingly specificPython classes.

What the students did Most students carriedout an extensive range of experiments, for the mostpart following the suggestions we provided at theassignment briefing and the strategies outlined inthe lectures. The baseline accuracy of 83% wasimproved in most projects by about 3-5 points. Thealgorithm which gave the best results was logistic

regression, whose default hyper-parameters workedwell. The students who reported the highest accu-racy scores used a combination of token unigrams,bigrams and trigrams, whereas most students di-rectly compared each n-gram order. The studentswere free to change the code structure, and indeedsome of them took the opportunity to refactor thecode to a style that better suited them.

2.2 Notebook Two: Sentiment Polarity withBERT

The assignment The aim of this second assign-ment is to help students feel comfortable usingBERT (Devlin et al., 2019). We provide a samplenotebook which shows how to fine-tune BERT onthe same task and dataset as in the previous assign-ment. The students are asked to do one of threethings:

1. Perform a comparative error analysis of theoutput of the BERT system(s) and the systemsfrom the previous assignment. The aim here isto get the students thinking about interpretingsystem output.

2. Using the code in this notebook and onlineresources as examples, fine-tune BERT on adifferent task. The aim here is to 1) allow thestudents to experiment with something otherthan movie review polarity classification andexplore their own interests, and 2) test theirresearch and problem-solving skills.

3. Attempt to improve the BERT-based systemprovided in the notebook by experimentingwith different ways of overcoming the inputlength restriction.

The notebook We exemplify how to fine-tuneBERT on the (Pang and Lee, 2004) dataset, usingHugging Face Transformers (Wolf et al., 2020) andPyTorch Lightning (Falcon, 2019). We introducethe concept of subword units, showing BERT’stoken IDs for sample text input, the matching vo-cabulary entries, the mapping to the original inputtokens and BERT’s special [CLS] and [SEP] to-kens. Then, we show the length distribution ofdocuments in the data set and sketch strategies toaddress the limited sequence length of BERT. Weimplement taking 1) a slice from the start or 2) theend of each document, or 3) combining a slice fromthe start with a slice from the end of each document.In doing so, we show the students how a dataset

113

can be read from a custom file format into the dataloader objects expected by the framework.

What the students did Of the ten students whocompleted the assignment, three chose the first op-tion of analysing system output and seven chosethe second option of fine-tuning BERT on a taskof their choosing. These included detection ofhate speech in tweets, sentence-level acceptabilityjudgements, document-level human rights viola-tion detection, and sentiment polarity classificationapplied to tweets instead of movie reviews. No stu-dent opted for the third option of examining waysto overcome the input length limit in BERT for the(Pang and Lee, 2004) dataset.

3 Future Improvements

We surveyed the students to see what they thoughtof the assignments. On the positive side, they foundthem challenging and interesting, and they appreci-ated the flexibility provided in the third assignment.On the negative side, they felt that they involvedmore effort than the marks warranted, and theyfound the code in the notebooks to be unnecessar-ily complicated. The object-oriented nature of thecode was also highlighted as a negative by some.For next year, we plan to 1) streamline the code,hiding some of the messy details, 2) reduce thescope of the assignments, and 3) provide moreBERT fine-tuning example notebooks.

Acknowledgements

The second author’s contribution to this work wasfunded by Science Foundation Ireland throughthe SFI Frontiers for the Future programme(19/FFP/6942). We thank the reviewers for theirhelpful suggestions, and the DCU CA4023 studentsfor their hard work and patience!

ReferencesJacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

WA et al. Falcon. 2019. Pytorch lightning. GitHubrepository.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe-ters, Michael Schmitz, and Luke Zettlemoyer. 2018.AllenNLP: A deep semantic natural language pro-cessing platform. In Proceedings of Workshop forNLP Open Source Software (NLP-OSS), pages 1–6, Melbourne, Australia. Association for Computa-tional Linguistics.

Charles R. Harris, K. Jarrod Millman, Stéfan J.van der Walt, Ralf Gommers, Pauli Virtanen, DavidCournapeau, Eric Wieser, Julian Taylor, Sebas-tian Berg, Nathaniel J. Smith, Robert Kern, MattiPicus, Stephan Hoyer, Marten H. van Kerkwijk,Matthew Brett, Allan Haldane, Jaime Fernández delRío, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, WarrenWeckesser, Hameer Abbasi, Christoph Gohlke, andTravis E. Oliphant. 2020. Array programming withNumPy. Nature, 585(7825):357–362.

Dan Jurafsky and James H. Martin. 2009. Speech andlanguage processing. Pearson Prentice Hall, UpperSaddle River, N.J.

Bo Pang and Lillian Lee. 2004. A sentimental edu-cation: Sentiment analysis using subjectivity sum-marization based on minimum cuts. In Proceed-ings of the 42nd Annual Meeting of the Associationfor Computational Linguistics (ACL-04), pages 271–278, Barcelona, Spain.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Transformers: State-of-the-art natural language pro-cessing. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

114

Proceedings of the Fifth Workshop on Teaching NLP, pages 115–124June 10–11, 2021. ©2021 Association for Computational Linguistics

Natural Language Processing for Computer Scientists andData Scientists at a Large State University

Casey KenningtonDepartment of Computer Science

Boise State [email protected]

Abstract

The field of Natural Language Processing(NLP) changes rapidly, requiring course offer-ings to adjust with those changes, and NLPis not just for computer scientists; it’s a fieldthat should be accessible to anyone who has asufficient background. In this paper, I explainhow students with Computer Science and DataScience backgrounds can be well-prepared foran upper-division NLP course at a large stateuniversity. The course covers probability andinformation theory, elementary linguistics, ma-chine and deep learning, with an attempt to bal-ance theoretical ideas and concepts with prac-tical applications. I explain the course objec-tives, topics and assignments, reflect on adjust-ments to the course over the last four years, aswell as feedback from students.

1 Introduction

Thanks in part to a access to large datasets, in-creases in compute power, and easy-to-use pro-gramming programming libraries that leverage neu-ral architectures, the field of Natural Language Pro-cessing (NLP) has become more popular and hasseen more widespread adoption in research and incommercial products. On the research side, theAssociation for Computational Linguistics (ACL)conference—the flagship NLP conference—andrelated annual conferences have seen dramatic in-creases in paper submissions. For example, in 2020ACL had 3,429 paper submissions, whereas 2019had 2,905, and this upward trend has been happen-ing for several years. Certainly, lowering barriersto access of NLP methods and tools for researchersand practitioners is a welcome direction for thefield, enabling researchers and practitioners frommany disciplines to make use of NLP.

It is therefore becoming more important to betterequip students with an understanding of NLP toprepare them for careers either directly related toNLP, or which leverage NLP skills. In this paper,

I reflect on my experience setting up and main-taining a class in NLP at Boise State University, alarge state university, how to prepare students forresearch and industry careers, and how the class haschanged over four years to fit the needs of students.

The next section explains the course objectives.I then explain challenges that are likely common tomany university student populations, how NLP isdesigned for students with Data Science and Com-puter Science backgrounds, then I explain coursecontent including lecture topics and assignmentsthat are designed to fulfill the course objectives. Ithen offer a reflection on the three times I taughtthis course over the past four years, and future plansfor the course.

2 Course Objectives

Most students who take an NLP course will notpursue a career in NLP proper; rather, they takethe course to learn skills that will help them findemployment or do research in areas that make useof language, largely focusing on the medium of text(e.g., anthropology, information retrieval, artificialintelligence, data mining, social media networkanalysis, provided they have sufficient data sciencetraining). My goal for the students is that theycan identify aspects of natural language (phonet-ics, syntax, semantics, etc.) and how each can beprocessed by a computer, explain the differencebetween classification models and approaches, beable to map from (basic) formalisms to functionalcode, and use existing tools, libraries, and data setsfor learning while attempting to strike a balancebetween theory and practice. In my view, there areseveral aspects of NLP that anyone needs to grasp,and how to apply NLP techniques in novel circum-stances. Those aspects are illustrated in Figure 1.

No single NLP course can possibly account fora level of depth in all of the aspects in Figure 1, buta student who has taken courses or has experiencein at least two of the areas (e.g., they have taken a

115

Figure 1: Areas of content that are important for acourse in Natural Language Processing.

statistics course and have experience with Python,or they have taken linguistics courses and have usedsome data science or machine learning libraries)will find success in the course more easily thanthose with no experience in any aspect.

This introduces a challenge that has been ex-plored in prior work on teaching NLP (Fosler-Lussier, 2008): the diversity of the student pop-ulation. NLP is a discipline that is not just forcomputer science students, but it is challenging toprepare students for the technical skills required ina NLP course. Moreover, similar to the student pop-ulation in Fosler-Lussier (2008), there should becourse offerings for both graduate and undergradu-ate students. In my case, which is fairly commonin academia, as the sole NLP researcher at the uni-versity I can only offer one course once every foursemesters for both graduate and undergraduate stu-dents, but also students with varied backgrounds–not only computer science. As a result, this is not aresearch methods course; rather, it is more gearedtowards learning the important concepts and techni-cal skills surrounding recent advances in NLP. Oth-ers have attempted to gear the course content anddelivery towards research (Freedman, 2008) givingthe students the opportunity to have open-ended as-signments. I may consider this for future offerings,but for now the final project acts as an open-endedassignment, though I don’t require students to readand understand recent research papers.

In the following section, I explain how we pre-pare students of diverse backgrounds to succeed inan NLP course for upper-division undergraduateand graduate students.

3 Preparing Students with DiverseAcademic and Technical Backgrounds

Boise State University is the largest university inIdaho, situated in the capital of the State of Idaho.The university has a high number of non-traditionalstudents (e.g., students outside the traditional stu-dent age range, or second-degree seeking students).Moreover, the university has a high acceptance rate(over 80%) for incoming first-year students. As isthe case with many universities and organizations,a greater need for “computational thinking" amongstudents of many disciplines has been an importantdriver of recent changes in course offerings acrossmany departments. Moreover, certain departmentshave responded to the need and student interest inmachine learning course offerings. In this section,we discuss how we altered the Data Science andComputer Science curricula to meet these needsand the implications these changes have had on theNLP course.1

Data Science The Data Science offerings beginwith a foundational course (based on Berkeley’sdata8 content) that has only a very basic math pre-requisite.2 It introduces and allows students topractice Python, Jupyter notebooks, data analysisand visualization, and basic statistics (includingthe bootstrap method of statistical significance).Several courses follow this course that are moredomain specific, giving the students options forgaining practical experience in Data Science skillsrelative to their abilities and career goals. One pathmore geared towards students of STEM-relatedmajors (though not targeting Computer Sciencemajors) as well as some majors in the Humanities,is a certificate program that includes the founda-tional course, a follow-on course that gives studentsexperience with more data analysis as well as prob-ability and information theory, an introductory ma-chine learning course, and a course on databases.The courses largely use Python as the programminglanguage of choice.

Computer Science In parallel to the changesin Data Science-related courses, the Departmentof Computer Science has seen increased enroll-ment and increased request for machine learning-related courses. The department offers several

1We do not have a Data Science department or major de-gree program; the Data Science courses are housed in differentdepartments across campus.

2http://data8.org/

116

Figure 2: Two course paths that prepare students to succeed in the NLP course: Data Science and Computer Sci-ence. Solid arrows denote pre-requisites, dashed arrows denote co-requisites. Solid outlines denote Data Scienceand Computer Science courses, dashed outlines denote Math courses.

courses, though they focus on upper-division stu-dents (e.g., artificial intelligence, applied deeplearning, information retrieval and recommendersystems). This is a challenge because the mainComputer Science curriculum focuses on procedu-ral languages such as Java with little or no expo-sure to Python (similar to the student populationreported in Freedman (2008)), forcing the upper-division machine learning-related courses to spendtime teaching Python to the students. This cannotbe underestimated–programming languages maynot be natural languages, but they are languagesin that it takes time to learn them, particularly rele-vant Python libraries. To help meet this challenge,the department has recently introduced a MachineLearning Emphasis that makes use of the Data Sci-ence courses mentioned above, along with addi-tional requirements for math and upper-divisionComputer Science courses.

Preqrequisites Unlike Agarwal (2013) which at-tempted and introductory course for all levels ofstudents, this course is designed for upper-divisionstudents or graduate students. Students of eitherdiscipline, Computer Science or Data Science, areprepared for this NLP course, albeit in differentways. Computer Scientists can think computation-ally, have experience with a syntactic formalism,and have experience with statistics. Data Sciencestudents likewise have experience with statisticsas well as Python, including data analysis skills.Though the backgrounds can be quite diverse, myNLP course allows two prerequisite paths: all stu-dents must take a statistics course, but ComputerScience students must take a Programming Lan-guages course (which covers context free grammarsfor parsing computer languages and now coverssome Python programming), and the Data Sciencestudents must have taken the introductory machine

learning course. Figure 2 depicts the two coursepaths visually.

4 NLP Course Content

In this section, I discuss course content includingtopics and assignments that are designed to meetthe course objectives listed above. Woven into thetopics and assignments are the themes of ambiguityand limitations, explained below.

4.1 Topics & Assignments

Theme of Ambiguity Figure 3 shows the topics(solid outlines) that roughly translate to a singlelecture, though some topics require multiple lec-tures. One main theme that is repeated through-out the course, but is not a specific lecture topicis ambiguity. This helps the students understanddifferences between natural human languages andprogramming languages. The Introduction to Lin-guistics topic, for example, gives a (very) high-level overviews of phonetics, morphology, syntax,semantics, and pragmatics, with examples of am-biguity for each area of linguistics (e.g., phoneticambiguity is illustrated by hearing someone say it’shard to recognize speech but it could be heard asit’s hard to wreck a nice beach, and syntactic ambi-guity is illustrated by the sentence I saw the personwith the glasses having more than one syntacticparse).

Probability and Information Theory Thiscourse does not focus only on deep learning,though many university NLP offerings seem tobe moving to deep-learning only courses. Thereare several reasons not to focus on deep learningfor a university like Boise State. First, studentswill not have a depth of background in probabil-ity and information theory, nor will they have adeep understanding of optimization (both convex

117

and non-convex) or error functions in neural net-works (e.g., cross entropy). I take time early in thecourse to explain discrete and continuous probabil-ity, and information theory. Discrete probabilitytheory is straight forward as it requires counting,something that is intuitive when working with lan-guage data represented as text strings. Continuousprobability theory, I have found, is more difficultfor students to grasp as it relates to machine learn-ing or NLP, but building on students’ understand-ing of discrete probability theory seems to workpedagogically. For example, if we use continuousdata and try somehow to count values in that data,it’s not clear what should be counted (e.g., usingbinning), highlighting the importance of continu-ous probability functions that fit around the data,and the importance of estimating parameters forthose continuous functions–an important conceptfor understanding classifiers later in the course. Toillustrate both discrete and continuous probability,I show students how to program a discrete NaiveBayes classifier (using ham/spam email classifica-tion as a task) and a continuous Gaussian NativeBayes classifier (using the well-known iris data set)from scratch. Both classifiers have similarities, butthe differences illustrate how continuous classifierslearn parameters.

Sequential Thinking Students experience prob-ability and information theory in a targeted andhighly-scaffolded assignment. They then extendtheir knowledge and program, from scratch, a part-of-speech tagger using counting to estimate proba-bilities modeled as Hidden Markov Models. Thesemodels seem old-fashioned, but it helps studentsgain experience beyond the standard machine learn-ing workflow of mapping many-to-one (i.e., fea-tures to a distribution over classes) because this isa many-to-many sequential task (i.e., many wordsto many parts-of-speech), an important concept tounderstand when working with sequential data likelanguage. It also helps students understand thatNLP often goes beyond just fitting “models" be-cause it requires things like building a trellis anddecoding a trellis (undergraduate students are re-quired to program a greedy decoder; graduates arerequired to program a Viterbi decoder). This is achallenging assignment for most students, irrespec-tive of their technical background, but grasping theconcepts of this assignment helps them grasp moredifficult concepts that follow.

Syntax The Syntax & Parsing assignment alsodeserves mention. The students use any parser inNLTK to parse a context free grammar of a fictionallanguage with a limited vocabulary.3 This helpsthe students think about structure of language, andwhile there are other important ways to think aboutsyntax such as dependencies (which we discuss inthe course), another reason for this assignment isto have the students write grammars for a smallvocabulary of words in a language they don’t know,but also to create a non-lexicalized version of thegrammar based on parts of speech, which helpsthem understand coverage and syntactic ambiguitymore concretely.4 There is no machine learningor estimating a probabilistic grammar here, justparsing.

Semantics An important aspect of my NLP classis semantics. I introduce them briefly to formalsemantics (e.g., first-order logic), WordNet (Miller,1995), distributional semantics, and grounded se-mantics. We discuss the merits of representinglanguage “meaning" as embeddings and the limi-tations of meaning representations trained only ontext and how they might be missing important se-mantic knowledge (Bender and Koller, 2020). TheTopic Modeling assignment uses word-level em-beddings (e.g., word2vec (Mikolov et al., 2013)and GloVe (Pennington et al., 2014)) to representtexts and gives them an opportunity to begin us-ing a deep learning library (tensorflow & keras orpytorch).

We then consider how a semantic representationthat has knowledge of modalities beyond text, i.e.,images is part of human knowledge (e.g., what isthe meaning of the word red?), and how recentwork is moving in this direction. Two assignmentsgive the students a deeper understanding of theseideas. The transfer learning assignment requiresthe students to use convolutional neural networkspre-trained on image data to represent objects inimages and train a classifiers to identify simple ob-ject types, tying images to words. This is extendedin the Grounded Semantics assignment where abinary classifier (based on the words-as-classifiersmodel introduced in Kennington and Schlangen(2015), then extended to work with images with"real" objects in Schlangen et al. (2016)) is trained

3I opt for Tamarian: https://en.wikipedia.org/wiki/Darmok

4I have received feedback from multiple students that thiswas their favorite assignment.

118

Figure 3: Course Topics and Assignments. Colors cluster topics (green=probability and information theory,blue=NLP libraries, red=neural networks, purple=linguistics, yellow=assignments). Courses have solid lines; topicdependencies are illustrated using arrows. Assignments have dashed lines. Assignments are indexed with the topicson which they depend.

for all words in referring expressions to objects inimages in the MSCOCO dataset annotated with re-ferring expressions to objects in images (Mao et al.,2016). Both assignments require ample scaffoldingto help guide the students in using the libraries anddatasets, and the MSCOCO dataset is much biggerthan they are used to, giving them more real-worldexperience with a larger dataset.

Deep Learning An understanding of deep learn-ing is obviously important for recent NLP re-searchers and practitioners. One constant challengeis determining what level of abstraction to presentneural networks (should students know what is hap-pening at the level of the underlying linear alge-bra, or is a conceptual understanding of parame-ter fitting in the neurons enough?). Furthermore,deep learning as a topic requires and understand-ing of its limitations and at least to some degreehow it works “under the hood" (learning just howto use deep learning libraries without understand-ing how they work and how they “learn" from thedata is akin to giving someone a car to drive with-out teaching them how to use it safely). This alsomeans explaining some common misconceptions

like how neurons in neural networks “mimic" realneurons in human brains, something that is veryfar from true, though certainly the idea of neuralnetworks is inspired from human biology. For mystudents, we progress from linear regression to lo-gistic regression (illustrating how parameters arebeing fit and how gradient descent is different fromdirectly estimating parameters in continuous prob-ability functions; i.e., maximum likelihood estima-tion vs convex and non-convex optimization), build-ing towards small neural architectures and feed-forward networks. We also cover convolutionalneural networks (for transfer learning and groundedsemantics), attention (Vaswani et al., 2017), andtransformers including transformer-based languagemodels like BERT (Devlin et al., 2018), and howto make use of them; understanding how they aretrained, but then only assigning fine-tuning for stu-dents to experience directly. I focus on smallerdatasets and fine-tuning so students can train andtune models on their own machines.

Final Project There is a "final project" require-ment. Students can work solo, or in a group ofup to three students. The project can be anything

119

NLP related, but projects generally are realized asusing an NLP or machine/deep learning library totrain on some specific task, but others include meth-ods for data collection (a topic we don’t cover inclass specifically, but some students have interestin the data collection process for certain settingslike second language acquisition), as well as in-terfaces that they evaluate with some real humanusers. Scoping the projects is always the biggestchallenge as many students initially envision veryambitious projects (e.g., build an end-to-end chat-bot from scratch). I ask students to consider howmuch effort it would take to do three assignmentsand use that as a point of comparison. Throughoutthe first half of the semester students can ask forfeedback on project ideas, and at the halfway pointin the semester, students are required to submit ashort proposal that outlines the scope and timelinefor their project. They have the second half of thesemester to then work through the project (witha "checkpoint" part way through to inform me ofprogress and needed adjustments), then they writea project report on their work at the end of thesemester with evaluations and analayses of theirwork. Graduate students must write a longer re-port than the undergraduate students, and graduatestudents are required to give a 10-minute presen-tation on their project. The timeline here is crit-ical: the midway point for beginning the projectallows students to have experience with classifica-tion and NLP tasks, but have enough time to makeadjustments as they work on their project. For ex-ample, students students attempt to apply BERTfine-tuning after the BERT assignment even thoughit wasn’t in their original project proposal.

Theme of Limitations As is the case with am-biguity, limitations is theme in the course: lim-itations of using probability theory on languagephenomena, limitations on datasets, and limitationson machine learning models. The theme of limi-tations ties into an overarching ethical discussionthat happens at intervals throughout the semesterabout what can reasonably be expected from NLPtechnology and whom it affects as more practicalmodels are deployed commercially.

The final assignment critical reading of the pop-ular press is based on a course under the sametitle taught by Emily Bender at the University ofWashington.5 The goal of the assignment is to

5The original assignment definition appears to no longerbe public.

learn to critically read popular articles about NLP.Given an article, they need to summarize the article,then scrutinize the sources using the following asa guide: can they (1) access the primary source,such as original published paper, (2) assess if theclaims in the article relate to what’s claimed by theprimary source, (3) determine if experimental workwas involved or if the article is simply offeringconjecture based on current trends, and (4) if thearticle did not carry out an evaluation, offer ideason what kind of evaluation would be approrpriateto substantiate any claims made by the article’s au-thor(s). Then students should relate the headline ofthe article to the main text and determine if readingthe headline provides an abstract understanding ofthe article’s contents, and determine to what extentthe author identified limitations to the NLP technol-ogy they were reporting on, what someone withouttraining in NLP might take away from the article,and if the authors identified the people who mightbe affected (negatively or positively) by the NLPtechnology. This assignment gives students expe-rience in recognizing the gap between the realityof NLP technology, how it is perceived by others,whom it affects, and its limitations.

We dedicate an entire lecture to ethics, and stu-dents are also asked to consider the implications oftheir final projects, what their work can and cannotreasonably do, and who might be affected by theirwork.6

Discussion Striking a balance between contenton probability and information theory, linguistics,and machine learning is challenging for a singlecourse, but given the diverse student populationat a public state school, this approach seems towork for the students. An NLP class should have atleast some content about linguistics, and framingaspects of linguistics in terms of ambiguity givesstudents the tools to think about how much theyexperience ambiguity on a daily basis, and the factthat if language were not ambiguous, data-drivenNLP would be much easier (or even unnecessary).The discussions about syntax and semantics areespecially important as many have not considered(particularly those who have not learned a foreign

6Approaching ethics from a framework of limitations Ithink helps students who might otherwise be put off by adiscussion on ethics because it can clearly be demonstratedthat NLP models have not just technical limitations and thoselimitations have implications for real people; an importantmessage from this class is that all NLP models have limita-tions.

120

language) how much they take for granted when itcomes to understanding and producing language,both speech and written text. The discussions onhow to represent meaning computationally (sym-bolic strings? classifiers? embeddings? graphs?)and how a model should arrive at those representa-tions (using speech? text? images?) is rewardingfor the students. While most of the assignments andexamples focus on English, examples of linguisticphenomena are often shown from other languages(e.g., Japanese morphology and German declen-sion) and the students are encouraged to work onother languages for their final project.

Assignments vary in scope and scaffolding. Forthe probability and information theory and BERTassignments, I provide a fairly well-scaffolded tem-plate that the students fill in, whereas most otherassignments are more open-ended, each with a setof reflection and analysis questions.

4.2 Content Delivery

Class sizes vary between 35-45 students. Classcontent is presented largely either as presentationslides or live programming using Jupyter note-books. Slides introduce concepts, explain thingsoutside of code (e.g., linguistics and ambiguity orgraphical models), but most concepts have concreteexamples using working code. The students seecode for Naive Bayes (both discriminative and con-tiniuous) classifiers, I use Python code to explainprobability and information theory, classificationtasks such as spam filtering, name classification,topic modeling, parsing, loading and prepossess-ing datasets, linear and logistic regression, senti-ment classification, an implementation of neuralnetworks from scratch as well as popular libraries.

While we use NLTK for much of the instruc-tion following in some ways what is outlined inBird et al. (2008), we also look at supported NLPPython libraries including textblob, flair (Akbiket al., 2019), spacy, stanza (Qi et al., 2020), scik-itlearn (Pedregosa et al., 2011), tensorflow (Abadiet al., 2016) and keras (Chollet et al., 2015), py-torch (Paszke et al., 2019), and huggingface (Wolfet al., 2020). Others are useful, but most librarieshelp students use existing tools for standard NLPpre-processing like tokenization, sentence segmen-tation, stemming or lemmatization, part-of-speechtagging, and many have existing models for com-mon NLP tasks like sentiment classification andmachine translation. The stanza library has models

for many languages. All code I write or show inclass is accessible to the students throughout thesemester so they can refer back to the code exam-ples for assignments and projects. This of coursemeans that students only obtain a fairly shallowexperience for any library; the goal is to show themenough examples and give them enough experiencein assignments to make sense of public documen-tation and other code examples that they mightencounter.

The course uses two books, both which are avail-able free online, the NLTK book,7 and an ongoingdraft of Jurafsky and Martin’s upcoming 3rd edi-tion.8 The first assignment (Python & Jupyter inFigure 3) is an easy, but important assignment: Iask the students to go through Chapter 1 and partsof Chapter 4 of the NLTK book and for all code ex-amples, write them by hand into a Jupyter notebook(i.e., no copy and pasting). This ensures that theirprogramming environments are setup, steps themthrough how NLTK works, gives them immediateexposure to common NLP tasks like concordanceand stemming, and gives them a way to practicePython syntax in the context of a Jupyter notebook.Another part of the assignment asks them to lookat some Jupyter notebooks that use tokenization,counters, stop words, and n-grams, and asks themquestions about best practices for authoring note-books (including formatted comments).9

Students can use cloud-based Jupyter serversfor doing their assignments (e.g., Google colab),but all must be able to run notebooks on a localmachine and spend time learning about Pythonenvironments (i.e., anaconda). Assignments aresubmitted and graded using okpy which rendersnotebooks and allows instructors to assign gradingto themselves or teaching assistants, and studentscan see their grades and written feedback for eachassignment.10

4.3 Adjustments for Remote LearningThis course was relatively straightforward to adjustfor remote delivery. The course website and okpy(for assignment submissions) are available to thestudents at all times. I decided to record lectureslive (using Zoom) then make them available with

7http://www.nltk.org/book_1ed/8https://web.stanford.edu/~jurafsky/

slp3/9I use the notebooks listed here for this part of

the assignment https://github.com/bonzanini/nlp-tutorial

10https://okpy.org/

121

semester Python Ling ML DLSpring 2017 none none 10% noneSpring 2019 50% 30% 30% 10%Spring 2021 80% 30% 60% 20%

Table 1: Impressions of preparedness for Python, Lin-guistics (Ling), Machine Learning (ML), and DeepLearning (DL).

transcripts to the students. This course has onemidterm, a programming assignment that is similarin structure to the regular assignments. Duringan in-person semester, there would normally be awritten final, but I opted to make the final be partof their final project grade.

5 Reflection on Three Offerings over 4years

Due to department constraints on offering requiredcourses vs. elective courses (NLP is elective), Iam only able to offer the NLP course in the Springsemester of odd years; i.e., every 4 semesters. Thecourse is very popular, as enrollment is alwaysover the standard class size (35 students). BelowI reflect on changes that have taken place in thecourse due to the constant and rapid change in thefield of NLP, in our undergraduate curriculum, andthe implications those changes had on the course.These reflections are summarized in Table 1. AsI am, to my knowledge, the first NLP researcherat Boise State University, I had to largely developthe contents of the course on my own, requiringadjustments over time as I better understand studentpreparedness. At this point, despite the two pathsinto the course, most students who take the courseare still Computer Science students.

Spring 2017 The first time I taught the course,only a small percentage of the students had expe-rience with Python. The only developed Pythonlibrary for NLP was NLTK, so that and scikit-learnwere the focus of practical instruction. I spent thefirst three weeks of the course helping students gainexperience with Python (including Jupyter, numpy,pandas) then used Python as a means to help themunderstand probability and information theory. Thecourse focused on generative classification includ-ing statistical n-gram language modeling with someexposure to discriminative models, but no exposureto neural networks.

Spring 2019 Between 2017 and 2019, severalimportant papers showing how transformer net-works can be used for robust language modelingwere gaining in momentum, resulting in a shift to-wards altering and understanding their limitations(so called BERTology, see Rogers et al. (2020) fora primer). This, along with the fact that changesin the curriculum gave students better experiencewith Python, caused me to shift focus from gen-erative models to neural architectures in NLP andto shift to cover word-level embeddings more rig-orously. I spent the second half of the semesterintroducing neural networks (including multi-layerperceptrons, convolutional, and recurrant architec-tures) and giving students assignments to give thempractice in tensorflow and keras. After the 2017course, I changed the pre-requisite structure to re-quire our programming languages course instead ofdata structures. This led to greater pareparednessin at least the syntax aspect of linguistics.

Spring 2021 In this iteration, I shifted focus fromrecurrant to attention/transformer-based modelsand assignmed a BERT fine-tuning assignment on anovel dataset using huggingface. I also introducedpytorch as another option for a neural network li-brary (I also spend time on tensorflow and keras).This shift reflects a shift in my own research andunderstanding of the larger field, though exposureto each library is only partial and somewhat ab-stract. I note that students who have a data sciencebackground will likely appreciate tensorflow andkeras more as they are not as object-oriented thanpytorch, which seems to be more geared towardsstudents with Computer Science backgrounds. Stu-dents can choose which one they will use (if any)for their final projects. More students are gaininginterest in machine learning and deep learning andare turning to MOOC courses or online tutorials,which has led in some degree to better prepara-tion for the NLP course, but often students havelittle understanding about the limitations of ma-chine learning and deep learning after completingthose courses and tutorials. Moreover, studentsfrom our university have started an Artificial In-telligence Club (the club started in 2019; I am thefaculty advisor), which has given the students guid-ance on courses, topics, and skills that are requiredfor practical machine learning. Many of the NLPclass students are already members of the AI Club,and the club has members from many academicdisciplines.

122

5.1 Student FeedbackI reached out to former students who took the classto ask for feedback on the course. Specifically, Iasked if they use the skills and concepts from theNLP class directly for their work, or if the skillsand concepts transferred in any way to their work.Student responses varied, but some answered thatthey use NLP directly (e.g., to analyze customerfeedback or error logs), while most responded thatthey use many of the Python libraries we coveredin class for other things that aren’t necessarily NLPrelated, but more geared towards Data Science. Forseveral students, using NLP tools helped them inresearch projects that led to publications.

6 Conclusions & Open Questions

With each offering, the NLP course at Boise StateUniversity is better suited pedagogically for stu-dents with some Data Science or Computer Sciencetraining, and the content reflects ongoing changesin the field of NLP to ensure their preparation. Thetopics and assignments cover a wide range, but asstudents have become better prepared with Python(by the introduction of new prerequisite coursesthat cover Python as well as changing some coursesto include assignments in Python), more focus isspent on topics that are more directly related toNLP.

Though I feel it important to stay abreast of theongoing changes in NLP and help students gainthe knowledge and skills needed to be successfulin NLP, an open question is what changes need tobe made, and a related question is how soon. Forexample, I think at this point it is clear that neu-ral networks are essential for NLP, though it isn’talways clear what architectures should be taught(e.g., should we still cover recurrant neural net-works or jump directly to transformers?). It seemsimportant to cover even new topics sooner thanlater, though a course that is focused on researchmethods might be more concerned with staying up-to-date with the field, whereas a course that is morefocused on general concepts and skills should waitfor accessible implementations (e.g., huggingfacefor transformers) before covering those topics.

With recordings and updated content, I hopeto flip the classroom in the future by assigningreadings and watching lectures before class, thenuse class time for working on assignments.11

11This worked well for the Foundations of Data Sciencecourse that I introduced to the university; the second time I

Much of my course materials including note-books, slides, topics, and assignments can be foundon a public Trello board.12

Acknowledgements

I am very thankful for the anonymous reviewers forgiving excellent feedback and suggestions on howto improve this document. I hope that it followedthose suggestions adequately.

ReferencesMartín Abadi, Paul Barham, Jianmin Chen, Zhifeng

Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorflow: A system for large-scalemachine learning. In 12th {USENIX} symposiumon operating systems design and implementation({OSDI} 16), pages 265–283.

Apoorv Agarwal. 2013. Teaching the basics of NLPand ML in an introductory course to information sci-ence. In Proceedings of the Fourth Workshop onTeaching NLP and CL, pages 77–84, Sofia, Bulgaria.Association for Computational Linguistics.

Alan Akbik, Tanja Bergmann, Duncan Blythe, KashifRasul, Stefan Schweter, and Roland Vollgraf. 2019.Flair: An easy-to-use framework for state-of-the-artnlp. In NAACL 2019, 2019 Annual Conference ofthe North American Chapter of the Association forComputational Linguistics (Demonstrations), pages54–59.

Emily M Bender and Alexander Koller. 2020. Climb-ing towards NLU: On Meaning, Form, and Under-standing in the Age of Data. In Association for Com-putational Linguistics, pages 5185–5198.

Steven Bird, Ewan Klein, Edward Loper, and JasonBaldridge. 2008. Multidisciplinary instruction withthe natural language toolkit. In Proceedings ofthe Third Workshop on Issues in Teaching Compu-tational Linguistics, pages 62–70, Columbus, Ohio.Association for Computational Linguistics.

François Chollet et al. 2015. Keras. https://keras.io.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. arXiv.

taught the course, I flipped it and used class time for studentsto work on assignments, which gave me a better indication ofhow students were understanding the material, time for one-on-one interactions, and fewer issues with potential cheating.

12https://trello.com/b/mRcVsOvI/boise-state-nlp

123

Eric Fosler-Lussier. 2008. Strategies for teaching“mixed” computational linguistics classes. In Pro-ceedings of the Third Workshop on Issues in Teach-ing Computational Linguistics, pages 36–44, Colum-bus, Ohio. Association for Computational Linguis-tics.

Reva Freedman. 2008. Teaching NLP to computerscience majors via applications and experiments.In Proceedings of the Third Workshop on Issuesin Teaching Computational Linguistics, pages 114–119, Columbus, Ohio. Association for Computa-tional Linguistics.

C. Kennington and D. Schlangen. 2015. Simple learn-ing and compositional application of perceptuallygroundedword meanings for incremental referenceresolution. In ACL-IJCNLP 2015 - 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing of the Asian Feder-ation of Natural Language Processing, Proceedingsof the Conference, volume 1.

Junhua Mao, Jonathan Huang, Alexander Toshev, OanaCamburu, Alan L Yuille, and Kevin Murphy. 2016.Generation and comprehension of unambiguous ob-ject descriptions. In Proceedings of the IEEE con-ference on computer vision and pattern recognition,pages 11–20.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed Represen-tations of Words and Phrases and their Composition-ality. In Proceedings of NIPS.

George A Miller. 1995. Wordnet: a lexical database forenglish. Communications of the ACM, 38(11):39–41.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Py-torch: An imperative style, high-performance deeplearning library. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar-nett, editors, Advances in Neural Information Pro-

cessing Systems 32, pages 8024–8035. Curran Asso-ciates, Inc.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global Vectors for WordRepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543.

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,and Christopher D Manning. 2020. Stanza:A python natural language processing toolkitfor many human languages. arXiv preprintarXiv:2003.07082.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2020. A Primer in BERTology: What we knowabout how BERT works. arXiv.

David Schlangen, Sina Zarriess, and Casey Kenning-ton. 2016. Resolving References to Objects in Pho-tographs using the Words-As-Classifiers Model. InProceedings of the 54th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1213–1223.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. arXiv preprint arXiv:1706.03762.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Transformers: State-of-the-art natural language pro-cessing. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

124

Proceedings of the Fifth Workshop on Teaching NLP, pages 125–130June 10–11, 2021. ©2021 Association for Computational Linguistics

On Writing a Textbook on Natural Language Processing

Jacob EisensteinGoogle Research

[email protected]

Abstract

There are thousands of papers about naturallanguage processing and computational lin-guistics, but very few textbooks. I describethe motivation and process for writing a col-lege textbook on natural language processing,and offer advice and encouragement for read-ers who may be interested in writing a text-book of their own.

1 Introduction

As natural language processing reaches ever-greater heights of popularity, its students canlearn from blogs and tutorials, videos and onlinecourses, podcasts, social media, open source soft-ware projects, competitions, and more. In this envi-ronment, is there still any room for textbooks? Thispaper describes why you might write a textbookabout natural language processing, how to do it,and what I learned from writing one.

Summary of the book. This paper will not focuson the details of my textbook (Eisenstein, 2019),but I offer a brief summary for context. My maingoal was to create a text with a formal and co-herent mathematical foundation in machine learn-ing, which would explain a broad range of tech-niques and applications in natural language pro-cessing. The first section of the book builds up themathematical foundation from linear classificationthough neural networks and unsupervised learn-ing. The second section extends this foundation tostructure prediction, with classical algorithms forsearch and marginalization in sequences and trees,while also introducing some ideas from morphol-ogy and syntax. The third section treats the specialproblem of semantics, which distinguishes natu-ral language processing from other applications ofmachine learning. This section is more method-ologically diverse, ranging from logical to distri-butional semantics. The final section treats three

of the primary application areas: machine transla-tion, information extraction, and text generation.Altogether this comprises nineteen chapters, whichis more than could be taught in a single semester.Rather, the teacher or student can select subsets ofchapters depending on whether they wish to empha-size machine learning, linguistics, or applications.The preface sketches out a few paths through thebook for various types of courses.

2 Motivation and related work

In this section, I offer some reasons for writing atextbook, compare textbooks with alternative edu-cational formats, and provide a few words of en-couragement for prospective authors.

2.1 Why you might want to write a textbook

The first requirement is that you expect to enjoy thetype of work involved: reading the most impactfulpapers in the field, synthesizing and curating theideas these papers contain, and presenting them ina way that is accessible and engaging for students.One of the main contributions of a textbook overthe original research material is the unification ofterminology and mathematical notation, so it willhelp if you have strong opinions about this andan impulse toward consistency. Finally, writing agood textbook requires reading great textbooks tounderstand what makes them work, and I enjoyedhaving a reason to spend more time with some ofmy favorites (e.g., MacKay, 2003; Blackburn andBos, 2005; Cover and Thomas, 2012; Sipser, 2012;Murphy, 2012).

A more respectable reason to write a textbookis to clarify and amplify your vision for the field.The writing process forces you to try to understandthings from multiple perspectives and to identifyconnections across diverse methods, problems, andconcepts. If you are an opinionated researcher orteacher, there are probably ideas that you thinkhaven’t gotten the credit they deserve or haven’t

125

been presented in the right way. Maybe you thinkstudents should know more about some method orset of problems: for example, I felt that learningto think about NLP by doing paper-and-pencil ex-ercises could help students avoid wasting a lot oftime writing code that was conceptually flawed. Atextbook is the perfect vehicle for grinding suchaxes, as long as you don’t take it too far and youkeep the focus on what will benefit the reader.

One more reason to write a textbook is that wereally do need them: only a small handful of NLPtextbooks have ever been written. It is true that thetextbook market is somewhat “winner-take-all”: itis easiest to build a course around a textbook thatis already widely in use, and hard to get teachersto change their materials. But different types ofcourses and students have different needs, and ma-ture fields have dozens of books that target each ofthese audiences. Compared with the difficulty offinding a niche among the thousands of research pa-pers written each year, a well-written NLP textbookis almost guaranteed to offer something valuable toa large number of readers.

2.2 Why I did itHonesty requires some additional introspectionabout my real motivations. The project startedbecause I felt unprepared to teach many topics innatural language processing, and could think ofno better preparation than writing out some notesand derivations in my own words. I find it hardto focus on lectures that are based on slides, and Ihave noticed that many students seem to have thesame difficulty. So I tried to write notes that wouldenable me to teach from a whiteboard.1

A second motivation was to create a resourcefor my students. When I started teaching in 2012,there was really only one textbook that was suf-ficiently complete and contemporary to offer ina college-level NLP course: Jurafsky and Martin(2008, J&M).2 But as an incoming faculty mem-ber, I was particularly eager to train graduate stu-dents as potential research assistants, and J&M wasless mathematical than I would have liked for thispurpose. My first approach was to have studentsread contemporary research papers and surveys,

1A specific inspiration to my early notes were the teach-ing materials from Michael Collins, e.g., http://www.cs.columbia.edu/~mcollins/loglinear.pdf.

2There are also some outstanding books that were eithertoo old (Manning and Schütze, 1999) or not quite aligned withmy goals for the course (e.g., Bender, 2013; Bird et al., 2009;Goldberg, 2017; Smith, 2011).

but this requires training, and students struggledwith inconsistencies in notation and terminologyacross papers. I needed something that would givestudents a bridge to contemporary research, anddecided I would have to write it myself.

These reasons added up to a set of course notesthat I posted on Github, but not a textbook. Af-ter periodic nudges from editors over a period ofseveral years (see Table 1), and some experiencereviewing books and book proposals, I finally de-cided to submit a proposal of my own in 2017. Atthis time I was close to submitting my tenure ma-terials, and writing a book seemed like a welcomechange of pace. I had become friends with a groupof professors in the humanities and social scienceswho were sweating over their own book projects atthe time, and I envied their focus on solo long-termwork, which seemed so different from my life ofbouncing from one student-led project to the next.And finally, I flattered myself to think that I wouldbe able to write the book quickly from the materialthat I had amassed in five years of teaching — readon to learn whether this prediction was accurate.Overall, the book arose from a combination of im-postor syndrome and irrational optimism, a recipethat may be at the heart of many writing projects.

2.3 Why not do something else?

When people find out that you are writing a text-book, you may receive suggestions for all sortsof better ways to communicate the same informa-tion. In the 2010s, there was great interest in onlinecourses — particularly at Georgia Tech, which wasthen my home university — and I was urged to pro-duce videos for such a course on natural languageprocessing. Another possibility would have been towrite a blog, which would be easier to keep currentthan a textbook, and would permit readers to postcomments and questions (e.g., Ruder, 2021). Go-ing further, tools like Jupyter notebooks (Kluyveret al., 2016) offer exciting new ways to combinewriting, math, and code. Some intrepid authorshave even written entire textbooks as collectionsof these interactive documents (e.g., VanderPlas,2016). With all these alternatives, why write a tra-ditional textbook on “dead trees,” (as one of mystudents put it)? Some reasons are more personaland others are practical. Here are three:

Longevity. Although much of the textbook willbe obsolete in a few years, some parts may standthe test of time; there are topics for which I still

126

turn to my copy of Manning and Schütze (1999).Even if my book does not offer the best explana-tion of anything that anyone cares about in twentyyears, I am glad to know that people will probablybe able to read it if they want to. With more in-novative online media, there is no such guarantee.Course videos may be available far into the future,but they are difficult to produce well, requiring anentirely different set of skills than the amateur type-setting capabilities that most academics acquire inthe course of their studies.

Quality. The publication process brings in severalpeople who help you write the best possible book:an editor who helps you choose the material andthe high-level approach, reviewers who make surethe presentation is clear and correct, and a copyeditor who finds writing errors. Perhaps becausetextbooks are rare, I also found that colleagues werevery generous when asked to lend their expertise.

Finality. The field of natural language processingwill surely continue to grow and evolve, and onlinemedia offers the temptation to try to keep pace withthese changes. But if you agree to be bound by theconventional publishing process, there will comea day where you send a file to the publisher andare unable to make any further changes. Whilesome authors seem to be happy (or at least willing)to continually revise through many editions overseveral decades (e.g., Russell and Norvig, 2020), Iwanted the option to move on to other things.

While textbooks can be expensive, open accessonline editions are increasingly typical. In my case,I was able to negotiate a free online edition in ex-change for a small portion of the royalties.

2.4 Yes, youBefore committing, I confessed to my prospectiveeditor one of my deepest fears about the project: thebest-known textbooks on natural language process-ing (Manning and Schütze, 1999; Jurafsky and Mar-tin, 2008) were written by true luminaries. Whowas I to try to compete with them? Being a craftyand experienced editor, she replied that perhapsthose authors were not so luminous before theirtextbooks, and wouldn’t I like to write one andjoin them in the firmament? Although I am not socrafty, even I could see through this ploy. Whatultimately gave me the courage to proceed was therealization that if I didn’t write this particular text-book, then no one else would. AI summer was then

coming into full bloom, and the true luminarieshad plenty of other things to keep them occupied.In any case, there is no minimum amount of lumi-nosity required for writing a textbook: publisherswill ask that you give some evidence that you knowwhat you’re talking about, but the main criterion isto have a compelling vision for a book that hasn’tbeen written yet.

3 Methodology

Publishers seem keenly aware of the need for moretextbooks in natural language processing and inAI more generally, and I found several editors thatwere eager to talk at conferences. I selected MITPress because of their track record in publishingsome of my favorite computer science textbooks.Other factors that you may wish to consider are thelength of the review and production process, andthe publisher’s position towards open access. I waslucky to get feedback on the contract from anothereditor and from colleagues who have written booksin other fields, but I did not think of negotiatingwith regard to electronic editions and translations.Fortunately the publisher was generous on thesepoints, as they turned out to be a significant frac-tion of the revenue for the book. In any case, in thecurrent environment of high demand for AI exper-tise, the financial compensation is not competitivewith other uses of the same amount of time. Youmay find that it makes more sense to negotiate onaspects of the book and publishing process, suchas length, open access, and support.

The publisher requires four main inputs from theauthor: a proposal, a complete draft for review, a“finished version” for copy editing and composition,and markup of page proofs. In the rest of the sec-tion, I’ll describe how I approached each of theseinputs. A timeline is given in Table 1.

3.1 Proposal

The publisher required a proposal with two com-plete chapters (which were entirely rewritten later),a detailed table-of-contents for the rest of book,and a discussion of the imagined readership andthe books that readers currently have to choosefrom. You will also give an estimate for some fac-tors that affect the price: how long the book will be,how many figures to include, and whether color isrequired; and you will be asked to provide a time-

127

Fall 2012 Started teaching natural language processing and writing lecture notes.July 2014 First contact with an editor.2014-2017 Periodic nudges from the editor to please finish my book proposal someday.March 2017 Book proposal done and sent out for review.May 2017 Book proposal reviewed and accepted.June 2017 Signed agreement with publisher.Summer 2017-2018 Did most of the writing.Early summer 2018 Solicited informal reviews of chapters from subject experts.June 2018 Manuscript draft sent out for formal reviews.Summer 2018 Wrote most of the exercises while awaiting reviews.July 2018 Received reviews, started revisions.November 2018 Revised manuscript sent out for production.Winter 2019 Received and reviewed copy edits.May 2019 Received and reviewed page proofs.Summer 2019 I was supposed to make slide decks while waiting for the book to come out.October 2019 Book is published.

Table 1: A rough timeline. Key inputs from section 3 are highlighted in bold.

line, which no one takes too seriously.3 As withanything else, it helps to see other proposals thathave been successful, and you may ask your editorfor positive examples. I spent a significant amountof time on the example chapters, and relatively lit-tle on the proposal itself, although it did help meto identify the overall structure of the book.

3.2 Draft

If the proposal is accepted, it’s time to start writing.The purpose of this stage is to produce somethingthat can be sent to the reviewers. In my case, theeditor did not require the exercises or figures to bedone at this stage; I have heard that other presseswill solicit reviews on a chapter-by-chapter basis.After getting to a complete draft of each chapter,I also solicited informal reviews from friends andcolleagues, which both improved the content andgave me far more confidence about the chaptersthat did not align with my expertise.

At first I tried to schedule the writing to alignwith teaching — for example, writing the chapteron parsing while teaching the same unit — but Iwasn’t able to keep up, and several chapters hadto be left to the following summer. I hesitate tooffer much writing advice to this audience, butI will pass along one thing I learned from MarkLiberman, when I asked how he was such a prolificblogger:4 it’s possible to learn to write well if you

3As my editor put it, “if missing deadlines was a crime,the prisons would be full of authors.”

4https://languagelog.ldc.upenn.edu

constrain yourself to write quickly, but it’s muchmore difficult to learn to write quickly while con-straining yourself to write well. So write quickly,and eventually the quality will catch up.

One regret about this stage is that I did not adoptthe publisher’s formatting templates. I had alreadywritten many pages of course notes, and when Icouldn’t immediately get them to compile againstthe publisher’s format, I decided to put this off untillater. Naturally that only made things much moredifficult in the end, and I didn’t use all that muchof my original material anyway.

There are several reasons why my estimate ofthe completeness of the original course notes wastoo optimistic. While teaching, you are likely toemphasize the aspects of the subject that you knowbest. This means that the remaining parts to writeare exactly those that are most difficult for you.In the classroom, you can rely on interactive tech-niques such as dialog and demonstrations to over-come weaknesses in the exposition of technically-challenging material, but the textbook must standalone. Finally, the requirements for consistency,clarity, and accuracy of attribution in a textbook aremuch higher than the standard that I had reached inmy course notes, and although the difference mayseem small to many readers, it represents quite alot of work for the writer. In total, I kept hardly anyof the original text, although I was able to reuse thehigh-level structure of roughly half of the chapters.

128

3.3 Revision(s)

The reviews were generally positive, but one re-viewer was quite critical of the early chapters; al-though the publisher didn’t require it, I made sub-stantial changes based on this feedback. At thispoint I also tried to add a few notes about veryrecent work, such as BERT (Devlin et al., 2019),which appeared on arXiv while I was doing the revi-sions. Had the original reviews been more negative,the publisher might have required another round be-fore accepting my revisions, but luckily this wasn’trequired in my case. The reviewers were very help-ful, but I am skeptical that any of them read thewhole thing, and I recommend seeking externalreviews, especially for the later chapters that thereviewers are likely to skip or skim.

3.4 Proofs

The remaining steps involve details of the writingstyle and typesetting. At this stage I handed overthe source documents to the production team, andcould only communicate by adding notes to a PDF.This may be less of a technical requirement andmore an incentive to prevent authors from introduc-ing significant new content. The copy editing stageidentified many writing problems, but the copy ed-itor was unable to check the math. The publisheroffered to pay for a math editor, but I was unable tofind someone willing to do it. Fortunately, many ofthe mathematical errors had already been identifiedby students of my course. This stage also involveda bit of haggling about minor issues like whetherit would be necessary for citations to include pagenumbers from conference proceedings, and whenit was appropriate to use a term of art that violatedthe house style, such as “coreference” instead of“co-reference.”

Once the copy edits are complete, a LaTeX pro-fessional was able to compile the document usingthe publisher’s format. This created a number ofproblems with the typesetting of the math, whichwere somewhat painstaking to check and resolve,and which could have been avoided if I had usedthe publisher’s templates from the beginning. Atthis point the publisher is highly resistant to anychanges to the content, but I did get them to fixsome glaring errors that I found at the last minute.

4 Evaluation

What worked. It is difficult to be objective abouta project of this scope. I am always happy to learn

that the book is being used in a course or for self-study, and was thrilled about translations into Chi-nese and Korean. The text seems best suited toclasses that are similar to mine, where the primarygoal is to train future researchers, who need a math-ematical foundation in the discipline. I have beentold that the exercises are particularly helpful, andI have received many requests for solutions, whichI am happy to provide to teachers. There have beenfewer reports of errors than I had expected, whichI attribute to the careful reading of several classesof students while I was teaching from the unpub-lished notes.5 Offering the PDF online seems tohave been an essential factor in the adoption of thetextbook, especially given that the most popularalternative (J&M) is also freely available.

What could have been better. By the time thebook appeared in print, there had been a number ofsignificant changes in both the theory and practiceof natural language processing. While this was ex-pected, it is nonetheless hard not to be disappointednot to have put more emphasis on the topics thatincreased in importance — I’ll say a bit more onthis in the final section. Some readers feel that theterm “introduction” in the title is misleading withregard to the amount of mathematical backgroundthat is expected. While the text assumes only mul-tivariate calculus (and attempts to be clear aboutthis expectation), the pace of the opening chaptersis difficult for students who are out of practice. Myeditor was probably correct that adoption wouldbe greater if I provided slides that professors couldteach from, but I couldn’t bring myself to maketime for this tedious task after finishing the book.

5 Future work

As the field of natural language processing contin-ues to progress, it is tempting to update the text-book with the latest research developments. Forexample, multilinguality and multimodality woulddeserve significantly more emphasis in a secondedition, and a revision would have to reflect thematuration of applications such as question answer-ing and dialog. But while some changes could beaddressed by adding or modifying a few chapters,others — particularly the shift from conventionalsupervised learning to more complex methodolo-gies like pretraining, multi-task learning, distilla-

5I maintain an errata page at https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/errata.md

129

tion, and prompt-based learning — seem to requirea more fundamental rethinking of the book’s un-derlying structure, particularly in a textbook thatemphasizes a coherent mathematical foundation.Any such revisions would have to grow out of class-room teaching experience, which did so much todetermine the shape of the first edition.

Acknowledgments. Thanks to William Cohen,Slav Petrov, and the anonymous reviewers for feed-back.

ReferencesEmily M Bender. 2013. Linguistic fundamentals for

natural language processing: 100 essentials frommorphology and syntax, volume 6 of Synthesis lec-tures on human language technologies. Morgan &Claypool Publishers.

Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural language processing with Python: analyz-ing text with the natural language toolkit. O’ReillyMedia, Inc.

Patrick Blackburn and Johan Bos. 2005. Represen-tation and inference for natural language: A firstcourse in computational semantics. CSLI Publica-tions.

Thomas M Cover and Joy A Thomas. 2012. Elementsof Information Theory. John Wiley & Sons.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Jacob Eisenstein. 2019. Introduction to natural lan-guage processing. MIT press.

Yoav Goldberg. 2017. Neural network methods for nat-ural language processing. Synthesis lectures on hu-man language technologies, 10(1):1–309.

Dan Jurafsky and James H Martin. 2008. Speech andlanguage processing, 2nd edition. Prentice Hall.

Thomas Kluyver, Benjamin Ragan-Kelley, Fer-nando Pérez, Brian Granger, Matthias Bussonnier,Jonathan Frederic, Kyle Kelley, Jessica Hamrick,Jason Grout, Sylvain Corlay, Paul Ivanov, DamiánAvila, Safia Abdalla, Carol Willing, and Jupyterdevelopment team. 2016. Jupyter notebooks — apublishing format for reproducible computationalworkflows. In Positioning and Power in AcademicPublishing: Players, Agents and Agendas, pages87–90. IOS Press.

David J. C. MacKay. 2003. Information theory, infer-ence and learning algorithms. Cambridge universitypress.

Christopher Manning and Hinrich Schütze. 1999.Foundations of statistical natural language process-ing. MIT press.

Kevin P Murphy. 2012. Machine learning: a proba-bilistic perspective. MIT press.

Sebastian Ruder. 2021. Recent advances in lan-guage model fine-tuning. https://ruder.io/recent-advances-lm-fine-tuning/.

Stuart Russell and Peter Norvig. 2020. Artificial intel-ligence: a modern approach, 4th edition. Pearson.

Michael Sipser. 2012. Introduction to the Theory ofComputation. Cengage learning.

Noah A Smith. 2011. Linguistic structure prediction.Synthesis lectures on human language technologies,4(2):1–274.

Jake VanderPlas. 2016. Python data science handbook:Essential tools for working with data. O’Reilly Me-dia, Inc.

130

Proceedings of the Fifth Workshop on Teaching NLP, pages 131–137June 10–11, 2021. ©2021 Association for Computational Linguistics

Learning How To Learn NLP: Developing Introductory ConceptsThrough Scaffolded Discovery

Alexandra SchofieldHarvey Mudd [email protected]

Richard WicentowskiSwarthmore College

[email protected]

Julie MederoHarvey Mudd [email protected]

Abstract

We present a scaffolded discovery learning ap-proach to introducing concepts in a NaturalLanguage Processing course aimed at com-puter science students at liberal arts institu-tions. We describe some of the objectives ofthis approach, as well as presenting specificways that four of our discovery-based assign-ments combine specific natural language pro-cessing concepts with broader analytic skills.We argue this approach helps prepare studentsfor many possible future paths involving bothapplication and innovation of NLP technologyby emphasizing experimental data navigation,experiment design, and awareness of the com-plexities and challenges of analysis.

1 Introduction

Discovery learning describes a pedagogical fram-ing where, instead of introducing students to a skilland then using assessments to explicitly practicethat skill, students are given a broad objective and“discover” pertinent concepts and skills in pursuit ofthat objective. Though pedagogical research yieldsunimpressive results for pure discovery learning(Mayer, 2004), several works support the effective-ness of guided discovery learning (Alfieri et al.,2011), where students are provided scaffolding forhow to develop their solutions and regular oppor-tunities for feedback or validation. This type ofinstructional approach offers the opportunity to ex-ercise creativity and to take more ownership of theirown sensemaking process.

In this paper, we present a proof of concept of themerits of scaffolded discovery learning by describ-ing our implementation of four guided discoverylearning exercises1 to anchor the first four weeksof an undergraduate natural language processingcourse. We also detail why we have found this

1Materials for these four assignments are available for useat https://github.com/DiscoverNLP.

scaffolded discovery learning approach especiallywell-suited to an undergraduate NLP course.

2 Motivation

There are several strategic benefits to a discoverylearning approach for NLP. First, the traditionalcomputer science approach to NLP instructiontends to emphasize programming and/or core algo-rithms first (Bird, 2008), which requires studentsto jump immediately into implementing widely ac-cepted algorithms for particular tasks. While thismay lead to a more satisfying performance out-come, it may miss the opportunity for students toengage with why particular choices, such as how totokenize text or how to smooth an n-gram model,actually make sense from a linguistic perspective.These skills are important when transferring to newdomains, where the optimal choices might changeand require further assessment.

Second, the choice of which model is the mostly“widely accepted” has been shown to changerapidly. Examining the curricular recommenda-tions made by the ACM/IEEE Joint Task Forcefor Computing Curricula over time, we see speechrecognition and parsing as emphasized topics in2001, but no specific discussion of n-gram mod-els (ACM/IEEE, 2001). In contrast, probabilisticmodels and ngrams appear in the 2013 version,while speech recognition disappears as an empha-sized topic and parsing remains on the syllabus invaguer terms (ACM/IEEE, 2013). While familiar-ity with these different problem spaces is helpfulfor building context and approaches to new prob-lem domains, it is hard to predict if the depth ofknowledge required to fully implement a condi-tional random field (CRF) or a deep neural netfor machine translation will be useful beyond thenext few years. In contrast, our discovery-learninglabs focus on how to set up experimental data andprotocols, select and interpret appropriate metrics,use those metrics to investigate what textual phe-

131

nomena they actually embody, and react to thatknowledge with possible design interventions. Wehave found these skills not only generalize betteracross different models and problem domains, butalso dominate the actual time spent doing our NLPwork as researchers.

Third, this strategy of teaching is more com-patible with the structure of the curricula at theundergraduate-only institutions where we teach.Upper-level courses such as natural language pro-cessing have shallow prerequisites and studentshave uneven backgrounds in math, computer sci-ence and linguistics. This challenge is shared byboth computer science and linguistics undergrad-uate programs (Bird, 2008) and graduate coursesaimed at an information studies audience (Hearst,2005; Liddy and McCracken, 2005). Emphasiz-ing skills of evaluation and model discovery overdetailed model analyses helps ensure that studentscomfortable with data structures and some probabil-ity will be able to succeed and develop meaningfulskills in the course. In our context, it has the ad-ditional benefit of inviting a liberal arts approachto critique the nature of what we evaluate in NLPtasks, as well as to consider the implications of thedeployment of language technologies.

3 Learning Objectives

We present four labs as a central element of the firstfour weeks of a natural language course. In devel-oping these exercises, we emphasize three skills wehope students develop through these assignments:analysis of quantitative results, text exploration,and generalizing concepts to new models. Whilesome of these fall into the classic “text processingfirst” paradigm of course offered as an applied lin-guistics approach by Bird (2008), it does not doso at the exclusion of understanding the “why” ofrelated algorithms and models. Notably, none ofthese skills specify particular models in natural lan-guage processing, though the labs we present dohave some alignment with different classic NLPcourse topics. This is a deliberate alternative ap-proach: by de-emphasizing outcomes focused on“knowing” particular algorithms and models, weprovide more opportunity to practice skills to ap-proach and interpret new models that students maychoose to explore later in the course.

Analysis of quantitative results. Our work aimsto support a progression not in terms of complex-ity of models, but in the levels of critique students

apply to computational work on text. By build-ing from some of the basic challenges of what itmeans to read, process, and write text in a computerthrough the development of increasingly complexmetrics, students naturally have the opportunityto progress from making comments based on ini-tially observed phenomena to making the case forwhether results are surprising using typical met-rics of the field. Unlike work that primarily priv-ileges meeting certain minimum accuracy thresh-olds, these exercises intentionally include analysesof unimpressive results, helping to emphasize im-portant lessons about how seemingly large quantita-tive differences don’t always indicate significancein the vast, sparse universe of language.

Exploration of the text. Natural language pro-cessing benchmarks often present results in waysthat obscure the contents of the data being evalu-ated, especially in cases where test data is totallyhidden. Students new to the field of NLP oftendo not have strong intuitions around how muchvariation exists even in a limited problem domain.Examining system inputs and outputs helps themdevelop intuitions about language data and what itmeans to be “unusual” in a highly sparse space. Ourdiscovery approach helps reinforce the importanceof including both quantitative experimental resultsand qualitative discussions using examples aboutmodel behavior with respect to the text, as bothcontribute to an understanding of what a model has(or hasn’t) done.

Thinking beyond individual models. An im-portant component of practicing natural languageprocessing is implementing models and compar-ing their performance. However, current “state ofthe art” models often are built using a combina-tion of different fairly detailed neural architecturalchoices. Engaging with implementing these can re-quire either many prerequisite courses, many daysof in-class time to establish working knowledge ofneural networks for sequential data, or a strategy ofteaching how to code the models that eschews whyto use these pieces. Further, with little confidencethat these models won’t be obsolete in two years, itcan be an expensive investment to set up curricularmaterials for these topics without knowing whetherthey will still merit reuse the next time the courseis offered. Because our approach de-emphasizesthe specific models implemented in favor of un-derstanding broader skills around evaluation and

132

comparison, we believe it will better prepare stu-dents to teach themselves and to ask meaningfulresearch questions when approaching new work.

4 Course Format

The format of the two courses using these labs in-clude both a lecture-focused larger class formatand a smaller discussion-based format. In both, stu-dents are expected to read some text from the text-book, Jurafsky and Martin (2020), supplemented bypapers on relevant NLP topics. Outside of coursereadings, the lab work constitutes the vast majorityof the work completed outside of class for the firsthalf of the semester.

In both formats, we reserve one day a week forspecifically setting up and starting lab assignments.Students attend “lab day” in order to start workon the lab exercise for that week. The lab relieson the concepts discussed in class. Though thelab assignment is started during the lab period, theexpectation is that the lab work will be completedoutside of class and due the following week.

5 Lab Assignment Progression

The following subsections describe the progressionof four laboratory assignments intended for the firstfour weeks of the course. Each lab assignment hasa learning outcome related to analytical skills innatural language processing. Each of the four labspresented also focuses on introducing content fromJurafsky and Martin (2020).

These assignments are implemented in Python,with starter code provided through GitHub. Theassignments can be coupled with autograders forthe code managed through Gradescope.2 Whenused, the autograders (which account for less thanhalf the points in the student’s final write-up) canserve as extra scaffolding, as they provide a usefulon-demand tool for students to check whether theircode is correct before writing their analysis.

Each lab is accompanied by an extensive labwrite-up, which breaks the exploration topic intoindividual implementation goals, check-ins, andquestions for students to address as they venturethrough the assignment. These are designed tobe self-contained: while references to concepts ordefinitions described in the textbook reading mayappear, most formulas and terms specific to thequestions in the lab are re-introduced as part of thewrite-up. The text alternates between providing

2https://www.gradescope.com/

textual definitions and descriptions, coding instruc-tions, and analysis questions. As the lab progresses,the questions grow more open-ended, culminatingin prompts to make sense of results and what theysignal about how well a model fits the text.

5.1 Lab 1: Regex Chatbot

In this lab, students develop a chatbot using reg-ular expressions, either using Slack or Discord inour class offerings.3 While the requirements of theassignment are fairly basic (for instance, the reg-ular expressions must use groups and quantifiers),students are primarily graded on a write-up describ-ing the chatbot and its performance. Students areexpected to document with screenshots examplesof conversations that worked well or poorly and toanalyze what sorts of features might be missing.

Preparation Students are expected to read theregular expressions subsection of the J&M text-book (§2.1). Students must also complete the firstportion and at least one challenge puzzle from a reg-ular expressions puzzle game called Regex Golf.4.

Outcomes Having completed this assignment,students should be able to develop and test Pythonregular expressions, as well as to outline and illus-trate a qualitative analysis of the capabilities andlimitations of a new model. This assignment alsoencourages students to focus their time on thought-ful analysis instead of optimal performance prior in-troducing clear numerical metrics of performance.For instance, the sample of a student write-up inFigure 1 shows introspection related to a relativelysimple pig Latin bot on how punctuation adds wrin-kles to the program. These observations about whatcases can lead to additional complexity help set uphow we attend to variation in case, punctuation,Unicode, and more general human language vari-ation in datasets for future labs. Though regularexpressions are technically covered in another re-quired course in the CS major as a theory topic, wefind it important to solidify in an applied context,as tokenization is a necessary part of processingpipelines for the remainder of the course, a conclu-sion we share with Hearst (2005).

3A side benefit of starting with this assignment is that, forcourses that use one of these systems for Q&A, it encouragesparticipation early among students.

4From Erling Ellinsen: https://alf.nu/RegexGolf

133

Figure 1: A sample student write-up for Lab 1 describing a case where their bot failed to generate formulaic pigLatin and proposing improvements to address that case.

5.2 Lab 2: Tokenization and SegmentationIn this lab, students are first guided step-by-stepthrough the basics of making a tokenizer in orderto compute word frequencies in English texts fromProject Gutenberg. Reflections include discussionof the effect of punctuation and lower-casing onwhich words appear most frequently. From there,students move on to improve a rule-based sentencesegmenter for several portions of the Brown corpususing simple regular expressions. In this exercise,students report their precision, recall, and F1 scores,and reflect on which rules worked well, which didnot, and why. Throughout both pieces, studentsexplore the text, focusing on which text does notbehave as expected.

Preparation Students are expected to preparewith textbook reading on text normalization (§2.2-2.4) as well as classification metrics (§4.7).

Outcomes Having completed this assignment,students should be able to perform standard stringmanipulations in Python. Students also will iden-tify behaviors in text collections that fall outsidetypical prescriptive grammar rules, e.g. that sen-tences in the Brown corpus may start with a lower-case letter or end with a colon. Students are of-ten tempted to add rules to their system for everysentence boundary, even those that are due to an-notation errors. The analysis portion of this lableads them to consider issues of overfitting for thefirst time, since adding rules for those edge casesleads to worse performance on held-out data. Tocomplete the lab, students will need to combine

these observations with quantitative results regard-ing word frequencies and classification metrics todescribe what they have observed about their meth-ods. Students will additionally reinforce Lab 1skills of regular expression creation and identify-ing strengths and weakness in new systems.

5.3 Lab 3: Zipf’s Law and N-gram Models

In this lab, students first look through the sameProject Gutenberg texts to assess how closely theunigram frequency patterns match those projectedby Zipf’s law. Students then develop functions toextract unigram, bigram and trigram frequenciesfrom the texts. They use these functions to comparefeatures found in an unrelated data set. In the lastoffering of this course, we utilized the Hyperparti-san news dataset (Kiesel et al., 2019) in order to un-derstand how many types and tokens are unique tohyperpartisan or non-hyperpartisan-labeled news.Students explored these same evaluations on a ran-dom reshuffling of the data to understand the mag-nitude of “unique” attributes that can arise evenwhen comparing texts from the same source.

Preparation Students are expected to preparewith textbook reading on n-gram language mod-els (§3.1-3.4).

Outcomes Having completed this assignment,students should be able to develop code to extractn-gram frequencies from text and predict unigramfrequency distributions from Zipf’s law. This re-quires interpreting relative and absolute frequen-cies and their reasonable values: a common error

134

(a) Properties of objects (b) Creators and creations

Figure 2: Two examples of alternate analogy tasks developed by students for Lab 4.

scenario as students attempt to compare empiricalversus theoretical behavior is to plot absolute theo-retical word frequencies against relative empiricalword frequencies, as displayed in Figure 3. Further,they should be able to provide qualitative and quan-titative analysis of rare and common features andtheir effect on different frequency statistics.

The datasets used for this lab are too large fornaïve file-reading solutions that load the whole fileinto memory at once. Students learn to iterativelyprocess the data, which prepares them for workingwith large files in later labs and course projects. Asthey work with these files, students gain some prac-tice leveraging existing libraries (such as lxml5,spaCy (Honnibal et al., 2020), and matplotlib(Hunter, 2007)) to process text and plot data.

5.4 Lab 4: Word Embeddings

In this lab, students use GloVe embeddings (Pen-nington et al., 2014) to experiment with relation-ships between word vectors, including word simi-larities and word analogies. Students first practiceloading in text vectors and saving them as a com-pressed numpymatrix. Subsequently, they developcode to rank similar words for a selection of relatedwords, as well as to test whether certain relations(e.g. gender, number, comparative vs superlative)have consistent vector differences as rendered in2d plots using PCA. Finally, students develop theirown relations and word lists to test the consistencyof this method.

Preparation Students are expected to preparewith textbook reading on vectors space models(§6.1-6.5) and their evaluation (§6.10-6.12). Sec-

5https://lxml.de/

Figure 3: An incorrect plot for Lab 3, caused by a stu-dent conflating relative and absolute frequencies. Au-tograders and lab check-in points allow students to en-gage with these points of confusion while still catchingerrors before students write their analyses.

tion 6.8 on word2vec or an introduction to GloVemay boost confidence.

Outcomes Having completed this assignment,students should be able to perform mathematicaloperations common to vector space models, suchas computing cosine similarity, vector norms, andaverage vector differences. Students will know howto use numpy operations to speed up computation(Harris et al., 2020) and dimensionality reductionto visualize higher-dimensional behavior. Addition-ally, students should know how to develop and testa hypothesis about word vector geometries usingappropriate metrics, visualizations, and qualitativeinsights. These hypotheses can lead to broader un-derstanding of what types of relationships workwell as word embedding analogies, too, including

135

how relationships that are not one-to-one may fail(Figure 2a) and how polysemy adds wrinkles toword analogies (Figure 2b).

6 Observations

In many ways, this format has helped us to feelliberated in how we approach topics later in thesemester. In the most recent offering of the course,for instance, the latter half focuses on studentprojects and presentations on student-determinedtopics covering a range of popular tasks like ma-chine translation and question answering and morewide-ranging topics like text-to-speech systems andcomputational social science. In other semesters,the course has continued with structured weekly (orbi-weekly) lab assignments, often culminating ina lab assignment that introduces a shared task thatstudents then work on as a final project. Overall,we find that the foundational skills introduced inthe first four weeks prepare students well for all ofthese paths through the rest of the course.

This approach admittedly releases the opportu-nity to do a deeper technical dive with the classinto modern models; while students tend to discussLSTMs, BERT, and other popular modern models,they are not expected to fully present these topics intheir presentations. While we had initially worriedabout whether this would provide too little infras-tructure to make sense of neural models, we havebeen happy with how students use the skills weemphasized earlier to uncover known controversiesin the field of NLP. We have witnessed advanceddiscussions initiated by the students about topicssuch as how strong baselines can be in question-answering datasets, the possible shortcomings ofmetrics such as BLEU, ROUGE, and PYRAMID,and what it would mean to reach human parity formachine translation.

Taking a discovery approach is not without risks,especially when students may get stuck or con-fused. We have learned some lessons about wheremore guidance may be needed for students. First,unsurprisingly, file I/O often can be remarkablypicky, and we have found that the I/O portions ofthese labs can be fairly brittle when students makechoices that may have been reasonable in previousclasses with fairly rudimentary file processing. Fur-ther illustration around ways to load and save data,manipulate numpy and spaCy objects, and other-wise use libraries can greatly reduce time studentsspend stuck on smaller bugs instead of NLP ques-

tions. Additionally, students often ask a numberof questions about how much analysis is sufficientand which aspects to focus on, often not finding en-gagement with the text intuitive. Sample write-upsfrom the first lab or two may help illustrate this, aswell as more development on the prompts for thewrite-up to elicit thoughtful analyses.

7 Conclusions

We hope that the description of this class for-mat and assignment progression helps motivatethe feasibility of a discovery-based approach toteaching natural language processing. We aredelighted to share the initial assignment instruc-tions, lab files, and student Gradescope imagepublicly on Github at https://github.com/DiscoverNLP, with additional files for autograd-ing through Gradescope available on request.

8 Acknowledgements

We would like to thank the students at bothSwarthmore College and Harvey Mudd Collegewho participated in and gave feedback to this class,including the students who gave permission fortheir work to be shared in this paper.

ReferencesACM/IEEE Joint Task Force on Computing Cur-

ricula. 2001. Computing curricula 2001.https://www.acm.org/education/curricula-recommendations.

ACM/IEEE Joint Task Force on Computing Cur-ricula. 2013. Computing curricula 2013.https://www.acm.org/education/curricula-recommendations.

Louis Alfieri, Patricia J Brooks, Naomi J Aldrich, andHarriet R Tenenbaum. 2011. Does discovery-basedinstruction enhance learning? Journal of Educa-tional Psychology, 103(1):1.

Steven Bird. 2008. Defining a core body of knowledgefor the introductory computational linguistics cur-riculum. In Proceedings of the Third Workshop onIssues in Teaching Computational Linguistics, pages27–35, Columbus, Ohio. Association for Computa-tional Linguistics.

Charles R. Harris, K. Jarrod Millman, Stéfan J.van der Walt, Ralf Gommers, Pauli Virtanen, DavidCournapeau, Eric Wieser, Julian Taylor, Sebas-tian Berg, Nathaniel J. Smith, Robert Kern, MattiPicus, Stephan Hoyer, Marten H. van Kerkwijk,Matthew Brett, Allan Haldane, Jaime Fernández del

136

Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, WarrenWeckesser, Hameer Abbasi, Christoph Gohlke, andTravis E. Oliphant. 2020. Array programming withNumPy. Nature, 585(7825):357–362.

Marti Hearst. 2005. Teaching applied natural languageprocessing: Triumphs and tribulations. In Proceed-ings of the Second ACL Workshop on Effective Toolsand Methodologies for Teaching NLP and CL, pages1–8, Ann Arbor, Michigan. Association for Compu-tational Linguistics.

Matthew Honnibal, Ines Montani, Sofie Van Lan-deghem, and Adriane Boyd. 2020. spaCy:Industrial-strength Natural Language Processing inPython. Zenodo.

J. D. Hunter. 2007. Matplotlib: A 2d graphics en-vironment. Computing in Science & Engineering,9(3):90–95.

Dan Jurafsky and James H Martin. 2020. Speech andLanguage Processing, 3rd (draft) edition. PrenticeHall.

Johannes Kiesel, Maria Mestre, Rishabh Shukla, Em-manuel Vincent, Payam Adineh, David Corney,Benno Stein, and Martin Potthast. 2019. SemEval-2019 task 4: Hyperpartisan news detection. InProceedings of the 13th International Workshop onSemantic Evaluation, pages 829–839, Minneapo-lis, Minnesota, USA. Association for ComputationalLinguistics.

Elizabeth Liddy and Nancy McCracken. 2005. Hands-on NLP for an interdisciplinary audience. In Pro-ceedings of the Second ACL Workshop on EffectiveTools and Methodologies for Teaching NLP and CL,pages 62–68, Ann Arbor, Michigan. Association forComputational Linguistics.

Richard E Mayer. 2004. Should there be a three-strikesrule against pure discovery learning? American Psy-chologist, 59(1):14.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In Proceedings of the 2014 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543, Doha,Qatar. Association for Computational Linguistics.

137

Proceedings of the Fifth Workshop on Teaching NLP, pages 138–148June 10–11, 2021. ©2021 Association for Computational Linguistics

The Online Pivot: Lessons Learned from Teaching a Text and DataMining Course in Lockdown, Enhancing online Teaching with Pair

Programming and Digital Badges

Beatrice AlexLiteratures, Languages and Cultures

University of EdinburghEdinburgh

Scotland, [email protected]

Clare LlewellynSocial and Political Science

University of EdinburghEdinburgh

Scotland, [email protected]

Pawel Michal Orzechowski Maria BoutchkovaUniversity of Edinburgh Business School

University of EdinburghScotland, UK

[email protected] [email protected]

Abstract

In this paper we provide an account of howwe ported a text and data mining course on-line in summer 2020 as a result of the COVID-19 pandemic and how we improved it in a sec-ond pilot run. We describe the course, howwe adapted it over the two pilot runs and whatteaching techniques we used to improve stu-dents’ learning and community building on-line. We also provide information on the re-lentless feedback collected during the coursewhich helped us to adapt our teaching fromone session to the next and one pilot to thenext. We discuss the lessons learned and pro-mote the use of innovative teaching techniquesapplied to the digital such as digital badgesand pair programming in break-out rooms forteaching Natural Language Processing coursesto beginners and students with different back-grounds.

1 Introduction

It was spring 2020 and it felt like we were in crisismode. We wanted to teach a text and data mining(TDM) pilot course but because of social distancingmeasures we could not do it in a physical classroom.We had to learn new ways of interacting online andusing a multitude of different technologies and weneeded to do it fast. We had been planning thiscourse for a while before Covid-19 hit. We weredesigning a TDM for Humanities and Social Sci-ence students but because of the situation we had toadapt the way we delivered it. Rather than hybridteaching as intended, accommodating in-classroom,

online synchronous and online asynchronous stu-dents, we had to fully commit to online methodsin a matter of a few weeks. We decided to plungeheadlong into the digital teaching world.

The easiest way would have been to post videosof a traditional style lectures – it is very temptingto take this approach. We felt, however, that it wasimportant that we maintained what is good aboutteaching when everyone is in the same room, thecollaboration, its social aspects, the feedback, allof which you lose when a student sits on their ownin a room watching a pre-recorded lecture.

We decided to run a TDM boot camp to virtuallytest our new course which we were planning as partof the Edinburgh Futures Institute (EFI) postgrad-uate programme.1 We wanted to not only teachfundamental methods for text mining corpora toprogramming novices but also teach ourselves howto become better practitioners in teaching in anonline world.

In this paper, we will describe our methods andexperience for porting an in-person TDM courseinto the online world. In the next section we willpresent related publications on teaching NaturalLanguage Processing or TDM courses. We thendescribe the academic backgrounds of the teachingteam (Section 3.1) and provide an overview of ourcourse (Section 3.2). Sections 3.3 and 3.4 explainhow we taught and adapted it in two online pilotruns delivered in June and September 2020. We

1EFI is a new institute at the University of Edinburghwhich will support interdisciplinary research and teaching forthe whole institution. https://efi.ed.ac.uk

138

provide information on how we collected relentlessfeedback during and after each course and include adetailed account of one participant of the first pilotand how it has affected her teaching (Section 4).Finally, we summarise what we learned from theseexperiences (Section 5) and lay out future plans forour TDM course (Section 6).

2 Related Work

There are two aspects that we consider of impor-tance in relation to this work, the course content,Natural Language Processing (NLP), and the en-vironment, teaching online during a pandemic. Inthis section we explore both topics.

NLP educators choose which aspects to teachbased on multiple constraints such as class length,student experience, recent advancements, programfocus, and even personal interest.

Our TDM course is fundamentally designed tobe cross-disciplinary as we are teaching NLP andcoding to students from multiple schools and back-grounds including linguistics, social sciences andbusiness. Jurgens and Li (2018) point out thatNLP courses are designed to reflect, amongst otherthings, the background and experience of the stu-dents. Agarwal (2013) explains that in coursessuch as these the majority of students, who he calls“newbies in Computer Science”, have never pro-grammed before. He highlights that we can in-crease experience through homework tasks whichwe did both before the course and in-between eachsession. Hearst (2005) states that in these circum-stances it is not important to place too much em-phasis on the theoretical underpinnings of NLP butto focus on providing instructions for students onwhat is possible and how they can use it on theirown in the future. We based our approach on usingthe NLTK2 and spaCy3 Python libraries as well asused examples inspired by Bird et al. (2009, 2005).We aim to explain how text analysis works step-by-step using clear and simple examples. We therebyaspire to develop and broaden humanities and so-cial science students’ data-driven training and givethem an understanding of how things work insidethe box, something for which there is still a sig-nificant need in their core disciplines (McGillivrayet al., 2020).

Teaching text analysis to non-computer scientistshas been explored in texts such as Hovy (2020). For

2https://www.nltk.org3https://spacy.io

our course we had to consider the variety of back-grounds and experiences that this would encompassand needed to use a pre-course learning task andoffice hours to provide a more level knowledgestarting point. We also had to design the course tokeep more advanced students engaged while notintimidating learners who may find it more chal-lenging. We used core material to explain principleconcepts (such as tokens, tokenisation, and part-of-speech (POS) tagging etc.) but with a hands-onapproach. We avoided too much technical detailand put the material in the context of projects wehave worked on ourselves to demonstrate how eachanalysis step becomes useful in practice.

As we taught our TDM course online in the con-text of a worldwide pandemic, we also report onrelated work in the area of online teaching, and withrespect to the challenges in which we are teaching.Massive open online courses (MOOC) generallyfocus on providing online access to learning re-sources to a large number and wide range of partic-ipants. This has led to a desire to automate teachingand innovate digital interaction techniques in orderto engage with large numbers of students. Whilstour intention was to teach a limited number of stu-dents, we hoped to use and draw upon innovationin this area in order to improve the experience forour students. E-learning and technology shouldnot be seen as an attempt to replace or automatehuman teaching, although this can often be a feararticulated by teachers. In a discussion of automa-tion within teaching Bayne (2015) argues that wecan design online teaching and still place humancommunication at the centre with technology en-hancing the learning of the student. Bayne suggeststhat the human teacher, the student and the tech-nology can be intertwined. We asked students toengage with digital objects and the technology toenhance their learning journey. As teachers wedo not merely support the digital learner but weremain at the centre of teaching the course.

Fawns et al. (2019) point out that online learningis a key growth area in higher education, which iseven more true since the pandemic started, but thatit is harder to form relationships in online courses.Therefore, we saw it as important to develop onlinedialogue between students in order to form com-munities which can improve these relationships.Building a community online can be harder but it ispossible. We tried to achieve this through using acombination of traditional learning such as lectures

139

and task-based learning such as pair programmingexercises. Online learning tends to be interruptedas we are in our homes or elsewhere and have re-sponsibilities that can take us away from the on-line space, bandwidth issues, dropping children atschool, flatmates interrupting, phone calls, even thedoor bell ringing. Our teaching practices needed tobe accepting of and adapted to this context.

Ross et al. (2013) discuss the issues of presenceand distance in online learning. Interruptions instudents’ concentration are a common event whenlearning online and we must use resilience strate-gies to maintain a ‘nearness’ to our students. Thisincludes recognising that these events are normaland that engaging is an effort, identifying affinitiesand creating a socialness, valuing that distractioncan change our perspective and this is helpful anddesigning openings, events that allow and encour-age student to come together and engage. Whilstdesigning the course we kept these ideas in focusin order to allow us to develop and enhance ouronline relationships and our students’ learning.

3 A Virtual Learning Experience

3.1 The Team

Our team is made up of three early career aca-demics at the University of Edinburgh. Two teach-ing fellows have a background in Natural LanguageProcessing with PhDs in Computational Linguis-tics. The third teaching fellow has a PhD in Com-puter Science and frequently teaches programmingto different types of audiences, including businessstudents as well as students outside of higher edu-cation. The author list of this paper also includesa fourth (last) author who was a participant of ourfirst pilot, is a lecturer herself, and who has pro-vided us with useful feedback for future iterationsof this course (see Section 4.2).

3.2 Course Overview

In our data-driven society, it is increasingly essen-tial for people throughout the private, public andthird sectors to know how to analyse the wealth ofinformation society creates each day. Our TDMcourse gives participants who have no or very lim-ited coding experience the tools they need to inter-rogate data. This course is designed to teach non-coders how to analyse textual data using Pythonas the main programming language. It takes themthrough the required steps needed to be able toanalyse and visualise information in large sets of

textual document collections, or corpora.The course takes place over three three-hour ses-

sions and each session introduces participants to anew topic through a short lecture. The topics buildon the previous sessions and at the end of each ses-sion there is time for discussion and feedback. Inthe first session we start with Python for reading inand processing text and teach how individual doc-uments are loaded and tokenised. We work withplain text files but do raise the issue that textualdata can be stored in different formats. However, tokeep things simple we do not cover other formatsin detail in the practical sessions.

In the second session we show how this is doneusing much larger sets of text and add in visualisa-tions. We used two data sets as examples, the Med-ical History of British India (of Scotland, 2019)made available by the National Library of Scot-land4 and the inaugural addresses of all AmericanPresidents from 1789 to 2017. We show how par-ticipants can create concordance lists, token fre-quency distributions in a corpus and over time aswell as lexical dispersion plots and how they canperform regular expression searches using Python.In this session we also explain that textual data canbe messy and that a lot of time can be spent oncleaning and preparing data in a way that is mostuseful for further analysis. For example, we pointstudents at stop words and punctuation in the re-sults and explain how to filter them when creatingfrequency-based visualisations.

During the third session we cover POS-taggingand named entity recognition. This last sessionconcludes with a lesson on visualisations of textand derived data by means of text highlighting,frequency graphs, word clouds and networks (seesome examples in Figure 1). The underlying NLPtools used for this course are NLTK 3 and spaCywhich are widely use for NLP research and develop-ment. This is also where we put some of the coursematerial in context of our own research to showhow it can be applied in practice in a real project.For example, we mentioned our previous work oncollecting topic-specific Twitter datasets for furtheranalysis (Llewellyn et al., 2015), on geoparsinghistorical and literary text (Clifford et al., 2016;Alex et al., 2019a) and on named entity recognitionfor radiology reports (Alex et al., 2019b; Gorinskiet al., 2019).

4https://data.nls.uk/data/digitised-collections/a-medical-history-of-british-india/

140

Figure 1: Visualisations of text explorations created by the students.

In the two pilots, we ran this course over threeafternoon sessions on Monday, Wednesday andFriday, with an office hour on the days in-betweento sort out any potential technical issues and answerquestions. The main learning outcome is that bythe end of the course the participants will haveacquired initial TDM skills which they can use intheir own research and build on by taking moreadvanced NLP courses or tutorials. A main goal ofthis course is to teach the material in a clear step-by-step way so all Python code and the examplesare specific to each task but do not go in-depthinto complicated programming concepts which webelieve would confuse complete novices.

3.3 Pilot 1

In the first pilot we wanted to test the content ofthis course but also different methods for teachingonline. We are all likely to be teaching virtuallymore often in the future even once the pandemicsubsides. For example, EFI was planning to run hy-brid courses to students across the world, even prior

to COVID-19. In this new world, we believe thatonline and hybrid teaching is here to stay alongsideteaching students in the classroom. Higher educa-tion will need to determine their offer of differentexperiences to students be they on site or partici-pating online synchronously or asynchronously.

We limited the first pilot to 25 participants. Thebackgrounds of students who signed up for ourcourse were mixed coming from Law, Linguisticsand Business. Everyone was either a student ora member of staff at the University of Edinburgh,where we had advertised the course, including ev-ery level from professor to undergraduate, joiningfrom around the world. Some students even partic-ipated from different time zones.

On each day we started with a short presenta-tion discussing the TDM theory of what was beingtaught in the practical session that followed. In thefirst pilot this was a live lecture, not recorded, allow-ing us to adapt the content to questions that cameup during the course. When one teacher spokethe other two managed the video chat, answer-

141

ing questions or dealing with specific problemsfrom students, and raising questions to the speaker.This was something we found was essential as itwas very easy to lose flow and get distracted with-out this help. We learned then that it would havebeen extremely challenging to teach this courselive online single-handedly and after each sessionexpressed appreciation that there were three of ushelping each other.

We used a variety of technologies provided bythe university. Learn,5 our in-house virtual learn-ing environment (VLE), was used to provide accessto course materials. We met with students virtu-ally using the Blackboard Collaborate software6

which is accessible through Learn. Aside from thevideo itself, we used text chat, the virtual white-board, polls, the ability to raise a hand, breakoutgroups, file sharing, and screen sharing, all func-tionalities which have become second nature aftera year of pandemic but which when we ran thefirst pilot were for the most part still fairly unfamil-iar to many participants. We also used Noteable,7

the University of Edinburgh’s in-house notebookplatform, to provide a virtual programming envi-ronment (VPE) with Jupyter Notebooks,8 and usedGitHub9 to provide students access to the coursematerial and code. We note that the students didnot have to learn how to use GitHub, which wouldbe a big ask for coding novices, but merely had topaste the GitHub link of the corresponding materialinto Noteable which then automatically loaded thematerial in the form of a notebook.

Each day the students were given two sets ofworked through problems using the VPE whichthey used directly through the VPL in their ownbrowser. We found this to be a really important toolfor everyone as it reduced the need for students todownload and set up software on different operat-ing systems and alleviated us from doing a lot oftechnical support to get students set up and runningfor all the practical parts of the course.

During the sessions the students were given alink to a GitHub repository from which they couldpull new notebooks onto the VPE at the beginningof each session. The notebooks include a combi-

5https://www.learn.ed.ac.uk6https://help.blackboard.com/Learn/

Instructor/Interact/Blackboard_Collaborate

7https://noteable.edina.ac.uk8https://jupyter.org9https://github.com

nation of explanations, code to run and mini or ex-tended programming tasks. For each approximatelyhour-long coding session students were assigned arandom buddy and which they were put in a break-out room within the Collaborate video call. By nowwe are used to teaching and/or learning online andhave likely experienced joining break-out roomsbut at the time when we ran the first pilot most ofour participants had never been in a break-out roombefore. So that experience took some getting usedto. We described it as feeling like being put in aseparate room with your buddy. You can chat andshare screens without being overheard by otherpeople. If the students got stuck on a particularcoding problem or line of code and could not solvethe issue together, they could raise a virtual handand an instructor would drop into the room to helpand answer questions or resolve programming is-sues. We also regularly popped into the rooms tosee how everyone was doing, something which waswell received by the students.

One of our team members is a strong proponentof pair programming (Williams et al., 2000; Hankset al., 2011), where two students work together ona single machine to solve problems. This allowseach pair of students to learn from each other aswell as from their teacher(s) and thereby helps tobroaden participation and to dispel the myth thatprogrammers work on their own (Williams, 2006).We wanted to see if it was possible to take thisapproach into a virtual teaching environment. Inaddition to the students learning TDM skills, italso provided an opportunity for social interactionwhich was particularly welcome when we first pi-loted our course at the tail end of the first waveof COVID-19 in the UK and after weeks of strictlockdown with no or little opportunity to meet andinteract with people outside one’s own household.

One advantage of Blackboard Collaborate is thatinstructors are able to see visually when the peoplein break-out rooms are chatting to each other. Thishelped us to gauge if students embraced our pairprogramming experiment or if they preferred towork quietly "side-by-side" but connected virtually.

After each practical session we pulled everyoneback into the shared room and asked participantsto fill in a quick survey to give us feedback. Weanswered any questions, had a quick break, andthen moved onto the next notebook with a newbuddy. We wrapped up each session with a shortQ&A and another round of feedback.

142

3.4 Pilot 2

By the time of the second pilot in September 2020,we had gotten a lot more used to online meetingsand two members of the teaching team had trainedin a summer course on hybrid teaching called AnEdinburgh Model for Teaching Online. This timewe allowed 30 participants to sign up with overhalf of them from Scottish Government and thecommercial sector, alongside university studentsand staff.

The main change we made to our first pilot, with-out altering the course content, is that we restruc-tured the course material into teaching with digitalbadges (Gibson et al., 2015; Muilenburg and Berge,2016) which are used in gamification of educa-tion (Dicheva et al., 2015; Ostashewski and Reid,2015). The principles that guided us were: flexi-bility, compartmentalisation and empowering thelearner. Each badge is built around a Thresholdconcept (Land et al., 2005), a core step or skill (a‘eureka’ moment) that opens the doors to furtherlearning. Using a clear name and symbol, eachbadge signposts students’ takeaways and how it fitswithin the top level learning journey (see Figure 2).

The macro-structure in which badges form ourcourse is complemented by a micro-structure ofeach badge: background theory and instructionalcontent, code-along videos, notebooks with workedexamples, exercises of increasing difficulty, relent-less feedback, pair work and mini coding prob-lems (with solutions). Badges build on top ofeach other, forming branches and enabling op-tional, further learning. Additionally, the modularmicro-structure, enables easier switching betweenplatforms or teaching modes (e.g. videos versusslides) and multiplies the benefits of improvements.Badges proved to be a promising format for deliv-ering teaching of this course, especially in times ofchange, disruption and pivoting.

We wanted to give us and our course participantsmore flexibility, so we recorded all of the shortlectures presented at the start of each badge andsituated before each coding session in the course.This allowed students to come back to the recordedlecture materials later-on. It also gave us moreflexibility answering questions in the chat, solvingtechnical issues in the background and discussingthe running of a given badge in a teaching teambreak-out room while participants were watchingthe video lecture.

Figure 2: The badges used in our TDM course. We cre-ated them using Android Material Design Icons whichare open source under Apache License 2.0.

4 Feedback

4.1 Relentless Feedback

In both pilots we collected relentless feedback.This feedback loop helped us to address questionsraised and go over things that were unclear. Wefound it was really important to be flexible andadapt to what the students wanted. The twice-a-session mini-feedback form was really helpful forthat and we made it very clear which parts of thecourse on day 2 and 3 were in response to partici-pants’ feedback (see feedback analysis in Figure 3).

For example, a comments we received in thefirst pilot was that the students would prefer a quickrecap of the previous session, which we then starteddoing and was a great way to link sessions andget the course material fresh in everyone’s minds.Given the feedback, we also worked through thefirst section of a notebook together, so everyonehad a clear idea of what to do.

The relentless feedback and our response isone of the reasons we believe we had such ahigh participant retention rate which we were verypleased about. The pilots was free of charge, non-compulsory and ran over three afternoons. At leasttwo thirds of the students who joined at the start ofthe week completed the last session on Friday.

We received constructive criticism but overallhad very positive feedback on the course which,especially after the first pilot, made us feel verymotivated having just had completed teaching ourfirst online course. One participant thought it was“Fantastic!” in our final feedback survey. Anotherwrote “The pair learning is excellent! Jupiter [sic]notebooks are a great tool. The real-time interac-tivity is super rewarding.” Others reported that the

143

“Fantastic! The pair learning is excellent! Jupiter notebooksare a great tool. The real time interactivity is super reward-ing.”

Figure 3: Feedback analysis for all surveys over the course of the boot camp and a quote from one student (withpermission to share). We asked students to record difficulty of the course, their progress and learning, their moodand how they felt about their collaboration in pairs as key performance indicators (KPIs) throughout the course.

lecturers and the “humour and playfulness of theexamples” made the course “really great, especiallyfor someone completely new to coding.” Yet an-other person commented that they would use theskills they learned in gathering data for their under-graduate dissertation about their research project.

4.2 Detailed Student Feedback

The following account is a more detailed reactionto our course provided by one of the student whoparticipated in the first pilot of the TDM courseand who we include as an author on this paper:

I was one of the mature students on the firstpilot of the TDM Workshop – an academic myselfwith quantitative methods and coding experiencein Stata and MatLab but not in Python, nor anyprevious experience with text mining or naturallanguage processing.

I appreciated the feedback requests at the endof each session via Microsoft Office forms and the

immediate showcasing of the results for the wholeclass. Whenever there were bandwidth issues, theteaching team coordinated instantaneously andtook over from each other.

The part of the course that taught me the mostwere the pair breakout rooms where we workedthrough computational Jupyter notebooks. The an-notation of the exercises was invaluable, as werethe videos showing one of the instructors workingthrough a notebook themselves and importantlyrunning into an error and explaining how we usethe error message as guidance to fix the code. Dur-ing the breakout sessions having the three instruc-tors drop in and answer any questions was an excel-lent balance of allowing the students independencewhile also feeling supported. Working with dif-ferent partners every time was also very valuable.When taking an active role and talking throughthe lines of code and my understanding of the out-come, I was able to check in with my partner and

144

Figure 4: Whiteboard with feedback generated by students in the course.

be exposed to their style and approach to learning.Similarly, when taking the passive role and witness-ing their way to working through a computationalnotebook, I could take away ideas of how to ex-plain my thinking and understanding of the code indifferent ways.

The distribution of new material via GitHub wasvery efficient. The interactions via the virtual white-board created playfulness and joy in the learningprocess. Although I did not participate in the sec-ond pilot and was not exposed to the Badges, I seethem as another element of enhancing the playful-ness of the process.

The TDM Workshop I participated in took placerelatively early in the pandemic before "Zoom fa-tigue" had set it and participants were excited toengage. A year later, full-time students appear tohave become more resistant to engaging in voiceand/or visual participation.

There were some points that required improve-ment, for example typos in the annotation of thecomputational notebooks or some time being eatenup by technical troubleshooting. However, eventhese created an atmosphere of immediacy, flexibil-ity and a sense of "We are all in this together".

Overall, I benefited immensely from taking thefirst pilot. Not only do I now have an idea of textmining tools and how to use them but I was also in-spired by and adopted the computational notebooks

in my own teaching of Investments in the Autumn of2020. I also implemented regular feedback, which Ifelt provided the element of playfulness and joy, inan even more interactive platform with gifs, word-clouds and animations (using Mentimeter10).

5 Lessons Learned

Despite on-the-whole positive comments, we stillfound teaching in an online environment quite odd.We felt that we lost the sense of whether the stu-dents were engaged, learning and enjoying the ex-perience because most participants had their cam-eras switched off so we could not see their faces orbody language. The feedback did help, even simplyasking students to ‘raise your hand if you can hearme’, but it still remains odd to us to talk to a blankscreen without seeing everyone.

We did not get everything right. The technologydid not always work but luckily one teaching teammember is quite experienced in fixing software-related issues. We would have struggled withoutit. Initially we also did not give enough thought toaccessibility; we just assumed the software woulddeal with that - it did not. We learned that wehave to ask all students before the course if theymight have issues in accessing course materials orvideo calls and make time to deal with any technicalissues that could arise as a result.

10https://www.mentimeter.com

145

We learned that students can be shy when itcomes to talking to each other and putting on theirwebcams. We found ice breaker questions upfront,can be answered playfully on a whiteboard, veryhelpful for putting students at ease and have somefun. We used some simple things that made a lotof difference. We played music in the room beforethe class so when students joined, they knew wewere there and that their speakers were on. Wemade extensive use of the virtual whiteboard togather anonymous feedback really fast in additionto frequent short surveys (see Figure 4). We alsoincluded questions in the notebooks that buddieshad to work on together to encourage discussion.The notebooks contained essential TDM codingtasks and more complex tasks for the curious. Thisallowed some students to extend their learning with-out others feeling they were left behind.

We also found that the amount of content that wecould cover grew as the course went on. There wereinitial issues with the technology which neededfixing and as we were all getting used to the newway of teaching. The conversation also becamemore natural as time went on. At first it was quiteodd to drop into the break-out rooms but by thesecond session this became easier and we wereall chatting a lot more. The majority of studentsreally liked the pair programming, they liked theflexibility and the content. They really felt theywere part of the course in a way that is not alwaysexperienced online.

As instructors, we found that teaching in thisway, switching between modes, lecturing, answer-ing the chat, live coding and responding to issues isreally cognitively challenging. It is hard work andcannot easily be done by one individual. The tech-nologies we use are complex and can fail but theyare for the most part intuitive and provide a widerange of ways to teach and interact. We learnedthat online teaching is exhausting but done right itcan still be really rewarding. We all enjoyed theinteractions and felt part of a little community. Af-ter the course we did a debriefing and each wrotedown three things we liked about the course andsomething we wished we could have achieved (seepage Appendix on 11).

We, the TDM course teachers on this paper, have,in the same way as the author who participatedin the course, benefited immensely from what welearned through these pilots before delving into ouronline teaching in the first term of 2020/21.

6 Summary and Future Work

In this paper, we have reflected of how we porteda TDM course online as a result of the global pan-demic caused by COVID-19. We described thecontent of the course and how we adapted it overtwo pilot runs. We particularly found different fea-tures of Blackboard Collaborate useful for teaching,especially the use of a virtual whiteboard and divid-ing the class up into break-out rooms. Students re-sponded positively to learning in pairs and to coursematerials broken down into digital badges. Finally,the relentless feedback we collected throughouteach session and after the course helped us as teach-ers to improve the course and how we teach it. Tomake a course like this a good learning experience,it is really important to build community and getstudents to talk not just to the teachers but to eachother as they would in a classroom.

Being caught in lockdown encouraged us to in-novate, and our experience demonstrates what ispossible to achieve virtually despite the limitations.Experiencing the learning in a classroom is difficultto replicate online, however, we are confident thatthese types of virtual environments will play a rolein education beyond this pandemic, to complementand enhance traditional learning.

Going forward we would like to experimentwith teaching this course in different ways: asyn-chronously to students joining from different timezones, to much larger groups to understand wherethe limits are in terms of number of participantsgiven staff capacity, or in a writer-retreat type setupwhere the instructors touch base with students sev-eral times during the day. We will also look athow this course can be pivoted back to on-campusteaching for students who can join in person andonce the current pandemic slows down, lockdownrestrictions are relaxed and on-campus teaching re-sumes. We are pleased to announce that this coursewill be part of the post-graduate programme taughtat EFI.

Acknowledgements

We thank Professor Laura Cram for supporting usin the long-term development of this course as partof EFI’s PGT training programme. Our coursewas built on and expands material developed for aLibrary Carpentries course11 with the support ofthe Centre for Data, Culture and Society.12 We

11http://librarycarpentry.org/lc-tdm/12https://www.cdcs.ed.ac.uk

146

would also like to thank Siobhan Dunn and hercolleagues at EFI for managing the registration forour courses and Marco Rossi at the University ofEdinburgh Business School’s Student DevelopmentTeam for offering the course to their students.

ReferencesApoorv Agarwal. 2013. Teaching the basics of NLP

and ML in an introductory course to information sci-ence. In Proceedings of the Fourth Workshop onTeaching NLP and CL, pages 77–84, Sofia, Bulgaria.Association for Computational Linguistics.

Beatrice Alex, Claire Grover, Richard Tobin, and JonOberlander. 2019a. Geoparsing historical and con-temporary literary text set in the City of Edinburgh.Language Resources and Evaluation, 53(4):651–675.

Beatrice Alex, Claire Grover, Richard Tobin, CathieSudlow, Grant Mair, and William Whiteley. 2019b.Text mining brain imaging reports. Journal ofBiomedical Semantics, 10(1):1–11.

Sian Bayne. 2015. Teacherbot: interventions in au-tomated teaching. Teaching in Higher Education,20(4):455–467.

Steven Bird, Ewan Klein, and Edward Loper. 2005.NLTK tutorial: Introduction to natural language pro-cessing. Creative Commons Attribution.

Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural language processing with Python: Analyz-ing text with the natural language toolkit. " O’ReillyMedia, Inc.".

Jim Clifford, Beatrice Alex, Colin M Coates, EwanKlein, and Andrew Watson. 2016. Geoparsing his-tory: Locating commodities in ten million pages ofnineteenth-century sources. Historical Methods: AJournal of Quantitative and Interdisciplinary His-tory, 49(3):115–131.

Darina Dicheva, Christo Dichev, Gennady Agre, andGalia Angelova. 2015. Gamification in education: Asystematic mapping study. Journal of EducationalTechnology & Society, 18(3):75–88.

Tim Fawns, Gill Aitken, and Derek Jones. 2019.Online learning as embodied, socially meaningfulexperience. Postdigital Science and Education,1(2):293–297.

David Gibson, Nathaniel Ostashewski, Kim Flintoff,Sheryl Grant, and Erin Knight. 2015. Digital badgesin education. Education and Information Technolo-gies, 20(2):403–410.

Philip John Gorinski, Honghan Wu, Claire Grover,Richard Tobin, Conn Talbot, Heather Whalley,Cathie Sudlow, William Whiteley, and BeatriceAlex. 2019. Named entity recognition for electronic

health records: A comparison of rule-based and ma-chine learning approaches. ArXiv.

Brian Hanks, Sue Fitzgerald, Renée McCauley, LaurieMurphy, and Carol Zander. 2011. Pair programmingin education: A literature review. Computer ScienceEducation, 21(2):135–173.

Marti A Hearst. 2005. Teaching applied natural lan-guage processing: Triumphs and tribulations. InProceedings of the Second ACL Workshop on Effec-tive Tools and Methodologies for Teaching NLP andCL, pages 1–8.

Dirk Hovy. 2020. Text Analysis in Python for SocialScientists: Discovery and Exploration. CambridgeUniversity Press.

David Jurgens and Lucy Li. 2018. A Look In-side the Pedagogy of Natural Language Processing.25/09/2018. Accessed on 23/04/2021.

Ray Land, Glynis Cousin, Jan HF Meyer, and PeterDavies. 2005. Threshold concepts and troublesomeknowledge (3): implications for course design andevaluation. Improving student learning diversityand inclusivity, 4:53–64.

Clare Llewellyn, Claire Grover, Beatrice Alex, JonOberlander, and Richard Tobin. 2015. Extractinga topic specific dataset from a Twitter archive. InInternational Conference on Theory and Practice ofDigital Libraries, pages 364–367. Springer.

Barbara McGillivray, Beatrice Alex, Sarah Ames,Guyda Armstrong, David Beavan, Arianna Ciula,Giovanni Colavizza, James Cummings, DavidDe Roure, Adam Farquhar, et al. 2020. The chal-lenges and prospects of the intersection of humani-ties and data science: A white paper from The AlanTuring Institute.

Lin Y Muilenburg and Zane L Berge. 2016. Digi-tal badges in education: Trends, issues, and cases.Routledge.

National Library of Scotland. 2019. A Medical Historyof British India. Accessed on 23/04/2021.

Nathaniel Ostashewski and Doug Reid. 2015. A his-tory and frameworks of digital badges in education.In Gamification in education and business, pages187–200. Springer.

Jen Ross, Michael Sean Gallagher, and HamishMacleod. 2013. Making distance visible: Assem-bling nearness in an online distance learning pro-gramme. International Review of Research in Openand Distributed Learning, 14(4):51–67.

Laurie Williams. 2006. Debunking the nerd stereotypewith pair programming. Computer, 39(5):83–85.

Laurie Williams, Robert R Kessler, Ward Cunningham,and Ron Jeffries. 2000. Strengthening the case forpair programming. IEEE software, 17(4):19–25.

147

A Appendix: Three Stars and a Wish

Clare:

F I liked that we were all willing to try anything andgo beyond our comfort zone and fail

F I liked the way we naturally supported each otherand took different roles, and swapped those roles

F I enjoyed learning the technology – something Ithought I’d hate!

/* I wish we could find a way to make the studentsmore interactive, chat, turn their cameras on.

Pawel:

F Relentless feedback: asking 30s pulse-checkingquestionnaires twice a day; we used Office Forms.

F Staff room: easy and persisting internal commschannel within the course development team; we usedTeams.

F Poetic licence: the person who created the final versionof the notes was allowed to make adjustments to thecontent, so they fit the format.

/* Our greatest ally was the tech that just worked. I wishwe could also provide core hybrid experience of being inthe same room.

Beatrice:

F Great team of people, great combination of skills

F Brilliant to see feedback from students coming in

F I learned a lot about teaching online

/* I wish I could witness the learning better online.

148

Proceedings of the Fifth Workshop on Teaching NLP, pages 149–159June 10–11, 2021. ©2021 Association for Computational Linguistics

Teaching NLP outside Linguistics and Computer Science classrooms:Some challenges and some opportunities

Sowmya VajjalaNational Research Council, Canada

[email protected]

Abstract

NLP’s sphere of influence went much be-yond computer science research and the de-velopment of software applications in the pastdecade. We see people using NLP methodsin a range of academic disciplines from AsianStudies to Clinical Oncology. We also noticethe presence of NLP as a module in most ofthe data science curricula within and outsideof regular university setups. These coursesare taken by students from very diverse back-grounds. This paper takes a closer look atsome issues related to teaching NLP to thesediverse audiences based on my classroom ex-periences, and identifies some challenges theinstructors face, particularly when there is noecosystem of related courses for the students.In this process, it also identifies a few chal-lenge areas for both NLP researchers and tooldevelopers.

1 Introduction

Until a few years ago, it was common to see NLPcourses predominantly taught in either ComputerScience or Linguistics departments, attended pri-marily by students from both the departments. Inthe past few years, there has been an increasinginterest in NLP across disciplines. This is also re-flected in the arrival of focused NLP courses suchas "Clinical NLP"1 and "Introduction to Computa-tional Literary Analysis"2.

This trend is by no means specific to NLPalone. There has been a growing interest in in-cluding courses related to computing/programmingin many liberal arts and sciences departments overthe past few years. Guzdial (2021) lists three rea-sons why many departments want such courses fortheir students: to support new discoveries throughcomputational methods; to use new modes of in-teractive communication through apps, simulated

1https://www.coursera.org/learn/clinical-natural-language-processing

2https://icla2020.jonreeve.com/

environments etc.; to study the cultural, social andpolitical influence of models and how to improvethem. This need resulted in the creation of data sci-ence programs open for all students from variouscolleges/departments. Naturally, the introductionof such programs started discussion around whatshould be included in such generic data scienceintroductory curricula (Krishnamurthi and Fisler,2020). An introductory course in NLP is com-monly offered either as an elective or as a part ofthe main coursework in most such data sciencecourse/minor/certificate programs in universities.Such a course is also a common component in in-dustry facing certification programs offered outsideof university settings.

However, popular textbooks and course mate-rials on NLP are not created taking these diverseaudiences and their motivations into consideration.They cover a range of topics in depth, requiringa deeper technical background that students com-ing from diverse academic backgrounds may notpossess. While there are some recent efforts towrite books focused to specific groups of students(Jockers, 2014; Hovy, 2020), they tend to focus ona smaller subset of topics within NLP.

Further, most courses and textbooks focus on thealgorithms, with much less attention given to otheressentials such as data collection, text extraction,pre-processing and other practical issues one willencounter when working on new research problems(and datasets), or while deploying NLP systems insome application scenario. Considering that manystudents take these courses with goals unrelated toperforming NLP research later (e.g., using NLP intheir disciplinary research, working for a softwarecompany, etc.), this lack of appropriate materialscan be seen as a potential gap between currentNLP teaching and what is required. Additionally,while research progress feeds into the developmentof course syllabi, we don’t see the opposite i.e.,teaching experiences informing NLP research.

149

In this background, this paper summarizes myexperiences with teaching multiple NLP courses toa diverse student body (STEM, humanities, socialsciences) both at undergrad and graduate level, andidentifies some challenge areas for classroom in-struction. At the same time, it also identifies somerelatively understudied problems in NLP researchand some issues to consider for tool developers. Inshort, the contributions of the paper can be summa-rized as follows:

• It identifies the challenges of teaching a rangeof student audiences outside of linguistics andcomputer science departments.

• It identifies a few challenge areas for NLPresearch and practice, which should be ad-dressed to further the use of NLP methodsand techniques beyond computer science andrelated disciplines.

The rest of this paper is organized as follows:Section 2 gives a brief overview of existing workon teaching NLP to put this paper’s contributionsin context. Section 3 describes the NLP courses Itaught with diverse goals and for diverse audiences,in detail. The next section then identifies some chal-lenges in terms of teaching for all these differentcontexts (Section 4). Section 5 elaborates on howthese teaching experiences help us identify somechallenge areas for NLP researchers as well as tooldevelopers. Section 6 summarizes and concludesthe paper.

2 Teaching NLP - A short review

Since the first Teaching NLP workshop almost twodecades ago, there has been some discussion in thecommunity on various issues related to NLP ped-agogy through the TeachCL/TeachingNLP work-shop series,3 and other events such as Teach4DH.4.Published work on teaching NLP can be broadlyclassified into four groups:

1. Sharing insights about building new CL pro-grams in various intra- and inter-disciplinarycontexts (e.g., Lonsdale, 2002; Dale et al.,2002; Koit et al., 2002; Baldridge and Erk,2008; Zinsmeister, 2008; Reiter et al., 2017)

3https://www.aclweb.org/anthology/venues/teachingnlp/

4https://teach4dh.github.io/

2. Focused discussion about the challenges inthe design of single courses, sometimes tar-geting a broader/diverse audience (e.g., Liddyand McCracken, 2005; Xia, 2008; Madnaniand Dorr, 2008; Agarwal, 2013; Navarro-Colorado, 2017)

3. Development and usage of specifictools/games/strategies for teaching NLP (e.g.,Bontcheva et al., 2002; Lin, 2008; Bird et al.,2008; Hockey and Christian, 2008; Levow,2008; Barteld and Flick, 2017)

4. Discussion around the design of linguisticproblems for North American ComputationalLinguistics Open Competition (NACLO)5 andother events (e.g., Bozhanov and Derzhanski,2013; Littell et al., 2013).

However, to my knowledge, there hasn’t beenmuch discussion on the gap between what welearn in the classroom versus how you use it later(whether in research or practice), how we shouldadapt the syllabus depending on the audience forthe course, and on how teaching experiences canpotentially inform NLP research and practice. Iaddressed these issues in this paper, based on myexperiences with teaching NLP.

3 Description of NLP Courses

This section describes the teaching scenarios inwhich I taught NLP between 2016-2021, whichform the basis for the observations discussed inthis paper. Full semester courses described be-low were taught at an American university during2016-18, and the online, compact courses weretaught at two German universities during 2020-21.All classrooms were small in size (< 30 students)and there were no teaching assistants. Inspired byBird (2008)’s classification of student’s backgroundand goals for using NLTK 6, and Fosler-Lussier(2008)’s grouping of students taking NLP courses,a grouping of my NLP teaching contexts, is shownin Table 1, taking student background and coursegoals into consideration. The "General" courses,as the name indicates, target a broader audience,while "Focused" courses were created to addressspecific student needs.

The rest of this section presents a detailedoverview about the courses taught under these

5https://nacloweb.org/6Table 3.1 in http://www.nltk.org/book/ch00.

html

150

General FocusedUndergrad A non-technical course

intended to give anoverview of languageand computers, openfor all disciplines

An introductory coursetaught as a part of datascience curriculum forLiberal Arts and Sci-ences (LAS) majors

Grad/Advancedundergrad

Two general NLPcourses with a focuson various algorithmsand applications

Two introductory NLPcourses for specificgraduate groups (ap-plied linguists andeconomists)

Table 1: NLP Teaching Contexts in terms of student groups

groups. I hope to achieve the following goalsthrough this long overview:

1. share some insights about what to in-clude/exclude in the syllabus/exercises, whenwe are developing a new NLP course for anon-traditional audience

2. provide a useful context to the challenges thatwill be discussed later in the paper.

3.1 Undergrad-General (U-Gen):

I taught one course7 which could be classified asa general undergraduate course that is open for all.This is based on the textbook by Dickinson et al.(2012), which is used to teach several such "Lan-guage and Computers" courses around the world.Students in this course came from all years of un-dergraduate curriculum, but were dominated byfreshmen and sophomores from Sciences and En-gineering. The course had an opt-in programmingcomponent for enthusiastic students, but otherwise,generic enough that all freshmen undergraduatestudents from any discipline can follow.

Syllabus: The topics covered followed thebook’s structure, with more contemporary informa-tion. For example, the topic "tutoring systems" in-cluded a discussion of software such as DuoLingo,and the topic "dialog systems" included discussionon virtual assistants such as Siri, Cortana etc. Thefocus in using these tools in this classroom is toexplain how such systems work, and evaluate theirperformance in real world contexts, rather than onteaching how to build such systems. Two sessionswere spent on giving a non-technical overview of

7https://github.com/nishkalavallabhi/LING120-Fall2017/

NLP with some discussion on recent trends and onthe broader impact of NLP on other areas of study.

Evaluation: The assignments in this course re-quired students to explore a few existing NLPtools/demos (e.g., corenlp.run) or browse existingcorpora (e.g., corpus.byu.edu). The students alsodid a group presentation that involved using a read-ily available day to day software which has someNLP component and summarizing its performancewith concrete examples and qualitative measures.The final exam for this course had two parts: thefirst part required the students to write a reportperforming a error analysis of two commercial sys-tems doing the same task (e.g., machine translation,speech recognition, etc.) and the second part askedthem to write a perspective essay on the impact oflanguage technologies on the society.

3.2 Undergrad-Focused (U-Focused)

I taught one course8 which was one of the electivesin an undergraduate data science minor program.This was open to students from all departments un-der the school of Liberal Arts & Sciences. It wastaught twice and attracted students from a widerange of disciplines including, but not limited to:Literature, Sociology, Journalism, Management,World Languages, Linguistics and Computer Sci-ence.

Syllabus: The goal of this course was to intro-duce students to methods of discovering languagepatterns in text documents and applying them tosolve practical text analysis problems in their dis-ciplines. The course required students to write afew programs, although the knowledge of program-

8https://github.com/nishkalavallabhi/LING410X-Spring18

151

ming was not a pre-requisite. Since many of thesestudents were getting familiarized with using R asa part of their curricula (for statistics courses), andI viewed it as a language in which they can makesome progress without having to gain expertisein programming simultaneously, the course wastaught in R. Relevant programming concepts anddata structures were introduced in the context ofthe topics of the course, on the fly. Jockers (2014)was used as the main textbook for this course, sinceit assumed no programming background from itstarget audience (Literature students), and it used R.

In terms of course organization, a few sessionswere spent on introducing students to the basicsof installing and using R, specifically with thegoal of working with textual data in mind. Thenext topic discussed methods to collect, extract andclean texts/corpora. This was followed by teach-ing about doing basic exploratory corpus analysisalong with keyword and key phrase extraction ap-proaches. The next topics focused on text classifi-cation and topic modeling - both of which are themost commonly used methods with textual dataacross disciplines. The final topic for the coursediscussed various means of visualizing textual data.

Evaluation: The course included assignmentsthat followed the topic structure, and required stu-dents to write small R programs to extract textpatterns, scrape data from different forms of doc-uments (e.g., webpages, twitter), extracting key-words, ngrams etc., following step by step processmaking alterations to pre-written code for traininga text classifier and a topic model, and building ba-sic visualizations of textual data (e.g., word clouds,dispersion plots etc). All assignments relied onlearning to use existing R libraries instead of fo-cusing on building everything from scratch. Thestudents did a group presentation which involvedvisualizing textual data. The final exam consistedof doing a small project and submitting the termpaper. The students were required to pick an inter-esting dataset for a problem they encountered intheir own discipline, and use one of the methodsthey learnt in the course.

3.3 Grad/Advanced undergrad-General(G-Gen)

I taught two courses that are the closest to a typicalNatural Language Processing course taught in uni-versities across the world: the first course followedthe standard structure in traditional textbooks, and

the second course focused specifically on how todo NLP in the absence of large annotated datasets.Students in both courses primarily consisted ofadvanced undergrad or graduate students comingfrom linguistics and computer science, with vary-ing degrees of background in programming, lin-guistics, and computer science. All students passedat least one programming course prior to attendingthese courses.

Statistical NLP: The course’s objective was toteach some of the common algorithms and tech-niques that form the foundation for modern dayNLP. Accordingly, starting with regular expres-sions and language models, the topics we dealt withincluded part of speech tagging, parsing, discourse,information extraction, text classification and otherNLP applications, and ended with introducing textembedding representations along with an overviewof neural network architectures. Jurafsky and Mar-tin (2008) and Goldberg (2017) were used as theprescribed textbooks. Assignments focused on im-plementing some of the popular NLP algorithmsfrom scratch, and final project involved modelingone of the common tasks such as text classificationor information extraction using existing datasetsand producing a technical report describing thesame.

NLP without data9 : The objective of thiscourse was to address a real world scenario thatis not typically addressed in NLP courses - how dowe apply NLP methods in the absence of annotatedtraining data. Hence, after giving a broad overviewof NLP and its uses both in real world and for otherdisciplines, and discussing in detail about NLP sys-tem development pipeline and text representation,the course focused on the following topics:

1. Corpus collection, text extraction and ex-ploratory analysis

2. Automatically labeling data and performingdata augmentation

3. Working with small datasets and transferlearning

This course was taught remotely, as a compact,intensive 1 month online course in January 2021 (9 hours per week, for 3 weeks) due to the current

9https://github.com/nishkalavallabhi/SfSCourseJan2021

152

pandemic situation. It primarily relied on a collec-tion of blog posts, research papers, code tutorialsand python notebooks from various sources, as noavailable textbook specifically addressed this topic.It had two assignments, which required students toexplore the usage of existing NLP tools to performgiven tasks and to "generate" labeled data for infor-mation extraction, and write up a report evaluatingtheir approaches with some error analysis. Therewas a group presentation, where students had topick from a selected collection of recent researcharticles on the course’s topics. Finally, there was anoptional term paper (for extra credits), where thestudents can pick a resource scarce NLP scenarioand apply what they learnt in this course to addressit.

3.4 Grad/Adv.undergrad-Focused(G-Focused)

I taught two introductory NLP courses that werespecific to graduate students - one for applied lin-guistics Masters/PhD students and the other forEconomics Masters/PhD students.

NLP for Applied Linguists Applied linguisticsgraduate program in the university where thiscourse was taught consisted of students study-ing corpus linguistics, computer assisted languageteaching/learning, and technical communication.Since the use of language processing tools in theseareas is increasing day by day, the goal of thiscourse was to teach some text processing meth-ods that can be directly applicable in their dis-sertation research. Accordingly, the course intro-duced various NLP techniques from regular expres-sions to using parsers, in the context of technolo-gies for language learning and corpus analysis e.g.,spelling/grammar checkers, pattern extractors forcorpus analysis etc. Bird et al. (2008) and Jurafskyand Martin (2008) were used as the textbooks forthis course, along with Church (1994).

The course had five assignments which involvedwriting small programs covering the topics of thecourse, and a group project, followed by a one-to-one oral exam to assess student understanding ofthe relevance of the course to their area of study.

NLP for Economists10: This was originallyplanned to be a one week intensive course, but wastaught online over three weeks due to the pandemicsituation in Fall 2020. It also included instruction

10https://econnlpcourse.github.io/

on Python fundamentals. The syllabus for the topiccomprised of four broad topics:

1. Introduction: An overview of NLP and its use-fulness in Economics; Python fundamentals

2. Python & textual data, which focused on datacollection, text extraction, pre-processing andtext representation

3. NLP methods for economics, covering ex-ploratory corpus analysis, text classification,topic modeling, and giving an overview ofothers such as information extraction and textsummarization

4. NLP in Economics - research paper readingsand discussion

The students had to do 3 assignments coveringthe first three topics which involved writing smallpython programs to use existing tools and write ananalysis of how they work for their domain data.They also did a group presentation picking a paperthat used NLP methods to address research ques-tions in economics, chosen most often from theirown disciplinary journals, instead of NLP confer-ences. Students also had to submit a term paperwhich involved creation of a problem statementdescribing a new economics problem that can beaddressed using NLP methods. No specific text-book was used for the course, although severalrecommendations were listed in the syllabus. Thecourse videos and slides, and links to publicly avail-able online content were the primary materials forthe course, as there was no appropriate textbookavailable to suit the needs of this audience.

As it can be seen from all the above course de-scriptions, the courses addressed a range of au-diences, and accordingly, differed in the way thecourse was organized as well as the topics thatwere covered. Most of these courses can be called"non-traditional" NLP courses, considering theircontents and intended audience. In the followingsection, I will elaborate on some of the general andcourse specific challenges I faced in the design anddelivery of these courses.

4 Challenges for Teaching

While all the courses received generally good stu-dent feedback towards the end, there are severalissues that could have been managed better. Someof these issues arise because I am teaching a di-verse, non-conventional NLP audience, and hence,

153

we don’t have readily available solutions yet. Eachcourse, of course, comes with its own challenges.In this section, I will discuss some of the more gen-eral teaching challenges I faced across all courses,and how I addressed them.

Student goals and Course Contents: For manyof the courses described in the previous section,the students did not have an ecosystem of relatedcourses in their curriculum because there are noNLP focused research groups or teaching programsin the university. For all except G-Gen courses,any NLP course is potentially a one off course thatcovers programming, NLP, and anything remotelyconcerned with computing for the students. Thestudent goals accordingly are related to either justfulfilling a course requirement, or gaining somequick actionable insights that they can use rightaway, or show relevant skills when they apply for ajob soon. In this background, I found it particularlychallenging to adapt the syllabi such that they getreadily usable practical skills, along with a solidfoundation to explore the topics further on theirown if needed.

An approach that seemed to work is to tie eachconcept of the course to a practical use case (eithera software application or some research problem inanother discipline) and give assignments that arecloser to their real-world scenarios. In the U-Gencourse, group activities involving apps the studentsregularly use such as Duolingo, Google Search,Siri etc generated a lot of interesting questions inthe class. In the G-Focused courses, providing anoverview about relevant disciplinary research thatuses NLP, and encouraging discussions about howthe students can use NLP in their research turnedout to be useful. The NACLO exercises helpedto introduce the challenges while working withtextual data, in a way that holds students’ attentionand generated interesting discussions, in all thefour cells in Table 1.

Faculty goals and Course Contents: As men-tioned above, many students lacked the eco-systemof related courses (all except G-Gen). Thus, alot of surrounding topics that are not typically apart of NLP courses had to be covered in theseclasses. Specifically, these topics revolved aroundconcepts of software programming, and the math-ematics needed to understand some of the basicNLP methods. Two questions I constantly grap-pled with in terms of my own teaching goals for

U-Focused and G-Focused courses courses were -how much mathematics/programming/linguisticsshould be included? how can I encourage goodprogramming/software engineering practices whilestill focusing on teaching the students to solve NLPrelated problems?

For the first question, keeping mathematicsand linguistics to the bare minimum, situatingall programming related exercises in the contextof NLP/textual data, and providing additional re-sources/tutorials for those interested in knowingmore math/linguistics behind NLP helped . Forthe second question, I tried to write clean, well-commented code for classroom examples as muchas possible, and showed variations of writing thesame piece of code, on top of the exercises in theprescribed textbook, discussing why we may choseone over the other. I also encouraged students toreview each other’s submissions and post discus-sions about programming in the forum for some ofthese courses. Both these teaching strategies cre-ated some awareness about programming practicesamong the students.

However, there were always students whowanted more/less of math/linguistics/programmingduring the classroom session itself, or in the form ofassignments, especially in G-Gen and G-Focusedcourses. Some students felt enough challenged,some were overwhelmed. This issue is by no meansspecific to my experiences, and has been docu-mented in past work on teaching NLP too (Brewet al., 2005; Koit et al., 2002). I addressed thisby providing optional, additional (ungraded) pro-gramming exercises and reading materials in all thecourses.

In terms of programming practices, I found itparticularly challenging to introduce the idea of ver-sion control for code for students in U-Focused andG-Focused classrooms. The students without priorbackground in any form of programming found itdifficult to understand Git. I emphasized its impor-tance and spent a small amount of time discussingwhy version control is useful, and how to do it, andprovided a few tutorial references. While none ofthe students used version control during the courses,I hope that as they gain more experience, they willunderstand its relevance and adopt in practice.

Textbooks and the real world: If we ask our-selves - what will the students do with what theylearn in these courses, provided they manage tostick to NLP?, we can think about three options: a)

154

pursue further studies/research focused on NLP b)pursue further studies/research in their own disci-pline, using NLP methods in their work c) work ina company on NLP projects. Among these, we cansafely assume that the last two are the most likelyscenarios for the audience I taught.

In my experience as a software engineer and asa data scientist in industry, and while collaboratingwith researchers from other disciplines, some ofthe most common issues I encountered are:

• How do we collect and label the data neededto train NLP models?

• How can we do a clean and accurate extractionof text from various file formats?

• What are the efficient ways of selecting mod-els going beyond standard intrinsic evaluationmeasures (e.g., considering external measures,deployment costs, maintenance issues etc)

Most of this issues are also commonly experiencedby NLP researchers, if they work on a new problemor on creating a new corpus resource. Yet, none ofthese issues are discussed even cursorily in avail-able/standard NLP textbooks, which means that thestudents have to rely on a lot of online blogs andother resources.

I addressed these issues by including topics suchas: what does a NLP system development pipelinelook like? and how is NLP used beyond CS re-search and software development scenarios? inthe introductory "NLP overview" sessions itself inG-Gen classrooms, introducing the students to thechallenges one faces at each step. In the rest ofthe course too, I addressed these issues in moredetail as needed, providing code examples, and in-cluding real-world case studies. Particularly, onecourse described in this paper, "NLP without an-notated dataset", entirely focused on the first issuementioned earlier.

In U-Focused and G-Focused classrooms, Iadded a detailed overview of how NLP is usedin various disciplines as a research method, andorganized student group discussions with contem-porary research papers in these disciplines. Evenin the relatively simpler U-Gen course scenario,we ran into the issue of the existing textbook be-ing outdated, which required me to look for newmaterials on the topic. As mentioned earlier, I ad-dressed this issue by introducing discussions andassignments on more contemporary developmentsabout the textbook’s topics.

I found it particularly challenging to cover allof these aspects in one course, while also givingenough background in core NLP topics, though. Infuture, it would be useful to develop some struc-tured reading material/tutorials/workbooks that canserve as goto sources on these topics, rather thanasking the students to just explore on their ownfrom a wide range of available material on the web.

Graded Resources: Another recurring chal-lenge I ran into relates to the availability of appro-priate textbooks. While I found introductory text-books that suited some of the classes, the studentsin the G-Focused group repeatedly mentioned thelack of a progressively difficult self-learning path.As they rightly pointed out, we either have intro-ductory books which gave a basic idea of selectedtopics (e.g., Jockers, 2014), or very advanced text-books (e.g., Jurafsky and Martin, 2008), but noth-ing in between. The issue of lack of resources wasalso highlighted by Hearst (2005) in the contextof teaching an applied NLP course. Although thiswas to some extent addressed by some of the morepractically oriented books published by O’ReillyMedia11 and Manning Publications12, they still ad-dressed industry facing scenarios, which did notalways account for these students’ needs.

Teaching diverse audience: All except the G-Gen courses described in this paper had either stu-dents coming from diverse disciplines, or a ho-mogeneous group belonging to a non-STEM disci-pline. In such cases, it is important that the studentsunderstand the relevance of the course to their owndiscipline. This can be challenging, if we, as theinstructors, don’t know what are some interestingchallenges in the various disciplines. The need totarget courses to the student background was alsodiscussed in the past in Baldridge and Erk (2008).To this end, I had several informational chats withthe faculty members from these disciplines, to un-derstand the use cases for NLP in their research,apart from reading research that used NLP methodsin the respective disciplines. This was definitelyuseful in making NLP more relevant to the stu-dent groups. Co-teaching such courses with facultyfrom other disciplines is an idea to explore in fu-ture, provided the class is homogeneous, and bothinstructors are willing to invest time and energyinto learning the methods of the other discipline.

11https://www.oreilly.com/12https://www.manning.com/

155

These are some of the challenges I faced in teach-ing NLP and how I addressed them. However, allthese issues are by no means completely answered,and more discussion is needed in this direction.Teaching NLP workshops, and sharing of teachingresources (and anecdotes) can be a starting pointtowards addressing these issues.

5 Challenges beyond the Teachingcontext

Generally, we see some discussion about how re-search informs teaching and the choice of topicsin a course. We may notice the evolution of thestandard Natural Language Processing course overthe past two decades, in terms of the topics cov-ered13. We also see a lot of discussion aroundchallenges encountered in teaching itself, as seenin the Teaching NLP papers in the past. However,when teaching to outside audience, we can uncoverhitherto under-studied research questions that arepotentially more relevant while doing NLP outsideof computer science and linguistics. I will focus onsuch issues in this section, by splitting it into twogroups: NLP research and tool development.

5.1 NLP ResearchThe questions raised by the students in G-Focusedcourses identified some under-studied issues in con-temporary NLP research, which are described be-low:

Predictions and Causality: In NLP, we gener-ally work with various representations of textualdata, and build predictive models. Though there isan increasing body of research on understandingthe predictions, it is not common to see a causalanalysis. However, in some disciplines, such causalrelationship is important for any prediction. Whileteaching an economics classroom, we repeatedlyran into these discussions about drawing causalinference for a text classifier or a topic modelingdecision. These led a student to a question - willNLP ever be really useful in economics researchbeyond being a fancy new technique?

There is some existing work that uses causalanalysis along with NLP methods in economicsliterature (Gentzkow et al., 2019), but there is notmuch within NLP research in that direction, fo-cusing on specific applications. Although there issome recent interest among NLP researchers on

13NLP Pedagogy interviews by David Jurgens and LucyLishorturl.at/bntR5

causal inference (Veitch et al., 2020; Keith et al.,2020), we do not yet have off the shelf teachingresources to incorporate such aspects into the class-room, to my knowledge. More NLP research andinter-disciplinary collaborations in this directionmay lead to the creation of the much needed teach-ing and software resources to address this importantissue in future.

Text as secondary data: Unlike typical NLP fo-cused problems, text is a form of secondary datafor many other disciplines (e.g., economics, again).Thus, their expectation is that the primary sourceof information for model predictions should comefrom the primary data sources. While discussingfeature representations for text, we repeatedly raninto the issue of how to effectively combine thesedifferent feature representations coming from pri-mary and secondary sources. It is certainly possibleto concatenate representations or create multimodalrepresentations and let the model figure out featureimportance, as we commonly do in NLP research.However, the students questioned this approachand asked for methods to develop models such thattheir primary feature representations were givenimportance over textual representation.

I am not aware of a commonly used approachfor the same, that can be incorporated directly in acourse through theory or practical exercises. How-ever, I believe we should acknowledge that the stu-dents are more familiar with research methods fromtheir own disciplines and want to use NLP withinthat framework, rather than completely switch to"the NLP way" of building models. New researchon using text representation as a secondary featurevector along with primary features may be a stepin the direction of addressing this issue.

Construct validity of text representations:Construct validity refers to whether a feature ac-tually measures what it is supposed to measure.In language testing literature, construct validityis frequently studied in the context of automatedlanguage assessment software. In the Applied Lin-guistics graduate course, we discussed automaticessay scoring and spelling/grammar correction sys-tems briefly, as use cases of NLP in language as-sessment and teaching. During these discussions,the students, who typically came from a languageassessment background, questioned the lack of tra-ditional steps such as exploratory analyses to un-derstand the corpus, and evaluating the construct

156

validity of the feature representations we use inNLP. While we see some discussion around theseissues in the context of real world NLP systemdevelopment (e.g., Sheehan, 2017; Beigman Kle-banov and Madnani, 2020), we don’t see muchdiscussion about such issues in research describingthe development of new NLP methods or in NLPtextbooks. As we see NLP being used more andmore outside its typical contexts, perhaps, it is timefor us as NLP researchers to consider these issueswhile analyzing models and their performance.

All these are important problems beyond teach-ing NLP, for NLP researchers in general, and in theinter-disciplinary research context in particular. AsConnolly (2020) suggests, supplementing comput-ing related courses with methods, and perspectivesfrom social sciences may give new insights forNLP researchers into addressing these issues. Thiswill benefit teaching NLP contexts beyond the tra-ditional audience, and also enrich NLP research infuture.

5.2 Tool Development

In all the courses taught outside of classrooms dom-inated by STEM students (i.e., all except G-Gen),I frequently ran into the issue of insufficient docu-mentation for NLP tools the students want to use.From my personal experience, this situation is con-stantly improving. Yet, it is far from ideal. Despitebeing actively involved with using various NLPtools in daily life, it is not uncommon for many ofus to face some challenges with software usage.

This is even harder for those without that back-ground with DIY software. Especially, tools thatwork across all commonly used operating systemsare not very easy to find, but are essential when wewant to use them in classrooms. It was particularlychallenging even as an instructor, while teachingthe data science minor course which used R. Sincethe students preferred to use their own machines,it was not possible to enforce a common operatingsystem/library versions setup. So, I addressed thisissue by guiding the students with links to videoson installation of various tools on all common oper-ating systems where possible, and using alternativelibraries where this did not work. Other issuesthe students raised were about the lack of propergraphical user interfaces and visualization methodsfor NLP tools. While I could not offer any clearsolutions for these issues, tool developers shouldperhaps keep a broader, potentially non-technical

users in mind when they release and document theirtools in future.

What is described in this section is merely asnapshot of some of the teaching NLP challengesthat go beyond the teaching context, and need theattention of other members of the NLP community -researchers and tool developers. The list mentionedin this section is by no means exhaustive, and morediscussion is needed in this direction especiallyin the current situation where NLP is taught andused by many disciplines beyond linguistics andcomputer science.

6 Conclusion

In this paper, I summarized my experiences withteaching NLP courses for diverse groups of under-graduate and graduate students and identified a fewchallenge areas for teaching such courses. I alsodiscussed few challenge areas for NLP researchersand tool developers, addressing which can help im-prove both teaching NLP and the ease of applyingNLP in research areas beyond linguistics and com-puter science. It should be acknowledged though,that this discussion is more qualitative than quan-titative in nature. Perhaps doing a survey of thestudents of such courses a couple of years laterabout how they are using NLP compared to whatthey learnt in a classroom could be one way tomeasure these observations in a more quantitativemanner. I hope that this paper will contribute to thegrowing body of work on NLP Teaching, and alsolead to further discussion on an increased focuson the inter-disciplinarily relevant aspects of NLPteaching, research and practice.

Acknowledgements

Firstly, I thank the workshop committee for or-ganizing this workshop. Comments from all thethree anonymous reviewers, and my colleagues -Rebecca Knowles, Gabriel Bernier-Colborne andTaraka Rama were immensely useful in bringingthe paper from the first draft to the final version- I thank them all for their time and thoughts onthis paper. Finally, I thank the students in all thesecourses, and the three universities (Iowa State Uni-versity, USA; Ludwig Maximilian University ofMunich, Germany; Eberhard Karls University ofTübingen, Germany) that gave me the opportunitiesto teach them.

157

ReferencesApoorv Agarwal. 2013. Teaching the basics of nlp and

ml in an introductory course to information science.In Proceedings of the Fourth Workshop on TeachingNLP and CL, pages 77–84.

Jason Baldridge and Katrin Erk. 2008. Teaching com-putational linguistics to a large, diverse student body:courses, tools, and interdepartmental interaction. InProceedings of the Third Workshop on Issues inTeaching Computational Linguistics, pages 1–9.

Fabian Barteld and Johanna Flick. 2017. Lea-linguisticexercises with annotation tools. In Teach4DH@GSCL, pages 11–16.

Beata Beigman Klebanov and Nitin Madnani. 2020.Automated evaluation of writing – 50 years andcounting. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 7796–7810, Online. Association for Computa-tional Linguistics.

Steven Bird. 2008. Defining a core body of knowledgefor the introductory computational linguistics cur-riculum. In Proceedings of the Third Workshop onIssues in Teaching Computational Linguistics, pages27–35.

Steven Bird, Ewan Klein, Edward Loper, and JasonBaldridge. 2008. Multidisciplinary instruction withthe natural language toolkit. In Proceedings ofthe Third Workshop on Issues in Teaching Compu-tational Linguistics, pages 62–70, Columbus, Ohio.Association for Computational Linguistics.

Kalina Bontcheva, Hamish Cunningham, ValentinTablan, Diana Maynard, and Oana Hamza. 2002.Using GATE as an environment for teaching NLP.In Proceedings of the ACL-02 Workshop on Effec-tive Tools and Methodologies for Teaching NaturalLanguage Processing and Computational Linguis-tics, pages 54–62, Philadelphia, Pennsylvania, USA.Association for Computational Linguistics.

Bozhidar Bozhanov and Ivan Derzhanski. 2013.Rosetta stone linguistic problems. In Proceedingsof the Fourth Workshop on Teaching NLP and CL,pages 1–8, Sofia, Bulgaria. Association for Compu-tational Linguistics.

Chris Brew, Markus Dickinson, and Detmar Meurers.2005. “language and computers”: Creating an in-troduction for a general undergraduate audience. InProceedings of the Second ACL Workshop on Effec-tive Tools and Methodologies for Teaching NLP andCL, pages 15–22.

Kenneth Ward Church. 1994. Unix™ for poets. Notesof a course from the European Summer Schoolon Language and Speech Communication, CorpusBased Methods.

Randy Connolly. 2020. Why computing belongswithin the social sciences. Communications of theACM, 63(8):54–59.

Robert Dale, Diego Mollá Aliod, and Rolf Schwit-ter. 2002. Evangelising language technology: Apractically-focussed undergraduate program. In Pro-ceedings of the ACL-02 Workshop on Effective Toolsand Methodologies for Teaching Natural LanguageProcessing and Computational Linguistics, pages27–32, Philadelphia, Pennsylvania, USA. Associa-tion for Computational Linguistics.

Markus Dickinson, Chris Brew, and Detmar Meurers.2012. Language and computers. John Wiley &Sons.

Eric Fosler-Lussier. 2008. Strategies for teaching“mixed” computational linguistics classes. In Pro-ceedings of the Third Workshop on Issues in Teach-ing Computational Linguistics, pages 36–44.

Matthew Gentzkow, Bryan Kelly, and Matt Taddy.2019. Text as data. Journal of Economic Literature,57(3):535–74.

Yoav Goldberg. 2017. Neural network methods for nat-ural language processing. Synthesis lectures on hu-man language technologies, 10(1):1–309.

Mark Guzdial. 2021. What liberal arts and sciencesstudents need to know about computing.

Marti A Hearst. 2005. Teaching applied natural lan-guage processing: Triumphs and tribulations. InProceedings of the Second ACL Workshop on Effec-tive Tools and Methodologies for Teaching NLP andCL, pages 1–8.

Beth Ann Hockey and Gwen Christian. 2008. Zeroto spoken dialogue system in one quarter: Teach-ing computational linguistics to linguists using reg-ulus. In Proceedings of the Third Workshop on Is-sues in Teaching Computational Linguistics, pages80–86, Columbus, Ohio. Association for Computa-tional Linguistics.

Dirk Hovy. 2020. Text Analysis in Python for SocialScientists: Discovery and Exploration. CambridgeUniversity Press.

Matthew L Jockers. 2014. Text Analysis with R for Stu-dents of Literature. Springer.

Dan Jurafsky and James Martin. 2008. Speech & lan-guage processing. Pearson Education.

Katherine Keith, David Jensen, and Brendan O’Connor.2020. Text and causal inference: A review of us-ing text to remove confounding from causal esti-mates. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 5332–5344, Online. Association for Computa-tional Linguistics.

Mare Koit, Tiit Roosmaa, and Haldur Õim. 2002.Teaching computational linguistics at the Universityof Tartu: Experience, perspectives and challenges.In Proceedings of the ACL-02 Workshop on Effec-tive Tools and Methodologies for Teaching Natural

158

Language Processing and Computational Linguis-tics, pages 85–90, Philadelphia, Pennsylvania, USA.Association for Computational Linguistics.

Shriram Krishnamurthi and Kathi Fisler. 2020. Data-centricity: a challenge and opportunity for com-puting education. Communications of the ACM,63(8):24–26.

Gina-Anne Levow. 2008. Studying discourse and di-alogue with SIDGrid. In Proceedings of the ThirdWorkshop on Issues in Teaching Computational Lin-guistics, pages 106–113, Columbus, Ohio. Associa-tion for Computational Linguistics.

Elizabeth Liddy and Nancy McCracken. 2005. Hands-on NLP for an interdisciplinary audience. In Pro-ceedings of the Second ACL Workshop on EffectiveTools and Methodologies for Teaching NLP and CL,pages 62–68, Ann Arbor, Michigan. Association forComputational Linguistics.

Jimmy Lin. 2008. Exploring large-data issues in thecurriculum: A case study with MapReduce. In Pro-ceedings of the Third Workshop on Issues in Teach-ing Computational Linguistics, pages 54–61, Colum-bus, Ohio. Association for Computational Linguis-tics.

Patrick Littell, Lori Levin, Jason Eisner, and DragomirRadev. 2013. Introducing computational concepts ina linguistics olympiad. In Proceedings of the FourthWorkshop on Teaching NLP and CL, pages 18–26,Sofia, Bulgaria. Association for Computational Lin-guistics.

Deryle Lonsdale. 2002. A niche at the nexus: situatingan NLP curriculum interdisciplinarily. In Proceed-ings of the ACL-02 Workshop on Effective Tools andMethodologies for Teaching Natural Language Pro-cessing and Computational Linguistics, pages 46–53, Philadelphia, Pennsylvania, USA. Associationfor Computational Linguistics.

Nitin Madnani and Bonnie J. Dorr. 2008. Combiningopen-source with research to re-engineer a hands-on introductory NLP course. In Proceedings ofthe Third Workshop on Issues in Teaching Compu-tational Linguistics, pages 71–79, Columbus, Ohio.Association for Computational Linguistics.

Borja Navarro-Colorado. 2017. A quick intensivecourse on natural language processing applied to lit-erary studies. In Teach4DH@ GSCL, pages 37–42.

Nils Reiter, Sarah Schulz, Gerhard Kremer, Ro-man Klinger, Gabriel Viehhauser, and Jonas Kuhn.2017. Teaching computational aspects in the dig-ital humanities program at university of stuttgart–intentions and experiences. Humanities, 1:6.

Kathleen M Sheehan. 2017. Validating automated mea-sures of text complexity. Educational Measurement:Issues and Practice, 36(4):35–43.

Victor Veitch, Dhanya Sridhar, and David Blei. 2020.Adapting text embeddings for causal inference. InConference on Uncertainty in Artificial Intelligence,pages 919–928. PMLR.

Fei Xia. 2008. The evolution of a statistical NLPcourse. In Proceedings of the Third Workshop on Is-sues in Teaching Computational Linguistics, pages45–53, Columbus, Ohio. Association for Computa-tional Linguistics.

Heike Zinsmeister. 2008. Freshmen’s cl curriculum:the benefits of redundancy. In Proceedings of theThird Workshop on Issues in Teaching Computa-tional Linguistics, pages 19–26.

159

Proceedings of the Fifth Workshop on Teaching NLP, pages 160–170June 10–11, 2021. ©2021 Association for Computational Linguistics

Teaching NLP with Bracelets and Restaurant Menus:An Interactive Workshop for Italian Students

Ludovica PannittoUniversity of Trento

[email protected]

Lucia BussoAston University

[email protected]

Claudia Roberta CombeiUniversity of Bologna

[email protected]

Lucio MessinaIndependent Researcher

[email protected]

Alessio MiaschiUniversity of Pisa

[email protected]

Gabriele SartiUniversity of [email protected]

Malvina NissimUniversity of [email protected]

Abstract

Although Natural Language Processing (NLP)is at the core of many tools young peopleuse in their everyday life, high school cur-ricula (in Italy) do not include any computa-tional linguistics education. This lack of ex-posure makes the use of such tools less re-sponsible than it could be and makes choos-ing computational linguistics as a universitydegree unlikely. To raise awareness, curios-ity, and longer-term interest in young people,we have developed an interactive workshop de-signed to illustrate the basic principles of NLPand computational linguistics to high schoolItalian students aged between 13 and 18 years.The workshop takes the form of a game inwhich participants play the role of machinesneeding to solve some of the most commonproblems a computer faces in understandinglanguage: from voice recognition to Markovchains to syntactic parsing. Participants areguided through the workshop with the helpof instructors, who present the activities andexplain core concepts from computational lin-guistics. The workshop was presented at nu-merous outlets in Italy between 2019 and 2021,both face-to-face and online.

1 Introduction

Have you used Google this week? This questionwould kick off the activity that we describe in thispaper every time we delivered it. And a number offollow-up comments would generally appear. Whatfor? Translating, getting some help for homework,looking for info, writing collaboratively - and get-ting spelling correction!

In our workshops, we talk to groups of teenagers– even if someone has not personally used any of

those tools on a daily basis, it is utmost unlikely thatthey have never interacted with a vocal assistant,wondered how their email spam filter works, usedtext predictions, or spoken to a chat-bot. Also,applications that do not require a proactive roleof the user are growing: most of us, for example,are subject to targeted advertising, profiled on thecontent we produce and share on social media.

Natural Language Processing (NLP) has grownat an incredibly fast pace, and it is at the core ofmany of the tools we use every day.1 At the sametime, though, awareness of its underlying mecha-nisms and the scientific discussion that has led tosuch innovations, and even NLP’s very existenceas a scientific discipline is generally much lesswidespread and is basically unknown to the generalpublic (Grandi and Masini, 2018).

A concurrent cause to this lack of awareness re-sides in the fact that in more traditional high-schoolformal education systems, such as the Italian one,“young disciplines" such as Linguistics and Com-puter Science tend to be overlooked. Grammar,that in a high-school setting is the closest field toLinguistics, is rarely taught as a descriptive disci-pline; oftentimes, it is presented as a set of normsthat one should follow in order to speak and writecorrectly in a given language. While this approachhas its benefits, it is particularly misleading whenit comes to what actual linguistic research is about.Similarly, Computer Science is often misread bythe general public as an activity that deals withcomputers, while aspects concerning informationtechnology and language processing are often ne-

1In this discussion, and throughout the paper, we conflatethe terms Natural Language Processing and ComputationalLinguistics and use them interchangeably.

160

glected. This often leads to two important conse-quences. First, despite the overwhelming amountof NLP applications, students and citizens at largelack the basic notions that would allow them tofully understand technology and interact with it ina responsible and critical way. Second, high-schoolstudents might not be aware of Computational Lin-guistics as an option for their university degree.Oftentimes, students that enrol in Humanities de-grees are mainly interested in literature and theyonly get acquainted with linguistics as disciplineat university. At the same time, most ComputerScience curricula in Italian higher education rarelyfocus on natural language-based applications. As aresult, Computational Linguistics as such is practi-cally never taught before graduate studies.

As members of the Italian Association for Com-putational Linguistics (AILC, www.ai-lc.it)we have long felt the need to bridge this knowl-edge gap, and made dissemination a core goal ofthe Association. As a step in this direction, wehave developed a dissemination activity that cov-ers the basic aspects of what it means to processand analyze language computationally. This is thefirst activity of its kind developed and promoted byAILC, and to the best of our knowledge, among thefirst in Italy at large.

This contribution describes the activity itself, theway it was implemented as a workshop for highschool students in the context of several dissemina-tion events, and how it can serve as a blueprint todevelop similar activities for yet new languages.

2 Genesis and Goals

We set to develop an activity whose main aimwould be to provide a broad overview of languagemodeling, and, most importantly, to highlight theopen challenges in language understanding andgeneration.

Without any ambition to present and explain theactual NLP techniques to students, we rather fo-cused on showing how language, which is usuallyconceptualized by the layperson as a simple andmonolithic object, is instead a complex stratifica-tion of interconnected layers that need to be disen-tangled in order to provide a suitable formalization.

In developing our activity, we took inspirationfrom the word salad Linguistic Puzzle, as pub-lished in Radev and Pustejovsky (2013):

Charlie and Jane had been passing notes inclass, when suddenly their teacher Mr. John-

son saw what was going on. He rushed tothe back of the class, took the note Charliehad just passed Jane, and ripped it up, drop-ping the pieces on the floor. Jane noticedthat he had managed to rip each word ofthe message onto a separate piece of paper.The pieces of paper were, in alphabeticalorder, as follows: dog, in, is, my, school,the. Most likely, what did Charlie’s noteoriginally say?

The problem includes a number of follow upquestions that encourage the student to reflect uponthe boundaries of sentence structure. In particular,we found that the word salad puzzle would give usthe opportunity to introduce some of the core as-pects of Computational Linguistics’ research. Ap-proaching the problem with no previous knowledgehelps raising some crucial questions, such as: whatare the building blocks of our linguistic ability thatallow us to perform such a task?, how much knowl-edge can we extract from text alone?, what doeslinguistic knowledge look like?

Since the workshop here presented is the firstactivity of this kind in the Italian context, we tookinspiration from games and problems such as thoseoutlined in Radev and Pustejovsky (2013) and usedfor the North American Computational Linguis-tics Olympiads, similar to the ones described inVan Halteren (2002) and Iomdin et al. (2013). Par-ticularly, we were inspired by the institution of(Computational) Linguistic Olympiads in makingour workshop a problem-solving game with dif-ferent activities, each related to a different aspectof computational language processing. LinguisticOlympiads are now an established annual event inmany parts of the world since they first took placein Moscow in 1965. In these competitions students(generally of high-school age) are faced with lin-guistic problems of varying nature, that requireparticipants to use problem-solving abilities to un-cover underlying patterns or rules in the data. Foran in-depth discussion of the history and diffusionof Linguistic Olympiads in the world, see Derzhan-ski and Payne (2010) and Littell et al. (2013).

In the choice of algorithms to include in ourdissemination activity, we decided to leave asideneural networks and instead focus on traditionalstatistical approaches, both for historical reasonsand for the fact that these convey more clearly thedistinction between different layers of linguistic in-formation and their roles in language modeling. A

161

Activity � Aim1. Get to know a (computational) linguist 10’ collaboratively build a definition for linguistics as a

study field2. Are computers able to hear? 15’ familiarize with the concept of simulation of humans’

speech perception abilities3. Are computers able to read? 30’ introduce corpora as sources of linguistic knowledge

and statistical patterns as structural aspects of lan-guage

4. Do computers know grammar? 30’ introduce human annotation and meta-linguisticknowledge as powerful research tools

5. Becoming a computational linguist 5’ evaluate pros and cons of the two presented ap-proaches, future directions and discuss about what’sneeded to become a computational linguist

Table 1: Sections of the activity, with their planned duration and a broad aim for each of them.

brief discussion of most recent NLP technologies,including the application of neural networks, is in-cluded in the final part of the workshop (Sec. 3.5).

The activity is targeted at students in their lastyear of middle school (13 years of age) or older.While we believe 13 is a good entry point, thereisn’t an actual upper-bound, since the activity canbe enjoyed by people of any older age (thoughin practice participation was mainly offered toschools, with the oldest students being 18-19). Wethought this would be the appropriate target audi-ence of this workshop for two main reasons. Onthe one hand, we believe that coming to the activitywith a richer metalinguistic background, typicallyacquired during the first Italian school cycle, wouldbe beneficial for the attendees to better grasp thedifferences between the scientific approach to lan-guage and the more prescriptive approach they areexposed to in school. On the other hand we alsoconceived the activity as a way of helping studentsin their future study choices: we therefore includedboth students attending their last year of middleschool and therefore about to choose a high schoolcurriculum as well as high school students, the lat-ter in order to provide them with more options fortheir university choices.

3 The activity

We planned our dissemination activity for a 90minutes time slot, divided into five parts, as detailedin Table 1.

3.1 Get to know a (computational) linguistThe first 15 minutes of the workshop are organizedboth as an ice-breaker activity for the attendees,

and as a brief introduction to linguistics and com-putational linguistics more specifically.

During the introduction we tried to demystifysome common misconceptions about linguistics,(i.e., a linguist knows many languages, linguists getsometimes confused with speech therapists, a lin-guist will correct my grammar, etc.): we presentedparticipants with a list of possible definitions, andthey had to identify appropriate ones. We broadlydefined linguistics as the study of language as a bio-logical, psychological, and cultural object. Compu-tational Linguistics was then introduced both as acommercial and engineering-oriented field, as wellas a purely scientific research discipline.2

We also briefly discussed linguistic questionswith participants as an example of the kind of prob-lems that a linguist tries to approach during their re-search activity. These included: "How many waysof pronouncing n do we have in Italian?", "Do num-bers from one to ten exist in all languages?".

For the following parts of the activity, the intro-duction to each sub-part was dedicated to a reflec-tion upon what it means for us humans to hear,read and understand grammar, and whether thereis a difference when the same tasks are performedby computers.

3.2 Are computers able to hear?

As vocal assistants such as Alexa, Siri, GoogleHome etc. have become increasingly popular, wechose them as tangible examples of NLP technolo-gies to kick-off the games. The aim of this section

2We relied on the definition reported in https://www.thebritishacademy.ac.uk/blog/what-computational-linguistics/.

162

of the workshop was to demonstrate two points:

• computers do not necessarily solve linguistictasks the way we solve them; they are there-fore simulating our abilities without replicat-ing them;

• consequently, the concepts of easy and diffi-cult tasks have to be carefully revised whenapplied to language models.

We introduced the McGurk effect (McGurk andMacDonald, 1976), to clarify how hearing lan-guage is a complex task in itself, involving a broadset of aspects beyond simple sound perception,such as the visual system as well as the expec-tations regarding the upcoming input. Computerson the other hand hear on the basis of an audiosignal that is processed (Figure 1), at least in themost traditional and well-established architectures,without access to higher level linguistic knowledgeor information from the communicative context.

We then briefly presented speech recognition as adirect optimization task (i.e., given an audio signal,find the word in a given database that maximises theprobability of being associated to that signal) andintroduced one of the major challenges that speechrecognition models are still facing, namely the abil-ity to adapt to different speech styles (i.e., speakersof different language varieties and dialects, non-native speakers, speakers with impairments etc.).

In order to further demonstrate this, we testedthe attendees’ ability to adapt their hearing skillsto different speech varieties by making them hearconversations in various Italian regional accents.3

While we asked attendees to guess the name ofthe region of the speakers, the actual aim was toshow how we easily adapt to understand speech,differently from speech recognition systems.

3.3 Are computers able to read?From this moment on, the attendees worked onwritten text. The activity described in this sectionwas aimed at showing how salient aspects relatedto language structure can be derived from the sta-tistical properties of language.

Following the “Word Salad” puzzle (Radev andPustejovsky, 2013), the ability of reading was pre-sented as follows: given a set of words, are weable to rearrange them in a plausible sentence-likeorder? We demonstrated how this is an easy task

3Conversations were extracted from corpus CLIPS (Al-bano Leoni et al., 2007).

Figure 1: Representation of the audio signal for the ital-ian word destra (en. right). This was used to show howinformation coming from audio signals can be repre-sented in a way that easily allows the computer to per-form a pattern matching task, but would be impossibleto process for humans.

for human beings, when one deals with a languagethey are familiar with (Figure 2), while, generallyone may not be able to perform the same task incase of unknown languages (Figure 3), where eachpossible ordering seems equally plausible.

Figure 2: A set of Italian words (from the top left cor-ner, en. is, garden, in, my, the, dog): when asked to re-arrange them into a sentence, participants would firstcome up with the most likely ordering (i.e., il mio caneè nel giardino, en. my dog is in the garden) and ifprompted they would then produce more creative sen-tences (i.e., il cane nel giardino è mio, en. the dog inthe garden is mine). They would however never con-sider ungrammatical sequences as possible sentences.

We therefore gave the attendees a deck of un-known, mysterious tokens (left card in Figure 4)and asked them to come up with the most plausi-ble sentence that contained all of them. The cardsrepresented either Italian or English words (par-ticipants were divided into two teams, each onedealing with a different masked language) whichhad however been transliterated into an unknownalphabet. While this was obviously an impossibletask to solve, it gave us the opportunity to introducea notion of probability in the linguistic realm. Wecast it as the expectation that we humans bear on

163

Figure 3: The figure depicts the same situation as Fig-ure 2, this time with non-words.

Figure 4: Example cards, both showing a word. Left:a card for the first activity, with a button loop to threadit in a sentence. Right: a card for the second activity,with part of speech (number) at the bottom.

the appearance of a specific linguistic sequence,and the subsequent need for a source of linguisticknowledge to implement the same notion.

When asked to perform the same task on thewords of Figure 2, participants produced sentencesin a quite consistent order, and the most prototyp-ical sentences (e.g., il mio cane è nel giardino,en. my dog is in the garden) were usually elicitedbefore some less typical ones (e.g., il cane nel gi-ardino è mio, en. the dog in the garden is mine),while agrammatical sentences (i.e., random permu-tations) were never produced.

We justified their responses by explaining thathumans accumulate a great amount of linguisticknowledge throughout their lifetime that helpsthem refine these expectations, while machines areinstead in principle unbiased towards having anyspecific preference. This observation allowed us tointroduce participants to the notion of corpus as alarge collection of linguistic data that mimics theamount of data we are exposed to as humans.

Each team was then provided with a corpus writ-ten in an unknown language (approximately 60 sen-tences hand-crafted by transliterating a portion ofthe “Snow White” tale into a mysterious alphabetFigure 5). Concurrently, we introduced a simplealgorithm to tell apart sentence-like orderings ofthe provided tokens from the random ones.

The algorithm, which we called The braceletmethod (Figure 6), is based on the Markov Chain

Figure 5: The figure shows one of the corpora that wasgiven to participants, 5 A3 tables containing approx.60 transliterated sentences from the “Snow White”fairy tale, and tokens with buttonholes that had to besearched in the corpus and threaded into sentences.

procedure: in a scenario similar to that of liningpearls up to form a bracelet, participants coulddecompose the task of forming up a sentence intosmaller tasks. To make the operation more concrete,we equipped each card with a button loop as shownin Figure 4: this way cards could be physicallythreaded together to form a sentence.

The first step consisted in choosing the first word,for the beginning of the sentence. Since partic-ipants were facing a language they didn’t know,they were not aware of language structures nor ofthe meaning of the tokens. In such a situation, theycould decide whether it was plausible to use a givenword at the beginning of a sentence just by look-ing up in the corpus sentences that began the sameway. If they found at least one sentence that beganwith the same word, it meant that that was a licitposition and they could use it to start the sentence.

The activity continued as follows: sticking to thebracelet metaphor, participants needed to select andinsert the following "pearl" into the thread: ideallythe pearl should pair well with the previous one, aswe might want to avoid colour mismatch (e.g., itis well known that blue and green do not go welltogether). The metaphor highlights therefore a coreaspect that holds true for language as well, namelythat we can condition our choice on a variable num-ber of previous choices, and this will influence thecomplexity of the obtained pattern.

The activity resulted in a number of sentence-bracelets, as shown in Figure 7, that were then keptaside to be translated at the end of the workshop.

3.4 Do computers know grammar?

While the previous game highlighted the impor-tance of statistical information in NLP, in accor-

164

Figure 6: The bracelet method applied on the Italianwords è, mio, giardino, nel (en. is, my, garden, in). Thealgorithm is based on bigram co-occurrences, so thechoice for each word is based solely on the previouslychosen one. Colors indicate probabilities, which arecomputed based only on the previously chosen wordcane (en. dog). The first token, il (en. the), appearsgrayed as it is ignored for the choice.

Figure 7: The figure shows the result of a braceletsequence: tokens are threaded together based on co-occurrences in the corpus.

dance with the overall aim of the activity, we alsowanted to introduce some of the algorithms that aremore deeply rooted in the linguistic tradition.

In order to do so, we introduced the notion ofgrammar as a descriptive abstraction over a set ofexamples. Out of the linguistic context, to exem-plify this notion of grammar metaphorically, wepresented participants with a set of possible restau-rant menus (Figure 8), and encouraged them tocome up with the general rule that the restaurantowner must have had in mind when choosing thosecombinations. All menus were built as a traditionalItalian full meal, composed by two main dishes anda dessert. We perpetuated the metaphor showinghow, once a set of rules is defined, these can beused both to decide if a new menu is likely to bepart of the same set (Figure 9) and also to generatenew meals (Figure 10).

This metaphor, which was readily grasped bymost participants, was useful to show how differentcomponents can be combined together in a mean-

Figure 8: Each block in the image corresponds to a pos-sible Italian full meal, consisting of: first course (e.g.fusilli al pomodoro), second course (e.g. pollo agli as-paragi) and dessert (e.g. pannacotta).

Figure 9: Step-by-step process to assess the validity ofa given menu. In the top-right corner the rules for cre-ating a full meal are shown . Categories are defined re-cursively until each course that constitute the full menuis obtained and therefore reduced to the initial categoryof a pasto (en. meal).

Figure 10: Process flow for creating a new meal fromthe initial category pasto (en. meal) up to the leavescontaining terminal symbols such as penne, funghi,pollo etc. (en., a type of pasta, mushrooms, chicken).The rules used to generate are the same used for thereduction process, reported in Figure 9.

ingful way, as it happens in grammar. Before mov-ing back to the corpus, we showed them what aformal grammar of the menus could look like.

We had previously annotated the corpus with165

syntactic and morpho-syntactic information (i.e.,part of speech), as shown in Figures 5 and 12. Wetherefore asked participants to extract from the cor-pus a possible grammar, namely a set of attestedrules and use them to generate a new sentence.

In order to write the grammar, participants weregiven several physical materials: felt strips repro-ducing the colors of the annotation, a deck of cardswith numbers (identifying parts of speech) and adeck of “=” symbols (Figure 11).

Figure 11: A set of rules extracted during the activityfrom the corpus. Each rule is made of felt strips forphrases, cards with numbers indicating parts of speech,and “=” cards.

With a new deck of words (Figure 4, right panel),not all of which present in the corpus, participantshad to generate a sentence using the previouslycomposed rules.

3.5 Becoming a computational linguist

At this point, participants had created a numberof sentences by means of the two techniques de-scribed above. It is now time to discover that themysterious languages they worked on were actuallyEnglish and Italian. This was achieved in practiceby superimposing a Plexiglas frame on the A3 cor-pus pages (Figure 13): the true nature of the cor-pora was this way revealed as the participants couldsee the original texts (in Italian and English) andtranslate the sentences they had created previously.

The outcome of the activity stimulated discus-sion amongst the participants, highlighting prosand cons of each approach and how could they beintegrated into real-life technologies. Our work-shop ended with a brief description of more recentNLP technologies and their commercial applica-tions, such as recommender systems, automatictranslation, text completion, etc.

Since the target audience consisted mostly ofmiddle- and high-school students, we offered anoverview of what it takes to become a computa-

Figure 12: Example of categories (it. Categorie), e.g.phrases and POS tags, and rules (it. Regole) for the sen-tence "lo specchio rispose a la regina" (en. the mirroranswered to the queen).

Figure 13: A trial session of the workshop (the pictureshows some of the tutors explaining the game to AILCmembers): the original language of the corpus has justbeen revealed by superimposing Plexiglas supports onthe corpus tables.

tional linguist and where one could study computa-tional linguistics in Italy.

4 The workshop in action

The activity – here outlined in its complete andoriginal form – was presented in various outletsduring the last year and a half.

It was initially designed for the 2019 edition ofthe “Bergamo Scienza” Science Festival4, where itwas presented live to over 450 participants in thecourse of two weeks. A simplified "print-and-play"version of the workshop was also presented at the2020 edition of the SISSA5 Student Day.

Due to the Covid-19 pandemic, all other pre-sentations of the activity had to be moved online.Transposing the workshop crucially meant strivingto maintaining the interactive nature of the activi-ties without the possibility of meeting face to face.

4https://festival.bergamoscienza.it/5https://www.sissa.it/

166

To do so, we integrated our original presentation onGoogle slides with the interactive presentation soft-ware Mentimeter6 - which we used for questions,polling and quizzes. The corpus and tokens werepresented via a web interface created especially forthis purpose7.

The interface presented the masked corpus, com-plete with POS tags and syntactic annotations. Forthe bracelet activity participants were automaticallyassigned a number of tokens which they could useto build a sentence by simply dragging and drop-ping them. (Figures 14 and 15).

This online version was crafted in the first placeto be presented at “Festival della Scienza”8 (Fig-ure 16), a science festival primarily aimed at schoolstudents held each year in Genoa, where multi-ple 45’ sessions of the workshop were run overthe course of two days. The fourth activity (Sec-tion 3.4) involved "bootstrapping" syntactic rulesbased only on our color-based annotation. Tosimplify online interaction, we only used the un-masked Italian version of the corpus and partic-ipants played collectively, helping each other tobuild sentences and grammatical rules: rules werecollected through a Mentimeter poll, while a sen-tence was generated in a guided demonstration bythe presenter of the workshop.

Overall, the workshop transposed really wellonline, and was extremely well-received in this ver-sion as well. The online modality also allowed usto present it to a more vast and varied audiencethan just students: a version for the general pub-lic was presented at European Researchers’ Night(Bright Night9) at the University of Pisa, thanks toa collaboration with the Computational LinguisticsLaboratory10 and ILC-CNR11 in November 2020,a dedicated session was run for the High schoolITS Buzzi12 (Prato, Italy) in December 2020, andduring the second edition of the Science Web Fes-tival13 in April 2021.

5 Reusability: this activity as a blueprint

As mentioned in Section 1, our activity was in-spired by a collection of English-language prob-

6https://www.menti.com/7A demo of the interface can be found at https://

donde.altervista.org/8http://www.festivalscienza.it9https://www.bright-night.it/

10http://colinglab.humnet.unipi.it/11http://www.ilc.cnr.it/12https://www.tulliobuzzi.edu.it/13https://www.sciencewebfestival.it/

lems created for students participating in the Com-putational Linguistics Olympiad (Radev and Puste-jovsky, 2013).

We adapted the original activity to the Italianlanguage and context. While we tried to choosewidely shared linguistic principles to communi-cate, the operation of adapting the game to a dif-ferent language obviously requires language spe-cific choices and details, which would have to bere-evaluated when porting the activity to yet newlanguages. Particularly, since the materials havebeen developed for Italian, it might be the casethat transposition is not straightforward for somelanguages. However, we believe that the generalstructure of our workshop can serve as a blueprintfor extension to new languages. For this reasonall the relevant materials are made available in adedicated repository. The repository includes bothscripts to reproduce our activity as well as a gen-eral set of insights/recommendations regarding thestructure and principles of the workshop.

All of the scripts necessary to produce the ma-terials used in the game’s workflow in a differentlanguage are made available in our open-accessrepository14. To get the activity into productionusing the scripts provided, one only needs to createan annotated corpus in the target language.

Our specific choice of “Snow White” as a cor-pus is motivated by the fact that the story can bephrased in a fairly repetitive formulation, with twoadvantages. One is that enough bigrams are presentthat enable the generation of new sentences withthe bracelet method. The other one is that the storycontains recognizable characters (e.g., the dwarfs,the evil queen etc.), so that, when the unmaskedtext is revealed, the process results intuitively trans-parent. Such characteristics are desirable for theactivity, and should be kept in mind when choosinga new text for a new language.

In addition to sharing scripts, we are sharing herethe core structure and principles the workshop re-lies on, which can be reproduced when replicatingthe activity also for a different language.

Parts 1&2 At the beginning of the workshop (seeSection 3.2) we show the limits of current tech-nologies, in particular in terms of adaptability. Tothis end, we exploited diatopic variations, suchas Italian regional accents, since this is an aspectreadily available to Italian students. In the con-

14https://bitbucket.org/melfnt/malvisindi

167

Figure 14: Interface of the online website used for the "bracelet" game (section 3.3) during the online workshops.Players can collaboratively drag and drop their card from the bottom left panel to form sentences in the top-leftpanel. The corpus is shown in the right panel.

Figure 15: After the games the website shows the translation of the corpus (see for example line 1) and cards.

text of other languages, the same concept could behowever shown by different means, such as the in-fluence of jargon or minority languages, or simplydifferences in accuracy depending on, age or othersocio-demographic and socio-cultural variables.

Part 3 The next part of the activity (Section 3.3)is the most easily reproducible in a different lan-guage as it only exploits statistical co-occurrencesas a cue for linguistic structure. The only pre-requisite here is the availability of a sufficientlystandardized tokenization for the language of inter-est. As described above, we masked the languagethrough a transliteration system: this was achievedeither substituting words with sequences of sym-

bols, or with non-words. In the case of non-words,these were generated by manually defining a seriesof phonotactic constraints for Italian, which shouldbe adapted to the target language.

The main message to be conveyed through thisactivity is that language is a complex system; wedid this disentangling semantics from the purelysymbolic tier, which is the one computers mostcommonly manipulate.

Part 4 The following part of the activity (Sec-tion 3.4) focuses on the expressive power residingin the definition of auxiliary categories as descrip-tors of linguistic evidence. We achieved this by sim-plifying a constituent-based annotation for our cor-

168

Figure 16: The picture was taken during a workshop session at “Festival della Scienza” in Genoa, in October 2020.Due to the COVID-19 pandemic, students were participating remotely. The left panel shows the tutor handlinggrammatical categories (colored cards for syntactic phrases and letters for parts of speech); the right panel shows ascreenshot of the slides employed during the activity. Students could see both panels as slides were streamed whilea webcam was capturing the tutor’s movements.

pus: having continuous constituents easily allowedfor the physical implementation of re-writing rules(i.e., participants had some physical constituentsand tokens that could be used to simulate the gen-eration process). We built categories in order toextract from the text a simple regular grammar,and we were especially careful about the fact thatboth the Italian and the English corpus showed asimilar structure and therefore complexity level.More specifically, we restricted to five types ofhigher-level categories that acted as phrases (i.e.,sentence, noun phrase, verb phrase, prepositionalphrase, subordinate clause): this is of course a hugesimplification and, for the sake of the activity, weoverlooked some relevant linguistic phenomena.

We reckon that this approach might not beportable to languages that exhibit a flexible wordorder, so alternative solutions should be sought.

Part 5 The last section of the activity was dedi-cated to a discussion on the presented methods forlanguage generation.

After that and depending on the audience, wepresented some options to pursue studies in Compu-tational Linguistics in Italy, which would of coursehave to be adapted to the target social context. Laypublications concerned with Computational Lin-guistics are also unfortunately not very commonin Italy, therefore we took the opportunity to pro-vide participants with some suggestions for furtherreadings (Masini and Grandi, 2017).

6 Looking back and ahead

In the previous sections we described an interactiveworkshop designed to illustrate the basic princi-ples of Computational Linguistics to young Italian

students. It is the first activity of its kind to be de-signed by the Italian Association for ComputationalLinguistics and among the first dissemination activ-ities in Italy for Computational Linguistics directedto young students.

The activity had the broad aim of increasingawareness towards applications based on languagetechnology, and introduce students to the study oflanguage as a scientific discipline.

We run the activity in both face-to-face and, dueto COVID-19, online form: generally speaking, wereceived enthusiastic feedback both from youngerparticipants and from the more general public. Weadapted the activity to a variety of formats and time-slots, ranging from 30 to 90 minutes: the amountof time required to approach the game and get ac-quainted with the concepts is of course variable,and depends on the participants’ background andon the level of engagement that is expected of them.Generally speaking 45 minutes are enough for apresentation including some interaction with theaudience, especially in the online setting, but atleast 90 minutes are needed for the participants tofully experiment with the hands-on activity.

We want to specifically stress how time is a cen-tral ingredient in the activity. While the game-related aspects remained engaging and fun evenin the shortened, online versions, in order to fullygrasp the mechanisms underlying the presented al-gorithms longer sessions would be needed. In fact,we often felt that more time would be beneficial fora deeper discussion about language as an object ofstudy itself, and about language as data on whichtheories can be built. In particular, shorter time-slots or less guided activities enhance the risk of

169

participants approaching the challenge as a puzzlethat they can solve regardless of linguistic knowl-edge. This is because the approach to language weare presenting is entirely new to our audience, andnot only to the younger students.

Although we did not have a formal system inplace to collect systematic feedback, the overallresponse across venues and conditions has beenextremely positive.15 Curiosity and engagementof participants remained high both onsite and on-line, and we received many questions on severalaspects of Natural Language Processing and neuralnetworks, as well as concerning its role in ArtificialIntelligence at large.

The participants’ enthusiastic questions gave usmany ideas for future dissemination activities. Infact, the technological world is advancing fast andwe firmly believe that it is necessary to spread moreawareness on the inner workings of AI-based tech-nologies, to develop a society-level conscience toapproach them in a critical way.

The activity described in this paper was targetedat middle to high-school students as well as the gen-eral public. It would be interesting to engage witha younger audience as well, as communicating thestudy and (computational) modelling of languageto them would raise awareness towards languagestudies as a scientific discipline.

For our activity, we took inspiration from one ofthe problems proposed in Radev and Pustejovsky(2013). Puzzles such as the ones presented at theComputational Linguistics Competitions are a greatway to introduce both important challenges and themethodology to solve them, as they stimulate stu-dents to investigate linguistic aspects in a bottom-up fashion. Organizing the competition in Italywould represent a bigger project for our associa-tion, to be addressed in the coming years.

Acknowledgements

The workshop has been developed with help andsupport from many parties. We would like to thankthem all here. First and foremost, all the partici-pants that enthusiastically played along and madethe activity a success. Along with them, four tutorsand science animators helped us greatly in puttingour ideas into practice. We are also grateful to Berg-amoScienza, Festival della Scienza, Science Web

15Following one reviewer’s suggestion, we have imple-mented such a system for our latest presentation at the ScienceWeb Festival 2021 as a Google form that we circulated amongparticipants at the end of the presentation.

Festival for hosting the activity during the festivals,to ILC-CNR “Antonio Zampolli" and ColingLab(University of Pisa) for hosting us during the Eu-ropean Researchers’ Night, and to the Scuola In-ternazionale Superiore di Studi Avanzati (SISSA)and ITI Tullio Buzzi. We are also grateful to Dr.Mirko Lai, who has collaborated on the develop-ment of the web interface for the online versions ofour activity. We would like to thank AILC (ItalianAssociation of Computational Linguistics) for en-couraging us to design the activity and supportingus throughout the process. Last but not least, wethank the true hero of our workshop: GingerCat. 16

ReferencesFederico Albano Leoni, Francesco Cutugno, and Re-

nata Savy. 2007. Clips (corpora e lessici di ital-iano parlato e scritto). Linguistica computazionale:ricerche monolingui e multilingui.

Ivan A Derzhanski and Thomas Payne. 2010. The lin-guistics olympiads: Academic competitions in lin-guistics for secondary school students. Linguisticsat school: language awareness in primary and sec-ondary education, pages 213–26.

Nicola Grandi and Francesca Masini. 2018. Perché lalinguistica ha bisogno di divulgazione (e viceversa).In N. Grandi and F. Masini, editors, La linguisticadella divulgazione, la divulgazione della linguistica.Atti del IV Convegno Interannuale SLI nuova serie,pages 5–12. SLI, Roma.

Boris Iomdin, Alexander Piperski, and Anton Somin.2013. Linguistic problems based on text corpora.In Proceedings of the Fourth Workshop on TeachingNLP and CL, pages 9–17.

Patrick Littell, Lori Levin, Jason Eisner, and DragomirRadev. 2013. Introducing computational concepts ina linguistics olympiad. In Proceedings of the FourthWorkshop on Teaching NLP and CL, pages 18–26.

Francesca Masini and Nicola Grandi. 2017. Tutto ciòche hai sempre voluto sapere sul linguaggio e sullelingue. Caissa Italia.

Harry McGurk and John MacDonald. 1976. Hearinglips and seeing voices. Nature, 264(5588):746–748.

Dragomir Radev and James Pustejovsky. 2013. Puzzlesin logic, languages and computation: the red book.Springer.

Hans Van Halteren. 2002. Teaching nlp/cl throughgames: The case of parsing. In Proceedings of theACL-02 Workshop on Effective Tools and Methodolo-gies for Teaching Natural Language Processing andComputational Linguistics, pages 1–9.

16https://icons8.com/illustrations/style--ginger-cat-1

170

Author Index

Adewumi, Oluwatosin, 1Agirrezabal, Manex, 80Aksenov, Sergey, 13Alex, Beatrice, 138Alonso, Pedro, 1Amblard, Maxime, 34Apishev, Murat, 13Artemova, Ekaterina, 13

Baglini, Rebekah, 28Boutchkova, Maria, 138Boyce, Richard, 96Busso, Lucia, 52, 160

Chen, Jifan, 99Combei, Claudia Roberta, 52, 160Corona, Rodolfo, 104Couceiro, Miguel, 34

Delbrouck, Jean-Benoit, 55DeNero, John, 104Desai, Shrey, 99Durrett, Greg, 99

Eisenstein, Jacob, 125

Faridghasemnia, Mohamadreza, 1Foster, Jennifer, 112Fried, Daniel, 104Friedrich, Annemarie, 49

Gaddy, David, 104Goyal, Tanya, 99

Hiippala, Tuomo, 46Hjorth, Hermes, 28

Jenifer, Jalisha, 92Jurgens, David, 62, 108

Kabela, Lucas, 99Kennington, Casey, 115Kirianov, Denis, 13Kitaev, Nikita, 104Klein, Dan, 104Kovács, György, 1

Liwicki, Marcus, 1Llewellyn, Clare, 138

Madureira, Brielen, 87Manning, Emma, 65Medero, Julie, 131Messina, Lucio, 52, 160Miaschi, Alessio, 52, 160Mokayed, Hamam, 1

Newman-Griffis, Denis, 96Nissim, Malvina, 52, 160

Onoe, Yasumasa, 99Orzechowski, Pawel, 138

Pannitto, Ludovica, 52, 160Plank, Barbara, 59Poliak, Adam, 92

Rakesh, Sumit, 1Reynolds, William, 96

Saini, Rajkumar, 1Sarkisyan, Veronica, 13Sarti, Gabriele, 52, 160Schneider, Nathan, 65Schofield, Alexandra, 131Serikov, Oleg, 13Smith, Ronnie, 70Stern, Mitchell, 104

Taneja, Sanya, 96

Vajjala, Sowmya, 149

Wagner, Joachim, 112Wicentowski, Richard, 131

Xu, Jiacheng, 99

Zeldes, Amir, 65Zesch, Torsten, 49

171