Machine Learning, Libraries, and Cross-Disciplinary Research

Machine Learning,

Libraries, and

Cross-Disciplinary Research:

Possibilities and Provocations

General Editor

Daniel Johnson

Editors

Don Brower

Mark Dehmlow

Eric Morgan

Alex Papson

JohnWang

Hesburgh Libraries, University of Notre Dame

Notre Dame, Indiana

2020

Contents

Preface v

I Machine Learning in Academia 1

1 Artificial Intelligence in the Humanities: Wolf in Disguise,or Digital Revolution? 3

Arend Hintze and Jorden Schossau

2 Generative Machine Learning 13

Charlie Harper

3 Humanities and Social Science Reading through Machine Learning 29

Marisa Plumb

4 Machine Learning in Digital Scholarship 43

Andrew Janco

5 Cultures of Innovation: Machine Learning as a Library Service 49

SueWiegand

6 Cross-Disciplinary ML Research is like Happy Marriages: Five Strengths andTwo Examples 63

Meng Jiang

7 AI and Its Moral Concerns 73

Bohyun Kim

II How-to’s and Case Studies: Machine Learning inLibrary Practice 87

8 Building a Machine Learning Pipeline 89

Audrey Altman

iii

iv Machine Learning, Libraries, and Cross-Disciplinary Research

9 Fragility and Intelligibility of Deep Learning for Libraries 101Michael Lesk

10 Bringing Algorithms and Machine Learning Into LibraryCollections and Services 113Eric LeaseMorgan

11 Taking a Leap Forward: Machine Learning for New Limits 127Max Prud’homme

12 Machine Learning + Data Creation in a Community Partnership for ArchivalResearch 137Jason Cohen andMario Nakazawa

13 Towards a Chicago place name dataset: From back-of -the-book index toa labeled dataset 151Ana Lucic and John Shanahan

14 Can a Hammer Categorize Highly Technical Articles? 159Samuel Hansen

Preface

This collection of essays is the unexpected culmination of a 2018–2020 grant from the Instituteof Museum and Library Services to the Hesburgh Libraries at the University of Notre Dame.1

The plan called for a survey and a series of workshops hosted across the country to explore, orig-inally, “the national need for library based topic modeling tools in support of cross-disciplinarydiscovery systems.” As the project developed, however, it became apparent that the scope of re-search should expand beyond topicmodeling and that the scope of outputmight expand beyonda white paper. The end of the 2010s, we found, was swelling with library-centered investigationsof broadermachine learning applications across the disciplines, andourworkshops demonstratedsuch a compelling mixture of perspectives on this development that we felt an edited collectionof essays from our participants would be an essential witness to the moment in history. Withremaining grant funds, we hosted one last workshop at Notre Dame to kick start writing.

The resulting essays cover a wide ground. Some present a practical, “how-to” approach tothe machine learning process for those who wish to explore it at their own institutions. Oth-ers present individual projects, examining not just technical components or research findings,but also the social, financial, and political factors involved in working across departments (and insome cases, across the town/gown divide). Others still take a larger panoramic view of the ethicsand opportunities of integratingmachine learningwith cross-disciplinary higher education, veer-ing between optimistic and wary viewpoints.

The multi-disciplinarity of the essayists and the diversity of their research give each chaptera sui generis flavor, though several shared concerns thread through the collection. Most signifi-cantly, the authors suggest that while the technical aspects of machine learning are a challenge,especially when working with collaborators from different backgrounds, many of their key con-cerns are actually about the ethical and social dimensions of thework. In this sense, the collectionis very much of the moment. Two large projects on machine learning, cross-disciplinarity, andlibraries ran concurrently with our grant — Cordell 2020 and Padilla 2019, which were com-missioned by major players in the field, the Library of Congress and OCLC, respectively — andboth took pains to foreground the wider potential effects of machine learning. As Ryan Cordellputs it, “current cultural attention toMLmay make it seem necessary for libraries to implementMLquickly. However, it is more important for libraries to implementML through their existingcommitments to responsibility and care” (1).

The voices represented here exhibit a thorough commitment toCordell’s call for responsibil-ity and care, and they are only a subset of the larger chorus that sounded at the workshops. Weeditors therefore encourage readers interested in this bigger picture to examine the meta-themes

1LG-72-18-0221-18: “Investigating the National Need for Library Based Topic Modeling Discovery Systems.” Seehttps://www.imls.gov/grants/awarded/lg-72-18-0221-18.

v

https://www.imls.gov/grants/awarded/lg-72-18-0221-18

vi Machine Learning, Libraries, and Cross-Disciplinary Research

and detailed information that emerged in the course of the workshops and the original surveythrough the grant’s final report.2 All of these pieces together capture a fascinating snapshot ofan interdisciplinary field in motion.

We should note that the working methods of the collection’s editorial team were an attemptto extend the grant’s spirit of collaboration. Through several stages of development, contenteditors Don Brower, Mark Dehmlow, Eric Morgan, Alex Papson, and John Wang reviewed as-signed essays and provided commentary before notifying general editorDaniel Johnson for proseediting, who in turn shared the updated manuscripts with the authors so the cycle could beginagain. The submissions, written variously inMicrosoft Word or Google Docs format, were ush-ered through these stages of life in team Google Drive folders and tracked by spreadsheet be-fore eventual conversion by Don Brower into a series of TeX files, provisioned in a version con-trolled Github repository, for more fine-tuned final editing. Like working with diverse teams inthe pursuit of machine learning, editing essays together in this fashion, for publication by theHesburgh Libraries, was a novel way of collaborating, and we editors thought candor about thisbook-making process might prove insightful to readers.

Attending to the social dimensions of the work ourselves, we must note that this collectionwould not have been possible without the generous support of many people and organizations.We would like to thank the IMLS for providing essential funding support for the grant and theHesburgh Libraries’ Edward H. Arnold University Librarian, Diane Parr Walker, for her orga-nizational support. Thank you to the members of the Notre Dame IMLS grant team who, atits various stages, provided critical support in managing logistics, conducting research, facilitat-ing workshops, and analyzing results. These individuals include John Wang (grant project di-rector), Don Brower, Mark Dehmlow, Nastia Guimaraes, Melissa Harden, Helen Hockx-Yu,Daniel Johnson, Christina Leblang, Rebecca Leneway, Laurie McGowan, Eric Lease Morgan,and Alex Papson. The University of Notre DameOffice of General Counsel provided key publi-cation advice, and the University of Notre Dame Office of Research provided critical support inadministering the grant. Again, many thanks.

Wewould also like to thank the co-signatories of the IMLSGrantApplication for supportingthe project’s goals: Mark Graves (then Visiting Research Assistant Professor, Center for Theol-ogy, Science, and Human Flourishing, University of Notre Dame), Pamela Graham (Directorof Global Studies and Director of the Center for Human Rights Documentation and Research,Columbia University Libraries), and Ed Fox (Professor of Computer Science andDirector of theDigital Library Research Laboratory, Virginia Polytechnic Institute and State University). Andof course, thanks to the 95 participants in our 2019 IMLS Grant Workshops (too many to enu-merate here) and to the essay authors for sharing their expertise and perspectives in growing ourcollective knowledge of machine learning and its use in research, scholarship, and cultural her-itage organizations. Your active engagement continues to shape the field, and we look forward toyour next achievements.

References

Cordell, Ryan. 2020. “Machine Learning+Libraries: AReport on the State of the Field.” Com-missioned by LC Labs, Library of Congress. https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf.

2See https://doi.org/10.7274/r0-320z-kn58.

https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf

https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf

https://doi.org/10.7274/r0-320z-kn58

vii

Padilla, Thomas. 2019. “Responsible Operations: Data Science, Machine Learning, and AI inLibraries.” Dublin, Ohio: OCLC Research. https://www.oclc.org/research/publications/2019/oclcresearch-responsible-operations-data-science-machine-learning-ai.html.

https://www.oclc.org/research/publications/2019/oclcresearch-responsible-operations-data-science-machine-learning-ai.html



Part I

Machine Learning in Academia

1

Chapter 1

Artificial Intelligence in the

Humanities: Wolf in Disguise,

or Digital Revolution?

Arend HintzeDalarna University

Jorden SchossauMichigan State University

Introduction

Artificial Intelligence, with its ability to machine learn coupled to an almost human-like under-standing, sounds like the ideal tool to the humanities. Instead of using primitive quantitativemethods to count words or catalogue books, current advancements promise to reveal insightsthat otherwise could only be obtained by years of dedicated scholarship. But are these technolo-gies imbued with intuition or understanding, and do they learn like humans? Are they capableof developing their own perspective, and can they aid in qualitative research?

In the 80s and 90s, as home computers were becoming more common, Hollywood was sen-sationalizing the idea of smart or human-like Artificial Intelligentmachines (AI) throughmoviessuch as Terminator, Blade Runner, Short Circuit, and Bicentennial Man. At the same time, thehome experience of personal computing highlighted the difference between Hollywood intelli-gent machines and the reality of how “dumb” machines really were. Home, or even industrymachines, could not answer simple natural language questions of anything but the simplest ofcomplexity. Instead, users or programmers needed to painstakingly implement an algorithm toaddress their question. Then, the user was required to wait for the machine to slavishly followeach instruction that was programmed while hoping that whoever entered the instructions did

3

4 Machine Learning, Libraries, and Cross-Disciplinary Research• Chapter 1

not make a mistake. Despite the Hollywood intelligent machines sensation, people understoodthat computers did not and could not think like humans, but that they do excel at perform-ing repetitive tasks with extreme speed and fidelity. This shaped the expectations for interactingwith computers. Computers became efficient tools that required specific instruction in order toachieve a desired outcome.

Computational technology and user experience drastically changed over the next 20 years.Technology became much more intuitive to use while it also became much more powerful athandling large data sets. For instance, Google can return search results for websites as a responseto even the silliest or sparsest request, with a decent chance that the results are relevant to thequestion asked. Didyou read amanual before youusedyour smartphone, or did you like everyoneelse just “figure it out”? Or, as a consequence of modern-day media and its on-demand services,children ask to skip a song playing through radio broadcast. The older technologies quickly feelarchaic.

These technological advancements go hand in hand with the developments in the field ofmachine learning and artificial intelligence. The automotive industry is on the cusp of fully self-driving cars. Electronic assistants are not only keeping track of our dates and responding to spo-ken language, they will also soon start making our appointments by speaking to other humanson our behalf. Databases are getting new voice-controlled intuitive interfaces, changing a typ-ical incomprehensible “SELECT AVG(salary) FROM employeeList WHERE yearHired> 2012;” to a spoken “Average salary of our employees hired after 2012?”

Another phenomenon is the trend in many disciplines to go from “qualitative” to “quanti-tative” research, or to think about the “system” rather than the “components.” The field thatprobably experienced this trend first was biology. While obviously descriptive about species oforganisms, biologists also always wanted to understand the mechanisms that drive life on earthspanning micro to macro scales. Consequently, a lot is known about the individual chemicalcomponents that constitute our metabolism, the components that drive cell division and DNAreplication, and which genes are involved in, for example, developmental processes. However, inmany cases, our scientific knowledge only covers single functions of single components. In thecontext of the cell, the state of the organism and how other components interact matters a lot.Cancer, for example, cannot be explained by a singlemutation on a single gene but involvesmanycomplex interactions (Hanahan andWeinberg 2011). Ecosystems don’t collapse because a singleinsect dies, but because indirect changes in the food chain interact in complex ways (for a reviewof the different theories, see Tilman 1996). As a result, systems biology emerged. Systems biolo-gists use large data sets and are often dependent on computer models to understand phenomenaon the systems level.

The field of Bioinformatics is one such example of an entire field that emerged as a result ofusing computers to study entire systems that were otherwise humanly intractable. The humangenome project to sequence the complete human genome finished in 2003, a timewhen our con-sumer data storage was limited by the amount of data that fit on a DVD (4.9 GB). While the hu-man genome fits on a DVD, the data that came from the sequencing machines was much larger.Short repetitive sequences first needed assembly, which at that timewas a high-performance com-puting task.

Other fields have since undergone their own computational revolutions, and now the hu-manities begin their computational revolution. Computers have been a part of core library in-frastructure and experience for some time now, by cataloging entries in a database and allowingintuitive user exploration of that database. However, the digital humanities go beyond this (Fitz-

Hintze and Schossau 5

patrick 2012). The ability to analyze (crawl) extremely large corpora of different sources,monitorthe internet using the Internet of Things as large sensor arrays, and detect patterns by using so-phisticated algorithms can each produce a treasure trove of quantitative data. Until this pointthese tasks could only be described or analyzed qualitatively.

Additionally, artificial intelligence promisesmodels of the humanmind (Yampolskiy and Fox2012). Machine learning allows us to learn from these data sets in ways that exceed human capa-bilities, while an artificial brain will eventually allow us to objectively describe a subjective experi-ence (through quantifying neural activations or positively and negatively associated memories).This would ultimately close the gap between quantitative and qualitative approaches by allowingan inspection of experience.

However, this bridging between quantitative and qualitative methods causes a possible ten-sion for the humanities, which historically defines itself by qualitative methodologies. Whenqualitative experiences or responses can be finely quantified, such as sadness caused by readinga particular passage, or the curiosity caused by viewing certain works of art, then the field willundergo a revolution. When this happens, we will be able to quantify and discuss how sadnesswas learned by reading, or howmuch surprise was generated by viewing an artwork.

This is exactly the point where the metaphors break down. Current computational modelsof the mind are not sophisticated enough to allow these kinds of inferences. Machine learningalgorithms work well for what they do but have nothing to do with what a person would calllearning. Artificial intelligence is a broad encompassing field. It includes methods that mighthave appeared to be magic only a couple of years ago (such as generative adversarial networks).Algorithmic finesse resulting from these advances is capable of beating humans in chess (Camp-bell, Hoane Jr, and Hsu 2002), but it is only a very specialized algorithm that has nothing todo with the way humans play or learn chess. This means we are back to the problem we had inthe 80s. Instead of being disappointed by the difference between modern technology and Hol-lywood technology, we are disappointed by the difference between modern technology and theexperience implied by the labels given to those technologies. Applying misnomer terminology,such as “smart,” “intelligent,” “search,” and “learning” tomodern technologies that have little todo with those terms is misleading. It is possible that such technology was deliberately brandedwith these terms for the improvedmarketing and sales, effectively redefining them and obscuringtheir original meaning. Consequently, we again are disappointed by the mismatch of the expec-tations of our computing infrastructure and the reality of our experiences.

The following paragraphs will explore current Machine Learning and Artificial Intelligencetechnologies, explain howquantitative or qualitative they really are, and explorewhat the possibleimplications for future Digital Humanities could be.

Learning: Phenomenon versusMechanism

Learning is an electrochemical process that involves cells, their genetic makeup, and how theyare interconnected. Some interplay between external stimuli and receptor proteins in specializedsensor neurons leads to electrochemical signals propagating over a network of interconnectedcells, which themselves respond with physical and genetic changes to said stimuli, probably alsodependent on previous stimuli (Kandel, Schwartz, Jessel 2000). This concoction of elaborateterms might suggest that we know in principle which parts are involved and where they are, butwe are far from an understanding of the learningmechanism. The description above is as genericas saying that a city functions because cars drive on streets. Even though we might know a lot


about long-term potentiation or the mechanism of neurons which fire together wiring together(akaHebbian learning), neither of these processes actually mechanistically explains how learningworks. Neuroscience, neurophysiology, and cognitive science have not been able to discover thiscomplete process in such a way that we can replicate it, though some inroads are being made (El-Boustani et al. 2018). Similarly, we find promising new interdisciplinary efforts like “Cognitivecomputational neuroscience” that try to bridge the gap between neuro- and cognitive scienceand computation (Kriegeskorte and Douglas 2018). So, unfortunately, while the componentsinvolved can be identified, the question about “how learning works” cannot be answered mech-anistically.

However, a lot is known about the phenomenon of learning. It happens during the lifetimeof an organism. What happens between the lifetimes of related organisms is an adaptive processcalled evolution: inheritance, variation, and natural selection over many generations up to 3.5billion years here on Earth enabled populations of organisms to succeed in their environments inanyway they could. Evolutionary forces foundways for organisms to adapt to their environmentduring their own lifetimes. While this can take many forms, such as storing energy, seeking shel-ter, having a fight or flight response, it has led to the phenomenon we now call learning. Insteadof discussing the diversity of learning in the animal kingdom, we will discuss the richest example:human learning.

Here, learning is defined as the cognitive adaptation to external stimulus. The phenomenonof learning canbe observed as an increase in performance over time. Learningmakes the organismbetter at doing something. In humans, because we have language and a much higher degree ofabstract thinking, an improvement in performance can be facilitated very quickly. While it takestime to learn how to juggle, the ability to find themean of a series of samples can be quickly com-municated by reading Wikipedia. Both types of lifetime adaptations are called learning. How-ever, these lifetime adaptations are facilitated by two different cognitive processes: explicit or im-plicit learning. 1 Explicit learning—or episodic memory—is fact-based memory. What you didyesterday, what happened in your childhood, or the list of things you should buy when you goshopping, are all memories. Currently, the engram theory best explains this mechanism (Poo etal. 2016 elaborates on the origins of the term). Explicit memory can be retrieved relatively easilyand then used to inform future decisions: “Press the green button if the capital of Italy is Paris,otherwise press the red.” The rate of learning for explicit memory can be much higher than forimplicitmemory, and it can also be communicatedmore quickly. Abstract communication, suchas “I saw a wolf” allows us to transfer the experience of seeing a wolf quickly to other individuals,even though their evoked explicit memory might not be identical to ours.

Learning by using implicit memory—sometimes called procedural memory—is facilitatedby much slower processes (Schacter, Chiu, and Ochsner 1993). It is generally based on the ideathat learning is a combination of expectation, observation or action, and internal model changes.For example, a recovering hospital patient who has suffered a stroke is handed an apple. In thisexchange, the patient forms an expectation of where his hand will be to accept the apple. He en-gages hismuscles tomove his forearm and hand to accept the apple, which is his action. Then thepatient observes that his arm did not arrive at the correct position (due to neurological damage).This discrepancy between expectation and action-outcome drives internal changes so that thepatient’s brain learns how to adequately control their arm. Presumably, everything considered askill is based on this process. While very flexible, this form ofmemory is not easily communicatednor fast to acquire. For instance, while juggling can be described it cannot be communicated in

1There are more than these two mechanisms, but these are the two major ones.


such a way that it enables the recipient to juggle without additional training.This description of explicit and implicit learning is an amalgamation of many different hy-

potheses and observations. Also, these processes are not as well segregated in practice as outlinedhere. What is important iswhat these two learningmechanisms are based on: observations lead tomemory, and internal predictions together with exploration lead to improved models about theworld. Lastly, these learning processes only exist in organisms because they previously conferredan evolutionary advantage: Organisms that couldmemorize and then act on thosememories hadmore offspring than those that did not. This interaction of learning and evolution is called theBaldwin effect (Weber and Depew 2003). Organisms that could explore the environment, makepredictions about it, and use observations to optimize their internal models were similarly morecapable than organisms that could not.

Machines do not Learn; They are Trained

Now prepared with a proper intuition about learning, we can turn our attention to machinelearning. After all, our intuitions should be meaningful in the computational domain as well,because learning always follows the same pattern. Onemight be disappointed when looking overthe table of contents of a machine learning book and find only methods for creating static trans-formation functions (see Russell and Norvig 2016, one of the putative foundations of machinelearning andAI).Therewill typically be a distinctionbetween supervised andunsupervised learn-ing, between categorical and continuous data, and maybe a section about other “smart” algo-rithms. You will not find a discussion about implicit and explicit memory, let alone methods forimplementing these concepts. So, if these important sections in our imaginary machine learningbook do not discuss the mechanisms of learning, then what are they discussing?

Unsupervised learning describes algorithms that report information based on associationswithin the data. Clustering algorithms are a popular example of unsupervised learning. Theseuse similarity between data points to form and report on distinct groups of data. Clustering is avery important method but is only a well-designed algorithm that is not adaptive.

Supervised learning describes algorithms that refine a transformation function to convertfrom a certain input to a certain output. The idea is to balance specific and general refining suchthat the transformation function correctly transforms all knownexamples but generalizes enoughto work well on new variations. For example, we would like themachine to transform image datainto textual labels, such as “house” or “car.” The input is an image and the output is a label. Theinput image data are provided to the machine, and small adjustments to the machine’s functionaremade depending on howwell it provided the correct output. Many iterations later ideally willresult in a machine that can transform all image data to correct labels, and even operate correctlyon new variations of images not provided before. Supervised learning is extremely powerful andis yet to be fully explored. However, supervised learning is quite dissimilar to actual learning.

Acommonargument is that supervised learninguses feedback in a “student-teacher”paradigmof making changes with feedback until proper behavior is achieved, so it could be consideredlearning. But this feedback is external, objective, and not at all similar to our prediction and com-parison model that, for instance, operates without an all-knowing oracle whispering “good” or“bad” into our ears. Humans and other organisms instead compare predictions with outcomes,and the choices are driven by an intersection of desire and prediction.

What seems astonishing is the diverse and specialized capabilities that these two rather simpletypes of computation, clustering and classification, can produce. Their economic impact is enor-


mous, and we are still finding new ways to combine neural networks and exploit deep learningtechniques to create amazing data transformations, such as deep fake videos. But so far, each as-tounding example of AI, through machine learning or some other method, is not showcasing allthese capabilities as one machine, but instead each as an independently achieved computationalmarvel. Each of these examples does only exactly what it was trained to do in a narrow domainand no more. Siri, or any other voice assistant for that matter, does not drive a car (López, Que-sada, and Guerrero 2017), Watson does not play chess (Ferrucci et al. 2013), and Google AlphaGo cannot understand spoken language (Gibney 2016). Even hybrid approaches, such as com-bining speech recognition, chess playing, and autonomous driving, would only be a combinationof specialty strategies, not a trained entity from the ground up.

Modern machine learning gives us an amazing collection of very applicable, but extremelyspecialized, computational tools that may be customized to particular data sets, but the resultingmachines do not learn autonomously as you or I do. There are cutting edge technologies, such asso-called neuromorphic chips (Nawrocki, Voyles, and Shaheen 2016) and other computationalbrain models that more closely mimic brain function, but they are not what has been sensation-alized in the media as machine learning or AI, and they have yet to showcase competence ondifficult problems competitive with standard supervised learning.

Curiously, many people in the machine learning community defend the term “learning,” ar-guing there is no difference between learning and training. In traditional machine learning, thetrained algorithm is deployed as a service after which it no longer improves. If the data set everchanges, then a new training set including correct labels needs to be generated and a new train-ing phase initiated. However, if the teacher can be forever bundled with the learner and trainingcontinued during the deployment phase, even on new never-before-seen data, then indeed thedelineation between learning and training is far less clear. Approaches to such lifelong learningexist, but they struggle with what is called catastrophic forgetting—the phenomenon that onlythe most recent experiences are learned at the expense of older ones (French 1999). This is theobjective for Continuous Delivery for machine learning. Unfortunately, creating a new trainingset is typically the most expensive endeavor for standard supervised machine learning develop-ment. Adequate training then becomes difficult or impossible without involving thousands ormillions of human inputs to keep up with training and using the online machine on an ever-evolving data set. Some have tried to use such “human-in-the-loop” methods, but the resultingmachine then becomes only a slight extension of the humans who are forever caught in the loop.Is it an intelligent machine, or a human trapped in a machine?

To combat this problem of generating the training set, researchers altered the standard super-vised learning paradigmof flexible learner and rigid teacher tomake the teacher likewise flexible togenerate new data, continually probing the bounds of the student machine. This is the methodof Generative Adversarial Networks, or GANs (Goodfellow et al. 2014). The teacher generatestraining examples and the student discerns between those generated examples and the originallabeled training data. After many iterations, the teacher is improved to better fool the student,and the student is improved to better discern generated training data. As amazing as they are,GANs only partially mitigate the problematic requirement for human-labeled training data, be-cause GANs can only mimic a known labeled distribution. If that distribution ever changes,then new labeled data must be generated, and again we have the same problem as before. Unfor-tunately, GANs have been sensationalized as magic, and public and hobbyist expectation is thatGANs are a way towardmuch better artificial intelligence. Disappointment is inevitable becauseGANs only allow us to explore what it would be like to have more training data from the same


data sets we were using before.These expectations are important for machine learning and AI. We are very familiar with

learning, to the point where our whole identity as human could be generously defined as theresult of being a monkey with an exceptional proclivity for learning. If we now approach AIand machine learning with expectations that these technologies learn as we do, or are an equallygeneral-purpose intelligence, then we will be bitterly disappointed. The best example of suchdiscrepancy is how easily neural networks trained by deep learning can be fooled. Images that areseemingly identical and differ only by a few pixels are grossly misclassified, a mistake no humanwould make (Nguyen, Yosinski, and Clune 2015). Fortunately, we know about these biases andthe possible shortcomings of these methods. As long as we have the right expectations, we cantake their flaws into account and still enjoy the prospects they provide.

TrainedMachines: Tool or Provocation?

On one side we have the natural sciences characterized by hypothesis-driven experimentation re-ducing reality to an abstractmodel of causal interactions. This approach can informus about theconsequences of our possible actions, but only as far in the future as the model can adequatelypredict. With machine learning and AI, we can move this temporal horizon of prediction far-ther into the future. While weather models might still struggle to predict precipitation 7 days inadvance, global climate models predict in detail the effects of global warming in 100 years. Butthese models are nihilist, void of values, and cannot themselves answer the question if humanswould prefer to live in one possible future or another. Is sunshine better than rain? The human-ities, on the other hand, are home to exactly these problems. What are our values? How do weunderstand what is essential? Now that we know the facts, how should we choose? Do we speakfor everyone? The questions seem to be endless, but they are what makes our human experienceso special, and what separates the humanities from the sciences.

Labels—such as learning or intelligence—are too easily anthropomorphized. A technologybranded in this way suggests human-like properties: intelligence, common sense, or even sub-jective opinion. From a name like “deep learning” we expect a system that develops a deep andintuitive understanding with insights more profound than our own. However, these systems donot provide an alternative perspective, but as explained above, are only as good or as biased as thescientist selecting their training data. Just because humans and machine learning are both blackboxes in the sense that their inner workings are opaque, does not mean they share other quali-ties. For instance, having labeled the ML training process as “learning” does not imply that MLalgorithms are curious and learn from observations. While these new computerized quantitativemeasures might be welcomed by some scholars, there will be others who view it as an existentialthreat to the very nature of the humanities. Are these quantitativemethods sneaking into the hu-manities disguised by anthropomorphic terms like a wolf shrouded in a sheep’s fleece? From thisviewpoint, having the wrong expectations is not only provoking a disappointment, but floodingthe humanities with sophisticated technologies that dilute and muddy the nature of qualitativeresearch that makes the humanities special.

However, this imminent clash between quantitative and qualitative research also providesa unique opportunity. Suppose there is a question that can only be answered subjectively andqualitatively. If so, it would define a hard boundary against the aforementioned reductionism ofthe purely causal quantitative approach. At the same time, such a boundary presents the perfecttarget for an artificially intelligent system to prove its utility. If a computational human analog


can be created, then it must be capable of performing the same tasks as a humanities researcher.In other words, it must be able to answer subjective and qualitative questions, regardless of itscomputational and quantitative construction. Failing at such a task would be equivalent to fail-ing the famous Turing test, thereby proving the AI is not yet human-like enough. In this way,the qualitative nature of the humanities poses a challenge—and maybe a threat—to artificiallyintelligent systems. While somemight say the threat is mutual, past successes of interdisciplinaryresearch suggest otherwise: The digital humanities could become the forefront of AI research.

Beyondmachine training, towards general purpose

intelligence

Currently, machines do not learn but must be trained, typically with human-labeled data. MLalgorithms are not smart as we are, but they can solve specific tasks in sophisticated ways. Per-haps sentience will only be a product of enough time and training data, but the path to sentienceprobably requires more than time and data. The process that gave rise to human intelligencewas evolution. This opportunistic process optimized brains over endless generations to performever-changing tasks, and it is the only known example of a process that resulted in such complexintelligence. None of the earlier described computational methods even remotely follow thisparadigm: Researchers designed ad hoc algorithms that solved well-defined problems. The nextiteration of thesemethods is either an incremental improvement of existing code, a newmethod-ological invention, or an application to a new data set. These improvements do not compoundto make AI tools better generalists, but instead contribute to the diversity of the existing tools.

One approach that does not suffer from these shortcomings is neuro-evolution (Floreano,Dürr, and Mattiussi 2008). Currently, the field of Neuroevolution is in its infancy, but find-ing new and creative solutions to otherwise unsolved problems, such as controlling robots driv-ing cars, is a popular area of focus (Lehman et al. 2020). At the same time, memory formation(Marstaller, Hintze, and Adami 2013), information integration in the brain (Tononi 2004), andhow systems evolve the ability to learn (Sheneman, Schossau, and Hintze 2019) are also beingresearched, as they are building blocks of general purpose intelligence. While it is not clear howthinkingmachines will ultimately emerge, they are on the horizon. The dualism of a quantitativesystem that can be subjective and understand the qualitative nature of existencemakes it a strangeartifact that cannot be ignored.

References

Campbell, Murray, A Joseph Hoane Jr, and Feng-hsiung Hsu. 2002. “Deep Blue.” ArtificialIntelligence 134 (1–2): 57–83.

El-Boustani, Sami, Jacque P K Ip, Vincent Breton-Provencher, Graham W Knott, HiroyukiOkuno, Haruhiko Bito, andMriganka Sur. 2018. “Locally Coordinated Synaptic Plasticityof Visual Cortex Neurons in Vivo.” Science 360 (6395): 1349–54.

Ferrucci, David, Anthony Levas, Sugato Bagchi, David Gondek, and Erik T Mueller. 2013.“Watson: Beyond Jeopardy!” Artificial Intelligence 199: 93–105.

Fitzpatrick, Kathleen. 2012. “The Humanities, Done Digitally.” In Debates in the Digital Hu-manities, edited byMatthewK. Gold, 12–15. Minneapolis: University ofMinnesota Press.


Floreano, Dario, Peter Dürr, and Claudio Mattiussi. 2008. “Neuroevolution: From Architec-tures to Learning.” Evolutionary Intelligence 1 (1): 47–62.

French, RobertM. 1999. “Catastrophic Forgetting inConnectionistNetworks.” Trends in Cog-nitive Sciences 3 (4): 128–35.

Gibney, Elizabeth. 2016. “Google AI Algorithm Masters Ancient Game of Go.” Nature News529 (7587): 445.

Goodfellow, Ian, Jean Pouget-Abadie,MehdiMirza, BingXu,DavidWarde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advancesin Neural Information Processing Systems 27 (NIPS 2014), edited by Z. Ghahramani, M.Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, 2672–80. N.p.: Neural Infor-mation Processing Systems Foundation.

Hanahan, Douglas, and Robert A Weinberg. 2011. “Hallmarks of Cancer: The Next Genera-tion.” Cell 144 (5): 646–74.

Kandel, Eric R, James H Schwartz, and Thomas M Jessell. 2000. Principles of Neural Science.4th ed. New York: McGraw-Hill.

Kriegeskorte,Nikolaus, andPamelaKDouglas. 2018. “CognitiveComputationalNeuroscience.”Nature Neuroscience 21: 1148–60.

Lehman, Joel et al. 2020. “The SurprisingCreativity ofDigital Evolution: ACollection ofAnec-dotes from the Evolutionary Computation andArtificial Life ResearchCommunities.” Ar-tificial Life 26 (2): 274-306.

López, Gustavo, Luis Quesada, and Luis A Guerrero. 2017. “Alexa vs. Siri vs. Cortana vs.Google Assistant: A Comparison of Speech-Based Natural User Interfaces.” In Interna-tional Conference on Applied Human Factors and Ergonomics, edited by Isabel L. Nunes,241–50. Cham: Springer.

Marstaller, Lars, ArendHintze, andChristophAdami. 2013. “The Evolution ofRepresentationin Simple Cognitive Networks.” Neural Computation 25 (8): 2079–2107.

Nawrocki, Robert A, Richard M Voyles, and Sean E Shaheen. 2016. “A Mini Review of Neu-romorphic Architectures and Implementations.” IEEE Transactions on Electron Devices 63(10): 3819–29.

Nguyen, Anh, Jason Yosinski, and Jeff Clune. 2015. “DeepNeural Networks Are Easily Fooled:High Confidence Predictions for Unrecognizable Images.” In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), 427–36. N.p.: IEEE.

Poo, Mu-ming et al. 2016. “What Is Memory? The Present State of the Engram.” BMC Biology14: 1-18.

Russell, Stuart J, and Peter Norvig. 2016. Artificial Intelligence: AModern Approach. Malaysia:Pearson Education Limited.

Schacter, Daniel L, C-Y PeterChiu, andKevinNOchsner. 1993. “ImplicitMemory: A SelectiveReview.” Annual Review of Neuroscience 16 (1): 159–82.

Sheneman, Leigh, Jory Schossau, and Arend Hintze. 2019. “The Evolution of Neuroplasticityand the Effect on Integrated Information.” Entropy 21 (5): 1-15.

Tilman, David. 1996. “Biodiversity: Population versus Ecosystem Stability.” Ecology 77 (2):350–63.

Tononi, Giulio. 2004. “An Information Integration Theory of Consciousness.” BMC Neuro-science 5: 1–22.

Weber, Bruce H, andDavid J Depew. 2003. Evolution and Learning: The Baldwin Effect Recon-sidered. Cambridge, MA:Mit Press.


Yampolskiy, Roman V, and Joshua Fox. 2012. “Artificial General Intelligence and the HumanMental Model.” In Singularity Hypotheses: A Scientific and Philosophical Assessment, editedby AmmonH. Eden, JamesH.Moor, JohnnyH. Søraker, and Erik Steinhart, 129–45. Hei-delberg: Springer.

Chapter 2

GenerativeMachine Learning

Charlie Harper, PhDCaseWestern Reserve University

Introduction

Generative machine learning is a hot topic. With the 2020 election approaching, Facebook andReddit have each issued their own bans on the category of machine-generated or -altered con-tent that is commonly termed “deep fakes” (Cohen 2020; Romm, Harwell, and Stanley-Becker2020). Calls for regulation of the broader, and very nebulous category of fake news are now partof US political debates, too. Although well known and often discussed in newspapers and onTV because of their dystopian implications, deep fakes are just one application of generative ma-chine learning. There is a remarkable need for others, especially humanists and social scientists,to become involved in discussions about the future uses of this technology, but this first requires abroader awareness of generativemachine learning’s functioning and power. Many articles on thesubject of generative machine learning exist in specialized, highly technical literature, but there islittle that covers this topic for a broader audience while retaining important high-level informa-tion on how the technology actually operates.

This chapter presents an overview of generative machine learning with particular focus ongenerative adversarial networks (GANs). GANs are largely responsible for the revolution inmachine-generated content that has occured in the past few years and their impact on our fu-ture extends well beyond that of producing purposefully-deceptive fakes. After covering genera-tive learning and the working of GANs, this chapter touches on some interesting and significantapplications of GANs that are not likely to be familiar to the reader. The hope is that this willserve as the start of a larger discussion on generative learning outside of the confines of technicalliterature and sensational news stories.

13


Figure 2.1: The three most-common letters following “F” in two Markov chains trained on anEnglish and Italian dictionary. Three examples of generated words are given for each Markovchain that showhowtheMarkov chain captures high-level information about letter arrangementsin the different languages.

What is GenerativeMachine Learning?

Machine learning, which is a subdomain of Artificial Intelligence, is roughly divided into threeparadigms that rely on different methods of learning: supervised, unsupervised, and reinforce-ment learning (Murphy 2012, 1–15; Burkov 2019, 1–8). These differ in the types of datasetsused for learning and the desired applications. Supervised and unsupervised machine learninguse labeled and unlabeled datasets, respectively, to assign unseen data to human-generated la-bels or statistically-constructed groups. Both supervised and unsupervised approaches are com-monly used for classification and regression problems, where we wish to predict categorical orquantitative information about new data. A combined form of these two paradigms, called semi-supervised learning, that mixes labeled and unlabeled data also exists. Reinforcement learning,on the other hand, is a paradigm in which an agent learns how to function in a specific environ-ment by being rewarded or penalized for its behavior. For example, reinforcement learning canbe used to train a robot to successfully navigate around obstacles in a physical space.

Generative machine learning, rather than being a specific learning paradigm, encompassesan ever-growing variety of techniques that are capable of generating new data based on learnedpatterns. The process of learning these patterns can engage both supervised and unsupervisedlearning. A simple, statistical example of one type of generative learning is aMarkov chain. Froma given set of data, aMarkov chain calculates and stores the probabilities of a following state basedon a current state. For example, aMarkov chain can be trained on a list of English words to storethe probabilities of any one letter occuring after another letter. These probabilities chain togetherto represent that chance of moving from the current letter state (e.g. the letter q) to a succeedingletter state (e.g. the letter u) based on the data from which it has learned.

If another Markov chain were trained on Italian words instead of English, the probabilitieswould change, and for this reason, Markov chains can capture important high level informationabout datasets (Figure 2.1). They can then be sampled to generate new data by starting froma random state and probabilistically moving to succeeding states. In figure 2.1, you can see the

Harper 15

Figure 2.2: Images generated with a simple statistical model appear as noise as the model is in-sufficient to capture the structure of the real data (Markov chains trained using wine bottles andcircles from Google’s QuickDraw dataset).

probability that the letter “F” transitions to the threemost common succeeding letters in Englishand Italian. A few examples of “words” generated by two Markov chains trained on an Englishand Italian dictionary are also given. The example words are generated by sampling the probabil-ity distributions of theMarkov chain, letter by letter, so that the generated words are statisticallyrandom, but guided by the learned probability of one letter following another. The differentprobabilities of letter combinations in English and Italian result in distinctly different generatedwords. This exemplifies how a generative model can capture specific aspects of a dataset to createnew data.

The letter combinations are nonsense, but they still reflect the high-level structure of Ital-ian and English words in the way letters join together, such as the different utilization of vowelsin each language. These basic Markov chains demonstrate the essence of generative learning: agenerative approach learns a distribution over a dataset, or in other words, a mathematical rep-resentation of a dataset, which can then be sampled to generate new data that exists within thelearned structure of that dataset. How convincing the generated data appears to a human ob-server depends on the type and tuning of the machine learning model chosen and the data uponwhich themodel has been trained. So, what happens if we build a comparableMarkov chainwithimage data1 instead of words, and then sample, pixel by pixel, from it to generate new images?The results are just noise and the generated images reveal no hint of a wine bottle or circle to thehuman eye (Figure 2.2).

The very simple generative statistical model we have chosen to use is incapable of capturingthe distribution of the underlying images sufficiently enough to produce realistic new images.Other types of generative statistical models, like Naive Bayes or a higher-order Markov chain,2

1In many examples, I have used the Google QuickDrawDataset to highlight features of generative machine learning.The dataset is freely available (https://github.com/googlecreativelab/quickdraw-dataset) and licensedunder CC BY 4.0.

2The order of a Markov chain reflects how many preceding states are taken into account. For example, a 2nd orderMarkov chain would look at the preceding two letters to calculate the probability of a succeeding letter. Rudimentaryautocomplete is a good example of Markov chains in application.

https://github.com/googlecreativelab/quickdraw-dataset


could perhaps capture a bit more information about the training data, but they would still beinsufficient for real-world applications like this.3 Image, video, and audio are complicated; it ishard to reduce them to their essence with basic statistical rules in the way we were able to withthe ordering of letters in English and Italian. Capturing the intricate and often-inscrutable distri-butions that underlie real-world media, like full-sized photographs of people, is where deep (i.e.using neural networks) generative learning shines andwhere generative adversarial networks haverevolutionized machine-generated content.

Generative Adversarial Networks

Theproblemof capturing the complexity of an image so that a computer can generate new imagesleads directly to the emergence of Generative Adversarial Networks, which are a neural-network-based model architecture within the broader sphere of generative machine learning. Althoughprior deep learning approaches to generating data, particularly variational autoencoders, alreadyexisted, it was a breakthrough in 2014 that changed the fabric and power of generative machinelearning. Like every big development, it has an origin story that has moved into legend with itsmany retellings. According to the handed-down tale (Giles 2018), in 2014 doctoral student IanGoodfellowwas at a bar with friends when the topic of generating photos arose. His friends wereworking out a method to create realistic images by using complex statistical analyses of existingimages. Goodfellow countered that it would not work; there were too many variables at playwithin such data. Instead, he put forth the idea of pairing two neural networks against each otherin a type of zero-sum game where the goal was to generate believable fake images. Accordingto the story, he developed this idea into working code that night and his paired neural networkarchitecture produced results the very first time. This was the birth of Generative AdversarialNetworks or GANs. Goodfellow’s work was quickly disseminated in what is one of the mostinfluential papers in the recent history of machine learning (Goodfellow et al. 2014).

GANs have progressed in almost miraculous ways since 2014, but the crux of their architec-ture remains the coupling of two neural networks. Each neural network has a specific functionin the pairing. The first network, called the generator, is tasked with generating fake examples ofsomedataset. Toproduce this data it randomly samples fromann-dimensional latent space oftenlabeledZ . In simple terms, the generator takes random noise (really a random list of n-numberswhere n is the dimensionality of the latent space) as its input and outputs its attempt at a fakepiece of data, such as an image, clip of audio, or row of tabular information. The second neuralnetwork, called the discriminator, takes both fake and real data as input. Its role is to correctly dis-criminate between fake and real examples.4 The generator and discriminator networks are thencoupled together as adversaries, hence “adversarial” in the name. The output from the generatorflows into the discriminator, and information on the success or failure of the discriminator toidentify fakes (i.e. the discriminator’s loss) flows back through the network so that the genera-tor and discriminator each knows how well it is performing compared to the other. All of thishappens automatically, without any need for human supervision. When the generator finds it isdoing poorly, it learns to produce better examples by updating its weights and biases through tra-ditional backpropagation (see especially Langr and Bok 2019, 3–16 for amore detailed summaryof this). As backpropagation updates the generator network’s weights and biases, the generator

3This is not to imply that thesemodels do not have immense practical applications in other areas ofmachine learning.4Its function is exactly that of any other binary classifier found in machine learning.

Harper 17

Figure 2.3: At the heart of a GAN are two neural networks, the generator and the discriminator.As the generator learns to produce fake data, the discriminator learns to separate it out. Thepairing of the two in an adversarial structure forces each to improve at its given task.

Figure 2.4: A GAN being trained on wine bottle sketches from Google’s quickdraw dataset(https://github.com/googlecreativelab/quickdraw-dataset) shows the genera-tor learning how to produce better sketches over time. Moving from left to right, the generatorbegins by outputting random noise and progressively generates better sketches as it tries to trickthe discriminator.

inherently begins tomap regions of the randomly sampled Z space to characteristics found in thereal dataset. Contrarily, as the discriminator finds that it is not identifying better fakes accurately,it learns to separate these out in new ways.

At first, the generator outputs random data and the discriminator easily catches these fakes(Figure 2.4). As the results of the discriminator feed back into the generator, however, the gen-erator learns to trick its foe by creating more convincing fakes. The discriminator consecutivelylearns to better separate out these more convincing fakes. Turn after turn, the two networksdrive one another to become better at their specialized tasks and the generated data becomes in-creasingly like the real data.5 At the end of training, ideally, it will not be possible to distinguishbetween real and fake (Figure 2.5).

In the original publication, the first GANs were trained on sets of small images, like theToronto Face Dataset, which contains 32 × 32 pixel grayscale photos of faces and facial expres-sions (Goodfellow et al. 2014). Although the generator’s results were convincing when com-pared to the originals, the fake images were still small, colorless, and pixelated. Since then anexplosion of research into GANs and increased computational power has led to strikingly realis-

5See https://poloclub.github.io/ganlab/ (accessed Jan 17, 2020) (Kahng et al. 2019).

https://github.com/googlecreativelab/quickdraw-dataset

https://poloclub.github.io/ganlab/


Figure 2.5: The fully trained generator from Figure 2.4 produces examples that are not readilydistinguishable from real world data. The top row of sketches were produced by the GAN andthe bottom row were drawn by humans.

tic images. The most recent milestone was reached in 2019 by researchers with NVIDIA, whobuilt a GAN that generates high-quality photo-realistic images of people (Karras, Laine, andAila2019). When contrastedwith the results of 2014 (Figure 2.6), the stunning progression ofGANsis self-evident, and it is difficult to believe that the person on the right does not exist.

SomeApplications of Generative Adversarial Networks

Over the past five years, many papers on implementations of GANs have been released by re-searchers (Alqahtani, Kavakli-Thorne, andKumar 2019;Wang, She, andWard 2019). The list ofapplications is extensive and ever growing, but it is worth pointing out some of the major exam-ples as of 2019 and why they are significant. These examples highlight the vast power of GANsand underscore the importance of understanding and carefully scrutinizing this type of machinelearning.

Data Augmentation

Onemajor problem inmachine learning has always been the lack of labeled datasets, which are re-quired by supervised learning approaches. Labeling data is time consuming and expensive. With-out good labeled data, trained models are limited in their power to learn and in their ability togeneralize to real-world problems. Services, such as Amazon’s Mechanical Turk, have attemptedto crowdsource the tedious process ofmanually assigning labels to data, but labeling has remaineda bottleneck in machine learning. GANs are helping to alleviate this bottleneck by generatingnew labeled data that is indistinguishable from the real data. This process can grow a small la-beled dataset into one that is larger and more useful for training purposes. In the area of medicalimaging and diagnostics this may have profound effects (Yi, Walia, and Babyn 2019). For exam-ple, GANs can produce photorealistic images of skin lesions that expert dermatologists are ableto separate from real images only slightly over 50% of the time (Baur, Albarqouni, and Navab2018) and they can synthesize high-resolutionmammograms for training better cancer detectionalgorithms (Korkinof et al. 2018).

A corollary effect of these developments inmedical imaging is the potential to publicly release

Harper 19

Figure 2.6: An image of a generated face from the original GAN publication (left) and the 2019milestone (right) shows how the ability of GANs to produce photo-realistic images has evolvedsince 2014.

large medical datasets and thereby expand researchers’ access to important data. Whereas thedissemination of traditionalmedical images is constrained by strict health privacy laws, generatedimages may not be governed by such rules. I qualify this statement with “may”, because anyrestrictions or ethical guidelines for the use of medical data that is generated from real patientdata requires extensive discussion and legal reviews that have not yet happened. Under certainconditions, it may also be possible to infer original data from a GAN (Mukherjee et al. 2019).How institutional review boards, professional medical organizations, and courts weigh in on thistopic will be seen in the coming years.

In addition to generating entirely new data, a GAN can augment datasets by expanding theircoverage to new domains. For example, autonomous vehicles must cope with an array of roadand weather conditions that are unpredictable. Training a model to identify pedestrians, streetsigns, road lines, and so on with images taken on a sunny day will not translate well to variablereal-world conditions. Using onedataset, in a process knownas style transfer,GANs can translateone image to other domains (Figure2.7). This can include creating night road scenes from dayscenes (Romera et al. 2019) and producing images of street signs under varying lighting condi-tions (Chowdhury et al. 2019). This added data permitsmodels to account for greater variabilityunder operating conditions without the high cost of photographing all possible conditions andmanually labeling them. Beyondmedicine and autonomous vehicles, generative data augmenta-tion will progressively impact other imaging-heavy fields (Shorten and Khoshgoftaar 2019) likeremote sensing (L. Ma et al. 2019; D. Ma, Tang, and Zhao 2019).

Creativity andDesign

The question of whethermachines can possess creativity or artistic ability is philosophically diffi-cult to answer (Mazzone and Elgammal 2019;McCormack, Gifford, andHutchings 2019). Still,in 2018, Christie’s auctioned off its first piece of GAN art for $432,500 (Cohn 2018) andGANs


Figure 2.7: The images on the left are originals and the images on the right have beenmodified bya GANwith the ability to translate images between the domains of “dirty lens” and “clean lens”on a vehicle (from Uřičář et al. 2019, fig. 11).

Harper 21

Figure 2.8: This example of GauGAN in action shows a sketched out scene on the left turnedinto a photo-realistic landscape on the right. *If any representatives of Christie’s are reading, theauthor would be happy to auction this piece.

are increasingly assisting humans in the creative process for all forms of media. Simple models,like CycleGAN, are already able to stylize images in the manner of Van Gogh or Monet (Zhu etal. 2017), and more varied stylistic GANs are emerging.

GauGAN, a beta tool released by NVIDIA, is a great example of GAN-assisted creativity inaction. GauGANallows you to rough out a scene using a paint brush for different categories, likeclouds, flowers, and houses (Figure 2.8). It then converts this into a photo reflecting what youhave drawn. The online demo6 remains limited, but the underlying model is powerful and hasmassive potential (Park et al. 2019). Recently, Martin Scorsese’s The Irishman made headlinesfor its digital de-aging of Robert Deniro and other actors. Although this process did not involveGANs, it is highly likely that in the future, GANs will become a major part of cinematic post-production (Giardina 2019) through assistive tools like GauGAN.

Fashion and product design are also being impacted by the use of GANs. Text-to-image syn-thesis, which can take free text or categories as input to generate a photo-realistic image, haspromising potential (Rostamzadeh et al. 2018). By accepting text as input, GANs can let de-signers rapidly generate new ideas or visualize concepts for products at the start of the designprocess. For example, a recently published GAN for clothing design accepts basic text and out-puts modeled images of the described clothing (Banerjee et al. 2019; Figure 9). In an example ofautomotive design, a single sketch can be used to generate realistic photos of multiple perspec-tives of a vehicle (Radhakrishnan et al. 2018). The many fields that rely on quick sketching orvisual prototyping, such as architecture or web design, are likely to be influenced by the use ofGAN-assisted design software in coming years.

In a similar vein, GANs have an upcoming role in the creation of new medicines, chemi-cals, and materials (Zhavoronkov 2018). By training a GAN on existing chemical and materialstructures, research is showing that novel chemicals andmaterials can be designedwith particularproperties (Gómez-Bombarelli et al. 2018; Sanchez-Lengeling and Aspuru-Guzik 2018). This isfacilitated by how information is encoded in the GAN’s latent space (the n-dimensional spacefrom which the generator samples; see “Z” in Figure 2.3). As the generator learns to producerealistic examples, certain aspects of the original data become encoded in regions of the latent

6See http://nvidia-research-mingyuliu.com/gaugan/ (last accessed January 12, 2019).

http://nvidia-research-mingyuliu.com/gaugan/


Figure 2.9: Text-to-image synthesis can generate images of new fashions based on a description.From the input “maroon round neck mini print a-line bodycon short sleeves” a GAN has pro-duced these three photos (from Banerjee et al. 2019, fig. 11).

Figure 2.10: Two examples of linearly-spacedmappings across the latent space between generatedimages A and B.Note that by taking one image andmoving closer to another, you can alter prop-erties in the image, such as adding steam, removing a cup handle, or changing the angle of view.These characteristics of the dataset are learned by the generator during training and encoded inthe latent space. (GAN built on coffee cup sketches from Google’s QuickDraw dataset)

space. By moving through this latent space or sampling particular areas, new data with desiredproperties can then be generated. This can be seen by periodically sampling the latent space andgenerating an image as one moves between two generated images (Figure 2.10). In the same way,by moving in certain directions or sampling from particular areas of the latent space, new chem-icals or medicines with specific properties can be generated.7

Impersonation and the Invisible

I have reserved some of the more dystopian and likely more well-heard-of applications of GANsfor last. This is the area where GANs’ ability to generate convincing media is challenging ourperceptions of reality and raising extreme ethical questions (Harper 2018). Deep fakes are, ofcourse, the most well known of these. This can include the creation of fake images, videos, andaudioof an individual or themodificationof anymedia to alterwhat someone appears tobedoingor saying. In images and video in particular, GANs make it possible to swap the identity of anindividual andmanipulate facial attributes or expressions (Tolosana et al. 2020). A large portion

7This is also relevant to facial manipulation discussed below.

Harper 23

Figure 2.11: GANs are providing a method to reconstruct hidden images of people and objects.Images 1–3 show reconstructions as compared to an input occluded image (OCC) and a groundtruth image (GT) (from Fulgeri et al. 2019, fig. 6).

of technical literature is, in fact, now devoted to detecting faked and altered media (see Tolosanaet al. 2020, Table IV and V). It remains to be seen how successful any approaches will be. From atheoretical perspective, anything that can detect fakes can also be used to train a better generatorsince the training process of a GAN is founded on outsmarting a detector (i.e. the discriminatornetwork).

One shocking extension of deep fakes that has emerged is transcript to video creation, whichgenerates a video of someone speaking fromawritten text. If youwant to see this atwork, you canview clips of Nixon giving the speech written in the case of an Apollo 11 disaster.8 As of now,deep fakes like this remain choppy and are largely limited to politicians and celebrities becausethey require large datasets and additional manipulation, but this limitation is not likely to last. Ifthe evolution of GANs for images is any predictor, the entire emerging field of video generationis likely to progress rapidly. One can imagine the incorporation of text-to-image and deep fakesenabling someone to produce an image of, say, “politican X doing action Y,” simply by typing it.

An application of GANs that parallels deep fakes and is likely more menacing in the shortterm is the infilling or adding of hidden, invisible, or predicted information to existing media.One nascent use is video prediction from an image. For example, in 2017, researchers were ableto build a GAN that produced 1-second video clips from a single starting frame (Vondrick andTorralba 2017). This may not seem impressive, but video is notoriously difficult to work withbecause the content of a succeeding frame can vary so drastically from the preceding frame (forother examples of on-going research into video prediction, see Cai et al. 2018; Wen et al. 2019).

For still images, occluded object reconstruction, in which a GAN is trained to produce afull image of a person or object that is partially hidden behind something else, is progressing(Fulgeri et al. 2019; see Figure 11). For some applications, like autonomous driving, this couldsave lives as itwouldhelp topickoutwhen apartially-occludedpedestrian is about to emerge from

8See http://news.mit.edu/2019/mit-apollo-deepfake-art-installation-aims-to-empower-more-discerning-public-1125.

http://news.mit.edu/2019/mit-apollo-deepfake-art-installation-aims-to-empower-more-discerning-public-1125

http://news.mit.edu/2019/mit-apollo-deepfake-art-installation-aims-to-empower-more-discerning-public-1125


behind a parked car. On the other hand, for surveillance technology, it can further undermineanonymity. Indeed, such GANs are already being explicitly studied for surveillance purposes(Fabbri, Calderara, and Cucchiara 2017). Lastly, I would be remiss if I did not mention thatresearchers have designed a GAN that can generate an image of what you are thinking about,using EEG signals (Tirupattur et al. 2018).

GANs and the Future

The tension between the creation of more realistic generated data and the technology to detectmaliciously generated information is onlybeginning. Themachine learning anddata scienceplat-form, Kaggle, is replete with publicly-accessible python code for building GANs and detectingfake data. Money, too, is freely flowing in this domain of research; The 2019 Deepfake Detec-tionChallenge sponsored by Facebook, AWS, andMicrosoft boasted onemillion dollars in prizes(https://www.kaggle.com/c/deepfake-detection-challenge accessed April 20,2020). Meanwhile, industry leaders, such as NVidia, continue to fund the training of better andmore convincing GANs. The structure of a GAN, with its generator and detector paired adver-sarially, is nowbeingmirrored in society as groups of researchers competitivelywork to create anddiscern generated data. The path that this machine-learning arms race will take is unpredictable,and, therefore, it is all the more important to scrutinize it and make it comprehensible to thebroader publics whom it will affect.

References

Alqahtani,Hamed,ManolyaKavakli-Thorne, andGulshanKumar. 2019. “Applications ofGen-erative Adversarial Networks (GANs): An Updated Review.” Archives of ComputationalMethods in Engineering, December. https://doi.org/10.1007/s11831-019-09388-y.

Banerjee, Rajdeep H., Anoop Rajagopal, Nilpa Jha, Arun Patro, and Aruna Rajan. 2019. “LetAI Clothe You: Diversified Fashion Generation.” InComputer Vision—ACCV 2018Work-shops, edited by Gustavo Carneiro and Shaodi You, 75–87. Cham: Springer InternationalPublishing.

Baur, Christoph, Shadi Albarqouni, and Nassir Navab. 2018. “Generating Highly Realistic Im-ages of Skin Lesions with GANs” September. https://arxiv.org/abs/1809.01410.

Burkov, Andriy. 2019. The Hundred-PageMachine Learning Book. Self-published, Amazon.

Cai, Haoye, Chunyan Bai, Yu-Wing Tai, and Chi-Keung Tang. 2018. “Deep Video Generation,Prediction and Completion of Human Action Sequences.” In Computer Vision — ECCV2018, edited by Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss,374–90. Lecture Notes in Computer Science. Cham: Springer International Publishing.https://doi.org/10.1007/978-3-030-01216-8_23.

Chowdhury, Sohini Roy et al. 2019. “Automated Augmentation with Reinforcement LearningandGANs for Robust Identification of Traffic Signs Using Front Camera Images.” In 53rdAsilomar Conference on Signals, Systems & Computers, 79–83. N.p.: IEEE. https://doi.org/10.1109/IEEECONF44664.2019.9049005.

Cohen, Libby. 2020. “Reddit Bans Deepfakes with ‘Malicious’ Intent.” The Daily Dot. January10, 2020. https://www.dailydot.com/layer8/reddit-deepfakes-ban/.

https://www.kaggle.com/c/deepfake-detection-challenge

https://doi.org/10.1007/s11831-019-09388-y

https://doi.org/10.1007/s11831-019-09388-y

https://arxiv.org/abs/1809.01410

https://doi.org/10.1007/978-3-030-01216-8_23

https://doi.org/10.1109/IEEECONF44664.2019.9049005

https://doi.org/10.1109/IEEECONF44664.2019.9049005

https://www.dailydot.com/layer8/reddit-deepfakes-ban/

Harper 25

Cohn, Gabe. 2018. “AI Art at Christie’s Sells for $432,500.” The New York Times, October 25,2018. https://www.nytimes.com/2018/10/25/arts/design/ai-art-sold-christies.html.

Fabbri, Matteo, Simone Calderara, and Rita Cucchiara. 2017. “Generative Adversarial Modelsfor People Attribute Recognition in Surveillance.” In 14th IEEE International Conferenceon Advanced Video and Signal Based Surveillance (AVSS). N.p.: IEEE. https://doi.org/10.1109/AVSS.2017.8078521.

Fulgeri, Federico, Matteo Fabbri, Stefano Alletto, Simone Calderara, and Rita Cucchiara. 2019.“Can Adversarial Networks Hallucinate Occluded People With a Plausible Aspect?” Com-puter Vision and Image Understanding 182 (May): 71–80.

Giardina, Carolyn. 2019. “Will Smith, Robert De Niro and the Rise of the All-Digital Actor.”The Hollywood Reporter, August 10, 2019. https://www.hollywoodreporter.com/behind-screen/rise-all-digital-actor-1229783.

Giles, Martin. 2018. “The GANfather: The Man Who’s given Machines the Gift of Imagina-tion.” MIT Technology Review 121, no. 2 (March/April): 48–53.

Gómez-Bombarelli, Rafael, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato,Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D.Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik. 2018. “Automatic Chemical Design Us-ing a Data-Driven Continuous Representation ofMolecules.” ACS Central Science 4, no. 2(February): 268–76. https://doi.org/10.1021/acscentsci.7b00572.

Goodfellow, Ian, Jean Pouget-Abadie,MehdiMirza, BingXu,DavidWarde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances inNeural Information Processing Systems, edited by Z. Ghahramani, M. Welling, C. Cortes,N. Lawrence, and K.Q.Weinberger, 27:2672–2680. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.

Harper, Charlie. 2018. “Machine Learning and the Library or: How ILearned to StopWorryingand Love My Robot Overlords.” Code4Lib Journal, no. 41 (August). https://journal.code4lib.org/articles/13671.

Kahng, Minsuk, Nikhil Thorat, Duen Horng Polo Chau, Fernanda B. Viegas, and Martin Wat-tenberg. 2019. “GANLab: UnderstandingComplexDeepGenerativeModels Using Inter-active Visual Experimentation.” IEEE Transactions on Visualization and Computer Graph-ics 25, no. 1 (January 2019): 310–320. https://doi.org/10.1109/tvcg.2018.2864500.

Karras, Tero, Samuli Laine, and Timo Aila. 2019. “A Style-Based Generator Architecture forGenerative Adversarial Networks.” In 2019 IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR), 4396–4405. N.p.: IEEE. https://doi.org/10.1109/CVPR.2019.00453.

Korkinof, Dimitrios, Tobias Rijken, Michael O’Neill, Joseph Yearsley, Hugh Harvey, and BenGlocker. 2018. “High-Resolution Mammogram Synthesis Using Progressive GenerativeAdversarial Networks.” Preprint, submitted July 9, 2018. https://arxiv.org/abs/1807.03401.

Langr, Jakub and Vladimir Bok. 2019. GANs in Action: Deep Learning with Generative Adver-sarial Networks. Shelter Island, NY: Manning Publications.

Ma, Dongao, Ping Tang, and Lijun Zhao. 2019. “SiftingGAN: Generating and Sifting La-beled Samples to Improve theRemote Sensing Image SceneClassificationBaseline InVitro.”

https://www.nytimes.com/2018/10/25/arts/design/ai-art-sold-christies.html

https://www.nytimes.com/2018/10/25/arts/design/ai-art-sold-christies.html

https://doi.org/10.1109/AVSS.2017.8078521

https://doi.org/10.1109/AVSS.2017.8078521

https://www.hollywoodreporter.com/behind-screen/rise-all-digital-actor-1229783

https://www.hollywoodreporter.com/behind-screen/rise-all-digital-actor-1229783

https://doi.org/10.1021/acscentsci.7b00572

https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf



https://journal.code4lib.org/articles/13671


https://doi.org/10.1109/tvcg.2018.2864500

https://doi.org/10.1109/tvcg.2018.2864500

https://doi.org/10.1109/CVPR.2019.00453





IEEE Geoscience and Remote Sensing Letters 16, no. 7 (July): 1046–1050. https://doi.org/10.1109/lgrs.2018.2890413.

Ma, Lei, YuLiu,XueliangZhang, YuanxinYe,Gaofei Yin, andBrianAlan Johnson. 2019. “DeepLearning in Remote Sensing Applications: A Meta-Analysis and Review.” ISPRS Journalof Photogrammetry and Remote Sensing 152 (June): 166–77. https://doi.org/10.1016/j.isprsjprs.2019.04.015.

Mazzone, Marian, and Ahmed Elgammal. 2019. “Art, Creativity, and the Potential of ArtificialIntelligence.” Arts 8, no. 1 (March): 1–9. https://doi.org/10.3390/arts8010026.

McCormack, Jon, Toby Gifford, and Patrick Hutchings. 2019. “Autonomy, Authenticity, Au-thorship and Intention inComputerGeneratedArt.” InComputational Intelligence inMu-sic, Sound, Art and Design, edited by Anikó Ekárt, Antonios Liapis, and María Luz CastroPena, 35–50. Cham: Springer International Publishing.

Mukherjee, Sumit, Yixi Xu, Anusua Trivedi, and Juan Lavista Ferres. 2019. “Protecting GANsagainstPrivacyAttacks byPreventingOverfitting.” Preprint, submittedDecember31, 2019.https://arxiv.org/abs/2001.00071v1.

Murphy, Kevin P. 2012. Machine Learning : A Probabilistic Perspective. Adaptive ComputationandMachine Learning Series. Cambridge, Mass: MIT Press.

Park, Taesung, Ming-Yu Liu, Ting-ChunWang, and Jun-Yan Zhu. 2019. “Semantic Image Syn-thesis with Spatially-AdaptiveNormalization.” In 2019 IEEE/CVFConference onComputerVision and Pattern Recognition (CVPR), 2332–2341. N.p.: IEEE. https://doi.org/10.1109/CVPR.2019.00244.

Radhakrishnan, Sreedhar,VarunBharadwaj, VarunManjunath, andRamamoorthySrinath. 2018.“Creative Intelligence – Automating Car Design Studio with Generative Adversarial Net-works (GAN).” InMachineLearningandKnowledgeExtraction, editedbyAndreasHolzinger,Peter Kieseberg, A Min Tjoa, and Edgar Weippl, 160–75. Cham: Springer InternationalPublishing.

Romera, Eduardo, Luis M. Bergasa, Kailun Yang, Jose M. Alvarez, and Rafael Barea. 2019.“Bridging the Day and Night Domain Gap for Semantic Segmentation.” In 2019 IEEEIntelligent Vehicles Symposium (IV), 1312–18. N.p.: IEEE. https://doi.org/10.1109/IVS.2019.8813888.

Romm, Tony, Drew Harwell, and Isaac Stanley-Becker. 2020. “Facebook Bans Deepfakes, butNew Policy May Not Cover Controversial Pelosi Video.” TheWashington Post. January 7,2020. https://www.washingtonpost.com/technology/2020/01/06/facebook-ban-deepfakes-sources-say-new-policy-may-not-cover-controversial-pelosi-video/.

Rostamzadeh, Negar, Seyedarian Hosseini, Thomas Boquet, Wojciech Stokowiec, Ying Zhang,Christian Jauvin, and Chris Pal. 2018. “Fashion-Gen: The Generative Fashion Dataset andChallenge.” Preprint, submitted June 21, 2018. https://arxiv.org/abs/1806.08317.

Sanchez-Lengeling, Benjamin, and Alán Aspuru-Guzik. 2018. “Inverse Molecular Design Us-ing Machine Learning: Generative Models for Matter Engineering.” Science 361, no. 6400(July): 360–365. https://doi.org/10.1126/science.aat2663.

Shorten, Connor, and Taghi M. Khoshgoftaar. 2019. “A Survey on Image Data Augmentationfor Deep Learning.” Journal of Big Data 6 (60): 1–48. https://doi.org/10.1186/s40537-019-0197-0.

https://doi.org/10.1109/lgrs.2018.2890413

https://doi.org/10.1109/lgrs.2018.2890413

https://doi.org/10.1016/j.isprsjprs.2019.04.015

https://doi.org/10.1016/j.isprsjprs.2019.04.015

https://doi.org/10.3390/arts8010026

https://arxiv.org/abs/2001.00071v1



https://doi.org/10.1109/IVS.2019.8813888

https://doi.org/10.1109/IVS.2019.8813888

https://www.washingtonpost.com/technology/2020/01/06/facebook-ban-deepfakes-sources-say-new-policy-may-not-cover-controversial-pelosi-video/





https://doi.org/10.1126/science.aat2663

https://doi.org/10.1186/s40537-019-0197-0

https://doi.org/10.1186/s40537-019-0197-0

Harper 27

Tirupattur, Praveen, Yogesh Singh Rawat, Concetto Spampinato, and Mubarak Shah. 2018.“Thoughtviz: Visualizing Human Thoughts Using Generative Adversarial Network.” InProceedings of the 26th ACM International Conference on Multimedia, 950–958. NewYork: Association for Computing Machinery. https://doi.org/10.1145/3240508.3240641.

Tolosana, Ruben, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. 2020. “DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detec-tion.” Preprint, submitted January 1, 2020. https://arxiv.org/abs/2001.00179.

Uřičář,Michal, PavelKřížek,DavidHurych, IbrahimSobh, Senthil Yogamani, andPatrickDenny.2019. “Yes, We GAN: Applying Adversarial Techniques for Autonomous Driving.” InIS&T International Symposium on Electronic Imaging, 1–16. Springfield, VA: Society forImaging Science and Technology. https://doi.org/10.2352/ISSN.2470-1173.2019.15.AVM-048.

Vondrick, Carl, and Antonio Torralba. 2017. “Generating the Future with Adversarial Trans-formers.” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2992–3000. N.p.: IEEE. https://doi.org/10.1109/CVPR.2017.319.

Wang, Zhengwei, Qi She, and Tomas E. Ward. 2019. “Generative Adversarial Networks: ASurvey and Taxonomy.” Preprint, submitted June 4, 2019. https://arxiv.org/abs/1906.01529.

Wen, Shiping, Weiwei Liu, Yin Yang, Tingwen Huang, and Zhigang Zeng. 2019. “GeneratingRealistic Videos From Keyframes With Concatenated GANs.” IEEE Transactions on Cir-cuits and Systems for Video Technology 29 (8): 2337–48. https://doi.org/10.1109/TCSVT.2018.2867934.

Yi, Xin, Ekta Walia, and Paul Babyn. 2019. “Generative Adversarial Network in Medical Imag-ing: A Review.” Medical Image Analysis 58 (December): 1–20. https://doi.org/10.1016/j.media.2019.101552.

Zhavoronkov, Alex. 2018. “Artificial Intelligence for Drug Discovery, Biomarker Development,and Generation of Novel Chemistry.” Molecular Pharmaceutics 15, no. 10 (October):4311–13. https://doi.org/10.1021/acs.molpharmaceut.8b00930.

Zhu, Jun-Yan, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. “Unpaired Image-to-ImageTranslation Using Cycle-Consistent Adversarial Networks.” In 2017 IEEE InternationalConference on Computer Vision (ICCV), 2242–2251. N.p.: IEEE. https://doi.org/10.1109/ICCV.2017.244.

https://doi.org/10.1145/3240508.3240641

https://doi.org/10.1145/3240508.3240641


https://doi.org/10.2352/ISSN.2470-1173.2019.15.AVM-048

https://doi.org/10.2352/ISSN.2470-1173.2019.15.AVM-048




https://doi.org/10.1109/TCSVT.2018.2867934

https://doi.org/10.1109/TCSVT.2018.2867934

https://doi.org/10.1016/j.media.2019.101552

https://doi.org/10.1016/j.media.2019.101552

https://doi.org/10.1021/acs.molpharmaceut.8b00930

https://doi.org/10.1109/ICCV.2017.244

https://doi.org/10.1109/ICCV.2017.244

Chapter 3

Humanities and Social Science

Reading throughMachine

Learning

Marisa PlumbSan Jose State University

Introduction

The purposes of computational literary studies have evolved and diversified a great deal over thelast half century. Within this dynamic and often contentious space, a set of fundamental ques-tions deserve our collective attention: does the computation and digitization of language recastthe ways we read, value, and receive words? In what ways can research and scholarship on lit-erature become a more meaningful part of the future development of computer systems? Asthe theory and practice of computational literary studies evolve, their potential to play a directrole in revising historical narratives and framing new research questions poses cross-disciplinaryimplications.

It’s worthwhile to anchor these questions in the origin stories that today’s digital humaniststell, from the work of JosephineMiles at Berkeley in the 1930s (Buurma andHeffernan 2018) toRoberto Busa’s work in the 1940s towork that links Structuralism andRussian Formalism at theturn of the 19th century (Algee-Hewitt 2015) to today’s systemized explorations of texts. Thesciences and humanities have a shared history in their desire to solve the patterns and systems thatmake language functional and impactful, and there have long been linguistic and computationaltools that help advance this work. What’s more challenging to unravel and articulate from theseorigin stories are the mathematical concepts behind the tools that humanists wield. Ideally onewould navigate this historical landscape when assessing the fitness of any given computationaltechnique for addressing a specific humanities research question, but often researchers choose

29


tools because they are powerful and popular, without a robust understanding of the conceptualassumptions they embody, which are defined by the mathematical and statistical principles theyare based on. This canmake it difficult to generate reproducible results that contribute to a tool’smethodological development.

This is related to a set of issues that drive debates among computationally-minded scholars,which regularly appear in digital humanities forums. In 2019, for instance,NanDa issued a harshcritique of humanists’ implementation of statistical methods in their research.1 Her claim is thatcomputational methods are not a good match for literary research, and she systematically showshow the results from several computational humanities are not only difficult to reproduce, butcan be easily skewed with minor changes to how an algorithm is implemented. Although thisdebate about digital methods points to a necessary evolution in the field (in which researchersbecome more accountable to the computational laws that they are utilizing), her essay’s broadermission is to question the appropriateness of using computational tools to investigate literaryobjects and ideas.

Refutations to this claim were swift and abundant (Critical Inquiry 2019), and highlight anumber of concepts central tomy concern herewith future intersections ofmachine learning andliterary research. Respondents such asMark Algee-Hewitt pointed out that literary scholars em-ploy computational statisticalmodels in order to reveal something about texts that human readerscould not. In doing so, literary scholars are at liberty to note where computation reaches its use-ful limit2 and take upmore traditional forms of literary analysis (Algee-Hewitt 2019). KatherineBode explores the promise and pitfalls of this hybrid “close and distant reading” approach in her2020 article on the intersection of topic modeling and bias. Imperfect as the hybrid method is,stressing the value of familiar interpretivemethods remains important, politically and practically,when bringing computation into humanities departments.

This essay extends the argument that computational tools do more than turn big data intonovel close reading opportunities. Machine learning, and word embedding algorithms in par-ticular, may have a unique ability to shift this conversation into new territory, where scholarsbegin to ask how historical research can contribute more sophisticated approaches to treatingwords as data. With historically-minded approaches to dataset creation for machine learning,issues emerge that engender new theoretical frameworks for evaluating the ability of statisticalmodels of information to reveal cultural and artistic dimensions of language. I will first contex-tualize what they do, and then show a few of the mathematical concepts that have driven theirdevelopment.

Of themany availablemachine learning algorithms, word embedding algorithms have shownparticular promise in capturing contextual meanings (of words or other units of textual data)more accurately than previous techniques in natural language processing. Word embeddings en-compass a set of language modeling techniques where words or phrases from a large set of texts(i.e., “corpus”) are analyzed through the use of a neural network architecture. For each vocabu-lary term in the corpus, the neural network algorithmuses the term’s proximity to other words toassign it values that become a vector of real numbers— one high-dimensional vector is generatedfor each word. (The term “embedding” refers to the mathematics that turns a space with many

1Da’s critique of statistical model usage in computational humanities work sparked a forum of responses in CriticalInquiry.

2This limit typically exists for a combination of three reasons: computer programs can only generate models basedon the data we give them, a tool isn’t fully understood and so not robustly explored, and many algorithms and tools arebeing used in experimental ways.

Plumb 31

dimensions per word into a continuous vector space with a much lower dimension.)3 They raisethree critical issues to this essay: How doword embeddings reflect the contexts of words in orderto capture their relative meanings? If word embeddings approximate word meanings, do theyalso reflect culture? How can literary history and cultural studies inform how scholars use them?

Word embeddings are powerful because they calculate semantic similarities between wordsbased on their distributional properties in large samples of language data. As computational lin-guist Jussi Karlgren puts it:

Language is a general-purpose representation of human knowledge, and models toprocess it vary in the degree they are bound to some task or some specific usage.Currently, the trend is to learn regularities and representations with as little explicitknowledge-based linguistic processing as possible, and recent advances in such gen-eral models for end-to-end learning to address linguistics tasks have been quite suc-cessful. Most of those approaches make little use of information beyond the occur-rence or co-occurrence of words in the linguistic signal and take the single word tobe the atomary unit.

This is notable because it highlights the power of word embeddings to assign values to words inorder to represent their relative meanings, simply based on unstructured language data, withouta system of linguistic rules or a labelling system. It also highlights the fact that a word embeddingmodel’s success is based on the parameters of the task it is designed to address. So while the accu-racy and power of word vector algorithmsmight be recognizable in general-purpose applicationsthat improve with larger training corpora (for instance Google News and Wikipedia), they canbe equally powerful representation learning systems for specific historical research tasks that usedifferent benchmarks for success. Humanists using these machine learningmethods are learningto think differently about corpora size, corpora content, and the utility of a successfully-trainedmodel for analysis and interpretation.

No matter what the application, the success of machine learning applications is predicatedon creating good datasets. As a recent paper in IEEE Transactions on Knowledge and Data En-gineering notes, “the majority of the time for running machine learning end-to-end is spent onpreparing the data, which includes collecting, cleaning, analyzing, visualizing, and feature en-gineering” (Roh et al. 2019, 1). Acknowledging this helps contextualize machine learning al-gorithms for text analysis tasks in the humanities, but also highlights data curation challengesthat can be taken up in new ways by humanists. This naturally raises questions about how ma-chine learning algorithms like word embeddings are implemented for text analysis, and how theyshould be modified for historical research—they require different computational priorities andframeworks.

In parallel to the corpora considerations that computational humanities scholars ponder,there is an abundance of work, across disciplines such as cognitive science and psychology (Grif-fiths et al. 2007), that attempts to refine the problems and limits of using large collections oftext for training embeddings. These large collections tend to reflect the biases that exist in soci-ety and history, and in turn, systems based on these datasets can make troubling inferences, nowwell documented as algorithmic bias.4 Computer science researchers need to evaluate the socialdimensions of their applications in diverse societies and find ways to fairly represent all popula-tions.

3See Koehrsen 2018 for a fuller explanation of the process.4As investigated, for instance, in Noble 2018.


Digital humanities practices can implicitly help address these issues. Literary studies, as itevolves towards multivocality and canon expansion, makes explicit a link between methods ofliterary analysis and digital practices that are deliberately inclusive, less-biased, and diachronic(rather than ahistorical). Emerging literary scholarship uses computational methods to questionhegemonic practices in the history of the field, through the now-familiar practice of data cura-tion (Poole 2013). But this work can also help combat algorithmic biasmore broadly, and expandbeyond corpus development into algorithmic design. As digital literary scholarship continues todeepen its exchanges with Sociology, History, and Information Science, stronger methodologiesfor using fair and representative data will become pervasive throughout these disciplines, as wellas in commercial applications. Interdisciplinary methodologies are foundational to future com-putational literary research that can make sophisticated contributions to text analysis.

The Bengal Annual: A Case Study

Complex relationships betweenwords cannot be fully assessedwith one flat application of a pow-erful tool to a set of texts. But this does not mean that the usefulness of machine learning forliterature is limited: rather, scholars can wield it to control how machines learn sets of relation-ships between concepts. Choosing which texts to include in a corpus is coupled to decisionsabout whether and how to label its contents, and how to tune the parameters of the algorithms.For the purposes of literary analysis, these should be embraced as interpretive, biased acts—onesthat deepen understanding of commonly-employed computational methods—and folded intoemerging methodologies. Because humanities scholars are not generating models to serve appli-cations with thousands of end-users who primarily expect accuracy, they can exploit the fallaciesof machine learning in order to improve how dataset management and feature engineering areconducted. Working with big data in order to generate models isn’t valuable because it revealshistory’s “true” cultural patterns, but because it demonstrates how machines already circulatethose “truths.” A scholar’s deep knowledge of the historical content and formalities of languagecan determine how corpora are compared, how we experiment with known biases, and how wemove towards a future landscape of literary analysis that is inclusive of marginalized texts and thelatest cultural theory.

Roopika Risam, for instance, advocates for both a theoretical and practice-based decoloniza-tion of the digital humanities, noting ways that postcolonial digital archives can intervene inknowledge production in society (2018, 79). Corpora created from periods of revolution, then,might reveal especially useful vector relationships and lead to better understanding of semanticchanges during those times. Those word embeddings might be useful for teaching computersracialized language over timelines, so that machine learning applications do not only “read” his-tory as a flat set of relationships, and inevitably reflect the worst of its biases.

To begin to unpack this process, I will present a case study on the 1830 Bengal Annual anda corpus of similarly-situated texts. Our team, made up of students in Katherine D. Harris’sgraduate seminar on decolonizing Romantic Literature at San Jose State University, asked: canwe operationalize questions that arise from close readings of texts to turn problematic quanti-tative evaluations of words into more complex methods of interpretation? A computer cannotinterpret complex cultural concepts, but it can be instructed to weigh time period, narrative per-spective, and publication venue, much as a literary scholar would.

With the explosion of print culture in England in the first half of the nineteenth century,publishers began introducing new forms of serialized print materials, which included serialized

Plumb 33

publications known as literary annuals (Harris 2015). These multi-author texts were commonlyproduced as high-quality volumes that could be purchased as gifts in the months leading up tothe holiday season. As a genre, the annual included poetry, prose, and engravings, among othervarieties of content, very often fromwell-known authors. Literary annuals represent a significantshift in the economics surrounding the production of print materials for mass consumption—for instance, contributors were typically paid. And annuals, though a luxury item, were moreaffordable than books sold before the mechanization of the printing press (Harris 2015, 1-29).

Literary annuals and other periodicals are interesting sites of literary study because they canbe read as reinforcing or resisting the British Empire. London-based periodicals were eventuallydistributed to all of Britain’s colonial holdings, including India (Harris 2019). As The BengalAnnual was written in India and contains a small representation of Indian authors, our projectinvestigates it as a variation onBritish-centric readingmaterials of the time, whichperhaps offereda provisional voice to a wider community of writers (though not without claims of superiorityover the colonized territory it exploits).

Some of the contents invoke themes that are affiliated with major Romantic writers suchas William Wordsworth and Samuel T. Coleridge, but editor D.L. Richardson included shortstories and fiction, which were not held in the same regard as poetry. He also employed localnative Indian engravers and writers. To explore the thesis that the concepts and genres typicallyassociated with British Romantic Literature are represented differently in a text that was writtenand produced in a different space with a set of contributors who were not exclusively Britishnatives, we experimented with word embeddings on semantic similarity tasks, comparing theannual to texts like Lyrical Ballads. Such a task is within the scope of traditional literary analysis,but my agenda was to probe the idea that we need large-scale representations of marginalizedvoices in order to show real differences from the ideas of the dominant race, class, and gender.5

The project teamfirst used statistical tools to find out if theAnnual’s poetry, non-fiction, andfiction contained interesting relationships between vocabularies about body parts, social class,and gender. We gathered information about terms that might reveal how different parts of thebody were referenced depending on sex. These differences were validated by traditional close-reading knowledge about British Romantic Literature and its historical contexts,6 and signaledthe need to read and analyze the Annual’s passages about body parts, especially ones bywriters ofdifferent genders and social backgrounds. These simplemethods allowed us to take a streamlinedapproach to confirming that an author’s perspective indeed altered his or her word choices andother aspects of their references to male vs. female bodies.

Collecting and mapping those references, however, was not enough to build a larger argu-ment about how discourse on bodies might be different in non-canonical British Romantic Lit-erature. Based on the potential for word embeddings to model semantic spaces for different cor-pora and compare the distribution of terms, the next step was to build a corpus of non-canonicaltexts of similar scope to a corpus of canonical works, so thatmodels for each could be legitimatelycompared. This work, currently in progress, faces challenges that are becoming more familiar todigital historians: the digitization of rare texts, the review of digitization processes for accuracy,and the cleaning of data.

The primary challenge is to find the correct works to include: this requires historical exper-

5Such textual repositories are important outside of literature departments, too. We need data to represent all voicesin training machines to represent any social arena.

6Some of these findings are illustrated in the project’s Scalar site: http://scalar.usc.edu/works/the-bengal-annual/bodies-in-the-annual.

http://scalar.usc.edu/works/the-bengal-annual/bodies-in-the-annual

http://scalar.usc.edu/works/the-bengal-annual/bodies-in-the-annual


tise, but also raises the question of how to uncover unknown authors. Manu Chander’s BrownRomantics calls for a global assessment of Romantic Literature’s impact by “calling attention toits genuinely unacknowledged legislators” (Chander 2017, 11). But he contends that even theauthors he was able to study were already individuals who aspired to assimilate with British cul-ture and ideologies in someways, andperhaps don’t represent political resistance or views entirelyantithetical to the British Empire.

Guided by Chander’s questions about how to locate dissent in contexts of colonization, wedocumented instances in the text that highlight the dynamics of colonialism, race, and nation-alism, and compared them to a set of statistical explorations of the text’s vocabulary (particu-larly terms related to national identity, gender, and bodies). Chander’s call for a more globally-comprehensive study of Romanticism speaks to the politics of corpora curation discussed above,but also suggests that corpus comparison can benefit from formal methodological guidelines.Puzzling out how to best combine traditional close readings with quantitative inquiries, andthen map that work to a machine-learning research framework, revealed several shortcomingsin methodological standardization. It also revealed several opportunities for rethinking the wayalgorithms could be implemented, by adopting and systematizing familiar comparative researchpractices. Ideas about such methodologies are emerging in many disciplines, which I highlightlater in this essay.

Disciplinary directions for word vector research

Thepotential ofword embedding techniques for projects such as ourBengalAnnual analysis canbe seen in the new computational research directions that have emerged in humanities research.7

Vector-space representations are based on high-dimensional vectors8 of real numbers.9 Thosevectors’ values are assigned using a word’s relationship to the words near it in a text, based on thelikelihood that a wordwill appear in proximity to other words it is told to “look” at. For example,this visualization demonstrates an embedding space for a historical corpus (1640-1699) using thevalues assigned to word vectors (figure 3.1).

In a visualized space (with reduced dimensions) such as the one in figure 3.1, distances amongvectors can be assessed, for example, to articulate the forty wordsmost similar towit. This partic-ular model (trained using the word2vec algorithm), published in the 2019Debates in the DigitalHumanities,10 allowed the authors to visualize the term wit with synonyms on the left side, andterms related to argumentation on the right, such as indeed, argues, and consequently. This ini-tial exploration prompted Gavin and his co-authors to look at a vector space model for a singleauthor (John Dryden), in order to both validate the model against their subject matter expertiseand explore the model’s results. Although word vectors are often employed for machine trans-lation tasks11 or to project analogistic relationships between concepts,12 they can also be used to

7See Kirschenbaum 2007 and Argamon and Olsen 2009.8Aword vector may have hundreds or even thousands of dimensions.9Word embedding algorithms aremodelled on the linguistic concept that context is a primaryway thatwordmeanings

are produced. Their usefulness is dependent on the breadth and domain-relevance of the corpus they are trained on,meaning that a corpus of medical research vs. a corpus of 1980s television guides vs. a corpus of family law proceedingswould generate models that show different relationships between words like “family,” “health,” “heart,” etc.

10See Goldstone 2019.11Software used to translate text or speech from one language to a target language. Machine translation is a subfield

of computational linguistics that can now allow for domain-based (i.e. specialized subject matter) customizations oftranslations, making translated word choices more context-specific..

12Although word embeddings aren’t explicitly trained to learn analogies, the vectors exhibit seemingly linear behavior(such as “woman is to queen as man is to king”), which approximately describe a parallelogram. This phenomenon is

Plumb 35

Figure 3.1: A visualized space with reduced dimensions of a neighborhood aroundwit (Gavin etal. 2019, Figure 21.2).

question concepts that are traditionally associated with particular literary periods and evaluatethose associations with new kinds of evidence.

What this type of study suggests is that we can look at cultural concepts likewit in newways.These results can also facilitate a comparison of historical models ofwit to contemporary ones—to show how its meaning may have shifted, using its changing relationship to other words asevidence. This is a growing area of research in the social sciences, computational linguistics, andother disciplines (Kutuzov et al. 2019) In a survey paper on current work in diachronic wordembeddings and semantic shifts, Kutuzov et al. note that the surge of interest points to its impor-tance for natural language processing, but that it currently lacks “cohesion, common terminologyand shared practices.”

Some of this cohesion might be generated by putting the usefulness of word vectors in con-text of the history of information retrieval and the history of distributed representation. Wordembeddings emerged in the 1960s, with datamodeled as amatrix, and a user’s query of a databaserepresented as a vector. Simple vector operations could be used to locate relevant data or docu-ments. Gerald Salton is generally credited as one of the first to do this, based on the idea that hecould represent a document as a vector of keywords andusemeasures like cosine similarity anddi-mensionality reduction to compare documents.13 Since the 1990s, vector spacemodels have been

explored in Allen and Hospedales 2019.13Algorithms like word2vec take as input the linguistic context of words in a given corpus of text, and output an N

dimensional space of those words—each word is represented as a vector of dimension N in that Euclidean space. Wordvectors with thousands of values are transformed to lower-dimensional spaces in which the directionality of two vectorscan be measured using cosine similarity—words that exist in similar contexts would be expected to have a similar cosinemeasurement and map to like clusters in the distributed space.


used in distributional semantics. In a paper on the history of vector space models, which exam-ines the trajectory of Gerald Salton’s work, David Dubin notes that these mathematical modelscan be defined as “a consistent mathematical structure designed to correspond to some physical,biological, social, psychological, or conceptual entity” (2004). In the case of word vectors, wordcontext and colocations give us quantifiable information about a word’s meaning.

But research in cognitive science has long questioned the property of linguistic similarity inspatial representations because they don’t align with important aspects of human semantic pro-cessing (Tversky 1977). Tversky shows, for example, that people’s interpretation of semantic sim-ilarity does not always obey the triangle inequality, i.e., the words w1 and w3 are not necessarilysimilar when both pairs of (w1, w2) and (w2, w3) are similar. While “asteroid” is very similar to“belt” and “belt” is very similar to “buckle”, “asteroid” and “buckle” are not similar (Griffiths etal. 2007). One reason this violation arises is because a word is represented as a single vector evenwhen it has multiple meanings. This has led to research that attempts new methods to capturedifferent senses of words in embedding applications. In a paper surveying techniques for dif-ferentiating words at the “sense” level, Jose Camacho-Collados and Mohammad Taher Pilehvarshow that these efforts fall in two camps: “Unsupervised models directly learn word senses fromtext corpora, while knowledge-based techniques exploit the sense inventories of lexical resourcesas their main source for representing meanings” (2018, 744).

Thefirstmethod, anunsupervisedmodel, induces differentmeanings of aword—it is trainedto analyze and represent eachword sense based on statistical knowledge derived from the contextswithin a corpus. The secondmethod for disambiguation relies on information contained inotherdatabases or sources. WordNet, for instance, associates multiple words with concepts, providinga sense inventory for terms. It is made up of synsets, which represent unique concepts that canbe expressed through nouns, verbs, adjectives or adverbs. The synset of a concept such as “a busi-ness where patrons can purchase coffee and use WiFi” might be “cafe, coffeeshop, internet cafe”etc. Camacho-Collados and Pilehvar review different ways to process word embedding resultsusing WordNet and similar resources, which essentially provide synonyms that share a commonmeaning.

There exists a relationship between work that addresses word disambiguation and work thataddresses the biases that word vector algorithms produce. Just as researchers can modify gen-eral word embedding models to capture a word’s multiple meanings, they can also modify themaccording to a word’s usage over time. These evolving methods begin to account for the social,historical, andpsychological dimensions of language. If one can show that applyingword embed-ding algorithms to diachronic corpora or corpora of different domains produces different biases,this would suggest that nuanced shifts in vocabulary and word usage can be used to impact datacuration practices that seek to isolate and remove historical bias from other word embeddingmodels.

Biases, one might say, persist despite contextual changes. Or, one might say that the short-comings of word embeddings don’t account for changes in bias that are present in context. Thisis where the domain expertise of literary scholars also becomes essential. Historians’ domain ex-pertise and natural interest in comparative corpora (from different time periods or containingdifferent types of documents) situates their ability to curate datasets that tend to both data ethicsand computational innovation. Such work could have impact beyond historical research, andresult in data-level corrections to biases that emerge inmore general-purpose embedding applica-tions. This could be more effective and reproducible than correcting them superficially (Gonenand Goldberg 2019). For instance, if novel cultural biases can be traced to an origin period, texts

Plumb 37

from that period could constitute a sub-corpus. Embeddingmodels specific to that corpusmightbe subtracted from the vectors generated from a broader dataset.

Examining a methodology’s history is an essential way in which scholars can strengthen thevalidity of computationally-driven research and its integration into literary departments—thistype of scholarship reconstitutes literary insights after the risky move of flattening literary textswith the rigor of machines. But as Lauren Klein (2019) and others reveal, scholars have begunto apply interpretation and imagination in both the computational and the “close reading” as-pects of their research. This reinforces that computational shifts in the study of literature aremore than just the adoption of useful tools for the sake of locating a novel pattern in data. Anincreasingly important branch of digital literary research demonstrates the efficacy of engagingthe interdisciplinary complexity of computational tools in relation to the complexity of literaryanalysis.

New ideas for close readings and analysis can serve as windows into defining secondary com-putational research questions that emerge from an initial statistical exploration. As in the workreviewed byCamacho-Collados Pilehvar, outside knowledge of word senses can be used for post-processing word embeddings that address theoretical issues. Implementing this type of processfor humanities research, one might begin with the question: can I generate word vector modelsthat attend to both author gender and word context if I train them in innovative ways? Doesthis require a corpus of male authors and one of female authors? Or would this be better accom-plished with an outside lexical source that has already associated word senses with genders?

Multi-disciplinary scholars are experimenting with a variety of methods to use word vectoralgorithms to track semantic complexities, and humanities researchers need an awareness of thetechnical innovations across a range of these disciplines because they are in a position to bring im-portant domain knowledge to these efforts. Ideally, the questions that unite these disciplinary ef-fortsmight be: howdowemakeword contexts and distributional semanticsmore useful for bothhistorians, who need reproducible results that lead to new interpretation, and technologists, whoneed historical interpretation to play a larger role in language generalization? Modeling languagehistories depends on how deeply humanists can understand word embedding models, so thatthey can augment their inherent shortcomings. Cross-disciplinary collaborations help scholarsreturn to fundamental issues that arise whenwe treat words as data, and help bringmore cohesivemethodological standards to language modeling.

Newdirections in cross-disciplinarymachine learning

frameworks

Literary scholars set up computational inquiries with attention to cultural complexity, and seekout instances of language that convey historical context. So while they aren’t likely to lead thecharge in correcting fundamental shortcomings of language representation algorithms, they canincreasingly impact social assessments of those algorithms, provide methodologies for those al-gorithms to locate anomalies in language usage, and assess whether those algorithms embodysocially just practices (D’Ignazio and Klein 2020). Some literary scholars also critique the non-neutral ideologies that are in place in both computing and the humanities (Rhody 2017, 660).These efforts not onlymake the field of literary studies (and its history)more relevant to a digitallyand computationally-driven future, but also help literary scholars createmeaningful intersectionsbetween their computational tools and theoretical training. That training includes frameworks


for reading and analysis that computers cannot yet perform, but should aspire to—from closereading, Semiotic Criticism, and Formalism to Post-structuralism, Cultural Studies, and Femi-nist Theory. The varied systems literary scholars have developed for thinking about signs, words,and symbols should not be seen as irreconcilable with computational tools for text analysis. In-stead, they should become the foundation for new methodologies that tackle the shortcomingsof machine learning algorithms and project future directions for text analysis.

Linguists and scientists interested in natural language processing have often looked to the hu-manities for methods that assign rules to the production of meaning. Suchmethods exist withinthe history of literary criticism, some of which are being newly explored as concepts for languagemodeling algorithms. For instance, data curation takes inspiration from cultural studies, whichempowers literary scholars to correct for bias and underrepresentation in literature by expand-ing the canon. Subsequent literary findings from that research need not only be literary ones:they have the potential to serve as models for best practices for computational tools and datasetsmore broadly. While the rift between society’s most progressive ideas and its technological ad-vancement is not unique to the rise of machine learning, practical opportunities exist to repairthe rift with a blend of literary criticism and computational skills, and there are many recent ex-amples14 of the growing importance of combining rich technical explanations, interdisciplinarytheories, and original computational work in corpus linguistics and beyond. A desire to wieldsocial and computational concerns simultaneously is evident also in recent work in Linguistics,15

Sociology,16 and History.17

Studies in computational Sociology by Lauren Nelson, Austin C. Kozlowski, Matt Taddy,James A. Evans, Peter McMahan, and Kenneth Benoit contain important parallels for machinelearning-driven text analysis. Nelson, for instance, calls for a new three-stepmethodology to com-putational sociology, one that “combines expert human knowledge and hermeneutic skills withthe processing power and pattern recognition of computers, producing amoremethodologicallyrigorous but interpretive approach to content analysis” (2020, 1). She describes a framework thatcan aid in reproducibility, which was noted as a problem by Da. Kozlowski, Taddy, and Evans,who study relationships between attention and knowledge, in a September 2019 paper on the“geometry of culture” use a vector space model to analyze a century of books. They show “thatthe markers of class continuously shifted amidst the economic transformations of the twentiethcentury, yet the basic cultural dimensions of class remained remarkably stable. The notable ex-ception is education, which became tightly linked to affluence independent of its associationwithcultivated taste” (1). This implies that disciplinary expertise can be used to isolate sub-corpora foruse in secondary word embedding research problems. Resulting word similarity findings couldaid in both validating the initial research finding and defining domain-specific datasets that arereusable for future research.

The ideaofusinghumanitiesmethodologies to informmodel architectures formachine learn-

14See Whitt 2018 for a state-of-the-art overview of the intersecting fields of corpus linguistics, historical linguistics,and genre-based studies of language usage.

15A special issue in the journal Language from the Linguistic Society of America published responses to a call toreconcile the unproductive rift between generative linguistics and neural network models. Christopher Potts’s response(2019) advocates an imperative integration between deep learning and traditional linguistic semantics.

16Sociologist Laura K.Nelson (2020) calls for a three-stepmethodological framework called computational groundedtheory.

17Another special issue, this one from Isis, a journal from the History of Science Society, suggests that “the historyof knowledge can act as a bridge between the world of the humanities, with its tradition of close reading and detailedunderstanding of individual cases, and the world of big data and computational analysis” (Laubichler, Maienschein, andRenn 2019, 502).

Plumb 39

ing is part of a wider history of computational scientists drawing inspiration from other fieldsto make AI systems better. Designing humanities research with novel word embedding mod-els stands to widen the territory where machine learning engineers look for conceptual conceptsto inspire strategies for improving the performance of artificial language understanding. Manycomputer scientists are investigating the figurative (Gagliano et al. 2019) and the metaphorical(Mao et al. 2018) in language. As machines get better at reading and interpreting texts, literarystudies and theories will becomemore applicable to how thosemachines are programmed to lookatmultiple layers and dimensions of language. TedUnderwood, Andrew Piper, Katherine Bode,James Dobson, and others make connections between computational literary research and socialdimensions of the history of vector space model research. Since vector models are based on the1950s linguistic notion of similarity (Firth 1957), researchers working to show superior algorith-mic performance focus on different aspects of why similarity is important than do researchersseeking cultural insights within their data. But Underwood points out that a word vector canalso be seen as a way to quantitatively account for more aspects of meaning (2019). Already,cross-disciplinary scholarship draws on computational linguistics,18 information science,19 andsemantic linguistics, and the imperative to understand concepts from all of these fields is grow-ing. As bettermethods are developed for usingword embeddings to better understand texts fromdifferent domains and time periods,more sophisticated tools and paradigms emerge that echo thecomplexity of traditional literary and historical interpretation.

Systematic data curation, combinedwithword embedding algorithms, represent a new inter-pretive system for literary scholars. The potential of machine learning methods for text analysisgoes beyond historical literary text analysis, and the methods for literary text analysis using ma-chine learning also go beyond literature departments. The corpora they model and the way theyframe their research questions reframe the potential to use systems like word vectors to under-stand aspects of historical language and could have broader ramifications on how other applica-tionsmodel wordmeanings. Because such literary research generates novel frameworks for usingmachine learning to represent language, it’s imperative to explore the question: Are there waysthat humanities methodologies and research goals can exert greater influence in the computa-tional sciences, make the history of literary studies more relevant in the evolution of machinelearning techniques, and better serve our shared social values?

References

Algee-Hewitt,Mark. 2015. “TheOrder of Poetry: Information, Aesthetics and Jakobson’s The-ory of Literary Communication.” Presented at the Russian Formalism & the Digital Hu-manities Conference, April 13, Stanford University, Palo Alto, CA. https://digitalhumanities.stanford.edu/russian-formalism-digital-humanities.

Algee-Hewitt, Mark. 2019. “Criticism, Augmented.” In the Moment (blog). April 1, 2019.https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses/.

Allen, Carl, and Timothy Hospedales. 2019. “Analogies Explained: Towards UnderstandingWord Embeddings.” In International Conference on Machine Learning, 223–31. PMLR.http://proceedings.mlr.press/v97/allen19a.html.

18Linguistics scholars are also adopting computational models tomake progress with theories related to semantic sim-ilarity. For instance, see Potts 2019.

19See Lin 1998, for example.

https://digitalhumanities.stanford.edu/russian-formalism-digital-humanities

https://digitalhumanities.stanford.edu/russian-formalism-digital-humanities

https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses/

https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses/

http://proceedings.mlr.press/v97/allen19a.html


Argamon, Shlomo and Mark Olsen. 2009. “Words, Patterns and Documents: Experiments inMachine Learning andText Analysis.” Digital Humanities Quarterly 3 (2). http://www.digitalhumanities.org/dhq/vol/3/2/000041/000041.html.

Bode, Katherine. 2020. “Why You Can’t Model Away Bias.” Modern Language Quarterly 81(1): 95–124. https://doi.org/10.1215/00267929-7933102.

Buurma, Rachel Sagner, and Laura Heffernan. 2018. “Search and Replace: Josephine Milesand the Origins of Distant Reading.” Modernism / Modernity Print+ 3, Cycle 1 (April).https://modernismmodernity.org/forums/posts/search-and-replace.

Camacho-Collados, Jose andMohammad Taher Pilehvar. 2018. “FromWord To Sense Embed-dings: A Survey on Vector Representations of Meaning.” Journal of Artificial IntelligenceResearch 63 (December): 743–88. https://doi.org/10.1613/jair.1.11259.

Chander, Manu Samriti. 2017. Brown Romantics: Poetry and Nationalism in the Global Nine-teenth Century. Lewisburg, PA: Bucknell University Press.

Critical Inquiry. 2019. “Computational Literary Studies: ACritical InquiryOnline Forum.” IntheMoment (blog). March 31, 2019. https://critinq.wordpress.com/2019/03/31/computational-literary-studies-a-critical-inquiry-online-forum/.

Da, Nan Z. 2019. “The Computational Case against Computational Literary Studies.” CriticalInquiry 45 (3): 601–39. https://doi.org/10.1086/702594.

D’Ignazio, Catherine, and Lauren Klein. 2020. Data Feminism. Cambridge: MIT Press.

Douglas, Samantha, Dan Dirilo, Taylor-Dawn Francis, Keith Giles, and Marisa Plumb. n.d.“The Bengal Annual: A Digital Exploration of Non-Canonical British Romantic Litera-ture.” https://scalar.usc.edu/works/the-bengal-annual/index.

Dubin, David. 2004. “TheMost Influential Paper Gerard Salton NeverWrote.” Library Trends52 (4): 748–64. https://www.ideals.illinois.edu/bitstream/handle/2142/1697/Dubin748764.pdf?sequence=2.

Firth, J.R. 1957. “A Synopsis of Linguistic Theory.” In Studies in Linguistic Analysis, 1–32.Oxford: Blackwell.

Gagliano, Andrea, Emily Paul, Kyle Booten, and Marti A. Hearst. 2019. “Intersecting WordVectors to Take Figurative Language toNewHeights.” In Proceedings of the FifthWorkshopon Computational Linguistics for Literature, 20-31. San Diego, California: Association forComputational Linguistics. https://doi.org/10.18653/v1/W16-0203.

Gavin, Michael, Collin Jennings, Lauren Kersey, and Brad Pasanek. 2019. “Spaces of Meaning:Conceptual History, Vector Semantics, and Close Reading.” InDebates in the Digital Hu-manities 2019, edited by Matthew K. Gold and Lauren F. Klein, 243–267. Minneapolis:University of Minnesota Press.

Goldstone, Andrew. 2019. “Teaching Quantitative Methods: What Makes It Hard (in LiteraryStudies).” InDebates in the Digital Humanities 2019, edited byMatthewK. Gold and Lau-ren F. Klein. Minneapolis: University of Minnesota Press. https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/620caf9f-08a8-485e-a496-51400296ebcd#ch19.

Gonen,Hila andYoavGoldberg. 2019. “Lipstick on aPig: DebiasingMethodsCover upSystem-atic Gender Biases in Word Embeddings But do not Remove Them.” ArXiv:1903.03862,September. https://arxiv.org/abs/1903.03862.

Griffiths, Thomas L., Mark Steyvers, and Joshua B. Tenenbaum. 2007. “Topics in SemanticRepresentation.” Psychological Review 114 (2): 211–44. https://doi.org/10.1037/

http://www.digitalhumanities.org/dhq/vol/3/2/000041/000041.html


https://doi.org/10.1215/00267929-7933102

https://modernismmodernity.org/forums/posts/search-and-replace

https://doi.org/10.1613/jair.1.11259

https://critinq.wordpress.com/2019/03/31/computational-literary-studies-a-critical-inquiry-online-forum/



https://doi.org/10.1086/702594

https://scalar.usc.edu/works/the-bengal-annual/index

https://www.ideals.illinois.edu/bitstream/handle/2142/1697/Dubin748764.pdf?sequence=2

https://www.ideals.illinois.edu/bitstream/handle/2142/1697/Dubin748764.pdf?sequence=2

https://doi.org/10.18653/v1/W16-0203

https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/620caf9f-08a8-485e-a496-51400296ebcd#ch19




https://doi.org/10.1037/0033-295X.114.2.211

https://doi.org/10.1037/0033-295X.114.2.211

Plumb 41

0033-295X.114.2.211.Harris, Katherine D. 2015. ForgetMeNot: The Rise of the British Literary Annual, 1823 – 1835.

Athens: Ohio University Press.

Harris, KatherineD. 2019. “TheBengalAnnual and#bigger6.” Keats-Shelley Journal 68: 117–18.https://muse.jhu.edu/article/771132.

Kirschenbaum, Matthew. 2007. “The Remaking of Reading: Data Mining and the Digital Hu-manities.” Presented at the National Science Foundation Symposium on Next GenerationofDataMining andCyber-EnabledDiscovery for Innovation, Baltimore,MD,October 11.https://www.csee.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf.

Klein, LaurenF. 2019. “What theNewComputationalRigor ShouldBe.” In theMoment (blog).April 1, 2019. https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses-5/.

Koehrsen, Will. 2018. “Neural Network Embeddings Explained.” Towards Data Science, Octo-ber 2, 2018. https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526.

Kozlowski, AustinC.,Matt Taddy, and James A. Evans. 2019. “TheGeometry of Culture: Ana-lyzing theMeanings of Class throughWord Embeddings.” American Sociological Review 84(5): 905–949. https://doi.org/10.1177/0003122419877135.

Kutuzov, Andrey, Lilja Øvrelid, Terrence Szymanski, and Erik Velldal. 2018. “Diachronic wordembeddings and semantic shifts: a survey.” In Proceedings of the 27th International Con-ference on Computational Linguistics, 1384-1397. Santa Fe, New Mexico: Association forComputational Linguistics. https://www.aclweb.org/anthology/C18-1117.

Laubichler, Manfred D., Jane Maienschein, and Jürgen Renn. 2019. “Computational Historyof Knowledge: Challenges and Opportunities.” Isis 110 (3): 502-512.

Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity.” In Proceedings of theFifteenth International Conference onMachine Learning, 296–304. San Francisco, Califor-nia: Morgan Kaufmann Publishers Inc.

Mao, Rui, Chenghua Lin, and Frank Guerin. 2018. “Word Embedding and WordNet BasedMetaphor Identification and Interpretation.” In Proceedings of the 56th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers), 1222–31. Mel-bourne, Australia: Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1113.

Nelson, Laura K. 2020. “Computational Grounded Theory: A Methodological Framework.”SociologicalMethods&Research 49 (1): 3–42. https://doi.org/10.1177/0049124117729703.

Noble, Safiya Umoja. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism.New York: New York University Press.

Poole, Alex H. 2013. “Now Is the Future Now? The Urgency of Digital Curation in the DigitalHumanities.” Digital Humanities Quarterly 7 (2). http://www.digitalhumanities.org/dhq/vol/7/2/000163/000163.html.

Potts, Christopher. 2019. “A Case for Deep Learning in Semantics: Response to Pater.” Lan-guage 95 (1): e115–24. https://doi.org/10.1353/lan.2019.0019.

Rhody, Lisa. 2017. “Beyond Darwinian Distance: Situating Distant Reading in a Feminist UtPictura Poesis Tradition.” PMLA 132 (3): 659-667.

Risam, Roopika. 2018. “Decolonizing the Digital Humanities in Theory and Practice.” In The

https://doi.org/10.1037/0033-295X.114.2.211

https://doi.org/10.1037/0033-295X.114.2.211

https://muse.jhu.edu/article/771132

https://www.csee.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf

https://www.csee.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf

https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses-5/

https://critinq.wordpress.com/2019/04/01/computational-literary-studies-participant-forum-responses-5/

https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526

https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526

https://doi.org/10.1177/0003122419877135

https://www.aclweb.org/anthology/C18-1117

https://doi.org/10.18653/v1/P18-1113

https://doi.org/10.18653/v1/P18-1113

https://doi.org/10.1177/0049124117729703.

https://doi.org/10.1177/0049124117729703.



https://doi.org/10.1353/lan.2019.0019


Routledge Companion to Media Studies and Digital Humanities, edited by Jentery Sayers,78–86. New York: Routledge.

Roh, Yuji, GeonHeo, and Steven EuijongWhang. 2019. “A Survey on Data Collection for Ma-chine Learning: A Big Data - AI Integration Perspective.” IEEE Transactions on Knowledgeand Data Engineering Early Access: 1–20. https://doi.org/10.1109/TKDE.2019.2946162.

Tversky, Amos. “Features of Similarity.” Psychological Review 84 (4): 327–52. https://doi.org/10.1037/0033-295X.84.4.327.

Underwood, Ted. 2019. Distant Horizons: Digital Evidence and Literary Change. Chicago:University of Chicago Press.

Whitt, Richard J., ed. 2018. Diachronic Corpora, Genre, and Language Change. John BenjaminsPublishing Company.

https://doi.org/10.1109/TKDE.2019.2946162

https://doi.org/10.1109/TKDE.2019.2946162

https://doi.org/10.1037/0033-295X.84.4.327

https://doi.org/10.1037/0033-295X.84.4.327

Chapter 4

Machine Learning in Digital

Scholarship

Andrew JancoHaverford College

Introduction

We are entering an exciting time when research on machine learning and innovation no longerrequires background knowledge in programming, mathematics, or data science. Tools like Run-wayML, the TeachableMachine, and Google AutoML allow researchers to train project-specificclassification and object detectionmodels. Other tools such as Prodigy or INCEpTIONprovidethe means to train custom named entity recognition and named entity linking models. Yet with-out a clear way to communicate the value and potential of these solutions to humanities scholars,they are unlikely to incorporate them into their research practices.

Since 2014, dramatic innovations in machine learning have occurred, providing new capa-bilities in computer vision, natural language processing, and other areas of applied artificial in-telligence. Scholars in the humanities, however, are often skeptical. They are eager to realize thepotential of these new methods in their research and scholarship, but they do not yet have themeans to do so. They need to make connections between machine capabilities, research in thesciences, and tangible outcomes for humanities scholarship, but very often, drawing these con-nections ismore amatter of chance thandeliberate action. Is it possible tomake such connectionsdeliberately and identify howmachine learning methods can benefit a scholar’s research?

This article outlines a method for connecting the technical possibilities of machine learningwith the intellectual goals of academic researchers in the humanities. It argues for a reframing ofthe problem. Rather than appropriating innovations from computer science and artificial intelli-gence, this approach starts from humanities-basedmethods and practices. This shift allows us towork from the needs of humanities scholars in terms that are familiar andhave recognized value totheir peers. Machines can augment scholars’ tasks with greater scale, precision, and reproducibil-

43


ity than are possible for a single scholar alone. However, only relatively basic and repetitive taskscan presently be delegated to machines.

This article argues that JohnUnsworth’s concept of “scholarly primitives” is an effective toolfor identifying basic tasks that can be completed by computers in ways that advance humani-ties research (2000). As Unsworth writes, primitives are “basic functions common to scholarlyactivity across disciplines, over time, and independent of theoretical orientation.” They are thebuilding blocks of research and analysis. As the roots and foundations of our work, “primitives”provide an effective starting point for the augmentation of scholarly tasks.

Here it is important to note that the end goal is not the automation of scholarship, but ratherthe delegation of appropriate tasks to machines. As François Chollet recently noted,

Our field isn’t quite “artificial intelligence” — it’s “cognitive automation”: the en-coding and operationalization of human-generated abstractions / behaviors / skills.The “intelligence” label is a category error. (2020)

This view shifts our focus from the potential intelligence of machines towards their abil-ity to complete useful tasks for human ends. Specifically, they can augment scholars’ work byperforming repetitive tasks at scale with superhuman speed and precision. I proceed from thisunderstanding to argue for an experimental and interpretive approach to machine learning thathighlights the value of the interaction between the scholar and machine rather than what ma-chines can produce.

***

Unsworth’s notion “scholarly primitive” takes its meaning from programming and refers tothe most basic operations and data types of a programming language. Primitives form the build-ing blocks for all other components and operations of the language. This borrowing of termi-nology also suggests that primitives are not universal. A sequence of characters called a string is aprimitive in Python, but not in Java orC.The architecture of a language’s primitives changes overtime and evolves with community needs. The Python andC communities, for example, have em-bracedUnicode as a standard to allow strings in every human language (including emojis). Othercommunities continue to use a range of character encodings, which grants greater flexibility tothe individual programmer and avoids the notion that there should be a common standard.

For scholarship, the termoffers ametaphor andpoint of departure. It poses a question: Whatare the most basic elements of scholarly research and analysis? Unsworth offers several initial ex-amples of primitives to illustrate their value without a claim that they are comprehensive, includ-ingdiscovering, annotating, comparing, referring, sampling, illustrating, and representing. Theseterms offer a “list of functions (recursive functions) that could be the basis for a manageable butalso useful tool-building enterprise in humanities computing.” Primitives can thus guide us inthe creation of computational tools for scholarship.

For example, with the primitive of comparison, a scholar might study different editions of atext, searching for similarities and differences that often lead to new insights or highlight ideasthat would otherwise be taken for granted. As a tool, comparison can (but does not always) re-veal new information. For an assignment in graduate school, I compared a historical calendar thatshowed the days of the week against entries in Stalin’s appointment book. The simple juxtaposi-tion revealed that none of Stalin’s appointments were on a Sunday. This example raises questionsfor further investigation and interpretation. If Stalin was an atheist who worked at all times of

Janco 45

the day and night, why wouldn’t he schedule meetings on Sundays? Perhaps it was a legacy fromStalin’s youth spent in seminary? Is there a similar pattern in other periods of Stalin’s life? Thecraft of humanities research relies on many such simple initial queries. It should be noted thatthese little experiments are just the beginning of a research project. Nonetheless, the utility ofcomparison is clear. If anything, it seems so basic as to go unnoticed. This particular comparisonoffered an insight and new knowledge that led to further research questions.

Such beginnings are often a matter of luck. However, machine learning offers an opportu-nity to increase the dimensionality of comparisons. The similarities and differences between twoeditions of a text can easily be quantified using Levenshtein distance.1 However, that will onlycapture the differences at the level of characters on a page. With machine learning, we can trainembeddings that account for semantics, authors, time periods, genders and other features of atext and its contents simultaneously. We can quantify similarity in new ways that facilitate newforms of comparison. This approach builds on the original meaning and purpose of comparisonas a form of “scholarly primitive,” but opens additional directions for research and opportunitiesfor insights. Rather than relying on happenstance or intuition to find productive comparisons,we can systematically search and compare research materials.

The second “scholarly primitive” that lends itself well to augmentation is annotation. Thisactivity takes different forms across disciplines. A literary scholar might underline notable sec-tions of a text by writing a note in the margins. A historian transcribes information from anarchival source into a notebook. At their core, these actions add observations and associations tothe original materials. Those steps in the research process are the first, most basic step, that con-nects information in a source to a larger set of research materials. We add context and meaningto materials that make them part of a larger collection.

When working with texts or images, machine learning models are presently capable of mak-ing simple annotations and associations. For example, named entity recognition models (NER)are able to recognize person names, place names, and other key words in text. Each label is anannotation that makes a claim about the content of the text. “Steamboat Springs” or “New YorkCity” are linked to an entity called PLACE.Once again, we are speaking about themost basic firststeps that scholars performduring research. I know that Steamboat Springs is a place. It’s where Igrew up. However, another scholar, one less versed in small mountain towns in Colorado, mightnot recognize the town name. Theymight identify it as a spring or a ski resort; perhaps a volcanicfield in Nevada. The idea of “scholarly primitives” forces us to confront the importance of do-main knowledge and the role that it plays in the interpretation of materials. To teach a machineto find entities, we must first explain everything in very specific terms. We can train the machineto use surrounding contextual information in order to predict — correctly — that “SteamboatSprings” refers to a town, a spring, or a ski resort.

As part of a project with Philip Gleissner, I trained a model that correctly identifies Sovietjournal names in diary entries. For instance, the machine uses contextual clues to identify whenthe term Volga refers to the journal by that name and not to the river or the automobile. Whereis the mention of “October” a journal name and not a month, a factory name, or the revolu-tion? The trained model makes it possible to identify references to journals in a corpus of over400,000 diary entries. This in turnmakes it possible to research the diaries with a focus on readerreception. Normally, this would be a laborious and time-consuming task. Each time themachinepredicts an entity in the text, it adds annotations. What was simply text is now marked as an en-

1Named after the SovietmathematicianVladimir Levenshtein, Levenshtein distance uses the number of changes thatwould be needed to make two objects identical as a measure of their similarity.


tity. As part of this project, we had to define the relevant entities, create training data, and trainthe model to accomplish a specific task. This process has tangible value for scholarship because itforces us to break down complicated research processes into their most basic tasks and processes.

As noted before, annotation can be an act of association and linking. Natural language pro-cessing is capable of not only recognizing entities in a text, but also associating that text with arecord in a knowledge base. This capability is called named entity linking. Using embeddings, astatistical language model can not only predict that “Steamboat Springs” is a town, but that it isa specific town with the record Q984721 in dbpedia. This association opens a wealth of contex-tual information about the place, including its population, latitude and longitude, and elevation.A scholar might have ample knowledge and experience reading literature — specifically, Milton.A machine does not, but it has access to context information that enriches analysis and permitsassociations. The result is a reading of a literarywork that accounts for contextual knowledge. Tobe sure, named entity linking is not a replacement for domain knowledge. However, it is able toaugment a scholar’s contextual knowledge of materials and make that information available forstudy during research.

At this point, we are asking the machine not only to sort or filter data, but to reason activelyabout its contents. Machine learning offers the potential to automate humanities annotationtasks at scale. This is true of basic tasks, such as recognizing that a given text is a letter. It is alsotrue of object recognition tasks, such as identifying a state seal in a letterhead or other visual at-tributes. AHaverford College student was doing research on documents in a digital archive thatwe are buildingwith theGrupo de ApoyoMutuo (GAM), ofmore than three thousand case inves-tigations of disappeared persons during the Guatemalan Civil War. They noticed that many ofthe documents were signed with a thumbprint. The student and I trained an image classificationmodel to identify those documents, thus providing the capability to search the entire collectionof documents for this visual attribute. The thumbprints provided a proxy for literacy and allowedthe student to study the collection in newways. Similarly, documents containing the state seal ofGuatemala are typically letters from the government in reply to GAM’s requests for informationabout disappeared persons.

At present, several excellent tools exist to facilitate machine annotation of images and texts.Google’s TeachableMachine offers an intuitive web application that humanities faculty and stu-dents can use to train classification models for images, sounds, and poses. To take the exampleabove, the user would upload images of correspondence. Theywould then upload images of doc-uments that are not letters.2 Once training begins, a base model is loaded and trained on the newcategories. Because the model already has existing training on image categories, it is able to learnthe new category with only a few examples. This process is called transfer learning. For moreadvanced tasks, Google offers AutoML Vision and Natural Language, which are able to processlarge collections of text or images and to deploy trained models using Google cloud infrastruc-ture. Similar products are available from Amazon, IBM, and other companies. Runway MLoffers a locally installed program with more advanced capabilities than the Teachable Machine.Runway ML works with a wide range of machine learning models and is an excellent way forscholars to explore their capabilities without having to write code.3 The accessibility of tools like

2In the Google Cloud Terms of Service there is specific assurance that your data will not be shared or used for anyother purpose than the training of the model. More expert analysis may find concerns, and caution is always warranted.At present, there seems to be no more risk in using cloud services for ML tasks than there are for using cloud servicesmore generally. See https://cloud.google.com/terms/.

3Teachable Machine, https://teachablemachine.withgoogle.com/; Google AutoML, https://cloud.google.com/automl/; RunwayML, https://runwayml.com/.

https://cloud.google.com/terms/

https://teachablemachine.withgoogle.com/

https://cloud.google.com/automl/

https://cloud.google.com/automl/

https://runwayml.com/

Janco 47

Runway allows for low-stakes experimentation and exploration. It is also a particularly goodwayfor scholars to explore new methods and discover new materials.

For Unsworth, discovery is largely the process of identifying new resources. We can find newsources in a library catalog, on the shelf, or in a conversation. These activities require a human inthe loop because it is the person’s incomplete knowledge of a source that makes it a “discovery”when found. Given that machines reason about the content of text and images in ways that arequite unlike those of humans, machine learning opens new possibilities for discovery. When itcomes to the differences in our own habits of mind and the computational processes of artificialnetworks, we may speak of “neurodiversity.” Scholars can benefit from these differences, sincethe strengths of machine thinking complement our needs.

Machine learning models offer a variety of ways to identify similarity and difference with re-searchmaterials. Yale’s PixPlot, for example, uses a convolutional network to train image embed-dings which are then plotted relative to one another in two-dimensional space with a stochasticnearest neighbor algorithm (t-SNE) (Duhaime n.d.).4 PixPlot creates a striking visualization ofhundreds or thousands of images, which are organized and clustered by their relative visual sim-ilarity. As a research tool, PixPlot and similar projects offer a quick means to identify statisticallyrelevant similarities and clusters. This visualization reveals what patterns are most evident to themachine and provides a discovery tool for associations that might not be evident to a humanresearcher. Ben Schmidt has applied a comparable process to “machine read” and visualize four-teen million texts in the HathiTrust (n.d., 2018).5 Using the relative co-occurrence of words ina book, Schmidt is able to train book embeddings. Schmidt’s vectors provide an original wayto organize and label texts based purely on the machine’s “reading” of a book. These machine-generated labels and clusters can be compared against human-generated metadata. The value ofthis work is the human investigation of what machine models find significant in a collection ofresearch materials. For example, with topic modeling, a scholar must interpret what a particularalgorithmhas identified as a statistically significant topic by interpreting a cryptic chain of words.The topic “menu, platter, coffee, ashtray” is likely related to a diner. In these efforts, Scattertextoffers an effective tool to visualize what terms are most distinctive of a text category. In a givencorpus of text, I can identify which words are most exemplary of poetry and which words aremost exemplary of prose. Scattertext creates a striking and useful visualization, or it can be usedin the terminal to process large collections of text.

Conclusion

As a conceptual tool, “scholarly primitives” has considerable promise to connect the intellectualgoals of academic researchers in the humanities with the technical possibilities of machine learn-ing. Rather than focusing on the capabilities of machine learning methods and the prioritiesof machine learning researchers, this method offers a means to build from the existing researchpractices of humanities scholars. It allows us to identify what kinds of tasks would benefit frombeing augmented. Using “primitives” shifts the focus away from large abstract goals, such as re-search findings and interpretive methods, to micro-methods and actions of humanities research.By augmenting these activities, we are able to benefit from the scale and precision afforded by

4See also https://artsexperiments.withgoogle.com/tsnemap/.5At time of writing, Schmidt’s digital monograph Creating Data (n.d.) is a work in progress, with most sections

empty until the official publication.

https://artsexperiments.withgoogle.com/tsnemap/


computational methods, as well as the valuable interplay between scholars and machines as hu-manities research practices are made explicit and reproducible.

References

Chollet, François. 2020. “Our Field Isn’t Quite ‘Artificial Intelligence’ — It’s ‘Cognitive Au-tomation’: The Encoding andOperationalization ofHuman-Generated Abstractions / Be-haviors / Skills. The ‘Intelligence’ Label Is a Category Error.” Twitter, January 6, 2020,10:45 p.m. https://twitter.com/fchollet/status/1214392496375025664.

Duhaime, Douglas. n.d. “PixPlot.” Yale DHLab. Accessed July 12, 2020. https://dhlab.yale.edu/projects/pixplot/.

Schmidt, Benjamin. n.d. “A Guided Tour of the Digital Library.” In Creating Data: The Inven-tion of Information in the American State, 1850-1950. http://creatingdata.us/datasets/hathi-features/.

. 2018. “Stable Random Projection: Lightweight, General-Purpose Dimension-ality Reduction for Digitized Libraries.” Journal of Cultural Analytics, October. https://doi.org/10.22148/16.025.

Unsworth, John. 2000. “Scholarly Primitives: WhatMethodsDoHumanitiesResearchersHavein Common, andHowMight Our Tools Reflect This?” Paper presented at the SymposiumonHumanities Computing: FormalMethods, Experimental Practice, King’s College, Lon-don, May 2000. http://www.people.virginia.edu/~jmu2m/Kings.5-00/primitives.html.

https://twitter.com/fchollet/status/1214392496375025664

https://dhlab.yale.edu/projects/pixplot/

https://dhlab.yale.edu/projects/pixplot/

http://creatingdata.us/datasets/hathi-features/

http://creatingdata.us/datasets/hathi-features/

https://doi.org/10.22148/16.025

https://doi.org/10.22148/16.025

http://www.people.virginia.edu/~jmu2m/Kings.5-00/primitives.html

http://www.people.virginia.edu/~jmu2m/Kings.5-00/primitives.html

Chapter 5

Cultures of Innovation: Machine

Learning as a Library Service

Sue WiegandSaintMary’s College

Introduction

Libraries and librarians have always been concerned with the preservation of knowledge. To thistraditional role, librarians in the 20th century added a new function—discovery—teaching peo-ple to find and use the library’s collected scholarship. Information Literacy, now considered thesignature pedagogy in library instruction, evolved from the previous Bibliographic Instruction.As Digital Literacy, the next stage, develops, students can come to the library to learn how toleverage the greatest strengths of Machine Learning. Machines excel at recognizing patterns;researchers at all levels can experiment with innovative digital tools and strategies, and build21st century skill sets. Librarian expertise in preservation, metadata, and sustainability throughstandards can be leveraged as a value-added service. Leading-edge librarians now invite all the cu-rious to benefit from the knowledge contained in the scholarly canon, accessible through librariesas curated living collections in multiple formats at distributed locations, transformed into newknowledge using new ways to visualize and analyze scholarship.

Library collections themselves, including digitized, unique local collections, can provide thedata fornew insights andwaysof knowingproducedbyMachineLearning. The library could alsobe viewed as a technology sandbox, a place to create knowledge, connect researchers, and bringtogether people, ideas, and new technologies. Many libraries are already rising to this challenge,working with other cultural institutions in creating a culture of innovation as a new learningparadigm, exemplified byMachine Learning instruction and technology tool exploration.

49


Library Practice

The role of the library in preserving, discovering, and creating knowledge continues to evolve.Originally, libraries came into being as collections to be preserved, managed, and disseminated, acentral repositoryof knowledge, possibly forpolitical reasons (Ryholt andBarjamovic 2019, 1–2).Libraries founded by scholars and devoted to learning came later, during the Middle Ages (Cas-son 2001, 145). In more recent times, librarians began “[c]ollecting, organizing, and makinginformation accessible to scholars and to citizens of a democratic republic” based on values de-veloped during the Enlightenment (Bivens-Tatum 2012, 186).

Bibliographic Instruction in libraries, and later Information Literacy, embodied the idea oflearning in the library as the next step beyond collecting, with librarians instructing on informa-tion infrastructure with the goal of empowering library users to find, evaluate, and use scholarlyinformation in print and digital formats, with an emphasis on privacy and intellectual freedomas core library values. Now, librarians are also contributing to and participating in the learn-ing enterprise by partnering with the disciplines to produce new knowledge. This final step ofknowledge creation in the library completes the scholarly communications cycle of building onprevious scholarship—“standing on the shoulders of giants.”

One way to cultivate innovation in libraries is to include Machine Learning in the library’sarray of tools, resources, and services, both behind-the-scenes and public-facing. Librarians areexpert at developing standards, preserving the scholarly record, and refiningmetadata to enhanceinterdisciplinary discovery of research, scholarship, and creative works. Librarian expertise couldgo far beyond local library collections to a global perspective and normative practice of participa-tion at scale in innovative emerging technologies such as Machine Learning.

For instance, citations analysis of prospective collections for the library to collect and of theinstitutions’ research outputs would provide valuable information for both further collectiondevelopment and for developing researchers’ toolkits. Machine Learning with its predilectionfor finding patterns, would reveal gaps in the literature and open up new questions to be an-swered, solving problems and leading to innovation. As one example, Yewno, amulti-disciplinaryplatform that uses Machine Learning to help combat “Information Overload,” advertises that it“helps researchers, students, and educators to deeply explore knowledge across interdisciplinaryfields, sparking new ideas along the way…” and “makes [government] information accessible bybreaking open silos and comprehending the complicated interconnections across agencies andorganizations,” among other applications to improve discovery (Yewno n.d.). Also, in 2019, theLibrary of Congress hosted a Summit as “part of a larger effort to learn about machine learningand the role it could play in helping the Library of Congress reach its strategic goals, such as en-hancing discoverability of the Library’s collections, building connections between users and theLibrary’s digital holdings, and leveraging technology to serve creative communities and the gen-eral public” (Jakeway 2020). Integration of Machine Learning technologies is already starting athigh levels in the library world.

New Services

A focus onMachine Learning can inspire new library services to enhance teaching and learning.Connecting people with ideas and with technology enables library virtual spaces to be used asa learning service by networking researchers at all levels in the enterprise of knowledge creation.Finding gaps in the literature would be a helpful first step in new library discovery tools. A way

Wiegand 51

this could be done is through a “Researchers’ Workstation,” an end-to-end toolkit that mightstart by using Machine Learning tools to automate alerts of new content in a narrow area of in-terest and help researchers at all levels find and focus on problem-solving. A Researchers’ Work-station could contain a collection of analytic tools and learning modules to guide users throughthe phases of discovery. Then, managing citations would be an important step in the process—storing, annotating, and sorting out the most relevant. Starting research reports, keeping labnotebooks, finding datasets, and preserving the researcher’s own data are all relevant to the finalresults. A collaboration tool would enable researchers to find others with similar interests andshare data or work collaboratively from anywhere, asynchronously. Having all these tools in oneserendipitous virtual place is an extension of the concept of the library as the physical place tostart research and scholarship. It is merely the containers of knowledge that are different.

Some of this functionality exists already, both inOpen Source software such as Zotero for ci-tationmanagement, and in proprietary tools that combinemultiple functions, such asMendeleyfrom Elsevier.1 Other commercial publishers are developing tools to enable researchers to workwithin their proprietary platforms, from the point of searching for ideas and finding researchgaps through the process of writing and submitting finished papers for publication. The Coali-tion ofOpenAccess Repositories (COAR) is similarly developing “next generation repositories”software integrating end-to-end tools for the Open Access literature archived in repositories, to“facilitate the development of new services on top of the collective network, including social net-working, peer review, notifications, and usage assessment.” (Rodrigues et al, 2017, 5).

What elsemight a researcherwant to do that the library could include in aResearchers’Work-station? Finding,writing, andkeeping track of grants couldbe incorporated at some level. Gener-ating a timelinemight behelpful, and infographics anddata visualizations could improve researchcommunication and even help make the case for the importance of the study with others, espe-cially the public and funders. Projectmanagement toolsmight bewelcomed by some researchers,too.

Finally, when it’s time to submit the idea (whether at the preliminary or preprint stage) tosomething like an ArXiv-like repository or an institutional repository, as well as to journals of in-terest (also identified through Machine Learning tools), the process of submission, peer-review,revision, and re-submitting could be done seamlessly. The tools and functions in the Worksta-tion would ideally be modular, interoperable, and easy to learn and use, as well as continuouslyupdated. The Workstation would be a complete ecosystem in the research cycle—saving timein the Scholarly Communications process and providing one place to go to for discovery, liter-ature review, data management, collaboration, preprint posting, peer review, publication, andpost-print commenting.2

Collections as Data, Collections as Resources

Exemplified by the literature search that now includes a myriad of Open content on a globalbasis, collections is an area that provides the greatest scope for library Machine Learning innova-tions to date, both applied and basic/theoretical. Especially if the pathway to using the expandedcollections is clear and coherent, and the library provides instruction on why and how to use thevarious tools to save time and increase impact of research, researchers at all levels will benefit from

1See https://www.zotero.org and https://www.mendeley.com.2In 2013, I wrote a blog that mentions the idea (Wiegand).

https://www.zotero.org

https://www.mendeley.com


partnering with librarians for a more comprehensive view of current knowledge in an area. TheAlways Already Computational: Collections as Data final report and project deliverables andCol-lections asData: Part toWhole Projectwere designed to “developmodels that support collectionsas data implementation and holistic reconceptualization of services and roles that support schol-arly use….” The Project specifically seeks “to create a framework and set of resources that guidelibraries and other cultural heritage organizations in the development, description, and dissemi-nation of collections that are readily amenable to computational analysis.” (Padilla et al 2019).

As a more holistic approach to data-driven scholarship, these resources aim to provide ac-cess to large collections to enable computational use on the national level. Some current librarydatabases have already built this kind of functionality. JSTOR, for example, will provide up to25,000 documents (or more at special request) in a dataset for analysis.3 Clarivate’s Content asa Service provides Web of Science data to accommodate multiple purposes.4 Besides the manyfreely available bibliodata sources, researchers can sign up for developer accounts in databasessuch as Scopus to workwith datasets for text mining and computational analysis.5 Using library-licensed collections as data could allow researchers to save time in reading a large corpus, stayupdated on a topic of interest, analyze the most important topics at a given time period, confirmgaps in the research literature for investigation, and increase the efficiency of sifting throughmas-sive amounts of research in, for instance, the race to develop a COVID-19 vaccine (Ong 2020;Vamathevan 2019).

Learning Spaces

Machine Learning is a concept that calls out for educating library users through all avenues, in-cluding library spaces. Taking a clue from other GLAM (Galleries, Libraries, Archives, andMu-seums) cultural institutions, especially galleries andmuseums, libraries and archives couldmountexhibits and incorporate learning into library spaces as a form of outreach to teach how andwhy using innovative tools will save time and improve efficiency. Inspirational, continuously-updating dashboards and exhibits could show progress and possibilities, while physical and vir-tual tutorials might provide a game-like interface to spark creativity. Showcasing scholarship andincorporating events and speakers help create a new culture of ideas and exploration. Eventsbring people together in library spaces to network for collaborative endeavors. As an example,the Cleveland Museum of Art is analyzing visitor experiences using an ArtLens app to promoteits collections.6 The Library of Congress, as mentioned, hosted a summit that explored suchtopics as buildingMachine Learning literacy, attracting interest in GLAM datasets, operational-izing Machine Learning, crowdsourcing, and copyright implications for the use of content. Asanother example, in 2017 the United Kingdom’s National Archives attempted to demystifyMa-chine Learning and explore ethics and applications such as topic modeling, which

was used to find key phrases in Discovery record descriptions and enable innova-tive exploration of the catalogue; and it was also deployed to identify the subjectsbeing discussed across Cabinet Papers. Other projects included the development

3See https://www.jstor.org/dfr/about/dataset-services.4See https://clarivate.com/search/?search=computational%20datasets.5See https://dev.elsevier.com/ and https://guides.lib.berkeley.edu/text-mining.6See https://www.clevelandart.org/art-museums-and-technology-developing-new-metrics

-measure-visitor-engagement and https://www.clevelandart.org/artlens-gallery/artlens-app.

https://www.jstor.org/dfr/about/dataset-services

https://clarivate.com/search/?search=computational%20datasets

https://dev.elsevier.com/

https://guides.lib.berkeley.edu/text-mining

https://www.clevelandart.org/art-museums-and-technology-developing-new-metrics-measure-visitor-engagement

https://www.clevelandart.org/art-museums-and-technology-developing-new-metrics-measure-visitor-engagement

https://www.clevelandart.org/artlens-gallery/artlens-app

https://www.clevelandart.org/artlens-gallery/artlens-app

Wiegand 53

of a system that found the most important sentence in a news article to generateautomated tweeting, while another team built a system to recognise computer codewritten in different programming languages — this is a major challenge for digitalpreservation. (Bell 2018)

Finally, theHGContemporary Gallery in Chelsea, in 2019, mounted an exhibit that utilizeda “machine-learning algorithm that did most of the work” (Bogost 2019).

Sustainable Innovation

Diversity, equity, and inclusion (DEI) concerns with the scholarly record and increasingly withrecognized biases implicit in algorithms can be addressed by a very intentional focus on the valueof differing perspectives in solving problems. Kat Holmes, an inclusive design expert previouslyat Microsoft and now a leading user experience designer at Google, urges a framework for inclu-sivity that counteracts bias with different points of view by recognizing exclusion, learning fromhuman diversity, and bringing in new perspectives (Bedrossian 2018). Making more data avail-able, andmore diverse data, will significantly improve the imbalance perpetuated by a traditional-only corpus. In sustainability terms, Machine Learning tools must be designed to continuouslyseek to incorporate diverse perspectives that go beyond the traditional definitions of the scholarlycanon if they are to be useful in combating bias. Collections used as data in Machine Learningmight undergo analysis by researchers, including librarian researchers, to determine the balanceof content. Library subject headings should be improved to better reflect the diversity of humanthought, cultures, and global perspectives.

Streamlining procedures is to everyone’s benefit, and saving time is universally desired. Ef-ficiency won’t fix the time crunch everyone faces, but with too much to do and too much toread, information overload is a very real threat to advancing the research agenda and confrontingamultitude of escalating global problems. Machine Learning techniques, applied at scale to largecorpora of textual data, could help researchers pinpoint areaswhere the human researcher shoulddelve more deeply to eliminate irrelevant sources and hone in on possible solutions to problems.One instance—a new service, Scite.ai “can automatically tell readers whether papers have beensupported or contradicted by later academic work” (Khamsi 2020). WHO (WorldHealth Orga-nization) is providing a Global Research Database that can be searched or downloaded.7 In re-search on self-driving vehicles, a systematic literature review foundmore than 10,000 articles, anestimated year’s worth of reading for an individual. A tool called Iris.ai allowed groupings of thisarchive by topic and is one of several “targeted navigation” tools in development (Extance 2020).Working together as efficiently as possible is the only way to move ahead, andMachine Learningconcepts, tools, and techniques, along with training, can be applied to increasingly large textualdatasets to accelerate discovery.

Machine Learning, like any other technology, augments human capacities, it does not replacethem. If 10% of library resources (measured in whatever way works for each particular library),including both time resources of expert librarians and staff and financial resources, were utilizedfor innovation, libraries would develop a virtuous self-sustaining cycle. Technologies that are notas useful can be assessed and dropped in an agile library, the useful can be incorporated into the90% of existing services, and the resources (people andmoney) repurposed. In the sameway, that

7See https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov.

https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov

https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov


10% of library resources invested into innovations such as Machine Learning, whether in librarypractice or instruction and other services, will keep the program and the library fresh.

Creativity is key and will be the hallmark of successful libraries in the future. Stewardshipof resources such as people’s skills and expertise, and strategic use of the collections budget, arealready library strengths. By building out new services and tools, and instructing at all levels,libraries can reinvent themselves continuously by investing in creative and sustainable innovation,from digital and data literacy to assembling modules for a library-based customized Researchers’Workstation that usesMachine Learning to enhance the efficiency of the scholars’ research cycle.

Results andmore questions

A library that adapted Machine Learning as an innovation technology would improve its prac-tices; add new services; choose, use, and license collections differently; utilize all spaces for learn-ing; and role model innovative leadership. What is a library in rapidly changing times? How canlibrarians reconcile past identity, add value, and leverage hard-won expertise in a new environ-ment? Change management is a topic that all institutions will have to confront as the digital agecontinues, as we reinvent ourselves and our institutions in a fast paced technological world.

Value-added, distinctive, unique—these are all words that will be part of the conversation.Not only does the library add value, but librarians will have to demonstrate and quantify thatvalue while preparing to pivot at any time in response to crises and innovative opportunities.Distinctive library resources and services that speak to the institutions’ academic mission andpurpose will be a key feature. What does the library do that no other entity on campus can do?At each particular inflection point, how best to communicate with stakeholders about the valueof the distinctive library mission? Can the library work with other cultural heritage institutionsto highlight the unique contributions of all?

One possible approach—develop a library science/library studies pedagogy as well as out-reach that encompasses the Scholarship of Teaching and Learning (SoTL) and pervades every-thing the library does in providing resources, services, and spaces. Emphasize that library re-sources help people solve multi-dimensional, complex problems, and then work on new ideasto save the time of researchers, improve discovery systems, advocate and facilitate Open Accessand Open Source alternatives while enabling, empowering, and yes, inspiring all users to partici-pate in and contribute to the record of human knowledge. Librarians, as the traditional keepersof the scholarly canon in written form, have standing to do this as part of our legacy and as partof our envisioned future.

From the library users’ point of view, librarians should think like the audience we are tryingto reach to answer the question—why come into the library or use the library website insteadof more familiar alternatives? In an era of increasing surveillance, library tools could be betterknown for an emphasis on privacy and confidentiality, for instance. This may require thinkingmore deeply about howwe use ourmetrics and finding other ways to show howuse of the librarycontributes to student success. It is also important to gather quantitative andqualitative evidencefrom library users themselves, and apply the feedback in an agile improvement loop.

In the case of Open Access vs. proprietary information, librarians should make the case forOpen Access (OA) by advocating, explaining, and instructing library users from the first timethey do literature searches to the time they are graduate students, post-docs, and faculty. Librar-ians should produce Open Educational Resources (OER) as well as encourage classroom facultyto adopt these tools of affordable education. Libraries also need to facilitateOpenAccess content

Wiegand 55

from discovery to preservation by developing search tools that privilege OA, using Open Sourcesoftware whenever possible. Librarians could lead the way to changing the Scholarly Commu-nications system by emphasizing change at the citations level—encourage researchers to insiston being able to obtain author-archived citations in a seamless way, and facilitate that throughdevelopment of new discovery tools usingMachine Learning. Improving discovery of Open Ac-cess, as well as embarking on expanded library publishing programs and advancing academic re-search, might be the most important endeavors that librarians could undertake at this point intime, to prevent a repeat of the “serials crisis” that commoditized scholarly information and tobuild amore diverse, equitable, and inclusive scholarly record. Well-funded commercial publish-ers are already engaging scholars and researchers in new proprietary platforms that could lock inacademia more thoroughly than “Big Deals” did, even as the paradigm shifts away from large,expensive publishers’ platforms and library subscription cancellations mount due to budget cutsand the desire to optimize value for money.

The concept of the “inside-out library” (Dempsey 2016) provides a way of thinking aboutopening local collections to discovery and use in order to create new knowledge through digiti-zation and semantic linking, with cross-disciplinary technologies to augment traditional researchand scholarship. Because these ideas are so new but fast-moving, librarians need to spread theword on possibilities in library publishing. Making local collections accessible for computationalresearch helps to diversify findings and focuses attention on larger patterns and new ideas. In2019, for instance, theLibrary ofCongress sought to “Maximize theUseof itsDigitalCollection”by launching a program “to understand the technical capabilities and tools that are required tosupport the discovery and use of digital collections material,” developing ethical and technolog-ical standards to automate in supporting emerging research techniques and “to preprocess textmaterial in a way that wouldmake that contentmore discoverable” (Price 2019). Scholarly Com-munication, dissemination, and discovery of research results will continue to be an importantfunction of the library if trusted research results are to be available to all, not just the privileged.The so-called Digital Divide isolates andmarginalizes some groups and regions; libraries can be aunifying force.

An important librarian role might be to identify gaps, in research or in dissemination, andwork to overcomebarriers to improving highly distributed access to knowledge. Libraries special-ize in connecting disparate groups. Here is what libraries can do now: instruct new researchers(including undergraduate researchers and up) in theories, skills, and techniques to find, use, pop-ulate, preserve, and cite datasets; provide server space and/or Data Management services; intro-duce Machine Learning and text analysis tools and techniques; provide Machine Learning andtext analysis tools and/or services to researchers at all levels. Researchers are now expected oreven required to provide public scholarship, i.e., to bring their research into the public realm be-yond obscure research journals, and to explain and illuminate their work, connecting it to thepublic good, especially in the case of publicly-funded research. Librarians can and should part-ner in the public dissemination of research findings through explaining, promoting, and provid-ing innovative new tools across siloed departments to catalyze cross-disciplinary research. Schol-arly Communications began with books and journals shared by scholars over time, then librarieswere assembled and built to contain the written record; librarians should ensure that the Schol-arly Communications and information landscape continues into the future with widely-shared,available resources in all formats, now including interactive, web-based software, embedded dataanalysis tools, and technical support of emerging Open Source platforms.

In addition, the flow of research should be smooth and seamless to the researcher, whether


in a Researchers’ Workstation or other library tools. The research cycle should be both clearlyexplained and embedded in systems and tools. The library, as a central place that cuts acrossnarrowly-defined research areas, could provide a systemic place of collaboration. Librarians, see-ing the bigger picture, could facilitate research as well as disseminate and preserve the resultingdata in journals anddatasets. Further investigations onhowresearcherswork, how students learn,best practices in pedagogy, and life-long learning in the library could mark a new era in librarian-ship, one that involves teaching, learning, and research as a self-reinforcing cycle. Beyond being apurchaser of journals and books, libraries can expand their role in the learning process itself intoa cycle of continuous change and exploration, augmented byMachine Learning.

Library Science, Research, and Pedagogy

In Library and Information Science (LIS), graduate library schools should teach aboutMachineLearning as a way of innovating and emphasize pervasive innovation as the new normal. Cre-ating a culture of innovation and creativity in LIS classes and in libraries will pay off for societyas a whole, if librarians promote the advantages of a culture of innovation in themselves and inlibrary users. Subverting the stereotypes of tradition-bound libraries and librarians will revital-ize the profession and our workplaces, replacing fear of change and an existential identity crisiswith a spirit of creative, agile reinvention that will rise to challenges rather than seek solace in de-nial, whether the seemingly impossible problem is preparedness in dealing with a pandemic orcreatively addressing climate change.

Academic libraries must transition from a space of transactional (one-time) actions into atransformational learning-centered user space, both physical and virtual, that offers an enhancedexperience with teaching, learning, and research—away to re-center the library as the place to getanswers that go beyond the Internet. Libraries add value: do faculty, students, and other patronsknow, for instance, that when they find the perfect book on a library shelf through browsing (oron the library website with virtual browsing), it is because a librarian somewhere assigned it a callnumber to group similar books together? Thenext step in that process is touseMachineLearningto generate subject headings, and also show the librarians accomplishing that. This process isbeing investigated in different types of works from fiction to scientific literature (Golub 2006,Joorabchi 2011, Wang 2009, Short 2019). Cataloging, metadata, and enabling access throughshared standards and Knowledge Bases are all things librarians do that add value for library usersoverwhelmed with Google hits, and are worthy of further development, including in an Openenvironment.

Preservation is another traditional library function, and now includes born-digital items anddigitization of special collections/archives, increasing the library role. Discoverywill be enhancedby Artificial/Augmented Intelligence and Machine Learning techniques. All of this should betaught in library schools, to build a new library culture of innovation and problem-solving be-yond just providing collections and information literacy instruction. The new learning paradigmis immersive in all senses, and the future, as reflected in library transformation and partnershipswith researchers, galleries, archives, museums, citizen scientists, hobbyists, and life-long learnersre-tooling their careers and life, is bright. LIS programs need to reflect that.

To promote learning in libraries, librarians could design a “You belong in the Library” cam-paign to highlight our diverse resources and new ways of working with technology, inviting par-ticipation in innovative technologies such as Machine Learning in an increasingly rare public,non-commercial space—telling why, showing how. In many ways, libraries could model ways to

Wiegand 57

achieve academic success and life success, updating a traditional role in educating, instructing,preparing for the future, explaining, promoting understanding, and inspiring.

Discussion

The larger questions now are, who is heard and who contributes? How are gaps, identified inneeds analysis, reduced? What are sources of funding for libraries to develop this important workandnot leave it to commercial services? Library leadership and innovative thinkingmust convergeto devise ways for libraries to bring people together, producing more diverse, ethical, innovative,inclusive, practical, transformative, and novel library services and physical and virtual spaces forthe public good.

Libraries could start with analyses of needs—what problems could be solvedwithmore effec-tive literature searches? What research could fill gaps and inform solutions to those needs? Whatkind of teaching could help build citizens and critical thinkers, rather than simply encouragingconsumption of content? Another need is to diversify collections used in Machine Learning,gathering cultural perspectives that reflect true diversity of thought through inclusion. All voicesshould be heard and empowered. Librarians can help with that.

A Researchers’ Workstation could bring together an array of tools and content to allow notonly the organization, discovery, and preservation of knowledge, but also facilitate the creationof new knowledge through the sustainable library, beyond the literature search.

The world is converging toward networking and collaborative research all in oneplace. I would like the library to be the free platform that brings all the others to-gether.

Coming full circle, my vision is that when researchers want to work on their re-search, they will log on to the library and find all they need…. The library is the oneplace … to get your scholarly work done. (Wiegand 2013)

The library as a platform should be a shared resource—the truest library value.Here is a scenario. Suppose, for example, scholars wish to analyze the timeline of the begin-

ning of the Coronavirus crisis. Logging on to the library’s Researchers’ Workstation, they startwith the Discovery module to generate a corpus of research papers from, say, December 2019 toJune 2020. Using theMachine Learning function, they search for articles and books, looking forgaps and ideas that have not yet been examined in the literature. They access and download full-text, save citations, annotate and take notes, and prepare a draft outline of their research usinga word processing function, writing and citing seamlessly. AMethods (protocols) section couldhelp determine the most effective path of the prospective research.

Then, they might search for the authors of the preprints and articles they find interesting,check the authors’ profiles, and contact some of them through the platform to discern interestin collaborating. The profile system would list areas of interest, current projects, availability fornewprojects, etc. Using theProjectManagement function, scholarsmight open anewworkspacewhere preliminary thoughts could be shared, with attribution and acknowledgement as appro-priate, and a peer review timeline chosen to invite commentswhile authors can still claim the ideaas their own.

If the preprint is successful, and the investigation shows promise after the results are in, thescholars could search for an appropriate journal for publication, the version of record. The au-


thor, with researcher ID (also contained in his/her profile), has the article added to the final pub-lished section of the profile, with a DOI. The journal showcases the article, sends out tables ofcontent alerts and press releases where it can be picked up by news services and authors invitedto comment publicly. Each institution would celebrate its authors’ accomplishments, use theScholars’ Workstation to determine impact and metrics, and promote the institutions’ researchprogress.

Finally, the article would be preserved through the library repository and also initiatives suchas LOCKSS. Future scholars would find it still available and continue to discover and build onthe findings presented. All of this and more would be done through the library.

Conclusion

Machine Learning as a library service can inspire new stages of innovation, energizing and provid-ing a blueprint for the library future—teaching, learning, and scholarship for all. The teachingpart of the equation invokes the faculty audience perspective: how can librarians help classroomfaculty to integrate both library instruction and library research resources (collections, expertise,spaces) into the educational enterprise (Wiegand and Kominkiewicz 2016)? How can librariansbest teach skills, foster engagement, and create knowledge to make a distinctive contribution tothe institution? Our answers will determine the library’s future at each academic institution.Machine Learning skills, engagement, and knowledge should fit well with the library’s array ofservices.

Learning is another traditional aspect of library services, this time from the student pointof view. The library provides collections—multimedia or print on paper, digital and digitized,proprietary and open, local, redundant, rare, unique. The use of collections is taught by bothlibrarians and disciplinary faculty in the service of learning, including life-long learning for non-academic, everyday knowledge. Students need to knowmore aboutMachineLearning, fromdataliteracy to digital competencies, including concerns about privacy, security, and fake news acrossthe curriculum, while learning skills associated with Machine Learning. In addition, throughOpen Access, library “collections” now encompass the world beyond the library’s physical andvirtual spaces.

Then, as libraries, like all digitally-inflected institutions, develop“changemanagement” strate-gies, they need to double-down on these unique affordances and communicate them to stake-holders. Themost critical strategy is embedding the ScholarshipofTeaching andLearning (SoTL)in all aspects of the library workflow. Instead of simply advertising new electronic resources ordescribing Open Access versus proprietary resources, libraries should broadly embed the lessonsof copyright, surveillance, and reproducibility into patron interactions, from the first undergrad-uate literature search to the faculty research consultation. Then, reinforce those lessons by em-phasizing open access and data mining permissions in their discovery tools. These are aspects ofthe scholarly research cycle over which libraries have some control. By exerting that control, li-braries will promote a culture that positions Machine Learning and other creative digital uses oflibrary data as normal, achievable parts of the scholarly process.

To complete the Scholarly Communications lifecycle, support for research, scholarship, andcreative works is increasingly provided by libraries as a springboard to creation of knowledge, thelibrary’s newest role. This iswhereMachineLearning as a newparadigmfits inmost compellinglyas an innovative practice. Libraries can provide not only associated services such asDataManage-ment of the datasets resulting from analyzing huge textual corpora, but also databases of propri-

Wiegand 59

etary and locally-produced content from inter-connected, cooperating libraries on a global scale.Researchers—faculty, students, and citizens (including alumni)—will benefit from crowdsourc-ing and citizen science while gaining knowledge and contributing to scholarship. But perhapsthe largest benefit will be learning by doing, escaping the “black box” of blind consumerism tosee how algorithms work and thus develop a more nuanced view of reality in the Machine Age.

References

Bedrossian, Rebecca. 2018. “Recognizing Exclusion is the Key to Inclusive Design: In Conver-sation with Kat Holmes.” Campaign (blog). July 25, 2018. https://www.campaignlive.com/article/recognizing-exclusion-key-inclusive-design-conversation-kat-holmes/1488872.

Bell, Mark. 2018. “Machine Learning in the Archives.” National Archives (blog). November 8,2020. https://blog.nationalarchives.gov.uk/machine-learning-archives/.

Bivens-Tatum,Wayne. 2012. Libraries and the Enlightenment. Los Angeles: Library Juice Press.Accessed January 6, 2020. ProQuest Ebook Central.

Bogost, Ian. 2019. “The AI-Art Gold Rush is Here.” The Atlantic. March 6, 2019. https://www.theatlantic.com/technology/archive/2019/03/ai-created-art-invades-chelsea-galler.

Casson, Lionel. 2001. Libraries in the Ancient World. New Haven: Yale University Press. Ac-cessed January 6, 2020. ProQuest Ebook Central.

Dempsey, Lorcan. 2016. “Library Collections in the Life of the User: Two Directions.” LIBERQuarterly 26: 338–359. https://doi.org/10.18352/lq.10170.

Extance, Andy. 2018. “How AI Technology Can Tame the Scientific Literature.” Nature 561:273-274. https://doi.org/10.1038/d41586-018-06617-5.

Golub, K. 2006. “Automated Subject Classification of Textual Web Documents.” Journal ofDocumentation 62: 350-371. https://doi.org/10.1108/00220410610666501.

Jakeway, Eileen. 2020. “Machine Learning + Libraries Summit: Event Summary now live!” TheSignal (blog), Library of Congress. February 12, 2020. https://blogs.loc.gov/thesignal/2020/02/machine-learning-libraries-summit-event-summary-now-live/.

Joorabchi, Arash and Abdulhussin E.Mahdi. 2011. “AnUnsupervised Approach to AutomaticClassification of Scientific Literature Utilising Bibliographic Metadata.” Journal of Infor-mation Science. https://doi.org/10.1177/016555150000000.

Khamsi, Rozanne. 2020. “Coronavirus in context: Scite.ai Tracks Positive and Negative Cita-tions for COVID-19 Literature.” Nature. https://doi.org/10.1038/d41586-020-01324-6.

Padilla, Thomas, Laurie Allen, Hannah Frost, et al. 2019. “Final Report — Always AlreadyComputational: Collections as Data.” Zenodo. May 22, 2019. https://doi.org/10.5281/zenodo.3152935.

Price, Gary. 2019. “The Library of Congress Posts Solicitation For a Machine Learning/DeepLearning Pilot Program to ‘Maximize the Use of its Digital Collection.’ ” Library Journal.June 13, 2019.

Rodrigues, Eloy et al. 2017. “Next Generation Repositories: Behaviours and Technical Rec-ommendations of the COAR Next Generation Repositories Working Group.” Zenodo.

https://www.campaignlive.com/article/recognizing-exclusion-key-inclusive-design-conversation-kat-holmes/1488872



https://blog.nationalarchives.gov.uk/machine-learning-archives/

https://blog.nationalarchives.gov.uk/machine-learning-archives/

https://www.theatlantic.com/technology/archive/2019/03/ai-created-art-invades-chelsea-galler



https://doi.org/10.18352/lq.10170

https://doi.org/10.1038/d41586-018-06617-5

https://doi.org/10.1108/00220410610666501

https://blogs.loc.gov/thesignal/2020/02/machine-learning-libraries-summit-event-summary-now-live/



https://doi.org/10.1177/016555150000000

https://doi.org/10.1038/d41586-020-01324-6

https://doi.org/10.1038/d41586-020-01324-6

https://doi.org/10.5281/zenodo.3152935



November 28, 2017. https://doi.org/10.5281/zenodo.1215014.Ryholt, K. S. B, and Gojko Barjamovic, eds. 2019. Libraries Before Alexandria: Ancient near

Eastern Traditions. Oxford: Oxford University Press.

Vamathevan, Jessica, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, GeorgeLee, Bin Lee, Anant Madabhushi, Parantu Shah, Michaela Spitzer, and Shanrong Zhao.2019. “Applications of Machine Learning in Drug Discovery and Development.” Nat RevDrug Discov 18: 463–477. https://doi.org/10.1038/s41573-019-0024-5.

Wang, Jun. 2009. “An Extensive Study on Automated Dewey Decimal Classification.” Journalof the American Society for Information Science & Technology 60: 2269–86. https://doi.org/10.1002/asi.21147.

Wiegand, Sue. 2013. “ACS Solutions: The Sturm und Drang.” ACRLog (blog), Association ofCollege and Research Libraries. November 8, 2020. https://acrlog.org/2013/04/06/acs-solutions-the-sturm-und-drang/.

Wiegand, Sue and Frances Kominkiewisz. 2016. Unpublished manuscript. “Integration of Stu-dent Learning through Library and Classroom Instruction.”

Yewno. n.d. “Yewno—Transforming Information into Knowledge.” Accessed January 6, 2020.https://www.yewno.com/.

Further Reading

Abbattista, Fabio, Luciana Bordoni, and Giovanni Semeraro. 2003. “Artificial Intelligence forCultural Heritage and Digital Libraries.” Applied Artificial Intelligence 17, no. 8/9: 681.https://doi.org/10.1080/713827258.

Ard, Constance. 2017. “Advanced Analytics Meets Information Services.” Online Searcher 41,no. 6: 21–24.

“Artificial Intelligence andMachine Learning in Libraries.” 2019. LibraryTechnologyReports 55,no. 1: 1–29.

Badke, William. 2015. “Infolit Land. The Effect of Artificial Intelligence on the Future of In-formation Literacy.” Online Searcher 39, no. 4: 71–73.

Boman, Craig. 2019. “Chapter 4: An Exploration of Machine Learning in Libraries.” LibraryTechnology Reports 55: 21–25.

Breeding, Marshall. 2018. “Chapter 6: Possible Future Trends.” Library Technology Reports 54,no. 8: 31–32.

Dempsey, Lorcan, Constance Malpas, and Brian Lavoie. 2014. “Collection Directions: TheEvolution of Library Collections and Collecting” portal: Libraries and the Academy 14, no.3 (July): 393-423. https://doi.org/10.1353/pla.2014.0013.

Enis, Matt. 2019. “Labs in the Library.” Library Journal 144, no. 3: 18–21.Finley,Thomas. 2019. “TheDemocratizationofArtificial Intelligence: OneLibrary’sApproach.”

Information Technology & Libraries 38, no. 1: 8–13. https://doi.org/10.6017/ital.v38i1.10974.

Frank, Eibe andGordonW.Paynter. 2004. “PredictingLibrary ofCongressClassifications FromLibrary of Congress Subject Headings.” Journal of The American Society for InformationScience and Technology 55, no. 3. https://doi.org/10.1002/asi.10360.

Geary, Daniel. 2019. “How to Bring AI into Your Library.” Computers in Libraries 39, no. 7:32–35.


https://doi.org/10.1038/s41573-019-0024-5

https://doi.org/10.1002/asi.21147


https://acrlog.org/2013/04/06/acs-solutions-the-sturm-und-drang/

https://acrlog.org/2013/04/06/acs-solutions-the-sturm-und-drang/

https://www.yewno.com/

https://doi.org/10.1080/713827258

https://doi.org/10.1353/pla.2014.0013

https://doi.org/10.6017/ital.v38i1.10974



Wiegand 61

Griffey, Jason. 2019. “Chapter 5: Conclusion.” Library Technology Reports 55, no. 1: 26–28.Inayatullah, Sohail. 2014. “Library Futures: From Knowledge Keepers to Creators.” Futurist

48, no. 6: 24–28.

Johnson, Ben. 2018. “Libraries in the Age of Artificial Intelligence.” Computers in Libraries 38,no. 1: 14–16.

Kuhlman, C., L. Jackson, and R. Chunara. 2020. “No Computation without Representation:Avoiding Data and Algorithm Biases through Diversity.” ArXiv:2002.11836v1 [cs.CY],February. http://arxiv.org/abs/2002.11836.

Lane, David C. and Claire Goode. 2019. “OERu’s Delivery Model for Changing Times: AnOpen Source NGDLE.” Paper presented at the 28th ICDE World Conference on OnlineLearning, Dublin, Ireland, November 2019. https://oeru.org/assets/Marcoms/OERu-NGDLE-paper-FINAL-PDF-version.pdf.

Liu, Xiaozhong, Chun Guo, and Lin Zhang. 2014. “Scholar Metadata and Knowledge Gener-ation with Human and Artificial Intelligence.” Journal of the Association for InformationScience & Technology 65, no. 6: 1187–1201. https://doi.org/10.1002/asi.23013.

Mitchell, Steve. 2006. “MachineAssistance inCollectionBuilding: NewTools,Research, Issues,and Reflections.” Information Technology & Libraries 25, no. 4: 190–216. https://doi.org/10.6017/ital.v25i4.3353.

Ojala,Marydee. 2019. “ProQuest’sNewApproach to Streamlining Selection andAcquisitions.”Information Today 36, no. 1: 16–17.

Ong, Edison, Mei U. Wong, Anthony Huffman, and Yongqun He. 2020. “COVID-19 Coro-navirus Vaccine Design Using Reverse Vaccinology and Machine Learning.” Frontiers inImmunology 11. https://doi.org/10.3389/fimmu.2020.01581.

Orlowitz, Jake. 2017. “You’re aResearcherWithout a Library: WhatDoYouDo?” AWikipediaLibrarian (blog), Medium. November 15, 2017. https://medium.com/a-wikipedia-librarian/youre-a-researcher-without-a-library-what-do-you-do-6811a30373cd.

Padilla, Thomas. 2019. Responsible Operations: Data Science, Machine Learning, and AI inLibraries. Dublin, OH: OCLCResearch. https://doi.org/10.25333/xk7z-9g97.

Plosker,George. 2018. “Artificial IntelligenceTools for InformationDiscovery.” Online Searcher42, no. 3: 31–35. https://www.infotoday.com/OnlineSearcher/Articles/Features/Artificial-Intelligence-Tools-for-Information-Discovery-124721.shtml.

Rak, Rafal, Andrew Rowley, William Black, and Sophie Ananiadou. 2012. “Argo: an Integra-tive, Interactive, TextMining-basedWorkbench SupportingCuration.” Database : the Jour-nal of Biological Databases and Curation. https://doi.org/10.1093/database/bas010.

Schmidt, Lena, Babatunde Kazeem Olorisade, Julian Higgins, and Luke A. McGuinness. 2020.“Data ExtractionMethods for SystematicReview (Semi)automation: ALivingReviewPro-tocol.” F1000Research 9: 210. https://doi.org/10.12688/f1000research.22781.2.

Schonfeld, Roger C. 2018. “Big Deal: Should Universities Outsource More Core Research In-frastructure?” Ithaka S+R. https://doi.org/10.18665/sr.306032.

Schockey, Nick. 2013. “How Open Access Empowered a 16-year-old to Make Cancer Break-through.” June 12, 2013. http://www.openaccessweek.org/video/video/show?id=5385115%3AVideo%3A90442.

http://arxiv.org/abs/2002.11836

https://oeru.org/assets/Marcoms/OERu-NGDLE-paper-FINAL-PDF-version.pdf

https://oeru.org/assets/Marcoms/OERu-NGDLE-paper-FINAL-PDF-version.pdf




https://doi.org/10.3389/fimmu.2020.01581

https://medium.com/a-wikipedia-librarian/youre-a-researcher-without-a-library-what-do-you-do-6811a30373cd



https://doi.org/10.25333/xk7z-9g97

https://www.infotoday.com/OnlineSearcher/Articles/Features/Artificial-Intelligence-Tools-for-Information-Discovery-124721.shtml



https://doi.org/10.1093/database/bas010

https://doi.org/10.1093/database/bas010

https://doi.org/10.12688/f1000research.22781.2

https://doi.org/10.12688/f1000research.22781.2

https://doi.org/10.18665/sr.306032

http://www.openaccessweek.org/video/video/show?id=5385115%3AVideo%3A90442

http://www.openaccessweek.org/video/video/show?id=5385115%3AVideo%3A90442


Short,Matthew. 2019. “TextMining and SubjectAnalysis for Fiction; or, UsingMachine Learn-ing and Information Extraction to Assign Subject Headings to Dime Novels.” Cataloging& Classification Quarterly 57, no. 5: 315–336. https://doi.org/10.1080/01639374.2019.1653413.

Thompson, Paul, RizaTheresaBatista-Navarro, andGeorgioKontonatsios. 2016. “TextMiningthe History of Medicine.” PloS One 11, no. 1:e0144717. https://doi.org/10.1371/journal.pone.0144717.

White, Philip. 2019. “Using Data Mining for Citation Analysis.” College & Research Libraries80, no. 1. https://scholar.colorado.edu/concern/parent/cr56n1673/file_sets/9019s3164.

Witbrock, Michael J. and Alexander G. Hauptmann. 1998. “Speech Recognition for a DigitalVideo Library.” Journal of the American Society for Information Science 49, no. 7: 619–32.https://doi.org/10.1002/(SICI)1097-4571(19980515)49:7<619::AID-ASI4>3.0.CO;2-A.

Zuccala, Alesia, Maarten Someren, and Maurits Bellen. 2014. “A Machine-Learning ApproachtoCoding BookReviews asQuality Indicators: Toward aTheory ofMegacitation.” Journalof the Association for Information Science & Technology 65, no. 11: 2248–60. https://doi.org/10.1002/asi.23104.

https://doi.org/10.1080/01639374.2019.1653413

https://doi.org/10.1080/01639374.2019.1653413

https://doi.org/10.1371/journal.pone.0144717


https://scholar.colorado.edu/concern/parent/cr56n1673/file_sets/9019s3164

https://scholar.colorado.edu/concern/parent/cr56n1673/file_sets/9019s3164

https://doi.org/10.1002/(SICI)1097-4571(19980515)49:7<619::AID-ASI4>3.0.CO;2-A

https://doi.org/10.1002/(SICI)1097-4571(19980515)49:7<619::AID-ASI4>3.0.CO;2-A



Chapter 6

Cross-DisciplinaryML Research

is like HappyMarriages: Five

Strengths and Two Examples

Meng JiangUniversity of Notre Dame

Top Strengths inML+XCollaboration

Cross-disciplinary research refers to research and creative practices that involve two or more aca-demic disciplines (Jeffrey 2003; Karniouchina, Victorino, andVerma 2006). These activitiesmayrange from those that simply place disciplinary insights side by side to much more integrativeor transformative approaches (Aagaard-Hansen 2007; Muratovski 2011). Cross-disciplinary re-searchmatters, because (1) it provides an understanding of complex problems that require amul-tifaceted approach to solve; (2) it combines disciplinary breadth with the ability to collaborateand synthesize varying expertise; (3) it enables researchers to reach a wider audience and com-municate diverse viewpoints; (4) it encourages researchers to confront questions that traditionaldisciplines do not ask while opening up new areas of research; and (5) it promotes disciplinaryself-awareness about methods and creative practices (Urquhart et al. 2011; O’Rourke, Crowley,and Gonnerman 2016; Miller and Leffert 2018).

One of the most popular cross-disciplinary research topics/programs isMachine Learning +X (orData Science +X). Machine learning (ML) is amethod of data analysis that automates an-alytical model building. It is a branch of artificial intelligence based on the idea that systems canlearn from data, identify patterns, and make decisions with minimal human intervention. MLhas been used in a variety of applications (Murthy 1998), such as email filtering and computervision; however, most applications still fall in the domain of computer science and engineering.Recently, the power ofML+X , whereX can be any other discipline (such as physics, chemistry,

63


biology, sociology, and psychology), is well recognized. ML tools can reveal profound insightshiding in ballooning datasets (Kohavi et al. 1994; Pedregosa et al. 2011; Kotsiantis 2012; Mul-lainathan and Spiess 2017).

However, cross-disciplinary research, whichML+X is part of, is challenging. Collaboratingwith investigators outside one’s own field requires more than just adding a co-author to a paperor proposal. True collaborations will not always be without conflict—lack of information leadsto misunderstandings. For example, ML experts would have little domain knowledge in the fieldof X ; and researchers in X might not understand ML either. The knowledge gap limits theprogress of collaborative research.

So how can we start and manage successful cross-disciplinary research? What can we do tofacilitate collaborative behaviors? In this essay, I will compare cross-disciplinary ML researchto “happy marriages,” discussing some characteristics they share. Specifically, I will present thetop strengths of conducting cross-disciplinary ML research and give two examples based on myexperience of collaborating with historians and psychologists.

Marriage is oneof themost common“collaborative”behaviors. Couples expect tohavehappymarriages, just like collaborators expect to have successful project outcomes (Robinson andBlan-ton 1993; Pettigrew 2000; Xu et al. 2007). Extensive studies have revealed the top strengths ofhappy marriages (DeFrain and Asay 2007; Gordon and Baucom 2009; Prepare/Enrich, n.d.),which can be reflected in cross-disciplinary ML research. Here I focus on five of them:

1. Collaborators (“partners” in the language of marriage) are satisfied with communication.

2. Collaborators feel very close to each other.

3. Collaborators discuss their problems well.

4. Collaborators handle their differences creatively.

5. There is a good balance of time alone (i.e., individual researchwork) and together (meetings,discussions, etc).

First of all, communication is the exchange of information to achieve a better understanding;and collaboration is defined as the process of working together with another person to achievean end goal. Effective collaboration is about sharing information, knowledge, and resources towork together through satisfactory communication. Ineffectiveness or lack of communication isone of the biggest challenges inML+X collaboration.

Second, researchers in different disciplines meet different challenges through the process ofcollaboration. Making the challenges clear to understand and finding solutions together is thecore of effective collaboration.

Third, researchers in different disciplines can collaborate only when they recognize mutualinterest and feel that the research topics they have studied in depth are very close to each other.Collaborators must be interested in solving the same, big problem.

Fourth, collaborators must embrace their differences on concepts and methods and take ad-vantage of them. For example, one researcher can introduce a complementarymethod to themixof other methods that the collaborator has been using for a long time; or one can have a new,impactful dataset and evaluation method to test the techniques proposed by the other.

Fifth, in strong collaboration, there is a balance between separateness and togetherness. Meet-ings are an excellentuse of time forhaving integratedperspectives andproductivediscourse around

Jiang 65

difficult decisions. However, excessive collaboration happens when researchers are depleted bytoo many meetings and emails. It can lead to inefficient, unproductive meetings. So it is impor-tant to find a balance.

Next, I, as a computer scientist andMLexpert,will discuss twoML+X collaborativeprojects.ML experts bring mathematical modeling and computational methods for mining knowledgefrom data. The solutions usually have good generalizability; however, they still need to be tai-lored for specialized domains or disciplines.

Example 1: ML +History

The history professor Liang Cai and I have collaborated on an international research project ti-tled “Digital Empires: Structured Biographical and Social Network Analysis of Early ChineseEmpires.” Dr. Cai is well known for her contributions to the fields of early Chinese Empires,Classical Chinese thought (in particular, Confucianism and Daoism), digital humanities, andthe material culture and archaeological texts of early China (Cai 2014). Our collaboration ex-plores how digital humanities expand the horizon of historical research and help visualize theresearch landscape of Chinese history. Historical research is often constrained by sources and thehuman cognitive capacity for processing them. ML techniques may enhance historians’ abilitiesto organize and access sources as they like. ML techniques can even create new kinds of sourcesat scale for historians to interpret.

“The historians pose the research questions and visualize the project,” said Cai.“The computer scientists can help provide new tools to process primary sourcesand expand the research horizon.”

We conducted a structured biographical analysis to leverage the development of machinelearning techniques, such as neural sequence labeling and textual pattern mining, which allowedclassical sources of Chinese empires to be represented in an encoded way. The project aims tobuild a digital biographical database that sorts out different attributes of all recorded historicalactors in available sources. Breaking with traditional formats, ML+History creates new oppor-tunities and augments our way of understanding history.

First, it helps scholars, especially historians, change their research paradigm, allowing themto generalize their arguments with sufficient examples. ML techniques can find all examples inthe data where manual investigation may miss some. Also, abnormal cases can indicate a newdiscovery. As far as early Chinese empires are concerned, ML promises to automate mining andencoding all available biographical data, which allows scholars to change the perspective fromoneperson to a group of persons with shared characteristics, and to shift from analyzing examples torelating a comprehensive history. Therefore, scholars can identify general trends efficiently andpresent an information-rich picture of historical reality using ML techniques.

Second, the structureddataproducedbyMLtechniques revolutionize thequestions researchersask, thereby changing the research landscape. Because of the lack of efficient tools, there are nu-merous interestingquestions scholarswould like to askbut cannot. For example, the geographicalmobility of historical actors is an intriguing question for early China, the answer to which wouldshow how diversified regions were integrated into a unified empire. Nevertheless, an individualhistorian cannot efficiently process the massive amount of information preserved in the sources.With ML techniques, we can generate fact tuples to sort out original geographical places of allavailable historical actors and provide comprehensive data for historians to analyze.


Figure 6.1: The graph presents a visual of the social network of officials who served in the gov-ernment about 2,000 years ago in China. The network describes their relationships and personalattributes.

Jiang 67

Patterns Mined by MLTech Extracted Relations

$PER_X …從 $PER_Y受 $KLG (張禹,施讎,易)$PER_X was taught by $PER_Y on $KLG (knowledge) (施讎,田王孫,易)

(眭弘,嬴公,春秋)

$PER_X …事 $PER_Y (司馬相如,孝景帝)$PER_X was taught/mentored by $PER_Y (尹齊,張湯)

$PER_X …授 $PER_Y (孟喜,后蒼、疏廣)$PER_X taught $PER_Y (王式,龔舍)

$PER … $LOC人也 (張敞,河東平陽)$PER place_of_birth $LOC (彭越,昌邑)

$PER遷 $TIT (朱邑,北海太守)$PER job_title $TIT (甘延壽,遼東太守)

$PER至 $TIT (歐陽生,御史大夫)$PER job_title $TIT (孟卿,中山中尉)

$PER為 $TIT (伏生,秦博士)$PER job_title $TIT (司馬相如,武騎常侍)

Table 6.1: Examples of Chinese Text Extraction Patterns

Third, the project revolutionizes our reading habits. Large datasets mined from primarysources will allow scholars to combine long-distant reading with original texts. The macro pic-ture generated from data will aid in-depth analysis of the event against its immediate context.Furthermore, graphics of social networks and common attributes of historical figures will changeour reading habits, transforming linear storytelling to accommodate multiple narratives (see theabove figure).

Researchers from the two sides develop collaboration through the project step by step, justlike developing a relationship formarriage. Ours started at a faculty gathering from some randomchat about our research. As the historian is open-minded toML technologies and theML expertiswilling to create broader impact, we brainstormed ideas thatwould not have developedwithouttaking care of the five important points:

1. Communication: With our research groups, we started to meet frequently at the begin-ning. We set up clear goals at the early stage, including expected outcomes, publicationvenues, and joint proposals for funding agencies, such as the National Endowment forthe Humanities (NEH) and Notre Dame seed grant funding. Our research groups metalmost twice a week for as long as three weeks.

2. Feel very close to each other: Besides holding meetings, we exchanged our instant messengeraccounts so we could communicate faster than email. We created Google Drive space toshare readings, documents, and presentation slides. We found many tools to create “tightrelationships” between the groups at the beginning.

3. Discuss their problems well: Whenever we had misunderstandings, we discussed our prob-


lems. Historians learned aboutwhat amachine does, what amachine can do, and generallyhow amachine works toward the task. ML people learned what is interesting to historiansandwhat kind of information is valuable. We hold the principle that as the problems exist,they make sense; any problem any other encounters is worth a discussion. We needed tosolve problems together from the moment they became our problems.

4. Handle their differences creatively: Historians are among the fewwho can read andwrite inclassical Chinese. Classical Chinese was used as thewritten language fromover 3,000 yearsago to the early 20th century. Since then, mainland China has used eitherMandarin (sim-plifiedChinese) or Cantonese, while Taiwan has used traditional Chinese. None is similarto classical Chinese at all. In other words, historians work on a language that no ML ex-perts here, even those who speak modern Chinese, can understand. So we handle our lan-guage differences “creatively” by using the translated version as the intermediate medium.Historians have translated history books in classical Chinese into simplifiedChinese so wecan read the simplified version. Here, the idea is to let the machine learning algorithmsread both versions. We find that information extraction (i.e., finding relations from text)andmachine translation (i.e., from classical Chinese tomodernChinese) canmutually en-hance each other, which turns out to be one of our novel technical contributions to thefield of natural language processing.

5. Good balance of time alone and together: After the first month, since the project goal,datasets, background knowledge, and many other aspects were clear in both sides’ minds,we had regularmeetings in a less intensivemanner. Wemet twice or three times amonth sothat computer science students could focus on developing machine learning algorithms,and only when significant progress was made or expert evaluation was needed would weschedule a quick appointment with Prof. Liang Cai.

So far, we have published peer-reviewed papers on the topic of information extraction andentity retrieval in classical Chinese history books using ML (Ma et al. 2019; Zeng et al. 2019).We have also submitted joint proposals with the above work as preliminary results to NEH.

Example 2: ML + Psychology

I am working with Drs. Ross Jacobucci and Brooke Ammerman in psychology to apply ML tounderstand mental health problems and suicidal intentions. Suicide is a serious public healthproblem; however, suicides are preventable with timely, evidence-based interventions. Social me-dia platforms have been serving users who are experiencing real-time suicidal crises with hopes ofreceiving peer support. To better understand the helpfulness of peer support occurring online,we characterize the content of both a user’s post and corresponding peer comments occurringon a social media platform and present an empirical example for comparison. We have designeda new topic-model-based approach to finding topics of users and peer posts from the social me-dia forum data. The key advantages include: (i) modeling both the generative process of eachtype of corpora (i.e., user posts and peer comments) and the associations between them, and (ii)using phrases, which aremore informative and less ambiguous thanwords alone, to represent so-cial media posts and topics. We evaluated the method using data from Reddit’s r/SuicideWatchcommunity.

Jiang 69

Figure 6.2: Screenshot of r/SuicideWatch on Reddit.

We examined how the topics of user and peer posts were associated and how this informationinfluenced the perceivedhelpfulness of peer support. Then,we applied structural topicmodelingto data collected from individuals with a history of suicidal crisis as a means to validate findings.Our observations suggest that effective modeling of the association between the two lines of top-ics can uncover helpful peer responses to online suicidal crises, notably providing the suggestionof pursuing professional help. Our technology can be applied to “paired” corpora inmany appli-cations such as tech support forums and question-answering sites.

This project started from a talk I gave at the psychology graduate seminar. The fun thing isthat Dr. Jacobucci was not able to attend the talk. Another psychology professor who attendedmy talk asked constructive questions and mentioned my research to Dr. Jacobucci when theymet later. SoDr. Jacobucci droppedme an email, and we had coffee together. Cross-disciplinaryresearch often starts from something that sounds like developing a relationship. Because, again,the psychologists are open-minded to ML technologies and the ML expert is willing to createbroader impact, we successfully brainstormed ideas when we had coffee, but this would not havedeveloped into long-term collaboration without the following efforts: (1) Communicate inten-sively between research groups at the early stage. We had multiple meetings a week to make thegoals clear. (2) Get students involved in the process. When my graduate student received moreand more advice from the psychology professors and students, the connections between the twogroups became stronger. (3) Discuss the challenges in our fields very well. We analyzed togetherwhether machine learning would be capable of addressing the challenges in mental health. Wealso analyzed whether domain experts could be involved in the loop of machine learning algo-rithms. (4) Handle our differences. We separately presented our research and then found timestowork together to put sets of slides together based onone commonvision and goal. (5)After thefirst month, only hold meetings when discussion is needed or there is an approaching deadline


for either paper or proposal.We have enjoyed our collaboration and the power of cross-disciplinary research. Our joint

work is under review atNature Palgrave Communications. We have also submitted joint propos-als to NIH with this work as preliminary results (Jiang et al. 2020).

Conclusions

In this essay, I used a metaphor comparing cross-disciplinaryML research to “happy marriages.”I discussed five characteristics they share. Specifically, I presented the top strengths of produc-ing successful cross-disciplinary ML research: (1) Partners are satisfied with communication. (2)Partners feel very close to each other. (3) Partners discuss their problems well. (4) Partners han-dle their differences creatively. (5) There is a good balance of time alone (i.e., individual researchwork) and together (meetings, discussions, etc). While every project is different and will produceits own challenges, my experience of collaborating with historians and psychologists accordingto the happy marriage metaphor suggests that it is a simple and strong paradigm that could helpother interdisciplinary projects develop into successful, long-term collaborations.

References

Aagaard ‐ Hansen, Jens. 2007. “The Challenges of Cross ‐ Disciplinary Research.” SocialEpistemology 21, no. 4 (October-December): 425–38. https://doi.org/10.1080/02691720701746540.

Cai, Liang. 2014. Witchcraft and the Rise of the First Confucian Empire. Albany: SUNY Press.DeFrain, John, and SylviaM.Asay. 2007. “Strong Families Around theWorld: An Introduction

to the Family Strengths Perspective.” Marriage & Family Review 41, no. 1–2 (August):1–10. https://doi.org/10.1300/J002v41n01_01.

Gordon, Cameron L., and Donald H. Baucom. 2009. “Examining the Individual Within Mar-riage: Personal Strengths and Relationship Satisfaction.” Personal Relationships 16, no. 3(September): 421–435. https://doi.org/10.1111/j.1475-6811.2009.01231.x.

Jeffrey, Paul. 2003. “Smoothing the Waters: Observations on the Process of Cross-DisciplinaryResearch Collaboration.” Social Studies of Science 33, no. 4 (August): 539–62.

Jiang, Meng, Brooke A. Ammerman, Qingkai Zeng, Ross Jacobucci, and Alex Brodersen. 2020.“Phrase-Level Pairwise Topic Modeling to Uncover Helpful Peer Responses to Online Sui-cidal Crises.” Humanities and Social Sciences Communications 7: 1–13.

Karniouchina, Ekaterina V., Liana Victorino, and Rohit Verma. 2006. “Product and Service In-novation: Ideas for FutureCross-DisciplinaryResearch.” The Journal of Product InnovationManagement 23, no. 3 (May): 274–80.

Kohavi, Ron, George John, Richard Long, DavidManley, and Karl Pfleger. 1994. “MLC++: AMachine Learning Library in C++.” In Proceedings of the Sixth International Conference onTools with Artificial Intelligence, 740–3. N.p.: IEEE. https://doi.org/10.1109/TAI.1994.346412.

Kotsiantis, S.B. 2012. “Use of Machine Learning Techniques for Educational Proposes [sic]: aDecision Support System for Forecasting Students’ Grades.” Artificial Intelligence Review37, no. 4 (May): 331–44. https://doi.org/10.1007/s10462-011-9234-x.

https://doi.org/10.1080/02691720701746540

https://doi.org/10.1080/02691720701746540

https://doi.org/10.1300/J002v41n01_01

https://doi.org/10.1111/j.1475-6811.2009.01231.x

https://doi.org/10.1111/j.1475-6811.2009.01231.x

https://doi.org/10.1109/TAI.1994.346412

https://doi.org/10.1109/TAI.1994.346412

https://doi.org/10.1007/s10462-011-9234-x

Jiang 71

Ma, Yihong, Qingkai Zeng, Tianwen Jiang, Liang Cai, and Meng Jiang. 2019. “A Study ofPerson Entity Extraction and Profiling from Classical Chinese Historiography.” In Pro-ceedings of the 2nd International Workshop on EntitY REtrieval, edited by Gong Cheng,KalpaGunaratna, and JunWang, 8–15. N.p.: InternationalWorkshoponEntitYREtrieval.http://ceur-ws.org/Vol-2446/.

Miller, Eliza C. and Lisa Leffert. 2018. “Building Cross-Disciplinary Research Collaborations.”Stroke 49, no. 3 (March): e43-e45. https://doi.org/10.1161/strokeaha.117.020437.

Mullainathan, Sendhil, and Jann Spiess. 2017. “Machine learning: an applied econometric ap-proach.” Journal of Economic Perspectives 31, no. 2 (spring): 87–106. https://doi.org/10.1257/jep.31.2.87.

Muratovski, Gjoko. 2011. “Challenges and Opportunities of Cross-Disciplinary Design Edu-cation and Research.” In Proceedings from the Australian Council of University Art andDesign Schools (ACUADS) Conference: Creativity: Brain—Mind—Body, edited by GordonBull. Canberra, Australia: ACAUDS Conference. https://acuads.com.au/conference/article/challenges-and-opportunities-of-cross-disciplinary-design-education-and-research/.

Murthy, Sreerama K. 1998. “Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey.” DataMining andKnowledgeDiscovery 2, no. 4 (December): 345–89.https://doi.org/10.1023/A:1009744630224.

O’Rourke, Michael, Stephen Crowley, and Chad Gonnerman. 2016. “On the Nature of Cross-Disciplinary Integration: A Philosophical Framework.” Studies in History and Philosophyof Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences 56(April): 62–70. https://doi.org/10.1016/j.shpsc.2015.10.003.

Pedregosa, Fabian et al. 2011. “Scikit-learn: Machine Learning in Python.” The Journal of Ma-chine Learning Research 12: 2825–30. http://www.jmlr.org/papers/v12/pedregosa11a.html.

Pettigrew, SimoneF. 2000. “Ethnography andGroundedTheory: aHappyMarriage?” InAssoci-ation for Consumer Research Conference Proceedings, edited by Stephen J. Hoch and RobertJ. Meyer, 256–60. Provo, UT: Association for Consumer Research. https://www.acrwebsite.org/volumes/8400/volumes/v27/.

Prepare/Enrich. N.d. “National Survey of Marital Strengths.” Prepare/Enrich (website). Ac-cessed January 17, 2020. https://www.prepare-enrich.com/pe_main_site_content/pdf/research/national_survey.pdf.

Robinson, Linda C. and PriscillaW. Blanton. 1993. “Marital Strengths in EnduringMarriages.”Family Relations: An Interdisciplinary Journal of Applied Family Studies 42, no. 1 (Jan-uary): 38–45. https://doi.org/10.2307/584919.

Urquhart, R., E. Grunfeld, L. Jackson, J. Sargeant, and G. A. Porter. 2013. “Cross-DisciplinaryResearch in Cancer: an Opportunity to Narrow the Knowledge–Practice Gap.” CurrentOncology 20, no. 6 (December): e512–e521. https://doi.org/10.3747/co.20.1487.

Xu, Anqi, Xiaolin Xie, Wenli Liu, Yan Xia, and Dalin Liu. 2007. “Chinese Family Strengthsand Resiliency.” Marriage & Family Review 41, no. 1–2 (August): 143–64. https://doi.org/10.1300/J002v41n01_08.

Zeng,Qingkai,MengxiaYu,WenhaoYu, JinjunXiong, YiyuShi, andMeng Jiang. 2019. “FacetedHierarchy: ANewGraphType toOrganize ScientificConcepts and aConstructionMethod.”

http://ceur-ws.org/Vol-2446/

https://doi.org/10.1161/strokeaha.117.020437

https://doi.org/10.1161/strokeaha.117.020437

https://doi.org/10.1257/jep.31.2.87

https://doi.org/10.1257/jep.31.2.87

https://acuads.com.au/conference/article/challenges-and-opportunities-of-cross-disciplinary-design-education-and-research/



https://doi.org/10.1023/A:1009744630224

https://doi.org/10.1016/j.shpsc.2015.10.003

http://www.jmlr.org/papers/v12/pedregosa11a.html

http://www.jmlr.org/papers/v12/pedregosa11a.html

https://www.acrwebsite.org/volumes/8400/volumes/v27/

https://www.acrwebsite.org/volumes/8400/volumes/v27/

https://www.prepare-enrich.com/pe_main_site_content/pdf/research/national_survey.pdf

https://www.prepare-enrich.com/pe_main_site_content/pdf/research/national_survey.pdf

https://doi.org/10.2307/584919

https://doi.org/10.3747/co.20.1487

https://doi.org/10.3747/co.20.1487

https://doi.org/10.1300/J002v41n01_08

https://doi.org/10.1300/J002v41n01_08


In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural LanguageProcessing (TextGraphs-13), edited byDmitryUstalov, Swapna Somasundaran, Peter Jansen,Goran Glavaš, Martin Riedl, Mihai Surdeanu, and Michalis Vazirgiannis, 140–50. HongKong: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-5317.

https://doi.org/10.18653/v1/D19-5317

https://doi.org/10.18653/v1/D19-5317

Chapter 7

AI and ItsMoral Concerns

Bohyun KimUniversity of Rhode Island

Automating Decisions and Actions

The goal of artificial intelligence (AI) as a discipline is to create an artificial system—whether itbe a piece of software or a machine with a physical body—that is as intelligent as a human inits performance, either broadly in all areas of human activities or narrowly in a specific activity,such as playing chess or driving.1 The actual capability of most AI systems remained far belowthis ambitious goal for a long time. But with recent successes with machine learning and deeplearning, the performance of some AI programs has started surpassing that of humans. In 2016,an AI program developed with the deep learning method, AlphaGo, astonished even its creatorsby winning four out of five Go matches with the eighteen-time world champion, Sedol Lee.2

In 2020, Google’s DeepMind unveiled Atari57, a deep reinforcement learning algorithm thatreached superhuman levels of play in 57 classic Atari games.3

Early symbolic AI systems determined their outputs based upon given rules and logical in-ference. AI algorithms in these rule-based systems, also known as good old-fashioned AI (GO-FAI), are pre-determined, predictable, and transparent. On the other hand, machine learning,

1Note that by ‘as intelligent as a human,’ I only mean AI at human-level performance in achieving a particular goalnot general(/strong) AI. General AI—also known as ‘artificial general intelligence (AGI)’ and ‘strong AI’—refers to AIwith the ability to adapt to achieve any goals. By contrast, an AI system developed to perform only one or some activitiesin a specific domain is called a ‘narrow (/weak) AI’ system.

2AlphaGo can be said to be “as intelligent as humans,” but only in playing Go, where it exceeds human capability.So, it does not qualify as general/strong AI in spite of its human-level intelligence in Go-playing. It is to be noted thatgeneral(/strong) AI and narrow(/weak) AI signify the difference in the scope of AI capability. General(/strong) AI is alsoa broader concept than human-like intelligence, either with its carbon-based substrate or with human-like understandingthat relies on what we regard as uniquely human cognitive states such as consciousness, qualia, emotions, and so on. Formore helpful descriptions of common terms in AI, see (Tegmark 2017, 39). For more on the match between AlphaGoand Sedol Lee, see (Koch 2016).

3Deep reinforcement learning is a type of deep learning that is goal-oriented and reward-based. See (Heaven 2020).

73


another approach in AI, enables an AI algorithm to evolve to identify a pattern through the so-called ‘training’ process, which relies on a large amount of data and statistics. Deep learning, oneof the widely-used techniques in machine learning, further refines this training process using a‘neural network.’4 Machine learning and deep learning have brought significant improvementsto the performance of AI systems in areas such as translation, speech recognition, and detectingobjects and predicting their movements. Some people assume that machine learning completelyreplaced GOFAI, but this is a misunderstanding. Symbolic reasoning and machine learning aretwo distinct but not mutually exclusive approaches in AI, and they can be used together (Knight2019a).

With their limited intelligence and fully deterministic nature, early rule-based symbolic AIsystems raised few ethical concerns.5 AI systems that near or surpass human capability, on theother hand, are likely to be given the autonomy to make their own decisions without humans,even when their workings are not entirely transparent, and some of those decisions are distinc-tively moral in character. As humans, we are trained to recognize situations that demand moraldecision-making. But how would an AI system be able to do so? Or, should they be? With self-driving cars and autonomous weapons systems under active development and testing, these areno longer idle questions.

The Trolley Problem

Recent advances of AI, such as autonomous cars, have brought new interest to the trolley prob-lem, a thought experiment introduced by the British philosopher Philippa Foot in 1967. In thestandard version of this problem, a runaway trolley barrels down a track where five unsuspectingpeople are standing. You happen to be standing next to a lever that switches the trolley onto adifferent track, where there is only one person. Those who are on either track will be killed if thetrolley heads their way. Should you pull the lever, so that the runaway trolley would kill one per-son instead of five? Unlike a person, a machine does not panic or freeze and simply follows andexecutes the given instruction. This means that an AI-powered trolley may act morally as longas it is programmed properly.6 The question itself remains, however. Should the AI-poweredtrolley be programmed to swerve or stay on course?

Different moral theories, such as virtue ethics, contractarianism, and moral relativism, takedifferent positions. Here, I will consider utilitarianism and deontology. Since their tenets arerelatively straightforward, most AI developers are likely to look towards those twomoral theoriesfor guidance and insight. Utilitarianism argues that the utility of an action is what makes anaction moral. In this view, what generates the greatest amount of good is the most moral thingto do. If one regards five human lives as a greater good than one, then one acts morally by pullingthe lever and diverting the trolley to the other track. By contrast, deontology claims that whatdetermines whether an action is morally right or wrong is not its utility but moral rules. If anaction is in accordance with those rules, then the action is morally right. Otherwise, it is morally

4Machine learning and deep learning have gained momentum because the cost of high-performance computing hassignificantly decreased and large data sets have become more widely available. For example, the data in the ImageNetcontains more than 14 million hand-annotated images. The ImageNet data have been used for the well-known annualAI competition for object detection and image classification at large scale from 2010 to 2017. See http://www.image-net.org/challenges/LSVRC/.

5For an excellent history of AI research, see chapter 1, “What is Artificial Intelligence,” of Boden 2016, 1-20.6Programming here does not exclusively refer to a deep learning or machine learning approach.

http://www.image-net.org/challenges/LSVRC/

http://www.image-net.org/challenges/LSVRC/

Kim 75

wrong. If not to kill another human being is one of those moral rules, then killing someone ismorally wrong even if it is to save more lives.

Note that these are highly simplified accounts of utilitarianism and deontology. The goodin utilitarianism can be interpreted in many different ways, and the issue of conflicting moralrules is a perennial problem that deontological ethics grapples with.7 For our purpose, however,these simplified accounts are sufficient to highlight the aspects in which the utilitarian and thedeontological position appeal to and go against our moral intuition at the same time.

If a trolley cannot be stopped, saving five lives over one seems to be a right thing to do. Util-itarianism appears to get things right in this respect. However, it is hard to dispute that killingpeople is wrong. If killing is morally wrong no matter what, deontology seems to make moresense. With moral theories, things seem to get more confusing. Furthermore, consider the caseinwhich one freezes and fails to pull the lever. According to utilitarianism, this would bemorallywrong because it fails to maximize the greatest good, i.e. human lives. But how far should one gotomaximize the good? Suppose there is a very large person on a footbridge over the trolley track,and one pushes that person off the footbridge onto the track, thus stopping the trolley and savingthe five people. Would this count as a right thing to do? Utilitarianism may argue that. But inreal life, many would consider throwing a person morally wrong but pulling the lever morallypermissible.8

The problem with utilitarianism is that it treats the good as something inherently quantifi-able, comparable, calculable, and additive. But not all considerations that we have to factor intomoral decision-making are measurable in numbers. What if the five people on the track are help-less babies or murderers who just escaped from the prison? Would or should that affect our de-cision? Some of us would surely hesitate to save the lives of five murderers by sacrificing oneinnocent baby. But what if things were different and wewere comparing five school children ver-sus one baby or five babies versus one school child? No one can say for sure what is the morallyright action in those cases.9

While the utilitarian position appears less persuasive in light of these considerations, deon-tology doesn’t fare too well, either. Deontology emphasizes one’s duty to observe moral rules.But what if those moral rules conflict with one another? Between the two moral rules, “do notkill a person” and “save lives,” which one should trump the other? The conflict among values iscommon in life, and deontology faces difficulty in guiding how an intelligent agent is to act in atricky situation such as the trolley problem.10

UnderstandingWhat Ethics Has toOffer

Now, let us consider AI-powered military robots and autonomous weapons systems since theypresent the moral dilemma in the trolley problem more convincingly due to the high stakes in-volved. Suppose that some engineers, following utilitarianism and interpreting victory as the ul-timate good/utility, wish to program an unmanned aerial vehicle (UAV) to autonomously drop

7For an overview, see (Sinnott-Armstrong, 2019) and (Alexander andMoore, 2016).8For an empirical study on this, see (Cushman, Young, and Hauser 2006). For the results of a similar survey that

involves an autonomous car instead of a trolley, see (Bonnefon, Shariff, and Rahwan 2016).9For an attempt to identify moral principles behind our moral intuition in different versions of the trolley problem

and other similar cases, see (Thomson 1976).10Some moral philosophers doubt the value of our moral intuition in constructing a moral theory. See (Singer 2005),

for example. But a moral theory that clashes with common moral intuition is unlikely to be sought out as a guide tomaking an ethical decision.


bombs in order to maximize the chances of victory. That may result in sacrificing a greater num-ber of civilians than necessary, and many will consider this to be morally wrong. Now imaginedifferent engineers who, adopting deontology and following the moral principle of not killingpeople, program a UAV to autonomously act in a manner that minimizes casualties. This maylead to defeat on the battlefield, because minimizing casualties may not be always advantageousto winning a war. From these examples, we can see that philosophical insights from utilitarian-ism and deontology may provide little practical guidance on how to program autonomous AIsystems to act morally.

Ethicists seek abstract principles that can be generalized. For this reason, they are interested inborderline cases that reveal subtle differences in our moral intuition and varying moral theories.Their goal is to define what is moral and investigate howmoral reasoning works or should work.By contrast, engineers and programmers pursue practical solutions to real-life problems and lookfor guidelines that will help with implementing those solutions. Their focus is on creating a setof constraints and if-then statements, which will allow amachine to identify and process morallyrelevant considerations, so that it can determine and execute an action that is not only rationalbut also ethical in the given situation.11

On the other hand, the goal of military commanders and soldiers is to end a conflict, bringpeace, and facilitate restoring and establishing universally recognized human values such as free-dom, equality, justice, and self-determination. In order to achieve this goal, they must make thebest strategic decisions and take the most appropriate actions. In deciding on those actions, theyare also responsible for abiding by the principles of jus in bello and for not abdicating their moralresponsibility, protecting civilians and minimizing harm, violence, and destruction as much aspossible.12 The goal of military commanders and soldiers, therefore, differs from those of moralphilosophers or of the engineers who build autonomous weapons. They are obligated to makequick decisions in a life-or-death situation while working with AI-powered military systems.

These different goals and interests explain why moral philosophers’ discussion on the trolleyproblemmay be disappointing to AI programmers or military commanders and soldiers. Ethicsdoes not provide an easy answer to the question of how one should program moral decision-making into intelligent machines. Nor does it prescribe the right moral decision in a battlefield.But taking this as a shortcoming of ethics is missing the point. The role of moral philosophy isnot to make decision-making easier but to highlight and articulate the difficulty and complexityinvolved in it.

Ethical Challenges fromAutonomous AI Systems

The complexity of ethical questions means that dealing with the morality of an action by anautonomous AI system will require more than a clever engineering or programming solution.The fact that ethics does not eliminate the inherent ambiguity in many moral decisions shouldnot lead to the dismissal of ethical challenges from autonomous AI systems. By injecting thecapacity for autonomous decision-making into machines, AI can fundamentally transform anygiven field. For example, AI-poweredmilitary robots are not just another kind of weapon. Whenwidely deployed, they can change the nature of war itself. Described below are some of the signif-icant ethical challenges that autonomous AI systems such as military robots present. Note that

11Note that this moral decision-making process can be modeled with a rule-based symbolic AI approach, a machinelearning approach, or a combination of both. See Vincent Conitzer et al. 2017.

12For the principles of jus in bello, see International Committee of the Red Cross 2015.

Kim 77

in spite of these ethical concerns, autonomous AI systems are likely to continue to be developedand adopted in many areas as a way to increase efficiency and lower cost.

(a) Moral desensitization

AI-poweredmilitary robots aremore capable thanmerely remotely-operated weapons. They canidentify a target and initiate an attack on their own. Due to their autonomy, military robotscan significantly increase the distance between the party that kills and the party that gets killed(Sharkey 2012). This increase, however, may lead people to surrender their ownmoral responsi-bility to a machine, thereby resulting in the loss of humanity, which is a serious moral risk (Davis2007). The more autonomous military robots become, the less responsibility humans will feelregarding their life-or-death decisions.

(b) Unintended outcome

The side that deploys AI-powered military robots is likely to suffer fewer casualties itself whileinflicting more casualties on the enemy side. This may make the military more inclined to starta war. Ironically, when everyone thinks and acts this way, the number of wars and the overallamount of violence and destruction in the world will only increase.13

(c) Surrender of moral agency

AI-powered military robots may fail to distinguish innocents from combatants and kill the for-mer. In such a case, can we be justified in letting robots take the lives of other human beings?Some may argue that only humans should decide to kill other humans, not machines (Davis2007). Is it permissible for people to delegate such a decision to AI?

(d) Opacity in decision-making

Machine learning is used tobuildmanyAI systems today. Insteadof prescribing apre-determinedalgorithm, a machine learning system goes through a so-called ‘training’ process to produce thefinal algorithmfroma large amountofdata. For example, amachine learning systemmaygeneratean algorithm that successfully recognizes cats in a photo after going through millions of photosthat show cats in many different postures from various angles.14 But the resulting algorithmis a complex mathematical formula and not something that humans can easily decipher. Thismeans that the inner workings of a machine learning AI system and its decision-making processis opaque to human understanding, even to those who built the system itself (Knight 2017). Incases where the actions of an AI system can have grave consequences such as a military robot,such opacity becomes a serious problem.15

13(Kahn 2012) also argues that the resulting increase in the number ofwars by the use ofmilitary robotswill bemorallybad.

14Google’s research team created an AI algorithm that learned how to recognize a cat in 2012. The neural networkbehind this algorithm had an array of 16,000 processors and more than one billion connections. Unlabeled randomthumbnail images from 10million YouTube videos allowed this algorithm to learn to identify cats by itself. SeeMarkoff2012 and Clark 2012.

15This black-box nature of AI systems powered by machine learning has raised great concern among many AI re-searchers in recent years. This is problematic in all areas where these AI systems are used for decision-making, not just inmilitary operations. The gravity of decisions made in a military operation makes this problem even more troublesome.Fortunately, some AI researchers including those in the US Department of Defense are actively working to make AI sys-tems explainable. But until such research bears fruit and AI systems become fully explainable, their military use meansaccepting many unknown variables and unforeseeable consequences. See Turek n.d.


AI Applications for Libraries

Do these ethical concerns outlined above apply to libraries? To answer that, let us first take alook at how AI, particularly machine learning, may apply to library services and operations. AI-powered digital assistants are likely to mediate a library user’s information search, discovery, andretrieval activities in the near future.

In recent years, machine learning and deep learning have brought significant improvementto natural language processing (NLP), which deals with analyzing large amounts of natural lan-guage data to make the interaction between people and machines in natural languages possible.For instance, Google Assistant’s new feature ‘duplex’ was shown to successfully make a phonereservation with restaurant staff in 2018 (Welch 2018). Google’s real-time translation capabilityfor 44 different languages was introduced to Google Assist-enabled Android and iOS phones in2019 (Rincon 2019).

As digital assistants become capable of handling more sophisticated language tasks, their useas a flexible voice user interface will only increase. Such digital assistants will be able to directlyinteract with library systems and applications, automatically interpret a query, and return resultsthat they deem to be most relevant. Those digital assistants can also be equipped to handle thelibrary’s traditional reference or readers’ advisory service. Integrated into ahumanoid robot body,they may even greet library patrons at the entrance and answer directional questions about thelibrary building.

Cataloging, abstracting, and indexing are other areas where AI will be actively utilized. Cur-rently, those tasks are performed by skilled professionals. But as AI applications become moresophisticated, we may see many of those tasks partially or fully automated and handed over toAI systems. Machine learning and deep learning can be used to extract key information from alarge number of documents or from information-rich visual materials, such as maps and videorecordings, and generate metadata or a summary.

Since machine learning is new to libraries, there are a relatively small number of machinelearning applications developed for libraries’ use. They are likely to grow in number. Yewno,Quartolio, and Iris.ai are examples of the commercial products developed withmachine learningand deep learning techniques.16 YewnoDiscover displays the connections between different con-cepts or works in library materials. Quartolio targets researchers looking to discover untappedresearch opportunities based upon a large amount of data that includes articles, clinical trials,patents, and notes. Similarly, Iris.ai helps researchers identify and review a large amount of re-search papers and patents and extracts key information from them. Kira identifies, extracts, andanalyzes text in contracts and other legal documents.17 None of these applications performs fullyautomated decision-making nor incorporates the digital assistant feature. But this is an area onwhich information systems vendors are increasingly focusing their efforts.

Libraries themselves are also experimenting with AI to test its potential for library servicesand operations. Some are focusing on using AI, particularly the voice user interface aspect ofthe digital assistant, in order to improve existing services. The University of Oklahoma Librarieshave been building an Alexa application to provide basic reference service to their students.18

16See https://www.yewno.com/education, https://quartolio.com/, and https://iris.ai/.17See https://kirasystems.com/. Law firms are adopting similar products to automate and expedite their legal

work, and law librarians are discussing how the use of AI may change their work. See Marr 2018 and Talley 2016.18University of Oklahoma Libraries are building an Alexa application that will provide some basic reference service to

their students. Also, their PAIR registry attempts to compile all AI-related projects at libraries. See https://pair.libraries.ou.edu.

https://www.yewno.com/education

https://quartolio.com/

https://iris.ai/

https://kirasystems.com/

https://pair.libraries.ou.edu

https://pair.libraries.ou.edu

Kim 79

At the University of Pretoria Library in South Africa, a robot named ‘Libby’ already interactswith patrons by providing guidance, answering questions, conducting surveys, and displayingmarketing videos (Mahlangu 2019).

Other libraries are applying AI to extract information from digital materials and automatemetadata generation to enhance their discovery and use. The Library of Congress has workedon detecting features, such as railroads in maps, using the convolutional neural network model,and issued a solicitation for a machine learning and deep learning pilot program that will max-imize the use of its digital collections in 2019.19 Indiana University Libraries, AVP, Universityof Texas Austin School of Information, and the New York Public Library are jointly developingthe Audiovisual Metadata Platform (AMP), using many AI tools in order to automatically gen-erate metadata for audiovisual materials, which collection managers can use to supplement theirarchival description and processing workflows.20

Some libraries are also testing outAI as a tool for evaluating services andoperations. TheUni-versity ofRochester Libraries applied deep learning to the library’s space assessment to determinethe optimal staffing level and building hours. The University of Illinois Urbana-Champaign Li-braries used machine learning to conduct sentiment analysis on their reference chat log (Blewer,Kim, and Phetteplace 2018).

Ethical Challenges from the Personalized and

Automated Information Environment

Do these current and future AI applications for libraries pose ethical challenges similar to thosethat we discussed earlier? Since information query, discovery, and retrieval rarely involve life-or-death situations, stakes seem to be certainly lower. But an AI-driven automated informationenvironment does raise its own distinct ethical challenges.

(i) Intellectual isolation and bigotry hampering civic discourse

ManyAI applications that assist with information seeking activities promise a higher level of per-sonalization. But a highly personalized information environment often traps people in their ownso-called ‘filter bubble,’ as we have been increasingly seeing in today’s socialmedia channels, newswebsites, and commercial search engines, where such personalization is provided by machinelearning and deep learning.21 Sophisticated AI algorithms are already curating and pushing in-formation feeds based upon the person’s past search and click behavior. The result is that infor-mation seekers are provided with information that conforms and reinforces their existing beliefsand interests. Views that are novel or contrast with their existing beliefs are suppressed and be-come invisible without them even realizing.

Such lack of exposure to opposing views leads information users to intellectual isolation andeven bigotry. Highly personalized information environments powered by AI can actively restrictways in which people develop balanced and informed opinions, thereby intensifying and perpet-uating social discord and disrupting civic discourse. Under such conditions, prejudices, discrim-

19See Blewer, Kim, and Phetteplace 2018 and Price 2019.20The AMP wiki is https://wiki.dlib.indiana.edu/pages/viewpage.action?pageId=531699941.

The AudiovisualMetadata Platform Pilot Development (AMPPD) project was presented at Code4Lib 2020 (Averkampand Hardesty 2020).

21See Pariser 2012.

https://wiki.dlib.indiana.edu/pages/viewpage.action?pageId=531699941


ination, and other unjust social practices are likely to increase, and this in turn will have morenegative impact on those with fewer privileges. Intellectual isolation and bigotry has a distinctlymoral impact on society.

(ii)Weakening of cognitive agency and autonomy

We have seen earlier that AI-powered digital assistants are likely to mediate people’s informationsearch, discovery, and retrieval activities in the near future. As those digital assistants becomemore capable, they will go beyond listing available information. They will further choose whatthey deem to be most relevant to users and proceed to recommend or autonomously execute thebest course of action.22 Other AI-driven features, such as extracting key information or generat-ing a summary of a large amount of information, are also likely to be included in future informa-tion systems, and theymay deliver key information or summaries even before the request is madebased upon constant monitoring of the user’s activities.

In such a scenario, an information seeker’s cognitive agency is likely be undermined. Cru-cial to cognitive agency is the mental capacity to critically review a variety of information, judgewhat is and is not relevant, and interpret how they relate to other existing beliefs and opinions.If AI assumes those tasks, the opportunities for information seekers to exercise their own cogni-tive agency will surely decrease. Cognitive deskilling and the subsequent weakening of people’sagency in the AI -powered automated information environment presents an ethical challengebecause such agency is necessary for a person to be a fully functioning moral agent in society.23

(iii) Social impact of scholarship and research from flawedAI algorithms

Previously, we have seen that deep learning applications are opaque to human understanding.This lack of transparency and explainability raises a question of whether it is moral to rely onAI-powered military robots for life-or-death decisions. Does the AI-powered information envi-ronment have a similar problem?

Machine learning applications base their recommendations and predictions upon the pat-terns in past data. Their predictions and recommendations are in this sense inherently conser-vative. They also become outdated when they fail to reflect new social views and material con-ditions that no longer fit the past patterns. Furthermore, each data set is a social construct thatreflects particular values and choices such as who decided to collect the data and for what pur-pose; who labeled data; what criteria or beliefs guided such labeling; what taxonomies were usedand why (Davis 2020). No data set can capture all variables and elements of the phenomenonthat it describes. Furthermore, data sets used for training machine learning and deep learningalgorithms may not be representational samples for all relevant subgroups. In such a case, an al-gorithm trained by such a data set will produce skewed results. Creating a large data set is alsocostly. Consequently, developers often simply take the data sets available to them. Those data setsare likely to come with inherent limitations such as omissions, inaccuracies, errors, and hiddenbiases.

22Needless to say, this is a highly simplified scenario. Those features can also be built in the information system itselfrather than being delivered by a digital assistant.

23Outside of the automated information environment, AI has a strong potential to engender moral deskilling. Vallor(2015) points out that automated weapons will lead to soldiers’ moral deskilling in the use of military force; new me-dia practices of multitasking may result in deskilling in moral attention; and social robots can cause moral deskilling inpractices of human caregiving.

Kim 81

AI algorithms trained with these flawed data sets can fail unexpectedly, revealing those limi-tations. For example, it has been reported that the success rate of a facial recognition algorithmplunges from 99% to 35% when the group of subjects changes from white men to dark-skinnedwomen because it was trained mostly with the photographs of white men (Lohr 2018). Adopt-ing such a faulty algorithm for any real-life use at a large scale would be entirely unethical. Forthe context of libraries, imagine using such a face-recognition algorithm to generate metadata fordigitized historical photographs or a similarly flawed audio transcription algorithm to transcribearchival audio recordings.

Just like those faulty algorithms, an AI-powered automated information environment canproduce information, recommendations, and predictions affected by similar limitations existingin many data sets. The more seamless such an information environment is, the more invisiblethose limitations become. Automated information systems from libraries may not be involved indecisions that have a direct and immediate impact on people’s lives, such as setting a bail amountor determining theMedicaid payment to be paid.24 But automated information systems that arewidely adopted and used for research and scholarshipwill impact real-life policies and regulationsin areas such as healthcare and the economy. Undiscovered flaws will undermine the validity ofthe scholarly output that utilized those automated information systems and can further inflictserious harm on certain groups of people through those policies and regulations.

Moral Intelligence and Rethinking the Role of AI

In this chapter, I discussed four significant ethical challenges that automating decisions and ac-tions with AI presents: (a) moral desensitization; (b) unintended outcomes; (c) surrender ofmoral agency; (d) opacity in decision-making.25 I also examined somewhat different but equallysignificant ethical challenges in relation to theAI-powered automated information environment,which is likely to surround us in the future: (i) intellectual isolation and bigotry hampering civicdiscourse; (ii) weakening of cognitive agency and autonomy; (iii) social impact of scholarship andresearch based upon flawed AI algorithms.

In the near future, librarieswill be acquiring, building, customizing, and implementingmanypersonalized and automated information systems. Given this, the challenges related to the AI-powered automated information environment are highly relevant to them. At present, librariesare at an early stage in developing AI applications and applyingmachine learning and deep learn-ing techniques to improve library services, systems, and operations. But the general issues ofhidden biases and the lack of explainability in machine learning and deep learning are alreadygaining awareness in the library community.

As we have seen in the trolley problem, whether a certain action is moral is not a line thatcan be drawn with absolute clarity. It is entirely possible for fully-functioning moral agents tomake different judgements. In addition, there is thematter ofmorality that our tools and systemsdisplay. This is called “machine morality” in relation to AI systems.

Wallach andAllen (2009) argue that there are three distinct levels ofmachinemorality: oper-ational morality, functional morality, and full moral agency (26). Operational morality is foundin systems that are low in both autonomy and ethical sensitivity. At this level of machine moral-ity, a machine or a tool is given a mechanism that prevents its immoral use, but the mechanism

24See Tashea 2017 and Stanley 2017.25This is by nomeans an exhaustive list. User privacy and potential surveillance are examples of other important ethical

challenges, which I do not discuss here.


is within the full control of the user. Such operational morality exists in a gun with a childproofsafety mechanism, for example. A gun with a safety mechanism is neither autonomous nor sen-sitive to ethical concerns related to its use. By contrast, machines with functional morality dopossess a certain level of autonomy and ethical sensitivity. This category includes AI systemswith significant autonomy and little ethical sensitivity or those with little autonomy and highethical sensitivity. An autonomous dronewould fall under the former type, whileMedEthEx, anethical decision-support AI recommendation system for clinicians, would be of the latter. Lastly,Wallach and Allen regard systems with high autonomy and high ethical sensitivity as having fullmoral agency, as much as humans do. This means that those systems would have a mental rep-resentation of values and the capacity for moral reasoning. Such machines can be held morallyresponsible for their actions.

We do not know whether AI will be able to produce such a machine with full moral agency.If the current direction to automate more and more human tasks for cost savings and efficiencyat scale continues, however, most of the more sophisticated AI applications to come will be ofthe kind with functional morality, particularly the kind that combines a relatively high level ofautonomy and a lower level of ethical sensitivity.

In the beginning of this chapter, I mentioned that the goal of AI is to create an artificialsystem—whether it be a piece of software or amachinewith a physical body—that is as intelligentas a human in its performance, either broadly in all areas of human activities or narrowly in aspecific activity. Butwhat does “as intelligent as a human” exactlymean? Ifmorality is an integralcomponent of human-level intelligence, AI research needs to pay more attention to intelligencenot only in accomplishing a goal but also in doing so ethically.26 In that light, it is meaningful toask what level of autonomy and ethical sensitivity a given AI system is equipped with, and whatlevel of machine morality is appropriate for its purpose.

In designing anAI system, it would be helpful to considerwhat level of autonomy and ethicalsensitivity would be best suited for its purpose and whether it is feasible to provide that level ofmachine morality for the system in question. In general, the narrower the function or the do-main of an AI system is, the easier it will be to equip it with an appropriate level of autonomyand ethical sensitivity. In evaluating and designing an AI system, it will be important to test theactual outcome against the anticipated outcome in different types of cases in order to identifypotential problems. System-wide audits to detect well-known biases, such as gender discrimina-tion or racism, can serve as an effective strategy.27 Other undetected problems may surface onlyafter the AI system is deployed. Having amechanism to continually test an AI algorithm to iden-tify those unnoticed problems and feeding the test result back into the algorithm for retrainingwill be another way to deal with algorithmic biases. Those who build AI systems will also benefitfromconsulting existing principles and guidelines such as FAT/ML’s “Principles forAccountableAlgorithms and a Social Impact Statement for Algorithms.”28

We may also want to rethink how and where we apply AI. We and our society do not have

26Here, I regard intelligence as the ability to accomplish complex goals following Tegmark 2017. For more discussionon intelligence and goals, see Chapter 2 and Chapter 7.

27These audits are far from foolproof, but the detection of hidden biases will be crucial inmaking AI algorithmsmoreaccountable and their decisions more ethical. A debiasing algorithm can also be used during the training stage of an AIalgorithm to reduce hidden biases in training data. See Amini et al. 2019, Knight 2019b, and Courtland 2018.

28See https://www.fatml.org/resources/principles-for-accountable-algorithms. Otherprinciples and guidelines include “Ethics Guidelines for Trustworthy AI” (https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai) and “Algorithmic Impact Assessments: APractical Framework For Public Agency Accountability” (https://ainowinstitute.org/aiareport2018.pdf).

https://www.fatml.org/resources/principles-for-accountable-algorithms

https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai

https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai

https://ainowinstitute.org/aiareport2018.pdf

https://ainowinstitute.org/aiareport2018.pdf

Kim 83

to use AI to equip all our systems andmachines with human- or superhuman-level performance.This is particularly so if the pursuit of such human- or superhuman-level performance is likelyto increase unethical decisions that negatively impact a significant number of people. We do nothave to task AI with always automating away human work and decisions as much as possible.What if we reframe AI’s role as helping people become more intelligent and more capable wherethey struggle or experience disadvantages, such as critical thinking, civic participation, healthy liv-ing, financial literacy, dyslexia, or hearing loss? What kind of AI-driven information systems andenvironments would be created if libraries approachAIwith such intention from the beginning?

References

Alexander, Larry, and Michael Moore. 2016. “Deontological Ethics.” In The Stanford Encyclo-pedia of Philosophy, edited by Edward N. Zalta, Winter 2016. Metaphysics Research Lab,Stanford University. https://plato.stanford.edu/archives/win2016/entries/ethics-deontological/.

Amini, Alexander, Ava P. Soleimany, Wilko Schwarting, Sangeeta N. Bhatia, and Daniela Rus.2019. “Uncovering andMitigatingAlgorithmicBias throughLearnedLatent Structure.” InProceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 289–295. AIES’19. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3306618.3314243.

Averkamp, Shawn, and Julie Hardesty. 2020. “AI Is Such a Tool: Keeping Your Machine Learn-ing Outputs in Check.” Presented at the Code4lib Conference, Pittsburgh, PA, March 11.https://2020.code4lib.org/talks/AI-is-such-a-tool-Keeping-your-machine-learning-outputs-in-check.

Blewer, Ashley, Bohyun Kim, and Eric Phetteplace. 2018. “Reflections on Code4Lib 2018.”ACRL TechConnect (blog). March 12, 2018. https://acrl.ala.org/techconnect/post/reflections-on-code4lib-2018/.

Boden, Margaret A. 2016. AI: Its Nature and Future. Oxford: Oxford University Press.Bonnefon, Jean-François, Azim Shariff, and Iyad Rahwan. 2016. “The Social Dilemma of Au-

tonomous Vehicles.” Science 352 (6293): 1573–76. https://doi.org/10.1126/science.aaf2654.

Clark, Liat. 2012. “Google’s Artificial Brain Learns to Find Cat Videos.” Wired, June 26, 2012.https://www.wired.com/2012/06/google-x-neural-network/.

Conitzer, Vincent,Walter Sinnott-Armstrong, Jana Schaich Borg, YuanDeng, andMaxKramer.2017. “MoralDecisionMaking Frameworks forArtificial Intelligence.” InProceedings of theThirty-First AAAI Conference on Artificial Intelligence, 4831–4835. AAAI’17. San Fran-cisco, California, USA: AAAI Press.

Courtland, Rachel. 2018. “Bias Detectives: TheResearchers Striving toMakeAlgorithms Fair.”Nature 558 (7710): 357–60. https://doi.org/10.1038/d41586-018-05469-3.

Cushman, Fiery, Liane Young, andMarcHauser. 2006. “TheRole of Conscious Reasoning andIntuition in Moral Judgment: Testing Three Principles of Harm.” Psychological Science 17(12): 1082–89.

Davis, Daniel L. 2007. “Who Decides: Man or Machine?” Armed Forces Journal, November.http://armedforcesjournal.com/who-decides-man-or-machine/.

Davis, Hannah. 2020. “ADataset Is aWorldview.” Towards Data Science. March 5, 2020. https://towardsdatascience.com/a-dataset-is-a-worldview-5328216dd44d.

https://plato.stanford.edu/archives/win2016/entries/ethics-deontological/

https://plato.stanford.edu/archives/win2016/entries/ethics-deontological/

https://doi.org/10.1145/3306618.3314243

https://doi.org/10.1145/3306618.3314243

https://2020.code4lib.org/talks/AI-is-such-a-tool-Keeping-your-machine-learning-outputs-in-check

https://2020.code4lib.org/talks/AI-is-such-a-tool-Keeping-your-machine-learning-outputs-in-check

https://acrl.ala.org/techconnect/post/reflections-on-code4lib-2018/

https://acrl.ala.org/techconnect/post/reflections-on-code4lib-2018/

https://doi.org/10.1126/science.aaf2654

https://doi.org/10.1126/science.aaf2654

https://www.wired.com/2012/06/google-x-neural-network/

https://doi.org/10.1038/d41586-018-05469-3

http://armedforcesjournal.com/who-decides-man-or-machine/

https://towardsdatascience.com/a-dataset-is-a-worldview-5328216dd44d

https://towardsdatascience.com/a-dataset-is-a-worldview-5328216dd44d


Foot, Philippa. 1967. “The Problem of Abortion and the Doctrine of Double Effect.” OxfordReview 5: 5–15.

Heaven,Will Douglas. 2020. “DeepMind’s AI CanNowPlay All 57 Atari Games—but It’s StillNot Versatile Enough.” MIT Technology Review, April 1, 2020. https://www.technologyreview.com/2020/04/01/974997.

International Committee of the Red Cross. 2015. “What Are Jus Ad Bellum and Jus in Bello?”January 22, 2015. https://www.icrc.org/en/document/what-are-jus-ad-bellum-and-jus-bello-0.

Kahn, Leonard. 2012. “Military Robots and The Likelihood of Armed Combat.” In RobotEthics: The Ethical and Social Implications of Robotics, edited by Patrick Lin, Keith Abney,and George A. Bekey, 274–92. Intelligent Robotics and Autonomous Agents. Cambridge,Mass.: MIT Press.

Knight, Will. 2017. “The Dark Secret at the Heart of AI.” MIT Technology Review, April 11,2017. https://www.technologyreview.com/2017/04/11/5113.

. 2019a. “Two Rival AI Approaches Combine to Let Machines Learn about theWorld like a Child.” MITTechnology Review, April 8, 2019. https://www.technologyreview.com/2019/04/08/103223.

. 2019b. “AI Is Biased. Here’s How Scientists Are Trying to Fix It.” Wired, De-cember 19, 2019. https://www.wired.com/story/ai-biased-how-scientists-trying-fix/.

Koch, Christof. 2016. “How the Computer Beat the Go Master.” Scientific American. March19, 2016. https://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/.

Lohr, Steve. 2018. “Facial Recognition Is Accurate, If You’re a White Guy.” New York Times,February 9, 2018. https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html.

Mahlangu, Isaac. 2019. “Meet Libby - the New Robot Library Assistant at the University ofPretoria’s Hatfield Campus.” SowetanLIVE. June 4, 2019. https://www.sowetanlive.co.za/news/south-africa/2019-06-04-meet-libby-the-new-robot-library-assistant-at-the-university-of-pretorias-hatfield-campus/.

Markoff, John. 2012. “How Many Computers to Identify a Cat? 16,000.” New York Times,June 25, 2012.

Marr, Bernard. 2018. “How AI AndMachine Learning Are Transforming Law Firms And TheLegal Sector.” Forbes, May 23, 2018. https://www.forbes.com/sites/bernardmarr/2018/05/23/how-ai-and-machine-learning-are-transforming-law-firms-and-the-legal-sector/.

Pariser, Eli. 2011. The Filter Bubble: How theNewPersonalizedWeb Is ChangingWhatWeReadand HowWe Think. New York: Penguin Press.

Price, Gary. 2019. “The Library of Congress Posts Solicitation For a Machine Learning/DeepLearningPilot Program to ‘Maximize theUse of ItsDigitalCollection.’ ” LJ InfoDOCKET.June 13, 2019. https://www.infodocket.com/2019/06/13/library-of-congress-posts-solicitation-for-a-machine-learning-deep-learning-pilot-program-to-maximize-the-use-of-its-digital-collection-library-is-looking-for-r/.

Rincon, Lilian. 2019. “Interpreter Mode Brings Real-Time Translation to Your Phone.” GoogleBlog (blog). December 12, 2019. https://www.blog.google/products/assista

https://www.technologyreview.com/2020/04/01/974997


https://www.icrc.org/en/document/what-are-jus-ad-bellum-and-jus-bello-0

https://www.icrc.org/en/document/what-are-jus-ad-bellum-and-jus-bello-0




https://www.wired.com/story/ai-biased-how-scientists-trying-fix/

https://www.wired.com/story/ai-biased-how-scientists-trying-fix/

https://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/

https://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/

https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html

https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html

https://www.sowetanlive.co.za/news/south-africa/2019-06-04-meet-libby-the-new-robot-library-assistant-at-the-university-of-pretorias-hatfield-campus/



https://www.forbes.com/sites/bernardmarr/2018/05/23/how-ai-and-machine-learning-are-transforming-law-firms-and-the-legal-sector/



https://www.infodocket.com/2019/06/13/library-of-congress-posts-solicitation-for-a-machine-learning-deep-learning-pilot-program-to-maximize-the-use-of-its-digital-collection-library-is-looking-for-r/




https://www.blog.google/products/assistant/interpreter-mode-brings-real-time-translation-your-phone/


Kim 85

nt/interpreter-mode-brings-real-time-translation-your-phone/.Sharkey, Noel. 2012. “Killing Made Easy: From Joysticks to Politics.” In Robot Ethics: The

Ethical and Social Implications of Robotics, edited by Patrick Lin, Keith Abney, and GeorgeA. Bekey, 111–28. Intelligent Robotics andAutonomousAgents. Cambridge,Mass.: MITPress.

Singer, Peter. 2005. “Ethics and Intuitions.” The Journal of Ethics 9 (3/4): 331–52.Sinnott-Armstrong, Walter. 2019. “Consequentialism.” In The Stanford Encyclopedia of Phi-

losophy, edited by Edward N. Zalta, Summer 2019. Metaphysics Research Lab, StanfordUniversity. https://plato.stanford.edu/archives/sum2019/entries/consequentialism/.

Stanley, Jay. 2017. “Pitfalls ofArtificial IntelligenceDecisionmakingHighlighted In IdahoACLUCase.” American Civil Liberties Union (blog). June 2, 2017. https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence-decisionmaking-highlighted-idaho-aclu-case.

Talley, Nancy B. 2016. “Imagining the Use of Intelligent Agents and Artificial Intelligence inAcademic Law Libraries.” Law Library Journal 108 (3): 383–402.

Tashea, Jason. 2017. “Courts Are Using AI to Sentence Criminals. That Must Stop Now.”Wired, April 17, 2017. https://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/.

Tegmark, Max. 2017. Life 3.0: Being Human in the Age of Artificial Intelligence. New York:Alfred Knopf.

Thomson, Judith Jarvis. 1976. “Killing, Letting Die, and the Trolley Problem.” TheMonist 59(2): 204–17.

Turek, Matt. n.d. “Explainable Artificial Intelligence.” Defense Advanced Research ProjectsAgency. https://www.darpa.mil/program/explainable-artificial-intelligence.

Vallor, Shannon. 2015. “Moral Deskilling and Upskilling in a New Machine Age: Reflectionson theAmbiguous Future ofCharacter.” Philosophy&Technology 28 (1): 107–24. https://doi.org/10.1007/s13347-014-0156-9.

Wallach,Wendell. 2009. MoralMachines: Teaching Robots Right fromWrong. Oxford: OxfordUniversity Press.

Welch, Chris. 2018. “Google Just Gave a Stunning Demo of AssistantMaking an Actual PhoneCall.” The Verge. May 8, 2018. https://www.theverge.com/2018/5/8/17332070/google-assistant-makes-phone-call-demo-duplex-io-2018.



https://plato.stanford.edu/archives/sum2019/entries/consequentialism/

https://plato.stanford.edu/archives/sum2019/entries/consequentialism/

https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence-decisionmaking-highlighted-idaho-aclu-case



https://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/

https://www.wired.com/2017/04/courts-using-ai-sentence-criminals-must-stop-now/

https://www.darpa.mil/program/explainable-artificial-intelligence

https://www.darpa.mil/program/explainable-artificial-intelligence

https://doi.org/10.1007/s13347-014-0156-9

https://doi.org/10.1007/s13347-014-0156-9

https://www.theverge.com/2018/5/8/17332070/google-assistant-makes-phone-call-demo-duplex-io-2018

https://www.theverge.com/2018/5/8/17332070/google-assistant-makes-phone-call-demo-duplex-io-2018

Part II

How-to’s and Case Studies:

Machine Learning in

Library Practice

87

Chapter 8

Building aMachine Learning

Pipeline

Audrey AltmanDigital Public Library of America

As a newmachine learning (ML) practitioner, it is important to develop amindful approachto the craft. Bymindful, Imean possessing the ability to think clearly about each individual pieceof the process, and understanding how each piece fits into the larger whole. In my experience,there are many good tutorials available that will help you work with an individual tool, deploya specific algorithm, or complete a single task. It is more difficult to find guidelines for buildinga holistic system that supports the entire ML workflow. My aim is to help you build just sucha system, so that you are free to focus on inquiry and discovery rather than struggling with in-frastructure and process. I write this as a software developer who has, at one time or another,been on the wrong end of all the recommendations presented here, and hopes to save you fromsimilar headaches. Many of the examples and design choices are drawn from my experiences atthe Digital Public Library of America, where I have worked alongside a very talented team ofdevelopers. This is by no means an exhaustive text, but rather a bit of pragmatic advice and ajumping-off point for further research, designed to give you a clearer idea of which questions toask throughout your practice.

This article reviews the basic machine learning workflow, discussing design considerationsalong the way. It offers recommendations for data storage, guidelines on selecting and workingwith ML algorithms, and questions to guide tool selection. Finally, it describes some challengeswith scalingup. Myhope is that the insight presentedhere, combinedwith your good judgement,will empower you to get started with the actual practice of designing and executing a machinelearning project.

89


Algorithm selection

As you begin ingesting and preparing data, you’ll want to explore possible machine learning al-gorithms to perform on your dataset. Choose an algorithm that fits your research question anddata. If you’re not sure which algorithm to choose and not constrained by time, experiment withseveral different options and see which one yields the best results. Start by determiningwhat gen-eral type of learning algorithm you need, and proceed from there to research and select one thatspecifically addresses your research question.

In supervised learning, you train a model to predict an output condition based on given in-put conditions; for example, predicting whether or not a patient has some disease based on theirsymptoms, or the topic of a news article based on keywords in the text. In order for supervisedlearning to work, you need labeled training data, meaning data in which the outcome is alreadyknown. Examples include records of symptoms in patients who were known to have the disease(or not), or news articles that have already been assigned topics.

Classification and regression are both types of supervised learning. In a classification problem,you are predicting a discrete number of possible outcomes. For example, “based on what I knowabout this book, will it make the New York Times Best Seller list?” is a classification problembecause there are two discrete outcomes: yes or no. Classification algorithms include naive Bayes,decision trees, and k-nearest neighbor. Regression problems try to predict an outcome from acontinuum of possibilities, i.e., “based on what I know about this book, what will its retail pricebe?” Regression algorithms include linear regression and regression trees.

In unsupervised learning, the ML algorithm discovers a new pattern. The training data isunlabeled, meaning there is no indication of how the data should be organized at the outset. Acommon example is clustering, in which the algorithm groups items together based on features itfinds mathematically significant. Perhaps you have a collection of news articles (with no existingtopic labels), and you want to discover common themes or topics that appear throughout thecollection. The algorithm will not tell you what the themes or topics are, but will show whicharticles group together. It is then up to the researcher to work out the common thread.

In addition to serving your research question, your algorithm should also be a good fit foryour data. Specific considerations will vary for each dataset and algorithm, so make sure youknow the strengths andweaknesses of your algorithm and how they relate to the unique qualitiesof your dataset. For example, algorithms differ in their abilities to handle datasets with a very largenumber of features, handle datasets with high variance, efficiently process very large datasets, andglean meaningful intelligence from very small datasets. Is it important that your algorithm beeasy to explain? Some algorithms, such as neural nets, function as black boxes, and it is difficultto decipher how they arrive at their decisions. Other algorithms, such as decision trees, are easyto understand. Can you prepare your data for the algorithm with a reasonable amount of pre-processing? Can you find examples of success (or failure) from people using similar datasets withthe same algorithm? Asking these sorts of questions will help you to choose an algorithm thatworks well for your data, and will also inform how you prepare your data for optimal use.

Finally, consider whether or not you are constrained by time, hardware, or available toolsets.Different algorithms require different amounts of time andmemory to train and/or execute. Dif-ferent ML tools offer implementations of different algorithms.

Altman 91

Themachine learning pipeline

The metaphor of a pipeline is often used for a machine learning workflow. This metaphor cap-tures the idea of data channeled through a series of sequential transformations. However, it isimportant to note that each stage in the process will need to be repeated and honed through-out the course of your project. Therefore, don’t think of yourself as building a single intelligentmodel, such as a decision tree or clustering algorithm. Instead, build a pipeline with pieces thatcan be swapped in and out as needed. Data flows through the pipeline and outputs a versionof a decision tree, clustering algorithm, or other intelligent model. Throughout your process,youwill tweak your pipeline, makingmany intelligentmodels. Eventually youwill select the bestmodel for your use case. To use another metaphor, don’t build a car, build an assembly line formaking cars.

While the final output of a machine learning workflow is some sort of intelligent model,there are many factors that make repetition and iteration necessary. ML processes often involvesubjective decisions, such as which data points to ignore, or which configurations to select foryour algorithm. You will want to test different possibilities to see what works best. As you learnmore about your dataset throughout the course of the project, you will go back and tweak partsof your process. You may discover biases in your data or algorithms that need to be addressed. Ifyou areworking collaboratively, youwill be incorporating asynchronous feedback frommembersof your team. At some point, you may need to introduce new or revised data, or try a new toolor algorithm. It is also prudent to expect and plan for errors. Human errors are inevitable, andhardware errors, such as network timeouts or memory overloads, are common. For all of thesereasons, you will be well-served by a pipeline composed of modular, repeatable steps, each withdiscrete and stable output.

Amodular pipeline supports a batch processing workflow, in which whole datasets undergoa series of transformations. During each step of the process, a large amount of data (possibly theentire dataset) is transformed all at once and then incrementally stored. This can be contrastedwith a real-time workflow, in which individual records are transformed instantaneously (e.g. a li-brarian updates a single record in library catalog); or a streamingworkflow, inwhich a continuousflow of data is pushed through an entire pipeline, often without incremental storage along theway (e.g. performing analysis on a continuous stream of new tweets). Batch processing is com-mon in the research and development phase of anML project, andmay also be a good choice fora production system.

When designing any step in the batch processing pipeline, assume that at some point youwillneed to repeat it either exactly as is, or with modifications. Documenting your process lets youcompare the outputs of different variations and communicate the ways in which your choicesimpact the final results. If you’re writing code, version control software can help. If you’re doingmore manual data manipulations, such as editing data in spreadsheets, you will need an inten-tional system of documenting exactly which transformations you are applying to your data. Itis generally preferable to automate processes wherever possible so that you can repeat them withease and consistency.

A concrete example frommy own experience demonstrates the importance of a pipeline thatsupports repetition. In my first ever ML project, I worked with a set of XML library data con-verted to CSV. I did most of my data cleanup by hand using spreadsheet software, and was notcareful about preserving the formulas for each step of the process; instead, I deleted and wroteover many important intermediate computations, saving only the final results. This whole pro-


cess tookme countless hours, andwhen an updated dataset became available, there was noway toreproduce my painstaking cleanup process. I was stuck with outdated data, and my final outputwas doomed to growmore andmore irrelevant as timewore on. Since then, I have always writtenrepeatable scripts for all my data cleanup tasks.

Each decision youmakewill have an impact on the final results, so it is important to keep cleardocumentation and to verify your assumptions and hypotheses wherever possible. Sometimesthere will be explicit tests to perform; at other times, you may just need to look at data—makea quick visualization, perform a simple calculation, or glance through a sample of records. Becognizant of the potential to introduce error or bias. For example, you could remove a field thatyou don’t think is important, but that would, in fact, have a meaningful impact on the finalresult. All of these precautions will strengthen confidence in your final outcomes andmake themintelligible to your collaborators and other audiences.

The pipeline for a machine learning project generally comprises five stages: data acquisition,data preparation, model training and testing, evaluation and analysis, and application of results.

Data acquisition

The first step is to acquire the data that you will be using for your machine learning project. Youmay need to combine data from several different sources. There are many ways to acquire data,including downloading files, querying a database or API, or scraping web pages. Depending onthe size of the source data and how it is made available, this can be a quick and simple step or themost challenging bottleneck in your pipeline. However you get your initial data, it is generally agood idea to save a copy in the rawest possible form and treat that copy as immutable, at least dur-ing the initial phase of testing different algorithms or configurations. Having a raw, immutablecopy of your initial dataset (or datasets) ensures that you can always go back to the beginning ofyour ML process and start over with exactly the same input. It will also save you from the possi-bility that the source data will change from beneath you, thereby compromising your ability tocompare the outputs of different operations (for more on this, see the section on data storage). Ifpossible, it’s often worthwhile to learn about how the original data was created, especially if youare getting data frommultiple sources that differ in subtle ways.

Data preparation

Data preparation involves cleaning data and transforming it into an appropriate format for sub-sequentmachine learning tasks. This is often the part of the process that requires themost work,and you should expect to iterate over your data preparationsmany times, even after you’ve startedtraining and testing models.

The first step of data preparation is to parse your acquired data and transform it into a com-mon, usable schema. Acquired data often comes in file formats that are good for data sharing,such asXML, JSON, orCSV. You can parse these files intowhatever schemamakes sense toman-age the various transformations you want to perform, but it can help to have a sense of whereyou are headed. Your eventual choice of data format will likely be dictated by your ML algo-rithms; likely candidates include multidimensional arrays, tensors, matrices, and DataFrames.Look ahead to specific functions in the specific libraries you plan to use, and see what type ofinput data is required. You don’t have to use these same formats during your data preparations,though it can simplify the process.

Altman 93

Data cleanup and transformation is an art. Data ismessy, and themessier the data, the harderit is to analyze and uncover underlying patterns. Yet, we are only human, and perfect data is farbeyond our reach. To strike a workable balance, focus on those cleanup tasks that you know(or strongly suspect) will have a significant impact on the final product. Cleanup and transfor-mation operations include removing punctuation or stopwords from textual data, standardizingdate and number formats, replacing missing or dummy values with a meaningful default, andexcluding data that is known to be erroneous or atypical. Youwill select relevant data points, andyou may need to represent them in a new way: a birth date becomes age range; a place name be-comes geo-coordinates; a text document becomes aword density vector. There aremany possiblenormalizations to perform, depending on your dataset and which algorithm(s) you plan to use.It’s not a bad idea to ensure that there’s a genuinely unique identifier for each record (even if youdon’t see an immediate need for one). This is also a good time to reflect on any biases that mightbe inherent in your data, and whether or not you can adjust for them; even if you cannot, under-standing how they might impact theML process will help you conduct a more nuanced analysisand frame your final results. At the very least, you can record biases in the documentation so thatfuture researcherswill be aware of them and react accordingly. As youbecomemore familiarwiththe data, you will likely hone your cleanup process and iterate through the steps multiple times.

Themore you can learn about the data, the better your preparations will be. During the datapreparation phase, practitioners often make use of visualizations and query frameworks to pic-ture their data holistically, identify patterns, and find errors or outliers. SomeML tools supportthese features out-of-the-box, or are intentionally interoperable with external query and visual-ization tools. For a lightweight tool, consider spreadsheet or notebook software. Depending onyour use case, it may be worthwhile to put your data into a temporary database or search indexso that you can make use of a more sophisticated query interface.

Model testing and training

During the testing and training phase, you will build multiple models and determine which onegives you the best results. One of the main ways you will tune your model is by trying multiplecombinations of hyperparameters. A hyperparameter is a value that you set before you run thelearningprocess, which impacts how the learningprocessworks. Hyperparameters control thingslike the number of learning cycles an algorithm will iterate through, the number of layers in aneural net, the characteristics of a cluster, or the number of decision trees in a forest. Often,you will also want to circle back to your data preparation steps to try different configurations,apply new enhancements, or address new problems and particularities that you’ve uncovered.The process is deceptively simple: try out different configurations until you get a good result.The challenge comes when you try to define what constitutes a good (or good-enough) result.

Measuring thequality of amachine learningmodel takes finesse. Start by asking: Whatwouldyou expect to see if the model learned perfectly? Equally important, what would you expect tosee if themodel didn’t learn anything at all? You can often utilize randomness as a stand-in for nolearning, e.g. “if a result was selected at random, the probability of the desired outcome wouldbe X”. These two questions will help you to set benchmarks at both extremes of the realm ofpossible outcomes. Perfection is illusive, and the return on investment dwindles after a while, sobe prepared to stop training once you’ve arrived at an acceptably good model.

In a supervised learning problem the dataset is split into training and testing datasets. Thealgorithm uses the training data to “learn” a set of rules that it can subsequently apply to new,


unseen data to predict the outcome. The testing dataset (also called a validation dataset) is usedto test how well the model performs. Often, a third dataset is held out as well, reserved for fi-nal testing after the model has been trained. This third dataset provides an additional bulwarkagainst bias and overfitting. Results are typically evaluated based on some statistical measure-ment that is directly relevant to your research question. In a classification problem, you mightoptimize for recall or precision. In a regression problem, you can use formulas such as the root-mean square deviation to measure how well the regression line matches the actual data points.How you choose to optimize your model will depend on your specific context and priorities.

Testing an unsupervised model is not as straightforward, since there is no preconceived no-tion of correct and incorrect categorization. You can sometimes rely on a known pattern in theunderlying dataset that you would reasonably expect to be reflected in a successful model. Theremay also be characteristics of the final model that indicate success. For example, if you are work-ingwith a clustering algorithm,models with dense, well-defined clusters are probably better thansparse clusterswith vagueboundaries. Inunsupervised learning, youmaywant toholdback someportion of your data to perform an independent validation of your results, or you may use theentire dataset to build the model—it depends on what type of testing you want to perform.

Application of results

As the final step of your workflow, you will use your intelligent model to perform some task.Perhaps you will use it for scholarly analysis of a dataset, or perhaps you will integrate it intoa software product. If it is the former, consider how to export any final data and preserve theartifacts of your project. If it is the latter, consider how the model, its outputs, and its contin-ued maintenance will fit into existing systems and workflows. Planning for interoperability mayinfluence decisions from tool selection to data formats and storage.

Immutable data storage

Immutable data storage can benefit the batch-processing ML pipeline, especially during the ini-tial research and development phase. This type of data storage supports iteration and allows youto compare the results of many different experiments. Treating data as immutable means that af-ter each significant change or set of changes to your data, you save a new snapshot of the datasetthat is never edited or changed. It also allows you tobe flexible and adaptivewith your datamodel.Immutable data storage has become a popular choice for data-intensive or “big data” applicationsas away to easily assemble large quantities of data, often frommultiple sources, without having tospend time upfront crafting a strict data model. Youmay have heard the term “data lake” to referto such large, unstructured collections of data. This can be contrasted with a “data warehouse”,which usually indicates a highly structured, centralized repository such as a relational database.

To demonstrate how immutable supports iteration and experimentation, consider the fol-lowing scenario: You start with an input file my_data.csv, and then perform some cleanupoperation over the data, such as converting all measurements in miles to kilometers, roundedto the nearest whole number. If you were treating your data as mutable, you might overwritethe original contents of my_data.csv with the transformed values. The problem with this ap-proach comes if you want to test some alteration of your cleanup operation. Say, for example,you wanted to round all your conversions to the nearest tenth instead. Since you no longer haveyour original data, you would have to start the entire ML process from the top. If you instead

Altman 95

treated your data as immutable, you would keep my_data.csv in its original state, and save theoutput of your cleanup operation in a new file, say my_clean_data.csv. That way, you couldreturn to my_data.csv as many times as you wished, try different operations on this data, andeasily compare the results of these operations knowing the source data was exactly the same foreach one. Think of each immutable dataset as a place in your process that you can safely reset toanytime you want to try something new or correct for some bias or failure.

To illustrate the benefits of a flexible data model, consider a mutable data store, such as arelational database. Before you put any data into the database, you would first need to design asystem of tables with set fields and datatypes, and the relationships between those tables. Thiscan feel like putting the cart before the horse, especially if you are starting with a dataset withwhich you are not yet intimately familiar, and you want the ability to experiment with differentalgorithms, all of which might require slightly different transformations on the original dataset.Revisiting the example in the previous paragraph, you might initially have defined your distancedatatype as an integer (when you were rounding to the nearest whole number), and would laterhave to change it to a floating point number (when you were rounding to the nearest tenth).Making this change would mean altering the database schema and migrating all of the existingdata to the new type, which is a nontrivial task—especially if you later decide to revert back tothe original type. By contrast, if you were working with immutable CSV files, it would be mucheasier to write out two files, one with each data type, and keep whichever one ultimately provedmost effective.

Throughout yourML process, you can create several incremental datasets that are essentiallyread-only. There’s no one correct data storage format, but ideally you would use something sim-ple and space-efficientwith the capacity to interoperatewith different tools, such as flat files (plaintext files without extraneous markup, such as TXT, CSV, or Parquet). Even if your data is ulti-mately destined for a different kind of datastore, such as a relational database or triplestore, con-sider using simple, immutable storage as an intermediary to facilitate iteration and experimenta-tion. If you’re concerned about overwhelming your local drive, cloud storage is a good option,especially if you can read and write directly from your programs or software services.

One final benefit of immutable storage relates to scale. Batch processing workflows and im-mutable data storageworkwellwith distributeddata processing frameworks, such asMapReduceand Spark. Therefore, if you need to scale your ML project using distributed processing, the in-tegration will be more seamless (for more, see the section on scaling up).

Organizing Immutable Data

Organizing immutable data stores can be a challenge, especially with multiple users. A littleplanning can save you from losing track of your experiments and results. A well-ordered direc-tory structure, informative and consistent file names, liberal use of timestamps, and disciplinednote-taking are simple but effective strategies. For example, say youwere acquiringMARCXMLrecords from an API feed, parsing out subject terms, and building a clustering algorithm aroundthese terms. Let us explore one possible way that you could organize your data outputs througheach step of the machine learning pipeline.

To enforce a naming convention, create a helper method that generates the output path foreach run of a particular data process. This output path includes the date and timestamp ofthe run—that way you won’t have to think about naming each individual file, and can avoidthe phenomenon of a mess of files called my_clean_data.csv, my_cleaner_data.csv,


my_final_cleanest_data.csv, etc. Your file path for the acquired data might be in theformat:

myProject/acquisitions/marc_YYYYMMDD_HHMMSS.xml

In this case, “YYMMDD” represents the date and “HHMMSS” represents the timestamp. Yourfile path for prepared and cleaned data might be:

myProject/clean_datasets/subjects_YYYYMMDD_HHMMSS.csv

Finally, each clustering model you build could be saved using the file path pattern:

myProject/models/cluster_YYYYMMDD_HHMMSS

Following this general pattern, you can organize all of the outputs for your entire project. Usingdate and timestamps in the file name also enables easy sorting and retrieval of the most recentoutput.

For each data output, you will want to maintain a record of the exact input, any special at-tributes of the process (e.g. “this time I roundeddecimals to the nearest hundredth”), andmetricsthatwill help you determine success or failure of the process. If you can generate this informationautomatically for each process, all the better for ensuring an accurate record. One strategy is toinclude a second helper method in your program that will generate and write out a companionfile to each data output. The companion file contains information that will help evaluate results,detect errors, perform optimizations, and differentiate between any two data outputs.

In the example project, you could accompany the acquisition output with a text file detailingthe exact API call used to fetch the data, the number of records acquired, and the runtime for theprocess. Keeping companion files as close as possible to their outputs helps prevent accidentalseparation, so save it to:

myProject/acquisition/marc_YYYYMMDD_HHMMSS.txt

In this case, the date and timestamp should exactly match that of its companion XML file.When running processes that test and train models, you can include information in your com-panion file about hyperparameters and whatever metrics you are using to evaluate the quality ofthe model. In our example, the companion file to each cluster model may contain the file pathfor the cleaned input data, the number of clusters, and a measure of cluster variance.

Workingwithmachine learning algorithms

New technologies and software advances make machine learning more accessible to “lay” users,by which I mean those of us without advanced degrees in mathematics or data science. Yet, thealgorithms are complex, and you need at least an intuitive understanding of how they work ifyou hope to implement them correctly. I use the following three questions as a guide for under-standing an algorithm. Keep inmind that any one project will likely make use of several complexalgorithms along the way. These questions help ensure that I have the information I truly need,and avoid getting bogged down with details best left to mathematicians.

• What do the inputs and outputs of the algorithm mean? There are two parts to answeringthis question. First is the data structure, e.g. “this is a vector with 300 integers.” Second

Altman 97

is knowing what this data describes, e.g. “each vector represents a document, and eachinteger specifies the number of times a particular word appears in that document.” Youalso need to be aware of specific implementation details—perhaps the input needs to benormalized in some way, perhaps the output has been smoothed (a technique that com-pensates for noisy data or outliers). This may seem straightforward, but it can be a lot tokeep track of once you’ve gone through several layers of processing and abstraction.

• What effect do different hyperparameters have on the algorithm? Part of themachine learn-ing process is tuning hyperparameters, or trying out multiple configurations until you getsatisfying results. Part of the frustration is that you can’t try every possible configuration,so you have to do some intelligent guesswork. Twiddling hyperparameters can feel enig-matic and unitutive, since it can be difficult to predict their impact on the final outcome.The better you understand hyperparameters and their roles in the ML process, the morelikely you are to make reasonable guesses and adjustments—though you should always beprepared for a surprise.

• Canyou explain how this algorithmworks to a lay person andwhy it’s beneficial to the project?There are two benefits to articulating a response to this question. First, it ensures that youreally understand the algorithm yourself. And second, you will likely be called on to givethis explanation to co-collaborators and other stakeholders. A good explanationwill buildexcitement around the project, while a befuddling one could sow doubt or disinterest.It can be difficult to strike a balance between general summary and technical equations,since your stakeholderswill likely include peoplewith diverse backgrounds, so do your bestand look for opportunities for people with different expertises to help refine your team’sunderstanding of the algorithm.

Learningmore about the underlyingmath canhelp youmake better,more nuanceddecisionsabout how to deploy the algorithm, and is fascinating in its own right—but in most cases I havefound that the above three questions provide a solid foundation for machine learning research.

Tool selection

Tool selection is an important part of your process and should be approached thoughtfully. Agood approach is to articulate and prioritize the needs of your team, and make selections thatmeet these needs. I’ve listed some possible questions for consideration below, many of whichyou will recognize as general concerns for any tool selection process.

• What sorts of features and interfaces do they offer? If you require a specific algorithm, theability to make data visualizations, or query interfaces, you can find tools to meet thesespecific needs.

• How well do tools interoperate with one another, or with other parts of your existing systems?One of the advantages of a well-designed pipeline is that it will enable you to swap outsoftware components if the need arises. For example, if your data is in a format that isinteroperable with many systems, it frees you from being tied down to any specific tool.

• How do the tools align with the skill sets and comfort levels of your team? For example, con-sider what coding languages your collaborators know, and whether or not they have the


capacity to learn a new one. If you have someone who is already a wiz with a preferredspreadsheet program, see if you can export data into a compatible file format.

• Are the tools stable, well-documented, andwell-supported? Machine learning is a fast-changingfield, with new algorithms, services, and software features being developed all the time.Something new and exciting that hasn’t yet been road-tested may not be worth the risk ifthere is a more dependable alternative. Furthermore, there tends to be more scholarship,documented use cases, and tutorials for older, more widely-adopted tools.

• Are you concerned about speed and scale? Don’t get bogged downwith these considerationsif you’re just trying to get aworking pilot off the ground, but it can help to at least be awareof how problems are likely tomanifest as your volume of data increases, or as you integrateinto time-sensitive workflows.

You and your team canwork through these questions and articulate additional requirementsrelevant to your specific context.

Scaling up

Scaling up in machine learning generally means that you need to work with a larger volume ofdata, or that youneedprocesses to execute faster. Recent advances in hardware and softwaremakethe execution of complex computationsmagnitudes faster andmore efficient than theywere evena decade ago, and you can often achieve quite a bit by working on a personal computer. Yet, timeis valuable, and it can be difficult to iterate and experiment effectively when individual processestake too long to execute.

There aremanyML software packages that can help youmake efficient use of whatever hard-ware you have, including your personal computer. Some examples at the time of writing areApache Spark, TensorFlow, Scikit-learn, and Microsoft Cognitive Toolkit, each with their ownstrengths and applications. In addition to providing libraries for building and testing models,these software packages optimize algorithmic performance, memory resources, data through-puts, and/or parallel computations. They can make a remarkable difference in both processingspeed and the amount of data you can comfortably handle. There are also services that allow youto submit executable code and data to the cloud for processing, such as Google AI Platform.

Managing your own hardware upgrades is not without challenge. You may be lucky enoughto have access to a high-powered computer capable of accelerated processing. A common exampleis a computer with GPUs (graphics processing units), which break complex processes into manysmall tasks and run them in parallel. However, these powerful machines can be prohibitively ex-pensive. Another scaling technique is distributed or cluster computing, in which complex pro-cesses are distributed across multiple computers, often in the cloud. A cloud cluster can bringsignificant cost savings, but managing one requires specialized knowledge and the learning curvecan be rather steep. It is also important to note that different algorithms require different scal-ing techniques. Some clustering algorithms, for example, scale well with GPUs but not withdistributed computing.

Evenwith the right hardware and software, scaling up can be a tricky business. ML processestend to have dramatic spikes inmemory or network use, which can tax your systems. Not allMLalgorithms scale well, causing memory use or execution time to grow exponentially as more datais added. Sometimes you have to add additional, complexity-reducing steps to your pipeline to

Altman 99

handle data at scale. Some of themore commonmachine learning languages, such as Python andR, execute relatively slowly, putting the onus on developers to optimize operations for efficiency.In anticipation of these and other challenges, it is often a good idea to start with a scaled-downpilot or proof of concept, and not to underestimate the time and resources necessary to scale upfrom there.

Conclusion

New technologies make it possible for more researchers and developers to leverage the power ofmachine learning. Building an effective machine learning system means supporting the entireworkflow, from data acquisition to final analysis. Practitioners must bemindful of how each im-plementation decision and subjective choice—from the way you structure and store your data tothe algorithms you use to the ways you validate your results—will impact the efficiency of opera-tions and the quality of learned intelligence. This article has offered some practical guidelines forbuildingML systemswithmodular, repeatable processes and intelligible, verifiable results. Thereare many resources available for further research, both online and in your libraries, and I encour-age you to consult with subject specialists, data scientists, mathematicians, programmers, anddata engineers. May your data be clean, your computations efficient, and your results profound.

Further Reading

I include here a few suggestions for further reading on key topics. I have also found that in thefast-changing world of machine learning technologies, blogs, internet communities, and onlineclasses can be a great source of information that is current, introductory, and/or geared towardpractitioners.

Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. 2005. Introduction to Data Mining.Boston: Pearson AddisonWesley. See chapter 2 for data preparation strategies. Later chap-ters introduce common classification and clustering algorithms.

Marz, Nathan and James Warren. 2015. Big Data: Principles and best practices of scalable real-time data systems. Shelter Island: Manning. “Part 1: Batch Layer” discusses immutablestorage in depth.

Kleppmann, Martin. 2017. Designing Data-Intensive Applications: The Big Ideas Behind Reli-able, Scalable, and Maintainable Systems. Boston: O’Reilly. “Chapter 10: Batch Process-ing” is especially relevant if you are interested in scaling up.

Chapter 9

Fragility and Intelligibility of

Deep Learning for Libraries

Michael LeskRutgers University

Introduction

On February 7, 2018, Mounir Mahjoubi, then the “digital minister” of France (le secrétariatd’État chargé du Numérique), told the civil service to use only computer methods that couldbe understood (Mahjoubi 2018). To be precise, what he actually said to l’Assemblée Nationalewas:

Aucun algorithme non explicable ne pourra être utilisé.

I gave this toGoogle Translate and asked for it in English. What I got (onOctober 13, 2019) was:

No algorithm that can not be explained can not be used.

That’s a longway from fluent English. As I count the “not” words, it’s actually reversed inmean-ing. But, what if I leave off the final period when I enter it in Google Translate? Then I get:

No non-explainable algorithm can be used

Quite different, and although only barely fluent, now the meaning is right. The difference wasonly the final punctuation on the sentence.1

This is an example of the fragility of an AI algorithm. The point is not that both translationsare of doubtful quality. The point is that a seemingly insignificant change in the input producedsuch a difference in the output. In this case, the fragility was detected by accident.

1In themonths betweenmy original queries inOctober 2019 and the final preparations for publication inNovember2020, the algorithmhas changed to produce the same translationwith orwithout a period: “Nonon-explicable algorithmcan be used.”

101


Machine learning systems have a set of data for training. For example, if you are interested intranslation, and you have a large collection of text in both French and English, you might noticethat theword truck in English appears where theword camion appears in French. And the systemmight “learn” this translation. It would then apply this in other examples; this is called general-ization. Of course if you wish to translate French into British English, a preferred translation ofcamion is lorry. And if the context of your English truck is a US discussion of thewheels and axlesunderneath railway vehicles, the better French word is le bogie.

Deep learning enthusiasts believe that with enough examples, machine learning systems willbe able to generalize correctly. There can be various kinds of failures: we can discuss both (a)problems in the scope of the training data and (b) problems in the kind of modeling done. If thesystem has sufficiently general input data so that it learns well enough to produce reliably correctresults on examples it has not seen,we call it robust; robustness is the opposite of fragility. Fragilityerrors here can arise frommany sources—for example, the training datamaynot be representativeof the real problem (if you train amachine translation program solely on engineering documents,do not expect it to do well on theater reviews). Or, the data may not have the scope of the realproblem: if you train for “boat” based on ocean liners, don’t be surprised if the program fails oncanoes.

In addition, there are also modeling issues. Suppose you use a very simple model, such as alinear model, for data that is actually perhaps quadratic or exponential. This is called “underfit-ting” and may often arise when there is not enough training data. The reverse is also possible:there may be a lot of training data, includingmany noisy points, and the programmay decide ona very complex model to cover all the noise in the training data. This is called “overfitting” andgives you an answer too dependent on noise and outliers in your data. For example, 1998 was anunusually warm year, but the decline in world temperature for the next few years suggests it wasnoise in the data, not a change in the development of climate.

Fragility is also a problem in image recognition (“AIRecognition” 2017). Currently themostcommon technique for image recognition research projects is the use of convolutional neuralnets. Recently, several papers have looked at how trivial modifications to images may impact im-age classification. Here (figure 9.1) is an image taken from (Su, Vargas, and Sakurai 2019). Theoriginal image class is in black and the classifier choice (and confidence) after adding a single un-usual pixel are shown in blue, with the extraneous pixel in white. The images were deliberatelyprocessed at low resolution—hence the pixellation—tomatch the input requirement of a popu-lar image classification program.

The authors experimentedwith algorithms tofind thequickest single-pixel change thatwoulddeceive an image classifier. They were routinely able to fool the recognition software. In this ex-ample, the deception was deliberate; the researchers searched for the best place to change theimage.

Bias andmistakes

We have seen a major change in the way we do machine learning, and there are real dangers in-volved. The current enthusiasm for neural nets risks the use of processes which cannot be under-stood, asMahjoubi warned, andwhich can thus conceal methods wewould not approve of, suchas discrimination in lending or hiring. Cathy O’Neil has described this in her bookWeapons ofMath Destruction (2016).

There is much research today that seeks methods to explain what neural nets are doing. See

Lesk 103

Figure 9.1: Examples of misclassification.

Guidiotti et al. (2017) for a survey. There is also a 2018 DARPA program on “Explainable AI.”Techniques used can include looking at the results over a range of input data and seeing if theneural net can be modeled by a decision tree, or modifying the input data to see which inputelements have the greatest effect on the results, and then showing that to the user. For example,Mariusz Bojarski et al. describe a self-driving system that highlights what it thinks is important inwhat it is seeing (2017). However, this is generally research in progress, and it raises the questionof whether we can trust the explanation generator.

Many popular magazines have discussed this problem; Forbes, for example, had an explana-tion of how the choice of datasets can produce a biased result without any deliberate attempt todo so (Taulli 2019). Similarly, theNew York Times discussed the way groups of primarily youngwhite men will build systems that focus on their data, and give wrong or discriminatory answersin more general situations (Tugend 2019). The MIT Media Lab hosts the Algorithmic JusticeLeague, trying to stop organizations from building socially slanted systems. Similar thoughtscome from groups like the Data and Society Research Institute or the AI Now Institute.

Again, the problems may be accidental or deliberate. The phrase “data poisoning” has beenused to suggest malicious creation of training data or examples of data designed to deceive ma-chine learning systems. There is now aDARPAresearch program, “GuaranteeingAIRobustnessagainst Deception (GARD),” supporting research to learn how to stop trickery such as a demon-stration of converting a traffic stop sign to a 45 mph speed limit with a few stickers (Eykholt etal. 2018). More generally, bias in systems deciding whether to grant loans may be discriminatorybut nevertheless profitable.

Even if you want to detect AI mistakes, recognizing such problems is difficult. Often thingswill be wrong and we won’t knowwhy. And even hypothetical (but perhaps erroneous) explana-tions can be very convincing; people easily believe plausible stories. I routinely give my studentsa paper that concludes that prior ownership of a cat prevents fatal myocardial infarctions; its re-sult implies that cats are more protective than statin drugs (Qureshi et al. 2009). The studentsare very quick to come up with possibilities like “petting a cat is relaxing, relaxation reduces yourblood pressure, and lower blood pressure decreases the risk of heart attacks.” Then I have to ex-plain that the paper evaluates 32 possibilities (prior/current ownership× cats/dogs× 4 medicalconditions× fatal/nonfatal) and you shouldn’t be surprised if you evaluate 32 chances and oneis significant at the 0.05 level, which is only 1 in 20. In this example, there is also the question ofreverse causality: perhaps someone who is in ill health will decide he is too sick to take care of a


Figure 9.2: Panoramic landscape.

pet, so that the poor health is not caused by the lack of a cat, but rather the poor health causesthe absence of a cat.

Sometimes explanations can help, as in a machine learning program that was deliberatelytrained to distinguish images of wolves and dogs but was trained using pictures of wolves thatalways contained snow and pictures of dogs that never did (Ribeiro, Singh, and Guestrin 2016).Without explaining that, 10 of 27 subjects thought the classifier was trustworthy; after point-ing out the snow only 3 of 27 subjects believed the system. Usually you don’t get such a clearpresentation of a mis-trained system.

Recognition of problems

Can we tell when something is wrong? Here’s the result of a Google Photo merge of three otherphotos; two landscapes and a picture of somebody’s friend. The software was told to make apanorama and stitched the images together (Peng 2018). It looks like a joke, and evenmade it intoa list of top jokes on reddit. The author’s point was that the panorama system didn’t understandbasic composition: people are not the same scale as mountains.

Often, machine learning results are overstated. Google Flu Trends was acclaimed for severalyears and then turned out to be undependable (Lazer et al. 2014). A study that attempted tocompare the performance of machine learning systems for medical diagnosis with actual doctorsfound that of over 20,000 papers analyzed, only a few dozen had data suitable for an evaluation(Liu et al. 2019). The results claimed comparable accuracy, but virtually none of the papers

Lesk 105

presented adequate data to support that conclusion.Unusually promising results are sometimes the result of overfitting (Brownlee 2018); this is

what was wrong with Google Flu Trends. Amachine learning program can learn a large numberof special cases and then find that the results do not generalize. In other cases problems can resultwhen using “clean” data for training, and then encounteringmessier data in applications. Ideally,training and testing data should be from the same dataset and divided at random, but it can betempting to start offwith examples that are the result of initial and higher quality data collection.

Sometimes in the past we had a choice betweenmodeling and data for predictions. Consider,for example, the problem of guessing what the weather will be tomorrow. We now do this basedon amodel of the atmosphere that uses theNavier-Stokes equations; we use supercomputers andderive tomorrow’s atmosphere from today’s (Christensen 2015). What did we do before we hadsupercomputers? Solving those equations by hand is impractical. One of the methods was “pre-diction by analogy”: find some day in the past whose weather wasmost similar to today. Supposethat day is Oct. 20, 1970. Then use October 21, 1970 as tomorrow’s prediction. Prediction byanalogy doesn’t require you to have a model or use advancedmathematics. In this case, however,it doesn’t work as well—partly because we don’t have enough past days to choose from, and weonly get new days at the rate of one per day.

In fact, Huug van den Dool estimated the number of days of data needed to make accuratepredictions as 1030 years, which is farmore than the age of the universe (Wilks 2008). The under-lying problem is that theweather is very random. If your state lottery is properly run, it should becompletely pointless to look at past winning numbers and try to guess the next one. The weatheris not that randombut it has toomuch variation to be solved easily by analogy. If your problem isvery simple (tic-tac-toe) you could indeed write down each position andwhat the best next moveis; there are only about 255,000 games.

To deal with more realistic problems, much of machine learning research is now focused onobtaining larger training sets. Instead of trying to learnmore about the characteristics of a systemthat is beingmodeled, the effort is driven by the dictum, “more data beats better algorithms.” In areview of the history of speech recognition, XuedongHuang, James Baker, andRaj Reddywrite,“The power of these systems arises mainly from their ability to collect, process, and learn fromvery large datasets. The basic learning and decoding algorithms have not changed substantially in40 years” (2014). Nevertheless, speech recognition has gone from frustration to useful productssuch as dictation software or home appliances.

Lacking a model, however, means that we won’t know the limits of the calculations beingdone. For example, if you have some data that looks quadratic, but you fit a linear model, anyattempt at extrapolation is fraught with error. If you are using a “black box” system, you don’tknowwhen this is happening. And, regrettably, many of theAI software systems are sold as blackboxes where the purchasers and users do not have access to the process, even if they are imaginedto be able to understand it.

What’s changing

Many AI researchers are sensitive to the risks, especially given the publicity over self-driving cars.As the hype over “deep learning” built up, writers discussed examples such as a Pittsburgh med-ical system that proposed to send patients with both pneumonia and asthma home, because thecomputer had not understood that patients with both problems were actually being sent to theICU (Bornstein 2016; Caruana et al. 2015).


Figure 9.3: Explainability.

Many people work on ways of explaining or presenting neural net software (Harley 2015).Most important, perhaps, are new EU regulations that prohibit automated decisionmaking thataffects EU citizens, and provides a “right of explanation” (Metz 2016).

We recognize that systemswhich don’t rely on amathematicalmodelmay be cheaper to buildthan one where the coders understand what is going on. More serious is that they may be moreaccurate. This image is from the same article on understandability (Bornstein 2016).

If there really is a tradeoff betweenwhatwill solve the problem andwhat can be explained, weknow that many system builders will choose to solve the problem. And yet even having explana-tionsmay not be an answer; a key paper on interpretability discusses the complexities ofmeaningrelated to explanation, causality, and modeling (Lipton 2018).

Arend Hintze has noted that we do not always impose a demand for explanation on people.I can write that the New York Public Library main building is well proportioned and attractivewithout anyone expecting that I will recite its dimensions or the source of the marble used toconstruct it. And for some problems that’s fine: I don’t care how my camera decides on thefocus distance to the subject. Where it matters, however, we often want explanations; the hardethical problem, as noted before, is if better performance can be achieved in an inexplicable way.

Recommendations

2017 saw the publication of the “Asilomar AI principles” (2017). Two of these principles are:

• Safety: AI systems should be safe and secure throughout their operational lifetime, andverifiably so where applicable and feasible.

• Failure Transparency: If an AI system causes harm, it should be possible to ascertainwhy.

The problem is that the technology used to build many systems does not enable verifiabilityand explanation. Similarly the World Economic Forum calls for protection against discrimina-tion but notes many ways in which technology can have unanticipated and undesirable effects asa result of machine learning (“How to Prevent” 2018).

Lesk 107

Historically there has been and continues to be too much hype. An important image recog-nition task is distinguishing malignant and benign spots on mammograms. There have beenpromises for decades that computers would do this better than radiologists. Here are examplesfrom 1995 (“computer-aided diagnosis can improve radiologists’ observational performance”)(Schmidt and Nishikawa) and 2009 (“The Bayesian network significantly exceeded the perfor-mance of interpreting radiologists”) (Burnside et al.). A typical recent AI paper to do this withconvolutional neural nets reports 90% accuracy (Singh et al. 2020). To put this in perspective,the problem is complex, but some examples aremore straightforward, and even pigeons can reach85% (Levenson et al. 2015). A serious recent review is “Diagnostic Accuracy ofDigital ScreeningMammographyWith andWithout Computer-Aided Detection” (Lehman et al. 2015). Very re-cently there was another claim that computers have surpassed radiologists (Walsh 2020); we willhave to await evaluation. As with many claims of medical progress, replicability and evaluationare needed before doctors will be willing to believe them.

What should we do? Software testing generally is a decades-old discipline, and many basicprinciples of regression testing apply here also:

• Test data should cover the full range of expected input.

• Test data should also cover unexpected and even illegal input.

• Test data should include known past failures believed cleared up.

• Test data should exercise all parts of the program, and all important paths (coverage).

• Test data should include a set of data which is representative of the distribution of actualdata, to be used for timing purposes.

It is difficult to apply these ideas in parts of the AI world. If the allowed input is speech, thereis no exhaustive list of utterances which can be sampled. If a black-box commercial machinelearning package is being used, there is no way to ask about coverage of any number of test cases.If a program is constantly learning from new data, there is no list of previously fixed failures tobe collected that reflects the constantly changing program.

And obviously the circumstances of usematter. Wemaywell, as a society, decide that forcingbanks evaluating loan applications to use decision trees instead of deep learning is appropriate,so that we know whether illegal discrimination is going on, even if this raises the costs to thebanks. We might also believe that the safest possible railway operation is important, even if theautomated train doesn’t routinely explain how it balanced its choices of acceleration to achievehigh punctuality and low risk.

What would I suggest?Organizationally:

• Have teams including both the computer scientists and the users.

• Collaborate with a statistician: they’ve seen a lot of these problems before.

• Work on easier problems.

As examples, I watched a group of zoologists with a group of computer scientists discussinghow to improve accuracy at identifying animals in photographs. The discussion indicated that


you needed hundreds of training examples at a minimum, if not thousands, since the animals donot typically walk up to the camera and pose for a full-frame shot. It was important to have boththe peoplewhounderstood the learning systems and the peoplewhoknewwhat the pictureswererealistically like. The most amusing contribution by a statistician happened when a computerscientist offered a program that tried to recognize individual giraffes, and a zoologist complainedthat it only worked if you had a view of the right-hand side of the giraffe. Somebody who knewstatistics perked up and said “it’s a 50% chance of recognizing the animal? I can do the math forthat.” And it is simpler to do “is there any animal in the picture?” before asking “which animalis it?” and create two easier problems.

Technically:

• Try to interpolate rather than extrapolate: use the algorithmonpoints “inside” the trainingset (thinking in multiple dimensions).

• Lean towards feature detection and modeling rather than completely unsupervised learn-ing.

• Emphasize continuous rather than discrete variables.

I suggest usingmethods that involve feature detection, since that tells youwhat the algorithmis relying on. For example, consider the Google Flu Trends failure; the public was not told whattermswere used. AsDavid Lazer noted, some of themwere just “winter” terms (like ‘basketball’).If you know that, you might be skeptical. More significant are decisions like jail sentences orcollege admissions; knowing that racial or religious discrimination are not relevant can be verifiedby knowing that the programdid not use them. Knowingwhat featureswere used can sometimeshelp the user: if you know that your loan applicationwas downrated because of your credit score,it may be possible for you to pay off some bill to raise the score.

Sometimes you have to use categorical variables (what county do you live in?) but if you havea choice of how you phrase a variable, asking something like “how many minutes a day do youspend reading?” is likely to produce a better fit than asking people to choose “howmuch do youread: never, sometimes, a lot?” A machine learning algorithm may tell you how much of thevariance each input variable explains; you can use that information to focus on the variables thataremost important to your problem, and decide whether you think you aremeasuring themwellenough.

Why not extrapolate? Sadly, as I write this in early April 2020, we are seeing all sorts of ex-trapolations of the COVID-19 epidemic, with expected US deaths ranging from 30,000 to 2million, as people try to fit various functions (Gaussians, logistic regression, or whatever) withinadequately precise data and uncertain models. A simpler example is Mark Twain’s: “In thespace of one hundred and seventy-six years the Lower Mississippi has shortened itself two hun-dred and forty-two miles. That is an average of a trifle over one mile and a third per year. There-fore, any calm person, who is not blind or idiotic, can see that in the ‘OldOolitic Silurian Period,’just a million years ago next November, the LowerMississippi River was upwards of one millionthree hundred thousand miles long, and stuck out over the Gulf of Mexico like a fishing-rod.And by the same token any person can see that seven hundred and forty-two years from now theLower Mississippi will be only a mile and three-quarters long, and Cairo and New Orleans willhave joined their streets together, and be plodding comfortably along under a single mayor and amutual board of aldermen” (1883).

Lesk 109

Finally, note the advice of Edgar Allan Poe: “Believe nothing you hear, and only one half thatyou see.”

References

“AI Recognition Fooled by Single Pixel Change.” BBCNews, November 3, 2017. https://www.bbc.com/news/technology-41845878.

“Asilomar AI Principles.” 2017. https://futureoflife.org/ai-principles/.Bojarski, Mariusz, Larry Jackel, Ben Firner, and Urs Muller. 2017. “Explaining How End-to-

End Deep Learning Steers a Self-Driving Car.” NVIDIADeveloper Blog. https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/.

Bornstein, Aaron. 2016. “Is Artificial Intelligence Permanently Inscrutable?” Nautilus 40 (1).http://nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable.

Brownlee, Jason. 2018. “The Model Performance Mismatch Problem (and What to Do aboutIt).” Machine Learning Mastery. https://machinelearningmastery.com/the-model-performance-mismatch-problem/.

Burnside, Elizabeth S., Jessie Davis, Jagpreet Chhatwal, Oguzhan Alagoz, Mary J. Lindstrom,BertaM. Geller, Benjamin Littenberg, Katherine A. Shaffer, Charles E. Kahn, andC. DavidPage. 2009. “Probabilistic Computer Model Developed from Clinical Data in NationalMammographyDatabase Format toClassifyMammographic Findings.” Radiology 251 (3):663–72.

Caruana,Rich, YinLou, JohannesGehrke, PaulKoch,Marc Sturm, andNoemieElhadad. 2015.“IntelligibleModels forHealthCare: PredictingPneumoniaRisk andHospital 30-dayRead-mission.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining (KDD ’15), 1721–30. New York: ACM Press. https://doi.org/10.1145/2783258.2788613.

Christensen, Hannah. 2015. “Banking on better forecasts: the new maths of weather predic-tion.” The Guardian, 8 Jan 2015. https://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes.

Eykholt, Kevin, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Florian Tramèr, AtulPrakash, Tadayoshi Kohno, and Dawn Song. 2018. “Physical Adversarial Examples for Ob-ject Detectors.” 12th USENIXWorkshop on Offensive Technologies (WOOT 18).

Guidiotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Giannotti Fosca, andDino Pedreschi. 2018. “A Survey of Methods for Explaining Black Box Models.” ACMComputing Surveys 51 (5): 1–42.

Halevy, Alon, Peter Norvig, and Fernando Pereira. 2009. “The Unreasonable Effectiveness ofData.” IEEE Intelligent Systems 24 (2).

Harley, AdamW. 2015. “An Interactive Node-Link Visualization of Convolutional Neural Net-works.” In Advances in Visual Computing, edited by George Bebis et al., 867–77. LectureNotes in Computer Science. Cham: Springer International Publishing.

“How to PreventDiscriminatoryOutcomes inMachine Learning.” 2018. White Paper from theGlobal Future Council on Human Rights 2016–2018, World Economic Forum. https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning.

https://www.bbc.com/news/technology-41845878

https://www.bbc.com/news/technology-41845878

https://futureoflife.org/ai-principles/

https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/

https://devblogs.nvidia.com/explaining-deep-learning-self-driving-car/

http://nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable

http://nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable

https://machinelearningmastery.com/the-model-performance-mismatch-problem/

https://machinelearningmastery.com/the-model-performance-mismatch-problem/

https://doi.org/10.1145/2783258.2788613

https://doi.org/10.1145/2783258.2788613

https://www.theguardian.com/science/alexs-adventures-in-numberland/2015/jan/08/banking-forecasts-maths-weather-prediction-stochastic-processes



https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning




Huang,Xuedong, JamesBaker, andRajReddy. 2014. “AHistorical Perspective of SpeechRecog-nition.” Communications of the ACM 57 (1): 94–103.

Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable ofGoogle Flu: Traps in Big Data Analysis.” Science 343 (6176): 1203–1205.

Lehman, Constance, Robert Wellman, Diana Buist, Karl Kerlikowske, Anna Tosteson, and Di-anaMiglioretti. 2015. “Diagnostic Accuracy of Digital ScreeningMammography with andwithout Computer-Aided Detection.” JAMA InternMed 175 (11): 1828–1837.

Levenson, Richard M., Elizabeth A. Krupinski, Victor M. Navarro, and Edward A. Wasserman.2015. “Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology BreastCancer Images.” PLoS One, November 18, 2015. https://doi.org/10.1371/journal.pone.0141357.

Lipton, Zachary. 2018. “TheMythos of Model Interpretability.” ACMQueue 61 (10): 36–43.

Liu, Xiaoxuan et al. 2019. “A Comparison of Deep Learning Performance against Health-CareProfessionals in Detecting Diseases fromMedical Imaging: a Systematic Review andMeta-Analysis.” Lancet Digital Health 1 (6): e271–97. https://www.sciencedirect.com/science/article/pii/S2589750019301232.

Mahjoubi,Mounir. 2018. “Assembléenationale,XVe législature. Sessionordinaire de2017–2018.”Compte rendu intégral, Deuxième séance du mercredi 07 février 2018. http://www.assemblee-nationale.fr/15/cri/2017-2018/20180137.asp.

Metz, Cade. 2016. “Artificial Intelligence Is Setting Up the Internet for a Huge Clash with Eu-rope.” Wired, July 11, 2016. https://www.wired.com/2016/07/artificial-intelligence-setting-internet-huge-clash-europe/.

O’Neil, Cathy. 2016. Weapons ofMath Destruction. New York: Crown.

Peng,Tony. 2018. “2018 in review: 10AI failures.” Medium,December 10, 2018. https://medium.com/syncedreview/2018-in-review-10-ai-failures-c18faadf5983.

Qureshi, A. I.,M. Z.Memon, G. Vazquez, andM. F. Suri. 2009. “Cat ownership and theRisk ofFatal Cardiovascular Diseases. Results from the SecondNational Health andNutrition Ex-amination Study Mortality Follow-up Study.” Journal of Vascular and Interventional Neu-rology 2 (1): 132–5. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317329.

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “ ‘Why Should I Trust You?’:Explaining thePredictions ofAnyClassifier.” InProceedings of the 22ndACMSIGKDDIn-ternational Conference on Knowledge Discovery and DataMining (KDD ’16), 1135–1144.New York: ACM Press.

Schmidt, R.A. andR.M.Nishikawa. 1995. “ClinicalUse ofDigitalMammography: the Presentand the Prospects.” Journal of Digital Imaging 8 (1 Suppl 1): 74–9.

Singh, VivekKumar et al. 2020. “Breast Tumor Segmentation and ShapeClassification inMam-mograms Using Generative Adversarial and Convolutional Neural Network.” Expert Sys-tems with Applications 139.

Su, Jiawei, Danilo Vasconcellos Vargas, and Kouichi Sakurai. 2019. “One Pixel Attack for Fool-ingDeepNeuralNetworks.” IEEETransactions onEvolutionaryComputation23 (5): 828–841.

Taulli, Tom. 2019. “How Bias Distorts AI (Artificial Intelligence).” Forbes, August 4, 2019.https://www.forbes.com/sites/tomtaulli/2019/08/04/bias-the-silent-killer-of-ai-artificial-intelligence/#1cc6f35d7d87.

Twain, Mark. 1883. Life on theMississippi. Boston: J. R. Osgood &Co.



https://www.sciencedirect.com/science/article/pii/S2589750019301232

https://www.sciencedirect.com/science/article/pii/S2589750019301232

http://www.assemblee-nationale.fr/15/cri/2017-2018/20180137.asp

http://www.assemblee-nationale.fr/15/cri/2017-2018/20180137.asp

https://www.wired.com/2016/07/artificial-intelligence-setting-internet-huge-clash-europe/

https://www.wired.com/2016/07/artificial-intelligence-setting-internet-huge-clash-europe/

https://medium.com/syncedreview/2018-in-review-10-ai-failures-c18faadf5983

https://medium.com/syncedreview/2018-in-review-10-ai-failures-c18faadf5983

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317329

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317329

https://www.forbes.com/sites/tomtaulli/2019/08/04/bias-the-silent-killer-of-ai-artificial-intelligence/#1cc6f35d7d87

https://www.forbes.com/sites/tomtaulli/2019/08/04/bias-the-silent-killer-of-ai-artificial-intelligence/#1cc6f35d7d87

Lesk 111

Tugend, Alina. 2019. “The Bias Embedded in Tech.” The New York Times, June 17, 2019,section F, 10.

Walsh, Fergus. 2020. “AI ‘outperforms’ doctors diagnosing breast cancer.” BBC News, January2, 2020. https://www.bbc.com/news/health-50857759.

Wilks,Daniel S. 2008. ReviewofEmpiricalMethods in Short-TermClimate Prediction, byHuugvan den Dool. Bulletin of the AmericanMeteorological Society 89 (6): 887–88.

https://www.bbc.com/news/health-50857759

Chapter 10

Bringing Algorithms and

Machine Learning Into Library

Collections and Services

Eric Lease MorganUniversity of Notre Dame

Seemingly revolutionary changes

At the time of their implementation, some changes in the practice of librarianship were deemedrevolutionary, but now-a-days some of these same changes are deemed matter of fact. Take, forexample, the catalog. During much of the Middle Ages, a catalog was more akin to a simpleacquisitions list. By 1548 the first author, title, subject catalog was created (LOC 2017, 18).These catalogs morphed into books, books which could be mass produced and distributed. Butthe bookswere difficult to keep up to date, and theywere expensive to print. As a consequence, inthe early 1860s, the card catalog was invented by Ezra Abbot, and the catalog eventually becameamassive set of drawers (82). Unfortunately, because the way catalog cards are produced, it is notfeasible to assign more than three or four subject headings to any given book. If one does, thenthe number of catalog cards quickly gets out of hand.

In the 1870s, the idea of sharing catalog cards between libraries became common, and theLibrary of Congress facilitated much of the distribution (LOC 2017, 87). In 1965 and with theadvent of computers, the idea of sharing cataloging data asMARC(machine readable cataloging)became prevalent (Crawford 1989, 204). The data structure of a MARC record is indicativeof the time. Intended to be distributed on reel-to-reel tape, the MARC record is a sequentialdata structure designed to be read from beginning to end, complete with checks and balancesensuring the record’s integrity. Despite the apparent flexibility of a digital data structure, thetradition of three or four subject headings per book still holds true. Now-a-days, the data fromMARC records is used to fill databases, the databases’ content is indexed, and items from the

113


library collection are locatedby searching the index. The evolutionof the venerable library cataloghas spanned centuries, each evolutionary change solving some problems but creating new ones.

With the advent of the Internet, a host of other changes are (still) happening in libraries.Some of them are seen as revolutionary, and only time will tell whether or not these changes willpersevere. Examples include but are not limited to:

• the advocacy of alt-metrics and open access publications

• the continuing dichotomy of the virtual library and library as place

• the creation and maintenance of institutional repositories

• the existence of digital scholarship centers

• the increasing tendency to license instead of own content

Manyof the traditional roles of libraries are not as important as theyused tobe. That does notmean the roles are unimportant, just not as important. Likemanyother professions, librarianshipis exploring new ways to remain relevant when many of their core functions are needed by fewerpeople.

Working smarter, not harder

Beyond automation, librarianship has not exploited computer technology. Despite the fact thatlibraries have theworld of knowledge at their fingertips, libraries do not operate very intelligently,where “intelligently” is an allusion to artificial intelligence.

Let’s enumerate the core functionalities of computers. First of all, computers…compute.They are given some sort of input, assign the input to a variable, apply any number of functionsto the variable, and output the result. This process — computing — is akin to solving simplealgebraic equations such as the area of a circle or a distance traveled. There are two factors ofparticular interest here. First, the input can be as simple as a number or a string (read: “a word”)or the input can be arbitrarily large combinations of both. Examples include:

• 42

• 1776

• xyzzy

• George Washington

• a MARC record

• the circulation history and academic characteristics of an individual

• the full text and bibliographic descriptions of all early American authors

Morgan 115

What is really important is the possible scale of a computer’s input. Libraries have not takenadvantage of that scale. Imagine how librarianship would change if the profession actively usedthe full text of its collections to enhance bibliographic description and resulting public service.Imagine how collection policies and patron needs could be better articulated if: 1) students, re-searchers, or scholars first opted-in to have their records analyzed, and2) the totality of circulationhistories and journal usage histories were thoroughly investigated in combination with patroncharacteristics and data from other libraries.

A second core functionality of computers is their ability to save, organize, and retrieve vastamounts of data. More specifically, computers save “data” — mere numbers and strings. Butwhen the data is given context, such as a number denoted as date or a string denoted as a name,then the data is transformed into information. An example might include the birth year 1972and the name of my pet, Blake. Given additional information, which may be compared andcontrasted with other information, knowledge can be created— information put to use and un-derstood. For example, Mary, my sister, was born in 1951 and is therefore 21 years older thanBlake. Computers excel at saving, organizing, and retrieving data which leads to informationand knowledge. The possibilities of computers dispensing wisdom — knowledge of a timelessnature— is left for another essay.

Like the scale of computer input, the library profession has not really exploited computers’ability to save, organize, and retrieve data; on the whole, the library profession does not under-stand the concept of a “data structure.” For example, tab-delimited files, CSV (comma-separatedvalue) files, relational database schema, XML files, JSONfiles, and the content of email messagesor HTTP server responses are all examples of different types of data structures. Each has its ownset of inherent strengths and weaknesses; there is no such thing as “One size fits all.” Throughthe use of data structures, computers store and retrieve information. Librarianship is about thesesame kinds of things, yet few librarians would be able to outline the differences between differentdata structures.

Again, data becomes information when it is given context. In the world of MARC, when astring (one ormore “words”) is inserted into the 245 field of aMARCbibliographic record, thenthe string is denoted as a title. In this case, MARC is a “data structure” because different fieldsdenote different contexts. There are fields for authors, subjects, notes, added entries, etc. This isall very well and good, especially considering thatMARCwas designedmore than fifty years ago.But since then, many more scalable, flexible, and efficient data structures have been designed.

Relational databases are a good example. Relational databases build on a classic data structureknownas the “table”—amatrix of rows and columnswhere each row is a record and each columnis a field. Think “spreadsheet.” For example, each row may represent a book, with columns forauthors, titles, dates, publishers, etc. The problem comes when a column needs to be repeatable.For example, a bookmayhavemultiple authors ormore commonly,multiple subjects. In this casethe idea of a table breaks down because it doesn’t make sense to have a column named subject-01,subject-02, and subject-03. As soon as you do that, you will want subject-04. Relational databasessolve this problem. The solution is to first add a “key”—a unique value— to each row. Next, forfields with multiple values, create a new table where one of the columns is the key from the firsttable and the other column is a value, in this case, a subject heading. There are now two tablesand they can be “joined” through the use of the key. Given such a data structure it is possible toadd as many subjects as desired to any bibliographic item.

But you say, “MARC can handle multiple subjects.” True,MARC can handle multiple sub-jects, but underneath, MARC is a data structure designed for when information was dissemi-


nated on tape. As such, it is a sequential data structure intended to be read from beginning toend. It is not a random access structure. What’s more, the MARC data structure is really di-vided into three substructures: 1) the leader, which is always twenty-four characters long, 2) thedirectory, which denotes where each bibliographic field exists, and 3) the bibliographic sectionwhere the bibliographic information is actually stored. It gets more complicated. The first fivecharacters of the leader are expected to be a left-hand, zero-padded integer denoting the lengthof the record measured in bytes. A typical value may be 01999. Thus, the record is 1999 byteslong. Now, ask yourself, “What is the maximum size of a MARC record?” Despite the fact thatlibrarianship embraces the idea of MARC, very few librarians really understand the structureof MARC data. MARC is a format for transmitting data from one place to another, not fororganization.

Moreover, libraries offer more than bibliographic information. There is information aboutpeople and organizations. Information about resource usage. Information about licensing. In-formation about resources that are not bibliographic, such as images or data sets. Etc. When thesetypes of information present themselves, libraries fall back to the use of simple tables, which areusually not amenable to turning data into information. There aremany different data structures.XML became popular about twenty years ago. Since then JSON has become prevalent. Morethan twenty years ago the idea of Linked Data was presented. All of these data structures havevarious strengths and weaknesses. None of them is perfect, and each addresses different needs,but they are all better than MARC when it comes to organizing data. Libraries understand theconcept of manifesting data as information, but as a whole, libraries do not manifest the conceptusing computer technology.

Finally, another core functionality of computers is networking and communication. Theadvent of the Internet is a relatively recent phenomenon, and the ubiquitous nature of comput-ers combined with other “smart” devices has facilitated literally billions of connections betweencomputers (and people). Consequently the data computed upon and stored in one place can betransmitted almost instantly to another place, and the transmission is an exact copy. Again, likethe process of computing and the process of storage, efficient computer communication buildsupon itself with unforeseen consequences. For example, who predicted the demise of many cen-tralized information authorities? With the advent of the Internet there is less of a need/desire fortravel agents, movie reviewers, or dare I say it, libraries.

Yet again, libraries use the Internet, but do they actually exploit it? How many librariansare able to create a file, put it on the Web, and share the resulting URL? Granted, centralizedcomputing departments and networking administrators put up road blocks to doing such things,but the sharing of data and information is at the core of librarianship. Putting a file on the ’Net,even temporarily, is something every librarian ought to be able to know how (and be authorized)to do.

Despite the functionality of computers and their place in libraries over the past fifty to sixtyyears, computers havemostly beenused to automate library tasks. MARCautomated the processof printing catalog cards and eventually the creation of “discovery systems.” Libraries have usedcomputers to automate the process of lending materials between themselves as well as to locallearners, teachers, and scholars. Libraries use computers to store, organize, preserve, and dissem-inate the gray literature of our time, and we call these systems “institutional repositories.” In allof these cases, the automation has been a good thing because efficiencies were gained, but the useof computers has not gone far enough nor really evolved. Lending and usage statistics are notroutinely harvested nor organized for the purposes of monitoring and predicting library patron

Morgan 117

needs/desires. The content of institutional repositories is usually born digital, but libraries havenot exploited their full text nature nor created services going beyond rudimentary catalogs.

Computers can do so much more for libraries than mere automation. While I will never saycomputers are “smart,” their fundamental characteristics do appear intelligent, especially whenused at scale. The scale of computing has significantly changed in the past ten years, and withthis change the concept of “machine learning” has becomemore feasible. The following sectionsoutline how libraries can go beyond automation, embrace machine learning, and truly evolvetheir ideas of collections and services.

Machine learning: what it is, possibilities, and use cases

Machine learning is a computing process used to make decisions and predictions. In the past,computer-aided decision-making and predictions were accomplished by articulating large sets ofif-then statements and navigating down decision trees. The applications were extremely domainspecific, and they weren’t very scalable. Machine learning turns this process on its head. Insteadof navigating down a tree, machine learning takes sets of previously made observations (think“decisions”), identifies patterns and anomalies in the observations, and saves the result as a math-ematical model, which is really an n-dimensional array of vectors. Outside observations are thencompared to the model and depending on the resulting similarities or differences, decisions orpredictions are drawn.

Using such a process, there are really only four different types of machine learning: classifi-cation, clustering, regression, and dimension reduction. Classification is a supervised machinelearning process used to subdivide a set of observations into smaller sets which have been previ-ously articulated. For example, suppose you had a few categories of restaurants such asAmerican,French, Italian, or Chinese. Given a set of previously classified menus, one could create a modeldefining each category and then classify new, unseen menus. The classic classification example isthe filtering of email. “Is this message ‘spam’ or ‘ham’?” This chapter’s appendix walks a personthrough the creation of a simplified classification system. It classifies texts based on authorship.

Clustering is almost always an unsupervised machine learning process which also createssmaller sets from a larger one, but clustering is not given a set of previously articulated categories.That is what makes it “unsupervised.” Instead, the categories are created as an end result. Topicmodeling is a popular example of clustering.

Regression predicts a numeric value based on sets of dependent variables. For example, givendependent variables like annual income, education level, size of family, age, gender, religion, andemployment status, onemight predict howmuchmoney a personmay spend on an independentvariable such as charity.

Sometimes the number of characteristics of each observation is very large. Many times someof these characteristics donot play a significant role in decision-making or prediction. Dimensionreduction is another machine learning process, and it is used to eliminate these less-than-usefulcharacteristics from the observations. This process simplifies classification, clustering, or regres-sion.

Some possible use cases

There are many possible ways to enhance library collections and services through the use of ma-chine learning. I’m not necessarily advocating the implementation of any of the following ideas,


but they are possibilities. Each is grouped into the broadest of library functional departments:

• reference and public services

– given a set of grant proposals, suggest library resources be used in support of thegrants

– given a set of licensed library resources and their usage, suggest other resources foruse

– given a set of previously checked outmaterials, suggest othermaterials to be checkedout

– given a set of reference interviews, create a chatbot to supplement reference services

– given the full text of a set of desirable journal articles, create a search strategy to beapplied against anynumber of bibliographic indexes; answer theproverbial question,“Can you help me find more like this one?”

– given the full text of articles as well as their bibliographic descriptions, predict anddescribe the sorts of things a specific journal title accepts or whether a given draft isgood enough for publication

– given the full text of reading materials assigned in a class, suggest library resources tosupport them

• technical services

– given a set of multimedia, enumerate characteristics of the media (number of faces,direction of angles, number and types of colors, etc.), and use the results to supple-ment bibliographic description

– given a set of previously cataloged items, determine whether or not the catalogingcan be improved

– given full-text content harvested from just about anywhere, analyze the content interms of natural language processing, and supplement bibliographic description

• collections

– given circulation histories, articulate more refined circulation patterns, and use theresults to refine collection development policies

– given the full text of sets of theses anddissertations, predictwhere scholarship at yourinstitution is growing, anduse the results tomore intelligently build your just-in-casecollection; do the same thing with faculty publications

Implementing any of these possible use cases would necessarily be a collaborative effort. Im-plementation requires an array of expertise. Enumerated in no priority order, this expertise in-cludes: subject/domain expertise (such as cataloging trends, circulation services, collection strate-gies, etc.), computer programming and data management skills (such as Python, R, relationaldatabases, JSON, etc.), and statistical modeling (an understanding of the strengths and weak-nesses of different machine learning algorithms). The team would then need to:

1. articulate and share a common goal for the work

Morgan 119

2. amass the data to model

3. employ a feature extraction process (lower case words, extract a value from a database, etc.)

4. vectorize the features

5. create and evaluate the resulting model

6. go to Step #2 until satisfied

7. put the model into practice

8. go to Step #1; this work is never done

For example, to bibliographically connect grant proposals to library resources, try this:

1. use classification to sub-divide each of your bibliographic index descriptions

2. apply the resulting model to the full text of the grants

3. return a percentage score denoting the strength of each resulting classification

4. recommend the use of zero or more bibliographic indexes

To predict scholarship, try this:

1. amass the full text and bibliographic descriptions of all theses and dissertations

2. topic model the full text

3. evaluate the resulting topics

4. go to Step #2 until satisfied

5. augment the model’s matrix of vectors with bibliographic description

6. pivot the matrix on any of the given bibliographics

7. plot the results to see possible trends over time, trends within disciplines, etc.

8. use the results to make decisions

The content of the GitHub repository reproduced in this chapter’s appendix describes howto do something very similar in method to the previous example.1

1See https://github.com/ericleasemorgan/bringing-algorithms.

https://github.com/ericleasemorgan/bringing-algorithms


Some real-world use cases

Here at the University of Notre Dame’s Navari Center for Digital Scholarship, we use machinelearning in a number of ways. We cut our teeth on a system calledConvocate.2 In this case we ob-tained a set of literature on the theme of human rights. Half of the set was written by researchersin non-governmental organizations. The other half was written by theologians. While both setswere on the same theme, the language of eachwas different. An excellent example is the use of theword “child.” In the former set, childrenwere included in documents about fathers andmothers.In the later set, children often referred to the “Children ofGod.” Consequently, queries referringto childrenwere oftenmisleading. To rectify this problem, a set of broad themeswere articulated,such as Actors, Harms andViolations, Rights and Freedoms, and Principles andValues. We thenused topic modeling to subdivide all of the paragraphs of all of the documents into smaller andsmaller sets of paragraphs. We compared the resulting topics to the broad themes, and when wefound correlations between the two, we classified the paragraphs accordingly. Because the processrequired a great deal of human intervention, and thus impeded subsequent updates, this processwas not ideal, but we were learning and the resulting index is useful.

On a regular basis we find ourselves using a program called Topic Modeling Tool, which is aGUI/desktop applicationheavily based on the venerableMALLETsuite of software.3 Given a setof plain text files and an integer, TopicModeling Tool will create a weighted list of latent themesfound in a corpus. Each theme is really a list of words which tend to cluster around each other,and these clusters are generated through the use of an algorithm called LDA (Latent DirichletAllocation). When it comes to topic modeling, there is no such thing as the correct number oftopics. Just as in the traditional process of denoting what a corpus is about, there can be manydistinct topics or there can be a few. Moreover, some of the topics may be large and others maybe small. When using a topic modeler, it is important to iteratively configure and re-configurethe input until the results seem to make sense.

Just like every other machine learning application, Topic Modeling Tool bases its “reason-ing” on a matrix of vectors. Each row represents a document, and each column is a topic. At theintersection of a document row and a topic column is a score denoting howmuch the given doc-ument is “about” the calculated topic. It is then possible to sum each topic column and outputa pie chart illustrating not only what the topics are, but how much of the corpus is about eachtopic. Such can be very insightful.

By adding metadata to the matrix of vectors, even more insights can be garnered. Supposeyou have a set of plain text files. Suppose also you know the names of the authors of each file. Youcan then do topicmodeling against your corpus, andwhen themodeling is complete you can addanewcolumn to thematrix and call it authors. Next, youupdate the values in the authors columnwith author names. Finally, you “pivot” thematrix on the authors column to calculate the degreeeach authors’ works are “about” the calculated topics. This too can be quite insightful. Supposeyouhaveworks by authorsA, B,C, andD. Suppose youhave calculated topics I, II, III, and IV. Byupdating the matrix and pivoting the results, you might discover that author A discusses topic Ialmost exclusively, whereas author B discusses topics I, II, III, and IV in equal parts. This processworks for just about any type of metadata: gender, genre, extent, dates, language, etc. What’smore, Topic Modeling Tool makes this process almost trivial. To learn how, see the GitHub

2See https://convocate.nd.edu.3See https://github.com/senderle/topic-modeling-tool for the Topic Modeling Tool. See http:

//mallet.cs.umass.edu for MALLET.

https://convocate.nd.edu

https://github.com/senderle/topic-modeling-tool

http://mallet.cs.umass.edu

http://mallet.cs.umass.edu

Morgan 121

repository accompanying this chapter.4

We have used classification techniques in at least a couple of ways. One project required theclassification of press releases. Some press releases are deemed mandatory — declared necessaryto publish. Other press releases are considered discretionary — published at the will of a com-pany. The domain expert needed a set of 100,000 press releases classified into either mandatoryor discretionary piles. We used a process very similar to the process outlined in this chapter’s Ap-pendix. In the end, the domain expert believes the classification process was 86% correct, andthis was good enough for them. In another project, we tried to identify articles about a particu-lar yeast (Cryptococcus neoformans), despite the fact that the articles never mentioned the givenyeast. This project failed because we were unable to generate an accuracy score greater than 70%.This was deemed not good enough.

We are developing a high performance computing system called the Distant Reader, whichuses machine learning to do natural language processing against an arbitrarily large volume oftext. Given one or more documents of just about any number or type, the Distant Reader will:

1. amass the documents

2. convert the documents into plain text

3. do rudimentary counts and tabulations against the plain text

4. calculate statistically significant keywords against the plain text

5. extract narrative summaries against the plain text

6. use Spacy (a natural language processing library) to classify each and every feature of eachand every sentence into parts-of-speech and/or named entities5

7. save the results of Steps #1 through #6 as plain text and tab-delimited files

8. distill the tab-delimited files into an SQLite database

9. create both narrative as well as tabular reports against the database

10. create an archive (.zip file) of everything

11. return the archive to the student, researcher, or scholar

The student, researcher, or scholar can then analyze the contents of the .zip file to get a bet-ter understanding of its contents. This analysis (“reading”) ranges from perusing the narrativereports, to using desktop tools to visualize the data, to exploiting command-line tools to inves-tigate the data, to writing software which uses the data as input. The Distant Reader scales toeverything between a single scholarly report, hundreds of book-length documents, and thou-sands of journal articles. Its purpose is to supplement the traditional reading process, and it usesmachine learning techniques at its core.

4https://github.com/ericleasemorgan/bringing-algorithms.5See https://spacy.io.


https://spacy.io


Summary and Conclusion

Computers and libraries are a natural fit. They both excel at the collection, organization, anddissemination of data, information, and knowledge. Compared tomost professions, the practiceof librarianship has used computers for a very long time. But, for themost part, the functionalityof computers in libraries has not been fully exploited. Advances in machine learning coupledwith the data/information found in libraries present an opportunity for both librarianship andthe people whom libraries serve. Machine learning can be used to enhance library collections andservices, and with a modest investment of time as well as resources, the profession can make it areality.

Appendix: Train and Classify

This appendix lists twoPythonprograms. Thefirst (train.py) creates amodel for the classificationof plain text files. The second (classify.py) uses the output of the first to classify other plain textfiles. For your convenience, the scripts and some sample data ought to be available in a GitHubrepository.6

The purpose of including these two scripts is to help demystify the process of machine learn-ing.

Train

The following Python script is a simple classification training application.Given a file name and a list of directories containing .txt files, this script first reads all of the

files’ contents and the names of their directories into sets of data and labels (think “categories”). Itthen divides the data and labels into training and testing sets. Such is a best practice for these typesof programs so the models can be evaluated for accuracy. Next, the script counts and tabulates(“vectorizes”) the training data and creates amodel using a variationof theNaiveBayes algorithm.The script then vectorizes the test data, uses the model to classify the test data, and comparesthe resulting classifications to the originally supplied labels. The result is an accuracy score, andgenerally speaking, a score greater than 75% is on the road to success. A score of 50% is no betterthan flipping a coin. Finally, the model is saved to a file for later use.

1 # train.py - given a file name and a list of directories# containing .txt files , create a model for classifying# similar items

# require the libraries/modules that will do the work6 from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_splitfrom sklearn.naive_bayes import MultinomialNBimport glob, os, pickle, sys

11 # sanity check; make sure the program has been given inputif len( sys.argv ) < 4 :

sys.stderr.write( 'Usage: ' + sys.argv[ 0 ] +

6https://github.com/ericleasemorgan/bringing-algorithms.


Morgan 123

" <model> <directory > <another directory >\n" )quit()

16

# get the name of the file where the model will be savedmodel = sys.argv[ 1 ]

# get the rest of the input , the names of directories to process21 directories = []

for i in range( 2, len( sys.argv ) ) :directories.append( sys.argv[ i ] )

# initialize the data to analyze and its associated labels26 data = []

labels = []

# loop through each given directoryfor directory in directories :

31

# find all the text files and get the directory's namefiles = glob.glob( directory + "/*.txt" )label = os.path.basename( directory )

36 # process each filefor file in files :

# open the filewith open( file, 'r' ) as handle :

41

# add the contents of the file to the datadata.append( handle.read() )

# update the list of labels46 labels.append( label )

# divide the data/labels into training sets and testing sets;# a best practicedata_train , data_test , labels_train , labels_test =

51 train_test_split( data, labels )

# initialize a vectorizer , and then count/tabulate the# training datavectorizer = CountVectorizer( stop_words='english' )

56 data_train = vectorizer.fit_transform( data_train )

# initialize a classification model , and then use Naive Bayes# to create a modelclassifier = MultinomialNB()

61 classifier.fit( data_train , labels_train )

# count/tabulate the test data, and use the model to classify it


data_test = vectorizer.transform( data_test )classifications = classifier.predict( data_test )

66

# begin to test for accuracycount = 0

# loop through each test classification71 for i in range( len( classifications ) ) :

# increment , conditionallyif classifications[ i ] == labels_test[ i ] : count += 1

76 # calculate and output the accuracy score;# above 75\% begins to achieve successprint ( "Accuracy: %s%% \n" % ( int( ( count * 1.0 )

/ len( classifications ) * 100 ) ) )

81 # save the vectorizer and the classifier (the model)# for future use, and done

with open( model, 'wb' ) as handle :pickle.dump( ( vectorizer , classifier ), handle )

86 exit()

Classify

The following Python script is a simple classification program.Given the model created by the previous script (train.py) and a directory containing a set of

.txt files, this script will output a suggested label (“classification”) and a file name for each file inthe given directory. This script automatically classifies a set of plain text files.

# classify.py - given a previously saved classification model and# a directory of .txt files , classify a set of documents

4 # require the libraries/modules that will do the workimport glob, os, pickle, sys

# sanity check; make sure the program has been given inputif len( sys.argv ) != 3 :

9 sys.stderr.write( 'Usage: ' + sys.argv[ 0 ] +" <model> <directory >\n" )

quit()

# get input; get the model to read and the directory containing14 # the .txt files

model = sys.argv[ 1 ]directory = sys.argv[ 2 ]

# read the model19 with open( model, 'rb' ) as handle :

Morgan 125

( vectorizer , classifier ) = pickle.load( handle )

# process each .txt filefor file in glob.glob( directory + "/*.txt" ) :

24

# open, read, and classify the filewith open( file, 'r' ) as handle :

classification = classifier.predict(vectorizer.transform( [ handle.read() ] ) )

29

# output the classification and the file's nameprint( "\textbackslash t".join( (

classification[ 0 ],os.path.basename( file ) ) ) )

34

# doneexit()

References

Crawford, Walt. 1989. MARC for Library Use: Understanding Integrated USMARC. 2nd ed.Boston: G.K. Hall.

LOC (Library of Congress). 2017. The Card Catalog: Books, Cards, and Literary Treasures. SanFrancisco: Chronicle Books.

Chapter 11

Taking a Leap Forward: Machine

Learning for New Limits

Patrice-Andre Prud’hommeOklahoma State University

Introduction

Today,machines can analyze vast amounts of data and increasinglyproduce accurate results throughthe repetition ofmathematical or computational procedures. With the increasing computing ca-pabilities available to us today, artificial intelligence (AI) and machine applications have made aleap forward. These rapid technological changes are inevitably influencing our interpretation ofwhat AI can do and how it can affect people’s lives. Machine learning models that are developedon the basis of statistical patterns from observed data provide new opportunities to augmentour knowledge of text, photographs, and other types of data in support of research and educa-tion. However, “the viability of machine learning and artificial intelligence is predicated on therepresentativeness and quality of the data that they are trained on,” as Thomas Padilla, InterimHead, Knowledge Production at the University of Nevada Las Vegas, asserts (2019, 14). Withthat in mind, these technologies and methodologies could help augment the capacity of archivesand libraries to leverage their creation-value and minimize their institutional memory loss whileenhancing the interdisciplinary approach to research and scholarship.

In this essay, I begin by placing artificial intelligence and machine learning in context, thenproceed by discussing why AI matters for archives and libraries, and describing the techniquesused in a pilot automation project from the perspective of digital curation at Oklahoma StateUniversity Archives. Lastly, I end by challenging other areas in the library and adjacent fields tojoin in the dialogue, to develop a machine learning solution more broadly, and to explore op-portunities that we can reap by reaching out to others who share a similar interest in connectingpeople to build knowledge.

127


Artificial Intelligence andMachine Learning. Why do

theyMatter?

Artificial intelligencehas seen a resurging interest in the recent past—in thenews, in the literature,in academic libraries and archives, and in other fields, such asmedical imaging, inspection of steelcorrosion, andmore. JohnMcCarthy,American computer scientist, defined artificial intelligenceas “the science and engineering of making intelligent machines, especially intelligent computerprograms. It is related to the similar task of using computers to understand human intelligence,butAI does not have to confine itself tomethods that are biologically observable” (2007, 2). Thisdefinitionhas since been extended to reflect a deeper understanding ofAI today andwhat systemsrun by computers are now able to do. Dr. Carmel Kent notes that “AI feels like a moving target”as we still need to learn how it affects our lives (2019). Within the last decades, the amazing jumpin computing capabilities has been quite transformative in that machines are increasingly able toingest and analyze large amounts of data andmore complex data to automatically producemodelsthat can deliver faster andmore accurate results. 1 Their “power lies in the fact thatmachines canrecognize patterns efficiently and routinely, at a scale and speed that humans cannot approach,”writes Catherine Nicole Coleman, digital research architect for Stanford University (2017).

AParadigm Shift for Archives and Libraries

Within the context of university archives, this paradigm shift has been transforming the way weinterpret archival data. Artificial intelligence, and specifically machine learning as a subfield ofAI, has direct applications throughpattern recognition techniques that predict the labeling valuesfor unlabeled data. As the software analytics company SAS argues, it is “the iterative aspect ofmachine learning [that] is important because as models are exposed to new data, they are ableto independently adapt. They learn from previous computations to produce reliable, repeatabledecisions and results” (n.d.).

Case in point, how can we use machine learning to train machines and apply facial and textrecognition techniques to interpret the sheer number of photographs and texts in either ana-log or born-digital formats held in archives and libraries? Combining automatic processes to as-sist in supporting inventory management with a focus on descriptive metadata, a machine learn-ing solution could help alleviate time-consuming and relatively expensivemetadata tagging tasks,and thus scale the process more effectively using relatively small amounts of data. However, thetraditional approach of machine learning would still require a significant time commitment byarchivists and curators to identify essential features to make patterns usable for data training. Bycontrast, deep learning algorithms are able “to learn high-level features from data in an incremen-tal manner. This eliminates the need of domain expertise and hard core feature extraction” (Ma-hapatra 2018).

Deep learning has regained popularity since themid-2000s due to “fast development of high-performance parallel computing systems, such as GPU clusters” (Zhao 2019, 3213). Deep learn-ing neural networks aremore effective in feature detection as they are able to solve complex prob-lems such as image classification with greater accuracy when trained with large datasets. Thechallenge is whether archives and libraries can afford to take advantage of greater computingcapabilities to develop sophisticated techniques and make complex patterns from thousands of

1See SAS n.d. and Brennan 2019.

Prud’homme 129

digital works. The sheer size of library and archive datasets, such as university photograph collec-tions, presents challenges to properly using these new, sophisticated techniques. As JasonGriffeywrites, “AI is only as good as its training data and the weighting that is given to the system as itlearns to make decisions. If that data is biased, contains bad examples of decision-making, or issimply collected in such a way that it isn’t representative of the entirety of the problem set[…],that system is going to produce broken, biased, and bad outputs” (2019, 8). How can culturalheritage institutions ensure that their machine learning algorithms avoid such bad outputs?

Implications toMachine Learning

Machine learning has the potential to enrich the value of digital collections by building upon ex-perts’ knowledge. It can also help identify resources that archivists and curators may never havethe time for, and at the same time correct assumptions about heritage materials. It can generatethe necessary added value to support the mission of archives and libraries in providing a publicgood. Annie Schweikert states that “artificial intelligence and machine learning tools are consid-ered by many to be the next step in streamlining workflows and easing workloads” (2019, 6).

For images, how can archives build a data-labeling pipeline into their digital curation work-flow that enables machine learning of collections? With the objective being to augment knowl-edge and create value, how can archives and libraries “bring the skills and knowledge of librarystaff, scholars, and students together todesign an intelligent information system” (Coleman2017)?Despite the opportunities to augment knowledge from facial recognition, models generated bymachine learning algorithms should be scrutinized so long it is unclear how choices are made infeature selection. Machine learning “has the potential to reveal things …that we did not knowand did not want to know” as Charlie Harper asserts (2018). It can also have direct ethical impli-cations, leading to biased interpretations for nefarious motives.

Machine Learning andDeep Learning on the Grounds of

Generating Value

In the fall 2018, Oklahoma State University Archives began to look more closely at a machinelearning solution to facilitate metadata creation in support of curation, preservation, and dis-covery. Conceptually, we envisioned boosting the curation of digital assets, setting up policies toprioritize digital preservation and access for education and research, and enhancing the long-termvalue of those data. In this section, I describe the parameters of automation andmachine learningused to support inventory work and experiment with face recognitionmodels to add contextual-ization to digital objects. From a digital curation perspective, the objective is to explore ways toadd value to digital objects for which little information is known, if any, in order to increase thevisibility of archival collections.

What started this Pilot Project?

Before proceeding, we needed to gain a deeper understanding of the large quantity of files heldin the archives—both types of data and metadata. The challenge was that with so many files, somany formats, files become duplicated and renamed, doctored, and scattered throughout direc-tories to accommodate different types of projects over time, making it hard to sift due to sparse


metadata tags that may have differed from one system to another. In short, how could we justifythe value of these digital assets for curatorial purposes? How much could we rely on the estab-lished institutional memory within the archives? Lastly, couldmachine learning or deep learningapplications help us build a greater capacity to augment knowledge? In order to optimize re-sources and systematically make sense of data, we needed to determine that machine learningcould generate value, which in turn could help us more tightly integrate our digital initiativeswith machine learning applications. Such applications would only be as effective as the data aregood for training and the value we could derive from them.

Methodology and Plan of Action

First, we recruited two student interns to create a series of processes that would automaticallypopulate a comprehensive inventory of all digital collections, including finding duplicate files byhashing. We generated the inventory by developing a process that could be universally adaptedto all library digital collections, setting up a universal list of works and their associated metadata,with a focus on descriptive metadata, which in turn could support digital curation and discov-ery of archival materials—digitized analog materials and born-digital materials. We developed auniversal policy for digital archival collections, which would allow us to incorporate all forms ofmetadata into a single format to remedy inconsistencies in existingmetadata. This first phase wascritical in the sense that it would condition the cleansing and organizing of data. We could thenproceed with the design of a face recognition database, with the intent to trace individuals fea-tured in the inventory works of the archives to the extent that our data were accurate. We utilizedtheOklahoma State University Yearbook collections and other digital collections as authoritativereferences for other works, for the purpose of contextualization to augment our data capacity.

Second, we implemented our plan; worked closely with the Library Systems’ team withina Windows-based environment; decided on Graphics Processing Unit (GPU) performance andcost, taking into consideration that training neural networks necessitates computing power; de-termined storage needs; and fulfilled other logistical requirements to begin the step-by-step pro-cess of establishing a pattern recognition database. We designed the database on known objectsbefore introducing and comparing new data to contextualize each entry. With this framework,we would be able to add general metadata tags to a uniform storage system using deep learningtechnology.

Third, we applied Tesseract OCR on a series of archival image-text combinations from thearchives to extract printed text from those images and photographs. “Tesseract 4 adds a newneural net (LSTM) [Long Short-Term Memory] based OCR engine which is focused on linerecognition,” while also recognizing character patterns (“Tesseract” n.d.). Wewere able to obtainsuccessful output for themost part, with the exceptionof a few characters thatwere hard to detectdue to pixelation and font types.

Fourth, we looked into object identifiers, keeping in mind that “When there are scarce orinsufficient labeled data, pre-training is usually conducted” (Zhao2019, 3215). Working throughthe inventory process, we knew that we would also need to label more data to grow our capacity.We chose to use ResNet 50, a smaller version backbone of Keras-Retinanet, frequently used as astarting point for transfer learning. ResNet 152was another implementation layer used as shownin Figure 11.1 demonstrating the output of a training session or epoch for testing purposes.

Keras is a deep learning network API (Application Programming Interface) that supportsmultiple back-end neural network computation engines (Heller 2019) and RetinaNet is a sin-

Prud’homme 131

Figure 11.1: ResNet 152 application using PASCAL VOC 2012

Figure 11.2: Face recognition API test

gle, unified network consisting of a backbone network and two task-specific subnetworks usedfor object detection (Karaka 2019). We proceeded by first dumping a lot of pre-tagged infor-mation from pre-existing datasets into this neural network. We experimented with three opensource datasets: PASCALVOC2012, a set including 20object categories; Open ImagesDatabase(OID), a very large dataset annotatedwith image-level labels and object bounding boxes; andMi-crosoft COCO, a large-scale object detection, segmentation, and captioning dataset. With a fewfaces from the OID dataset, we could compare and see if a face was previously recognized. Ex-panding our process to data known from the archives collection, we determined facial areas, andmore specifically, assigned bounding box regressions to feed into the facial recognitionAPI, basedon Keras code written in Python. The face recognition API is available via GitHub. 2 It usesa method called Histogram of Oriented Gradient (HOG) encoding that makes the actual facerecognition process much easier to implement for individuals because the encodings are fairlyunique for every person, as opposed to encoding images and trying to blindly figure out whichparts are faces based on our label boxes. Figure 11.2 illustrates our test, confirming from twovery different photographs the presence of Jessie Thatcher Bost, the first female graduate fromOklahoma A&MCollege in 1897.

Ren et al. stated that it is important to construct a deep and convolutional per-region object

2See https://github.com/ageitgey/face_recognition.

https://github.com/ageitgey/face_recognition


classifier to obtain good accuracy using ResNets (2015). Going forward, we could use the tool“as is” despite the low tolerance for accuracy, or instead try to establish large datasets of faces bytraining on our own collections in hopes of improving accuracy. We proceededwith utilizing theOklahoma State University Yearbook collections, comparing image sets with other photographsthat may include these faces. We look forward to automating more of these processes.

AConclusive First Experiment

We can say that our first experiment developing a machine learning solution on a known set ofarchival data resulted in positive output, while recognizing that it is still a work in progress. Forexample, the model we ran for the pilot is not natively supported on Windows, which hinderedteam collaboration. In light of these challenges, we think that our experiment was a step in theright direction of adding value to collections by bringing in a new layer of discovery for hiddenor unidentified content.

Above all, this type of work relies greatly on transparency. As Schweikert notes, “Trans-parency is not aperk, but a key to the responsible adoptionofmachine learning solutions” (2019, 72).More broadly, issues in transparency and ethics in machine learning are important concerns inthe collecting and handling of data. In order to boost adoption and get more buy-in with thisnew type of discovery layer, our team shared information intentionally about the process to helpadd credibility to the work and foster a more collaborative environment within the library. Also,the team developed a Graphic User Interface (GUI) to search the inventory within the archivesand ultimately grow the solution beyond the department.

Challenges andOpportunities ofMachine Learning

Challenges

In a National Library of Medicine blog post, Patti Brennan points out “that AI applications areonly as good as the data upon which they are trained and built”(2019), and having these dataready for analysis is amust in order to yield accurate results. Scaling of input and output variablesalso plays an important role in the performance improvement when using neural network mod-els. Jerome Pesenti, Head of AI at Facebook, states that “When you scale deep learning, it tendsto behave better and to be able to solve a broader task in a better way” (2019). Clifford Lynchaffirms, “machine learning applications could substantially help archives make their collectionsmore discoverable to the public, to the extent that memory organizations can develop the skillsand workflows to apply them” (2019). This raises the question whether archives can also affordto create the large amount of data from print heritage materials or refine their born-digital col-lections in order to build the capacity to sustain the use of deep-learning applications. Granted,the increasing volume of born-digital materials could help leverage this data capacity somehow;it does not exclude the fact that all data will need to be ready prior to using deep learning. Sincemachine learning is only good so long as value is added, archives and libraries will need to thinkin terms of optimization as well, deciding when value-generated output is justified compared tothe cost of computing infrastructure and skilled labor needs. Besides value, operations, such asstoring and ensuring access to these data, are just as important considerations tomakingmachinelearning a feasible endeavor.

Prud’homme 133

Opportunities

Investment in resources is also needed for interpreting results, in that “results of an AI-poweredanalysis should only factor into the final decision; they should not be the final arbiter of that de-cision” (Brennan 2019). While this could be a challenge in itself, it can also be an opportunitywhen machine learning helps minimize institutional memory loss in archives and libraries (e.g.,when long-time archivists and librarians leave the institution). Machine learning could supple-ment practices that are already in place—it may not necessarily replace people—and at the sametime generate metadata for the access and discovery of collections that people may never have thetime to get to otherwise. But we will still need to determine accuracy in results. As deep learn-ing applications will only be as effective as the data, archives and libraries should expand theircapacity by working with academic departments and partnering with university supercomput-ing centers or other highly performant computing environments across consortium aggregatingnetworks. Such networks provide a computing environmentwith greater data capacity andmoreGPUs. Along similar lines, there are opportunities to build uponCarpentries workshops and thecommunities of practice that surround this type of interest.

These growing opportunities could help boost the use of machine learning and deep learn-ing applications to minimize our knowledge gaps about local history and the surrounding com-munity, bringing together different types of data scattered across organizations. This increasedcapacity for knowledge could grow through collaborative partnerships, connecting people, schol-ars, computer scientists, archivists and librarians, to share their expertise through different typesof projects. Such projects could emphasize the multi- and interdisciplinary academic approachto research, including digital humanities and other forms or models of digital scholarship.

Conclusion

Along with greater computing capabilities, artificial intelligence could be an opportunity for li-braries and archives to boost the discovery of their digital collections by pushing text and imagerecognition machine learning techniques to new limits. Machine learning applications couldhelp increase our knowledge of texts, photographs, and more, and determine their relevancewithin the context of research and education. It couldminimize institutional memory loss, espe-cially as long-time professionals are leaving the profession. However, these applications will onlybe as effective as the data are good for training and for the added value they generate.

At Oklahoma State University, we took a leap forward developing a machine learning so-lution to facilitate metadata creation in support of curation, preservation, and discovery. Ourexperiment with text extraction and face recognition models generated conclusive results withinone academic year with two student interns. The team was satisfied with the final output andso was the library as we reported on our work. Again, it is still a work in progress and we lookforward to taking another leap forward.

In sum, it will be organizations’ responsibility to build their data capacity to sustain deeplearning applications and justify their commitment of resources. Nonetheless, asOklahomaStateUniversity’s face recognition initiative suggests, these applications can augment archives’ and li-braries’ support for multi- and interdisciplinary research and scholarship.


References

Brennan, Patti. 2019. “AI is Coming. Are Data Ready?” NLM Musings from the Mezzanine(blog). March 26, 2019. https://nlmdirector.nlm.nih.gov/2019/03/26/ai-is-coming-are-the-data-ready/.

Carmel, Kent. 2019. “Evidence Summary: Artificial Intelligence in Education.” EuropeanEdTech Network. https://eetn.eu/knowledge/detail/Evidence-Summary-\/-Artificial-Intelligence-in-education.

Coleman, Catherine Nicole. 2017. “Artificial Intelligence and the Library of the Future, Revis-ited.” Stanford Libraries (blog). November 3, 2017. https://library.stanford.edu/blogs/digital-library-blog/2017/11/artificial-intelligence-and-library-future-revisited.

“Face Recognition.” n.d. Accessed November 30, 2019. https://github.com/ageitgey/face_recognition.

Griffey, Jason, ed.. 2019. “Artificial Intelligence and Machine Learning in Libraries.” Specialissue, Library Technology Reports 55, no. 1 (January). https://journals.ala.org/index.php/ltr/issue/viewIssue/709/471.

Harper, Charlie. 2018. “Machine Learning and the Library or: How ILearned to StopWorryingand Love My Robot Overlords.” Code4Lib, no. 41 (August). https://journal.code4lib.org/articles/13671.

Heller, Martin. 2019. “What is Keras? The Deep Neural Network API Explained.” InfoWorld(website). January 28, 2019. https://www.infoworld.com/article/3336192/what-is-keras-the-deep-neural-network-api-explained.html.

Karaka, Anil. 2019. “Object Detection with RetinaNet.” Weights & Biases (website). July 18,2019. https://www.wandb.com/articles/object-detection-with-retinanet.

Lynch, Clifford. 2019. “Machine Learning, Archives and Special Collections: A High LevelView.” International Council on Archives Blog. October 1, 2019. https://blog-ica.org/2019/10/02/machine-learning-archives-and-special-collections-a-high-level-view/.

Mahapatra, Sambit. “Why Deep Learning over Traditional Machine Learning?” Towards DataScience (website). March 21, 2018. https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063.

McCarthy, John. “What is Artificial Intelligence?” Professor JohnMcCarthy (website). RevisedNovember 12, 2007. http://jmc.stanford.edu/articles/whatisai/whatisai.pdf.

Padilla, Thomas. 2019. Responsible Operations: Data Science, Machine Learning, and AI inLibraries. Dublin, OH: OCLCResearch. https://doi.org/10.25333/xk7z-9g97.

Pesenti, Jerome. 2019. “Facebook’s Head of AI Says the Field Will Soon ‘Hit the Wall.’ ” Inter-view by Will Knight. Wired (website). December 4, 2019. https://www.wired.com/story/facebooks-ai-says-field-hit-wall/.

Ren, Shaoqing, Kaiming He, Ross Girshick, Xiangyu Zhang, and Jian Sun. 2015. “Object De-tectionNetworks on Convolutional FeatureMaps.” IEEE Transactions on Pattern AnalysisandMachine Intelligence 39, no. 7 (April).

SAS. n.d. “Machine Learning: What It Is and Why It Matters.” Accessed December 17, 2019.

https://nlmdirector.nlm.nih.gov/2019/03/26/ai-is-coming-are-the-data-ready/

https://nlmdirector.nlm.nih.gov/2019/03/26/ai-is-coming-are-the-data-ready/

https://eetn.eu/knowledge/detail/Evidence-Summary-\/-Artificial-Intelligence-in-education

https://eetn.eu/knowledge/detail/Evidence-Summary-\/-Artificial-Intelligence-in-education

https://library.stanford.edu/blogs/digital-library-blog/2017/11/artificial-intelligence-and-library-future-revisited





https://journals.ala.org/index.php/ltr/issue/viewIssue/709/471

https://journals.ala.org/index.php/ltr/issue/viewIssue/709/471



https://www.infoworld.com/article/3336192/what-is-keras-the-deep-neural-network-api-explained.html

https://www.infoworld.com/article/3336192/what-is-keras-the-deep-neural-network-api-explained.html

https://www.wandb.com/articles/object-detection-with-retinanet

https://www.wandb.com/articles/object-detection-with-retinanet

https://blog-ica.org/2019/10/02/machine-learning-archives-and-special-collections-a-high-level-view/



https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063



http://jmc.stanford.edu/articles/whatisai/whatisai.pdf

http://jmc.stanford.edu/articles/whatisai/whatisai.pdf

https://doi.org/10.25333/xk7z-9g97

https://www.wired.com/story/facebooks-ai-says-field-hit-wall/

https://www.wired.com/story/facebooks-ai-says-field-hit-wall/

Prud’homme 135

https://www.sas.com/en_us/insights/analytics/machine-learning.html.

Schweikert, Annie. 2019. “Audiovisual Algorithms, New Techniques for Digital Processing.”Master’s Thesis, New York University. https://www.nyu.edu/tisch/preservation/program/student_work/2019spring/19s_thesis_Schweikert.pdf.

“Tesseract OCR.” n.d. Accessed December 11, 2019. https://github.com/tesseract-ocr/tesseract.

Zhao, Zhong-Qiu, Peng Zheng, Shou-tao Xu, and Xindong Wu. 2017 “Object Detection withDeep Learning: A Review.” IEEE Transactions on Neural Networks and Learning Sys-tems 30, no. 11 (2019): 3212-3232.

https://www.sas.com/en_us/insights/analytics/machine-learning.html

https://www.sas.com/en_us/insights/analytics/machine-learning.html

https://www.nyu.edu/tisch/preservation/program/student_work/2019spring/19s_thesis_Schweikert.pdf

https://www.nyu.edu/tisch/preservation/program/student_work/2019spring/19s_thesis_Schweikert.pdf

https://github.com/tesseract-ocr/tesseract

https://github.com/tesseract-ocr/tesseract

Chapter 12

Machine Learning +Data

Creation in a Community

Partnership for Archival

Research

Jason CohenBerea College

Mario NakazawaBerea College

Introduction: Cultural Heritage and Archival

Preservation in Eastern Kentucky

In this chapter, two researchers, Jason Cohen andMario Nakazawa, describe the contexts for anarchivally focused project that emerged from a partnership between the Pine Mountain Settle-ment School (PMSS)1 in Harlan County, Kentucky, and scholars and students at Berea College.In this process, we have entered into a critical dialogue with our sources and knowledge pro-duction that Roopika Risam calls for in “self-reflexive” investigations in the digital humanities(2015, para. 16). Risam’s intervention, nevertheless, does not explicitly distinguish questions ofclass and the concomitant geographic constraints that often accompany the economic and socialdisadvantages of poverty (Ahmed et al. 2018). Our work demonstrates how class and geographyare tied, even in digital archives, to the need for reflexive and diverse approaches to humanist ma-terials. For instance, a recent invited contribution to Proceedings of the IEEE articulates a need

1See http://pinemountainsettlementschool.com.

137

http://pinemountainsettlementschool.com


for diversity in computing and technology without mentioning class or region as factors shapingthese related issues of diversity (Stephan et al. 2012, 1752–5). Given these constraints, perhaps itis also pertinent to acknowledge that themachine learning applicationwe describe in this chapteris itself not particularly novel in scope or method—we describe our data acquisition and prepa-ration, and two parallel implementations of commercially available tools for facial recognition.What stands out as unique are the ethical and practical concerns tied to bringing unique archivalmaterials out of their local contexts into a larger conversation about computer vision as a toolthat helps liberate, and at the same time possibly endanger, a subaltern cultural heritage.

In that light, we enter our archival investigation into what Bruno Latour has productivelynamed “actor-network theory” (2007, 11–13) because, as we suggest below, our actions werehighly conditioned not only by the physical and social spaces our research occupies and whereits events occurs, but also because the nature of the historical artifacts themselves act powerfullyto shape our work in these contexts. Moreover, the partnership model of curation and archivingthat we pursued in this project complicates the very concept of agency because the actions form-ing the project emerged from a continuing dialogue rather than any one decision or hierarchy.As we suggest later, a distributed model for decisions (Sabharwal 2015, 52–5) also revealed thelimitations of using a participatory and identity-basedmodel for archival development andman-agement. Indeed, those historical artifacts will exert influence on this network of relations longafter any one of us involved in the current project has ceased to pursue them. When we cameto this project, we asked a version of a classic question that has arisen in a variety of forms begin-ningwith very early efforts by Bell Laboratories, among others, to translate data structures to suitthe often flexible needs of humanist data: “what aspects of life are formalizable?” (Weizenbaum1976, 12). We discovered that while an ontology may represent a formalized relationship of anarchive to a database or finding aid, it also asks questions about the ethical implications of whatinformation and embedded relationships can be adequately formalized by an abstract schema.

The Promises and Realities of Technology After Coal in

Eastern Kentucky

Despite the longstanding threats of having to adapt to a post-coal economy,HarlanCounty, Ken-tucky continues to rely on coal and themountains fromwhich that coal is extracted as two of thecornerstones that shape the identity of the territory as well as the people who call it home. Themountains of Eastern Kentucky, like much of Appalachia, are by turns beautiful and devastated,and both authors of this essay have found conversations with Eastern Kentucky’s citizens aboutthe role the mountains play and the traditions that emerge from them both insightful and, attimes, heartbreaking. This dramatic landscape, with its drastic challenges, may not sound like aplace likely to find uses for machine learning. You would not be alone in your assumption.

Standing far from urban centers of technology and mobility, Eastern Kentucky combinesdeeply structural problems of generational poverty with a hard won understanding that, sincethe moment of the region’s colonization, outsiders have taken resources and made uninformeddecisions about what the region needs, or where it should turn in order to gain a better pur-chase on the narrative of American progress, self-improvement, and the unavoidable allures ofdevelopment-driven capitalism. Suspicion of outsiders is endemic here. And unfortunately, eco-nomic and social conditions, such as the high workplace injury rates associated with mining andextraction-related industries, the effects of the pharmaceutical industry’s abuse of prescription

Cohen and Nakazawa 139

opioids to treat a wide array of medical pain symptoms without treating the underlying causalconditions, and the systematic dismantling of federal- and state-level social support programs,have become increasingly acute concerns today. But this trajectory is not new: when PresidentLyndon B. Johnson announced the beginning of theWar on Poverty in 1964, he landed an houraway inMartinCounty, and subsequently, drove throughHarlanon a regional tour to inauguratethe initiative. Successive generations have sought to leave a mark, and all the while, the residentshave been collecting their own local histories of their place. Our project, centered on recoveringa latent social network of historical families represented by the images held in one local archive,mobilizes this tension between insiders’ persistence and outsiders’ interventions to think abouthow, as Bruno Latour puts it, we can “reassemble the social” while still respecting the local (2007,191–2). PMSS occupies a unique position in this social and physical landscape: both local in itsemplacement and attention, and a site of philanthropic work that attracted outside money aswell as human and cultural capital, PMSS is at once ofHarlan County and beyond it. As we sug-gest in the later sections of this essay, PMSS’s position, both within local and straddling regionalboundaries, complicates the network we identified. More than that, however, its split positioncomplicates the relationships of power and filiation embedded in its historical social network.

While an economy centered on coal continues to define the Eastern Kentucky regional iden-tity, a second history can be told about this place and its people, one centered on resilience, in-dependence, simplicity, and beauty, both of the land and its people. This second history hasmade outsiders’ recent appeals for the region to court technology as a potential solution for whatcomes “after coal”particularly attractive to a region thatprides itself on its capacity to sustain, out-last, and overcome obstacles. While that techno-utopian vision offers another version of the self-aggrandizing Silicon Valley bootstraps success story J.D. Vance narrates inHillbilly Elegy (2016),like Vance’s story itself, those narratives most often get told by outsiders to outsiders using re-gional stereotypes as the grounds for a sales pitch. In reality, however, those efforts have largelyproven difficult to sustain, and at times, become the sources of potentially explosive accusationsof fraud and malfeasance. Recently, for instance, organizations including Mined Minds2 havebeen accused by residents aiming to prepare for a post-coal economy of misleading students, atleast, and of fraud at worst. As with the timber, coal, and gas extraction industries that precededthese software development firms’ aspirations, the promises of technology have not been kindto Eastern Kentucky, and in particular, as with those extraction industries that preceded them,the technological-industrial complexmaking its pitch in Kentucky’s mountains has not returnedresources to the region’s residents whom the work was intended at least nominally to support(Hochschild 2018; Campbell 2019; Bailey 2017).

In this context of technology, culture, and the often controversial position machine learningoccupies in generating obscure metrics for its classifiers that may embed bias, our project aimsto activate its archival holdings and bring critical awareness to the question of how to activelyengage with a paper archive of a local place as we venture further into our pervasively digital mo-ment. The School operates today as a regional cultural heritage institution; it opened in 1913 asa residential school and operated as an educational institution until 1974, at which point it trans-formed itself into an environmental and cultural outreach institution focused on developing itslocal community and maintaining the richness of the region’s cultural resources and heritage.Every year since 1974, PMSS has brought hundreds of students and citizens onto its campus tolearn about nature and the landscape, traditional crafts and artistic practices, and musical anddance forms, among many other programs. Similarly, it has created a space for locals to come

2See http://www.minedminds.org/.

http://www.minedminds.org/


together for social events, community celebrations, and festival days, and at the same time, hasbecome a destination for national-level events that create community from shared interests in-cluding foodways, wildflowers, traditional dance forms, and other wide-ranging attractions.

Project Background: Preserving Cultural Heritage in

Harlan Country

The archives of the Pine Mountain Settlement School emerge from its shifting history. The ma-jority of its papers relate to its time as a traditional institution of education, including studentrecords (which continue to be restricted for several reasons, including FERPA constraints, andpersonal and community interests in privacy), minutes of its board meetings (again, partially re-stricted), and financial and narrative accounts of its many activities across a year. The school’srecords are unique because they provide a snapshot, year by year and month by month, of theregion’s interests and challenges during key years of the 20th Century, spanning the First WorldWar toVietnam. In addition, they detail the relations the Schoolmaintainedwith a philanthropicbase of donors who helped to support it and shape it, and beyond its local relations, place it intocontact with a larger set of cultural interactions than a boarding school that relied on tuition orother profit-driven means to sustain its operations would. While the archival holdings contin-ued to be informally developed by its directors and staff, who kept the official papers organizedroughly by year, the archive itself sat largely neglected after 1974. Beginning around the turn ofthemillennium, a volunteer archivist namedHelenWykle began digitizing items one by one, andsoon, hosted a curated selection of those digital surrogates along with interpretive and descrip-tive narration on a WordPress installation, The Pine Mountain Settlement School Collections.3

ThePMSSCollectionsWordPress site has been continuously running and frequently updated byWykle and the volunteer community members she has organized since 1999.4 Together with hercollaborators and volunteers, Wykle has grown the WordPress site to over 2200 pages, includingover 30,000 embedded images that include photographs and newspapers; scannedmemos, meet-ing minutes and other textual material (in JPG and PDF formats); HTML transcriptions andbibliographies hard-coded into the pages; scanned images of 3-D collections objects like textilelooms or wood carving tools; partially scanned runs of serial publications; and other compos-ite visual material. None of those objects was hosted within a regular and complete metadatahierarchy or ontology: no regular scheme of fields or file-naming convention was followed, nocontrolled vocabulary was maintained, no object-types were defined, no specific fields were re-quired prior to posting, and perhaps unsurprisingly as a result, the search and retrieval functionsof the site had deteriorated noticeably.

In 2016, Jason Cohen approached PMSS with the idea of using its archives as the basis forcurricular development at Berea College.5 Working in collaboration beginning in 2017, MarioNakazawa and Cohen developed two courses in digital and computational humanities, led ateam-directed study in augmented reality in coordination with PineMountain, contributed ma-

3See https://pinemountainsettlement.net/.4Jason Cohen andMario Nakazawa wish to extend a note of appreciation to Helen HaysWykle, GeoffMarietta, the

former director of PMSS, and Preston Jones, its current director, for welcoming us and enabling us to access the physicalarchives at PMSS from 2016–20.

5Jason Cohen would like to recognize the support this project received from the National Endowment for the Hu-manities’ “Humanities Connections” grant. See grant number AK-255299-17, description online at https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=AK-255299-17.

https://pinemountainsettlement.net/

https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=AK-255299-17

https://securegrants.neh.gov/publicquery/main.aspx?f=1&gn=AK-255299-17


terials and methods for a new course in Appalachian Studies, and promoted the use of PMSSarchival materials in several other extant courses in history and art history, among others. Thesenew college courses each make use of PMSS historical documents as a shared core of visual andtextual material in a digital and computational humanities concentration that clusters aroundcritical archival and textual studies.6

The success of that initial collaboration andcoursedevelopment seeded thepotential in 2019–2021 for aWhitingPublicEngagement7 fellowship focusedondevelopingmiddle andhigh schoolcurricula for use inKentucky public schoolswithPMSS archivalmaterials. ThatWhiting fundedproject has generated over 80 lessons keyed to Kentucky state standards; these lessons are cur-rently in use at nine schools across eight school districts, and each school is using PMSSmaterialsto highlight its own regional and local interests. The work we have done with these archives hasthus far reached the classrooms of at least eleven different middle and high school teachers, andas a result, touched over 450 students in eastern and central Kentucky public schools.

Wemention these numbers in order to demonstrate that our collaboration has not been shal-low nor fleeting. We have come to know these archives quite well, and because they are not ade-quately cataloged, the only way to get to know them is to spend time reading through the mate-rials one page at a time. An ancillary consequence of this durable collaboration and partnershipacross the public-academic divide is the shared recognition early in 2019 that the PMSS archivaldatabase and its underlying data structure (a flat SQL database generated by theWordPress inter-face) would provide inadequate stability for records management and quality control in futuredevelopment. In addition, we discovered that the interpretive materials and metadata associatedwith the WordPress installation were also insufficient for linked metadata across the objects inthis expanding digital archive, for reasons discussed below.

As partners, we decided together to migrate to a ContentDM instance hosted by the Ken-tuckyVirtual Library,8 a consortium towhichBereaCollege belongs, andwhich is open to futuremembership fromPMSS.That decision led a teamofBereaCollege undergraduate and faculty re-searchers to scrape the data from the PMSS archive site and supplement the images and transcrip-tions it contains with available textual metadata drawn from the site.9 Alongside the WordPressinstance as our reference, we were also granted access to a Dropbox account that hosted higherresolution versions of the images featured on the blog. The scraper pulled over 19,228 uniqueimages (and located over 11,000 duplicate images in the process), 732 document transcriptionsfor scanned texts on the site, and 380 subject and person bibliographies, including Library ofCongress Subject Headings that had been hard-coded into the site’s HTML. We also extractedthe unique object identifiers and labels associated with each image, which in WordPress are notassociated with the image objects themselves. We used that data to populate the ContentDM in-stance and returned a sparse but stable skeleton for future archival development. In the process,we also learned significantly about how a future implementation of a controlled vocabulary, animage acquisition and processing pipeline, and object documentation standards should work inthe next stages of our collaborative PMSS archival development.

6In the original versionof the collaboration,wehadplanned also to teachbasic computer programming tohigh schoolstudents during a summer program that also would have used that same set of materials, but with the paired departuresof the original co-PI as well as the former director, that plan has thus far remained unfulfilled.

7See https://www.whiting.org/content/jason-cohen.8See https://kdl.kyvl.org/.9Jason Cohen wishes to thankMario Nakazawa, BethanieWilliams, and Tradd Schmidt for undertaking this project

with him. The github repo for the PMSS scraper is hosted here: https://github.com/Tradd-Schmidt/PMSS_Scraper.

https://www.whiting.org/content/jason-cohen

https://kdl.kyvl.org/

https://github.com/Tradd-Schmidt/PMSS_Scraper

https://github.com/Tradd-Schmidt/PMSS_Scraper


As we developed and refined this new point of entry to the digital archives using the Con-tentDM hosting and framework, some of the ethical issues surrounding this local archive camemore clearly into focus. A parallel set of questions arose in response in the first instance to J.D.Vance’s work, and in the second, to outsiders’ claims for technological solutions to the deteri-oration of local and cultural heritage. Because we were creating virtual archival surrogates formaterials housed at Pine Mountain, for instance, questions arose from the PMSS board mem-bers related to privacy and use of historical materials. Further, the board was concerned thateven historical materials could bear on families present in the community today. We found thatwhile profession-wide responses to archival constraints are shaped predominantly by discussionsof copyright and fair use, issues of personal privacy are often left tacit. This gap between legal useand public interests in privacy reveals how tasks executed using techniques in machine learningmay impinge upon more ethical constraints of public trust and civic obligation.10

Similarly, as the ownership of historical images suddenly extended to include present-daycommunity members, and as these questions of access and serving a local public were inextri-cably bound up with interactions with members of that shared public whose family names andfaces appear in the images we were making available, we began to consider the ways in whichour archival work was tied to what Ryan Calo calls the “historical validation” of primary sourcematerials (2017, 424–5). When an AI system recognizes an object, Calo remarks, that object isvalidated. But how should one handle the lack of a specific vocabularywithin a given training set?One answer, of course, would be to train a new set—but that response is becoming increasinglyprohibitive for smaller cultural heritage projects like ours: the time and computational power re-quired to execute the training is non-negligible. In addition, training resources (such as data sets,algorithms, and platforms) are increasingly becoming monetized, and we do not have the mar-gins to buy access to new data for training. As a consequence, questions stemming fromhowonelabels material in a controlled vocabulary were also at issue. We encountered a failure in historicalvalidationwhen, for instance, ourAI system labeled a “spinningwheel” as awheel, but didnot de-tect its historical relationship toweaving and textiles. That validationwas further obscuredwhenthe system also failed to categorize a second form of “spinning wheel,” which refers locally to ahome-made merry-go-round.11 In other words, not only did the system flatten a spinning wheelinto a generic wheel, it also missed the regional homology between textile production and play, acultural crux that reveals how this place envisions an intersection between work and recreation.By breaking the associations between two forms of “spinning wheel,” our system erased a smallbut significant site of cultural inheritance. How, we asked, should one handle such instances ofeffacement? At one level, one would expect an archival system to be able to identify the prim-itive machine for spinning wool, flax, or other raw materials into usable thread for textiles, butwhat about the merry-go-round? And what should one do when a system neglects both of thesemeanings and reduces the object to the same status as a wheel on a tractor, car, or carriage?

Similarly, when competing naming conventions arise for landmarks, we were conscious toconsider which name should be granted priority as the default designation, and we asked howone should designate a local or historical name, whether for a road, waterway, knob, or other fea-ture, in relationship to a more widely accepted nomenclature such as state route designations or

10The professional conversation in archive and collections management has not been as rich as the one emerging inAI contexts more broadly. For a recent discussion of the conflict in the roles of public trust and civic service that emergefrom the context of the powers artificial intelligence holds for image recognition in policing applications, see ElizabethJoh, “Artificial Intelligence and Policing: First Questions,” Seattle University Law Review 41: 1139–44.

11See “SpinningWheel” in Cassidy 1985–2012.


standardized toponym? As we attempted to address the challenge of multiple naming conven-tions, we encountered some of the same challenges that archivists find in dealingwith indigenouspeoples and their textual, material, and physical artifacts.12 Following an example derived fromthe Passamaquoddy people, we implemented a small set of “traditional knowledge labels”13 todescribe several forms of information, including (a) restrictions on images that should not beshown to strangers (to protect family privacy), (b) places that should remain undisclosed (for in-stance, wild ginseng, ramp, orchid, or morel mushroom patches), and (c) educational materialsfocused on “how it was done” as related to local skills and crafts that have more modern imple-mentations, but for which the traditional practices have remained meaningful. This includedcases such as Maypole dancing and festivals, which remain endowed with ritual significance. Inthe final analysis, neither the framework supplied by copyright and fair use nor the one suppliedby data validation proved singularly adequate to our purposes, but they did provide guidelinesfrom which our facial recognition project could proceed, as we discuss below.

Machine Learning in a Local Archive

These preliminary discussions of ethics and conventionmay seem unrelated to the focus this col-lection adopts toward machine learning and artificial intelligence in the archive. However, as wehave begun to suggest, the data migration to ContentDM opened the door to machine learningfor this project, and those initial steps framed the pitfalls that we continue to navigate as we con-tinue forward. As we suggested at the outset, the technical machine-learning task that we set forourselves is not cutting edge research as much as an application of existing technologies to a newaspect of archival investigation. We proposed (and succeededwith) an application of commercialfacial recognition software to identify the persons in historic photographs in the PMSS archives.We subsequently proposed and are currently working to identify the photographs sharing com-mon but unnamed faces, and in coordination with photographs of known people, to re-createthe social network of this historic institution across slices of its history.

We describe the next steps briefly below, but let us tarry for a moment with the question ofhow the ethical concerns we navigated up to this point also influenced our approach to facialrecognition. The first of those concerns has to do with commercial and public access to archivalmaterials that, as we suggested above, include materials that are designated as restricted use insomeway. Wedemonstrated to the localmembers at PineMountain howour use case and its con-straints for digital archives fit with the current standards for the fair use of copyrighted materialsbased on the “substantive transformation” of reproduced objects (Levendowski 2018, 622–9).Since we are not making available large bodies of materials still protected by copyright, and sinceour use of select materials shifts the context within which they are presented, we were able tonegotiate with PMSS to allow us to design a system for facial recognition using the ContentDMinstance as our image source. What that negotiation did not consider, however, is when fair usedoes not provide a sufficiently high standard of control for the institution involved in the appli-cation of algorithms to institutional memory or its technological dependencies.

First, to test the facial recognition processes, we reached back to the most primitive and localversion of facial recognition software that we could find, Google’s retired platform, the Picasa

12One well-documented digital approach to handling indigenous archival materials includes the Mukurtu platformfor indigenous cultural heritage: https://mukurtu.org/.

13For the original traditional knowledge labels, see: https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels.

https://mukurtu.org/

https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels



Web Albums API, which was retired in May 2016 and fully deprecated as of March 2018 (Sab-harwal 2016). We chose Picasa because it is a self-contained software application that operatesusing a locally hosted script and locally hosted images. Given its deprecated status and its loca-tion on a local machine, we were confident that no cloud services would be ingesting the imageswe fed into the system for our trial. This meant that we could test small data examples withoutfear of having toupload an entire corpus ofmaterial that could subsequently be incorporated intocommercial facial recognition engines or pop up unexpectedly in search results. We thus beganby upholding a high threshold for privacy and insisting on finding ways for PMSS to maintaincontrol over these images within the grasp of its local directories.

The Picasa system created surprisingly good results within the scope we allowed it. It washighly successful atmatching the small group of known faces we supplied as testmaterials. Whileit would be difficult to supply a numerical match rate first because of this limited test set, andsecond because we have not expanded the test to a broad sample using another platform, wewere anecdotally surprised at how robust Picasa’s matching was in practice. For instance, Picasamatched the images of a single person’s face, Celia Cathcart, from pictures of her as a teenagerto images of her as a grandmother. It recognized Cathcart in a group of basketball players, andit also identified her face from side-view and off-center angles, as in a photograph of her lookingdown at her newborn child. The most immediate limitation of Picasa lies in its tagging, whichrequired manual entry of every name and did not allow any automation.

Following the success of that hand-tagging and cross-image identification process, we dis-cussed with our partners whether the next step, using Amazon Web Services’ computer visionand facial recognition platform, ReKognition, would be acceptable. They agreed, and we ranthe images through the AWS application, testing our results against samples pulled from our Pi-casa run to verify the results. Perhaps unsurprisingly, AWS ReKognition fared even better withthose test cases. Using one photograph image, the AWS application identified all of the Picasamatches as well as three new images that had not previously been tagged with Cathcart’s name.The samepattern held for other images in our sample group: Katherine Pettit was positively iden-tified across more likenesses than had been previously tagged, and Alice Cobb was also positivelytracked across images. This positive attribution also reveals a limitation of the metadata: whilethese three women we have named are important historical figures at PMSS, and while they arewidely acknowledged in the archive and well-represented in the photographic record, not all ofthe photographs have been well-tagged or fully documented in the archive. The newly taggedimages that we found would enrich the metadata available to the archive not because these im-ages include surprising faces, but rather, because the tagging has been inconsistent, and over time,previously known faces have become less easy to discern.

Like other recent discussions of private materials disclosed within systems trained for match-ing and similarity, we found that the ethics of private materials for this non-private purpose pro-voked strong reactions. While someof the reactionwaspositivewith communitymembers happyto have more images of the School’s founding director, Katherine Pettit, identified, those samecommunitymembers were not comfortable with our role as researchers identifying people in thephotographs in their community’s archive, unsupervised. They wanted instead to verify eachpositive identification, a point that we agreed with, but which also hindered the process of mov-ing through 19,000 images. They wanted to maintain authority, and while we saw our efforts ascontributions to their goals of better describing their archival holdings, it turns out that the largerscope of automation we brought to the project was intimidating. While its legal status and directethics seemed settled before the beginning of the project, ultimately, this project contributed to


a sense among some individuals at PMSS that they were losing control of their own archive.14

That fear of a loss of control led to another reckoning with the project, as we discuss in the nextsection.

WhatMachine Learning Cannot Learn: An Ethics of the

Archive

It became clear at the same moment we validated our test case, that our research goals and thoseof our partners had quickly diverged. We had discussed the scope and use of PMSS materialswith our partners at PMSS and laid out in a formally drafted “Memorandum ofUnderstanding”(MOU) adapted from theUSDepartment of Justice (2008; 2017) our shared goals in the project.As we described in theMOU, both partners considered it mutually beneficial for the archive anditsmetadata to be able to identify faces of named aswell as unnamed people. We aimed to capturesingle-person images as well as groups in order to enrich the archive with cross-links to other pho-tographs or archivalmaterialswith a shared subject heading, andwehoped to increase thenumberof names included in object attributes. Despite those conversations andmultiple revisions of theMOU draft, what we discovered was ultimately different than the path our planning had indi-cated. Instead of creating an historical social network using the five decades of photographs wehad prepared, we found that the history of the social network and the family and kinship relation-ships detailed through those images was deeply personal for the community living in the regiontoday. We found out the hard way that those kinships reflected economic changes in status andpower, realignments among families and their communities, and new patterns in the social fabricformed by the warp of personal relationships and the weft of local institutions (schools, hospi-tals, and local governance). Revealing those changes was not always something that our partnerswanted us to do, and these were not patterns we had sought to discover: they are simply there,embedded in the images and the relations among images.

These social changes in local alignments—tied in complexways tomarriages and separations,legal conflicts and resolutions, changes in ownership of residential and commercial interests, andother material reflections of that social fabric—remain highly charged and, for those continuingto live in the area, they revealed potentially unexpected parts of the lived realities and values ofthe place. As a result, even though we had an MOU that worked for the technical details of theproject, we could not find common ground for how to handle the competing social and ethicalvalues of the project.

As we problem-solved, we tried to describe new forms of restriction and to generate appro-priately sensitive guidelines to handle future use and access, but it turned out that all of theseapproaches were threatening to the values of a tightly knit community. They, rightly, want totell their story, and so many people have told it so poorly for so long that they wish to have soleaccess to thematerials fromwhich the narratives are assembled. As researchers interested in openaccess and stable platform management, we have disagreements with the scholarly and archivalimplications of this decision, but we ultimately respect the resolve and underlying values thataccompany the difficult choices PMSS makes about its public audiences and the correspondinggoals it maintains for its collections. Interestingly, Wykle has come to view our work with PMSScollections as another form of the material and cultural extraction that has dominated the region

14See, for another example of the ethical quandaries that may be associated with legal applications ofmachine learningtechniques, Ema et al. 2019.


for generations. While we see our work in light of preservation and access as well as our lastingcommitment to PMSS and the region, we have also come to recognize the powerful explanatoryforce that the idea of “extraction” has become for the communities in a region that has sufferedmany forms of extraction industries’ negative effects. In acknowledging the limitations of ourown efforts, we would posit that our case study offers a counter-example to works that suggesthow AI systems can be designed automatically to meet the needs of their constituents (Winfieldet al. 2019). We tried to use a design approach to address our research goals and our partner’sneeds, and it turned out that the dynamically constructed and evolving nature of those needsoutstripped the capacity we could build into our available system of machine learning.

The divergence of our goals has led the collaboration to an impasse. Given that we had al-ready outlined further steps in our initial documents that could not be satisfied after the partnersidentified their divergent intentions, the collaborative scope the partners initially described wasnot completely fulfilled. The divergence of goals became stark: as researchers interested in therelevance and sustainability of these archives, we were moving the collections toward a more ac-cessible and comprehensive platform with open documentation and protocols for future devel-opment. By contrast, the PMSS staff were moving towardmore stringent and local controls overaccess to the archives in order to limit dissemination. At this juncture, we had some negotiatingto do. First, we made the ContentDM instance a password protected and not publicly accessible(private) sandbox rather than a public instance of a virtual digital collection. As PMSS owns thematerial, they decided shortly thereafter to issue a take-down order of the ContentDM instance,and we complied. As the ContentDMmaterials were ultimately accessible in the public domainon their live site, this decision revealed how personal the challenges had become. Nothing in-cluded in the take-down order was unique or newmaterial—rather, the ContentDM site simplyprovided a more accessible format for existing primary material on the WordPress site, strippedof its interpretive and secondary contexts.

If there is a silver lining, it lies in this context for use: the “academic divorce”weunderwent bydiscontinuing our collaboration has made it possible for us to continue conducting research onthe publicly available archivalmaterialswithout being obligated to host a live anddynamic reposi-tory for furthermaterials. As a result, we can test best-approaches without having toworry aboutpushing them to a live production site. Within this constraint, we aim to continue re-creating thehistorical social network without compromising our partners’ needs for privacy and control oftheir production site. The mutual decision to terminate further partnership activities based inarchival development arose because of these differing paths forward. That decision meant thatany further enrichment of the archival materials would not become publicly available, which wesaw as a penalty against using the archive at a moment when archives need as much advocacy andvisible support as possible.

Under these constraints ofprivate accessibility,wehave continued toworkon theAWSReKog-nition pipeline and have successfully identified all of the faces of named people featured in thearchive, with face and name labels now associated with over 1900 unique images. Our nextstep, delayed to Spring 2021 as a result of the COVID-19 pandemic, includes the creation ofan associative network that first identifies unnamed faces in each image using unique identifiers.The second element of that process will be to generate an historical social network using the co-occurrence among those faces as well as the faces of named people in the available images. Giventhat our metadata enrichment has already included date associations for most of the images, weare confident that we will be able to reconstruct historically specific networks for a given year orrange of years, and moreover, that the association between dates and named people will help us


to identify further members of the community who are not currently named in the photographsbecause of the small groups involved in activities and clubs, aswell as the generally limited studentand teacher populations during any given year.

We are now farmore sensitive to how the local concerns of this community shape our researchmethods and outcomes. The longer-term hope, one it is not clear at all that we will be allowed topursue, would be to use natural language processing tools on the archive’s textual materials, par-ticularly named entity recognition and word vectors, to search and match images where knownnames occur proximate to the names of unmatched faces. The present goal, however, remains tocreate a more replete and densely connected network of faces and the places they occupied whenthey were living in the gentle shadows of PineMountain. In order to abide by PMSS communitywishes for privacy, we will be using anonymized aggregate results without identifying individualsin the photographs. While this method has the drawback of not being able to reveal the complex-ity of the historical relations at the granular level of individuals, it will allow us to report on thepersistence or variation in network metrics, such as network density, centrality, path length, andbetweennessmeasures, among others. In this way, we aim to be able tomeasure and report on thenetwork and its changes over time without reporting on individuals. We arrived at an anonymiz-ing method as a solution to the dissolved partnership by asking about the constraints of FERPAas well as by looking back at federal and commercial facial recognition practices. In each case,the dark side of these technological tools remains one associated with surveillance, and in the lan-guage of Eastern Kentucky, extraction. We mention this not only to be transparent about ourrecognition of these limitations, but also in the hopes of opening a new dialogue with our part-ners that might stem from generating interesting discoveries without compromising their senseof the local ownership of their archival materials. Nonetheless, in order to report on the mostinteresting aspects, the actual people and their local histories of place, the work to be done wouldremain more at a human level than at a technical one.

Conclusion

In conclusion, our project describes a success that remains imbricated with a shortcoming inmachine learning. The machine learning tasks and algorithms our project implemented servea mimetic function in the distilled picture of the community they reflect. By matching histori-cal faces to names, the project embraces a form of digital surrogacy: we have aimed to produce ameta-historical account of the present institution’s social and cultural function as a site of socialnetworking and local knowledge transmission. As Robyn Caplan and danah boyd have recentlysuggested, the “bureaucratic functions” these algorithms promote can be understood by thewaysin which they structure users’ behaviors (2018, 3). We would like to supplement Caplan andboyd’s insight regarding the potential coercions involved in how data structures implicitly shapetheir contents as well as their users’ behaviors. Not only do algorithms promote a kind of bureau-cracy, to ends that may be positive and negative, and sometimes both at once, but further, thosesame structures may reflect or shape public behaviors and interactions beyond a single platform.

As we move between digital and public spheres, our work similarly shifts its scope. The re-search that we intended to have positive community effects was instead read by that very sameset of people as an attempt to displace a community from the center of its own history. In otherwords, the bureaucratic functions embedded in PMSS as an institution saw our new approach totheir storytelling as an unwanted and external intervention. As their response suggests, the inter-nal and extant structures for governing their community, its stories, and the peoplewho tell them,


saw our contribution as an effort to co-opt their control. Where we thought we were offeringnew tools for capturing, discovering, and telling stories, they saw what Safiya Noble has recentlycharacterized in a specifically racialized context as “algorithms of oppression” (2018). Here theoppression would be geographic, socio-economic, and cultural, rather than racial; nevertheless,the perception that one is being oppressed by systems set into place by agents working beyondone’s own community remains a shared foundation in Noble’s argument and in the unexpectedreception of our project. As we move forward with our own project into unknown territories,in which our work-products may never see the light of day because of the value conflicts boundup in making archival objects public and accessible, we have found a real and lasting respect forthe institutional dependencies and emplacements within which we all do our work. We hopeto channel some of those functions of emplacement to create new forms of accountability andrestraint that will allow us to move forward, but at least for now, we have found with our projectone limitation of machine learning, and it is not the machine.

References

Ahmed, Manan, Maira E. Álvarez, Sylvia A. Fernández, Alex Gil, Rachel Hendery, Moacir P. deSá Pereira, and Roopika Risam. 2018. “Torn Apart / Separados.” Group for ExperimentalMethods in Humanistic Research. https://xpmethod.plaintext.in/torn-apart/volume/2/.

Bailey, Ronald. 2017. “TheNoble, Misguided Plan to Turn CoalMiners Into Coders.” Reason,November 25, 2017. https://reason.com/2017/11/25/the-noble-misguided-plan-to-tu/.

Calo, Ryan. 2017. “Artificial Intelligence Policy: A Primer and Roadmap.” University of Cali-fornia, Davis Law Review 51:399-435.

Caplan, Robyn and danah boyd. 2018. “Isomorphism through algorithm: Institutional de-pendencies in the case of Facebook.” Big Data & Society (January-June): 1-12. https://doi.org/10.1177/2053951718757253.

Cassidy, Frederic G. et al., eds. 1985-2012. Dictionary of American Regional English. Cam-bridge, MA: Belknap Press. https://www.daredictionary.com.

Ema, Arisa et. al. 2019. “Clarifying Privacy, Property, and Power: Case Study on Value ConflictBetween Communities.” Proceedings of the IEEE 107, no. 3 (March): 575-80. https://doi.org/10.1109/JPROC.2018.2837045.

Harkins, Anthony and Meredith McCarroll, eds. 2019. Appalachian Reckoning: A Region Re-sponds toHillbilly Elegy. Morgantown, WV:West Virginia University Press.

Hochschild, Arlie. 2018. “TheCoders of Kentucky.” TheNewYork Times, September 21, 2018.https://www.nytimes.com/2018/09/21/opinion/sunday/silicon-valley-tech.html.

Joh, Elizabeth. 2018. “Artificial Intelligence and Policing: First Questions.” Seattle UniversityLaw Review 41 (4): 1139-44.

Latour, Bruno. 2007. Reassembling the Social: An Introduction of Actor-Network Theory. NewYork: Oxford University Press.

Levendowski, Amanda. 2018. “How Copyright Law Can Fix Artificial Intelligence’s ImplicitBias Problem.” Washington Law Review 93 (2): 579-630.

Mukurtu CMS. https://mukurtu.org/. Accessed December 12, 2019.

https://xpmethod.plaintext.in/torn-apart/volume/2/

https://xpmethod.plaintext.in/torn-apart/volume/2/

https://reason.com/2017/11/25/the-noble-misguided-plan-to-tu/

https://reason.com/2017/11/25/the-noble-misguided-plan-to-tu/

https://doi.org/10.1177/2053951718757253

https://doi.org/10.1177/2053951718757253

https://www.daredictionary.com

https://doi.org/10.1109/JPROC.2018.2837045


https://www.nytimes.com/2018/09/21/opinion/sunday/silicon-valley-tech.html

https://www.nytimes.com/2018/09/21/opinion/sunday/silicon-valley-tech.html

https://mukurtu.org/


Noble, Safiya. 2018. Algorithms of Oppression: How Search Engines Reinforce Racism. NewYork: NYU Press.

Passamaquoddy People. “Passamaquoddy Traditional Knowledge Labels.”https://passamaquoddypeople.com/passamaquoddy-traditional-knowledge-labels Accessed December 12, 2019.

Risam, Roopika. 2015. “Beyond the Margins: Intersectionality and the Digital Humanities.”DHQ: Digital Humanities Quarterly 9 (2). http://digitalhumanities.org/dhq/vol/9/2/000208/000208.html.

Robertson, Campbell. 2019. “TheyWere Promised Coding Jobs in Appalachia. NowThey SayIt Was a Fraud.” The New York Times, May 12, 2019. https://www.nytimes.com/2019/05/12/us/mined-minds-west-virginia-coding.html.

Sabharwal, Anil. 2016. “Moving on from Picasa.” Google Photos Blog. Last modified March26, 2018. https://googlephotos.blogspot.com/2016/02/moving-on-from-picasa.html.

Sabharwal, Arjun. 2015. Digital Curation in the Digital Humanities: Preserving and PromotingArchival and Special Collections. Boston: Chandos.

Stephan, Karl D., Katina Michael, M.G. Michael, Laura Jacob, and Emily P. Anesta. 2012. “So-cial Implications of Technology: The Past, the Present, and the Future.” Proceedings of theIEEE 100, Special Centennial Issue (May): 1752-1781. https://doi.org/10.1109/JPROC.2012.2189919.

United StatesDepartment of Justice. 2008. “Guidelines for aMemorandumofUnderstanding.”https://www.justice.gov/sites/default/files/ovw/legacy/2008/10/21/sample-mou.pdf.

. 2017. “Sample Memorandum of Understanding.” http://www.doj.state.or.us/wp-content/uploads/2017/08/mou_sample_guidelines.pdf.

Vance, J.D. 2016. Hillbilly Elegy: A Memoir of a Family and Culture in Crisis. New York:Harper.

Weizenbaum, Joseph. 1976. Computer Power and Human Reason: From Judgment to Calcula-tion. New York: W.H. Freeman and Co.

Winfield, Alan F., Katina Michael, Jeremy Pitt, and Vanessa Evers. 2019. “Machine Ethics: thedesign and governance of ethicalAI and autonomous systems.” Proceedings of the IEEE 107,no. 3 (March): 509-17. https://doi.org/10.1109/JPROC.2019.2900622.



http://digitalhumanities.org/dhq/vol/9/2/000208/000208.html

http://digitalhumanities.org/dhq/vol/9/2/000208/000208.html

https://www.nytimes.com/2019/05/12/us/mined-minds-west-virginia-coding.html

https://www.nytimes.com/2019/05/12/us/mined-minds-west-virginia-coding.html

https://googlephotos.blogspot.com/2016/02/moving-on-from-picasa.html

https://googlephotos.blogspot.com/2016/02/moving-on-from-picasa.html



https://www.justice.gov/sites/default/files/ovw/legacy/2008/10/21/sample-mou.pdf

https://www.justice.gov/sites/default/files/ovw/legacy/2008/10/21/sample-mou.pdf

http://www.doj.state.or.us/wp-content/uploads/2017/08/mou_sample_guidelines.pdf

http://www.doj.state.or.us/wp-content/uploads/2017/08/mou_sample_guidelines.pdf


Chapter 13

Towards a Chicago place name

dataset: From back-of-the-book

index to

a labeled dataset

Ana LucicUniversity of Illinois

John ShanahanDePaul University

Introduction

Reading Chicago Reading1 is a grant-supported digital humanities project that takes as its ob-ject the “One Book One Chicago” (OBOC) program2 of the Chicago Public Library. Since fall2001, One Book One Chicago has fostered community through reading and discussion. On its“Big Read” website, the Library of Congress includes information about One Book programsaround the United States,3 and the American Library Association (ALA) also provides materialswith which a library can build its own One Book program and, in this way, bring members oftheir communities together in a conversation.4 While community reading programs are not a

1Reading Chicago Reading project (https://dh.depaul.press/reading-chicago/) gratefully acknowl-edges the support of theNational Endowment for theHumanitiesOffice ofDigitalHumanities,HathiTrust, andLyrasis.

2See https://www.chipublib.org/one-book-one-chicago/.3See http://read.gov/resources/.4See http://www.ala.org/tools/programming/onebook.

151

https://dh.depaul.press/reading-chicago/

https://www.chipublib.org/one-book-one-chicago/

http://read.gov/resources/

http://www.ala.org/tools/programming/onebook


new phenomenon and exist in various formats and sizes, the One BookOne Chicago program isnotable because of its size (the Chicago Public Library has 81 local branches) as well as its history(the program has been in continual existence for nearly 20 years). Although relatively common,book clubs and community-based reading programs are not regularly assessed as other libraryprogramming components are, or are subjects of long-term quantitative study.

The following research questions have been guiding the Reading Chicago Reading projectso far: canwe predict the future circulation of a book using a predictivemodel based on prior cir-culation, community demographics, and text characteristics? How did different neighborhoodsin a diverse but also segregated city respond to particular book choices? Have certain books beenmore popular than others around the city as measured by branch-level circulation, and can thesechanges in checkout totals be correlated with CPL outreach work? A related question is the fo-cus of this paper: by associating place names with sentiment scores in Chicago-themed OBOCbooks, what trends emerge from spatial analysis? Results are still in progress and will be forth-coming in future papers. In the meantime, exploration of these questions, and our attempt tofind solutions for some of them, enables us to reflect on some innovative services that librariescan offer. We will discuss this possibility in the last section of this paper.

Chicago as a place name

Thus far, theReadingChicagoReadingproject has focused thebulkof its analysis on seven recentOBOC book selections and their respective “seasons” of public outreach programming:

• Fall of 2011: Saul Bellow’s The Adventures of AugieMarch

• Spring of 2012: Yiyun Li’sGold Boy, Emerald Girl

• Fall of 2012: Markus Zusak’s The Book Thief

• 2013–2014: Isabel Wilkerson’s TheWarmth of Other Suns

• 2014 – 2015: Michael Chabon’s The Amazing Adventures of Kavalier and Clay

• 2015 – 2016: Thomas Dyja’s The Third Coast

• 2016 – 2017: Barbara Kingsolver’s Animal VegetableMiracle: A Year of Food Life

All of the listed works above, spanning categories of fiction and non-fiction, are still in copy-right. Of the seven works, three were categorized as Chicago-themed because they take place inthe Chicago area in whole or in substantial part: Saul Bellow’s The Adventures of Augie March,Isabel Wilkerson’s TheWarmth of Other Suns, and Thomas Dyja’s The Third Coast.

As part of ongoing work of the Reading Chicago Reading project, we used the secure dataportal of the HathiTrust Research Consortium to access and pre-process the in-copyright nov-els in our set. The HathiTrust research portal permits the extraction of non-consumptive fea-tures of the works included in the digital library, even those that are still under copyright. Non-consumptive features donot violate copyright restrictions as they donot allow the regular reading(“consumption”) or digital reconstruction of the full work in question. An example of a non-consumptive feature is the part of speech information extracted in aggregate with or withoutconnection to its source words. Locationwords (i.e. place names) in the text are another example

Lucic and Shanahan 153

of a non-consumptive feature as long as we do not aim to extract locations with the surround-ing context: that is, while the extraction of a location word alone from a work under copyrightwill not violate copyright law, the extraction of the location word with its surrounding context(a fixed size “window” of words that surrounds the location word) might do so. Similarly, thesentiment of a sentence also falls under the category of a “non-consumptive” feature as long aswe do not extract both the entire sentence and its sentiment score. Using these methods, it waspossible to utilize the HathiTrust research portal to access and also extract the location words aswell as sentiment of individual sentences from copyrighted works. As later paragraphs will revealhowever, we also needed to verify the accuracy of these extractions, which was done manually bychecking the extracted references against the actual text of the work.

This paper arises from the finding that the three OBOC books that are set largely in or areabout Chicago circulated differently than the OBOC books that are not, (i.e., Marcus Zusak’sTheBookThief, YiyunLi’sGoldBoy, BarbaraKingsolver’sAnimal,Vegetable,Miracle, andMichaelChabon’sTheAmazingAdventures ofKavalier andClay. Since oneof thefindingswas that someCPL branches had higher circulation for “Chicago” OBOC books than others in the program,we wanted to determine (1) which place names were featured in the three books and (2) quan-tify and examine the sentiment associatedwith these places. Although recognizing a well-definedplace name in a text by automatedmeans is no longer a difficult task thanks to the development ofnamed entity recognizers such as the Stanford Named Entity Recognizer,5 OpenNLP,6 spaCy,7

and NLTK,8 recognizing whether a place name is a reference to a Chicago location is a hardertask. If Chicago is the setting or one of the main topics of the book then we can assume thata number of locations mentioned will also be Chicago place names. However, if informationabout the topicality or locality of the book is not known in advance or if the plot in the bookmoves from location to location, then the task of verifying through automatedmethods whethera place name is a Chicago location is much harder.

With the help of LinkedGeoData9 we were able to obtain all of the Chicago place namesidentified by volunteers through the OpenStreetMap project10 and then download a listing thatincludedChicago buildings, theaters, restaurants, streets, and other prominent places. While thisis very useful, we also realized that we were missing historical Chicago place names with this ap-proach. At the same time, the way that place names are represented in a text will likely not alwayscorrespond to theway aplacename is formally represented in adictionary, database, or knowledgegraph. For example, a sentencemight simply use an anaphoric reference such as “that building” or“her home” instead of directly naming the entity known from other sentences. Moreover, therewere many examples of generic place names: how many cities in the United States have a StateStreet, a Madison Street, or a 1st Avenue, and the like? A further hindrance was determining thetype of place names we wanted to identify and collect from the text’s total set of location wordtokens: it soon became obvious that for the purposes of visualizing a place name on themap, gen-eral references to Chicago went beyond the scope of the maps we wanted to create. We becamemore interested in tracking references to specific Chicago place names that included buildings(historical and present), named areas of the city, monuments, streets, theatres, restaurants, andthe like. Given that our total dataset for this task comprised just three books, wewere able toman-

5See https://nlp.stanford.edu/software/CRF-NER.html.6See https://opennlp.apache.org/.7See https://spacy.io/.8See https://www.nltk.org/book/ch07.html.9See http://linkedgeodata.org/About.

10See https://www.openstreetmap.org/.

https://nlp.stanford.edu/software/CRF-NER.html

https://opennlp.apache.org/

https://spacy.io/

https://www.nltk.org/book/ch07.html

http://linkedgeodata.org/About

https://www.openstreetmap.org/


Figure 13.1: Mapping place names associated with positive (top row) and very negative (bottomrow) sentiment extracted from three OBOC books.

ually sift through the automatically identified place names and verify whether they were indeed aChicago place name or not. We also established the sentiment of each location-bearing sentencein the three books using the Stanford Sentiment Analyzer.11 Our guiding principle was that spe-cific place(s) mentioned in the sentence “inherit” the sentiment score of the entire sentence. Thisprinciple may not always be true, but our manual inspection of the sentiment assigned to sen-tences, and therefore to locations mentioned in the sentences, established that this was a fairlyaccurate estimate: the sentiment score of the entire sentence is at the very least connected to or“resonates”with the individual components of the sentence including place names. Whilewe didexamine some samples, we did not conduct a qualitative analysis of the accuracy of the sentimentscores assigned to the corpus.

Figure 13.1 documents an example of the results of our effort to integrate place names withthe sentiment of the sentence.

Particularly notable in Figure 13.1 isThe Third Coast (right column) which shows a concen-tration of positively-associated Chicago place names in the northern parts of the city along theshore of LakeMichigan. Negative sentiment, by contrast appears to bemore concentrated in thecentral part of Chicago and also in the southern parts of the city.

The place names extracted from our three Chicago-setting OBOCbooks allowed us to focus

11See https://nlp.stanford.edu/sentiment/.

https://nlp.stanford.edu/sentiment/


Figure 13.2: Mapping of sentences that feature “Hyde Park,” and their sentiment, from threeOBOC program books

on particular areas of the city such as Hyde Park on the South Side, which is mentioned in eachof them. Larger circles correspond to a greater number of sentences that mention Hyde Parkand are associated with a negative sentiment in both The Adventures of Augie March and TheWarmth of Other Suns. As the maps in figure 13.2 indicate, on the other hand, The Third Coastfeatures sentences in which Hyde Park is mentioned in both positive and negative contexts.

These results prompt us to continue with this line of research and to procure a larger “con-trol” set of texts withChicago place names and sentiment scores. This would allowus to focus onspecific places such as “Wrigley Field” or the once-famous but no longer existing “Mecca” apart-ment building (which stood at the intersection of 34th and State Street on the South Side andwas immortalized in a 1968 poetry collection by Gwendolyn Brooks). With a robust place namedata set, we could analyze the context in which these place names were mentioned in other liter-ature, in contemporary or historical newspapers (Chicago Tribune, Chicago Sun-Times, ChicagoDefender), or in library and archivalmaterials. Promising contextual elements would include thesentiment associated with the place name.

Our interest in creating a dataset of Chicago place names extracted from literature led us toThe Chicago of Fiction, a vast annotated bibliography by James A. Kaser. Published in 2011, this


work contains entries on more than 1,200 works published between 1852 and 1980 that featureChicago. Kaser’s book contains several indexes that can serve as sources of labeled data or in-stances in which Chicago locations are mentioned. Although we are still determining howmanyof the titles included in the annotated bibliography already exist in digital format or are accessiblethrough the HathiTrust digital library, it is likely that a subset of the total can be accessed elec-tronically. Even if the books do not exist in electronic format presently, it is still possible to usethe index as a source of already-labeled data for Chicago place names. We anticipate that sucha dataset would be of interest to researchers in Urban Studies, Literature, History, and Geogra-phy. A sufficiently large number of sentences featuring Chicago place names would enable us toproceed in the direction of a Chicago place name recognizer that can “learn” Chicago context orexamine how much context is sufficient to establish whether, for instance, a “Madison Street”place name in a text is located in Chicago or elsewhere.

Howdo libraries innovate? From print index to labeled

data

Over the last decade, libraries have pioneered services related to the development andpreservationof digital scholarship projects. Librarians frequently assist faculty and students with the devel-opment of digital humanities and digital scholarship projects. They point patrons to resourcesand portals where they can find data and help with licensing. Librarians also procure datasets,and some perform data cleaning and pre-processing tasks. And yet it is still not that commonfor librarians to participate in the creation of a dataset. A relatively recent initiative, however,Collections as Data,12 directly tackles the issue of treating research, library, and cultural heritagecollections as data and providing access to them. This ongoing initiative aims to create 12 projectsthat can serve as a model to other libraries for making collections accessible as data.

The data that undergird the mechanisms of library workings—circulation records for phys-ical and digital objects, metadata records, and the like—are not commonly available as datasetsopen to machine learning tasks. If they were, not only could libraries refer others to the alreadycreated and annotated physical and digital objects, but they could also participate in creating ob-jects that are local to their settings. Creation and curation of such datasets could in turn helpestablish new relationships between area libraries and local communities. One can imagine a“data challenge,” for instance, in which libraries assemble a community by building a dataset rel-evant to that community. Such an effort would need to be preceded by assessment of the dataneeds and interests of that particular community. In the case of a Chicago place name datasetchallenge, efforts could revolve around local communities adding sentences to the dataset fromliterary sources. A second stepmight involve organizing a crowdsourced data challenge to build aplace name recognizer model (e.g. Chicago place name recognizer model) based on the sentencesgathered.

One can also imagine turning metadata records into curated datasets that are shared withlocal communities and with teachers and university lecturers for use in the classroom. Once adataset is built, scenarios can be invented for using it. This kind of work invites conversationswith faculty members about their needs and about potential datasets that would be of particularinterest. Creation of datasets based on unique materials at their disposal will enrich the paletteof services already offered by libraries.

12See https://collectionsasdata.github.io/part2whole/.

https://collectionsasdata.github.io/part2whole/


One of the main goals of the Reading Chicago Reading project was the creation of a modelthat canpredict the circulationof aOneBookOneChicagoprogrambook selection givenparam-eters such as prior circulation for the book, its text characteristics, and the geographical localityof the work. We are not aware of other predictive models that integrate circulation records withtext features extracted from the books in this way. Given that circulation records are not com-monly integrated with other data sources when they are analyzed, linking different data sourceswith circulation records is another challenging opportunity that this paper envisions.

Ultimately, libraries can play a dynamic role in bothmanaging and creating data and datasetsthat can be shared with the members of local communities. Using back-of-the-book indexes asa source of labeled place name data is a tool that we have begun to prototype but still requiresfurther exploration and troubleshooting. While organizing a data challenge takes a lot of effort,a data challenge can be an effective way of reaching out to one’s local community and identifyingtheir data needs. To this end, we aim to make freely available our curated list of sentences andassociated sentiment scores for Chicago place names in the three OBOC selections centered onChicago. We will invite scholars and the general public to add more Chicago location sentencesextracted from other literature. Our end goal is a labeled training dataset for the creation of aChicago place name recognizer, which, we hope, will enable new avenues of research.

References

AmericanLibraryAssociation. n.d. “OneBookOneCommunity.” Programming&Exhibitions(website). Accessed May 31, 2020. http://www.ala.org/tools/programming/onebook.

Bird, Steven, Edward Loper and Ewan Klein. 2009. Natural Language Processing with Python.Sebastopol, CA: O’Reilly Media Inc.

Chicago Public Library. n.d. “One Book One Chicago.” Accessed May 31, 2020. https://www.chipublib.org/one-book-one-chicago/.

“Collections as Data: Part to Whole.” n.d. Accessed May 31, 2020. https://collectionsasdata.github.io/part2whole/.

Finkel, Jenny Rose, Trond Grenager, and Christopher Manning. 2005. “Incorporating Non-local Information into Information Extraction Systems byGibbs Sampling.” InProceedingsof the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005),363-370. https://www.aclweb.org/anthology/P05-1045/.

HathiTrust Digital Library. n.d. AccessedMay 31, 2020. https://www.hathitrust.org/.Kaser, A. James. 2011. The Chicago of Fiction: A Resource Guide. Lanham: Scarecrow Press.Library of Congress. “Local/Community Resources.’ n.d. Read.gov. Accessed May 31, 2020.

http://read.gov/resources/.LinkedGeoData. “About / News.” n.d. Accessed May 31, 2020. http://linkedgeodata.

org/About.Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and

David McClosky. 2014. “The Stanford CoreNLP Natural Language Processing Toolkit.”In Proceedings of the 52nd AnnualMeeting of the Association for Computational Linguistics:System Demonstrations, 55-60. https://www.aclweb.org/anthology/P14-5010/.

OpenStreetMap. n.d. AccessedMay 31, 2020. https://www.openstreetmap.org/.Reading Chicago Reading. “About Reading Chicago Reading.” n.d. Accessed May 31, 2020.

https://dh.depaul.press/reading-chicago/about/.







https://www.aclweb.org/anthology/P05-1045/

https://www.hathitrust.org/

http://read.gov/resources/



https://www.aclweb.org/anthology/P14-5010/

https://www.openstreetmap.org/

https://dh.depaul.press/reading-chicago/about/

Chapter 14

Can aHammer Categorize

Highly Technical Articles?

Samuel HansenUniversity ofMichigan

When everything looks like a nail...

I was sure I had the most brilliant research project idea for my course in Digital Scholarship tech-niques. I would use theMathematical Subject Classification (MSC) values assigned to the publi-cations inMathSciNet1 to create a temporal citation network which would allowme to visualizehow new mathematical subfields were created and perhaps even predict them while they werestill in their infancy. I thought it would be an easy enough project. I already knew how to analyzenetwork data and the data I needed already existed, I just had to get my hands on it. I even sold acouple ofmy fellow coursemates on the idea and they agreed toworkwithme. Of course nothingis as easy as that, and numerous requests for data wentwithout response. Even after I reached outto personal contacts at MathSciNet, we came to understand we would not be getting the MSCdata the entire project relied upon. Not that we were going to let a little setback like not havingthe necessary data stop us.

After all, this was early 2018 and there had already been years of stories about how artificialintelligence, machine learning in particular, was going to revolutionize every aspect of our world(Kelly 2014; Clark 2015; Parloff 2016; Sangwani 2017; Tank 2017). All the coverage made itseem like AI was not only a tool with as many applications as a hammer, but that it alsomagicallyturned all problems into nails. While none of us were AI experts, we knew that machine learningwas supposed to be good at classification and categorization. The promise seemed to be that ifyou had stacks of data, a machine learning algorithm could dive in, find the needles, and arrangethem into neatly divided piles of similar sharpness and length. Not only that, but there were pre-built tools that made it so almost anyone could do it. For a group of people whose project was on

1See https://mathscinet.ams.org/.

159

https://mathscinet.ams.org/


life support because we could not get the categorization data we needed, machine learning beganto look like our only potential savior. So, machine learning is what we used.

I will not go too deep into the actual process, but I will give a brief outline of the techniqueswe employed. Machine-learning-based categorization needs data to classify, which in our casewere mathematics publications. While this can be done with titles and abstracts we wanted toprovide themachine with as much data as we could, so we decided to work with full-text articles.Since we were at the University of Wisconsin at the time, we were able to connect with the teambehind GeoDeepDive2 who have agreements with many publishers to provide the full text of ar-ticles for text and datamining research (“GeoDeepDive: Project Overview” n.d.). GeoDeepDiveprovided us with the full text of 22,397mathematics articles which we used as our corpus. In or-der to classify these articles, whichwere already pre-processed byGeoDeepDive withCoreNLP,3

we first used the Python package Gensim4 to process the articles into a Python-friendly formatand to remove stopwords. Then we randomly sampled 1⁄3 of the corpus to create a topic modelusing theMALLET5 topicmodeling tool. Finally, we applied themodel to the remaining articlesin our corpus. We then coded the words within the generated topics to subfields within mathe-matics and used those codes to assign articles a subfield category. In order tomake sure our resultswere not just a one-off, we repeated this process multiple times and checked for variance in theresults. There was none, the results were uniformly poor.

Thatmight not be entirely fair. Therewere interesting aspects to the results of the topicmod-eling, but when it came to categorization they were useless. Of the subfield codes assigned to arti-cles, only two were ever the dominant result for any given article: Graph Theory and Undefined,which does not really tell the whole story as Undefined was the run-away winner in the articleclassification race with more than 70% of articles classified as Undefined in each run, includingone for which it hit 95%. The topics generated by MALLET were often plagued by gibberishcaused by equations in the mathematics articles and there was at least one topic in each run thatwas filled with the names of months and locations. Add how the technical language of math-ematics is filled with words that have non-technical definitions (for example, map or space), orwordswhich have their own subfield-specificmeanings (such as homomorphismor degree), bothof which frustrate attempts to code a subfield. These issues help make it clear why so many arti-cles ended up as “Undefined.” Even for the one subfield which had a unique enough vocabularyfor our topic model to partially be able to identify, Graph Theory, the results were marginallypositive at best. We were able to obtain Mathematical Subject Classification (MSC) values foraround 10% of our corpus. When we compared the articles we categorized as Graph Theory tothe articles which had been assigned theMSCvalue forGraphTheory (05Cxx), we foundwe hada textbook recall-versus-precision problem. We could either correctly categorize nearly all of theGraph Theory articles with a very high rate of false positives (high recall and low precision) or wecould almost never incorrectly categorize an article as Graph Theory, but miss over 30% that weshould have categorized as Graph Theory (high precision and low recall).

Needless to say, we were not able to create the temporal subfield network I had imagined.While we could reasonably claim that we learned very interesting things about the language ofmathematics and its subfields, we could not claim we even came close to automatically catego-rizing mathematics articles. When we had to report back on our work at the end of the course,

2See https://geodeepdive.org/.3See https://stanfordnlp.github.io/CoreNLP/.4See https://radimrehurek.com/gensim/.5See http://mallet.cs.umass.edu/topics.php.

https://geodeepdive.org/

https://stanfordnlp.github.io/CoreNLP/

https://radimrehurek.com/gensim/

http://mallet.cs.umass.edu/topics.php

Hansen 161

our main result was that basic, off-the-shelf topic modelling does not work well when it comesto highly technical articles from subjects like mathematics. It was also a welcome lesson in notbelieving the hype ofmachine learning, even when a problem looks exactly like the kindmachinelearning was supposed to excel at solving. While we had a hammer and our problem looked likea nail, it seemed that the former was a ball peen and the latter a railroad tie. In the end, even inthe land of hammers and nails, the tool has to match the task. Though we failed to accomplishautomated categorization of mathematics, we were dilettantes in the world of machine learning.I believe our project is a good example of howmachine learning is still a long way from being themagic tool as some, though not all (Rahimi and Recht 2017), have portrayed it. Let us look atwhat happens when smarter and more capable minds tackle the problem of classifying mathe-matics and other highly technical subjects using advanced machine learning techniques.

Finding the Right Hammer

To illustrate the quest to find the right hammer I amgoing to focus on three different projects thattackled the automated categorization of highly technical content, two of which also attemptedto categorize mathematical content and one that looked to categorize scholarly works in general.These three projects provide examples of many of the approaches and practices employed by ex-perts in automated classification and demonstrate the twomain paths that these types of projectsfollow to accomplish their goals. Since we have been discussing mathematics, let us start withthose two projects.

Both projects began because the participants were struggling to categorize mathematics pub-lications so they would be properly indexed and searchable in digital mathematics databases: theCzech Digital Mathematics Library (DML-CZ)6 and NUMDAM7 in the case of Radim Ře-hůřek and Petr Sojka (Řehůřek and Sojka 2008), and Zentralblatt MATH (zbMath)8 in the caseof Simon Barthel, Sascha Tonnies, and Wolf-Tilo Balke (Barthel, Tönnies, and Balke 2013). Allof these databases rely on the aforementionedMSC9 to aid in indexing and retrieval, and so theirgoal was to automate the assignment of MSC values to lower the time and labor cost of requir-ing humans to do this task. The main differences between their tasks related to the number ofdocuments they were working with (thousands for Řehůřek and Sojka and millions for Barthel,Tönnies, andBalke), the amount of theworks available (full text forŘehůřek andSojka, and titles,authors, and abstracts for Barthel, Tönnies, and Balke), and the quality of the data (mostly OCRscans for Řehůřek and Sojka and mostly TeX for Barthel, Tönnies, and Balke). Even with thesedifferences, both projects took a similar approach, and it is the first of the twomain pathways to-ward classification I spoke of earlier: using a predetermined taxonomy and a set of pre-categorizeddata to build a machine learning categorizer.

In the end, while both projects determined that the use of Support VectorMachines (Gandhi2018)10 provided the best categorization results, their implementations were different. The Ře-

6See https://dml.cz/.7See http://www.numdam.org/.8See https://zbmath.org/.9Mathematical Subject Classification (MSC) values inMathSciNet and zbMath are a particularly interesting catego-

rization set to work with as they are assigned and reviewed by a subject area expert editor and an active researcher in thesame, or closely related, subfield as the article’s content before they are published. This multi-step process of reviewyields a built-in accuracy check for the categorization.

10Support Vector Machines (SVMs) are machine learning models which are trained using a pre-classified corpus tosplit a vector space into a set of differentiated areas (or categories) and then attempt to classify new items by where in the

https://dml.cz/

http://www.numdam.org/

https://zbmath.org/


hůřek and Sojka SVMswere trainedwith termsweighted using augmented term frequency11 anddynamic decision threshold12 selection using s-cut13 (Řehůřek and Sojka 2008, 549) and Barthel,Tönnies, and Balke’s with term weighting using term frequency–inverse document frequency14

and Euclidean normalization15 (Barthel, Tönnies, and Balke 2013, 88), but the main differencewas how they handled formulae. In particular the Barthel, Tönnies, and Balke group split theircorpus into words and formulae and mapped them to separate vectors which were then mergedtogether for a combined vector used for categorization. Řehůřek and Sojka did not differenti-ate between words and formulae in their corpus, and they did note that their OCR scans’ poorhandling of formulae could have hindered their results (Řehůřek and Sojka 2008, 555). In theend, not having the ability to handle formulae separately did not seem to matter as Řehůřek andSojka claimed microaveraged F1 scores of 89.03% (Řehůřek and Sojka 2008, 549) when classify-ing the top level MSC category with their best performing SVM. When this is compared to themicroaveraged F1 of 67.3% obtained by Barthel, Tönnies, and Balke (Barthel, Tönnies, and Balke2013, 88), it would seem that either Řehůřek’s and Sojka’s implementation of SVMs or their ac-cess to full-text led to a clear advantage. This advantage becomes less clear when one takes intoaccount that Řehůřek and Sojka were only working with top level MSCs where they had at least30 (60 in the case of their best result) articles, and their limited corpus meant that many top-levelMSC categories would not have been included. Looking at the work done by Barthel, Tönnies,and Balke makes it clear that these less common MSC categories such as K-Theory or PotentialTheory, for which Barthel, Tönnies, and Balke achievedmicroaveraged F1 measures of 18.2% and24% respectively, have a large impact on the overall effectiveness of the automated categorization.Remember, this is only for the top level of MSC codes, and the work of Barthel, Tönnies, andBalke suggests it would get worse when trying to apply the second and third level for full MSCcategorization to these less-common categories. This leads me to believe that in the case of cat-egorizing highly technical mathematical works to an existing taxonomy, people have come closeto identifying the overall size of themachine learning hammer, but are still a long way away fromfinding the right match for the categorization nail.

Now let us shift from mathematics-specific categorization to subject categorization in gen-eral and look at the work Microsoft has done assigning Fields of Study (FoS) in the MicrosoftAcademic Graph (MAG) which is used to create their Microsoft Academic article search prod-uct.16 While the MAG FoS project is also attempting to categorize articles for proper indexingand search, it represents the second path which is taken by automated categorization projects:using machine learning techniques to both create the taxonomy and to classify.

Microsoft took a unique approach in the development of their taxonomy. Instead of rely-

vector space the trainedmodel places them. For a more in-depth, technical explanation, see: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47.

11Augmented term frequency refers to the number of times a term occurs in the document divided by the number oftimes the most frequent occurring term appears in the document.

12The decision threshold is the cut-off for how close to a category the SVMmust determine an item to be in order forit to be assigned that category. Řehůřek and Sojka’s work varied this threshold dynamically.

13Score-based local optimization, or s-cut, allows amachine-learningmodel to set different thresholds for each categorywith an emphasis on local, or category, instead of global performance.

14Term frequency–inverse document frequency provides a weight for terms depending on how frequently it occursacross the corpus. A termwhich occurs rarely across the corpus but with a high frequency within a single document willhave a higher weight when classifying the document in question.

15A Euclidean norm provides the distance from the origin to a point in an n-dimensional space. It is calculated bytaking the square root of the sum of the squares of all coordinate values.

16See https://academic.microsoft.com/.

https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47



https://academic.microsoft.com/

Hansen 163

ing on the corpus of articles in the MAG to develop it, they relied primarily onWikipedia for itscreation. They generated an initial seed by referencing the ScienceMetrix classification scheme17

and a couple thousand FoS Wikipedia articles they identified internally. They then used an iter-ative process to identify more FoS inWikipedia based on whether they were linked toWikipediaarticles that were already identified as FoS and whether the new articles represented valid entitytypes—e.g. an entity type of protein would be added and an entity type of person would be ex-cluded (Shen, Ma, and Wang 2018, 3). This work allowed Microsoft to develop a list of morethan 200,000 Fields of Study for use as categories in the MAG.

Microsoft then used machine learning techniques to apply these FoS to their corpus of over140million academic articles. The specific techniques are not as clear as theywere with the previ-ous examples, likely due toMicrosoft protecting their specificmethods fromcompetitors, but thearticle published to the arXiv by their researchers (Shen,Ma, andWang 2018) and thewrite up ontheMAGwebsite doesmake it clear they used vector based convolutional neural networks whichrelied on Skip-gram (Mikolov et al. 2013) embeddings and bag-of-words/entities features to cre-ate their vectors (“Microsoft Academic Increases Power of Semantic Search by Adding MoreFields of Study—Microsoft Research” 2018). One really interesting part of the machine learn-ingmethod used byMicrosoft was that it did not rely only on information from the article beingcategorized. It also utilized the citations to and references from information about the article inthe MAG, and used the FoS the citations and references were assigned in order to influence theFoS of the original article.

The identification of potential FoS and their assignment to articles was only a part of Mi-crosoft’s purpose. In order to fully index the MAG and make it searchable they also wished todetermine the relationships between the FoS; in other words they wanted to build a hierarchicaltaxonomy. To achieve this they used the article categorizations and defined a Field of Study A asthe parent of B if the articles categorized as Bwere close to a subset of the articles categorized as A(a more formal definition can be found in (Shen, Ma, andWang 2018, 4). This work, which cre-ated a six-level hierarchy, was mostly automated, but Microsoft did inspect and manually adjustthe relationships between FoS on the highest two levels.

To evaluate the quality of their FoS taxonomy and categorization work,Microsoft randomlysampled data at each of the three steps of the project and used human judges to assess their ac-curacy. The accuracy assessments of the three steps were not as complete as they would be withthe mathematics categorization, as that approach would evaluate terms across the whole of theirdata sets, but the projects are of very different scales so different methods are appropriate. In theend Microsoft estimates the accuracy of the FoS at 94.75%, the article categorization at 81.2%,and the hierarchy at 78% (Shen, Ma, and Wang 2018, 5). Since MSC was created by humansthere is no meaningful way to compare the FoS accuracy measurements, but the categorizationaccuracy falls somewhere between that of the two mathematics projects. This is a very impres-sive result, especially when the aforementioned scale is taken into account. Instead of trying toreplace the work of humans categorizing mathematics articles indexed in a database, which for2018 was 120,324 items in MathSciNet18 and 97,819 in zbMath,19 the FoS project is trying toreplace the human categorization of all items indexed inMAG, which was 10,616,601 in 2018.20

17See http://science-metrix.com/?q=en/classification.18See https://mathscinet.ams.org/mathscinet/search/publications.html?dr=pubyear&yrop=

eq&arg3=2018.19See https://zbmath.org/?q=py%3A2018.20See https://academic.microsoft.com/publications/33923547.

http://science-metrix.com/?q=en/classification

https://mathscinet.ams.org/mathscinet/search/publications.html?dr=pubyear&yrop=eq&arg3=2018

https://mathscinet.ams.org/mathscinet/search/publications.html?dr=pubyear&yrop=eq&arg3=2018

https://zbmath.org/?q=py%3A2018

https://academic.microsoft.com/publications/33923547


Both zbMath and MathSciNet were capable of providing the human labor to do the work ofassigning MSC values to the mathematics articles they indexed in 2018.21 Therefore using anautomated categorization, which at best could only get the top level right with 90% accuracy, wasnot the right approach. On the other hand, it seems clear that no one could feasibly provide thehuman labor to categorize all articles indexed byMAG in 2018 so an 80% accurate categorizationis a significant accomplishment. To go back to the nail and hammer analogy,Microsoft may haveused a sledgehammer but they were hammering a rather giant nail.

Are You Sure it’s a Nail?

I started this chapter talking about how we have all been told that AI and machine learning weregoing to revolutionize everything in the world. That they were the hammers and all the world’sproblems were nails. I found that this was not the case when we tried to employ it, in an ad-mittedly rather naive fashion, to automatically categorize mathematical articles. From the otherexamples I included, it is also clear computational experts find the automatic categorization ofhighly technical content a hard problem to tackle, one where success is very much dependent onwhat it is beingmeasured against. In the case of classifyingmathematics, machine learning can doa decent job but not enough to compete with humans. In the case of classifying everything, scalegives machines an edge, as long as you have the computational power and knowledge wielded bya company like Microsoft.

This collection is about the intersection of AI,machine learning, deep learning, and libraries.While there are definitely problems in libraries where these techniques will be the answer, I thinkit is important to pause and consider if artificial intelligence techniques are the best approachbefore trying to use them. Libraries, even those like the one I work in, which are lucky enoughto boast of incredibly talented IT departments, do not tend to have access to a large amount ofunused computational power or numerous experts in bleeding-edge AI. They are also rather no-toriously limited budget-wise and would likely have to decide between existing budget items anddeveloping an in-house machine learning program. Those realities combined with the legitimatequestions which can be raised about the efficacy of machine learning and AI with respect to thetypes of problems a library may encounter, such as categorizing the contents of highly technicalarticles, makemeworry. While there will bemany cases where using AImakes sense, I want to besure libraries are asking themselves a lot of questions before starting to use it. Questions like: isthis problem large enough in scale to substitute machines for human labor given that machineswill likely be less accurate? Or: will using machines to solve this problem cost us more in equip-ment and highly technical staff than our current solution, and has that factored in the people andservices a library may need to cut to afford them? Or: does the data we have to train a machinecontain bias and therefore will produce a biased model which will only serve to perpetuate exist-ing inequities and systemic oppression? Not to mention: is this really a problem or are we justlooking for a way to employ machine learning to say that we did? In the cases where the answersto these questions are yes, it will make sense for libraries to employ machine learning. I just wantlibraries to look really carefully at how they approach problems and solutions, to make sure that

21When an article is indexed by MathSciNet it receives initial MSC values from a subject area editor who then passesthe article along to an external expert reviewer who suggests new MSC values, completes partial values, and providespotential corrections to the MSC values assigned by the editors (“Mathematical Reviews Guide For Reviewers”2020)and then the subject area editors will make the final determination in order to make sure internal styles are followed.zbMath follows a similar procedure.

Hansen 165

their problem is, in fact, a nail, and then to look even closer and make sure it is the type of nail amachine-learning hammer can hit.

References

Barthel, Simon, SaschaTönnies, andWolf-Tilo Balke. 2013. “Large-Scale Experiments forMath-ematical Document Classification.” In Digital Libraries: Social Media and CommunityNetworks, edited by Shalini R. Urs, Jin-Cheon Na, and George Buchanan, 83–92. Cham:Springer International Publishing.

Clark, Jack. 2015. “Why 2015 Was a Breakthrough Year in Artificial Intelligence.” Bloomberg,December 8, 2015. https://www.bloomberg.com/news/articles/2015-12-08/why-2015-was-a-breakthrough-year-in-artificial-intelligence.

Gandhi, Rohith. 2018. “Support Vector Machine—Introduction to Machine Learning Algo-rithms.” Medium. July 5, 2018. https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47.

“GeoDeepDive: Project Overview.’ n.d. Accessed May 7, 2018. https://geodeepdive.org/about.html.

Kelly, Kevin. 2014. “The Three Breakthroughs That Have Finally Unleashed AI on theWorld.”Wired, October 27, 2014. https://www.wired.com/2014/10/future-of-artificial-intelligence/.

“MathematicalReviewsGuideForReviewers.” 2015. AmericanMathematical Society. February2015. https://mathscinet.ams.org/mresubs/guide-reviewers.html.

“Microsoft Academic Increases Power of Semantic Search by Adding More Fields of Study.”2018. Microsoft Academic (blog). February 15, 2018. https://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-increases-power-semantic-search-adding-fields-study/.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “DistributedRepresentations of Words and Phrases and Their Compositionality.” In Advances in Neu-ral Information Processing Systems 26, edited by C. J. C. Burges, L. Bottou, M. Welling, Z.Ghahramani, and K. Q. Weinberger, 3111–3119. Curran Associates, Inc. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.

Parloff, Roger. 2016. “From 2016: Why Deep Learning Is Suddenly Changing Your Life.” For-tune. September 28, 2016. https://fortune.com/longform/ai-artificial-intelligence-deep-machine-learning/.

Rahimi, Ali, and BenjaminRecht. 2017. “BackWhenWeWere Kids.” Presentation at theNIPS2017 Conference. https://www.youtube.com/watch?v=Qi1Yry33TQE.

Řehůřek, Radim, and Petr Sojka. 2008. “Automated Classification and Categorization ofMath-ematical Knowledge.” In Intelligent ComputerMathematics, edited by SergeAutexier, JohnCampbell, JulioRubio, Volker Sorge,Masakazu Suzuki, and FreekWiedijk, 543–57. Berlin:Springer Verlag.

Sangwani, Gaurav. 2017. “2017 Is the Year ofMachine Learning. Here’sWhy.” Business Insider,January 13, 2017. https://www.businessinsider.in/2017-is-the-year-of-machine-learning-heres-why/articleshow/56514535.cms.

https://www.bloomberg.com/news/articles/2015-12-08/why-2015-was-a-breakthrough-year-in-artificial-intelligence

https://www.bloomberg.com/news/articles/2015-12-08/why-2015-was-a-breakthrough-year-in-artificial-intelligence




https://geodeepdive.org/about.html

https://geodeepdive.org/about.html

https://www.wired.com/2014/10/future-of-artificial-intelligence/

https://www.wired.com/2014/10/future-of-artificial-intelligence/

https://mathscinet.ams.org/mresubs/guide-reviewers.html

https://www.microsoft.com/en-us/research/project/academic/articles/microsoft-academic-increases-power-semantic-search-adding-fields-study/



http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf



https://fortune.com/longform/ai-artificial-intelligence-deep-machine-learning/

https://fortune.com/longform/ai-artificial-intelligence-deep-machine-learning/

https://www.youtube.com/watch?v=Qi1Yry33TQE

https://www.businessinsider.in/2017-is-the-year-of-machine-learning-heres-why/articleshow/56514535.cms

https://www.businessinsider.in/2017-is-the-year-of-machine-learning-heres-why/articleshow/56514535.cms


Shen, Zhihong, HaoMa, and KuansanWang. 2018. “AWeb-Scale System for Scientific Knowl-edgeExploration.” Paper presented at the 56thAnnualMeetingof theAssociation forCom-putational Linguistics, Melbourne, July 2018. http://arxiv.org/abs/1805.12216.

Tank, Aytekin. 2017. “This Is the Year of the Machine Learning Revolution.” Entrepreneur,January 12, 2017. https://www.entrepreneur.com/article/287324.

http://arxiv.org/abs/1805.12216

https://www.entrepreneur.com/article/287324

Machine Learning, Libraries, and Cross-Disciplinary Research

Documents

Transcript of Machine Learning, Libraries, and Cross-Disciplinary Research