A Cloud-based Automated Speech Recognition System for ...

50
CALIFORNIA STATE UNIVERSITY, NORTHRIDGE A Cloud-based Automated Speech Recognition System for Instructional Contents A thesis submitted in partial fulfillment of the requirements For the degree of Master of Science in Software Engineering By Timothy Spengler May 2020

Transcript of A Cloud-based Automated Speech Recognition System for ...

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

A Cloud-based Automated Speech Recognition System for Instructional Contents

A thesis submitted in partial fulfillment of the requirements

For the degree of Master of Science in Software Engineering

By

Timothy Spengler

May 2020

ii

Copyright by Timothy Spengler 2020

iii

The thesis of Timothy Spengler is approved:

Ani Nahapetian, Ph.D. Date

Kate Tipton Date

Li Liu, Ph.D., Chair Date

California State University, Northridge

iv

Acknowledgements

I would like to acknowledge my committee chair, Dr. Liu for giving me the

opportunity to research and implement this thesis. Thank you for being an exceptional

mentor throughout this entire process and always providing thoughtful direction when I was

lost. Thank you to Dr. Nahaepatian for agreeing to participate in this thesis and thank you to

Kate Tipton for providing thoughtful insight regarding accessibility. This thesis and research

behind it would not have been possible without the exceptional support of Amazon AWS,

CSUN Universal Design Center, and CSUN IT for including computing resources and

technical support.

v

Dedication

First, I would like to dedicate this thesis to my family, especially to my

Grandmother and Grandfather for providing me the opportunity to pursue my master’s

degree and always encouraging the pursuit of knowledge. Mom and Dad, thank you for

giving me endless support throughout the years and giving me the opportunity to make you

proud. I am also dedicating this thesis to the CSUN Career Center. Ana, thank you for giving

me the opportunity to work in such an encouraging environment and expand my skill set. The

experience alone truly made these past two years worth it.

vi

Table of Contents

Acknowledgements ................................................................................................................... iv

Dedication ...................................................................................................................................v

List of Figures ......................................................................................................................... viii

List of Abbreviations................................................................................................................. ix

Abstract ......................................................................................................................................x

1. Introduction .........................................................................................................................1

1.1 Problem Statements.......................................................................................................1

2. Background ..........................................................................................................................4

2.1 Research .......................................................................................................................4

2.2 Related Work ................................................................................................................5

3. Preprocessing Approach .......................................................................................................8

3.1 Overview ......................................................................................................................8

3.2 Keywords ......................................................................................................................8

3.3 Keyword Filters ............................................................................................................9

3.4 Automation ................................................................................................................. 10

4. Prototype Implementation .................................................................................................. 11

4.1 Overview .................................................................................................................... 11

4.2 System Architecture .................................................................................................... 12

4.2.1 AWS S3 .................................................................................................................. 13

4.2.2 Amazon API Gateway ............................................................................................. 14

4.2.3 AWS Lambda .......................................................................................................... 14

4.2.4 AWS Comprehend .................................................................................................. 16

4.2.5 Twinword ................................................................................................................ 17

4.3 Front-End.................................................................................................................... 18

4.4 Back-End .................................................................................................................... 19

4.5 Keyword Analysis ....................................................................................................... 21

4.5.1 Frequency Score ...................................................................................................... 22

4.5.2 Similarity Score ....................................................................................................... 23

4.5.3 Confidence Score .................................................................................................... 23

5. Experiment ........................................................................................................................ 25

vii

5.1 Experiment Overview ................................................................................................. 25

5.2 Experiment Setup ........................................................................................................ 26

5.3 Trial #1 ....................................................................................................................... 27

5.4 Trial #2 ....................................................................................................................... 32

5.5 Discussion................................................................................................................... 35

6. Conclusion ......................................................................................................................... 38

7. References ......................................................................................................................... 40

viii

List of Figures

Figure 4.2.1 Reference Architecture .......................................................................................... 12

Figure 4.3.1 Front-end Screenshot ............................................................................................. 18

Figure 4.4.1 System Workflow .................................................................................................. 19

Figure 5.1.1 Testing Workflow.................................................................................................. 25

Figure 5.2.1 Sample .txt File Format ......................................................................................... 27

Figure 5.3.1 Trial #1 Vocabulary Size Comparison ................................................................... 28

Figure 5.2.1 Trial #1 Accuracy Rate of Vocabularies ................................................................ 30

Figure 5.4.1 Trial #2 Vocabulary Size Comparison ................................................................... 32

Figure 5.4.1 Trial #2 Accuracy Rate of Vocabularies ................................................................ 33

Figure 5.5.1 Accuracy Rate of Vocabularies of all Trials ........................................................... 35

ix

List of Abbreviations

I. API Application Programming Interface

II. ASR Automated Speech Recognition

III. AWS Amazon Web Services

IV. DHH Deaf and Hard of Hearing

V. GUI Graphics User Interface

VI. MIME Multipurpose Internet Mail Extensions

VII. MIT Massachusetts Institute of Technology

VIII. NLP Natural Language Processing

IX. RAKE Rapid Automatic Keyword Extraction

X. UI User Interface

XI. XML Extensible Markup Language

x

Abstract

A Cloud-based Automated Speech Recognition System for Instructional Contents

By

Timothy Spengler

Master of Science in Software Engineering

This thesis explores the possibilities of using cloud-based automated speech

recognition services for instructional captioning. The research investigates factors that

improve the performance of using automated speech recognition services to generate

captioning for instructional audio and video including keyword extraction and keyword

analysis. This thesis will define the criteria for implementation of a web-based application

prototype integrated with various cloud-based services which will allow lecturers to upload

lecture-related documents prior to their lecture, and automatically generate a custom

vocabulary to provide to their ASR service. In addition, this thesis defines keyword

extraction implementations for various lecture-related documents, as well as explore various

keyword preprocessing techniques to generate a custom vocabulary to help improve the

accuracy of automated speech recognition services. Custom vocabularies generated by the

prototype implementations will be tested for improvement of accuracy ratings compared to

the baseline and to one another to provide insight on the most beneficial preprocessing

approach.

1

1. Introduction

Modern real-time captioning services are capable of interpreting speech into text with

a delay of just five seconds and are heavily utilized in education for deaf and hard of hearing

(DHH) students who are unable to hear or comprehend aural speech [1]. When it comes to

providing captioning services, DHH students can utilize a human-powered approach which

means a human captioner is typing simultaneously with a speaker or students can utilize an

automated speech recognition service that automatically generates a transcription without the

need for human supervision. As the demand for captioning services have increased in

educational settings, automated speech recognition (ASR) services have grown in popularity

as the primary choice for producing captions, but have a difficult time producing accurate

captions and transcriptions. Although using high-quality microphones and reducing external

sounds to a minimum noticeably improves the accuracy of transcription, these improvements

alone might not be all that is required to achieve a desired accuracy goal of at least 90% in

classrooms [9].

Many speakers use aided material in their presentations to help listeners grasp the

presentation; this can include a slide presentation such as PowerPoint or a paper handout. By

providing additional information, listeners can better understand what is important

throughout a presentation. Similarly, by providing ASR services with aided material prior to

the presentation, ASR services may be able to improve the accuracy rate during transcription

by knowing ahead of time what important topics and keywords are going to be discussed.

1.1 Problem Statements

The Americans with Disabilities Act was passed in 1990 but the graduation rate of

deaf and hard of hearing (DHH) students has remained stagnant at about 25% compared to

2

the graduation rate of 56% for all students [1]. When it comes to providing captioning

options for DHH students, limiting factors include cost, availability, and quality. The most

common approach today for real-time captioning is human powered with humans typing

simultaneously with a speaker, but this approach can be expensive and difficult to coordinate.

An alternative approach to human generated captioning is ASR, which can be cheaper and

potentially always readily available, but produces less accurate results in most educational

environments compared to human powered captions.

Amongst all students, notetaking is a fundamental learning activity many will

incorporate into their learning process when listening to a lecture. The benefits of notetaking

can include better organization, improved comprehension and better summarization of the

lecture material. Many students claim, “a tradeoff takes place between taking quality notes

and paying full attention to the lecturer”, which results in students spending much of their

mental energy in note taking as opposed to understanding the material that is being presented

[3]. When students receive multimedia transcripts of a lecture to study from, a notable

increase in mean quiz score was recorded at 38.8% when multimedia transcripts are available

compared to 23.8% mean quiz score when multimedia transcripts are not available [3].

Students reported the main positive feature of the multimedia transcripts was the ability to go

back and revisit what was discussed in lecture. Students also reported a major drawback of

the multimedia transcripts was the errors recorded throughout the transcript [3].

One of the major drawbacks with ASR services is the accuracy of the interpretation of

speech into visual text. Modern ASR services have an average accuracy rate of about 75%,

but with supplemental training, accuracy rate could reach 90% under ideal speaker conditions

[1]. When ASR services make errors, the meaning of the text can change drastically. In one

3

study, the ASR service changed ‘two-fold axis’ to ‘twenty-four Lexus’ which completely

alters the meaning of the text and results in confusion amongst the readers [1]. Although

human captioners are prone to mistakes as well, such as spelling mistakes or omitting words

that were not understood during transcription [1]. Typically, these mistakes do not

significantly change the meaning of the text [1].

As the coronavirus disease 2019 (COVID-19) outbreak spread to all parts of the

world, a shelter in place order was put into action to protect the citizens of the world. The

world has seen dramatic changes in daily life such as many educational environments

completely transitioning to online formats. As this transition has taken place, courses

intended for traditional classroom settings must have the online environment accessible as

well as ensure learning-based accommodations are being satisfied [11]. With video

presentations becoming the standard teaching method, institutions may require professors to

have all course related videos to be captioned, but professors may have difficulty getting

their videos captioned on short notice. Now more than ever, instructors need an available

resource that can caption their lecturers or videos quickly and accurately.

4

2. Background

2.1 Research

The research for this master’s thesis will explore the various types documents that are

used in academic settings and how they are utilized by different subjects. These documents

range from presentation slides, chapters from textbooks, pdf articles, etc. The choice to

explore these documents extends beyond the fact that they are some of the most popular

documents used in academia, but also because they adhere to a specific structure that will

allow specific portions of the document to be analyzed to help avoid irrelevant sections

throughout the document.

Keyword analysis techniques such as frequency, similarity, and confidence score will

be another portion of the research conducted to possible help improve ASR transcription

accuracy. The direct impact each technique has on accuracy rates will be considered by

adjusting minimum criteria each keyword will need to meet to be included in the custom

vocabulary document. A hybrid approach of the various techniques will be used to generate a

single custom vocabulary and will involve the adjustment of weights, providing adjustable

favorability to specific techniques within the hybrid approach.

PowerPoint is one of the most used technological tools in educational settings and is

utilized differently throughout various academic disciplines. [7]. A study conducted by

Herting, Cladellas, and Tarrida analyzed and compared how PowerPoint is utilized by

educators depending on the subject that is being taught. PowerPoint slides can be textual,

visual, auxiliary, or a mix. Slides are considered textual when primarily using texts and

definitions, visual when preferences were given to graphics or tables, and auxiliary when

indices or headlines without explanation were favored [7]. The subjects analyzed fell into one

5

of three categories: natural sciences, medical sciences or social sciences. One of the main

premises of the study was to explore the types of slides commonly found in PowerPoint

presentations. The study found that natural sciences contained more visual slides than textual

slides while also containing minimal auxiliary slides [7]. Natural sciences also contained

more mixed slides than pure text slides, with text slides often containing a visual element on

the slide. Medical sciences were more visual than textual but contained more textual slides

than natural sciences. Medical sciences contained more pure text slides than mixed slides and

contained the least number of auxiliary slides. Social sciences contained more text slides than

visual but had the most equal distribution of slide types amongst the other sciences. Text

slides had an occurrence of 33.3% while both visual and auxiliary slides occurred 26.2% [7].

The study found that natural sciences and social sciences had a more similar PowerPoint

pattern than the medical sciences.

2.2 Related Work

Much researched has been performed in the area of keyword identification in

documents and other forms of content. Identification techniques are used for text

compression and summarization which were utilized in a research project known as

PitchKeywordExtraction [2]. The PitchKeywordExtraction primary objective was to get

keywords through ASR services by pitch and tone and would analyze words that came after

pauses or sudden changes in a speaker’s speech pattern to a text document of pre-listed

keywords. If a selected keyword was mapped to a keyword in the pre-listed keywords, the

word would be added to an official keyword document that is outputted at the end of the

speaker’s session. The hypothesis was based on research which proposed keywords as being

the most informative parts of speech and are connected to speech signals for given and new

6

information during a lecture [2]. One of the possible real-world applications of the research

project include mapping keywords to external resources such as encyclopedias, geographic

locations, etc.

Rapid automatic keyword extraction (RAKE) is a keyword extraction methodology

that identifies keywords contained in a single document without the use of manual

annotations [4]. RAKE is based on an observation that keywords frequently contain multiple

words but often do not contain standard punctuation or stop words such as “and”, “the” and,

“of” [4]. RAKE begins by using searching for candidate keywords in the document and

assigns keyword scores based on word frequency, word degree, and ratio of degree to

frequency. Once candidate keywords are scored, the highest one-third of scoring keywords

are selected as keywords for the document. The RAKE methodology was applied to various

types of documents such as technical abstracts and news articles and applied different

weights to the criteria that generates keyword scores. In technical abstracts, accurate keyword

extraction was achieved by favoring longer keywords while in news articles shorter

keywords were favored more. This shows how different documents exhibit text information

differently which may require the analysis of the keywords to be adjusted accordingly [4].

Amazon Comprehend Medical is a cloud-based natural language processing service

provided by Amazon Web Services (AWS) that utilizes machine learning to extract medical

information from unstructured data. Comprehend Medical can retrieve information relating

to medical condition, medication, dosage, strength, frequency, doctors’ notes, etc.

Comprehend Medical has use cases involving clinical decision support, revenue cycle

management, and clinical trial management. Comprehend Medical is derived from Amazon

Comprehend which is a more general-purpose natural language processing service. Many of

7

the uses cases currently for Amazon Comprehend are business related and often involve

customer relations such as parsing customer reviews and feedback. Amazon Comprehend

was capable of being specialized for medical purposes; it may be possible for the service to

be specialized for educational or scholarly documents. These documents could include

presentations slides, pdf files, and tables. Potential use cases for educational settings could

include summarizing passages, high-lighting key-terms, and lecture topic identification.

In a study conducted by Saquib, Alam and Nianimin Yao, preprocessing steps were

applied to remove “noisy data” from tweets in order to improve the accuracy of three

machine learning algorithms involved with sentiment analysis [8]. Sentiment analysis is

involved with classifying text as positive, neutral, or negative. Noisy data would often

include emoticons, extra letters applied to words, and stop-words such as “the”, “is”, “at”,

“which”, “on”, etc. The machine learning algorithms tested include Naïve Bayes algorithm,

maximum entropy, and support vector machines algorithms. Experimental results proved that

for the Naïve Bayes algorithm accuracy has significantly improved after applying text

preprocessing steps followed by slight improvement of maximum entropy and no

improvement for support vector machine algorithms [8].

8

3. Preprocessing Approach

3.1 Overview

In educational settings, educators will often instruct students to read a chapter from a

textbook or review a document prior to a lecture. This allows students to be introduced to

information that is going to be discussed during the lecture allowing students to become more

familiar with the material. Much like the students receiving material prior to a lecture, many

ASR services can receive custom vocabularies prior to transcription jobs. The custom

vocabularies allow the ASR service to be introduced to words that are likely to be spoken and

train the ASR’s model to more accurately identify those specific words.

Students can be provided lecture material in the form of books, articles, and files

while most ASR services can only be provided with a list of words in a digital format.

Although ASR services cannot not consume material in the same way as humans, it is

possible to transform lecture material into a format that can be utilized by the ASR services.

By leveraging cloud-based services, lecture material can be analyzed for keywords that can

then be supplied to ASR services prior to transcription. This preprocessing approach may

help increase the accuracy of ASR services during transcription, and by leveraging cloud-

based solutions, can be made available at any time.

3.2 Keywords

When it comes to lectures of about 50 minutes, speakers on average will speak at a

rate of 140 words per minute and use about 2,421 unique words throughout the lecture [1].

This shows that may be difficult to predict every word a speaker is going to use during their

lecture, but an attempt can be made to predict specific words that are related to the lecture

subject specifically. Keywords can be defined as a sequence of one or more words, that

9

provide a compact representation of a document’s content [4]. Specifically, this approach

will be utilizing noun phrases as keywords. A noun phrase is defined as defined as a noun

and the modifiers that distinguish it [5]. An example of a noun phrase could be “blue car”.

This example contains an adjective (“blue”), and a noun (“car”). Noun phrases are the

keywords of choice because it has been reported by a keyword ranking system known as

TextRank that keywords consisting only of nouns and adjectives have the biggest impact

amongst all keywords contained in a document [5]. Extraction of noun-phrases will provide

ASR services the highest chance to receive quality keywords that best describe the lecture

material associated with the lecture. Among the keywords that are extracted from the related

lecture material, additional filters are going to be applied to rank the extracted noun-phrases

amongst each other.

3.3 Keyword Filters

Keywords that are extracted from the document are going to be filtered through

various approaches which will rank keywords amongst themselves. The filters will serve an

additional purpose to error check keywords and help ensure they are meaningful keywords to

the subject that is being presented by a speaker. The filters that will be applied to the

keywords will be frequency, confidence, and similarity. The frequency filter will pertain to

the number of occurrences the keyword is found within the document it was extracted from.

The more frequently a keyword is found within the document, the higher the score the

keyword will receive. The confidence filter will check every keywords confidence score and

remove keywords that do not meet the minimum confidence score. A confidence score is a

value between zero and one and represents how likely a keyword is in fact, a keyword. The

closer a keywords confidence score is to one, the higher the score the keyword will receive.

10

The similarity filter will compare extracted keywords to main topics found within the

document. Specifically, keywords will be compared to document titles and connected

subtitles found throughout the document. This filter will provide a similarity score between

zero and one and will represent how similar a keyword is to the subject or topic that is being

discussed. The higher the similarity score, the higher the rank the keyword will receive. The

frequency, confidence, and similarity filters will be compared with one another to provide

insight on which filters provide the highest quality custom vocabularies for ASR services.

3.4 Automation

A beneficial attribute of ASR services is their ability to be automated; meaning the

service does not require humans to generate accurate captions and transcripts. This provides

users with a reliable and available service that can be utilized in a moment’s notice. Many

existing approaches for keyword extraction are not automated and often require professional

curators who used a fixed taxonomy or rely on an author’s represented list to extract

keywords from documents [4]. Professional curators not only have to be paid for the work

they perform, but also do not work 24 hours a day. By leveraging cloud-based services, an

automated system can be built that completely automates the keyword extraction process and

generates high quality custom vocabularies for users to supply to their ASR service. This

system will be available to all users through a web application allowing for operation times

be nearly 24 hours a day and 7 days a week.

11

4. Prototype Implementation

4.1 Overview

The web application prototype will be hosted on the AWS cloud, and will integrate

various AWS cloud services into the system. These AWS services will include Amazon S3,

Amazon API Gateway, AWS Comprehend, AWS Lambda, and a machine learning service

called Twinword. The prototype’s main purpose is to generate a custom vocabulary by

extracting keywords from an uploaded document. The custom vocabulary will provide

support to ASR services by providing a list of keywords to look out for during audio

transcription which will improve the accuracy of the ASR service.

The system architecture will explore what each individual service provides to the

system and how the components interact with one another. The system architecture can be

divided into two main sections: the front-end and the back-end. The front-end will introduce

the cloud services required for users to successfully upload documents into the system. The

back-end will explore the cloud services necessary to extract keywords from the uploaded

documents and generate custom vocabularies. Keywords will be ranked by various keyword

analysis techniques to generate higher quality custom vocabularies.

12

4.2 System Architecture

Figure 4.2.1 Reference Architecture

The system architecture of the web-based prototype will utilize a serverless

architecture pattern, allowing management of machine resources to be handled by the cloud

provider. This architecture pattern will help reduce costs by only having to pay for

computation resources and will also provide a huge advantage of easier integration of other

cloud-based services into the workflow of the system. Since the system will be serverless,

this means the system will exist entirely in the cloud. AWS will be the cloud provider of

choice for the prototype application as it is one of the most established cloud providers

currently available in the market and provides numerous unique services within their

marketplace. The services going to be utilized in the application will consist of Amazon API

Gateway, Amazon S3, AWS Lambda, Amazon Comprehend, and an application

programming interface (API) available in the AWS marketplace called Twinword. API

Gateway, AWS Lambda and S3 are the main infrastructure anchors within the system,

13

handling events as they are triggered throughout the system and coordinating data and

responses from one service to another. Amazon Comprehend and Twinword are machine

learning services that are going to analyze user uploaded data and provide the necessary

insight to generate a custom vocabulary. The generated custom vocabulary is going to be a

text file that can be supplied to an ASR service such as AWS Transcribe to help improve the

accuracy during live transcription.

4.2.1 AWS S3

Amazon S3 is a service provided by AWS that will be mainly utilized as a cloud

storage device to store data as it is uploaded to the system. The system workflow will utilize

two S3 buckets, one in the beginning which will store uploaded user document, and another

S3 bucket at the end of the workflow where custom vocabulary text documents will be

stored. Although the system could utilize a single S3 bucket, a lack of organization of the

files coming back and forth could raise issues within the system, such as accidentally

triggering lambda functions as data comes back after traveling throughout the system. This

creates unnecessary complexities within the system which can easily be avoided by having

two separate S3 buckets.

The first S3 bucket which will be referred to as the User S3 Bucket will handle all

files that are uploaded by the user. The User S3 Bucket will be responsible for triggering the

appropriate lambda functions depending on the type of file that was uploaded by the user. As

the file’s data is parsed and transformed, it will be placed into the other bucket which will be

referred to as the Vocabulary S3 Bucket. The Vocabulary S3 bucket will consist only of .txt

files and will be the last stop for the workflow. The prototype will not consist of a delivery

14

method to be triggered by the Vocabulary S3 bucket but can be integrated depending on the

desired delivery method, such as email.

4.2.2 Amazon API Gateway

The prototype will contain a RESTful API built with Amazon’s API Gateway and

will allow custom APIs to be built, maintained, and customized as the criteria of the system

changes. The prototype will support three files but can easily be extended to handle any type

of file to meet future system requirements. The prototype’s gateway will only handle POST

requests from the user as the system will only be taking in documents the user is uploading;

this means that end users will not be able to request data from the system.

The API Gateway serves as the front door for the system and will have specific

endpoints depending on the type of file the user is trying upload. For the prototype, there will

be at least three endpoints, one for .docx files, .pptx files, and .epub files. Each endpoint will

have a specified AWS Lambda function ready to handle and place the file into the Document

S3 bucket.

4.2.3 AWS Lambda

AWS Lambda is going to contain all the business logic of the system, allowing

multiple services to communicate with one another as well as handle and transform data.

AWS Lambda provides the system with the ability to have no servers to manage and

continuously scale as the system requires. AWS Lambda keeps cost of cloud-computing to a

minimum, only charging for every 100ms code is being executed. AWS Lambda supports

many different programming languages such as Node.js, Java, Python, etc. The prototype of

15

the system will be using lambda functions written in Python 3.6 but can utilize any of the

supported programming languages as system requirements change.

The prototype’s lambda functions will be triggered when a user uploads a file to the

API Gateway and any time a file is placed in the Document S3 Bucket. The prototype will

have three lambda functions triggered by the API Gateway each corresponding to the file

type. The lambda functions will be responsible for renaming the file to a unique name,

converting the file from base64 and placing the file into the Document S3 bucket. Once the

file is placed in the S3 bucket, the appropriate lambda function corresponding to the file type

will trigger and begin parsing and transforming the files. The lambda function will parse the

extensible markup language (XML) files of the system looking for titles, subtitles, and

paragraph text and begin sending the data to AWS Comprehend and Twinword API.

As data is returned from Comprehend and Twinword, the same lambda function will

begin handling the data based on the type of analysis that is specified. The lambda function

will have the ability to count the frequency of keywords in the document, analyze confidence

scores, and gather similarity scores. The lambda function will also contain the logic for the

minimum requirements for keywords to qualify for the custom vocabulary. Once all the data

has been processed and analyzed, the lambda function will place a single .txt file in the

Vocabulary S3 bucket. The .txt file will be formatted specifically to be consumed by ASR

services. The formatting of the .txt file will remove commas, numbers, or any characters not

supported by the ASR service from the file.

An important limitation to note about lambda functions is that there is a maximum

execution time of 15 minutes. The execution time of a lambda function will be dependent on

16

the size of the file that is going to be processed. The prototype has all lambda function’s

maximum execution time set to three minutes, well below the maximum time allowed. Three

minutes was chosen as the maximum execution time as opposed to fifteen minutes because if

the lambda function is still executing past three minutes, an error has likely occurred. If an

error as occurred in the system, it is important the lambda function fails fast and recovers

quickly because the lambda function will be charged for the execution time regardless if it is

successful. A three-minute execution time also prevents the system from holding resources

hostage and preventing other files from being processed.

4.2.4 AWS Comprehend

Amazon Comprehend is a fully managed and continuously trained natural language

processing (NLP) service which utilizes machine learning techniques to discover

relationships and insights within text. The service is popular in business applications to

analyze customer feedback, identify key phrases, places, etc. The service is also heavily

utilized in the medical field to identify medical conditions, analyze patient health records

such as identifying relations between medications and treatments, etc. Comprehend excels at

identifying items of interest such as keywords, people, and places.

In the prototype, Amazon Comprehend analyzes all texts that are uploaded by the

user. Once the lambda functions extract the data from the files, it is sent to Comprehend.

Comprehend will analyze the data as a whole and return a list of keywords that were

identified in the data, along with an associated confidence score with how positive the

service is that the keyword, is in fact, a keyword. The confidence score provided by

Comprehend will be utilized in the analysis of the keywords and will be a contributing factor

for the keyword to be written to the custom vocabulary.

17

Comprehend has a limitation of how much data can be processed and analyzed in a

single request. If data is too large, an asynchronous request is required, and the keywords will

be returned when Comprehend is finished. This requires additional services such as queues

and lacks organization of associated keywords. In the prototype, the lambda functions will

send data as chunks in multiple requests which provides better organization of keywords for

future analysis and keeps the process synchronous throughout the workflow. This requires

the lambda functions to run longer, but still well within the three-minute timeout limit.

4.2.5 Twinword

Twinword is a third-party service available on the AWS Marketplace which provides

text analysis APIs capable of understanding and associating words in a similar way to

humans. Twinword offers a text similarity API capable of evaluating the similarity between

two words, sentences, or paragraphs. Twinword will allow the prototype to take the context

of keywords into account when assigning a rank to a keyword. The prototype will utilize

Twinword by supplying a keyword and its associated title or subtitle for comparison. The

API will take in both parameters and return a similarity score based on their relevance to one

another and will be a contributing factor for keywords to be included in the custom

vocabulary.

18

4.3 Front-End

Figure 4.3.1 Front-end Screenshot

For users to be able to have their lecture-related documents analyzed and have a

custom vocabulary generated for their lectures, a graphic user interface (GUI) is required that

is accessible from a user’s internet browser. It is important the GUI is simple and requires

minimal effort from end users to upload their documents. The front-end needs to available to

the end users at any time and reliably upload documents into the system with minimal errors.

The front-end of the system is going to be the mode of entry for all user uploaded files and

will have direct communication with the system’s API gateway. The front-end is going to be

built with the Angular 6 web framework and hosted in a S3 bucket as a static website which

can be accessed from any standard web browser.

Angular 6 is a powerful web framework written in typescript that contains many read-

to-go components that can create visually appealing GUIs. As the prototype evolves, the

Angular framework can integrate new components into the system with minimal effort. The

prototype utilizes a library called ng2-file-upload which handles all POST requests on the

19

front-end and transmits the data to the API gateway. By using the ng2-file-upload library,

minimal amounts of code are required to be written because the library handles all base64

encodings for each file.

The front-end is going to screen the file extension of each file that is going to be

uploaded into the prototype to help minimize the chances a file becomes corrupted. The file

extension type will determine which API endpoint the file will be sent to because each

document must be handled differently to keep data correct and prevent the file from

becoming corrupt. Each file extension is decoded according to its associated MIME type,

because if a .docx file is sent to the endpoint dedicated to the .pptx file type, the file will

become corrupted and the data will be lost. Once a user uploads a file to the system, the API

gateway will return a status code of 200, which will trigger a response on the front-end to let

the user know their data has been uploaded successfully.

4.4 Back-End

Figure 4.4.1 System Workflow

20

The back-end workflow begins with the API gateway receiving POST requests from

the front-end by sending it to the proper lambda function. Each file extension has a dedicated

lambda function that is triggered when a post request is sent to the API gateway. Each

lambda function decodes the data, writes the data into a new file with a unique name, and

places the file into the Document S3 bucket. Uniquely naming the file is important because

you do not want to overwrite files as they are being placed into the Document bucket and this

minimizes data loss throughout the system. The prototype will contain three lambda

functions that are responsible for placing files into the S3 bucket, each handling a specific

file extension.

Once the file is placed into the S3 bucket, another lambda function is triggered which

begins parsing the XML portion of the file. The XML inside the file plays an important

factor in the parsing and analysis of the document as it contains a logical structure of the

page capable of providing organizational insight of the document while removing the need to

physically scan the page using human eyes. The XML will contain specific tags for each file

type. These tags can represent titles, subtitles, lists, images, paragraphs, etc. The prototype

will focus specifically on tags that are related to titles, subtitles, and paragraph text, but can

be easily adjusted as requirements and analysis provides feedback on the importance of

specific tags in relation to extracting meaningful keywords contained in the files. The

prototype does not require XML files to contain title or subtitle tags for successful keyword

extraction and analysis but does require XML file to contain a Text tag, as this will be the

input for keyword extraction and analysis.

The lambda functions will send one regular text at a time to AWS Comprehend to

keep requests synchronous while maintaining organization of text with relation to associated

21

titles and subtitles. The titles and subtitles will be used heavily in the similarity score

implementation and it is vital that paragraphs are correctly categorized with their related

titles and subtitles. When keywords are returned from Comprehend, the lambda function will

begin analysis of the keywords using approaches such as the frequency score, similarity

score, and confidence score.

The same lambda functions are responsible for the analysis of keywords and phrases

by analyzing the scores of various score implementations. The frequency score and

confidence score will be calculated in the lambda and does not require any data to be sent to

any external services. The similarity score will be calculated by sending individual keywords

to an external service known as Twinword API. The scores generated in the lambda functions

will be the determining factor if a keyword will qualify to be included in the custom

vocabulary. The output of the system is a single .txt file that will contain the qualified

keywords from the uploaded files. This file will be stored in the Vocabulary S3 bucket with

the same name that was assigned to the file in the Document S3 bucket. The prototype will

not send the output to email or back to the user in the front-end but can be implemented with

another lambda function or notification service.

4.5 Keyword Analysis

Not all keywords will be equal in terms of the relevance to the lecture, but it is

troublesome to determine exactly what keywords are the most important. There are multiple

approaches to categorize the importance of a keyword, but it is important to determine what

categorization will help improve speech recognition accuracy most effectively. The prototype

will explore three different implementations to provide a better understanding on what

techniques will best improve ASR accuracy. The system will implement a keyword

22

frequency approach, a context approach, and a confidence score approach. As the prototype

provides feedback, hybrid approaches amongst the tested techniques could be developed and

analyzed to help improve the accuracies of the ASR services.

4.5.1 Frequency Score

The more times a keyword occurs within the text, the more relevance a keyword

might have towards the topic. The frequency approach might provide insight on the number

of times a word occurs within a text and the chances of the word being spoken when talking

about the same subject. A frequency score will be assigned to a keyword based on the

number of times the keyword occurs within a body of text. Every keyword will have a

minimum count of one, and the default minimum frequency score to qualify for the custom

vocabulary will be one. These values can be adjusted accordingly and can provide insight on

the importance of frequency of a keyword throughout a file. A keyword repeating often in

lecture-related file might have a correlation with the number of times the keyword will be

spoken in a lecture. By including high-frequency keywords in the custom vocabulary, the

ASR services will be trained to recognize the high-frequency keywords and improve the

accuracy during transcribing.

A major concern with allowing the frequency score to determine the rank of a

keywords is that important keywords might only occur once in a body of text. This approach

may prematurely dismiss valuable keywords because a keyword only occurring once within a

document does not diminish the importance of the keyword to the topic and may have high

relevance when it comes to understanding the material.

23

4.5.2 Similarity Score

It is important that keywords are relevant to the subject matter that is being presented

and it is worth exploring how context impacts ASR accuracy during transcription. A

similarity score will provide insight on the context of a keyword compared to other

information contained in the file. A minimum similarity score will be required for keywords

to be included in the custom vocabulary. The prototype will define the minimum similarity

score at .50 but can be adjusted accordingly.

The prototype will utilize the Twinword API by comparing keywords to their

associated title or subtitle. The similarity score will provide insight on the context of the

keyword within the paragraph, and how closely related the keyword is to the subject matter

that is being discussed. Subtitles will have priority over titles when generating a similarity

score because subtitles are found in closer proximity to the keywords. As well, subtitles will

often be more specific to a section of the topic and may provide a more accurate similarity

score. If the keyword has no associated subtitle, the title of the file will be used for

generating a similarity score. Titles will have a farther proximity to the keywords than

subtitles but can still provide insight on the context of a keyword. If a keyword has no

associated title or subtitle, the similarity score will not be generated and will not be relevant

in determining if a keyword qualifies to be written to the custom vocabulary.

4.5.3 Confidence Score

As AWS Comprehend extracts keywords from text, an associated confidence score

will be assigned to the keyword to support that the keyword is a keyword. A confidence

score is determined by the service by taking semantic structure and context into account

during extraction. Having a higher confidence score indicates a higher probability of a

24

keyword being a keyword, thus the system will assign a higher rank to keywords with higher

confidence scores.

The prototype will have a minimum confidence score of .95 for keywords to qualify

for the custom vocabulary. This approach will filter out lower ranking keywords and keep

higher probabilistic keywords for the custom vocabulary. This minimum confidence score

can be adjusted accordingly, such as if custom vocabularies contain too many or too little

keywords.

25

5. Experiment

5.1 Experiment Overview

Figure 5.1.1 Testing Workflow

To provide insight if the preprocessing techniques will help increase ASR

transcription and captioning accuracy, an experiment is going to conducted to analyze the

transcription accuracy of video lectures by supplying an ASR service with custom

vocabularies derived from the lecture-related material. Each trial of the experiment is going

to require a lecture video with an audible speaker, lecture related material, and an ASR

service to process and transcribe the video. The lecture related material will be preprocessed

by the prototype and will generate various custom vocabularies to be utilized by ASR

services for transcription. The ASR service will transcribe the lecture video for each type of

custom vocabulary and the results from each transcription job will be compared with one

another.

MIT Open Courseware is a website created by MIT that hosts and publishes

educational materials freely and openly. Educational material can include full courses,

assignments, or old lectures that were recorded during previous academic years. Many videos

and associated lecture materials are available for download under the Creative Commons

26

license. MIT Open Courseware provides a text transcription of the lectures and will be the

template to compare the accuracy of the transcription jobs for each custom vocabulary. The

experiment is going to analyze a video lecture series called “The Analytics Edge” instructed

by Dimitris Bertsimas; specifically analyzing the lecture videos of section 5.2 titled “Turning

Tweets into Knowledge: An Introduction to Text Analysis” [10]. The lecture video is 17

minutes and 22 seconds in length and has two different speakers throughout the lecture. The

lecture is paired with 26 PowerPoint slides which the speakers reference throughout the

video.

5.2 Experiment Setup

The ASR service that is going to be transcribing the video lecture is Amazon

Transcribe. Amazon Transcribe is a free cloud-based service that is capable of captioning and

transcribing live lectures or previously recorded videos. Before Amazon Transcribe can

begin the transcription job, the prototype must generate the custom vocabularies. The lecture

related PowerPoint file will be uploaded to the prototype for preprocessing and will generate

custom vocabularies based on the frequency, similarity, and confidence approaches. Each

preprocessing approach will analyze the same PowerPoint file and will generate a unique .txt

file that will be supplied to Amazon Transcribe prior to the transcription of the lecture video.

Amazon Transcribe will be provided with the lecture video in .mp4 format along with

a custom vocabulary in .txt format. Amazon Transcribe requires custom vocabularies to

conform to a specific format, requiring all key phrases to appear on their own individual line

with spaces between words to be replaced with a single hyphen. Apostrophes, periods, and

commas can appear in the .txt document while numbers and symbols are prohibited. Figure

27

5.2.1 is an example of .txt file which adheres Amazon Transcribe’s custom vocabulary

specifications.

Figure 5.2.1 Sample .txt File Format

Amazon Transcribe will first transcribe the lecture video without being supplied a

custom vocabulary. This will provide a baseline on how accurate the ASR service is without

utilizing any preprocessing approaches. After the baseline has been established, Amazon

Transcribe will begin three transcription jobs utilizing the custom vocabularies generated

from the preprocessing techniques. The transcriptions generated from each job will be

compared with the transcription provided from MIT Courseware and will determine the

accuracy of the transcription job. If a sentence is not correctly transcribed compared to the

baseline transcription, an error will be logged. The more sentences that are correctly

transcribed compared to the baseline transcription will determine the accuracy of the

transcription job.

5.3 Trial #1

The first trial of the experiment is going to generate three custom vocabularies each

filtered with a different preprocessing approach. The preprocessing approaches each had a set

minimum requirement for possible keywords to be included in their respective vocabularies.

Each preprocessing approach produced vocabularies that varied greatly in size with the

28

confidence vocabulary producing the large vocabulary, followed by the frequency

vocabulary, while the similarity vocabulary was the smallest.

Figure 5.3.1 Trial #1 Vocabulary Size Comparison

For keywords to qualify for the frequency vocabulary, the keyword was required to

have occurred at least twice throughout the document, and among the keywords extracted,

only 27 of the keywords occurred at least twice in the PowerPoint document. The frequency

approach in this trial may provide insight that keywords do not repeat that often throughout

the document. This could be related to the type of document that is being processed as

PowerPoint slides are not as detailed as a paragraph from a textbook or article. PowerPoint

slides often summarize a topic and reinforce the main points that the lecturer wishes to

emphasize.

The requirement for keywords to be included in the confidence score vocabulary is a

confidence score of at least .90. Among the keywords extracted, 230 keywords received a

29

confidence score of at least .90. This preprocessing approach produced the biggest

vocabulary size amongst the other approaches. 263 suspected keywords were identified in the

document which means that 87.45% of keywords had a confidence score of .90 or better.

Having an effective extraction process is important, but it is possible that .90 may be too

loose of a requirement for keywords to qualify as significantly more keywords are being

included in the confidence custom vocabulary compared to the frequency and similarity

vocabularies.

The similarity technique required a minimum similarity score of .10 to qualify for the

custom vocabulary. Among the keywords extracted, only 8 keywords qualified for the

custom vocabulary. A possible explanation as to why the similarity vocabulary size is much

lower than the other approaches is the titles of each PowerPoint slide are not descriptive

enough to produce any similarity to the words that are being extracted. The similarity score

requirement of .10 is a low requirement which means many keywords received a zero for the

similarity score, when in fact they are relevant to the topic that is being discussed. Among the

words included in the similarity vocabulary, all keywords had a received a similarity score of

1, which is well above the minimum requirement of .10. This shows that keywords compared

to their title either receive a score of 1 or zero and is the difference between qualifying for the

similarity custom vocabulary.

30

Figure 5.3.2 Trial #1 Accuracy Rate of Vocabularies

The MIT supplied transcription contained 169 sentences and served as the template

for determining accuracy of Amazon Transcribe. The baseline transcription job with no

custom vocabulary supplied had an accuracy rating of 63.33%, the lowest of all transcription

jobs. When the ASR service was provided a custom vocabulary, the accuracy rating had a

significant rise. The confidence score vocabulary produced the highest accuracy rating at

78.11%, followed by the frequency vocabulary with an accuracy rating of 72.78%, and the

similarity vocabulary with the second lowest accuracy rating of 69.82%.

Since all three vocabularies improved the accuracy rating of Amazon Transcribe by at

least 6%, the results suggest that supplying prefacing material to ASR services is better than

supplying nothing. The confidence vocabulary performed the best in the trial, with a defining

characteristic difference being its vocabulary size. This could suggest that bigger

vocabularies are better than smaller vocabularies but is the bigger vocabulary producing

optimal results? While the confidence vocabulary is performing better than frequency and

31

similarity vocabularies, too many keywords are being supplied to the ASR service and may

not be providing optimal training to the ASR service.

The similarity approach performed the worst out of three transcription jobs and

supplied the ASR service with the smallest vocabulary of just 8 keywords. This suggests that

this preprocessing approach is producing too specific of a vocabulary and is filtering out

significant keywords. An issue with attempting to improve the similarity approach is the

minimum similarity score requirement was significantly low, at only .10. A possible

explanation as to why keywords are not meeting the minimum score requirement is that the

titles are not specific enough to the topic that is being suggest. For example, a slide title in

the lecture material is “Cleaning Up Irregularities” regarding to punctuation throughout the

slide. Although “Cleaning Up Irregularities” describes the overall purpose of the slide, it

does not relate specifically to the many of the keywords contained on the slide.

The frequency vocabulary performed the second best out of the three transcription

jobs and supplied the ASR service with a medium-size vocabulary containing 27 keywords.

The most frequent keyword in the vocabulary was the keyword “tweets” which occurred

twelve times throughout the document. With the frequency vocabulary having the second

largest vocabulary and performing the second best out of the transcription jobs, this could

support the significance of a larger vocabulary. The minimum occurrence required for

keywords to be included in the vocabulary was two, providing the frequency approach with

very limited room for extending its vocabulary.

32

5.4 Trial #2

The second trial of this experiment is going to analyze the same video lecture series

as trial #1, but the ASR service is only going to be provided custom vocabulary generated

from the confidence preprocessing approach. This experiment is going to compare

transcription jobs utilizing four additional custom vocabularies strictly derived from the

confidence preprocessing approach with a minimum confidence score requirement of .90,

.95, .98, and 1.0 respectively. The primary objective of this trial is to attempt to find

additional insight on the size of vocabulary on transcription accuracy rating and identify a

more optimal implementation for the confidence preprocessing approach.

Figure 5.4.1 Trial #2 Vocabulary Size Comparison

As the minimum confidence score requirement becomes stricter, the preprocessing

approach will generate smaller vocabularies. 263 potential keywords were identified by the

prototype with 73.76% of the keywords receiving a confidence score of at least .99. This

information may imply that the keyword identifying service is over-rating keywords more

33

than the service should, or possibly that the keyword extraction service performs

exceptionally well at identifying keywords from documents and is scoring each keyword

appropriately. The vocabulary size dramatically drop-offs between .99 to 1.0 vocabularies,

reducing the total vocabulary size by 81.96%. This reduction of vocabulary size may provide

the most insight on the impact of the size of vocabularies directly compared with

vocabularies that only contain highly rated keywords.

Figure 5.4.2 Trial #2 Accuracy Rate of Vocabularies

The MIT supplied transcription contained 169 sentences and served as the template

for determining accuracy of each transcription job. The baseline transcription job with no

custom vocabulary supplied to the ASR service had an accuracy rating of 63.33% and

remained the lowest of all transcription jobs. The .90 custom vocabulary generated the best

results in Trial #1 with an accuracy rating of 78.11% but performed the third best in this trail

behind both the .95 Confidence Vocabulary and the .99 Confidence Vocabulary. The .99

Confidence Vocabulary performed the second best in the trial with an accuracy rating of

34

79.28% while also utilizing the third largest vocabulary. The .90 and .99 vocabularies yield

the two closest accuracy ratings with a difference of only 1.17% while also having a

vocabulary size difference of 36 keywords.

The .95 Confidence Vocabulary performed the best with an accuracy rating of

82.24% while having the second smallest custom vocabulary. An interesting aspect to

highlight, is the .95 Confidence Vocabulary is between both .90 and .99 vocabularies in

terms of vocabulary size but outperforms both vocabularies. This aspect may provide insight

on the impact of size of custom vocabularies on ASR services’ accuracy rating. While

vocabulary size has shown to be an important trait for transcription accuracy, it is possible to

oversaturate the ASR service with excess keywords which can notably reduce transcription

accuracy.

The 1.0 Confidence Vocabulary performed the second worst in this trial only

outperforming the baseline transcription job. The 1.0 Confidence Vocabulary had the

smallest vocabulary which reinforces the hypothesis that size affects the outcome of

transcription accuracy. The accuracy rating was reduced by 3.55% compared to the .99

Confidence Vocabulary. This comparison presents major insight on adjusting the minimum

confidence score by only one percent. This one percent change resulted in the vocabulary

becoming too specific which removed important keywords from the vocabulary that were

spoken throughout lecture video.

35

5.5 Discussion

Figure 5.5.1 Accuracy Rate of Vocabularies of all Trials

The experiments conducted between Trial #1 and Trial #2 as shown that by providing

an ASR service with keywords from lecture-related documents, the ASR will have a higher

likelihood of producing more accurate transcriptions. Preprocessing lecture related material

is beneficial and can provide foresight to ASR services prior to transcription. Each

preprocessing approach produced different results with the Confidence approach performing

the best with an accuracy rate of 82.24%. The Frequency approach performed with an

accuracy rate of 72.78%, and the Similarity approach performed with an accuracy rate of

69.82%. The Confidence approach not only performed the best amongst all the approaches

but also produced the largest custom vocabulary which contained 230 keywords. The

Similarity Approach produced the smallest vocabulary containing only 8 keywords.

36

The confidence preprocessing approach produced the largest vocabularies while also

remaining the most adjustable among the other approaches allowing for optimization. This

approach is the most adaptable in producing broad vocabularies by utilizing a confidence

score of .99 or below while remaining capable of producing more limited vocabularies when

settings the minimum confidence score to 1.0. The frequency approach in Trial #1 had a

minimum frequency count of two for keywords to qualify for the vocabulary. Increasing the

minimum requirement for this approach will only produce more specific vocabularies while

reducing the minimum occurrence count to one would only produce a vocabulary as large as

all possible keywords contained in the document. The frequency approach may be more

suitable for detailed documents, like a chapter from a textbook as slide presentations are

generally broader and summarize material while a textbook chapter would be more detailed.

A notable error that occurred throughout both Trial #1 and Trial #2 is the ASR

service would over correct words to directly fit with words that were included in the custom

vocabulary. In Trial #1 during the similarity transcription job, the ASR service continuously

overcorrected the word “textual” to always correct to “textual data” as written in the custom

vocabulary. Textual data was a keyword in the custom vocabulary, and was spoken during

the lecture video, but as the lecture continued the speaker would only speak the word

“textual” in other sentences. The ASR service would routinely change the transcript to

“textual-data”. Although this overcorrection did not drastically alter the meaning of the

sentences, it influenced the readability of the transcript.

For this experiment, it is important to highlight the external factors that may have

influenced the results. A major factor that may have influenced the results is the sound

quality of the lecture video. Although the video may be clear to humans, it is possible it was

37

not of the highest of quality and could affect the ASR services’ ability to interpret the lecture.

When it comes to the ASR services’ ability to interpret the speaker, factors could include the

speaker’s natural tone and dialect and if any microphone or audio system were used to

amplify the speaker’s voice. Awkward pauses and stutters can result in poor punctuation and

grammar of the transcription which can influence the output to the transcript.

The similarity approach in Trial #1 had the worst performance out of all transcription

jobs supplied with a custom vocabulary, but still resulted in an improvement compared to the

baseline transcription job with no preprocessing approach. Although the vocabulary size was

only 8 keywords, the similarity approach managed to improve the accuracy rating by 6.49%.

This comparison presents the potential on preprocessing approaches paired with ASR

services. This approach may be more suitable for preprocessing different documents, but still

reinforces the trend that a small vocabulary is better than no vocabulary.

38

6. Conclusion

Much like how a student can be introduced to information before a lecture to help

increase the understanding of the material, an ASR service can be provided with information

about the lecture prior to transcribing to help produce more accurate transcriptions and

captions. Preprocessing lecture material has shown to be a powerful approach to help support

ASR services in generating high quality transcriptions and captions. Many ASR services are

capable of specialized training for transcription jobs by providing custom vocabularies

consisting of keywords prior to transcribing or captioning. The AWS cloud infrastructure

allows for a system to be constructed to provide users with an automated process to generate

custom vocabularies.

Cloud-based ASR tools such as AWS Comprehend are capable of accurately

extracting meaningful keywords to support ASR services and can help automate the

generation of custom vocabularies. Custom vocabularies should consist of as many high-

quality keywords as possible, without sacrificing vocabulary size. The Confidence Approach

has shown that it is possible to produce quality custom vocabularies that steadily improve the

accuracy rating of the ASR service. If a user’s objective is to produce a more exclusive

vocabulary consisting strictly of the most meaningful keywords, the Frequency or Similarity

approach could be more appropriate.

The pandemic caused by COVID-19 has dramatically changed the way people

interact with one another. The demand for captioning has increased as social distancing has

been encouraged and educational environments have transitioned into online formats. ASR

services are becoming more reliable and are readily available for instructors to use at any

time. ASR services can produce transcription of live lectures and educational videos, which

39

can provide students with an alternative resource for their academic studies. By leveraging

cloud-based services, educational environments can quickly adapt to the changing times and

better support students throughout their academic journey.

40

7. References

[1] Kushalnagar, Raja S., et al. “Accessibility Evaluation of Classroom Captions.” ACM

Transactions on Accessible Computing, vol. 5, no. 3, 1 Mar. 2013, pp. 1–24.,

doi:10.1145/2543578.

[2] Lezhenin, Yurij, et al. “PitchKeywordExtractor: Prosody-Based Automatic Keyword

Extraction for Speech Content.” Proceedings of the 2017 Federated Conference on

Computer Science and Information Systems, 2017, doi:10.15439/2017f326

[3] Ranchal, Rohit, et al. “Using Speech Recognition for Real-Time Captioning and Lecture

Transcription in the Classroom.” IEEE Transactions on Learning Technologies, vol. 6,

no. 4, Oct. 2013, pp. 299–311., doi:10.1109/tlt.2013.21.

[4] “Automatic Keyword Extraction from Individual Documents.”Text Mining: Applications

and Theory, by Michael W. Berry and Jacob Kogan, Wiley, 2010, pp. 1-20.

[5] “Detect Key Phrases.” AWS Documentation, 2020,

docs.aws.amazon.com/comprehend/latest/dg/how-key-phrases.html.

[6] Mihalcea, Rada, and Paul Tarau. “TextRank: Bringing Order into Texts.” TextRank:

Bringing Order into Texts, 2004,

web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf.

[7] Chávez Herting, D., Cladellas Pros, R. & Castelló Tarrida, A. Patterns of PowerPoint Use

in Higher Education: a Comparison between the Natural, Medical, and Social Sciences.

Innov High Educ 45, 65–80 (2020). https://doi-org.libproxy.csun.edu/10.1007/s10755-

019-09488-4

[8] Alam, S., Yao, N. The impact of preprocessing steps on the accuracy of machine learning

algorithms in sentiment analysis. Compute Math Organ Theory 25, 319–335 (2019).

https://doi-org.libproxy.csun.edu/10.1007/s10588-018-9266-8

[9] Kheir, Richard, and Thomas Way. “Inclusion of Deaf Students in Computer Science

Classes Using Real-Time Speech Transcription.” ACM Digital Library, June 2007, dl-

acm-org.libproxy.csun.edu/doi/10.1145/1269900.1268860.

[10] Bertsimas, Dimitris, director. The Analytics Edge. Internet Archive, 1 Jan. 2017,

archive.org/details/MIT15.071S17/MIT15_071S17_Session_5.1.01_300k.mp4.

[11] Smith, S. J., & Basham, J. D. (2014). Designing Online Learning Opportunities for

Students with Disabilities. TEACHING Exceptional Children, 46(5), 127–137.

https://doi.org/10.1177/0040059914530102