A Cloud-based Automated Speech Recognition System for ...
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of A Cloud-based Automated Speech Recognition System for ...
CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
A Cloud-based Automated Speech Recognition System for Instructional Contents
A thesis submitted in partial fulfillment of the requirements
For the degree of Master of Science in Software Engineering
By
Timothy Spengler
May 2020
iii
The thesis of Timothy Spengler is approved:
Ani Nahapetian, Ph.D. Date
Kate Tipton Date
Li Liu, Ph.D., Chair Date
California State University, Northridge
iv
Acknowledgements
I would like to acknowledge my committee chair, Dr. Liu for giving me the
opportunity to research and implement this thesis. Thank you for being an exceptional
mentor throughout this entire process and always providing thoughtful direction when I was
lost. Thank you to Dr. Nahaepatian for agreeing to participate in this thesis and thank you to
Kate Tipton for providing thoughtful insight regarding accessibility. This thesis and research
behind it would not have been possible without the exceptional support of Amazon AWS,
CSUN Universal Design Center, and CSUN IT for including computing resources and
technical support.
v
Dedication
First, I would like to dedicate this thesis to my family, especially to my
Grandmother and Grandfather for providing me the opportunity to pursue my master’s
degree and always encouraging the pursuit of knowledge. Mom and Dad, thank you for
giving me endless support throughout the years and giving me the opportunity to make you
proud. I am also dedicating this thesis to the CSUN Career Center. Ana, thank you for giving
me the opportunity to work in such an encouraging environment and expand my skill set. The
experience alone truly made these past two years worth it.
vi
Table of Contents
Acknowledgements ................................................................................................................... iv
Dedication ...................................................................................................................................v
List of Figures ......................................................................................................................... viii
List of Abbreviations................................................................................................................. ix
Abstract ......................................................................................................................................x
1. Introduction .........................................................................................................................1
1.1 Problem Statements.......................................................................................................1
2. Background ..........................................................................................................................4
2.1 Research .......................................................................................................................4
2.2 Related Work ................................................................................................................5
3. Preprocessing Approach .......................................................................................................8
3.1 Overview ......................................................................................................................8
3.2 Keywords ......................................................................................................................8
3.3 Keyword Filters ............................................................................................................9
3.4 Automation ................................................................................................................. 10
4. Prototype Implementation .................................................................................................. 11
4.1 Overview .................................................................................................................... 11
4.2 System Architecture .................................................................................................... 12
4.2.1 AWS S3 .................................................................................................................. 13
4.2.2 Amazon API Gateway ............................................................................................. 14
4.2.3 AWS Lambda .......................................................................................................... 14
4.2.4 AWS Comprehend .................................................................................................. 16
4.2.5 Twinword ................................................................................................................ 17
4.3 Front-End.................................................................................................................... 18
4.4 Back-End .................................................................................................................... 19
4.5 Keyword Analysis ....................................................................................................... 21
4.5.1 Frequency Score ...................................................................................................... 22
4.5.2 Similarity Score ....................................................................................................... 23
4.5.3 Confidence Score .................................................................................................... 23
5. Experiment ........................................................................................................................ 25
vii
5.1 Experiment Overview ................................................................................................. 25
5.2 Experiment Setup ........................................................................................................ 26
5.3 Trial #1 ....................................................................................................................... 27
5.4 Trial #2 ....................................................................................................................... 32
5.5 Discussion................................................................................................................... 35
6. Conclusion ......................................................................................................................... 38
7. References ......................................................................................................................... 40
viii
List of Figures
Figure 4.2.1 Reference Architecture .......................................................................................... 12
Figure 4.3.1 Front-end Screenshot ............................................................................................. 18
Figure 4.4.1 System Workflow .................................................................................................. 19
Figure 5.1.1 Testing Workflow.................................................................................................. 25
Figure 5.2.1 Sample .txt File Format ......................................................................................... 27
Figure 5.3.1 Trial #1 Vocabulary Size Comparison ................................................................... 28
Figure 5.2.1 Trial #1 Accuracy Rate of Vocabularies ................................................................ 30
Figure 5.4.1 Trial #2 Vocabulary Size Comparison ................................................................... 32
Figure 5.4.1 Trial #2 Accuracy Rate of Vocabularies ................................................................ 33
Figure 5.5.1 Accuracy Rate of Vocabularies of all Trials ........................................................... 35
ix
List of Abbreviations
I. API Application Programming Interface
II. ASR Automated Speech Recognition
III. AWS Amazon Web Services
IV. DHH Deaf and Hard of Hearing
V. GUI Graphics User Interface
VI. MIME Multipurpose Internet Mail Extensions
VII. MIT Massachusetts Institute of Technology
VIII. NLP Natural Language Processing
IX. RAKE Rapid Automatic Keyword Extraction
X. UI User Interface
XI. XML Extensible Markup Language
x
Abstract
A Cloud-based Automated Speech Recognition System for Instructional Contents
By
Timothy Spengler
Master of Science in Software Engineering
This thesis explores the possibilities of using cloud-based automated speech
recognition services for instructional captioning. The research investigates factors that
improve the performance of using automated speech recognition services to generate
captioning for instructional audio and video including keyword extraction and keyword
analysis. This thesis will define the criteria for implementation of a web-based application
prototype integrated with various cloud-based services which will allow lecturers to upload
lecture-related documents prior to their lecture, and automatically generate a custom
vocabulary to provide to their ASR service. In addition, this thesis defines keyword
extraction implementations for various lecture-related documents, as well as explore various
keyword preprocessing techniques to generate a custom vocabulary to help improve the
accuracy of automated speech recognition services. Custom vocabularies generated by the
prototype implementations will be tested for improvement of accuracy ratings compared to
the baseline and to one another to provide insight on the most beneficial preprocessing
approach.
1
1. Introduction
Modern real-time captioning services are capable of interpreting speech into text with
a delay of just five seconds and are heavily utilized in education for deaf and hard of hearing
(DHH) students who are unable to hear or comprehend aural speech [1]. When it comes to
providing captioning services, DHH students can utilize a human-powered approach which
means a human captioner is typing simultaneously with a speaker or students can utilize an
automated speech recognition service that automatically generates a transcription without the
need for human supervision. As the demand for captioning services have increased in
educational settings, automated speech recognition (ASR) services have grown in popularity
as the primary choice for producing captions, but have a difficult time producing accurate
captions and transcriptions. Although using high-quality microphones and reducing external
sounds to a minimum noticeably improves the accuracy of transcription, these improvements
alone might not be all that is required to achieve a desired accuracy goal of at least 90% in
classrooms [9].
Many speakers use aided material in their presentations to help listeners grasp the
presentation; this can include a slide presentation such as PowerPoint or a paper handout. By
providing additional information, listeners can better understand what is important
throughout a presentation. Similarly, by providing ASR services with aided material prior to
the presentation, ASR services may be able to improve the accuracy rate during transcription
by knowing ahead of time what important topics and keywords are going to be discussed.
1.1 Problem Statements
The Americans with Disabilities Act was passed in 1990 but the graduation rate of
deaf and hard of hearing (DHH) students has remained stagnant at about 25% compared to
2
the graduation rate of 56% for all students [1]. When it comes to providing captioning
options for DHH students, limiting factors include cost, availability, and quality. The most
common approach today for real-time captioning is human powered with humans typing
simultaneously with a speaker, but this approach can be expensive and difficult to coordinate.
An alternative approach to human generated captioning is ASR, which can be cheaper and
potentially always readily available, but produces less accurate results in most educational
environments compared to human powered captions.
Amongst all students, notetaking is a fundamental learning activity many will
incorporate into their learning process when listening to a lecture. The benefits of notetaking
can include better organization, improved comprehension and better summarization of the
lecture material. Many students claim, “a tradeoff takes place between taking quality notes
and paying full attention to the lecturer”, which results in students spending much of their
mental energy in note taking as opposed to understanding the material that is being presented
[3]. When students receive multimedia transcripts of a lecture to study from, a notable
increase in mean quiz score was recorded at 38.8% when multimedia transcripts are available
compared to 23.8% mean quiz score when multimedia transcripts are not available [3].
Students reported the main positive feature of the multimedia transcripts was the ability to go
back and revisit what was discussed in lecture. Students also reported a major drawback of
the multimedia transcripts was the errors recorded throughout the transcript [3].
One of the major drawbacks with ASR services is the accuracy of the interpretation of
speech into visual text. Modern ASR services have an average accuracy rate of about 75%,
but with supplemental training, accuracy rate could reach 90% under ideal speaker conditions
[1]. When ASR services make errors, the meaning of the text can change drastically. In one
3
study, the ASR service changed ‘two-fold axis’ to ‘twenty-four Lexus’ which completely
alters the meaning of the text and results in confusion amongst the readers [1]. Although
human captioners are prone to mistakes as well, such as spelling mistakes or omitting words
that were not understood during transcription [1]. Typically, these mistakes do not
significantly change the meaning of the text [1].
As the coronavirus disease 2019 (COVID-19) outbreak spread to all parts of the
world, a shelter in place order was put into action to protect the citizens of the world. The
world has seen dramatic changes in daily life such as many educational environments
completely transitioning to online formats. As this transition has taken place, courses
intended for traditional classroom settings must have the online environment accessible as
well as ensure learning-based accommodations are being satisfied [11]. With video
presentations becoming the standard teaching method, institutions may require professors to
have all course related videos to be captioned, but professors may have difficulty getting
their videos captioned on short notice. Now more than ever, instructors need an available
resource that can caption their lecturers or videos quickly and accurately.
4
2. Background
2.1 Research
The research for this master’s thesis will explore the various types documents that are
used in academic settings and how they are utilized by different subjects. These documents
range from presentation slides, chapters from textbooks, pdf articles, etc. The choice to
explore these documents extends beyond the fact that they are some of the most popular
documents used in academia, but also because they adhere to a specific structure that will
allow specific portions of the document to be analyzed to help avoid irrelevant sections
throughout the document.
Keyword analysis techniques such as frequency, similarity, and confidence score will
be another portion of the research conducted to possible help improve ASR transcription
accuracy. The direct impact each technique has on accuracy rates will be considered by
adjusting minimum criteria each keyword will need to meet to be included in the custom
vocabulary document. A hybrid approach of the various techniques will be used to generate a
single custom vocabulary and will involve the adjustment of weights, providing adjustable
favorability to specific techniques within the hybrid approach.
PowerPoint is one of the most used technological tools in educational settings and is
utilized differently throughout various academic disciplines. [7]. A study conducted by
Herting, Cladellas, and Tarrida analyzed and compared how PowerPoint is utilized by
educators depending on the subject that is being taught. PowerPoint slides can be textual,
visual, auxiliary, or a mix. Slides are considered textual when primarily using texts and
definitions, visual when preferences were given to graphics or tables, and auxiliary when
indices or headlines without explanation were favored [7]. The subjects analyzed fell into one
5
of three categories: natural sciences, medical sciences or social sciences. One of the main
premises of the study was to explore the types of slides commonly found in PowerPoint
presentations. The study found that natural sciences contained more visual slides than textual
slides while also containing minimal auxiliary slides [7]. Natural sciences also contained
more mixed slides than pure text slides, with text slides often containing a visual element on
the slide. Medical sciences were more visual than textual but contained more textual slides
than natural sciences. Medical sciences contained more pure text slides than mixed slides and
contained the least number of auxiliary slides. Social sciences contained more text slides than
visual but had the most equal distribution of slide types amongst the other sciences. Text
slides had an occurrence of 33.3% while both visual and auxiliary slides occurred 26.2% [7].
The study found that natural sciences and social sciences had a more similar PowerPoint
pattern than the medical sciences.
2.2 Related Work
Much researched has been performed in the area of keyword identification in
documents and other forms of content. Identification techniques are used for text
compression and summarization which were utilized in a research project known as
PitchKeywordExtraction [2]. The PitchKeywordExtraction primary objective was to get
keywords through ASR services by pitch and tone and would analyze words that came after
pauses or sudden changes in a speaker’s speech pattern to a text document of pre-listed
keywords. If a selected keyword was mapped to a keyword in the pre-listed keywords, the
word would be added to an official keyword document that is outputted at the end of the
speaker’s session. The hypothesis was based on research which proposed keywords as being
the most informative parts of speech and are connected to speech signals for given and new
6
information during a lecture [2]. One of the possible real-world applications of the research
project include mapping keywords to external resources such as encyclopedias, geographic
locations, etc.
Rapid automatic keyword extraction (RAKE) is a keyword extraction methodology
that identifies keywords contained in a single document without the use of manual
annotations [4]. RAKE is based on an observation that keywords frequently contain multiple
words but often do not contain standard punctuation or stop words such as “and”, “the” and,
“of” [4]. RAKE begins by using searching for candidate keywords in the document and
assigns keyword scores based on word frequency, word degree, and ratio of degree to
frequency. Once candidate keywords are scored, the highest one-third of scoring keywords
are selected as keywords for the document. The RAKE methodology was applied to various
types of documents such as technical abstracts and news articles and applied different
weights to the criteria that generates keyword scores. In technical abstracts, accurate keyword
extraction was achieved by favoring longer keywords while in news articles shorter
keywords were favored more. This shows how different documents exhibit text information
differently which may require the analysis of the keywords to be adjusted accordingly [4].
Amazon Comprehend Medical is a cloud-based natural language processing service
provided by Amazon Web Services (AWS) that utilizes machine learning to extract medical
information from unstructured data. Comprehend Medical can retrieve information relating
to medical condition, medication, dosage, strength, frequency, doctors’ notes, etc.
Comprehend Medical has use cases involving clinical decision support, revenue cycle
management, and clinical trial management. Comprehend Medical is derived from Amazon
Comprehend which is a more general-purpose natural language processing service. Many of
7
the uses cases currently for Amazon Comprehend are business related and often involve
customer relations such as parsing customer reviews and feedback. Amazon Comprehend
was capable of being specialized for medical purposes; it may be possible for the service to
be specialized for educational or scholarly documents. These documents could include
presentations slides, pdf files, and tables. Potential use cases for educational settings could
include summarizing passages, high-lighting key-terms, and lecture topic identification.
In a study conducted by Saquib, Alam and Nianimin Yao, preprocessing steps were
applied to remove “noisy data” from tweets in order to improve the accuracy of three
machine learning algorithms involved with sentiment analysis [8]. Sentiment analysis is
involved with classifying text as positive, neutral, or negative. Noisy data would often
include emoticons, extra letters applied to words, and stop-words such as “the”, “is”, “at”,
“which”, “on”, etc. The machine learning algorithms tested include Naïve Bayes algorithm,
maximum entropy, and support vector machines algorithms. Experimental results proved that
for the Naïve Bayes algorithm accuracy has significantly improved after applying text
preprocessing steps followed by slight improvement of maximum entropy and no
improvement for support vector machine algorithms [8].
8
3. Preprocessing Approach
3.1 Overview
In educational settings, educators will often instruct students to read a chapter from a
textbook or review a document prior to a lecture. This allows students to be introduced to
information that is going to be discussed during the lecture allowing students to become more
familiar with the material. Much like the students receiving material prior to a lecture, many
ASR services can receive custom vocabularies prior to transcription jobs. The custom
vocabularies allow the ASR service to be introduced to words that are likely to be spoken and
train the ASR’s model to more accurately identify those specific words.
Students can be provided lecture material in the form of books, articles, and files
while most ASR services can only be provided with a list of words in a digital format.
Although ASR services cannot not consume material in the same way as humans, it is
possible to transform lecture material into a format that can be utilized by the ASR services.
By leveraging cloud-based services, lecture material can be analyzed for keywords that can
then be supplied to ASR services prior to transcription. This preprocessing approach may
help increase the accuracy of ASR services during transcription, and by leveraging cloud-
based solutions, can be made available at any time.
3.2 Keywords
When it comes to lectures of about 50 minutes, speakers on average will speak at a
rate of 140 words per minute and use about 2,421 unique words throughout the lecture [1].
This shows that may be difficult to predict every word a speaker is going to use during their
lecture, but an attempt can be made to predict specific words that are related to the lecture
subject specifically. Keywords can be defined as a sequence of one or more words, that
9
provide a compact representation of a document’s content [4]. Specifically, this approach
will be utilizing noun phrases as keywords. A noun phrase is defined as defined as a noun
and the modifiers that distinguish it [5]. An example of a noun phrase could be “blue car”.
This example contains an adjective (“blue”), and a noun (“car”). Noun phrases are the
keywords of choice because it has been reported by a keyword ranking system known as
TextRank that keywords consisting only of nouns and adjectives have the biggest impact
amongst all keywords contained in a document [5]. Extraction of noun-phrases will provide
ASR services the highest chance to receive quality keywords that best describe the lecture
material associated with the lecture. Among the keywords that are extracted from the related
lecture material, additional filters are going to be applied to rank the extracted noun-phrases
amongst each other.
3.3 Keyword Filters
Keywords that are extracted from the document are going to be filtered through
various approaches which will rank keywords amongst themselves. The filters will serve an
additional purpose to error check keywords and help ensure they are meaningful keywords to
the subject that is being presented by a speaker. The filters that will be applied to the
keywords will be frequency, confidence, and similarity. The frequency filter will pertain to
the number of occurrences the keyword is found within the document it was extracted from.
The more frequently a keyword is found within the document, the higher the score the
keyword will receive. The confidence filter will check every keywords confidence score and
remove keywords that do not meet the minimum confidence score. A confidence score is a
value between zero and one and represents how likely a keyword is in fact, a keyword. The
closer a keywords confidence score is to one, the higher the score the keyword will receive.
10
The similarity filter will compare extracted keywords to main topics found within the
document. Specifically, keywords will be compared to document titles and connected
subtitles found throughout the document. This filter will provide a similarity score between
zero and one and will represent how similar a keyword is to the subject or topic that is being
discussed. The higher the similarity score, the higher the rank the keyword will receive. The
frequency, confidence, and similarity filters will be compared with one another to provide
insight on which filters provide the highest quality custom vocabularies for ASR services.
3.4 Automation
A beneficial attribute of ASR services is their ability to be automated; meaning the
service does not require humans to generate accurate captions and transcripts. This provides
users with a reliable and available service that can be utilized in a moment’s notice. Many
existing approaches for keyword extraction are not automated and often require professional
curators who used a fixed taxonomy or rely on an author’s represented list to extract
keywords from documents [4]. Professional curators not only have to be paid for the work
they perform, but also do not work 24 hours a day. By leveraging cloud-based services, an
automated system can be built that completely automates the keyword extraction process and
generates high quality custom vocabularies for users to supply to their ASR service. This
system will be available to all users through a web application allowing for operation times
be nearly 24 hours a day and 7 days a week.
11
4. Prototype Implementation
4.1 Overview
The web application prototype will be hosted on the AWS cloud, and will integrate
various AWS cloud services into the system. These AWS services will include Amazon S3,
Amazon API Gateway, AWS Comprehend, AWS Lambda, and a machine learning service
called Twinword. The prototype’s main purpose is to generate a custom vocabulary by
extracting keywords from an uploaded document. The custom vocabulary will provide
support to ASR services by providing a list of keywords to look out for during audio
transcription which will improve the accuracy of the ASR service.
The system architecture will explore what each individual service provides to the
system and how the components interact with one another. The system architecture can be
divided into two main sections: the front-end and the back-end. The front-end will introduce
the cloud services required for users to successfully upload documents into the system. The
back-end will explore the cloud services necessary to extract keywords from the uploaded
documents and generate custom vocabularies. Keywords will be ranked by various keyword
analysis techniques to generate higher quality custom vocabularies.
12
4.2 System Architecture
Figure 4.2.1 Reference Architecture
The system architecture of the web-based prototype will utilize a serverless
architecture pattern, allowing management of machine resources to be handled by the cloud
provider. This architecture pattern will help reduce costs by only having to pay for
computation resources and will also provide a huge advantage of easier integration of other
cloud-based services into the workflow of the system. Since the system will be serverless,
this means the system will exist entirely in the cloud. AWS will be the cloud provider of
choice for the prototype application as it is one of the most established cloud providers
currently available in the market and provides numerous unique services within their
marketplace. The services going to be utilized in the application will consist of Amazon API
Gateway, Amazon S3, AWS Lambda, Amazon Comprehend, and an application
programming interface (API) available in the AWS marketplace called Twinword. API
Gateway, AWS Lambda and S3 are the main infrastructure anchors within the system,
13
handling events as they are triggered throughout the system and coordinating data and
responses from one service to another. Amazon Comprehend and Twinword are machine
learning services that are going to analyze user uploaded data and provide the necessary
insight to generate a custom vocabulary. The generated custom vocabulary is going to be a
text file that can be supplied to an ASR service such as AWS Transcribe to help improve the
accuracy during live transcription.
4.2.1 AWS S3
Amazon S3 is a service provided by AWS that will be mainly utilized as a cloud
storage device to store data as it is uploaded to the system. The system workflow will utilize
two S3 buckets, one in the beginning which will store uploaded user document, and another
S3 bucket at the end of the workflow where custom vocabulary text documents will be
stored. Although the system could utilize a single S3 bucket, a lack of organization of the
files coming back and forth could raise issues within the system, such as accidentally
triggering lambda functions as data comes back after traveling throughout the system. This
creates unnecessary complexities within the system which can easily be avoided by having
two separate S3 buckets.
The first S3 bucket which will be referred to as the User S3 Bucket will handle all
files that are uploaded by the user. The User S3 Bucket will be responsible for triggering the
appropriate lambda functions depending on the type of file that was uploaded by the user. As
the file’s data is parsed and transformed, it will be placed into the other bucket which will be
referred to as the Vocabulary S3 Bucket. The Vocabulary S3 bucket will consist only of .txt
files and will be the last stop for the workflow. The prototype will not consist of a delivery
14
method to be triggered by the Vocabulary S3 bucket but can be integrated depending on the
desired delivery method, such as email.
4.2.2 Amazon API Gateway
The prototype will contain a RESTful API built with Amazon’s API Gateway and
will allow custom APIs to be built, maintained, and customized as the criteria of the system
changes. The prototype will support three files but can easily be extended to handle any type
of file to meet future system requirements. The prototype’s gateway will only handle POST
requests from the user as the system will only be taking in documents the user is uploading;
this means that end users will not be able to request data from the system.
The API Gateway serves as the front door for the system and will have specific
endpoints depending on the type of file the user is trying upload. For the prototype, there will
be at least three endpoints, one for .docx files, .pptx files, and .epub files. Each endpoint will
have a specified AWS Lambda function ready to handle and place the file into the Document
S3 bucket.
4.2.3 AWS Lambda
AWS Lambda is going to contain all the business logic of the system, allowing
multiple services to communicate with one another as well as handle and transform data.
AWS Lambda provides the system with the ability to have no servers to manage and
continuously scale as the system requires. AWS Lambda keeps cost of cloud-computing to a
minimum, only charging for every 100ms code is being executed. AWS Lambda supports
many different programming languages such as Node.js, Java, Python, etc. The prototype of
15
the system will be using lambda functions written in Python 3.6 but can utilize any of the
supported programming languages as system requirements change.
The prototype’s lambda functions will be triggered when a user uploads a file to the
API Gateway and any time a file is placed in the Document S3 Bucket. The prototype will
have three lambda functions triggered by the API Gateway each corresponding to the file
type. The lambda functions will be responsible for renaming the file to a unique name,
converting the file from base64 and placing the file into the Document S3 bucket. Once the
file is placed in the S3 bucket, the appropriate lambda function corresponding to the file type
will trigger and begin parsing and transforming the files. The lambda function will parse the
extensible markup language (XML) files of the system looking for titles, subtitles, and
paragraph text and begin sending the data to AWS Comprehend and Twinword API.
As data is returned from Comprehend and Twinword, the same lambda function will
begin handling the data based on the type of analysis that is specified. The lambda function
will have the ability to count the frequency of keywords in the document, analyze confidence
scores, and gather similarity scores. The lambda function will also contain the logic for the
minimum requirements for keywords to qualify for the custom vocabulary. Once all the data
has been processed and analyzed, the lambda function will place a single .txt file in the
Vocabulary S3 bucket. The .txt file will be formatted specifically to be consumed by ASR
services. The formatting of the .txt file will remove commas, numbers, or any characters not
supported by the ASR service from the file.
An important limitation to note about lambda functions is that there is a maximum
execution time of 15 minutes. The execution time of a lambda function will be dependent on
16
the size of the file that is going to be processed. The prototype has all lambda function’s
maximum execution time set to three minutes, well below the maximum time allowed. Three
minutes was chosen as the maximum execution time as opposed to fifteen minutes because if
the lambda function is still executing past three minutes, an error has likely occurred. If an
error as occurred in the system, it is important the lambda function fails fast and recovers
quickly because the lambda function will be charged for the execution time regardless if it is
successful. A three-minute execution time also prevents the system from holding resources
hostage and preventing other files from being processed.
4.2.4 AWS Comprehend
Amazon Comprehend is a fully managed and continuously trained natural language
processing (NLP) service which utilizes machine learning techniques to discover
relationships and insights within text. The service is popular in business applications to
analyze customer feedback, identify key phrases, places, etc. The service is also heavily
utilized in the medical field to identify medical conditions, analyze patient health records
such as identifying relations between medications and treatments, etc. Comprehend excels at
identifying items of interest such as keywords, people, and places.
In the prototype, Amazon Comprehend analyzes all texts that are uploaded by the
user. Once the lambda functions extract the data from the files, it is sent to Comprehend.
Comprehend will analyze the data as a whole and return a list of keywords that were
identified in the data, along with an associated confidence score with how positive the
service is that the keyword, is in fact, a keyword. The confidence score provided by
Comprehend will be utilized in the analysis of the keywords and will be a contributing factor
for the keyword to be written to the custom vocabulary.
17
Comprehend has a limitation of how much data can be processed and analyzed in a
single request. If data is too large, an asynchronous request is required, and the keywords will
be returned when Comprehend is finished. This requires additional services such as queues
and lacks organization of associated keywords. In the prototype, the lambda functions will
send data as chunks in multiple requests which provides better organization of keywords for
future analysis and keeps the process synchronous throughout the workflow. This requires
the lambda functions to run longer, but still well within the three-minute timeout limit.
4.2.5 Twinword
Twinword is a third-party service available on the AWS Marketplace which provides
text analysis APIs capable of understanding and associating words in a similar way to
humans. Twinword offers a text similarity API capable of evaluating the similarity between
two words, sentences, or paragraphs. Twinword will allow the prototype to take the context
of keywords into account when assigning a rank to a keyword. The prototype will utilize
Twinword by supplying a keyword and its associated title or subtitle for comparison. The
API will take in both parameters and return a similarity score based on their relevance to one
another and will be a contributing factor for keywords to be included in the custom
vocabulary.
18
4.3 Front-End
Figure 4.3.1 Front-end Screenshot
For users to be able to have their lecture-related documents analyzed and have a
custom vocabulary generated for their lectures, a graphic user interface (GUI) is required that
is accessible from a user’s internet browser. It is important the GUI is simple and requires
minimal effort from end users to upload their documents. The front-end needs to available to
the end users at any time and reliably upload documents into the system with minimal errors.
The front-end of the system is going to be the mode of entry for all user uploaded files and
will have direct communication with the system’s API gateway. The front-end is going to be
built with the Angular 6 web framework and hosted in a S3 bucket as a static website which
can be accessed from any standard web browser.
Angular 6 is a powerful web framework written in typescript that contains many read-
to-go components that can create visually appealing GUIs. As the prototype evolves, the
Angular framework can integrate new components into the system with minimal effort. The
prototype utilizes a library called ng2-file-upload which handles all POST requests on the
19
front-end and transmits the data to the API gateway. By using the ng2-file-upload library,
minimal amounts of code are required to be written because the library handles all base64
encodings for each file.
The front-end is going to screen the file extension of each file that is going to be
uploaded into the prototype to help minimize the chances a file becomes corrupted. The file
extension type will determine which API endpoint the file will be sent to because each
document must be handled differently to keep data correct and prevent the file from
becoming corrupt. Each file extension is decoded according to its associated MIME type,
because if a .docx file is sent to the endpoint dedicated to the .pptx file type, the file will
become corrupted and the data will be lost. Once a user uploads a file to the system, the API
gateway will return a status code of 200, which will trigger a response on the front-end to let
the user know their data has been uploaded successfully.
4.4 Back-End
Figure 4.4.1 System Workflow
20
The back-end workflow begins with the API gateway receiving POST requests from
the front-end by sending it to the proper lambda function. Each file extension has a dedicated
lambda function that is triggered when a post request is sent to the API gateway. Each
lambda function decodes the data, writes the data into a new file with a unique name, and
places the file into the Document S3 bucket. Uniquely naming the file is important because
you do not want to overwrite files as they are being placed into the Document bucket and this
minimizes data loss throughout the system. The prototype will contain three lambda
functions that are responsible for placing files into the S3 bucket, each handling a specific
file extension.
Once the file is placed into the S3 bucket, another lambda function is triggered which
begins parsing the XML portion of the file. The XML inside the file plays an important
factor in the parsing and analysis of the document as it contains a logical structure of the
page capable of providing organizational insight of the document while removing the need to
physically scan the page using human eyes. The XML will contain specific tags for each file
type. These tags can represent titles, subtitles, lists, images, paragraphs, etc. The prototype
will focus specifically on tags that are related to titles, subtitles, and paragraph text, but can
be easily adjusted as requirements and analysis provides feedback on the importance of
specific tags in relation to extracting meaningful keywords contained in the files. The
prototype does not require XML files to contain title or subtitle tags for successful keyword
extraction and analysis but does require XML file to contain a Text tag, as this will be the
input for keyword extraction and analysis.
The lambda functions will send one regular text at a time to AWS Comprehend to
keep requests synchronous while maintaining organization of text with relation to associated
21
titles and subtitles. The titles and subtitles will be used heavily in the similarity score
implementation and it is vital that paragraphs are correctly categorized with their related
titles and subtitles. When keywords are returned from Comprehend, the lambda function will
begin analysis of the keywords using approaches such as the frequency score, similarity
score, and confidence score.
The same lambda functions are responsible for the analysis of keywords and phrases
by analyzing the scores of various score implementations. The frequency score and
confidence score will be calculated in the lambda and does not require any data to be sent to
any external services. The similarity score will be calculated by sending individual keywords
to an external service known as Twinword API. The scores generated in the lambda functions
will be the determining factor if a keyword will qualify to be included in the custom
vocabulary. The output of the system is a single .txt file that will contain the qualified
keywords from the uploaded files. This file will be stored in the Vocabulary S3 bucket with
the same name that was assigned to the file in the Document S3 bucket. The prototype will
not send the output to email or back to the user in the front-end but can be implemented with
another lambda function or notification service.
4.5 Keyword Analysis
Not all keywords will be equal in terms of the relevance to the lecture, but it is
troublesome to determine exactly what keywords are the most important. There are multiple
approaches to categorize the importance of a keyword, but it is important to determine what
categorization will help improve speech recognition accuracy most effectively. The prototype
will explore three different implementations to provide a better understanding on what
techniques will best improve ASR accuracy. The system will implement a keyword
22
frequency approach, a context approach, and a confidence score approach. As the prototype
provides feedback, hybrid approaches amongst the tested techniques could be developed and
analyzed to help improve the accuracies of the ASR services.
4.5.1 Frequency Score
The more times a keyword occurs within the text, the more relevance a keyword
might have towards the topic. The frequency approach might provide insight on the number
of times a word occurs within a text and the chances of the word being spoken when talking
about the same subject. A frequency score will be assigned to a keyword based on the
number of times the keyword occurs within a body of text. Every keyword will have a
minimum count of one, and the default minimum frequency score to qualify for the custom
vocabulary will be one. These values can be adjusted accordingly and can provide insight on
the importance of frequency of a keyword throughout a file. A keyword repeating often in
lecture-related file might have a correlation with the number of times the keyword will be
spoken in a lecture. By including high-frequency keywords in the custom vocabulary, the
ASR services will be trained to recognize the high-frequency keywords and improve the
accuracy during transcribing.
A major concern with allowing the frequency score to determine the rank of a
keywords is that important keywords might only occur once in a body of text. This approach
may prematurely dismiss valuable keywords because a keyword only occurring once within a
document does not diminish the importance of the keyword to the topic and may have high
relevance when it comes to understanding the material.
23
4.5.2 Similarity Score
It is important that keywords are relevant to the subject matter that is being presented
and it is worth exploring how context impacts ASR accuracy during transcription. A
similarity score will provide insight on the context of a keyword compared to other
information contained in the file. A minimum similarity score will be required for keywords
to be included in the custom vocabulary. The prototype will define the minimum similarity
score at .50 but can be adjusted accordingly.
The prototype will utilize the Twinword API by comparing keywords to their
associated title or subtitle. The similarity score will provide insight on the context of the
keyword within the paragraph, and how closely related the keyword is to the subject matter
that is being discussed. Subtitles will have priority over titles when generating a similarity
score because subtitles are found in closer proximity to the keywords. As well, subtitles will
often be more specific to a section of the topic and may provide a more accurate similarity
score. If the keyword has no associated subtitle, the title of the file will be used for
generating a similarity score. Titles will have a farther proximity to the keywords than
subtitles but can still provide insight on the context of a keyword. If a keyword has no
associated title or subtitle, the similarity score will not be generated and will not be relevant
in determining if a keyword qualifies to be written to the custom vocabulary.
4.5.3 Confidence Score
As AWS Comprehend extracts keywords from text, an associated confidence score
will be assigned to the keyword to support that the keyword is a keyword. A confidence
score is determined by the service by taking semantic structure and context into account
during extraction. Having a higher confidence score indicates a higher probability of a
24
keyword being a keyword, thus the system will assign a higher rank to keywords with higher
confidence scores.
The prototype will have a minimum confidence score of .95 for keywords to qualify
for the custom vocabulary. This approach will filter out lower ranking keywords and keep
higher probabilistic keywords for the custom vocabulary. This minimum confidence score
can be adjusted accordingly, such as if custom vocabularies contain too many or too little
keywords.
25
5. Experiment
5.1 Experiment Overview
Figure 5.1.1 Testing Workflow
To provide insight if the preprocessing techniques will help increase ASR
transcription and captioning accuracy, an experiment is going to conducted to analyze the
transcription accuracy of video lectures by supplying an ASR service with custom
vocabularies derived from the lecture-related material. Each trial of the experiment is going
to require a lecture video with an audible speaker, lecture related material, and an ASR
service to process and transcribe the video. The lecture related material will be preprocessed
by the prototype and will generate various custom vocabularies to be utilized by ASR
services for transcription. The ASR service will transcribe the lecture video for each type of
custom vocabulary and the results from each transcription job will be compared with one
another.
MIT Open Courseware is a website created by MIT that hosts and publishes
educational materials freely and openly. Educational material can include full courses,
assignments, or old lectures that were recorded during previous academic years. Many videos
and associated lecture materials are available for download under the Creative Commons
26
license. MIT Open Courseware provides a text transcription of the lectures and will be the
template to compare the accuracy of the transcription jobs for each custom vocabulary. The
experiment is going to analyze a video lecture series called “The Analytics Edge” instructed
by Dimitris Bertsimas; specifically analyzing the lecture videos of section 5.2 titled “Turning
Tweets into Knowledge: An Introduction to Text Analysis” [10]. The lecture video is 17
minutes and 22 seconds in length and has two different speakers throughout the lecture. The
lecture is paired with 26 PowerPoint slides which the speakers reference throughout the
video.
5.2 Experiment Setup
The ASR service that is going to be transcribing the video lecture is Amazon
Transcribe. Amazon Transcribe is a free cloud-based service that is capable of captioning and
transcribing live lectures or previously recorded videos. Before Amazon Transcribe can
begin the transcription job, the prototype must generate the custom vocabularies. The lecture
related PowerPoint file will be uploaded to the prototype for preprocessing and will generate
custom vocabularies based on the frequency, similarity, and confidence approaches. Each
preprocessing approach will analyze the same PowerPoint file and will generate a unique .txt
file that will be supplied to Amazon Transcribe prior to the transcription of the lecture video.
Amazon Transcribe will be provided with the lecture video in .mp4 format along with
a custom vocabulary in .txt format. Amazon Transcribe requires custom vocabularies to
conform to a specific format, requiring all key phrases to appear on their own individual line
with spaces between words to be replaced with a single hyphen. Apostrophes, periods, and
commas can appear in the .txt document while numbers and symbols are prohibited. Figure
27
5.2.1 is an example of .txt file which adheres Amazon Transcribe’s custom vocabulary
specifications.
Figure 5.2.1 Sample .txt File Format
Amazon Transcribe will first transcribe the lecture video without being supplied a
custom vocabulary. This will provide a baseline on how accurate the ASR service is without
utilizing any preprocessing approaches. After the baseline has been established, Amazon
Transcribe will begin three transcription jobs utilizing the custom vocabularies generated
from the preprocessing techniques. The transcriptions generated from each job will be
compared with the transcription provided from MIT Courseware and will determine the
accuracy of the transcription job. If a sentence is not correctly transcribed compared to the
baseline transcription, an error will be logged. The more sentences that are correctly
transcribed compared to the baseline transcription will determine the accuracy of the
transcription job.
5.3 Trial #1
The first trial of the experiment is going to generate three custom vocabularies each
filtered with a different preprocessing approach. The preprocessing approaches each had a set
minimum requirement for possible keywords to be included in their respective vocabularies.
Each preprocessing approach produced vocabularies that varied greatly in size with the
28
confidence vocabulary producing the large vocabulary, followed by the frequency
vocabulary, while the similarity vocabulary was the smallest.
Figure 5.3.1 Trial #1 Vocabulary Size Comparison
For keywords to qualify for the frequency vocabulary, the keyword was required to
have occurred at least twice throughout the document, and among the keywords extracted,
only 27 of the keywords occurred at least twice in the PowerPoint document. The frequency
approach in this trial may provide insight that keywords do not repeat that often throughout
the document. This could be related to the type of document that is being processed as
PowerPoint slides are not as detailed as a paragraph from a textbook or article. PowerPoint
slides often summarize a topic and reinforce the main points that the lecturer wishes to
emphasize.
The requirement for keywords to be included in the confidence score vocabulary is a
confidence score of at least .90. Among the keywords extracted, 230 keywords received a
29
confidence score of at least .90. This preprocessing approach produced the biggest
vocabulary size amongst the other approaches. 263 suspected keywords were identified in the
document which means that 87.45% of keywords had a confidence score of .90 or better.
Having an effective extraction process is important, but it is possible that .90 may be too
loose of a requirement for keywords to qualify as significantly more keywords are being
included in the confidence custom vocabulary compared to the frequency and similarity
vocabularies.
The similarity technique required a minimum similarity score of .10 to qualify for the
custom vocabulary. Among the keywords extracted, only 8 keywords qualified for the
custom vocabulary. A possible explanation as to why the similarity vocabulary size is much
lower than the other approaches is the titles of each PowerPoint slide are not descriptive
enough to produce any similarity to the words that are being extracted. The similarity score
requirement of .10 is a low requirement which means many keywords received a zero for the
similarity score, when in fact they are relevant to the topic that is being discussed. Among the
words included in the similarity vocabulary, all keywords had a received a similarity score of
1, which is well above the minimum requirement of .10. This shows that keywords compared
to their title either receive a score of 1 or zero and is the difference between qualifying for the
similarity custom vocabulary.
30
Figure 5.3.2 Trial #1 Accuracy Rate of Vocabularies
The MIT supplied transcription contained 169 sentences and served as the template
for determining accuracy of Amazon Transcribe. The baseline transcription job with no
custom vocabulary supplied had an accuracy rating of 63.33%, the lowest of all transcription
jobs. When the ASR service was provided a custom vocabulary, the accuracy rating had a
significant rise. The confidence score vocabulary produced the highest accuracy rating at
78.11%, followed by the frequency vocabulary with an accuracy rating of 72.78%, and the
similarity vocabulary with the second lowest accuracy rating of 69.82%.
Since all three vocabularies improved the accuracy rating of Amazon Transcribe by at
least 6%, the results suggest that supplying prefacing material to ASR services is better than
supplying nothing. The confidence vocabulary performed the best in the trial, with a defining
characteristic difference being its vocabulary size. This could suggest that bigger
vocabularies are better than smaller vocabularies but is the bigger vocabulary producing
optimal results? While the confidence vocabulary is performing better than frequency and
31
similarity vocabularies, too many keywords are being supplied to the ASR service and may
not be providing optimal training to the ASR service.
The similarity approach performed the worst out of three transcription jobs and
supplied the ASR service with the smallest vocabulary of just 8 keywords. This suggests that
this preprocessing approach is producing too specific of a vocabulary and is filtering out
significant keywords. An issue with attempting to improve the similarity approach is the
minimum similarity score requirement was significantly low, at only .10. A possible
explanation as to why keywords are not meeting the minimum score requirement is that the
titles are not specific enough to the topic that is being suggest. For example, a slide title in
the lecture material is “Cleaning Up Irregularities” regarding to punctuation throughout the
slide. Although “Cleaning Up Irregularities” describes the overall purpose of the slide, it
does not relate specifically to the many of the keywords contained on the slide.
The frequency vocabulary performed the second best out of the three transcription
jobs and supplied the ASR service with a medium-size vocabulary containing 27 keywords.
The most frequent keyword in the vocabulary was the keyword “tweets” which occurred
twelve times throughout the document. With the frequency vocabulary having the second
largest vocabulary and performing the second best out of the transcription jobs, this could
support the significance of a larger vocabulary. The minimum occurrence required for
keywords to be included in the vocabulary was two, providing the frequency approach with
very limited room for extending its vocabulary.
32
5.4 Trial #2
The second trial of this experiment is going to analyze the same video lecture series
as trial #1, but the ASR service is only going to be provided custom vocabulary generated
from the confidence preprocessing approach. This experiment is going to compare
transcription jobs utilizing four additional custom vocabularies strictly derived from the
confidence preprocessing approach with a minimum confidence score requirement of .90,
.95, .98, and 1.0 respectively. The primary objective of this trial is to attempt to find
additional insight on the size of vocabulary on transcription accuracy rating and identify a
more optimal implementation for the confidence preprocessing approach.
Figure 5.4.1 Trial #2 Vocabulary Size Comparison
As the minimum confidence score requirement becomes stricter, the preprocessing
approach will generate smaller vocabularies. 263 potential keywords were identified by the
prototype with 73.76% of the keywords receiving a confidence score of at least .99. This
information may imply that the keyword identifying service is over-rating keywords more
33
than the service should, or possibly that the keyword extraction service performs
exceptionally well at identifying keywords from documents and is scoring each keyword
appropriately. The vocabulary size dramatically drop-offs between .99 to 1.0 vocabularies,
reducing the total vocabulary size by 81.96%. This reduction of vocabulary size may provide
the most insight on the impact of the size of vocabularies directly compared with
vocabularies that only contain highly rated keywords.
Figure 5.4.2 Trial #2 Accuracy Rate of Vocabularies
The MIT supplied transcription contained 169 sentences and served as the template
for determining accuracy of each transcription job. The baseline transcription job with no
custom vocabulary supplied to the ASR service had an accuracy rating of 63.33% and
remained the lowest of all transcription jobs. The .90 custom vocabulary generated the best
results in Trial #1 with an accuracy rating of 78.11% but performed the third best in this trail
behind both the .95 Confidence Vocabulary and the .99 Confidence Vocabulary. The .99
Confidence Vocabulary performed the second best in the trial with an accuracy rating of
34
79.28% while also utilizing the third largest vocabulary. The .90 and .99 vocabularies yield
the two closest accuracy ratings with a difference of only 1.17% while also having a
vocabulary size difference of 36 keywords.
The .95 Confidence Vocabulary performed the best with an accuracy rating of
82.24% while having the second smallest custom vocabulary. An interesting aspect to
highlight, is the .95 Confidence Vocabulary is between both .90 and .99 vocabularies in
terms of vocabulary size but outperforms both vocabularies. This aspect may provide insight
on the impact of size of custom vocabularies on ASR services’ accuracy rating. While
vocabulary size has shown to be an important trait for transcription accuracy, it is possible to
oversaturate the ASR service with excess keywords which can notably reduce transcription
accuracy.
The 1.0 Confidence Vocabulary performed the second worst in this trial only
outperforming the baseline transcription job. The 1.0 Confidence Vocabulary had the
smallest vocabulary which reinforces the hypothesis that size affects the outcome of
transcription accuracy. The accuracy rating was reduced by 3.55% compared to the .99
Confidence Vocabulary. This comparison presents major insight on adjusting the minimum
confidence score by only one percent. This one percent change resulted in the vocabulary
becoming too specific which removed important keywords from the vocabulary that were
spoken throughout lecture video.
35
5.5 Discussion
Figure 5.5.1 Accuracy Rate of Vocabularies of all Trials
The experiments conducted between Trial #1 and Trial #2 as shown that by providing
an ASR service with keywords from lecture-related documents, the ASR will have a higher
likelihood of producing more accurate transcriptions. Preprocessing lecture related material
is beneficial and can provide foresight to ASR services prior to transcription. Each
preprocessing approach produced different results with the Confidence approach performing
the best with an accuracy rate of 82.24%. The Frequency approach performed with an
accuracy rate of 72.78%, and the Similarity approach performed with an accuracy rate of
69.82%. The Confidence approach not only performed the best amongst all the approaches
but also produced the largest custom vocabulary which contained 230 keywords. The
Similarity Approach produced the smallest vocabulary containing only 8 keywords.
36
The confidence preprocessing approach produced the largest vocabularies while also
remaining the most adjustable among the other approaches allowing for optimization. This
approach is the most adaptable in producing broad vocabularies by utilizing a confidence
score of .99 or below while remaining capable of producing more limited vocabularies when
settings the minimum confidence score to 1.0. The frequency approach in Trial #1 had a
minimum frequency count of two for keywords to qualify for the vocabulary. Increasing the
minimum requirement for this approach will only produce more specific vocabularies while
reducing the minimum occurrence count to one would only produce a vocabulary as large as
all possible keywords contained in the document. The frequency approach may be more
suitable for detailed documents, like a chapter from a textbook as slide presentations are
generally broader and summarize material while a textbook chapter would be more detailed.
A notable error that occurred throughout both Trial #1 and Trial #2 is the ASR
service would over correct words to directly fit with words that were included in the custom
vocabulary. In Trial #1 during the similarity transcription job, the ASR service continuously
overcorrected the word “textual” to always correct to “textual data” as written in the custom
vocabulary. Textual data was a keyword in the custom vocabulary, and was spoken during
the lecture video, but as the lecture continued the speaker would only speak the word
“textual” in other sentences. The ASR service would routinely change the transcript to
“textual-data”. Although this overcorrection did not drastically alter the meaning of the
sentences, it influenced the readability of the transcript.
For this experiment, it is important to highlight the external factors that may have
influenced the results. A major factor that may have influenced the results is the sound
quality of the lecture video. Although the video may be clear to humans, it is possible it was
37
not of the highest of quality and could affect the ASR services’ ability to interpret the lecture.
When it comes to the ASR services’ ability to interpret the speaker, factors could include the
speaker’s natural tone and dialect and if any microphone or audio system were used to
amplify the speaker’s voice. Awkward pauses and stutters can result in poor punctuation and
grammar of the transcription which can influence the output to the transcript.
The similarity approach in Trial #1 had the worst performance out of all transcription
jobs supplied with a custom vocabulary, but still resulted in an improvement compared to the
baseline transcription job with no preprocessing approach. Although the vocabulary size was
only 8 keywords, the similarity approach managed to improve the accuracy rating by 6.49%.
This comparison presents the potential on preprocessing approaches paired with ASR
services. This approach may be more suitable for preprocessing different documents, but still
reinforces the trend that a small vocabulary is better than no vocabulary.
38
6. Conclusion
Much like how a student can be introduced to information before a lecture to help
increase the understanding of the material, an ASR service can be provided with information
about the lecture prior to transcribing to help produce more accurate transcriptions and
captions. Preprocessing lecture material has shown to be a powerful approach to help support
ASR services in generating high quality transcriptions and captions. Many ASR services are
capable of specialized training for transcription jobs by providing custom vocabularies
consisting of keywords prior to transcribing or captioning. The AWS cloud infrastructure
allows for a system to be constructed to provide users with an automated process to generate
custom vocabularies.
Cloud-based ASR tools such as AWS Comprehend are capable of accurately
extracting meaningful keywords to support ASR services and can help automate the
generation of custom vocabularies. Custom vocabularies should consist of as many high-
quality keywords as possible, without sacrificing vocabulary size. The Confidence Approach
has shown that it is possible to produce quality custom vocabularies that steadily improve the
accuracy rating of the ASR service. If a user’s objective is to produce a more exclusive
vocabulary consisting strictly of the most meaningful keywords, the Frequency or Similarity
approach could be more appropriate.
The pandemic caused by COVID-19 has dramatically changed the way people
interact with one another. The demand for captioning has increased as social distancing has
been encouraged and educational environments have transitioned into online formats. ASR
services are becoming more reliable and are readily available for instructors to use at any
time. ASR services can produce transcription of live lectures and educational videos, which
39
can provide students with an alternative resource for their academic studies. By leveraging
cloud-based services, educational environments can quickly adapt to the changing times and
better support students throughout their academic journey.
40
7. References
[1] Kushalnagar, Raja S., et al. “Accessibility Evaluation of Classroom Captions.” ACM
Transactions on Accessible Computing, vol. 5, no. 3, 1 Mar. 2013, pp. 1–24.,
doi:10.1145/2543578.
[2] Lezhenin, Yurij, et al. “PitchKeywordExtractor: Prosody-Based Automatic Keyword
Extraction for Speech Content.” Proceedings of the 2017 Federated Conference on
Computer Science and Information Systems, 2017, doi:10.15439/2017f326
[3] Ranchal, Rohit, et al. “Using Speech Recognition for Real-Time Captioning and Lecture
Transcription in the Classroom.” IEEE Transactions on Learning Technologies, vol. 6,
no. 4, Oct. 2013, pp. 299–311., doi:10.1109/tlt.2013.21.
[4] “Automatic Keyword Extraction from Individual Documents.”Text Mining: Applications
and Theory, by Michael W. Berry and Jacob Kogan, Wiley, 2010, pp. 1-20.
[5] “Detect Key Phrases.” AWS Documentation, 2020,
docs.aws.amazon.com/comprehend/latest/dg/how-key-phrases.html.
[6] Mihalcea, Rada, and Paul Tarau. “TextRank: Bringing Order into Texts.” TextRank:
Bringing Order into Texts, 2004,
web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf.
[7] Chávez Herting, D., Cladellas Pros, R. & Castelló Tarrida, A. Patterns of PowerPoint Use
in Higher Education: a Comparison between the Natural, Medical, and Social Sciences.
Innov High Educ 45, 65–80 (2020). https://doi-org.libproxy.csun.edu/10.1007/s10755-
019-09488-4
[8] Alam, S., Yao, N. The impact of preprocessing steps on the accuracy of machine learning
algorithms in sentiment analysis. Compute Math Organ Theory 25, 319–335 (2019).
https://doi-org.libproxy.csun.edu/10.1007/s10588-018-9266-8
[9] Kheir, Richard, and Thomas Way. “Inclusion of Deaf Students in Computer Science
Classes Using Real-Time Speech Transcription.” ACM Digital Library, June 2007, dl-
acm-org.libproxy.csun.edu/doi/10.1145/1269900.1268860.
[10] Bertsimas, Dimitris, director. The Analytics Edge. Internet Archive, 1 Jan. 2017,
archive.org/details/MIT15.071S17/MIT15_071S17_Session_5.1.01_300k.mp4.
[11] Smith, S. J., & Basham, J. D. (2014). Designing Online Learning Opportunities for
Students with Disabilities. TEACHING Exceptional Children, 46(5), 127–137.
https://doi.org/10.1177/0040059914530102