Applying Speech Recognition and Language Processing ...

Applying Speech Recognition and Language Processing Methods to Transcribe and

Structure Physicians’ Audio Notes to a Standardized Clinical Report Format

by

Syed M. Faizan

Submitted in partial fulfilment of the requirements

for the degree of Master of Computer Science

at

Dalhousie University

Halifax, Nova Scotia

February 2020

© Copyright by Syed M. Faizan, 2020

ii

To my parents, my wife, my son and my sister.

Thank you all for your love and support.

iii

TABLE OF CONTENTS

List of Tables ................................................................................................................... viii

List of Figures ......................................................................................................................x

Abstract .............................................................................................................................. xi

List of Abbreviations Used ............................................................................................... xii

Acknowledgments............................................................................................................ xiii

Chapter 1. Introduction ....................................................................................................1

1.1 Solution Approach................................................................................................ 1

1.2 Research Objectives ............................................................................................. 2

1.2.1 First Objective ................................................................................................2

1.2.2 Second Objective ...........................................................................................3

1.3 Research Scope .................................................................................................... 3

1.4 Thesis Contribution .............................................................................................. 3

1.5 Thesis Organization.............................................................................................. 4

Chapter 2. Background and Related Work ......................................................................5

2.1 Health Record ....................................................................................................... 5

2.1.1 Electronic Health Record (EHR) ...................................................................5

2.2 Clinical Report ..................................................................................................... 5

2.2.1 SOAP .............................................................................................................6

2.3 Clinical Documentation........................................................................................ 9

2.3.1 Challenges ......................................................................................................9

2.3.2 Modes ...........................................................................................................10

2.3.3 Workflows....................................................................................................11

2.3.4 Problems ......................................................................................................12

2.3.5 Summary ......................................................................................................13

2.4 Speech Recognition ............................................................................................ 14

2.4.1 Acoustic Model ............................................................................................15

2.4.2 Language Model ..........................................................................................17

2.4.3 Evaluation Metric.........................................................................................19

iv

2.5 Speech Recognition in Clinical Documentation ................................................ 20

2.5.1 Review Objective .........................................................................................20

2.5.2 Literature Exploration ..................................................................................21

2.5.3 Analysis........................................................................................................21

2.5.4 Findings........................................................................................................22

2.6 Handling Noise in Speech Recognition ............................................................. 25

2.6.1 Feature-space Techniques ............................................................................26

2.6.2 Model-space Techniques .............................................................................27

2.6.3 Summary ......................................................................................................28

2.7 Domain-Specific Speech Recognition ............................................................... 29

2.7.1 Out-of-Domain Model Adaptation ..............................................................29

2.7.2 Domain Relevant Data Generation ..............................................................30

2.7.3 Summary ......................................................................................................31

2.8 Speech Recognition Systems ............................................................................. 32

2.8.1 Deep Speech.................................................................................................32

2.8.2 ESPnet ..........................................................................................................34

2.8.3 Wav2Letter++ ..............................................................................................35

2.8.4 CMUSphinx .................................................................................................36

2.8.5 Kaldi .............................................................................................................36

2.8.6 Julius ............................................................................................................36

2.8.7 Google Speech-to-Text ................................................................................36

2.8.8 Summary ......................................................................................................37

2.9 SOAP Classification ........................................................................................... 38

2.10 Conclusion .......................................................................................................... 39

Chapter 3. Research Methodology ................................................................................40

3.1 Introduction ........................................................................................................ 40

3.2 Layer 1: Transcription of Physicians’ Audio Notes ........................................... 40

3.2.1 Selection of Methods ...................................................................................40

3.2.2 Preliminary Analysis ....................................................................................42

3.2.3 Development of Solution .............................................................................43

v

3.2.4 Evaluation of solution ..................................................................................43

3.3 Layer 2: Autonomous Generation of Clinical Report ........................................ 44

3.3.1 Data Labeling ...............................................................................................44

3.3.2 Selection of Method .....................................................................................44

3.3.3 Development of Solution .............................................................................44

3.3.4 Evaluation of Solution .................................................................................44

3.4 Dataset ................................................................................................................ 44

3.4.1 Data Collection ............................................................................................45

3.4.2 Dataset Preparation ......................................................................................45

3.5 Research Environment ....................................................................................... 46

3.6 Conclusion .......................................................................................................... 47

Chapter 4. Speech Recognition – Methods ...................................................................48

4.1 Introduction ........................................................................................................ 48

4.2 Domain Adaptation ............................................................................................ 48

4.3 Preliminary Analysis .......................................................................................... 49

4.4 Acoustic Modeling ............................................................................................. 51

4.4.1 Audio Pre-Processing...................................................................................51

4.4.2 Fine-tuning ...................................................................................................53

4.5 Language Modeling............................................................................................ 56

4.5.1 Dataset Augmentation with Domain Relevant Data ....................................56

4.5.2 Developing Domain-Specific Language Models .........................................61

4.5.3 Enhancing Pre-Trained Language Model ....................................................63

4.5.4 Summary ......................................................................................................65

4.6 Discussion .......................................................................................................... 66

4.7 Conclusion .......................................................................................................... 66

Chapter 5. Speech Recognition – Evaluations...............................................................67

5.1 Introduction ........................................................................................................ 67

5.2 Setup ................................................................................................................... 67

5.2.1 Evaluation Phases ........................................................................................67

5.2.2 Cross-Validation ..........................................................................................68

vi

5.2.3 Interpretation of Results ...............................................................................68

5.3 Evaluating Acoustic Modeling ........................................................................... 69

5.3.1 Analysis........................................................................................................70

5.4 Evaluating Language Modeling Methods .......................................................... 71

5.4.1 Analysis........................................................................................................72

5.5 Combined Evaluation ......................................................................................... 80

5.5.1 Analysis........................................................................................................80

5.6 Working Examples ............................................................................................. 82

5.6.1 Best Case ......................................................................................................82

5.6.2 Worst Case ...................................................................................................84

5.6.3 Random Cases ..............................................................................................86

5.7 Final Validation .................................................................................................. 91

5.8 Discussion .......................................................................................................... 92

5.9 Conclusion .......................................................................................................... 93

Chapter 6. Classification of Transcription into SOAP Categories ................................94

6.1 Introduction ........................................................................................................ 94

6.2 Methods .............................................................................................................. 94

6.2.1 Assigning SOAP categories to Dataset ........................................................94

6.2.2 Exemplar-based Concept Detection .............................................................98

6.2.3 Exemplar-based Sentence Classification ...................................................100

6.2.4 Summary ....................................................................................................104

6.3 Evaluation......................................................................................................... 105

6.3.1 Experimental Setup ....................................................................................105

6.3.2 Results ........................................................................................................107

6.4 Discussion ........................................................................................................ 124

6.5 Conclusion ........................................................................................................ 125

Chapter 7. Discussion ..................................................................................................126

7.1 Introduction ...................................................................................................... 126

7.2 Observations ..................................................................................................... 126

7.3 3-Layer Solution Framework ........................................................................... 127

vii

7.4 Limitations ....................................................................................................... 128

7.5 Future Work ..................................................................................................... 129

7.6 Conclusion ........................................................................................................ 130

References ........................................................................................................................131

Appendix A. AP Scores for SOAP Classification .......................................................146

viii

LIST OF TABLES

Table 2.1 Documentation Guidelines of SOAP categories adapted from [13] ................... 7

Table 3.1 Summary of Research Environment ................................................................. 46

Table 4.1 Preliminary Results: WER and CER of Cloud and Offline SR ........................ 49

Table 4.2 Examples mistakes by DeepSpeech during preliminary experiments. ............. 49

Table 4.3 Hyperparameters used for acoustic model fine-tuning ..................................... 53

Table 5.1 Training and testing error rates for Acoustic Model training ........................... 70

Table 5.2 Absolute and relative change in error rates after acoustic modeling ................ 70

Table 5.3 List of scenarios for Language Model evaluations ........................................... 71

Table 5.4 Training error rates from each evaluation scenario .......................................... 73

Table 5.5 Cross-validated error rates from each evaluation scenario ............................... 73

Table 5.6 Absolute and relative change in error rates compared to baseline ................... 74

Table 5.7 Variables to analyze each experimented method .............................................. 75

Table 5.8 Comparing error rates between original and augmented corpus ...................... 77

Table 5.9 Comparing error rates between new and enhanced pooled model ................... 78

Table 5.10 Comparing error rates between new and enhanced interpolated model ......... 78

Table 5.11 Error rates after testing with combined models. ............................................. 80

Table 5.12 Performance comparison of DeepSpeech (before and after) with Google ..... 81

Table 5.13 Errors in the best transcribed note .................................................................. 83

Table 5.14 Errors in the worst transcribed note ................................................................ 84

Table 5.15 Errors in the first examined note transcription ............................................... 86

Table 5.16 Errors in the second examined note transcription ........................................... 89

Table 5.17 Error rates on the validation dataset ............................................................... 91

Table 6.1 Patterns observed in subjective category sentences .......................................... 95

Table 6.2 Patterns observed in objective category sentences ........................................... 96

Table 6.3 Patterns observed in assessment category sentences ........................................ 96

Table 6.4 Patterns observed in plan category sentences ................................................... 97

Table 6.5 Cases based on Stop Word, Similarity Function and Exemplar type ............. 105

Table 6.6 Conversion of categorical variables to numeric ............................................. 107

ix

Table 6.7 Confidence scores of sentences from the baseline condition ......................... 109

Table 6.8 Highest and Lowest AP scores for SOAP categories ..................................... 110

Table 6.9 Summary of regression analysis results .......................................................... 110

Table 6.10 Best performing values for maximum n-gram length for each case ............. 112

Table 6.11 Summary of optimal conditions for each SOAP category ............................ 117

Table 6.12 Confidence scores of sentences from case 1 using 8-gram limit .................. 118

Table 6.13 Confidence scores of sentences from case 3 using 11-gram limit ................ 119



x

LIST OF FIGURES

Figure 1.1 Solution approach for clinical documentation ................................................... 2

Figure 2.1 Structure of any SR system in its simplest form ............................................. 15

Figure 2.2 Query Tree for Literature search ..................................................................... 21

Figure 2.3 Structure of Deep Speech Model [77] ............................................................. 33

Figure 2.4 Processing stages in ESPnet recipe [75] .......................................................... 35

Figure 3.1 Steps in Research Methodology spanning over two layers ............................. 40

Figure 4.1 Steps of Pattern Extraction .............................................................................. 58

Figure 4.2 Steps of Knowledge Extraction ....................................................................... 59

Figure 4.3 Steps to Generate Augmented Corpus ............................................................. 61

Figure 5.1 Flow of experiments ........................................................................................ 67

Figure 5.2 Error loss in Acoustic Model training ............................................................. 69

Figure 5.3 Result summary of regression analysis using WER ........................................ 76

Figure 5.4 Result summary of regression analysis using CER ......................................... 76

Figure 6.1 Exemplar based algorithms proposed by Juckett [102] ................................... 98

Figure 6.2 Outline of word-class array [102].................................................................... 99

Figure 6.3 Pseudocode for Exemplar-based Sentence Classification Algorithm ........... 103

Figure 6.4 Precision-Recall curve of the baseline condition .......................................... 108

Figure 6.5 All AP scores over n-gram max length ......................................................... 111

Figure 6.6 Box plots of scores after removing stop words ............................................. 114

Figure 6.7 Comparing performance based on similarity function (Cosine vs Jaccard) .. 115

Figure 6.8 Comparing performance based on exemplar type (sentence vs n-gram) ...... 116

Figure 6.9 Precision-Recall Curve of best performing condition for Subjective ........... 117

Figure 6.10 Precision-Recall Curve of best performing condition for Objective ........... 120

Figure 6.11 Precision-Recall Curve of best performing condition for Assessment ....... 122

Figure 6.12 Precision-Recall Curve of best performing condition for Plan ................... 122

Figure 7.1 Design for 3-Layer Solution Framework ...................................................... 127

xi

ABSTRACT

Clinical documentation is an audio recording of the clinical encounter by the specialist

which is subsequently manually transcribed to be added to the patient’s medical record.

The current clinical documentation process is tedious, error-prone and time-consuming,

more so for specialists working in the emergency department given the rapid turnaround

of high-acuity patients. In this thesis, we investigate methods to automate the clinical

documentation processes for a pediatric emergency department, leading to the generation

of a SOAP report of the clinical encounter. Our approach involves (a) speech recognition

to transcribe the audio recording of the clinical encounter to a textual clinical encounter

report; and (b) identifying and classifying the sentences within the textual report in terms

of the standard SOAP format for clinical reports. For speech recognition, we worked with

the DeepSpeech application and used recurrent neural network and n-gram based methods,

augmented with medical terminologies and heuristics, to develop domain-specific acoustic

and language models. Our approach resulted in a reduction of 49.02% of critical errors as

compared to the baseline acoustic and language models provided by DeepSpeech. For

generating a SOAP report from the clinical text, we extended an exemplar-based concept

detection algorithm to learn a sentence classifier to identify and annotate the clinical

sentences in terms of subjective, objective, assessment and plan. Our SOAP classifier

achieves a precision of 0.957 (subjective), 0.919 (objective), 0.626 (assessment) and 0.82

(plan).

xii

LIST OF ABBREVIATIONS USED

EHR Electronic Health Record

EMR Electronic Medical Record

SR Speech Recognition

ANN Artificial Neural Network

DNN Deep Neural Network

RNN Recurrent Neural Network

CNN Convolutional Neural Network

WER Word Error Rate

CER Critical Error Rate

SOAP Subjective Objective Assessment Plan (Acronym)

NLP Natural Language Processing

CD Clinical Documentation

LM Language Model

AM Acoustic Model

MeSH Medical Subject Headings (Acronym)

xiii

ACKNOWLEDGMENTS

I am deeply thankful and blessed to have Dr. Syed Sibte Raza Abidi as my supervisor on

my journey of this research. I am grateful for his guidance, support, and advice that has

helped me a lot throughout the course of my degree. None of this would have been possible

without him. I also acknowledge the support of all members of the NICHE research group

for their suggestions and constructive feedback that has facilitated this work. I like to thank

Ali Daowd for his assistance on healthcare-related specific questions, William Van

Woensel and Patrice Roy for their expert technical opinions and Brett Tylor for providing

a dataset for this work. I also like to thank Asil Naqvi for keeping himself with me as moral

support in all the blues.

1

Chapter 1. INTRODUCTION

Clinical reports are vital in providing quality healthcare. These reports facilitate physicians

to recall previous episodes of care given to patients [1]. Moreover, these reports ensure

quality healthcare by providing a medium to perform regular audits of the care delivery

process [2]. Clinical documentation is a process in which physicians note down encounter

synopsis and generate clinical reports. It is a challenging (Section 2.3.1) and tedious

process, which can sometimes take twice as much time as it takes for patient interaction

[3], [4]. Clinical documentation is done in three modes (Section 2.3.2): hand-written, type-

written, and dictations; while adhering to two work-flows (Section 2.3.3): front-end and

back-end [6]. Hand-written reports are mostly illegible and unsatisfactory [5]. Type-written

and dictations are currently the most common mode of documentation among physicians.

In general, the process of clinical documentation poses five major problems (Section 2.3.4)

that are caused by either any or all combinations of modes and work-flows. These problems

provide motivation for this thesis to work on methods to automate the process of clinical

documentation.

1.1 Solution Approach

This thesis focuses on using the physician’s dictated audio notes to autonomously generate

structured clinical reports. Dictations are given in audio; therefore, Speech Recognition

(SR) is used to transcribe audio notes. Transcriptions are then analyzed for SOAP

categories, which can be organized to form a standardized clinical report.

After a detailed review of the use of speech recognition in clinical documentation (Section

2.5), this thesis made two recognitions. Firstly, there is a need for more accurate SR

systems. Secondly, physicians prefer to dictate in freestyle; hence there is a need to research

in the direction of autonomous preparation of reports from transcribed notes. This thesis

considers the documentation process from an end-to-end perspective, where physicians can

get the ability to dictate in freestyle. Dictated audio notes should firstly transcribe into text

using a noise-robust domain-specific SR system. Subsequently, transcribed notes should

2

organize to form a structured clinical report using autonomous language processing

techniques. Henceforth, our approach aims to form two separate research objectives, whilst

keeping the focus of this thesis on the individual methods within each objective. Figure 1.1

illustrates the solution design for this thesis.

Figure 1.1 Solution approach for clinical documentation

1.2 Research Objectives

This thesis aims to work as a multi-layer solution pipeline, where each layer possesses its

own set of problems that needs active rounds of research. Two main research problems are

identified that are requisite towards setting a conclusion. These problems act as the

objectives of this thesis, which are defined in the subsections below.

1.2.1 First Objective

The first objective of this thesis is “to accurately recognize speech content from physicians’

dictated audio notes”. To facilitate the better achievement of this objective, this research

forms a set of three research questions.

1. What are the challenges and shortcomings of SR technology in clinical environments?

2. What are the current advancements and in SR that can address those challenges?

3. What are the steps required to develop robust SR systems in clinical environments?

3

The first research question is responded to in Section 2.5, which presents a detailed review

of the use of SR in clinical documentation, its challenges, shortcomings, and reasons for

those shortcomings. For the second question, Section 2.6 and 2.7 presents the current

advancements in SR technology that addresses the challenges of noise-robustness and

domain-robustness. Finally, this thesis responds to the third question by presenting our

methods in Chapter 4, which are evaluated in Chapter 5.

1.2.2 Second Objective

The second objective of this thesis is “to organize transcribed physicians’ dictated audio

notes into meaningful categories of the clinical report structure”. To achieve this objective,

we investigate algorithms to classify the transcribed text into SOAP categories.

1.3 Research Scope

The main scope of this thesis is limited to research problems within the individual layers

of our approached solution pipeline. The feasibility and end-to-end performance of our

solution pipeline are not examined in this work.

Clinical reports are of various types since documentation is done in almost every healthcare

activity. The scope of this thesis is also limited to the pediatric emergency department that

usually generates reports in SOAP (Subjective, Objective, Assessment, and Plan) format;

therefore, this thesis follows the SOAP structure.

In the area of speech recognition, this work considers the single-channel audio input. For

SR systems, the scope of this thesis is limited to open-sourced systems in the offline

domain.

1.4 Thesis Contribution

This thesis spans over two domains of computer science; thus, it contributes to both

domains. This thesis reports five contributions that are listed below.

4

1. This work approaches the process of clinical documentation from an end-to-end

perspective. To the best of our knowledge, this perspective has never approached

before. This thesis seeks freestyle dictations as input and proceeds to generate a

formatted clinical report as output. The operating scope of this solution currently

experiments only within emergency departments following SOAP structure.

Nevertheless, this solution has the potential to expand in other clinical departments and

domains as well.

2. In this work, we demonstrate the use of Project DeepSpeech [7] in the paradigm of the

healthcare domain.

3. We apply methods to adapt out-of-domain pre-trained DeepSpeech models and train

healthcare domain-specific models without requiring a large healthcare dataset.

4. We propose a method to augment the healthcare domain-specific dataset by generating

simulated domain-relevant data using principles of synonym replacement method.

5. Lastly, this thesis contributes by extending an exemplar-based concept detection

algorithm to develop a sentence classifier to classify SOAP categories.

1.5 Thesis Organization

This thesis consists of 7 chapters. Chapter 2 provides background details about the

individual concepts that are frequently referred to in this thesis. Chapter 3 presents the

methodology of this research and all the tools that are used for this research. Chapters 4, 5,

and 6 demonstrate the work that is done for each layer of our solution model. Lastly,

Chapter 7 constructs a thorough discussion upon the observations, limitations and future

work while providing a conclusion to the thesis.

5

Chapter 2. BACKGROUND AND RELATED WORK

2.1 Health Record

A patient’s health record entails chronologically sequenced assortment of a variety of

clinical documents, such as clinical reports, lab reports and x-rays, that are generated over

time by healthcare professionals providing care services to that patient. These records

present medical history and events of care given within the healthcare providers’

institution. The term health record is often used conversely with the medical record or

medical chart.

2.1.1 Electronic Health Record (EHR)

Electronic Health Record (EHR) are software systems that manage patients’ health records

electronically. Like health records and medical records, EHR also often gets interchanged

with Electronic Medical Record (EMR). However, there exists a subtle difference between

both terms. EMR has limited working jurisdiction and can only operate in one provider’s

institutional domain, while EHR’s are designed to serve across multiple institutions i.e.,

available to multiple providers as well as researchers and policymakers [8], [9].

2.2 Clinical Report

A clinical report is a type of clinical document that appends into a patient’s health record.

Clinical reports include details of a patient's clinical status and assessments, recorded by

healthcare providers during the hospitalization visit or in outpatient care [10]. Clinical

reports facilitate communication between healthcare providers [11]. These reports can be

of various types, such as progress reports, visit reports and discharge summaries [12].

Clinical reports can be unstructured, semi-structured, or systematically structured [13].

Unstructured and semi-structured reports include information in free texts, which blends

relevant and vital information with insignificant details and make retrieval of required

information difficult, oftentimes along with adversely increased cognitive load [14].

However, structured reports follow some systematic guidelines to structure the

6

information; thus, important information comes under observation quickly without putting

much effort. In this thesis, we work with systematic reports that follow a defined structure.

2.2.1 SOAP

SOAP is a universally accepted method that provides systematic guidelines to write

structured clinical reports [13], [15]. SOAP is the acronym for Subjective, Objective,

Assessment, and Plan which represents the categories within its structure. SOAP was first

theorized over 50 years ago [16] to advise medical students in writing effective reports

[17], which later accepted widely among healthcare professionals [16]. Table 2.1 mentions

a summary of SOAP guidelines as listed by Sando [13]. There are various benefits that

SOAP formatted reports offer above unstructured or semi-structured reports.

1. SOAP reports follow a defined structure. It lowers mental efforts to extract the required

information efficiently and quickly [14].

2. SOAP reports are clear, concise, accurate, and allows efficient communications

between healthcare providers. Due to this, providers explicitly use these reports to give

recommendations to each other [10], [17].

3. SOAP encourages providers to write complete reports by reminding them about

specific tasks. This way, SOAP ensures complete reports and enhances the quality of

healthcare delivery [16].

4. SOAP reports systematically record encounters of healthcare professionals and

patients. Hence, it can also serve as an evaluation tool for accountability, billing, and

legal documentation [1], [2], [18].

5. When care delivery requires multiple healthcare providers, effectively written SOAP

reports can fasten the delivery process by eliminating the need for redundant history

taking episodes for each provider [18].

7

Table 2.1 Documentation Guidelines of SOAP categories adapted from [13]

Category Documentation Guidelines (Semantic concepts for each category)

Subjective

• Demographics

• Patient concerns and complaints

• Current health problems

• Current medications

• Current allergies

• Past medical history

• Family history

• Social history

Objective

• Drugs administered

• Physical signs and symptoms

• Vital signs

• Medication lists

• Laboratory data

Assessment

• Active problem list with an assessment of each problem

• Actual and potential problems that warrant surveillance

• Therapeutic appropriateness including route and method of

administration

• Goals of therapy for each problem.

• Degree of control for each disease stated

Plan

• Adjustments made to drug dosage, frequency, form or route of

administration

• Patient education and counseling provided

• Oral and written consultations to other health care providers

• Follow-up Plan

• Monitoring parameters

8

On the one hand, SOAP is appraised for its easy to use guidelines. On the other hand, some

studies highlight some limitations as well. Lin [19] questions if swapping the structure of

SOAP to APSO increases any efficiency. Moreover, many studies propose extended

versions of SOAP. However, all those studies, perhaps only add more guidelines above the

basic ones. In this thesis, we focus on the categories of SOAP but not their sequence;

therefore, our produced reports can be sequenced in any order to satisfy any structural

variation of SOAP. Each SOAP category is further explained in later subsections.

• SUBJECTIVE

The subjective category suggests documenting everything that is coming from the

patient’s perspective and experiences [16]. It distinguishes between what a patient is

thinking about the situation from what healthcare providers believe.

• OBJECTIVE

In the Objective category, providers are guided to include all the observations,

examination findings, and diagnostic tests; specifically, those findings that the provider

can objectively confirm. Providing objective measures in a separate category allows

readers to immediately focus on the relevant and clinically verifiable information

without ever needing to know more about the subjective details [18].

• ASSESSMENT

In the Assessment category, providers explicitly mention the diagnosis along with the

rationale that leads to the diagnosis. It might also include all those tradeoffs that were

considered while reaching to the said diagnosis [16].

• PLAN

The plan category is there to document any treatment plans or directions that the

provider might have for the patient or any other provider. This section is also used to

write down secondary treatment plans that will get into consideration if the primary

plan does not show appropriate results [16].

9

2.3 Clinical Documentation

Clinical Documentation (CD) is the process where healthcare providers generate clinical

reports after every patient encounter. These encounters can be of any purpose and of any

length. Even the minor patient-physician interactions must be noted accurately and timely

with the same level of responsibility that is usually taken to prepare reports of surgeries

and other lifesaving treatments. Clinical documentation is a fundamental skill [13] and is

the core responsibility of healthcare providers to generate clinical reports that are accurate

and complete. The process of clinical documentation has multiple aspects of details that

are defined in later subsections.

2.3.1 Challenges

The goal of the clinical documentation process is to generate high-quality and practical

clinical reports to ensure efficient and effective healthcare delivery; however, achieving

this goal is a challenging task. Primarily, the process of clinical documentation poses 5

main challenges.

• ACCURACY

The first challenge of clinical documentation is to produce accurate clinical reports. It

means that all the contents within a report must reflect the truth, and nothing that is in

the report is wrong.

• COMPLETENESS

Completeness is the second challenge, which refers that reports must cover all the

aspects of the patient-provider encounter, and no detail goes unnoticed.

• TURNAROUND TIME

Turnaround time is the time that providers take to perform documentation related tasks

after finishing one encounter and before starting another encounter. The third challenge

of this process is to keep turnaround times as low as possible.

10

• PROVIDER THROUGHPUT

Throughput reflects the efficiency level of a provider. Documentation related tasks can

sometimes take more time and effort than the encounter itself and can overburden the

providers, which can lower the quality of healthcare delivery. Therefore, the fourth

challenge of the clinical documentation process is to keep provider throughput as high

as possible.

• COST

The fifth challenge is to reduce the overall cost. There are two types of costs that are

linked with this challenge; setup cost and report generation cost. Setup cost can be

defined as the cost it takes to set up any tools that assist with the process, while

generation cost can be defined as the cost it takes to generate each report.

2.3.2 Modes

Clinical documentation is usually done in three modes: hand-written, type-written, or

voice-dictated. Institutions adopt one of these ways, whereas many of them use multiple

modes to create redundancy in the documentation. This section will define each mode in

detail.

• HAND-WRITTEN

Classically, clinical reports are produced by hand on a piece of paper. It is the simplest

of all modes, and it consists of a plethora of problems around it. Reports prepared by

hand are illegible to read and are often never consulted again by providers. Moreover,

producing reports in this way takes much time. Handwritten reports are also prone to

damage as they exist in physical format, and making a copy is also a hassle.

• TYPE-WRITTEN

Adaption of EHR changes the way clinicians did documentation tasks. EHR’s require

the digital entry of information within reports. Healthcare providers use the keyboard

and mouse to type-write the reports directly into the software systems of EHR. It

reduces the clutter of illegible paper-based reports and makes comprehension easier.

11

However, healthcare providers are now forced to learn using these new systems.

Without learning, there is still low or no impact on documentation times. A study has

shown that even with EHR, healthcare providers still spend 38.5% of their time in

creating documentation [4].

• VOICE-DICTATED

Voice dictations are not new for healthcare providers. In many institutions’ providers

are facilitated with dedicated transcriptionists who oversee all the documentation

related tasks. Healthcare providers brief transcriptionists about the clinical encounter

by giving them dictations, who then complete the reports. This practice shifts the

burden of documentation away from providers, who are busy with other crucial tasks.

In institutions where no transcriptionist is available, the practice of dictation still

benefits providers. They record dictations after encounters and manage their time on

more important things when needed. At the end of all encounters or later, when they

get time, they then complete the documentation tasks by using pre-recorded dictations.

A number of newer EHR systems are also now providing voice-enabled options to enter

information within reports. These EHR systems use speech recognition technology and

allow providers to use their voice to enter information within individual sections of

clinical reports. An increasing number of institutions are now implementing voice-

enabled EHR systems to facilitate providers; nevertheless, provider adaption of this

documentation mode is still a concerning question.

2.3.3 Workflows

Documentation is usually done in two workflows: front-end and back-end. Both workflows

start when the provider is finished dealing with patients in an encounter and ends when a

report is generated and submitted. Details about both workflows are defined below.

12

• FRONT-END

In the front-end workflow, the provider is responsible for the proper generation and

submission of clinical reports right after the encounter, and before starting any other

encounter.

• BACK-END

In the back-end workflow, the provider has the liberty to delay the report generation

process to the time of their choosing before a deadline that is set by the institution or

any other regulatory body. In this work-flow, the provider can perform encounters back

to back to finish patient queues and can then generate all the clinical reports at once.

2.3.4 Problems

To meet all five challenges of clinical documentation is a challenge in itself.

Documentation modes and workflows usually focus on a subset of these challenges.

Institutions often practice those modes and workflows that are closer to their requirements.

In essence, there is nothing at this moment that can offer to meet all five challenges at the

same time. Therefore, the current practices of the documentation process fail to meet some

challenges and express various problems. There are five major problems. Some of these

problems are caused by specific workflow or mode, while others are due to the overall

process. The problems are defined below.

• LOW QUALITY OF CLINICAL REPORTS

A major problem in the clinical documentation process is the low quality of clinical

reports, which is primarily caused within those institutions that overlook accuracy and

completeness in favor of low turnaround times, high throughput, and low costs.

• HIGH TIME CONSUMPTION

Writing a clinical report takes time. In the case of dictated notes, physicians get the

ability to postpone the development of reports to increase throughput, while also

increasing report turnaround times. With current workflows and modes, this poses a

major problem.

13

• LOSS OF MINOR CLINICAL ENCOUNTERS

Due to the tedious nature of the process, minor encounters are usually ignored, which

in turn poses a high risk of lower quality in healthcare delivery.

• HIGH COST

Many institutions employ dedicated transcriptionists and implement advanced EHR

systems to facilitate providers with the burden of the documentation process. Both

options are costly and require extensive funds.

• LACK OF OPPORTUNITY TO DEVELOP DATASETS FOR RESEARCH

This problem mainly arises due to illegible hand-written reports on paper forms, as

patient records become cluttered, and managing them is a hassle [14]. This problem is

also present in type-written reports to some extent. Physicians get this habit to copy

and paste, and incompleteness is rather common in these reports [20].

2.3.5 Summary

Clinical documentation is the backbone of quality healthcare; however, it often takes more

time than treating patients. There are five main challenges of clinical documentation:

accuracy, completeness, turnaround time, provider throughput, and cost. Each institution

sets policies that focus on a subset of these challenges and practice a workflow and

documentation mode that comply with their policies. There are two commonly practiced

workflows: front-end and back-end; and three documentation modes: hand-written, type-

written and voice-dictated. Currently, no combination of documentation mode and

workflow offers to meet all challenges; therefore, the documentation practices raise a

number of problems. The five major problems of the clinical documentation process are 1)

low quality of reports, 2) high time consumption, 3) loss of minor clinical encounters, 4)

high cost, and 5) lack of opportunity to develop datasets for research.

14

2.4 Speech Recognition

This section provides a background on the fundamental concepts of Speech Recognition

(SR). Speech recognition refers to any machine-based method that functions over speech

input and processes it into text. SR research is generally categorized into three areas:

isolated word recognition, continuous speech recognition, and speech understanding [21].

Isolated word recognition methods are applied to those speech inputs in which speaker

utter individual words, separately. However, continuous speech recognition methods are

used when speech is continuous and in a natural manner. Speech Understanding, whereas,

is used when we are more interested in understanding the sense of speech instead of specific

words. In this thesis, we explore methods from continuous speech recognition, as we work

with natural and continuous speech.

Speech is recorded in the form of audio signals. These signals are defined as the long arrays

of timed sound intensity values. These values represent the acoustic features of the speech

in the form of phonemes that are the basic sound units within a language. Phonemes, when

combined, form the basis for words and other lexicons within the speech. For efficient

recognition, SR requires a prior understanding of such acoustic behaviors, along with the

knowledge of grammatical and other rules that exists within the language. This information

is provided to SR methods using acoustic and language models that are trained on the

speech data from a language. These models are the core components within any SR system.

Firstly, acoustic features from input audio are matched with the given acoustic model to

identify language units (phonemes) which then produce an estimated sequence of words

using the rules from the language model. The robustness of these models is vital in the

performance of Recognition. Figure 2.1 illustrates the structure of any SR system in its

simplest form.

15

Figure 2.1 Structure of any SR system in its simplest form

When a speech input is given, the objective of SR is to obtain the optimal word sequence

for the given speech (𝑿), which is a form of well-known maximum a posteriori (MAP)

problem [22].

Equation 1

�̂� = arg 𝑚𝑎𝑥𝑾 𝑃Λ,Γ(𝑾|𝑿)

In Equation 1, optimal word sequence �̂� is a word sequence 𝑾 that maximizes the

likelihood for the given speech signal 𝑿 by the use of an acoustic model 𝚲 and language

model 𝚪. Both models are the components inside the construction of SR systems that are

further explained in subsequent sections.

2.4.1 Acoustic Model

The acoustic model forms the basis of any SR system [23]. The purpose of the acoustic

model is to provide estimations that a particular phoneme is uttered in a given audio

sequence. Acoustic models are either a statistical or machine learning model that maps the

16

relationship between audio signals and linguistic units. These models are trained upon

hours of speech data to generalize linguistic relationships effectively.

Audio signals are digitally recorded sound waves, which are made up of sequential

samples. The sample rate of an audio file denotes the number of samples recorded in one

second. The resolution of a sample refers to the amount of memory (bits) that is used to

record each sample. For example, a 10-second audio file of sample rate 16000 Hz and

resolution of 16 bits means that it contains a total of 160,000 samples that are recorded in

16 bits each, which translates into a raw size of 312 kilobytes (160,000 x 16 bits) of

memory. Audio signals are long arrays; therefore, these signals are segmented into equal

intervals, called frames, for processing and model construction. Frames are small but

overlapping windows within audio files. Frame size remains fixed throughout the SR

system. As an example, for a 16000 Hz audio signal, if we choose a frame of length 256

samples, then the first frame covers from 0th sample to 255th sample, and then the second

frame starts from 128th sample and so on.

SR systems extract various types of features from frames. The commonly used features are

power spectrum, Mel-frequency cepstral coefficients, and delta features. In continuous

speech recognition systems, features within the frames of training audio files base the

construction of acoustic models. To perform recognition, a query audio file is extracted for

frames, and its features match the acoustic model to get the estimated sequence of linguistic

units.

SR systems from their inception are using statistical acoustic models. These systems use

the Gaussian Mixture Models (GMM) and Hidden Markov Models (HMMs) to recognize

the sequences of phonemes. GMMs detects the phonemes within each frame, and HMMs

estimates the likelihood of having detected phonemes given the prior detected sequence.

Statistical acoustic models then use estimation maximization algorithms to get the optimal

phoneme sequence out of audio signals.

17

The current advancements in machine learning have enabled researchers to develop end-

to-end acoustic models that eliminate the need to use domain expertise as it is required to

train the statistical models. End-to-end models take benefit of powerful machine learning

techniques, such as Artificial Neural Networks (ANN) and Deep Neural Networks (DNN),

to develop acoustic models. These models work on the same inputs, i.e., acoustic features

of audio signals; however, a number of current approaches produce direct word sequences,

as opposed to traditional models that produce phonemes. In such cases, characters are used

as the output sequences.

Since the ultimate goal of SR is to get the optimal word sequence, SR systems take the

output of the acoustic model, either statistical model outputting phoneme sequence or

machine learning model outputting character sequence, and pass it through decoders to

achieve the optimal word sequence. In the decoding part, output sequences are searched in

lexicon dictionaries for matching words. In this step, most SR systems also exploit

language models to refine the output of acoustic models. The details about language models

are in the next subsection.

2.4.2 Language Model

The language model maps the relationship between the words from a given language. In

speech recognition, it provides the contextual information of words by assigning a

probability distribution over the trained word sequences [24]. It is usually trained on large

samples of text from a language. Language models facilitate SR systems to detect

connections between the words in a sentence with the help of a pronunciation dictionary

[23]. Language models also introduce domain-specific vocabulary to the SR paradigm. It

refines the output of SR systems. However, it is not always required for the recognition

task. When language models are used in SR systems, their calculated likelihood

probabilities are merged with the decoded word sequence probabilities from acoustic

models, and then the combined probabilities are used to get the optimal word sequence.

The most common method to construct language models is n-gram language modeling,

where n is the order of language model. An nth order language model calculates the counts

18

for having 1-grams, 2-grams … n-grams from a training corpus. These n-grams occurrence

counts are used to calculate the likelihood of having a word W based on the given sequence

of n-1 preceding words [23].

Equation 2

𝑃(𝑊) = 𝑃(𝑊𝑛|𝑊1, 𝑊2, 𝑊3 … 𝑊𝑛−1) =𝐶(𝑊1, 𝑊2, 𝑊3 … 𝑊𝑛)

𝐶(𝑊1, 𝑊2 … 𝑊𝑛−1)

In Equation 2, 𝑃(𝑊𝑛|𝑊1, 𝑊2, 𝑊3 … 𝑊𝑛−1) denotes the probability of having a word 𝑊𝑛

when given a sequence 𝑊1, 𝑊2, 𝑊3 … 𝑊𝑛−1. 𝐶(𝑊1, 𝑊2, 𝑊3 … 𝑊𝑛) shows the count of

having n-gram within the corpus. 𝐶(𝑊1, 𝑊2, 𝑊3 … 𝑊𝑛−1) shows the count of having (n-1)-

gram.

When dealing with probabilities, n-gram based language models are prone to problems

such as zero probability problem and out of vocabulary problem. The zero probability

problem is due to the unavailability of an n-gram within the training corpus. For these

unseen situations, language modeling techniques offer solutions such as smoothing and

discounting. In smoothing techniques, n-gram counts are manipulated to give weights to

unseen events. Add-one smoothing and add-k smoothing are some examples of smoothing

techniques. In discounting techniques, if an n-gram is unseen, then the counts from (n-1)-

gram are considered with some discounts. This strategy is also known as backing off.

The out of vocabulary problem arises when a language model does not have a word in its

vocabulary. That means not even 1-gram is available for that word. Therefore, solutions to

zero probability problems do not work here. A common strategy to tackle the out of

vocabulary problem is to use pseudowords such as unknown <UNK>, start <S> and stop

</S>. The pseudoword <UNK> is appended with all seen n-1 grams to form a new n-gram,

whereas, the pseudowords <S> and </S> are appended in the start and end of each sentence.

They are then treated as normal words while calculating probabilities and training n-gram

language models.

19

2.4.3 Evaluation Metric

SR systems are evaluated by investigating differences between their output text and ground

truth text. If the SR system works according to expectation, then output text match the

ground truth. However, if there are differences, it highlights the system’s shortcomings and

provides an evaluation of the system. The most commonly adopted evaluation metric for

SR is the Word Error Rate (WER) [25].

• WORD ERROR RATE

Word Error Rate (WER) is an evaluation metric that highlights the ratio of mistakes.

WER returns a value between 0 and 1, where 0 means that the SR system was not able

to recognize a single word correctly, and 1 means that everything was recognized

perfectly. Mistakes are counted in the form of insertions, deletions, and substitutions.

When comparing output text with ground truth, if a word is missing, then it is inserted,

and insertions count is incremented. If a word is in the output sequence and ground

truth does not expect to have it, then it is deleted, and deletions count is incremented.

If for a word found in output sequence ground truth expects a different word, then it is

substituted, and substitutions count is incremented. After finishing comparisons, WER

counts all mistakes and divides them by word count of ground truth to get the error rate.

WER is commonly expressed as a percentage [25], which can be done by multiplying

WER by 100.

𝑊𝐸𝑅 =𝐼 + 𝐷 + 𝑆

𝑊

∴ 𝐼 = # 𝑜𝑓 𝐼𝑛𝑠𝑒𝑟𝑡𝑖𝑜𝑛𝑠

∴ 𝐷 = # 𝑜𝑓 𝐷𝑒𝑙𝑒𝑡𝑖𝑜𝑛𝑠

∴ 𝑆 = # 𝑜𝑓 𝑆𝑢𝑏𝑠𝑡𝑖𝑡𝑢𝑡𝑖𝑜𝑛𝑠

∴ 𝑊 = # 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝐺𝑟𝑜𝑢𝑛𝑑 𝑇𝑟𝑢𝑡ℎ

20

2.5 Speech Recognition in Clinical Documentation

To overcome the challenges of clinical documentation, researchers are looking towards SR

technology since the time of its inception [22]. In theory, the idea of using speech to create

clinical documentation is quite promising. However, multi-factored evaluations show that

it is not as simple to adapt. Changing modes of report generation workflows can open a

plethora of issues that may come with the new mode of input. The introduction of EHR

reformed the process of clinical documentation when it offered typing as the form of input;

however, this input method added up to the already existing issues [26]. With EHR,

providers spend roughly double their time in creating documents as compared to care

delivery [4]. Many healthcare providers are not very good at operating computer systems,

so they still prefer the classical methods of documentation. Since EHR offers much more

than the change of input mode, and it is not possible to use handwriting as the method to

input notes within EHR, many studies have tried to use SR technology as an alternative to

enter notes within EHR. When compared to traditional dictation and transcription methods,

speech assisted EHR significantly reduces turnaround times and costs [27]. Consequently,

as SR technology is improving, the increasing number of institutions are adopting SR

enabled systems to generate clinical reports [27]. However, speech assisted EHR puts the

responsibility of the management of note input and report creation upon healthcare

providers, which limits the provider adoption of this technology. Therefore, in this section,

we investigate the various aspect of SR adoption in documentation workflows and analyze

the limitations and shortcomings of SR in the clinical documentation process.

2.5.1 Review Objective

The purpose of this review is to investigate prior efforts of using SR to overcome the

challenges and problems of documentation. Therefore, we form three main objectives.

1) To explore previous attempts on using speech as a tool for clinical documentation.

2) To scrutinize the methodology and evaluations of explored attempts.

3) To identify all the factors that limit the intended outcomes of explored attempts.

21

2.5.2 Literature Exploration

This thesis deals with a problem in the medical domain; therefore, we primarily used

PubMed [28] as the source to look for literature. We identified a set of keywords that were

combined using ‘AND’ and ‘OR’ operators to retrieve the best matching articles. Figure

2.2 shows the query tree that was used to search PubMed to retrieve the relevant studies.

Figure 2.2 Query Tree for Literature search

2.5.3 Analysis

Healthcare providers use dictations in traditional documentation workflows. However,

they are now required to use an EHR for report generation. Speech-enabled EHRs

contribute more to the challenges of the documentation process. Therefore, in this analysis,

we iterate and analyze the impact of SR on the challenges of the clinical documentation

process.

• ACCURACY AND COMPLETENESS

Accuracy and completeness reflect the overall quality of the report; therefore, studies

usually address these two challenges together. When providers type-write their reports

on EHR, they usually make 0.4 mistakes for every 100 words (0.4% error rate) in back-

end workflow [29]. Zhou conducted a study to test SR on the same workflow, where

he noted an increment of errors by 7.8% [29]. However, he added that reports managed

to achieve the same level of quality after a final review from the provider.

Studies by Hodgson and Goss address the use of SR in front-end workflow. They show

similar results to Zhou, where the report quality significantly declined [27], [30].

22

Although, in a study where experiments were conducted in an ideal environment, with

no real-life constraints and stress, 81% of providers reported improvement in the

overall quality of reports with the use of SR enabled tools [26].

• TURNAROUND TIME AND PROVIDER THROUGHPUT

Providers spend most of their time to generate reports [29]; this means, an efficient

documentation process can reduce turnaround times and boost provider throughput.

Studies report that most of the providers are convinced that SR improves efficiency and

is easy to use [26], [27]. In back-end workflows, SR has shown to reduce turnaround

times and increase productivity [29]. However, when SR is used in front-end

workflows, no significant time difference was reported [30].

• COST

In the clinical documentation process, the cost is the aspect that is the least studied in

the academic domain. One reason could be that costs are primarily a concern of

departments that are not connected with academics. In any case, the cost is a profound

variable in the adoption of any solution, and also plays a vital role in the adoption of

SR in the clinical documentation process. These days, more and more institutions are

tilting towards the adoption of SR enabled solutions since they are thought to reduce

costs [27]. A provider adoption survey done in 2018 has also concluded that SR enabled

systems can reduce monthly transcription costs by 81% [26]. Since we were not able

to find any study that reports any adverse findings in terms of the cost of using SR

enabled systems, there is no reason to disbelieve the above-stated opinions.

2.5.4 Findings

The approach to use SR for documentation is not new. Studies from 1981 [31] and 1987

[32] experimented with SR when this technology was developing as a new concept, where

they found SR to be worsening the problem. SR technology improved a lot since then. This

trend is reflected in the studies done throughout the decade of 1990 [33]–[39]. This trend

remains in the decade of the 2000s [40]–[43]. A study done in 2010 [44] reported that about

two-thirds of their participants feel that SR can improve report quality and can reduce

23

documentation times as well. Another study that was done in 2012 [45] supports the claim

that SR now has the ability to reduce documentation times. Almost all of the recent studies

[27], [46]–[49] affirms that SR has the tendency to overcome problems of documentation.

However, with all the optimism, SR enabled solutions still lack adaption among healthcare

providers [26], [50]. Studies document four main reasons behind the low adoption of SR

enabled solutions.

• HIGH EXPECTATIONS FROM SR ENABLED SYSTEMS.

SR systems are viewed as they can dramatically reduce mistakes and documentation

times. However, current SR technology still struggles to maintain high accuracy due to

real-world issues, such as noise and domain-specific vocabularies. Therefore, noise-

robustness and domain-robustness are real challenges of SR that limit the confidence

of healthcare providers in SR at this point.

• RESPONSIBILITY FOR CORRECTIONS.

SR systems make mistakes, and physicians are responsible for correcting those

mistakes. Due to this, sometimes they spend even more time in corrections. When any

machine-based system makes a mistake, humans are expected to correct for those

mistakes; however, in this scenario, the overall confidence in SR technology can be

increased by making robust systems that do not make mistakes in the first place. The

same challenges apply here as well that are mentioned in the above-stated point.

• CHANGE IN DICTATION STYLE.

SR is merely a tool for note entry in many reporting systems. Physicians are responsible

for controlling the structure of reports where they have to manually select the report

section for which they wish to add the notes. As an example, for the SOAP structured

reports, physicians have to select one of the SOAP categories to enter its content at one

time. This practice breaks their dictation pace as they have to tailor their dictation style

according to the structure of the report. Due to this hassle, physicians prefer recording

conventional dictations where they get the ability to record in freestyle at the time of

encounter. This hassle also calls for autonomous features that are specialized for

24

documentation tasks, and can intelligently separate the content for different sections of

a clinical report from a freestyle recorded dictation.

• PHYSICIANS’ TRAINING

Physicians require training to use SR enabled systems efficiently [51]. A 2010 study

[44] reported positive response when participants got adequate training, whereas, when

they were not satisfied with the training, they reviewed SR enabled systems negatively.

Such training includes teaching physicians to enable SR features within the system and

other reporting features that are linked to SR.

25

2.6 Handling Noise in Speech Recognition

SR is an area of interest with more than three decades of active research [22]. In ideal

environments where noise and distortions do not interfere with speech signals, the latest

developments have enabled SR systems to perform increasingly closer to human speech

recognition performance [52]–[54].

Noise is referred to as any phonetic element in the audio signal that is other than the speech

signal [55]. In real-life environments that are filled with a huge concentration of noise, SR

systems lack that robustness to compete with humans [56]–[58]. This degradation is mainly

due to the difference in the acoustic features from input audio to the ones in the trained

model [59]. Noise from surroundings changes the acoustic features of speech with

unwanted additives from noise. Traditionally, SR-enabled devices used to create an ideal

environment using close-talking microphones and other acoustic adjustments. However,

with the rise in large-scale hand-held mobile devices, it is not further possible to provide

SR with ideal speech inputs. It is inevitable for SR to work in challenging acoustic

environments and noise robustness is now a key challenge for SR to maintain its

performance levels [60].

Noise robustness is a trending problem in the SR domain. Colossal amounts of methods

and techniques have been proposed to provide noise-robust solutions for SR. A number of

systematic reviews of noise-robust speech recognition [22], [23], [61] have presented the

current advancements in systematic manners. In this review, we explain and analyze

techniques about the development of noise-robust systems for domain-specific

environments. We are particularly interested in indoor environments where there are

reverberant distortions and speech overlaps; nevertheless, this review is not limited to such

environments. However, we only review those techniques that deal with single-channel

audio signals.

Noise-robustness techniques can be divided into two main groups: feature-space and

model-space [22]. Feature-space techniques consider the preprocessing of audio signals to

extract clean speech before passing it to SR components. Model-space techniques, on the

26

other hand, deal with the internal construction of noise-robust components. The subsequent

sections explain both groups further and provide an analysis of techniques within each

group.

2.6.1 Feature-space Techniques

SR works well with clean speech. A straightforward and classic solution is to process the

incoming audio to extract clean speech before passing it to the SR system. Feature-space

techniques apply this classic solution by using signal processing techniques to enhance the

acoustic features of speech. These techniques do not change the acoustic model or any

component within SR systems.

Enhancing acoustic features from noisy audio is a difficult problem in single-channel as

compared to multi-channel audio spectrum [62], where techniques, such as acoustic beam-

forming, are performing close to human transcription performance [63]. However, we do

not see such accuracies while using single-channel audio.

Spectral Subtraction [64] and Weiner filtering [65] are classical techniques to filter noise

from the audio signals. These methods do not depend upon training [22]; instead, they use

statistical tools that estimate the noisy speech spectrum and remove any intensity

distribution over that estimated spectrum. These techniques are acceptable when noise is

stationary throughout the span, such as noise from wind, fan, or anything that emits a

continuous stream of sound. However, their performance degrades with the interaction of

varying or convolutional noises such as reverberation and overlapped speeches.

Without prior knowledge of noise or speech patterns, statistical models sometimes over

filter the speech signals, consequently clipping and losing acoustic details [66]. There are

various learning-based techniques to tackle the problems of reverberation and speech

overlap such as Weighted Prediction Error (WPE) using Short-Time Fourier Transform

(STFT) domain [67], de-noising auto-encoder [68] and a statistical-neural hybrid Cepstra

Minimum Mean Squared Error–Deep Neural Network (CMMSE–DNN) based learning

model [69]. However, these techniques shift away from the area of speech recognition

27

while belonging more to the area of signal processing, which is not the working domain of

this thesis.

2.6.2 Model-space Techniques

Model-space techniques aim to develop noise-robust SR components, particularly acoustic

models. These techniques are linked directly with the objective function of acoustic

modeling to absorb the effects of noise and distortion [22]. These techniques have generally

shown to achieve higher accuracies in comparison to feature-space techniques. However,

in comparison, they use significantly more computational resources.

Noise adaptive training [70] and noise aware training [61] are the candidates of the model-

space techniques. In noise adaptive training, acoustic features infused with the noise are

relayed directly into the acoustic models, whereas in noise aware training, an estimated

noise model is generated to calculate corruption in the training noise [61] which then acts

as the mask to filter noise from testing audio. Noise Adaptive techniques are reporting

high-performance gains [71]–[73]. Acoustic models trained using ANN-based noise

adaptive framework were introduced to the same levels of noise that are expected in the

execution life of SR systems, where they have shown performance boosts as high as 20%

[74]. Seltzer [61] shows the use of DNN for noise aware training with a relative

performance boost of 7.5%. Noise based acoustic training works well on the single-

channeled audios and incorporates most of the effects of additive noise. However, models

trained using these techniques are not generalizable [68] to use on environments with

different noise signatures then of where the models were trained.

The development of current end-to-end SR systems has defined a new state-of-the-art [75]–

[78]. These systems are built on the underlined principles of model-space techniques with

the goal to tackle the problem of the noise of real environments. These systems train their

acoustic models in a data-driven way without relying on any expert domain knowledge

[79] such as noise estimates and phonetic dictionaries. Their end-to-end nature allows them

to directly train their model based on paired data, where training audio and its

28

corresponding text is paired. However, to achieve higher accuracies, they require vast

amounts of training data.

2.6.3 Summary

The problem of noise-robustness in SR is addressed using two types of approaches. The

first approach presents techniques that seek treatment to noisy audio before feeding it to

SR systems. These techniques span over unsupervised as well as supervised learning.

However, they shift the problem to the signal processing domain, where the good signal is

enhanced, and undesired signals are suppressed. Variations of such problems are de-

noising and noise removal. The second approach seeks the construction of robust SR

components, particularly the acoustic model. It shows that noise adaptive training is

currently the top-performing technique on which various end-to-end SR systems are

constructed. The review has also mentioned that end-to-end systems are setting the current

state-of-the-art in the technology of SR.

29

2.7 Domain-Specific Speech Recognition

In speech recognition, domain entails the combination of the acoustic domain as well as

the language domain. The acoustic domain includes speakers, audio channels, and

environmental noise. The current state of the art end-to-end SR systems have covered many

grounds of robustness over acoustic domains, though domain robustness is still a

challenging problem [80]. The problem primarily remains due to the domain-specific

language which includes specific words and their relations. Handling the rules of language

is the integral responsibility of the language model within SR systems. However, this

problem broadens when there is a shortage of domain-specific data and language modeling

techniques struggle in training robust language models. Therefore, in this review, we

analyze two major techniques that deal with the development of domain-specific language

models when domain-relevant data are scarce.

2.7.1 Out-of-Domain Model Adaptation

Model adaptation refers to the use of language models that are trained on one language

domain to a different target domain where there is little or no training data available.

Various scenarios are linked to model adaptation that we review one by one. A typical

scenario is that when the available language model is from the general-purpose domain

that seeks adaptation in a specific target domain. In such cases, a direct adaptation of the

out-of-domain language model can give domain agnostic behaviors; therefore, it is better

in such cases to perform some fine-tuning operations. Another scenario is when the source

language model is coming from a domain-specific environment that is different from the

domain of the target environment. Lyer [81] experimented with multiple out-of-domain

models that were combined to form a single model that improved performance on a general

task. You [80] show a 10.4% relative improvement in performance by using an all-rounder

approach where multiple domain-specific models were used to train a single language

model to apply on a general-purpose target.

Adapted models are combined with domain-specific models by either pooling or

interpolation [82]. The pooling technique deals with the creation of one dataset from

30

multiple sources, and then use that dataset to create one combined model [83], whereas, in

interpolation, multiple datasets train multiple language models which are then combined

using interpolation weights.

2.7.2 Domain Relevant Data Generation

Language models work well when there is sufficient domain-specific training data

available. Domain relevant data generation techniques seek various strategies to produce

such domain-specific training data when there is a scarcity problem. There are three main

types of strategies for data generation: conversion, extraction, and augmentation.

Data conversion strategies seek such data sources that exhibit domain-relevant data in

different format or language, then apply techniques to convert such data into the desired

format or language. As an example of data conversion, Horia [84] proposes a technique

that exploits the use of machine translation to convert domain-relevant data that exhibits in

a different language into the language of the target domain.

Data extraction strategies are similar to conversion as it explores domain-relevant data on

other platforms, and instead of conversion, it focuses on the extraction of such data from

the source platform to the target platform. Abhinav [85] has shown relative progress of 6%

by the use of a data extraction technique to extract domain-relevant data from the web and

train language models from extracted data.

Data augmentation strategies offer a different way to generate domain-relevant data. It

focuses on the augmentation of available information within existing data to generate new

data points. Enhancing language models by augmenting domain-specific vocabulary and

synonyms have shown a decrement of absolute recognition error by 5.41% in noisy

environments of tennis championship while having all variety of noises like emotional

outbursts, game noise, crowd noise [86]. Another study shows considerable progress by

using models that were created by web extracted and augmented data using pooling [83].

31

2.7.3 Summary

Domain robustness is a challenge in speech recognition, which is due to the combination

of acoustic and language domains. Approaches of noise-robustness, particularly from end-

to-end SR systems, considerably address the challenges from the acoustic domain.

However, language models still need to be taken care of for robustness. We review two

main techniques that deal with the robustness of language models. The first technique seeks

adaptation of models from different domains to perform tasks on the target domain. These

techniques complement the adaptation process with domain-specific fine-tuning and model

enhancements to achieve higher performance. The second technique focuses on the

generation of domain-relevant data for language models. This technique exploits various

strategies such as data conversion, extraction, and augmentation to artificially generate

domain-relevant data that can then be used to generate domain-specific language models.

32

2.8 Speech Recognition Systems

This section provides an overview of multiple SR systems and toolkits. In this review, we

have a special focus on the state-of-the-art end-to-end open-source offline systems. In SR,

the term “end-to-end” refers to those systems that recognize text from the given speech

without the use of any phonetic representation, which used to be the core of traditional SR

systems [79]. Traditional speech recognition systems used to depend on significant human

expertise for the development of such representations. End-to-end speech recognition is a

recent development that was achieved due to the advancements in Deep Neural Network

(DNN). It was first presented in 2014 [87], in which it reports state-of-the-art accuracy on

the Wall Street Journal corpus. Later, a number of researchers expanded the end-to-end

concept and reported significant performance improvements.

The following SR systems and toolkits are reviewed.

1. Deep Speech

2. ESPnet

3. Wav2Letter++

4. CMUSphinx

5. Kaldi

6. Julius

7. Google Speech-to-Text

2.8.1 Deep Speech

Deep speech is a DNN based end-to-end acoustic modeling technique that uses the

Recurrent Neural Network (RNN) to construct a robust SR system. It is built with the focus

on noise-robustness, by enabling RNN to consume thousands of hours of unaligned,

transcribed audio to train its models. It performs character-level recognition. It was

presented in 2014, where it claimed better performance than other commercial SR systems

on both, clean and noisy datasets [77]. It outperformed previously established SR systems

by achieving 16% WER on the Switchboard corpus [88]. Later in 2015, deep speech

presented a state-of-the-art performance in two vastly different languages: English and

33

Mandarin; and showed that it can be adapted to perform recognition tasks on any other

language [76].

Deep speech models are composed of 6 layers (input + 5 layers) (Figure 2.3), where the

second last layer is a bi-directional recurrent layer [89]. The output layer in deep speech

predicts the character probabilities for the input audio sequence. The structure of deep

speech is straightforward and uncomplicated as compared to other RNN based models [87].

Therefore, deep speech depends upon extensive training data to achieve high-performance

standards. However, due to simplicity in structure, the RNN model is prone to overfitting;

thus, deep speech applies dropout rate in between 5-10% to its layers along with using

techniques to synthesize training data with artificial noise.

Figure 2.3 Structure of Deep Speech Model [77]

One advantage of deep speech is that it provides techniques to accelerate the training

process by using parallelism to enable execution over GPUs. The first technique it provides

takes the approach to train a model on many input examples in parallel. Secondly, it

provides some techniques to parallelize the overall model training process. Since recurrent

34

layers have a sequential nature, it is challenging to parallelize them; therefore, deep speech

approaches to optimize the execution of all other layers.

In November 2017, Mozilla research lab developed project DeepSpeech that implemented

the techniques defined above and released the first stable version of an SR system in the

open-source domain. Under the hood, DeepSpeech uses TensorFlow [90] neural network

toolkit. Training deep speech acoustic model is a compute expensive operation, as it

requires a vast amount of data to efficiently learn the domain characteristics; however, it

has the ability to run inferences in real-time by using a decent sized GPU due to the native

GPU support that TensorFlow provides. To refine its output, DeepSpeech uses an n-gram

based language model. The language model generation is not directly handled by

DeepSpeech; rather, it is dependent upon the implementations by KenLM [91] language

modeling toolkit.

2.8.2 ESPnet

ESPnet is a recent addition to the end-to-end SR systems, presented in 2018 [75]. It is based

on a hybrid DNN/HMM approach. It makes use of two vastly different approaches,

Connectionist Temporal Classification (CTC) and attention mechanism, in the hybrid

manner to train its acoustic models. Although it is a newer edition, it presented comparable

performance as compared to the state-of-the-art systems [75].

ESPnet is based on an attention-based encoder-decoder network to perform recognition

tasks. Attention-based networks do not require explicit acoustic and language models.

However, ESPnet presents a mechanism to develop a hybrid network by fusing the scores

from an attention decoder network with CTC decoder. Moreover, it also provides a

mechanism to make use of explicit language models. The final proposed recipe of ESPnet

works in 6 (1+5) stages (Figure 2.4). Stage 0 prepares the audio data using data preparation

scripts from Kaldi toolkit [92], Stage 1 again uses Kaldi to extract features from training

data. Stage 2 perform data preparation again, although this time preparation is for training

encoders and explicit language models. Stage 3 trains explicit language models. Stage 4

trains SR encoders. Finally, stage 5 performs recognition.

35

Figure 2.4 Processing stages in ESPnet recipe [75]

The first stable version of ESPnet was released in March 2019. It was built on PyTorch

[93], which is a neural network toolkit mainly for research purposes. Therefore, it can have

some issues in the case of scaling applications. Like deep speech, it is also dependent upon

huge datasets for efficient operation.

2.8.3 Wav2Letter++

Wav2Letter++ is the most recently presented end-to-end SR toolkit that is presented in

early 2019 [94]. It is developed entirely on C++ by the Facebook research lab to boost

performance and speed. It reports the lowest error rates on the same datasets on which deep

speech and ESPnet report their performance numbers. The first stable version of

Wav2Letter++ was released in early 2019. As it is based on the end-to-end concept, it also

depends upon massive training data to train efficient acoustic models. However, until the

start of October 2019, it was not offering any pre-trained models with the system.

Wav2Letter++ provides a platform to test various DNN based SR recipes. Recipes in

Wav2Letter++ consists of DNN architectures for acoustic modeling and decoding

information for language models. Currently, the official repository of Wav2Letter++

provides recipes that use Convolution Neural Network (CNN-DNN) based architectures

36

[95], [96]. When using architecture from deep speech, Wav2Letter++ reported 5% WER

on LibriSpeech clean dataset [94].

2.8.4 CMUSphinx

CMUSphinx is not an end-to-end system, as it is built upon traditional pipelines of speech

recognition. It was developed by Carnegie Mellon University (CMU) and was presented in

2004 [97]. We reviewed this system since it has the most recent version of the traditional

HMM-based acoustic model, along with a considerable community presence. Even recent

studies consider to compare it with leading commercial cloud-based SR [98]. However,

with the recent advancements of the end-to-end SR concept, the popularity of CMUSphinx

is now fading away.

2.8.5 Kaldi

Kaldi is a speech recognition toolkit, rather than a standalone system, that was presented

in early 2011 with the purpose of research [92]. The main focus of Kaldi is towards acoustic

model research. However, a number of recent end-to-end systems have borrowed various

components from this toolkit [52], [75]. Kaldi toolkit is built upon traditional HMM/GMM

based SR pipelines.

2.8.6 Julius

Julius was presented in 2009 as a large-vocabulary continuous speech recognition

(LVCSR) system for both research and industrial applications [99]. It was also a traditional

system using HMM-based acoustic models. Its main features were the ability to perform

real-time processes with low memory usage. However, it was last updated in 2014 and has

never updated since.

2.8.7 Google Speech-to-Text

Google speech-to-text [100] is a cloud-based speech recognition service that is currently

leading in real life general-purpose speech recognition [98]. It offers services in 120

languages and supports real-time recognition. The cloud service is based on powerful deep

neural network models and it offers recognition via easy to use API.

37

2.8.8 Summary

This review has covered six open-sourced and one cloud-based SR systems. Within open-

sourced systems, three are built upon end-to-end concept whereas the remaining three are

built upon traditional SR pipelines. This review analyzes these systems upon three factors:

performance, ease of use and scalability. We have shown that performance and scalability

wise, all end-to-end systems are cutting edge. However, training and using them in domain-

specific environments are challenging tasks due to their need for substantial training data.

In terms of noise-robustness, end-to-end systems are the best ones along with the reviewed

cloud-based system. Due to the dependability of traditional systems on accurate training

sets, they usually do not catch up end-to-end systems in real-world environments. When

comparing end-to-end systems with each other, DeepSpeech seems to be the most practical

one to use at this point.

38

2.9 SOAP Classification

Classification is the procedure of assigning one or more classes to some entity. Sentence

classification is the type of classification that falls under the umbrella of NLP and works

on text sentences. It is commonly applied in both medical and non-medical domains. It

stems from the document classification, categorization, or segmentation that works on the

scope of documents. Some examples of classification in NLP are classifying newspaper

articles based on topics and identifying an email as spam. An example of sentence

classification is the sentiment analysis of tweets.

This thesis uses sentence classification to assign SOAP categories to the transcription of

provider dictations. In a clinical report, subjective and objective sections bare most of the

content, whereas the assessment and plan sections are filled with marginally a smaller

number of sentences. This trend has shown in the dataset of this thesis which is also

affirmed by the dataset class proportion reported by Mowery [101]. This trend creates an

imbalanced dataset which may lead to a biased analysis of the performance. Another point

to note while dealing with SOAP format is that each category has a semantic definition,

and a SOAP formatted report expects its content to adhere to those definitions. In this

regard, it is a good idea for a classifier to pick those semantic patterns from the training.

Word sequence of a sentence plays a vital role in highlighting those patterns. It means that

keeping word sequence intact can positively impact the classification performance.

SOAP classification is attempted before by Mowery [101] in which SVM was used and

showed positive gains; however, Mowery did not address the imbalanced nature of SOAP

datasets and also did not exploit the sentence sequences in its solution. We were not able

to find any other study on SOAP classification to compare the results of this study.

Mowery, in its work, also mentioned about this unavailability. However, we reviewed other

studies on similar problems in the medical domain. We adapted an exemplar-based concept

detection algorithm that is defined by Juckett [102] in an effort to extract concepts from

clinical text. This algorithm explicitly takes care of the word sequences and aims to work

efficiently on imbalanced datasets.

39

2.10 Conclusion

In this chapter, we define the clinical documentation process, along with its challenges,

modes, and workflows. We discuss the structure of SR systems along with some recent

developments that are happening in the SR research domain. A literature review presents

the use of SR in the healthcare domain with respect to the challenges of the documentation

process. We also present a review of 6 top-of-the-line speech recognition systems. In the

end, we review techniques to classify sentences into SOAP categories, where we present

an exemplar-based concept detector algorithm with the idea to implement it into a SOAP

classifier.

40

Chapter 3. RESEARCH METHODOLOGY

3.1 Introduction

This chapter describes the methodology undertaken to address the research objectives of

this thesis. The process of research started after performing a systematic review of the

problem, which led us to define a solution approach that spans over two separate layers

and separate domains. The rationale for this multi-layered multi-domain approach is

provided in the introductory chapter (Section 1.1). This chapter now highlights all the steps

that are taken to form a conclusion within each layer. Figure 3.1 highlights the overall

methodology of this thesis.

Figure 3.1 Steps in Research Methodology spanning over two layers

3.2 Layer 1: Transcription of Physicians’ Audio Notes

The focus of this layer is to achieve higher recognition accuracy in the noise rich

environments. It aims to apply the solutions to transcribe physicians’ dictation audio files.

This layer achieves its objective in four steps.

3.2.1 Selection of Methods

Noise-robustness and domain-robustness are two key challenges towards achieving the

first objective of this layer. Both challenges were reviewed in detail (Section 2.6 and 2.7)

to explore the top-of-the-line techniques. After a detailed review (Section 2.8), one Speech

Recognition (SR) system was selected based on the ability to implement our explored

41

techniques. The selection of the SR system and robustness techniques goes hand in hand

to ensure the compatibility between the selections.

• SPEECH RECOGNITION (SR) SYSTEM

We have selected project DeepSpeech [7] in our work. DeepSpeech is an open-source

speech recognition engine that is developed by Mozilla. Project DeepSpeech uses

acoustic models that are trained on the machine learning techniques of deep speech

[77] and language models that are trained using KenLM [91] language modeling

toolkit. When the DeepSpeech engine works with a deep speech acoustic model and a

KenLM language model, the whole combination becomes a complete speech

recognition system.

Name Short Name Description

Project

DeepSpeech

DeepSpeech Speech Recognition Engine requires an acoustic

and a language model to perform recognition

tasks.

Deep Speech

Acoustic Model

Deep speech

model

OR

Acoustic

Model

An acoustic model that is developed using Deep

Neural Network (DNN) based machine learning

techniques.

KenLM Language

Model

KenLM Model

OR

Language

model

A language model that is trained using the

KenLM toolkit.

Deep speech [77] also provides pre-trained acoustic and language models that have

learned from massive datasets. It also provides utilities to train custom models. We

selected the release 0.5.1 for our experiments, which is the latest stable release at this

point in time.

42

• PRE-TRAINED MODELS

Pre-trained acoustic and language models provided by DeepSpeech are trained over

3,000 hours of transcribed audio from Fisher [103], LibriSpeech [104] and Switchboard

[88] datasets. All datasets are in American English. Details of each dataset are given

below.

Corpus

Name Fisher [103]

LibriSpeech

[104] Switchboard [88]

Corpus Size 2,000 hours 1,000 hours 250 hours

Nature of

Data

telephone

conversations audiobooks conversations

Source of

Data 16,000 conversations audiobooks

2,500 conversations by 500

speakers

The pre-trained deep speech model was trained using the following hyperparameters.

Parameter Layer

width

Dropout

rate

Learning

rate

Language model

weight (CTC

Decoder)

Word insertion

weight (CTC

Decoder)

Value 2048 15% 0.00001 0.75 1.85

The language model was trained till 5-grams. The total size of the pre-trained language

model is over 4 gigabytes.

3.2.2 Preliminary Analysis

Cloud-based SR systems generally perform better [105] as compared to offline systems,

but they do not offer customization options as open-sourced offline SR systems offer.

Moreover, in data-sensitive domains, such as the healthcare domain, offline solutions are

preferred due to privacy and such concerns. Therefore, in this work, we will be exploring

robustness techniques to apply them to an open-sourced offline SR system to maximize

accuracy gain. However, we are interested in comparing the performance of the selected

SR system with the top-of-the-line cloud SR system in both settings; before and after

43

applying our solution. In this step, we experimented with DeepSpeech and Google Speech

[100] using our full dataset.

3.2.3 Development of Solution

This step developed a solution to overcome the challenges of this layer. The solution is

motivated by the insights gathered from the preliminary analysis and is based on the

reviewed robustness techniques. It proposes methods to maximize accuracy gains using

both, acoustic and language models.

3.2.4 Evaluation of solution

SR systems are generally evaluated using Word Error Rate (WER) [25], which gives equal

weight to all mistakes. However, in the domain-specific environment, some mistakes can

have more impact than others. For example, in the healthcare domain, a mistaken drug

name or diagnosis is a critical hazard, in comparison to most of the grammatical mistakes.

Therefore, we developed a custom domain-specific evaluation metric, Critical Error Rate

(CER), to evaluate such errors. We evaluated our models on both of these evaluation

metrics (WER and CER).

• CRITICAL ERROR RATE (CER)

Critical mistakes are defined as those mistakes that involve medical concepts. CER

counts critical mistakes and takes its ratio out of all critical concepts within the ground

truth. First, it counts all the critical mistakes, and then it divides the count by the number

of all medical concepts in the ground truth. CER uses Metamap [106] to query the word

to find if it is a critical mistake.

𝐶𝐸𝑅 =𝐶

𝑊

∴ 𝐶 = # 𝑜𝑓 𝐶𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑀𝑖𝑠𝑡𝑎𝑘𝑒𝑠

∴ 𝑤 = # 𝑜𝑓 𝐶𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝐶𝑜𝑛𝑐𝑒𝑝𝑡𝑠 𝑖𝑛 𝐺𝑟𝑜𝑢𝑛𝑑 𝑇𝑟𝑢𝑡ℎ

44

3.3 Layer 2: Autonomous Generation of Clinical Report

A SOAP formatted clinical report contains sections representing all four SOAP categories;

therefore, the focus of this layer is to develop a classifier to categorize sentences from the

transcription of physicians’ notes into one of the SOAP categories. We achieved the

objectives for this layer in four steps.

3.3.1 Data Labeling

In the first step, we labeled the dataset for the classification task.

3.3.2 Selection of Method

In this step, an exemplar-based algorithm was selected, which is proposed by Juckett [102]

for concept detection within clinical texts.

3.3.3 Development of Solution

The selected algorithm detects concepts at the word level, whereas we aim to work at the

sentence level. Thus, in this step, we extended the selected algorithm to give a single

confidence score for each class for the given sentence. We also identified four key areas to

investigate and develop the solution.

3.3.4 Evaluation of Solution

In this step, we evaluated our solution based on the four independent variables that were

formed from the identified areas. For the evaluations, we used Average Precision (AP) as

the primary measure of performance. We did not evaluate the area under the ROC curves

due to the class imbalance in the dataset.

3.4 Dataset

To achieve the objectives of this research, the IWK health center in Halifax, Nova Scotia,

Canada, provided a set of 105 physicians dictated notes, which were then processed into a

structured dataset.

45

3.4.1 Data Collection

Audio clips of clinical dictations were provided by the Pediatric Emergency Department

of IWK health center. 20 out of 105 dictation audio clips contained a complete dictated

note, whereas the rest of the dictations were chunks of incomplete notes. Only 16 dictations

were coupled with gold standard text. In total, all dictations accumulate to 2 hours, 28

minutes, and 14 seconds of audio data. Out of all dictations, 100 were recorded by one

person in different areas of the hospital, whereas, remaining five dictations were recorded

by different persons.

3.4.2 Dataset Preparation

The dataset was prepared in two steps. In the first step, gold standards were generated and

validated. For all those dictation clips where no gold standard was provided, manual

transcription process was completed by three fellow researchers of our lab, one of these

fellows holds a medical degree that helped us in maintaining the exact vocabulary. Every

researcher transcribed all dictation clips; afterward, all three variations were analyzed to

generate gold standards. After generating the missing gold standards, all 105 dictations

were manually validated.

In the second step, the dataset was structured. All dictation and transcription files were

renamed into a sequence of numbers, where a prefix was given to each file. Prefix A was

given to all audio clips, and corresponding text files. A dictation and its respective gold

standard couple were given the same filename; nevertheless, they had separate extensions

(.wav for audio clips, .txt for text files), e.g., (A001.wav, A001.txt).

After setting up all the files for the dataset, we segmented our dataset into two parts. The

first part included 100 dictations that were recorded by one person. We called this an

original dataset and used it for experiments and analysis. The remaining five dictations

form a validation dataset that we used explicitly to validate the final results of our

experiments. We did not use the validation dataset at the time of defining, experimenting

and refining our methods.

46

3.5 Research Environment

This research is performed on two computer systems that were used in parallel. Table 3.1

provides a summary of both systems that were used. System 1 was primarily used for

review, documentation, and light scripting work where the windows operating system is

required. For all the compute-intensive tasks such as model training and fine-tuning,

System 2 was used that was equipped with a well-functioning GPU.

Table 3.1 Summary of Research Environment

System 1 System 2

Processor Intel Core i7 Intel Core i7

RAM 16GB 20GB

GPU - NVIDIA GeForce GTX 960 (2GB VRAM)

Storage HDD SSD

Operating System Windows 10 Pro Ubuntu 18.04 LTS

Metamap [106] was used to extract medical concepts from texts. In this thesis, vocabulary

was limited to MeSH (Medical extended Subject Headings) [107]. Metamap server was

initialized on system 2, which was readily available over the network. We were more

comfortable working with the JAVA implementation of Metamap client; hence all the

scripts that required interaction with Metamap were written in JAVA.

We used Python to experiment with DeepSpeech. We used system 2 for experiments since

DeepSpeech uses TensorFlow [90] libraries to run Deep Learning algorithms that facilitate

execution over NVIDIA based GPU.

Cloud-based SR systems were tested on the JAVA platform. Exemplar based algorithm in

NLP was also implemented in JAVA where results were stored in files which were later

analyzed by python scripts using the scikit-learn library to calculate and plot the precision-

recall graphs.

47

3.6 Conclusion

In this chapter, we present the steps we took in each layer of our solution to address both

of our objectives. We highlight the methods that are selected in each layer, along with the

reasons for such selections. We also mention details about the dataset and working

environment that we used throughout our research.

48

Chapter 4. SPEECH RECOGNITION – METHODS

4.1 Introduction

This chapter presents our methods to explore robust and domain-specific Speech

Recognition (SR) for scenarios when there is a shortage of domain-relevant data. In this

chapter, methods pertaining to both components: acoustic model and language model; are

explored to achieve maximum recognition accuracy. Acoustic models are primarily

focused on achieving noise-robustness, as they are directly responsible for dealing with the

noise within the feature space. Language models, on the other hand, contribute more to

domain-robustness, as they learn from domain-specific language (vocabulary and speech

patterns).

4.2 Domain Adaptation

Domain adaptation is a sub-discipline of machine learning which “deals with scenarios in

which a model trained on a source distribution is used in the context of a different (but

related) target distribution” [108], [109]. In many realistic and low-resourced target

domains, where the collection of data is costly and sometimes impossible, training domain-

specific models become a challenge. In such scenarios, domain adaptation serves as a

highly practical option [110] to bootstrap the model development process [84].

Transfer learning and fine-tuning are two main methods of domain adaptation. In both

methods, models that are already trained on one problem domain are used as the starting

point to use in another problem domain. However, the difference lies in the way these

models are adapted. If the adapted model is shown some training data from a different

target domain to learn the domain specifics, this method is referred to as fine-tuning.

However, if some or all layers of the adapted model are taken to develop a new model, this

method is transfer learning.

49

4.3 Preliminary Analysis

In this section, we address some fundamental questions that require answers to drive this

research. DeepSpeech is a recognition engine that depends on acoustic and language

models to perform recognition tasks; however, we have not trained any domain-specific

models yet. Therefore, our first logical question is, ‘what will be the performance of

DeepSpeech on our dataset by simple domain adaptation of pre-trained (general purpose)

models?’; secondly, ‘what type of mistakes do pre-trained models do on a domain-specific

dataset?’; and thirdly, ‘how does this performance compare with top-of-the-line

commercial cloud-based SR systems like Google SR?’

Table 4.1 Preliminary Results: WER and CER of Cloud and Offline SR

CER WER

DeepSpeech 49.38% 46.63%

Google 20.18% 21.69%

To answer these questions, all 100 audio clips from our original dataset were transcribed

using both SR systems/services (DeepSpeech and Google). Table 4.1 presents the error

rates for both systems. In general, we observed that DeepSpeech, with its pre-trained

models, made a mistake for almost every two words.

In comparison with the cloud-based SR (Google), DeepSpeech made relatively 114.98%

more general mistakes (WER) and 144.69% more critical mistakes (CER). In isolation,

DeepSpeech made more critical mistakes than general mistakes, whereas Google was

better in recognizing domain critical vocabulary. These differences highlight the domain

agnostic behavior of pre-trained models of DeepSpeech.

Table 4.2 Examples mistakes by DeepSpeech during preliminary experiments.

Ground Truth DeepSpeech Google

with the t max of with a tax of with a t maxx of

ibuprofen age profane ibuprofen

tracheal palpation trail at a patient I killed a palpation

50

The reason for a large number of critical mistakes by DeepSpeech is that its models are not

trained on the vocabulary that is deemed critical in our domain. We present some examples

of critical mistakes done by both systems/services in Table 4.2. From the example, one can

observe that DeepSpeech returned hypotheses that are most common in generally spoken

language. For example, ‘t max’ was inferred as ‘tax’, and ‘tracheal’ was conceived as

‘trail’. This behavior is precisely the same that is within the dataset on which the pre-trained

language model is trained, i.e., mostly conversational and general speeches. Thus, we

should investigate techniques to incorporate more domain-specific vocabulary within the

language model.

Google was also not able to conceive many, if not most, of the critical concepts either;

however, we observed that such mistakes were less then DeepSpeech. Most of the time,

when DeepSpeech made a critical mistake, the audio happens to be valid. For example, in

the case of ‘tracheal palpation’, Google recognized ‘palpation’ correctly, while giving a

completely out of context hypothesis for ‘tracheal’, but DeepSpeech did not recognize any

part of that phrase, which is worse. We verified this specific instance manually and found

that the distortion in this audio clip is below the average level of noise within our dataset.

Therefore, for better inferences, there is also a need to explore methods to develop accurate

acoustic models.

51

4.4 Acoustic Modeling

In preliminary experiments, the pre-trained acoustic model was not aware of the acoustic

environment of the target domain. Therefore, we explored methods to train domain-

relevant acoustic models. In this regard, we outlined two research questions. 1) what will

be the impact on performance if the pre-trained acoustic model incorporates the acoustic

environment of our target domain? and 2) how will DeepSpeech perform with a new

acoustic model that is trained solely on our dataset?

Our first question sought domain adaptation of the pre-trained model along with an

enhancement method to incorporate the target domain. There are two methods to do it in a

machine learning paradigm: transfer learning and fine-tuning. Both of these methods adapt

an existing model that is learned in a separate domain. These methods require less training

data from the target domain. In our case, the pre-trained acoustic model has already learned

from general conversations. Moreover, we do not require any change in the input or output

classes. Therefore, we considered using the fine-tuning method to incorporate our target

environment within the pre-trained model.

The second question requires the development of a new acoustic model using supervised

learning. DeepSpeech is an end-to-end system that relies on large datasets to generalize;

therefore, with our dataset of about two and a half hours of audio, we were not able to train

the model to the point of convergence. Even after training a new model for 500 epochs, we

did not observe any reduction in the error rates, whereas, in comparison, the pre-trained

model is only trained for 75 epochs. Hence, we did not pursue this question further.

4.4.1 Audio Pre-Processing

Audio clips in our dataset have an average length of 1:30 mins, whereas the pre-trained

model was trained on the audio files that are of sentence length (4-5 seconds). The length

of audio has a direct correlation on the memory consumption while training deep speech

acoustic models. Therefore, we needed to split the audio clips in shorter chunks to keep the

memory allocation under the affordability of our hardware.

52

A series of considerations were acknowledged before splitting. Firstly, attention was given

towards the completion of words in each split. Any cut in between the utterance of a word

can make it impossible to recognize that splitting word. Secondly, splits were made in a

way to keep the average length around 5 seconds mark.

To effectively break audio clips following all considerations, a silence finding process was

used. The idea is that if a cut is made on silence, that is long enough, then the probability

of breaking a word will greatly diminish. Audio clips were analyzed to find silences upon

various threshold values. There are two thresholds that a silence finder process requires:

max intensity level and duration of silence. Both are defined below.

1) Maximum Intensity Level: This requires a value for maximum sound intensity level

below which everything is considered silence.

2) Duration of silence: This requires a value that denotes time in milliseconds of

continuous silence to mark it as a valid silence period. Whenever sound intensity

drops below the above-defined level a true silence does not need to occur. It is quite

possible that while pronouncing a word, the sound intensity drops below the above-

defined intensity level for a moment; hence this threshold gives an option to select a

duration of silence period, such that only silences of periods more than specific

duration will be marked as true silence.

For each audio clip, the silence finding process was executed with varying threshold levels.

After each execution, all splits were checked for word completion and the average split

length. If any of the considerations failed, thresholds were readjusted, and the process was

re-executed. The whole process was performed manually for each audio clip to ensure that

speech content preserves even after the splits.

Once all the audio clips were broken into smaller splits, they were then stored in a folder

by appending a number from a sequence to their filenames. For Example, the first split of

‘A001.wav’ was given filename ‘A001-01.wav’. All splits were then copied into a folder

that was named ‘A001’. After storing all the splits, their corresponding gold standard

53

transcriptions were also modified in such a way that they contain the same number of lines

as there are splits, where the nth line has the text for the nth split.

4.4.2 Fine-tuning

Project DeepSpeech provides checkpoints for its pre-trained deep speech model.

Checkpoints capture the exact value of all weights and parameters within a model and are

stored in a directory. The checkpointing directory within the project DeepSpeech contains

the latest values for which the deep speech model was exported. Therefore, we used the

provided checkpoints to perform fine-tuning.

Table 4.3 Hyperparameters used for acoustic model fine-tuning

Hyperparameter Value

audio_sample_rate 16000

train_files Path/to/train.csv

alphabet_config_path Path/to/alphabet.txt

export_dir Path/to/export-directory

checkpoint_dir Path/to/checkpointing-directory

train_batch_size 5

n_hidden 2048

epochs 1

dropout_rate 0.15

learning_rate 0.0001

lm_alpha 0.75

lm_beta 1.85

The fine-tuning process is the same as model training in DeepSpeech. While training, if a

checkpointing directory is defined and the directory contains valid checkpoints, then

DeepSpeech starts fine-tuning the existing model; otherwise, it creates a new model. In our

scenario, we downloaded the checkpointing directory and started the training process from

that directory. Table 4.3 provides a list of hyperparameters that we used. Details about

these hyperparameters are as follows.

54

1. Training files.

The parameter train_files expects a path to a comma-separated file (CSV) that contains

three columns: wav_filename, wav_filesize, and transcript. The column wav_filename

record paths of the audio files that we want to use for training the model, wav_filesize

note the sizes of each audio file in bytes, and transcript have the ground truth of each

audio file that other columns enlist. Each row in the CSV represents one audio clip

within the dataset.

2. Alphabet file

The parameter alphabet_config_path expects a path to a text file that contains a list of

all distinct characters that are within the ground truth.

3. Export directory

The parameter export_dir is used to specify the directory in which we wish to export

the trained model. If this parameter is not supplied, then the trained model parameters

will be stored in the checkpointing directory only, without exporting the model for

inference use.

4. Batch size

The parameter train_batch_size is used to set the batch size of training. The pre-trained

deep speech model was trained on a batch size of 24 clips when there were 12 gigabytes

of memory available to the developers of the pre-trained model. With one-sixth of that

memory available to us and while using audio clips of similar lengths (after splits), we

were facing memory exhaustion errors. Therefore, we explored the batch size

iteratively by reducing it on each step. We were able to successfully complete the

training process by using the batch size of 5 audio clips.

5. Epochs

For our evaluations, we were interested in analyzing the error rates after each epoch of

fine-tuning. Therefore, we executed the training one epoch at a time and exported the

55

trained model after each epoch. After each export, the training was resumed from the

same checkpointing directory.

6. Other hyperparameters

Project DeepSpeech reports the hyperparameters that were used to train the pre-trained

model. All other hyperparameters were kept unchanged in this fine-tuning process.

After finishing the fine-tuning process, all generated models are evaluated, whose details

are provided in the next chapter.

56

4.5 Language Modeling

Language is a combination of words (vocabulary) and speech patterns (grammar). Specific

domains may have specialized languages; therefore, domain-specific language model plays

an important role in the domain-robustness of any SR system. Ideally, language models

require training on vast volumes of domain-relevant data [111]. However, in realistic

situations, problems arise when there is little or no domain-specific data available [82]. We

also observed such problems in our preliminary analysis, when we used the pre-trained

model which is an out-of-domain model for our problem domain. Therefore, we explored

various methods to develop robust language models that are specific to our target domain.

The main concern is the insufficiency of domain-relevant training data. In the review, we

highlight two main techniques to develop domain-specific language models when there are

data scarcity problems. The first technique seeks domain adaptation of the out-of-domain

language model while enhancing the adapted model using domain-relevant data and using

various model combination methods. The second technique, on the other hand, focuses on

the generation of domain-relevant data. Therefore, in this work, we experimented with

methods from both techniques. We are also interested to see the recognition performance

without applying any of these techniques. Hence, we also developed standalone language

models using only our original dataset. In summary, we worked on three key tasks to

investigate the impact of language modeling techniques.

1) Dataset Augmentation with Domain-Relevant Data.

2) Developing Domain-Specific Language Models

3) Enhancing Pre-Trained Language Model

4.5.1 Dataset Augmentation with Domain Relevant Data

Our dataset includes clinical notes representing narrated cases of patients’ diagnosis and

treatment, along with rationales and reasons coming from the healthcare provider’s

perspective. These notes contain domain-specific concepts, for example, diseases,

diagnoses, and observations. However, with the limited number of notes having 1966

sentences in total, it is not possible to cover all the concepts within the domain. Therefore,

57

to develop domain-specific and robust language models, it is required to augment our

dataset with most of the domain-relevant information.

We reviewed three techniques for data augmentation: extraction, conversion, and

generation. Extraction and conversion techniques consider that domain-relevant data is

available somewhere else perhaps in a different format, language or source. These

considerations are ineffective for sensitive domains like healthcare since such data is

usually heavily protected and is available with scrutiny and restrictions. Therefore, the only

logical option left is to use data generation techniques to simulate domain-relevant data.

Data generation techniques vary for each domain and require domain knowledge. In the

NLP domain, synonym replacement [112] is a simple technique that is also the most natural

choice to augment small datasets [113]. In its simplified form, synonym replacement

generates new textual units by replacing words from its synonyms. Most implementations

of this technique use thesaurus like WordNet [114] or thesaurus.com [115] to get the list

of synonyms for replacement. Synonym replacement requires two things for execution:

which words in the text should be replaced, and which synonyms should be used for the

replacement [113]. In synonym replacement, words to replace are identified from the

original dataset, whereas, synonyms are extracted from knowledge sources. Once both:

words and synonyms; are identified, then each word is permuted with its synonyms to

generate new variations of text.

In this work, we generated domain-relevant data based on the synonym replacement

technique. To apply this technique in our domain, we considered synonyms as those

medical concepts that belong to the same semantic classes. We used MeSH [107] as the

knowledge source to extract medical concepts within the healthcare domain. MeSH is a

thesaurus that lists medical concepts in a tree hierarchy. Previous work has also shown to

use MeSH with the same technique for the data simplification task [116]. We used

Metamap [106] to identify medical concepts within our dataset, which is a concept

recognizer that annotates medical concepts within texts. Our data generation method

consists of three steps.

58

• STEP 1: PATTERN EXTRACTION

Each sentence within our dataset can have multiple medical concepts. We observe that

most of the time, these concepts semantically depend on each other, and a replacement can

semantically invalidate the sentence. As an example, in the sentence, “She was prescribed

a course of cephalexin to treat bacterial infection”, there are three medical concepts:

“prescribed”, “cephalexin” and “bacterial infection”. The concepts “cephalexin” and

“bacterial infection” belong to ‘drug’ and ‘disease’ classes respectively and are

semantically dependent upon each other. A replacement of the ‘disease’ concept “bacterial

infection” to another ‘disease’ concept “lung cancer” might semantically invalidate the

sentence. Therefore, we approached for extracting patterns from the sentences, so that a

replacement does not invalidate the semantics. We define that a pattern within a sentence

has exactly one medical concept while having maximum neighboring words that are not

medical concepts. In our given example, patterns are “She was prescribed a course of”, “a

course of cephalexin to treat” and “to treat bacterial infection”. Since our goal is to generate

robust language models, we generated patterns with overlaps to maximize the count of

neighbors around the medical concepts.

Figure 4.1 Steps of Pattern Extraction

Pattern extraction starts by annotating a sentence using MetaMap by restricting it to use

MeSH vocabulary. These annotations are upon the concepts within the sentence. The

number of annotations defines the number of patterns that can be extracted from the

sentence. Each annotation is then used as the starting point of a new pattern. All the words

59

to the left and right of that annotation that are not part of another annotation are added to

that pattern. When the sentence boundary or boundary of another annotation is reached,

then the pattern stops growing and is then added to a global list of patterns. This step is

repeated with all the annotations to extract all possible patterns from the sentence. Then

the whole sentence level work is repeated for all sentences in the dataset. Figure 4.1

illustrates the sentence level pattern extraction along with an example.

• STEP 2: KNOWLEDGE EXTRACTION

In this step, we extract synonyms from the knowledge source. MeSH classifies each

medical concept into 16 root category classes. The concepts within these classes are

specific to the healthcare domain and are far from the general-purpose conversational

paradigm upon which the pre-trained models were trained. Even our dataset does not

account for all concepts from these classes. From preliminary experiments, we observe that

most of the mistakes are coming from the concepts of ‘disease’ and ‘drug’ classes.

Therefore, we decide to generate data to cover only these two classes for now. In MeSH,

‘disease’ concepts are kept under the root node “Diseases[C]” and ‘drug’ concepts under

the “Chemical and Drugs[D]” node. We traverse all the child nodes under these two nodes

and extract all the concept terms for each child node.

Figure 4.2 Steps of Knowledge Extraction

60

In MeSH, each medical concept enlists all candidate and qualifier terms, while defining a

separate list for the preferred terms of that concept. Many of these terms are closely related

to the concept but are not strictly synonymous with the terms that are in use by the

physicians. Moreover, root nodes of MeSH include all the concepts that are from all

domains of medical science, while for this work we are only interested in those concepts

that are applicable in the pediatric emergency departments. As an example, the “Chemical

and Drugs[D]” node contains a concept “Propoxur” which is an insecticide. “Propoxur” is

not related to pediatric care and is not applicable in our case. Therefore, it is required to

filter out such terms from our extracted lists to ensure that the terms are semantically

similar to our working domain. To filter the lists, we considered a validation step where

only those terms are validated and kept that are semantically closer to our dataset, and all

other terms were dropped.

To validate the extracted concepts, we passed each concept term to the Metamap and

examined the detected semantic type. If the semantic type matches those that are already

existing in our dataset while the semantic type belongs to the same root node from which

the term is coming, then we consider that term as valid. We collected all the valid terms

for each of the two semantic classes that we extracted. Figure 4.2 depicts the whole process

of knowledge extraction. After the extraction and validation of concepts, we get 24,071

concepts from the ‘disease’ class and 29,935 concepts from the ‘drug’ class.

• STEP 3: AUGMENTATION

In this step, we generate simulated data by using patterns and medical concepts extracted

in previous steps and then augment the simulated data with our original text corpus. Our

strategy in this step is similar to synonym replacement, with a small change that seeks

selection of concept list depending upon the class of concept within each pattern. For this

step, we know that each pattern has exactly one medical concept within it.

61

Figure 4.3 Steps to Generate Augmented Corpus

For each pattern from the patterns list, we check the class of the medical concept. If the

concept belongs to the ‘drug’ class, we select the list of drugs for the augmentation. In case

the concept belongs to the ‘disease’ class, we select the list of diseases. For anything else,

the pattern does not augment. To augment the pattern with the selected list of concepts,

each concept within the list is used to replace the existing concept in the pattern to form a

simulated variation. Each new variation after the replacement is then recorded in the list of

simulated patterns (Figure 4.3). These simulated patterns are then combined with the

original text corpus to form an augmented corpus.

Our data augmentation task enabled us to work with two datasets: original and augmented.

We use both datasets to evaluate our language modeling methods that we define in later

tasks. For the evaluations, we require data augmentation on multiple subsets of our original

dataset. Therefore, we develop a utility program using JAVA to automate the augmentation

process. Our utility program takes lists of concepts (Drugs and Diseases) and a text dataset,

and then it outputs the augmented dataset.

4.5.2 Developing Domain-Specific Language Models

In this task, we develop language models using only the domain-relevant data. For this

purpose, we use both datasets: original and augmented. Augmentation is done using the

62

methods presented in the previous task. Language models are created using the KenLM

toolkit.

KenLM provides methods to develop n-gram based language models in ARPA [117]

format. ARPA is a file format to list down all the calculated probabilities within an n-gram

language model. ARPA formatted models are bulky and not efficient to use for inferences;

hence, for efficient execution, it is required to convert ARPA models into specific data

structures [118]. KenLM also provides functions to convert the ARPA formatted models

into PROBING and TRIE data structure based language models [91]. PROBING data

structure is faster but uses extensive amounts of memory for inferences. In comparison, the

TRIE data structure aims to lower memory consumption. As our compute environment has

a limited stack of memory, we adapt the TRIE data structure to develop our language

models. Project DeepSpeech also supports TRIE structure by providing a native client to

convert the TRIE based language models into efficient binaries that aim speedy execution.

In this task, we use four steps to develop language models. In the first step, we streamline

the dataset by removing all leading and trailing whitespaces from all the sentences and

converting them into lower case. In the second step, the streamlined dataset trains an ARPA

based language model of the 5th order. We use the maximum n-gram order length of 5

because the pre-trained model from DeepSpeech uses the maximum of 5-grams; therefore,

we keep the n-gram order similar for valid comparisons. In the third step, the ARPA model

converts into a TRIE based model. Finally, the TRIE model converts into DeepSpeech

specific binary using its native client.

In this task, we develop language models in two strategies. In the first strategy, we directly

use the original dataset to develop the language models. In the second strategy, we apply

augmentation methods on our original dataset and then use the augmented dataset to

develop the language models.

63

4.5.3 Enhancing Pre-Trained Language Model

In this task, we enhance the pre-trained language model by incorporating information from

our domain-relevant dataset. There are two main reasons to seek this task for the

development of robust language models. The first reason is the insufficiency of our

domain-relevant dataset. We have only 127 kilobytes of text available in the original

dataset. Although after augmenting, we are able to generate about 2 gigabytes of text;

however, the pre-trained DeepSpeech model is trained on about 4 gigabytes of text, which

is still double. Therefore, we hypothesize that developing language models using text from

both data sources can be a better solution. The second reason is due to the preliminary

analysis, where DeepSpeech, with its pre-trained language model, made more critical

mistakes then general mistakes. From the nature of mistakes, we observed that the pre-

trained language model lacks domain-specific vocabulary, evidently due to its training

upon an out-of-domain data source. Therefore, we consider a reduction in critical errors by

introducing domain-relevant vocabulary in the pre-trained model.

Enhancement of the language model entails combining the information from other data

sources (domains) to the existing language model. Language models can be combined by

any of two methods: pooling and interpolation. In the pooling method, the corpus of the

existing language model is pooled with the text from other data sources. The pooled corpus

then trains an enhanced language model. In the interpolation method, a separate language

model is trained using the text from each data source. Afterward, all language models;

existing and newly trained, are merged using weights that are tuned on some validation

text. This validation text is fetched from the target domain.

The interpolation method has some variations [119]. The simplest of all is linear

interpolation, in which all intermediary language models (𝑃𝑖) are combined linearly using

tuned weights (𝜆𝑖) (Equation 3). These weights are tuned over the validation set such that

it maximizes the likelihood operator 𝑃(𝑊|𝐻) for the combined model.

64

Equation 3

𝑃(𝑊|𝐻) = ∑ 𝑃𝑖(𝑊|𝐻)𝜆𝑖

𝑖

KenLM [91] toolkit uses the log-linear interpolation method, which is a slight variation of

the linear interpolation method. In the log-linear interpolation method, individual models

are merged by applying weights as the power and multiplying the powered models

(Equation 4).

Equation 4

𝑃(𝑊|𝐻) = 1

𝑍𝜆(𝐻)∏ 𝑃𝑖(𝑊|𝐻)𝜆𝑖

𝑖

After merging all the models, to complete the interpolation probability, the product is

divided by a normalizing term 𝑍𝜆(𝐻) to ensure that the sum of all probabilities over words

W equals to 1. The normalizing term calculates the product of all words in the vocabulary

of all language models (Equation 5).

Equation 5

𝑍𝜆(𝐻) = ∑ ∏ 𝑃𝑖(𝑊|𝐻)𝜆𝑖

𝑖𝑊

To enhance the pre-trained language model, we experiment with both; pooling and

interpolation; methods. Similar to the previous task, we adopt strategies to use both

datasets; original and augmented. In total, we develop language models in four strategies,

by the combination of methods and datasets. In the first strategy, we use the original dataset

to enhance the pre-trained language model using the pooling method. In the second

strategy, the same dataset is used with the interpolation method. The third and fourth

strategies are similar to the first two, in terms of methods; however, we use augmented

dataset in these strategies.

65

When we use the pooling method, the source corpus of the pre-trained model is pooled

with the domain-specific text of our datasets, and then the pooled text trains new language

models. When we apply the interpolation method, the pre-trained model remains the first

intermediary model, while we train separate models from our datasets and interpolate all

intermediaries. All the models developed in this task are trained using the n-gram order of

5, which denotes the maximum n-gram length in the language model.

4.5.4 Summary

In the language modeling phase, we work on three tasks. In the first task, we focus on the

generation of artificial domain-relevant data using the principles of the synonym

replacement method. We define a three-step method that expects a text corpus to work on,

and in turn, it provides an augmented corpus. In the second and third tasks, we provide

strategies to develop language models by using both datasets; original and augmented. Both

tasks, in combination, develop six strategies to train language models.

1) Training a new language model using the original dataset.

2) Training a new language model using the augmented dataset

3) Enhancing the pre-trained language model using the original dataset using the pooling

method.

4) Enhancing the pre-trained language model using the augmented dataset using the

pooling method.

5) Enhancing the pre-trained language model using the original dataset using the

interpolation method.

6) Enhancing the pre-trained language model using the augmented dataset using the

interpolation method.

All these strategies seek the development of n-gram based language models that are of 5th

order.

66

4.6 Discussion

We consider our solution to be replicable in other domains where noise and domain

robustness is required, and there is insufficient relevant data available. In the acoustic

modeling method, we approached to fine-tune a general-purpose model. As this model is

generic, the same method can be applied in any other domain. However, to replicate our

language modeling methods, particularly the data augmentation, one will need to look for

the domain-specific knowledge sources and concept detection tools for the successful

adaptation of these techniques.

4.7 Conclusion

In this chapter, we present our explored techniques and developed methods to train robust

and domain-specific models. Our methods focused on those acoustically distorted and

sensitive scenarios where there is a shortage of domain-relevant data. We first did the

preliminary analysis using pre-trained models of DeepSpeech. We then developed acoustic

modeling and language modeling methods by taking insights from the analysis. The

acoustic modeling method sought domain adaptation of the pre-built model along with a

fine-tuning operation using our dataset. On the other hand, language modeling focus on the

generation of domain-relevant data. It worked on a three-step method to generate an

augmented dataset, and then it developed six strategies to develop robust language models.

67

Chapter 5. SPEECH RECOGNITION – EVALUATIONS

5.1 Introduction

This chapter provides a detailed evaluation of the methods that we presented in the previous

chapter. Evaluations were done in 2 phases. Figure 5.1 shows the flow of the experiments

that we executed for our evaluations.

Figure 5.1 Flow of experiments

5.2 Setup

5.2.1 Evaluation Phases

Our solution defines methods to enhance both, acoustic and language models within the

speech recognition system, therefore our evaluation strategy consisted of two phases.

1) In the first phase, methods pertaining to each model were evaluated in isolation.

When evaluating the acoustic modeling method (in Section 0), we used the pre-

trained language model for the experiments. Similarly, when evaluating language

modeling methods (in Section 5.4), we used the pre-trained acoustic model. For the

evaluations in this phase, we did experiments with both, training and testing data to

68

analyze the performance when models have seen everything as compared to unseen

environments. We did not use the testing results to tweak the models, therefore we

did not use any separate validation set to test the models in this phase.

2) In the second phase, we evaluated both models in combination. The best performing

models from the first phase were selected to perform evaluations in combination.

We did not evaluate low performing methods in this phase, particularly due to the

fact that the first step of evaluation already gave us such insights. Finally, in this

phase, we used our validation set to present the final validation of our trained

models.

5.2.2 Cross-Validation

We used 10-fold cross-validation throughout the evaluations. Since we used the same

dataset to train both: acoustic and language: models, a conventional method of cross-

validation where the dataset shuffles and split into folds each time before conducting an

experiment can cause problems; specifically in the second phase of evaluations where we

intend to use trained models in combinations. Due to the limited dataset, we also cannot

afford to set aside a testing dataset. Therefore, to ensure reliable and valid evaluations, we

performed the shuffling and splitting task once, and recorded all the folds. Afterward, we

supplied the same folds whenever an experiment used cross-validation.

5.2.3 Interpretation of Results

In the evaluations, we analyze the impact of our methods by examining the difference in

performance and comparing it to baselines. As speech recognition systems are primarily

evaluated on error rates, we will be considering Word Error Rate (WER) along with our

custom metric Critical Error Rate (CER) to measure the performance where a lower error

rate means better performance. However, all the crucial decisions are made on the basis of

CER. There are two ways to examine the difference in performance. One way is to examine

the absolute change in error rates, while the other is to check the relative change with

respect to the error rate of baseline. In our evaluations, we have considered both ways to

examine our results, however, attention is given to the relative performance differences.

69

5.3 Evaluating Acoustic Modeling

This section provides an evaluation of our acoustic modeling method. In this method, we

fine-tuned the pre-trained acoustic model with our dataset. Fine-tuning was done till 8

epochs. We stopped at the 8th epoch due to the continuous increase in testing error rates.

Cross-validation was used to calculate testing error rates. For each of the 10 folds, the pre-

trained model was separately fine-tuned, and the error rates from all folds were averaged.

This process was repeated at each epoch. Error rates for training cases were calculated

directly by using the whole dataset for fine-tuning and using the same data to calculate

errors. Figure 5.2 shows the trend of errors on each epoch (horizontal axis) for both error

rates (vertical axis) and Table 5.1 lists down the values of error rates

Figure 5.2 Error loss in Acoustic Model training

70

Table 5.1 Training and testing error rates for Acoustic Model training

CER WER

Training CV Testing Training CV Testing

nth

Ep

och

1 27.59% 38.09% 26.84% 36.94%

2 19.62% 32.31% 19.43% 32.04%

3 16.47% 29.57% 16.56% 29.77%

4 13.06% 27.42% 13.53% 27.83%

5 12.57% 27.45% 12.94% 28.11%

6 10.84% 27.22% 11.47% 28.01%

7 9.25% 28.17% 9.84% 28.86%

8 9.37% 28.35% 9.75% 29.28%

5.3.1 Analysis

Analysis of error rates in comparison with the number of epochs showed that the model

reached a local minima on our dataset on around 4th to 6th epochs. Our analysis of the error

rates shows that the 6th epoch is the most efficient (CER: 27.22%) on the basis of CER,

while the 4th epoch is the most efficient (WER: 27.83%) on the basis of WER. With the

fine-tuning method, we were able to achieve 22.16% absolute and 44.88% relative

reduction in CER and 18.8% absolute and 40.32% relative reduction in WER (Table 5.2).

As the performance comparison was very close (error difference within a range of 0.2%),

models from the 4th to 6th epoch were selected for further evaluations of the second phase.

Table 5.2 Absolute and relative change in error rates after acoustic modeling

CER WER

Pre-trained AM 49.38% 46.63%

Pre-trained AM + fine-tuning 27.22% 27.83%

└ Absolute change -22.16% -18.80%

└ Relative change -44.88% -40.32%

71

5.4 Evaluating Language Modeling Methods

This section provides an evaluation of our language modeling methods. We provided six

strategies to train domain-relevant language models for improved recognition. In our

evaluations, we combined all those strategies to develop 8 evaluation scenarios that cover

all methods. Table 5.3 provides the details of each evaluation scenario.

Table 5.3 List of scenarios for Language Model evaluations

S.

No. Evaluation Scenario Strategy Details

1 Pre-trained LM Baseline: Performance from the preliminary

experiment is considered.

2 No LM No LM is used in this scenario.

3 New LM

+ Original in-domain corpus

New LM is trained using the text corpus from

the original in-domain dataset.

4 New LM

+ Augmented in-domain corpus

New LM is trained using the text corpus from

the augmented in-domain dataset.

5

Pre-trained LM


+ Pooling Method

Pre-trained LM is enhanced using the text

corpus from the original in-domain dataset

using the pooling method.

6

Pre-trained LM


+ Interpolation Method


corpus from the original in-domain dataset

using the interpolation method.

7

Pre-trained LM


+ Pooling Method


corpus from the augmented in-domain dataset

using the pooling method.

8

Pre-trained LM


+ Interpolation Method


corpus from the augmented in-domain dataset

using an interpolation method.

72

Scenario 1 was developed to provide a baseline for this evaluation; therefore, we did not

run any experiments in this scenario; instead, the results from preliminary experiments

were considered. Scenario 2 was developed to test the performance of DeepSpeech without

the use of any language model, as it is not a required component in speech recognition. For

the experiments within this scenario, all audio clips from our original dataset were used to

calculate error rates. In all other scenarios, language models were trained and tested using

training and testing datasets. While evaluating the training error rates, language models

were trained using the whole text corpus from the original dataset and all audio clips from

the same dataset were used to test those models. While evaluating testing error rates, we

used 10-fold cross-validation.

5.4.1 Analysis

The analysis of our results shows that DeepSpeech did mistakes for more than half the time

in scenario 2 when no language model was used. Upon comparing the error rates with

baseline, one can see that the average CER increased around 8% and WER increased

around 10% when DeepSpeech did not use any language model. This provides evidence of

the importance of language models in the process of speech recognition. Moreover, when

such language models were used that has already seen the testing data while training, the

recognition performance matched the top of the line cloud-based SR (Google). This is not

a valid comparison to build a conclusion; nevertheless, it highlights the ability of language

models to significantly boost the recognition performance. Table 5.4 shows the error rates

for all scenarios when the training dataset was used to test the models that were trained

using the corpus from the same dataset. The table also shows error rates from the first 2

scenarios to provide a comparison, even though no language model was trained in those

scenarios.

73

Table 5.4 Training error rates from each evaluation scenario

Table 5.5 Cross-validated error rates from each evaluation scenario

74

Cross-validated experiments show that all our methods improved DeepSpeech

performance since we observe lower error rates as compared to the baseline in each

experimented scenario. We note a maximum absolute reduction of 4.81% WER and 9.9%

CER in our experiments, which translates into a 10.31% relative reduction in WER and

20.04% CER. Table 5.5 presents a bar chart plot and values of error rates in each scenario,

while Table 5.6 lists the absolute and relative differences as compared to the baseline.

Table 5.6 Absolute and relative change in error rates compared to baseline

Evaluation

Scenario

Critical

Error

Rate

Absolute

Difference

(CER)

Relative

Difference

(CER)

Word

Error

Rate

Absolute

Difference

(WER)

Relative

Difference

(WER)

Baseline

Pre-Trained LM 49.38% 0.00% 0.00% 46.63% 0.00% 0.00%

New LM

+ Original corpus 39.49% -9.9% -20.04% 41.82% -4.81% -10.31%

New LM

+ Augmented corpus 45.05% -4.33% -8.77% 43.10% -3.53% -7.57%

Pre-trained LM

+ Original corpus

+ Pooled

47.12% -2.26% -4.59% 45.03% -1.6% -3.43%

Pre-trained LM

+ Original corpus

+ Interpolated

42.40% -6.98% -14.13% 43.28% -3.35% -7.18%

Pre-trained LM

+ Augmented corpus

+ Pooled

47.04% -2.35% -4.75% 44.56% -2.07% -4.43%

Pre-trained LM

+ Augmented corpus

+ Interpolated

44.02% -5.36% -10.86% 42.76% -3.87% -8.29%

75

As we presented multiple methods that are applied in combination with each evaluation

scenario, we were interested in analyzing the impact of each individual method on the

performance of DeepSpeech. Therefore, we defined four independent variables based on

our methods, whose values represent results from each experimented scenario. Table 5.7

lists all variables and their values along with the scenarios that are represented by each

value combination. We assigned the value 1 to denote a True or ‘Yes’, meaning that the

respective variable was applied to the scenario, whereas, the value 0 denotes False or ‘No’.

Table 5.7 Variables to analyze each experimented method

Variables → Augmentation

of Original corpus

Development of in-domain

LM

Enhancement of Pre-Trained

LM using Pooling

Enhancement of Pre-Trained

LM using Interpolation

Evaluation Scenario ↓

3 0 1 0 0

4 1 1 0 0

5 0 0 1 0

6 0 0 0 1

7 1 0 1 0

8 1 0 0 1

Based on these variables, we analyzed the contribution of each method in improving the

performance of DeepSpeech. For this analysis, we applied linear regression where the null

hypothesis is that there is no change in performance within the different experimented

scenarios. Our analysis used only three of these variables since we had to drop the use of

variable “Development of in-domain LM” due to its direct correlation with other

enhancement variables (“Enhancement of Pre-Trained LM using Pooling” and

“Enhancement of Pre-Trained LM using Interpolation”). In scenarios 3 and 4, we

experimented with the methods that are based on the dropped variable; however, values

from other variables are still able to represent these scenarios (False for both of

enhancement variables translates into a True for dropped variable). We applied regression

analysis on both error rates individually by keeping them as the dependent variables. Figure

5.3 shows the results summary of regression analysis when WER is taken as the predicted

variable. Figure 5.4 shows the results summary of regression analysis for CER. In total 6

76

scenarios were compared each having 100 observations, therefore the summaries have

shown 600 observation count.

Figure 5.3 Result summary of regression analysis using WER

Figure 5.4 Result summary of regression analysis using CER

The ANOVA results from regression analyses reject the null hypothesis when CER is taken

as the performance factor, which provides the domain-specific error rate. Due to the

importance of critical errors, we argue that some of our methods have a significant impact

on the performance of recognition than others, even when there does not seem to have a

significant difference in general word errors. However, in these analyses, we were not able

to support any single method in providing a significant contribution to error reduction.

Therefore, we investigated more on our results to see the combined effects of our methods.

77

We firstly analyzed the results of those scenarios that applied the augmentation method to

compare with those scenarios that used original datasets. We observe that the augmentation

method increased error rates as compared to those language models that were created only

with our original dataset corpus. However, in those scenarios where we were enhancing

the pre-trained language model, the augmentation method did show some reduction in error

rates. We noted a maximum of 1.19% relative reduction in WER and 0.18% relative

reduction in CER; however, these reductions are not reliable, as we also observe substantial

increase (3.81% relative) in critical errors when enhancement is done using an interpolation

method. Table 5.8 mentions absolute and relative differences between all scenarios using

original and augmented datasets.

Table 5.8 Comparing error rates between original and augmented corpus

CER WER

New

LM

Pre-

Trained

LM

+Pooling

Pre-Trained

LM

+Interpolation

New

LM

Pre-

Trained

LM +

Pooling

Pre-Trained

LM

+Interpolation

Original corpus 39.49% 47.12% 42.40% 41.82% 45.03% 43.28%

Augmented corpus 45.05% 47.04% 44.02% 43.10% 44.56% 42.76%

└Absolute Diff. 5.57% -0.08% 1.62% 1.28% -0.47% -0.52%

└Relative Diff. 14.10% -0.18% 3.81% 3.05% -1.03% -1.19%

Secondly, we analyzed all those scenarios which train new language models to compare

with those scenarios that enhance the pre-trained model using the pooling method as well

as those using the interpolation method. We note that enhancement, in general, is not

effective in reducing error rates; however when interpolation is used to enhance the pre-

trained model, we observed 2.29% relative reduction of CER and 0.78% of WER using the

augmented corpus. From all other scenarios, a new language model using only the original

(un-augmented) corpus gave us the lowest error rates. Table 5.9 and Table 5.10 lists the

absolute and relative differences in error rates when using pooling and interpolation

methods in comparison to the new domain-specific language modeling method.

78

Table 5.9 Comparing error rates between new and enhanced pooled model

CER WER

Using

Original

Corpus

Using

Augmented

Corpus

Using

Original

Corpus

Using

Augmented

Corpus

New LM 39.49% 45.05% 41.82% 43.10%

└Enhanced LM using Pooling 47.12% 47.04% 45.03% 44.56%

└Absolute Difference 7.63% 1.98% 3.21% 1.46%

└Relative Difference 19.33% 4.40% 7.67% 3.40%

Table 5.10 Comparing error rates between new and enhanced interpolated model

CER WER

Using

Original

Corpus

Using

Augmented

Corpus

Using

Original

Corpus

Using

Augmented

Corpus

New LM 39.49% 45.05% 41.82% 43.10%

└Enhanced LM using Interpolation 42.40% 44.02% 43.28% 42.76%

└Absolute Difference 2.92% -1.03% 1.46% -0.33%

└Relative Difference 7.39% -2.29% 3.48% -0.78%

To sum up our analysis, we observe three key performance behaviors of our methods. 1)

The pooling method to enhance the pre-trained model has the least impact on the reduction

of error rates. 2) The application of the augmentation method on our original dataset,

however, did enhance the impact of the pooling method, yet it was not able to surpass the

standalone impact of the interpolation method. 3) Nevertheless, the development of simple

domain-specific language models has the highest impact, although, enhancement of the

pre-trained model seems a more logical option as it provides an opportunity to generate

models that are trained on a larger set of information.

To understand these behaviors, we analyzed the construction of the clinical notes from our

dataset i.e., our target domain. In the notes, we observe many patterns that were recurring

79

in almost the whole dataset. An instance that we observe that usually occur in the start of

every note is, “This is a <age> <weeks/month/year> old <white/black/other ethnicity>

<male/female> …”. Some examples that follow this pattern are; “This is a 2 months old

white male …”, “This is a 4-year-old black female ...”. Due to many of such patterns that

physicians mostly follow while dictating notes, a language model needs to learn them in

order to perform efficient recognition. If there happens to be a glitchy output from the

acoustic side, a language model will only be able to refine that output if it is already aware

of such patterns. The main purpose of language model training is, in fact, to pick the

recurring patterns from the training set. Therefore, when we talk about the pre-trained

model, it implies that this model has already identified patterns from the datasets on which

it was trained. Hence, when we try to enhance the pre-trained model, we try to teach it

some new patterns. The pre-trained model was trained on about 4 gigabytes of text,

whereas the text from our original dataset all combined is not more than 150 kilobytes.

Therefore, we consider that our smaller text corpus was not able to make a significant

impact while enhancing the pre-trained model due to the smaller size. The property of the

pooling method to give equal weight to all data sources also supports this idea. This also

explains the increased impact of error reduction by the use of the augmentation dataset, as

it increases the size of the domain-relevant text corpus.

After analysis, we selected language models from all scenarios for the second phase

experiments, except those scenarios that implement the pooling method. We dropped the

pooling method for further evaluations due to two main reasons. Firstly, it has shown the

least performance gains, as we have mentioned above. Secondly, pooling is one of the two

model enhancement methods that we experimented with; therefore, we only wanted to

continue with one best method of enhancement.

80

5.5 Combined Evaluation

In this section, we provide our second phase of evaluation. In this phase, no new model

was trained, instead, the best performing models; that we selected from previous

evaluations; were analyzed further. We selected multiples of both: acoustic and language;

models and tested all their combinations on our original dataset using the same cross-

validation folds upon which the models were trained. Table 5.11 lists the error rates for all

tested combinations in this evaluation phase.

Table 5.11 Error rates after testing with combined models.

CER WER

Acoustic Model → 4th Epoch 5th Epoch 6th Epoch 4th Epoch 5th Epoch 6th Epoch

Language Model ↓

Pre-trained LM

+ Original Corpus

+ Interpolated

26.55% 26.04% 25.51% 28.30% 28.12% 27.90%

Pre-trained LM

+ Augmented Corpus

+ Interpolated

25.73% 25.89% 25.17% 26.97% 27.41% 26.95%

New LM

+ Original Corpus 24.03% 24.28% 24.05% 27.18% 27.51% 27.35%

New LM

+ Augmented Corpus 26.65% 26.10% 25.74% 27.13% 27.49% 27.23%

5.5.1 Analysis

The combined evaluation showed that we achieved the lowest word error rate (WER:

26.95%) from the combination of the acoustic model of the 6th epoch and the pre-trained

language model that is enhanced using augmented corpus. However, this model

combination was slightly behind with respect to critical error rates. The lowest critical error

rate (CER: 24.03%) was achieved by the 4th epoch acoustic model in combination with a

new language model that was trained using only the original dataset.

The analysis of results shows that the language modeling methods of dataset augmentation

and model interpolation works best in combination to reduce general error rates. In terms

81

of WER, the augmented interpolated language models performed better than standalone

augmented and standalone interpolated models and were also the overall best performing

models.

Table 5.12 Performance comparison of DeepSpeech (before and after) with Google

When analyzing critical errors, we found that the 4th epoch acoustic model did less critical

mistakes when combined with new language models, and the 6th epoch acoustic models

did less critical mistakes with interpolated language models. However, the lowest critical

error rate was achieved by newly trained language models on the original dataset. The

augmented interpolated language models were the second-lowest with about 4.7% relative

difference.

82

In the end, we found that all of our acoustic and language modeling methods, in

combination, were able to significantly improve the performance of DeepSpeech by

reducing error rates from 46.63% to 26.95% WER and from 49.38% to 25.17% CER. This

reduction relatively translates into a 42.2% reduction in WER and 49.02% in CER. This

improvement enables DeepSpeech to deliver performance that is relatively close to the

performance of top-of-the-line cloud-based SR (Google). Table 5.12 shows the error rates

of Google, DeepSpech before applying our methods and DeepSpeech after applying our

acoustic and language modeling methods.

5.6 Working Examples

In this section, we have provided examples of critical and word errors within the

transcribed notes that we observed with the best performing model combinations. In this

section, we present a best-case examining transcribed note with the least critical errors, a

worst-case examining a note with the most critical errors and two random cases. The two

model combinations that we are examining are 1) 4th epoch acoustic model and new

language model with the original corpus, and 2) 6th epoch acoustic model and enhanced

pre-trained language model with the augmented corpus.

5.6.1 Best Case

In all our 100 notes, the best-transcribed note achieved a CER as low as 6.25% with the 1st

model combination where it made only 5 critical mistakes out of 80 total critical concepts.

With the 2nd model combination, it achieved 7.5% CER and made 6 critical mistakes. The

quality of the corresponding speech in the audio note is found to be clear without any

background noise. The dictation is at a normal pace and it appears to be recorded in a closed

room with a relaxed environment. Almost all critical mistakes, in this case, were found to

be homonyms. For example, “enterovirus” was mistaken as “ether is” and “two older” as

“told or”. Table 5.13 lists transcriptions from both model combinations. All the highlighted

and bold words and phrases are the errors. The yellow highlights are general word errors,

while green highlighted and underlined are critical errors.

83

Table 5.13 Errors in the best transcribed note

Ground Truth AM: 4th Epoch LM: New LM + Original Corpus

AM: 6th Epoch LM: Pre-trained LM + Augmented Corpus + Interpolation

excerpt seven excerpt seven excerpt seven

past medical history past medical history past medical history

on birth history BLANK was full term



she had a previous admission for bronchiolitis to the p m u and this was for a r s v positive bronchiolitis on BLANK

she had a previous admission for bronchiolitis to the p m u and this was for an r s v positive bronchiolitis on BLANK

she had a previous admission for bronchiolitis to the p m u and this was for an r s v positive bronchiolitis on BLANK

she has been followed for peripheral pulmonary artery stenosis by cardiology

she has been followed per four peripheral pulmonary artery stenosis by cardiology

she has been followed for peripheral pulmonary artery talusin cardiology

she has also had most recently bronchiolitis with human rhinovirus enterovirus on BLANK

she has also had most recently bronchiolitis with human rhinovirus enterovirus on BLANK

she has also had most recently bronchiolitis with human rhinovirus ether is on BLANK

this required a short admission to i c u for high flow

is required a short emission to i c u for high flow

is required a short admission to i c u for high flow

medications vitamin d four hundred international units once daily

medications vitamin d for hundred international units once daily

medications vitamin d for a hundred international units once daily

allergies no known drug allergies immunizations up to date



family history BLANK so it is unclear of her family history

family history BLANK so as on clear of her family history

family history BLANK so desone clear of her family history

social history BLANK lives with mom and two older siblings BLANK and BLANK years old BLANK

social history BLANK less with mom and told or siblings BLANK and BLANK years old old BLANK

social history BLANK lysis with mom and told or siblings BLANK and BLANK years old BLANK

several family members had been recently

several family members have been recently

several family members have been recently

84

5.6.2 Worst Case

In our results, the worst transcription achieved the CER of 59.6% with 2nd model

combination and 54.8% CER with 1st model combination. We observed that the quality of

the corresponding audio note is significantly lower than other notes. We note high echo

and reverberation in the audio while appearing that the speaker is away from the recording

device. This audio note was also filled with a high amount of background noise, especially

the overlapping noise that is corrupting the speech signals. We hear loud baby cries and

parents trying to soothe the baby in the background by talking and playing with the baby.

These interruptions reflect on the mistakes. For example, in one place “abdominal pain” is

mistaken as “a dona panadol”. This transcription is not a homonym for its gold standard.

Another similar example is when “upper respiratory tract infection” transcribes into “offer

reichstein”. These examples show that due to overlapping noise, a lot of speech signal is

lost, resulting in a sloppy transcription. Table 5.14 presents the comparison of

transcriptions for both model combinations. Green highlights show critical mistakes.

Table 5.14 Errors in the worst transcribed note



this is a four year almost

five year old female with

trisomy twenty one

he is a four year always

five year old female at ten

twenty one

he is a four year as five

year old female as try

twenty one

presenting with two to

three week history of upper

respiratory tract infection

that has lingered

awakening with a cough

and having the cough

during the day and

sometimes at night

is an in a to the three

weeks history as offer

rather infection of lines

waiting with her cough

and having (the) cough

during the day and some

time at night

is anion so to three weeks

history as offer reichstein

of linear waiting with call

and having (the) cough

during the day had some

time (at) tight

ventolin of no benefit

recently

that on no better we that on no batel rashes

she has had no fevers with

this

she (has) had no fevers

with this

(she has) at no fevers with

this

85

possible headache no other

pain

four headache nor her

pain

four endantadine

she has had abdominal

pain and vomiting a week

ago that resolved over

about a day and resulted in

multiple episodes of

vomiting

she has had a do paid

(and) vomiting a week no

that resolved over about a

vein (and) result is well

colitis following

she has had a dona

panadol in a week no the

resolved over about ada in

result as well colicines

following

she has been on p o

steroids for the last five

days as per the parents

usual approach to acute

asthma

she says on p o per i for

(the last) five days her the

parents usual got his used

for as no

she see n p o ser for (the

last) five days her the

parents usual proctocele

for as mopeg

paragraph past medical

history immunizations are

up to date

paragraph past medical

history immunizations are

(up to) date

ph past medical history

immunizations (are) up to

date

she has had a tetralogy of

fallot repaired trisomy

twenty one and she is on

valproic acid and

clobazam for seizure

disorder which is quiescent

for the last two years

ten had also or care

trisomy twenty one and she

(is) on a for acid episode

for (seizure disorder)

which is a two years

on fenoterol some or

paired trisomy twenty one

and she (is) on dolo acidic

anorectics two years

on exam no respiratory

stress normal air entry

with no wheeze airway

breathing and circulation

stable


stress normal artery with

no wheezer we breathing

in certain stable


stress normal reentry with

no leader we breathing and

circulation stable

pink soft belly normal left

ear normal right ear

pain saw a day normal as

ear nor rays

pain sagatal normal last

ear nor river

snotty nose but not

dramatically swollen

tonsils, slightly red, no pus.

not dramatically acute

not of note a iron in last or

or to widened no refused

to one week ago i saw so

so i group is probably a

mild mild being a certain

in wide per were more

operation the in are work

but i do an a b released

nodose but no oestro

intraorbital red neural

perforin was miosis so so

iridoids probably mild

iletin ceresan viper werner

osteoid an a b q resin

86

5.6.3 Random Cases

We chose two random notes, in between the above-presented best and worst cases, to

analyze the mistakes. Upon examining the corresponding audio recordings, we found that

the noise levels remained between the above-defined two example cases. There were no

overlapping noises, however, we could listen to some reverberation. On a few occasions,

the physician’s dictation pace increased dramatically, which directly increased mistakes in

transcriptions. On one of such occasions, the phrase “feet swept out from underneath him”

was mistaken as “feet slept out from under eat him”.

The first random note we examined achieved 11.66% CER with the 1st model combination

and 14.4% CER with 2nd model combination. Table 5.15 mentions the transcription we

received with both model combinations.

Table 5.15 Errors in the first examined note transcription



this is an eighteen month old black female

this is a ten month old black female

this is a ten month old black female

presenting with a seizure like episode at home



child was perfectly well until three days ago when sniffles and cough became apparent



last evening at about eighteen hundred hours the child experienced a mild fever of thirty eight point five

last evening (at) about eighteen hundred hours the child experience on mild (of) fever thirty eight point five

last evening (at) about eighteen hundred hours the child experience a mild fever (of) thirty eight point five

this morning on rousing her from her bed and she stiffened exhibited five to fifteen seconds of shaking movements suggestive of

this morning on rising her for her bed if she stiffened excited five to fifteen seconds of of shaking movement suggestive of

this morning on the rising her for her bed if she stiffened excited five to fifteen seconds of of shaking movement

87

tonic clonic seizure and then settled into a deep sleep from which she roused fifteen minutes later

tonic clonic seizure and in settled into a deep sleep for which she was fifteen minutes later

suggestive of tonic clonic seizure and intensain to a deep sleep for which she resistin in the later

by the time she arrived at the i w k emergency she was awake and alert and cranky

in time she wrist the i w k emergency is she was awake and alert and cranky

date she arrived at the i w k emergency i she was awake and alert and cranky

parents have never seen similar seizure activity before



past medical history reveals immunizations up to date and normal birth and pregnancy history with no specialist and no regular medications and no known allergies

past medical history revealed immunizations up to date a normal birth and pregnancy history with no specialists and no regular medications and no known allergies

past medical history reveals immunizations up to date a normal birth and pregnancy history with no specialist and no regular medications and no known allergies

family history reveals febrile seizures in dad as a child


family history reveals febrile seizures in dead as a child

physical exam revealed a cranky child who settled with care



ear nose and throat exam revealed a red throat with no exudate suggestive of a viral pharyngitis

ear nose and throat exam revealed a red throat with no x date suggestive of a viral pharyngitis

ear nose and throat exam revealed a red throat with no exo date suggestive of a viral paris

ears were normal lungs were clear



cardiac exam was unremarkable



abdominal exam was normal



neurological exam revealed a child who following ibuprofen was playful and interactive and no focal finding on

neurological exam revealed a child who following ibuprofen was playful and interim and no focal findings on

neurological exam revealed a child who following ibuprofen was playful and interacting and no focal findings under local exam is cover

88

neurological exam was discovered

neurological exam is descend

new paragraph impression is that this is simple febrile seizure



parents were counseled and reassured



instructions were given with regards to the delivery of ibuprofen as needed for pain and fever but the parents were cautioned that this would not reduce the risk of further seizures.

instructions were given with regards to the delivery of ibuprofen as needed for pain and fever by the parents were cousin that this would not reduced the risk of for other seizures

instructions were given with regards to the delivery of i do provided for pain and fever but the parents were cation that this would not reduced the rose of further seizures

we stated the seizure risk at about one or two percent over the rest of this illness

we stayed the seizures (risk) at about one or two percent over the rest of the illness

we state the seizure risk at about one or two percent over the rest of the sinus

but cautioned them that further febrile seizures were likely between now and six years of age

but out on them that further febrile i sure were likely between now and six years of age

but cation them that further febrile tissue (were) likely between no and six years of age

her parents were advised to return to the emergency department with further seizures for further assessment

a parents were advised to return to the emergency department with rather seizures refer her assessment

a parents were advised to return to the emergency department with further seizures refer assessment

diagnosis viral upper respiratory tract infection with febrile seizure

diagnosis for upper respiratory tract infection with febrile seizures

diagnosis boro upper parry tract infection with febrile seizure

89

The second random note we examined achieved 22.41% CER with the 1st model

combination and 24.13% CER with 2nd model combination. Table 5.16 shows the

transcription of the second random note we examined.

Table 5.16 Errors in the second examined note transcription



excerpt thirty one excerpt thirty one excerpt thirty one

medications at admission medications at admission medications at admission

two hundred and fifty

milligrams of amoxil t i d

for ten days started on

BLANK

to turn fifty milligrams of a

lot p i d for ten days see on

BLANK

to endrin fifty milligrams of

a moban t i d for ten days

are on BLANK

examination at admission examination and admission examination and admission

BLANK was afebrile and

vitally stable

BLANK is a febrile advised

this table

BLANK was a febrile and

vita e table

head and neck exam was

unremarkable cervical

lymph nodes were palpable

and less than one point five

centimeters in diameter


unremarkable service

lymph nodes with health

able and less than one point

five centimeters and via


unremarkable service

lymph nodes but palpable

and less than one point five

centimeters and dia

cardiac exam was normal cardiac exam was normal carb exam was normal

respiratory exam showed

increased work of

breathing


increased work of everting


increased work of retin

intercostal indrawing on

the right side

intercostal indrawing on

the right side

intercostal in von on vit

side

good air entry bilaterally

and course crackles and

wheezes throughout both

lung fields

a good air entry bilaterally

and course crackles one a

throat business

a good air entry bilaterally

and coarse crackles and

tear out pulseless

her abdominal exam was

unremarkable


unremarkable


unremarkable

allergies none allergies non allergies on

immunization status immunizations status immunizations status

90

After reviewing the individual cases, we observed that DeepSpeech was able to understand

the domain-specific speech for the most part. However, we noted 3 factors that are

prominently linked with the errors.

1) Reverberation

On average our dataset has low levels of reverberation. However, those audio notes

having high reverberation are consistently linked with high numbers of errors.

2) Overlapping Noise

Similar to reverberations, most of the notes in our dataset have moderate noise levels

that do not interfere with the speech. Therefore, such instances skew the performance

of transcriptions to a large extent.

3) Dictation Pace

Within the notes, we observed that the commonly occurring phrases are uttered at a

high pace, while complicated sentences are uttered slowly. However, when the speech

pace is up, transcription starts to show errors mostly by skipping critical words.

91

5.7 Final Validation

In this section, we report the results of our acoustic and language models on a separate

validation dataset. In the process of dataset preparation (Section 3.4.2), we separated a

validation set that we did not use in any of our prior experiments. Therefore, in this final

validation, we used our validation dataset upon all of our selected models, which were

trained from the whole original dataset. Table 5.17 lists the error rates for all model

combinations.

Table 5.17 Error rates on the validation dataset

CER WER

Acoustic Model → 4th Epoch 5th Epoch 6th Epoch 4th Epoch 5th Epoch 6th Epoch

Language Model ↓

Pre-trained LM

+ Original Corpus

+ Interpolated

32.09% 30.52% 30.41% 35.99% 34.39% 34.65%

Pre-trained LM

+ Augmented Corpus

+ Interpolated

33.24% 32.04% 33.30% 35.20% 34.23% 35.02%

New LM

+ Original Corpus 30.98% 29.63% 28.96% 34.99% 34.14% 33.84%

New LM

+ Augmented Corpus 34.30% 33.89% 34.61% 35.70% 35.34% 35.79%

We observed similar behaviors from the results of the final validation. With respect to

WER, the lowest error rates were from the enhanced pre-trained models that use the

augmented dataset and newly trained models on the original dataset. The lowest two error

rates from these language models only had an absolute difference of 0.09%. However, with

respect to CER, we found that newly trained models on the original dataset did a relative

maximum of 5% less critical mistakes then the enhanced pre-trained models.

92

5.8 Discussion

Throughout the evaluations, we observe that newly trained language models on our original

(un-augmented) dataset were performing similar to, or sometimes better than those

language models that apply our augmentation and model enhancement methods,

specifically in regard to critical errors. We consider that this was due to the extremely small

size of our dataset, which impacts the performance in two ways.

1) Language modeling is all about vocabulary and speech patterns. Thus, the small size

of our dataset does not cover both aspects sufficiently to reflect the target domain. We

tried to handle the vocabulary aspect by dataset augmentation method; however, we

were not able to introduce more domain-specific speech patterns.

2) The pre-trained model is trained on a significantly larger dataset than ours. Moreover,

when our small dataset breaks down into folds for cross-validation, it creates many

unseen scenarios for the testing. Thus, all those unseen domain-specific speech patterns

get compensation by the prior knowledge of the pre-trained model, which does not

reflect our target domain.

In addition to these two reasons, our limited implementation of the augmentation method

also hinders the reduction of critical errors. Since our definition of critical errors entails all

the domain-specific vocabulary, yet we only used two domain-specific classes (drugs and

diseases) to augment our augmentation method. Therefore, all the concepts of remaining

domain-specific classes also remain unseen in the experiments with testing folds.

The interpolation method that develops a language model using information from multiple

data sources has the ability to adjust in case the volume of data is not similar across all

sources. As there is a huge size difference in the data source of the pre-trained language

model and our dataset, the interpolation method logically has the means to cope up in this

situation. However, interpolation requires a separate tuning set from the target domain to

calculate weights for each data source. Since our dataset is already limited, fetching a

tuning set from it makes it even smaller. Therefore, we found that in our experiments the

93

interpolation method performed fairly closer to stand-alone models, but they were not able

to surpass the performance. Had we had a slightly larger dataset; we consider that the

interpolation method would have been the best in performance.

5.9 Conclusion

This chapter presents an evaluation of our defined methods. We have shown that by using

the fine-tuning method on the pre-trained deep speech acoustic model, WER relatively

decreased by 40.32% CER by 44.88% on our original dataset. After applying the data

augmentation method to our dataset and using the interpolation method to enhance the pre-

trained language model, we managed to reduce the error rates further relatively by 42.2%

WER and 49.02% CER. In our evaluations, we observed that the pooling method to

enhance the pre-trained language model had the least impact on the reduction of error rates.

Furthermore, when we applied our language modeling methods in isolation, neither of

those methods had a reducing impact on error rates. However, when applied in

combination, we observed lower error rates when the pre-trained language model was

enhanced using the interpolation method and augmented dataset.

94

Chapter 6. CLASSIFICATION OF TRANSCRIPTION INTO

SOAP CATEGORIES

6.1 Introduction

This chapter presents our work to categorize clinical transcriptions as a SOAP structured

clinical report. To achieve our second objective, we have selected an exemplar-based

concept detection algorithm [102] with the motivation to extend it for SOAP classification.

This algorithm was chosen particularly due to its nature of using the word n-gram based

approach that we consider working in our problem. In previous work, Mowery [101] have

used an n-gram based approach as well and showed positive results. However, their work

poses two major limitations. This exemplar-based algorithm takes care of both limitations.

Firstly, it makes use of overlapping word grams that fully exploit the sentence sequences.

Secondly, since it works on exemplar matching, it should not have an impact on class

imbalance.

6.2 Methods

6.2.1 Assigning SOAP categories to Dataset

Our dataset did not provide prior assignments of sentences into SOAP categories; hence

we perform data labeling manually on our dataset. We perform this task before working on

methods to analyze class proportions of our dataset and to make logical decisions. To label

the sentences within our dataset, we referred to the definitions of SOAP categories and

related literature to identify the guiding principles for each category. Learning on those

principles we skimmed over the dataset looking for commonly occurring patterns within

each category. Based on the understanding we developed by these guiding principles and

patterns, we heuristically assigned a SOAP label with each sentence within our dataset.

The heuristics we applied for each of the SOAP category is given below.

1) Subjective

Sentences having narrations of tales happened to patients and are from the patient’s

perspective. Subjective sentences are mostly in the past tense defining a scene

95

involving the patient. There can be some present tense sentences as well in this category

that are limited to the responses that are coming from patients. The only exception we

have noticed in this category is mostly the first sentence where the physician starts a

note by describing the demographics of the patient. Table 6.1 lists the patterns that we

observed in the sentences that we labeled as ‘Subjective’.

Table 6.1 Patterns observed in subjective category sentences

Patterns Comments

<Child/He/She> <is/was/has> ... Since physicians are narrating patients’ perspectives, usually a pronoun ‘Child’, ‘He’ or ‘She’ starts a sentence along with a helping word from present or past tense. A tale about patient then follows.

family history ... When describing the family history, physicians generally start with the phrase itself and then dictate the contents.

immunizations are <up to date> Physicians are mostly talking about immunizations in every note. For all the notes we examined, we did not find any case where the content changed. However, we think that the phrase ‘up to date’ will change if a different case happens.

past medical history <is non-contributory/reveals ...>

Similar to family history, physicians start talking about past history with this phrase. If such history is not contributing to the case, then they mention mostly.

this is a <age> <days/months/years> old <boy/girl/male/female> ...

This is the most frequent pattern in all notes and comes at the very start.

2) Objective

Sentences that describe observations of the physician on the patient. These sentences

are written in either past or present tense and are usually short in length. They usually

start with a laboratory test or examination type and then provide a result to those tests

or results. Table 6.2 shows the patterns for which we labeled ‘Objective’ sentences.

96

Table 6.2 Patterns observed in objective category sentences

Patterns Comments

<EXAM NAME> <is/was> <normal/unremarkable/uncomplicated/...>

Subjective sentences are talking about laboratory and physical exam results. Mostly these exams are good, so physicians are noting that down.

<EXAM NAME> reveals ... In case an exam reveals something, they also mention that.

On exam <child/he/she> <was/has> ... Mostly for physical exams, physicians simply use ‘On exam’.

the rest of exam <was/is> ... This pattern is also fairly visible in the notes.

<BODY PART> <is/are> <normal/clear/...> While defining results of physical exam, physicians note down the state of each body part they examined.

3) Assessment

Sentences that describe the assessment of the physician on the patient. These sentences

usually have the word ‘diagnosis’ in them along with a disease or medical condition.

Table 6.3 mentions some patters for ‘Assessment’ sentences

Table 6.3 Patterns observed in assessment category sentences

Patterns Comments

diagnosis is <DIAGNOSIS> The word ‘diagnosis’ is prominent in assessment category sentences.

<Medicine name and dosage> was prescribed

This is a loose pattern that we don’t find following a general style. In concept, this pattern mentions about the medications and treatments that are prescribed to the patient.

parents were <reassured/instructed> ... This pattern mentions all the instructions and reassurances that the physician has given to the parents of the patient which starts with the phrase ‘parents were instructed …’ and ‘parents were reassured that…’

97

4) Plan

Sentences that describe plans of the physician for the future treatments on the patient.

These sentences are usually written in the future tense. Table 6.4 lists the patterns for

‘Plan’ category sentences.

Table 6.4 Patterns observed in plan category sentences

Patterns Comments

... follow up ... This is a loose pattern that look for sentences with the phrase ‘follow up’. One variation that we observed multiple time is “Follow up as necessary”

<child/parents> <was/were> advised to return if...

This pattern mentions the advice of physician to return if some condition happens.

<Plans for future treatments> This is also a loose pattern. Generally, it looks for future treatment options. Physician does not seem to follow a strict word sequence for this.

All the sentences from our original dataset are extracted, and each sentence is manually

labeled according to the heuristics defined above. The labeled dataset is saved such that

features represent sentences and labels consist of one of the SOAP categories. In this

process, we observe that there is a class imbalance in our dataset. We note around 47%

Subjective, 35.3% Objective, 11% Assessment, and 6.7% Plan sentences in our dataset.

The imbalance property of our dataset matches the way clinical reports are usually written.

The same spread of classes is shown by Mowery [101] which also supports our observation.

98

6.2.2 Exemplar-based Concept Detection

Concept detection is a problem in NLP to extract relevant information from text

collections. Juckett presents an exemplar-based algorithm to link text to semantically

similar classes [102]. It maps each word within the text with probable class assignments.

Figure 6.1 highlights the algorithm. The algorithm is named Fuzzy matching. Juckett then

presents another algorithm Output array creation that takes the output values from Fuzzy

matching to create an array of detected concepts.

Figure 6.1 Exemplar based algorithms proposed by Juckett [102]

The Fuzzy matching algorithm expects to have some relevant text available that are already

categorized by human annotators to train exemplars. Exemplars are defined as the

collection of unique text strings that belong to specific semantic classes. In this algorithm,

exemplars are stored as Bag of Words (BOW) and Bag of Bi-characters (BOB). For each

sentence in the training set, an exemplar pair of BOW and BOB is created. BOW is the set

of all words within the training sentence. BOB is created after removing spaces from the

99

sentence and then extracting all overlapping character bi-grams. All generated exemplar

BOW and BOB are then stored along with the class assignments of training sentences.

To perform concept detection, the query sentence is first filtered for punctuations and

determiners are removed. It then creates overlapping word n-grams to the maximum of 5th

order (1-gram, 2-gram … 5-gram). Each word n-gram is converted into BOW and BOB

and stored. It gives a list of BOW and BOB for all possible word n-grams (till the order 5)

within the query. For each word n-gram, exemplars are selected that are of length between

n and 4n. For each selected exemplar, the Jaccard Index [120] of exemplar and word n-

gram is calculated for both; BOW and BOB, and the larger of these two values is selected.

After having similarity values of word n-gram with each exemplar, top n values are

selected. Word n-gram and exemplar information for the selected values are then stored.

Figure 6.2 Outline of word-class array [102]

After selecting the top n values, the algorithm Output array creation generates a word-by-

class map. Word-by-class map is a two-dimensional array where rows are the words in the

query text, and columns are the classes. Each cell within the array gets the score of word-

class combination. From the stored values in fuzzy matching, all values are selected whose

n-gram has the word, and exemplar belongs to the class. All these values are then added,

and the resultant value is stored as the word-class combination score. Figure 6.2 highlights

a word-class matrix as shown in [102].

100

6.2.3 Exemplar-based Sentence Classification

The second objective of this thesis requires the classification of textual units from

transcriptions into one of the SOAP categories. We consider sentences as a textual unit. To

achieve our objective, we develop an exemplar-based sentence classification algorithm on

the principles of exemplar-based concept detection by extending it to work on whole

sentences and provide confidence scores for each class. We use this algorithm due to its

nature to exploit word sequences. Moreover, since this algorithm work on exemplars, class

imbalance does not make a huge impact. As it selects top n scores for each n-gram, it can

optimally perform as long as there is a minimum of n exemplars available for each class.

Juckett took many assumptions during the development of the concept-detection algorithm

[102]. Since we adapt and extend this algorithm for a different problem and situation, we

identify four key areas that possibly develop an issue. Therefore, we investigate each of

those areas to find an optimal solution. After having a solution, we consider each area as

an independent variable to test the impact of our solution concerning the original

implementation. Details for each identified area are given below.

1) Stop Words

Stop words removal is a widely accepted and effective pre-processing step in various

problems of NLP [121]. However, the concept detection algorithm does not remove

stop words from their processing step. Moreover, in the arguments, Juckett [102] does

not provide any justification for keeping the stop words. Therefore, in this work, we

consider the removal of stop words since it has shown to have significant performance

improvements in text categorization tasks [122]. We apply a step to remove stop words

and make it optional in our classification algorithm by including it as an input

parameter.

2) Maximum word n-gram length limit

In the concept-detection algorithm, the query sentence converts into word-grams of

varying lengths from 1-gram to the maximum of 5-grams. There were two reasons

given for limiting the maximum n-gram length till the 5th order. Frist is that the dataset

101

on which Juckett [102] experimented, had shorter exemplars where 2/3rd of exemplars

were of word length five or less. The second reason is that each increasing word-gram

length increases the computation time exponentially. In our situation, the dataset has

sentences with an average length of 13 words, whereas 2/3rd of sentences are about 17

words or lower. With the same reasoning, we should look at 17-gram exemplars.

However, since the magnitude of the difference is more than double, a limit of 17-

grams can cause the algorithm to run for much more extended periods. Therefore, we

approach not to limit the length of maximum word n-grams in the algorithm; instead,

move the limit as an input parameter. There are two reasons for this approach. First, it

is easier to evaluate with varying maximum lengths. Second, this enables adaptation of

this algorithm to use on a dataset having sentences of different average lengths.

3) Exemplar type

The concept detection algorithm keeps full-text strings in the exemplars. Juckett [102]

does not report the considerations they took while defining exemplars. As we extend

the algorithm in our problem domain, we recognize that keeping full-text strings in

exemplars can cause some problems.

The current approach creates one exemplar per text string from the training text. We

have a small dataset, in which sentence length ranges from 3 words to 49 words. These

text strings come from natural speech and can have larger compound sentences.

Creating one exemplar for these larger text strings limits the opportunity to create

multiple exemplars. Moreover, due to the longer size of such text strings, they will be

skipped in many comparisons with shorter word n-grams, as the algorithm only selects

exemplars of length n to 4n for comparison with any word n-gram.

Size Sentence

A 5 Past medical history is unremarkable

B 5 Immunizations are up to date

C 11 Past medical history shows immunizations are up to date and unremarkable

102

As an example, in the given sentences: A, B and C; sentence C contains semantics of

both A and B. However if sentence C is used as exemplar, and sentence A and B are

used as query, they will get lower scores since there will only be a partial match among

the exemplar and query. Besides, shorter word n-grams from the query text, 1-grams,

and 2-grams specifically, will get 0 scores since the exemplar will not be selected for

comparison as its length is greater than 4 (n=1, 4n=4) and 8 (n=2, 4n=8).

To solve these problems, we consider using word n-gram based exemplars. The text

strings from the training set converts into word n-grams and are stored with the class

assignment as exemplars. This way, each overlapped word n-gram (sub-string) from

the training text string gets an opportunity to have representation. In our algorithm, we

implement both types of exemplars and gave options in the input parameter to choose

the exemplar type.

4) Similarity function

The concept detection algorithm uses the Jaccard Index as the similarity measure,

which does not take word count and sequence for similarity calculations. Cosine

similarity [123], on the other hand, is built upon the idea of tf-idf, which takes the count

of each dimension in the calculations. Both of these similarity measures are extensively

used in the information retrieval domain. However, a comparison of documents

retrieved from Google search shows that cosine similarity is the most relevant metric

to compare text documents [124]. As we deal with natural speech in our dataset, we

hypothesize that cosine similarity will improve the classification performance in

comparison with the Jaccard Index. Therefore, we apply both in our algorithm while

giving an option to use any one of them in the input parameter.

After investigating all four key areas, we extend the concept detection algorithm into a

classifier by applying a two-step process to the word-by-class map output. In the first step,

a single score calculates for each class by adding the scores of all words. In the second

step, scores of all classes are normalized to get confidence scores in the range of 0 and 1.

The pseudocode for the algorithm after the extension is showed in Figure 6.3.

103

Figure 6.3 Pseudocode for Exemplar-based Sentence Classification Algorithm

104

6.2.4 Summary

To achieve our objective, we extend the exemplar-based concept detection algorithm for

SOAP classification. Five changes are done in this regard. The first change is to give an

option in the input parameter to remove stop words. Second, maximum word n-gram length

is made variable by taking a max length as input. Third, a word n-gram based exemplar is

developed, and an option is given as an input parameter to select the exemplar type. Fourth,

an option is given as an input parameter for similarity matric to use either the Jaccard Index

or Cosine Similarity. Fifth, the word-by-class map output is processed using a two-step

process to calculate a single confidence score for each class for the query.

105

6.3 Evaluation

6.3.1 Experimental Setup

We implemented our exemplar-based sentence classifier and then evaluated it on our

dataset using four independent variables that correspond to the four proposed

improvements. These variables are directly derived from the key areas that we identified

and enhanced. One of the independent variables is the maximum word n-gram length limit,

for which we did not specify any length; therefore, we experimented with lengths from 1

to 15. All other independent variables have two conditions each that give us 8 cases for

experiments. All these cases are given a number. Table 6.5 shows all the cases. Each of

these cases is experimented with varying maximum word n-gram length limits, which gave

us 120 total conditions to experiment (8 cases x 15 maximum word n-gram length limits).

Table 6.5 Cases based on Stop Word, Similarity Function and Exemplar type

Cases Stop Word Similarity Function Exemplar Type

1

w/ Stop Words

Cosine Sentence

2 n-gram

3 Jaccard

Sentence

4 n-gram

5

w/o Stop Words

Cosine Sentence

6 n-gram

7 Jaccard

Sentence

8 n-gram

For each condition, experiments are done in a one-vs-rest method, which means for each

SOAP category, a different binary classifier is trained and tested. For instance, when

experimenting with the ‘Subjective’ category, all subjective sentences in the dataset are

given positive labels (1), and all other sentences are given a negative label (0). The same

is repeated for all other categories. Although our classifier can perform multi-class

classification, yet we choose this strategy because SOAP categories have a very thin line

of separation which can impact the classification performance with our imbalanced dataset.

As an example, the sentence “we prescribed him ...” should be an assessment, whereas,

with a subtle change “we will prescribe him …” becomes a plan. However, in our dataset,

106

both of these categories have limited sentences. We also observe Mowery [101] applying

the same evaluation strategy, due to the same problems.

We conducted all experiments with 5x5 cross-validations. Dataset was randomly shuffled,

and 5-fold cross-validation was performed. This process was repeated four more times. For

each SOAP category, 25 results were retrieved, which in total gave us 100 results for each

condition. Ground truth, confidence score, and condition details were stored from all

experiments.

Confidence scores from our classifier show the predicted probability of having a positive

label. In this work, we did not specify any mechanism to select any threshold for binary

prediction (positive or negative). Therefore, to evaluate our classifier and to compare all

the experimental conditions, we used Precision-Recall Curve (PRC), since it provides an

accurate evaluation in case of imbalance dataset as compared to Receiver Operating

Characteristic (ROC) [125].

Area Under PRC (AUPRC) is represented by Average Precision (AP), where higher AP

means better classification performance. In this work, we used AP as the main

performance evaluation measure, which also serves as the dependent variable for all our

independent variables. We used regression testing to analyze the impact of each

independent variable, and based on the results, we selected the most optimal combination

of independent variables. We then thoroughly analyze the classification performance of the

experimental condition that is according to the selected variables. Three of our independent

variables: stop words, similarity function, and exemplar type; are categorical, though has

only two values in each. Therefore, we converted them into numerical variables for

regression testing.

We analyzed results from all SOAP categories separately as well as in combination. To

analyze the overall performance, we took both averages: micro-average and macro-

average. In micro-averaging, we directly calculate precision and recall values from overall

107

results, whereas, in macro-averaging, we calculated precision and recall values first for

individual categories and then averaged them.

Table 6.6 Conversion of categorical variables to numeric

Independent Variable Categorical Value Numeric Value

Maximum word n-gram Length Limit - 1 … 15

Stop Words w/ Stop Words 0

w/o Stop Words 1

Similarity Function Cosine 0

Jaccard 1

Exemplar Type Sentence 0

n-gram 1

In this evaluation, we focus on six main questions.

1. What is the baseline performance of the classifier before improvements?

2. What is the optimal maximum word n-gram length limit for SOAP classification?

3. Does the removal of stop words enhance classification performance?

4. What is the best exemplar type for SOAP classification?

5. What is the best similarity function for SOAP classification?

6. What is the performance of the classifier using optimal conditions?

6.3.2 Results

We experimented with all 120 conditions and analyzed the results to answer the six

questions. In this section, we first analyze the baseline performance of the classifier before

applying any improvement. Then we analyze the impact of each improvement on the

classification performance and select the optimal values for each improvement category.

Finally, we analyze the performance of the classifier on the conditions that correspond to

the optimal values.

108

• BASELINE PERFORMANCE

We consider case 3 using a maximum word n-gram length limit of 5 as the baseline

condition, as this condition equals of classifier without having any improvements. In this

condition, we observed an overall micro-averaged AP score of 0.886, while having AP

scores of 0.932 for ‘Subjective’, 0.911 for ‘Objective’, 0.563 for ‘Assessment’ and 0.765

for ‘Plan’ category. Figure 6.4 shows the AUPRC for the baseline condition.

Figure 6.4 Precision-Recall curve of the baseline condition

We compiled a list of example sentences along with their confidence scores from baseline

condition and actual SOAP category in Table 6.7. We highlight the top confidence scores

while underlined scores show that the highest confidence score is given to the actual SOAP

category. The list shows 3 examples each from the ‘Subjective’ and ‘Objective’ category,

and 2 examples each from ‘Assessment’ and ‘Plan’ categories.

109

Table 6.7 Confidence scores of sentences from the baseline condition

Sentence Confidence Scores Actual

SOAP Category

Subjective Objective Assessment Plan

1

paragraph past medical history immunizations are up to date

0.88791 0.0884 0.04866 0.05589 Subjective

2

immunizations up to date and no significant medical issues

0.83285 0.14976 0.11717 0.11628 Subjective

3

child was already starting to show return to normal function with crying and purposeful movements

0.46867 0.47539 0.41664 0.35019 Subjective

4

ear nose and throat exam revealed a runny nose mildly red throat and normal ears

0.17412 0.8412 0.11327 0.13244 Objective

5

new paragraph family sorry physical exam revealed entirely normal neurological exam with no focal findings

0.29106 0.69507 0.13999 0.14118 Objective

6

blood work was taken four hours after ingestion revealing non toxic acetaminophen levels

0.59756 0.23034 0.41212 0.14035 Objective

7 a diagnosis was croup treatment

0.3143 0.30527 0.71247 0.18112 Assessment

8 i note no allergies 0.66861 0.32717 0.19232 0.07669 Assessment

9 child will follow up with family doctor

0.23159 0.17391 0.12547 0.74356 Plan

10

a throat swab was sent today should it be positive we will call her and start her on antibiotics

0.3523 0.46499 0.41026 0.3531 Plan

110

• ANALYZING THE IMPACT OF IMPROVEMENTS

In all experimented conditions, the overall best performance was achieved by the condition

using case 1 with maximum word n-gram length limit of 5th order. This condition achieved

the highest micro-averaged AP score of 0.897. However, we observed that for individual

SOAP categories, different conditions showed better performance for different categories,

and no one condition was optimal for all SOAP categories. We list down the highest and

lowest AP scores we observed for each SOAP category along with overall combined scores

in Table 6.8. The magnitude of different highlights that our suggested improvements are

having a major impact on performance, especially with assessment and plan categories.

Table 6.8 Highest and Lowest AP scores for SOAP categories

SOAP Category Highest AP Lowest AP Difference

Subjective 0.957 0.881 0.076

Objective 0.919 0.836 0.083

Assessment 0.626 0.396 0.230

Plan 0.820 0.705 0.115

Micro-Averaged 0.897 0.812 0.085

Macro-Averaged 0.808 0.716 0.092

Results from regression tests on all 120 conditions showed that all independent variables

influenced classification performance, except similarity function which is consistently

insignificant for all SOAP categories. Table 6.9 highlights the results of regression testing.

Table 6.9 Summary of regression analysis results

Independent

Variables → Max n-gram Length Stop Words

Similarity

Function Exemplar Type

SOAP Categories ↓ P-value T Stat P-value T Stat P-value T Stat P-value T Stat

Subjective 1.125E-08 6.155 1.554E-19 -10.944 0.512 0.658 0.0725 -1.812

Objective 2.988E-07 5.444 3.503E-10 -6.869 0.189 1.321 2.920E-27 -14.292

Assessment 0.046 -2.019 1.036E-21 -11.874 0.375 -0.891 0.011 -2.584

Plan 0.009 -2.625 0.002 3.215 0.068 -1.841 0.039 -2.090

Micro-Average 3.375E-06 4.885 2.783E-32 -16.585 0.284 1.077 2.596E-11 -7.386

Macro-Average 0.984 -0.020 3.292E-22 -12.087 0.256 -1.141 2.490E-09 -6.469

111

• IMPACT OF MAXIMUM N-GRAM LENGTH LIMIT

Regression analysis showed that changing the maximum n-gram length limit had a

significant change in performance. However, the direction of change varied within SOAP

categories. We observed that sentences from ‘Subjective’ and ‘Objective’ categories were

better classified with higher order of n-grams, whereas, ‘Assessment’ and ‘Plan’ categories

favored shorter n-grams. Figure 6.5 draws the trend line of 6th-degree polynomial after

plotting all AP scores for all cases on ordinate with max n-gram lengths on the abscissa.

Figure 6.5 All AP scores over n-gram max length

112

The trend line validates the regression analysis; however, it highlights diminishing

increment in AP for ‘Subjective’ and ‘Objective’ categories with the increasing n-gram

lengths. For the ‘Assessment’ category, we see continuous declination of performance

along n-gram lengths. ‘Plan’ category, however, shows inconsistency, where performance

rapidly increased for the first few increments in n-gram lengths, then it started to fall,

nevertheless, the fall was not much. These opposite behaviors from SOAP categories

regarding max n-gram length explain the high P-value (0.984) for macro-averaged scores.

All 8 cases mostly followed the same trend; however, we did observe some anomalies. In

the ‘Subjective’ category, we marked 3 cases to have slight differences. In cases 3 and 8,

we saw another spike in scores with the higher values of max n-gram length. In case 5

specifically, we observed that n-gram lengths had no significant impact. We had the same

observation about the ‘Objective’ category in case 5. For the ‘Assessment’ category, cases

5 and 6 seemed to have no impact due to n-gram lengths, and case 8 was observed to be

favoring larger n-grams. We observed the same behavior with the ‘Plan’ category, where

cases 5 and 8 had no impact, whereas, case 6 favored larger n-grams. Overall, case 5

(without stop words, cosine similarity, sentence exemplars) appeared to be immune to the

n-gram lengths. We confirmed this observation by applying a t-test on the AP scores which

gave us a P-value of 0.196. After analyzing scores, we realized that there cannot be a single

selection for max n-gram length; therefore, we picked the best performing max n-gram

length from each case (Table 6.10).

Table 6.10 Best performing values for maximum n-gram length for each case

SOAP Categories ↓ All Cases Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7 Case 8

Subjective 8 8 9 14 6 11 9 14 15

Objective 11 11 5 11 7 6 7 12 12

Assessment 1 1 3 1 2 15 15 1 15

Plan 6 2 3 2 2 14 8 6 8

Micro-Average 5 5 5 14 7 15 9 10 15

Macro-Average 3 3 3 6 2 15 15 6 15

113

• IMPACT OF STOP WORDS REMOVAL

It is evident from the regression analysis that stop words removal had a significant change

in the classification performance for all categories. However, for all the categories except

‘Plan’, the direction of change is negative (-ve t-stat value). This means that the

performance of only the ‘Plan’ category increased by removing stop words, whereas, all

other categories performed better when we keep stop-words in the sentences. Figure 6.6

shows the box plots for AP scores after categorizing them with and without stop words.

In Figure 6.6, all box plots are placed in the same sequence as shown in the legends on the

left. To be specific, from left to right, the first 2 plots represents the results from all SOAP

categories combined, the second pair represents results only for ‘Subjective’ category, the

third pair represents ‘Objective’ category, the fourth pair represent ‘Assessment’ category

and the fifth pair represent ‘Plan’ category. All the blue shaded plots (left on each pair) are

made from performance scores when stop words were kept in the sentences. On the other

hand, green-shaded plots (right on each pair) are from the scores when stop words were

removed. The x mark within plots shows the mean, and the dots outside plots show outliers.

While analyzing these plots, we can see that although ‘Plan’ sentences were better

classified without stop words, this performance was not very reliable, since the plot spread

is wider along with some outliers.

114

Figure 6.6 Box plots of scores after removing stop words

• IMPACT OF SIMILARITY FUNCTIONS

We observed no significant change in performance due to similarity functions. In

regression analysis, P-values for all categories were consistently above 0.05, which accepts

the null hypothesis. Figure 6.7 shows the box plots distributed over cases using cosine

similarity and Jaccard index. The plots are on the same pattern as we saw in the previous

figure.

115

Figure 6.7 Comparing performance based on similarity function (Cosine vs Jaccard)

• IMPACT OF EXEMPLAR TYPES

In our analysis, sentence based exemplars were better than n-gram based exemplars on all

SOAP categories. Figure 6.8 shows the box plots for scores based on exemplar-type. Blue

shaded plots are for sentence based exemplars and green shaded plots are for n-gram

exemplars. These plots validate the results of regression analysis as the mean and median

for each blue shaded plot is higher than the corresponding green-shaded plot.

116

Figure 6.8 Comparing performance based on exemplar type (sentence vs n-gram)

• OPTIMAL PERFORMANCE

Results from regression analysis presented that classification performance varied across

individual SOAP categories. Therefore, we inspected the optimal performance for each

category separately. We select the conditions having the highest AP scores for each

category and analyze the improvements that go within that condition. We also calculate the

F1 scores for the selected conditions using varying threshold levels. The highest F1 scores

are reported with each category. Table 6.11 provides a summary of optimal conditions for

117

each SOAP category along with the magnitude of improvement as compared to baseline

performance.

Table 6.11 Summary of optimal conditions for each SOAP category

SOAP Category

Base AP Score

Max AP Score

AP Improved

Optimal Maximum word n-gram length limit

Best Case

Subjective 0.932 0.957 0.025 8 1

Objective 0.911 0.919 0.008 11 3

Assessment 0.563 0.626 0.063 1 1

Plan 0.765 0.82 0.55 6 7

For the ‘Subjective’ category, we achieved max performance (AP: 0.957) with case 1 using

the maximum word n-gram length limit of the 8th order. This means that when our classifier

is set to break query sentences in word n-grams till 8-grams, use sentence based exemplar

and keep stop words, it will give the best classify ‘Subjective’ category sentences. Since

the similarity function has no significance, we select the condition with cosine similarity

as its score was moderately better. Figure 6.9 shows the AUPRC for this condition, where

we achieved an F1 score of 0.912.

Figure 6.9 Precision-Recall Curve of best performing condition for Subjective

118

We selected some examples and list them in Table 6.12 showing sentences with their actual

SOAP category and confidence scores we obtained using our classifiers with this condition.

The top 6 example shows the behavior of this condition on ‘Subjective’ sentences, while

rest shows behavior on other categories. We highlight the top-scoring category while

underline shows that our classifier predicted the actual category for the given sentence.

Table 6.12 Confidence scores of sentences from case 1 using 8-gram limit

Sentence Confidence Scores Actual SOAP

Category Subjective Objective Assessment Plan

1 no concussive findings

0.80509 0.296 0.25021 0 Subjective

2

immediately afterwards paramedics were called on arrival

0.4717 0.3464 0.50901 0.23972 Subjective

3 no medications 0.75963 0.24037 0.211 0 Subjective

4

past medical history reveals a healthy athletic male with no significant medical issues

0.75253 0.25061 0.22631 0.17429 Subjective

5

by the time she arrived at the IWK emergency she was awake and alert and cranky

0.51804 0.41937 0.33357 0.38002 Subjective

6


0.51926 0.40717 0.30884 0.38605 Subjective

7 tracheal palpation did not elicit pain

0.76971 0.39352 0.15284 0.24594 Objective

8 lungs were clear 0.18333 0.77103 0.24431 0.19585 Objective

9 the likely choice of antibiotics will be amoxicillin

0.36357 0.37519 0.42211 0.57157 Plan

10 diagnosis fracture radius ulna

0 0.55951 0.60422 0 Assessment

119

For the ‘Objective’ category, best scores (AP: 0.919) were achieved by condition using

case 3 and the maximum word n-gram length limit of 11. We achieved an F1 score of 0.846

in this condition. Figure 6.10 illustrates the AUPRC for this condition. Table 6.13 mentions

some examples from this condition, where the top 5 examples show the behavior of this

condition over the ‘Objective’ category. The rest of the examples show the performance

over other SOAP categories.



SOAP Category


1

ear nose and throat exam revealed a red throat with no exudate suggestive of a viral pharyngitis

0.27987 0.72597 0.18146 0.1777 Objective

2

blood work was taken four hours after ingestion revealing non toxic acetaminophen levels

0.62871 0.20222 0.37966 0.14458 Objective

3 rest of the physical exam was without comment

0.20491 0.83209 0.15358 0.24458 Objective

4

new paragraph family sorry physical exam revealed entirely normal neurological exam with no focal findings

0.28433 0.71114 0.12414 0.12644 Objective

5

his neurovascular status in the effected limb is normal

0 0.67539 0 0 Objective

6

past medical history reveals immunizations up to date and normal

0.76841 0.22902 0.15446 0.13135 Subjective

120

birth and pregnancy history with no specialist and no regular medications and no known allergies

7 diagnosis fracture radius ulna

0 0.49184 0.69675 0 Assessment

8 chest xray revealed a dense left lower consolidation

0.34847 0.69141 0.21597 0.11372 Assessment


0.29484 0.15295 0.0951 0.7286 Plan

10 parents were counseled and reassured

0.35712 0.33844 0.61333 0.29793 Plan

Figure 6.10 Precision-Recall Curve of best performing condition for Objective

121

For the ‘Assessment’ category, case 1 using maximum word n-gram length limit of 1st

order excelled (AP: 0.626) among all other conditions. This condition managed an F1 score

of 0.63. Figure 6.11 depicts AUPRC for this condition. Table 6.14 lists example sentences

from this condition showing highlighted confidence scores along with the actual SOAP

category. Top 5 examples show the behavior of this condition over ‘Assessment’ category

sentences, while rest show the behavior over other categories.


Sentence Confidence Scores Actual SOAP

Category Subjective Objective Assessment Plan

1 a diagnosis was croup treatment

0.39702 0.299 0.71162 0.18063 Assessment

2 diagnosis is viral respiratory tract infection

0.41961 0.33547 0.64259 0.10616 Assessment

3 croup instructions were delivered

0.41324 0.5422 0.54945 0.33527 Assessment

4 i note no allergies 0.65644 0.35877 0.31561 0.14695 Assessment

5

diagnosis viral upper respiratory tract infection with febrile seizure

0.44084 0.31599 0.52301 0.18636 Assessment

6

immediately afterwards paramedics were called on arrival

0.4717 0.3464 0.50901 0.23972 Subjective

7


0.49094 0.33401 0.5078 0.33077 Objective

8

post reduction film showed good placement of the bones

0.5516 0.49871 0.49488 0.31146 Objective

9 child was hydrated and happy

0.44425 0.55868 0.4596 0.40219 Objective

10 parents were counseled and reassured

0.40641 0.38024 0.54527 0.38648 Plan

122

Figure 6.11 Precision-Recall Curve of best performing condition for Assessment

Figure 6.12 Precision-Recall Curve of best performing condition for Plan

123

For the ‘Plan’ category, we observe case 7 using a maximum word n-gram length limit of

6th order as the best performing condition (AP: 0.82). Figure 6.12 presents the AUPRC for

this condition. We achieved an F1 score of 0.805 in this condition. Table 6.15 shows the

example sentences for this condition. Top 6 examples show the behavior of this condition

over ‘Plan’ category sentences, while others show the behavior over other categories.



SOAP Category


1 follow up as necessary

0 0 0 1 Plan


0.19748 0.27724 0.17992 0.73046 Plan

3

i have asked mom to follow up with me should she fail to improve over the next few days or should she worsen in any way

0.39493 0.1183 0 0.70123 Plan

4

child was advised to return if worsening occurred or if she failed to improve significantly over the course of the next seven days

0.50909 0.27323 0.12239 0.59619 Plan

5

child will follow up with family doctor if there is any other issues

0.2562 0.2489 0.16498 0.68693 Plan

6 follow up in the next two to three days if she does not improve

0.6324 0.17638 0.16543 0.49534 Plan

7

her parents are presented with her to the emergency department for further assessment

0.47742 0.13096 0.36867 0.52188 Subjective

124

8

child was already starting to show return to normal function with crying and purposeful movements

0.5657 0.35864 0.31614 0.47694 Subjective

9

she presented to a walk in clinic where she was told that she did not have a concussion

0.53009 0.36812 0.36995 0.54721 Subjective

10

instructions were given with regards to the delivery of ibuprofen as needed for pain and fever but the parents were cautioned that this would not reduce the risk of further seizures

0.43787 0.35454 0.3649 0.44165 Objective

6.4 Discussion

In this evaluation, we used 238 sentences to evaluate our classifier, which is far less than

4130 sentences that Mowery [101] used in their study; however, we achieved comparable

performance in almost all SOAP categories.

The performance of our classifier significantly improved with ‘Plan’ category sentences

when we applied the stop word removal operation. We observe that this change is

specifically due to the nature of ‘Plan’ sentences. Usually, in this category, physicians talk

in the future tense. For example, “Child will follow up …” and “We will perform …”. In

this case, when stop words are removed, the grammatical structure of a sentence becomes

more prominent and highlights those keywords like ‘will’ and ‘shall’ that be the key

identifiers of the future tense.

In our experiments, we also noticed a significant performance change with varying

maximum n-gram lengths, which ranged from the lowest overall AP score of 0.81 to the

highest of 0.89. We also observed that each SOAP category behaves differently with

125

different n-grams. Juckett [102] has implemented the concept-detection algorithm with the

max n-gram length equals to 5 for each class, owing to the small-phrase lengths of their

dataset. Using the same principle, we expected better performance with longer n-grams.

In this work, we also evaluated the use of the Jaccard index as the similarity function in

comparison to the cosine similarity function. We argued that the Jaccard Index only takes

the union of all lexicons for calculations and does not impact with repeated words, and

cosine similarity should work better. However, after experiments, our argument failed to

achieve support from the results, which made it evident that similarity function has no

significant impact on the performance.

6.5 Conclusion

This chapter provides an approach to use an exemplar-based concept detection algorithm

to develop a SOAP classifier. This classifier was implemented and was experimented with

four independent variables having a total of 120 variations where each variation was tested

for each of the SOAP categories separately. These experiments were defined to show a

directional insight into our approach, yet due to the small dataset, it does not provide a

meaningful conclusion. With the provided dataset, we make two suggestions. The first

suggestion is to use the longer n-grams for better classification in ‘Subjective’ and

‘Objective’ categories, whereas shorter n-grams for ‘Assessment’ and ‘Plan’ categories.

However, the drawback of longer n-grams is that for each increasing n-gram, processing

time increases exponentially. The second suggestion is to make separate classifiers for each

of the SOAP categories and remove stop words specifically for Plan category classifiers.

126

Chapter 7. DISCUSSION

7.1 Introduction

This chapter constructs a thorough discussion based on our work for both of our research

objectives. It highlights the observations that we made while performing this research. This

chapter then presents a 3-Layer solution framework that is based upon our observations.

Afterward, it reports the limitations of this work and provides possible and potential

improvements that can be done in the future. Finally, this chapter ends by providing a

conclusion towards this thesis.

7.2 Observations

We demonstrated our work on a 2-layer solution model, with the approach of having an

end-to-end solution. However, when we observe the output of the first layer, we found no

punctuation marks. The reason for this is that SR returns speech in free text format. These

texts bare no lexical entities other than words, such as punctuation marks, paragraph

indentation. On the other hand, Natural Language Processing (NLP) tasks usually work on

text data that are rich with punctuation marks. It means that output from SR systems can

create havoc with NLP methods; therefore, it is highly desired to enrich SR output with

punctuation marks before transmitting it to our final layer.

While working with the dictations, we also observed that some of them were recorded

while patients were still in the room. For our experiments, this created issues because of

the increased level of background noise. As we are working towards an end-to-end

solution, we made a realization that an autonomous report generation solution will allow

providers to record dictations anywhere and anytime. Most of the time, it should be

happening right after the encounter, possibly in front of the patient. For such scenarios, we

believe that patients will be hearing of what is said, hence they will get the chance to

provide clarification for any detail, which will be beneficial in reducing the dictation errors.

Fratzke [51] observed the same pattern when they implemented a voice-assisted

technology.

127

7.3 3-Layer Solution Framework

In this thesis, we observed that the output from the first layer is not fully compatible to be

used as the input of our second layer. Therefore, to develop a true end-to-end solution, we

need to specify a processing layer in between both layers whose purpose should be to

streamline the output of the first layer for the next layer.

Sentence boundaries are marked by using a dot, also known as period. This dot shows

where a sentence has ended in the text. In our final layer, we are using the sentence as the

unit to classify. Therefore, those transcriptions without punctuation marks will certainly

have some issues. A quick review shows that there are various solutions available that deal

with the punctuation prediction task. We found a punctuation predictor based on a

bidirectional recurrent neural network model with attention mechanism [126] that provides

pre-trained models based on Wikipedia text. We tested this solution on our dataset and

analyzed the results visually. We observed a few mistakes, though it generally felt

satisfactory. This simple review of the problem shown us that punctuation prediction is an

established problem in the research domain; hence we should give it a separate focus in

our solution.

Figure 7.1 Design for 3-Layer Solution Framework

With all the observations we made in this work, we present a potential 3-layer solution

framework for the problem of clinical documentation. All of the layers in this framework

are connected as a pipeline. Figure 7.1 represents the design of the 3-layer solution

framework. Each layer of this framework is defined below.

128

1. The first layer of this framework expects physicians recorded audio clips. This layer

processes the input audio into text by using a noise and domain robust speech

recognition system.

2. In the second layer, the transcriptions from the first layers are analyzed for

punctuations. Transcriptions are enriched with predicted punctuations and are given to

the third layer.

3. In the third layer, punctuated transcriptions are then classified for clinical reports.

Sentences from transcriptions are extracted and then classified for various categories

of the report structure.

7.4 Limitations

The size of the dataset possesses the biggest limitation of this work. Due to the small size

of the dataset, we had to tweak our experiments accordingly. While experimenting with

acoustic models, we initially wanted to train standalone models as well to hold a

comparison. However, a few trials with training exhibited the limitations. Therefore, we

skipped that scenario from our experiments. Similarly, during the SOAP classification, we

sensed a lower level of overall classification performance. The algorithm that we extended

and implemented in this layer was tested on a dataset five times in size than ours. Hence,

we believe that a lower dataset has introduced a limitation in our work.

Another limitation of this work is that all of the dataset labeling tasks were done by us,

without the help of any domain specialist. We perceived the need for a specialist at two

points. Firstly, at the time of generating gold standard transcriptions for dictation audios,

and secondly, when we labeled the sentences into SOAP categories. While preparing

transcriptions, we took help from one of the fellow members of our research group, who

has a medical degree. However, we do not consider this fellow a domain specialist since it

never practiced in the domain.

129

7.5 Future Work

This work is primarily experimental; however, it sets a direction for further research. This

thesis presents the solutions in terms of research areas that combine to form a solution

framework. Particularly, the solution depends on three research areas: continuous speech

recognition, punctuation prediction, and sentence classification. Any work that is done in

any of these research areas has the potential to enhance the presented solution framework.

Open-sourced offline SR are improving at a rapid pace. While working on this thesis, a

new open-sourced offline SR Wav2Letter++ [94] was released reporting better

performance then DeepSpeech [76] using similar models and datasets. Wav2Letter++ is

developed to scale and work in real-life environments. At the time of our experimentations,

its pre-trained models were not released, however, they are available at the time of writing

this thesis. Wav2Letter++ is also based on the concept of end-to-end SR, whilst providing

features to use custom acoustic models of different architectures. Therefore, we consider

that using this toolkit can be a perfect next step towards the enhancement of this work.

At the time of selecting methods for our final layer, we did not explore the sentence

classification solutions from other domains. One can argue this as a limitation, but we

consider this as the opportunity to move ahead in a logical manner. Since we were only

able to find one study that addressed the problem of sentence segmentation for SOAP

categories, the next logical step we took was to explore for similar problems within the

same (healthcare) domain. Therefore, we consider that this work provides the basis to

explore solution ideas from a different domain. We expect that examination of supervised

approaches towards text classification for this problem can pose as an appropriate work in

this direction.

In the previous subsections, this chapter presented a 3-layer solution framework. We

consider that this provides the opportunity to research the problem of clinical

documentation in a more systematic way. However, the framework still needs active

research and evaluations. Therefore, as future work, this 3-layer solution framework can

be enhanced.

130

7.6 Conclusion

Clinical documentation is a challenging process that poses multiple problems. Researchers

are experimenting with speech recognition since its inception to develop efficient solutions.

A review suggests that such solutions still lack adoption among healthcare providers. One

reason is the lack of accuracy in domain-specific noisy environments. Another reason is

that providers still need to intervene and perform manual changes. Therefore, this thesis

considers the problem from the end-to-end perspective. It identifies two research objectives

and translates them into individual layers of a solution model.

The first layer seeks efficient transcription of dictations in domain-specific and acoustically

distorted environments. We selected the DeepSpeech speech recognition system and

performed preliminary experiments. We then presented methods to enhance the

performance of both: acoustic and language; models. Evaluations show that all our

modeling methods improve DeepSpeech performance. We show that by adapting and fine-

tuning the pre-trained acoustic model, along with enhancing the pre-trained language

model using the augmented corpus from the domain-specific dataset, DeepSpeech can

achieve performance levels very close to Google Speech API.

The second layer aims to categorize transcriptions into SOAP categories. We select an

exemplar-based concept detection algorithm and extend it to develop a sentence classifier.

We identify four areas to enhance and implement our classifier for SOAP classification.

Evaluations show that each SOAP category requires a separate setting for optimal

performance, and no single set of independent variables will deliver the best performance

for all categories.

In this work, we observe that a two-layer solution could not exhibit as an end-to-end

solution due to the lack of punctuations in the transcriptions. Therefore, this thesis

discusses a 3-layered solution framework that can have the potential to become an end-to-

end solution.

131

REFERENCES

[1] T. K. Colicchio and J. J. Cimino, “Clinicians’ reasoning as reflected in electronic

clinical note-entry and reading/retrieval: a systematic review and qualitative

synthesis,” J. Am. Med. Informatics Assoc., vol. 26, no. 2, pp. 172–184, 2019.

[2] A. Mathioudakis, I. Rousalova, A. A. Gagnat, N. Saad, and G. Hardavella, “How to

keep good clinical records,” Breathe, vol. 12, no. 4, pp. 371–375, Dec. 2016.

[3] B. G. Arndt et al., “Tethered to the EHR: Primary care physician workload

assessment using EHR event log data and time-motion observations,” Ann. Fam.

Med., vol. 15, no. 5, pp. 419–426, 2017.

[4] C. Sinsky et al., “Allocation of physician time in ambulatory practice: A time and

motion study in 4 specialties,” Ann. Intern. Med., vol. 165, no. 11, pp. 753–760,

Dec. 2016.

[5] J. Holbrook and R. Aghababian, “A computerized audit of 15,009 emergency

department records,” Ann. Emerg. Med., vol. 19, no. 2, pp. 139–144, 1990.

[6] T. G. Poder, J. F. Fisette, and V. Déry, “Speech Recognition for Medical Dictation:

Overview in Quebec and Systematic Review,” J. Med. Syst., vol. 42, no. 5, p. 89,

May 2018.

[7] “mozilla/DeepSpeech: A TensorFlow implementation of Baidu’s DeepSpeech

architecture.” [Online]. Available: https://github.com/mozilla/DeepSpeech.

[Accessed: 20-Nov-2019].

[8] M. Amatayakul, “EHR versus EMR: what’s in a name?,” J. Healthc. Financ.

Manag. Assoc., vol. 63, no. 3, p. 24, 2009.

[9] W. Kondro, “CMAJ 2011 election survey: research.,” CMAJ, vol. 183, no. 8, pp.

E463--4, May 2011.

132

[10] J. H. Seo et al., “A pilot study on the evaluation of medical student documentation:

assessment of SOAP notes,” Korean J. Med. Educ., vol. 28, no. 2, pp. 237–241, Jun.

2016.

[11] M. Gardiner, “Clinical documentation.,” Texas dental journal, 2013. [Online].

Available: https://searchhealthit.techtarget.com/definition/clinical-documentation-

healthcare.

[12] D. R. Maines, Social Organization of Medical Work (Book)., vol. 8, no. 4. Chicago,

IL, US, IL, US: University of Chicago Press, 1986.

[13] K. R. Sando, E. Skoy, C. Bradley, J. Frenzel, J. Kirwin, and E. Urteaga, “Assessment

of SOAP note evaluation tools in colleges and schools of pharmacy,” Curr. Pharm.

Teach. Learn., vol. 9, no. 4, pp. 576–584, Jul. 2017.

[14] J. L. Belden, R. J. Koopman, S. J. Patil, N. J. Lowrance, G. F. Petroski, and J. B.

Smith, “Dynamic electronic health record note prototype: Seeing more by showing

less,” Journal of the American Board of Family Medicine, vol. 30, no. 6. American

Board of Family Medicine, pp. 691–700, 01-Nov-2017.

[15] M. R. Andrus et al., “Development and Validation of a Rubric to Evaluate Diabetes

SOAP Note Writing in APPE,” Am. J. Pharm. Educ., vol. 82, no. 9, p. 6725, 2018.

[16] L. A. Lenert, “Toward Medical Documentation That Enhances Situational

Awareness Learning,” AMIA ... Annu. Symp. proceedings. AMIA Symp., vol. 2016,

pp. 763–771, 2016.

[17] K. M. Lisenby et al., “Ambulatory care preceptors’ perceptions on SOAP note

writing in advanced pharmacy practice experiences (APPEs),” Curr. Pharm. Teach.

Learn., vol. 10, no. 12, pp. 1574–1578, Dec. 2018.

[18] P. F. Pearce, L. A. Ferguson, G. S. George, and C. A. Langford, “The essential

SOAP note in an EHR age,” Nurse Pract., vol. 41, no. 2, pp. 29–36, Feb. 2016.

133

[19] C. T. Lin, M. McKenzie, J. Pell, and L. Caplan, “Health care provider satisfaction

with a new electronic progress note format: SOAP vs APSO format,” JAMA Internal

Medicine, vol. 173, no. 2. American Medical Association, pp. 160–162, 28-Jan-

2013.

[20] E. L. Siegler and R. Adelman, “Copy and Paste: A Remediable Hazard of Electronic

Health Records,” American Journal of Medicine, vol. 122, no. 6, pp. 495–496, Jun-

2009.

[21] F. Jelinek, R. L. Mercer, and L. R. Bahl, “25 Continuous speech recognition:

Statistical methods,” in Handbook of Statistics, vol. 2, 1982, pp. 549–573.

[22] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of noise-robust

automatic speech recognition,” IEEE Trans. Audio, Speech Lang. Process., vol. 22,

no. 4, pp. 745–777, Apr. 2014.

[23] W. Ghai and N. Singh, “Literature Review on Automatic Speech Recognition,” Int.

J. Comput. Appl., vol. 41, no. 8, pp. 42–50, Mar. 2012.

[24] K. Jing and J. Xu, “A Survey on Neural Network Language Models.”

[25] A. J. Nathan and A. Scobell, “How China sees America,” Foreign Aff., vol. 91, no.

5, pp. 391–397, 2012.

[26] K. Saxena, R. Diamond, R. F. Conant, T. H. Mitchell, I. G. Gallopyn, and K. E.

Yakimow, “Provider Adoption of Speech Recognition and its Impact on

Satisfaction, Documentation Quality, Efficiency, and Cost in an Inpatient EHR.,”

AMIA Jt. Summits Transl. Sci. proceedings. AMIA Jt. Summits Transl. Sci., vol.

2017, pp. 186–195, 2018.

[27] F. R. Goss et al., “A clinician survey of using speech recognition for clinical

documentation in the electronic health record,” Int. J. Med. Inform., vol. 130, p.

103938, Oct. 2019.

134

[28] “Home - PubMed - NCBI.” [Online]. Available:

https://www.ncbi.nlm.nih.gov/pubmed/. [Accessed: 25-Nov-2019].

[29] L. Zhou et al., “Analysis of Errors in Dictated Clinical Documents Assisted by

Speech Recognition Software and Professional Transcriptionists,” JAMA Netw.

Open, vol. 1, no. 3, p. e180530, Jul. 2018.

[30] T. Hodgson, F. Magrabi, and E. Coiera, “Evaluating the efficiency and safety of

speech recognition within a commercial electronic health record system: A

replication study,” Appl. Clin. Inform., vol. 9, no. 2, pp. 326–335, 2018.

[31] B. W. Leeming, D. Porter, J. D. Jackson, H. L. Bleich, and M. Simon,

“Computerized radiologic reporting with voice data-entry,” Radiology, vol. 138, no.

5, pp. 585–588, Mar. 1981.

[32] A. H. Robbins et al., “Speech-controlled generation of radiology reports,”

Radiology, vol. 164, no. 2, pp. 569–573, Aug. 1987.

[33] A. Zafar, J. M. Overhage, and C. J. McDonald, “Continuous speech recognition for

clinicians,” J. Am. Med. Informatics Assoc., vol. 6, no. 3, pp. 195–204, 1999.

[34] M. M. Teel, R. Sokolowski, D. Rosenthal, and M. Belge, “Voice-enabled structured

medical reporting,” Conf. Hum. Factors Comput. Syst. - Proc., vol. 13, no. 1, pp.

595–602, 1998.

[35] K. Korn, “Voice recognition software for clinical use.,” J. Am. Acad. Nurse Pract.,

vol. 10, no. 11, pp. 515–517, 1998.

[36] R. Et, “Voice-enabled Reporting Systems Application of Information Technology,”

1997.

[37] O. A. Buch and N. P. Reddy, “Speaker verification for telemedical applications,”

Annu. Int. Conf. IEEE Eng. Med. Biol. - Proc., vol. 2, no. C, pp. 902–903, 1997.

135

[38] T. M. Murray, M. L. Kruse, J. A. Battcher, J. W. Woods, H. L. Edmonds, and M. P.

Palaheimo, “’Hands-off’--Voice activated automated anesthesia recordkeeping and

monitoring system (ARMS),” in Conference Proceedings - IEEE

SOUTHEASTCON, 1990, vol. 3, pp. 822–824.

[39] N. A. Linn, R. M. Rubenstein, A. E. Bowler, and J. L. Dixon, “Improving the quality

of emergency department documentation using the voice-activated word processor:

interim results.,” Proc. Annu. Symp. Comput. Appl. Med. Care, vol. 1992, no. 1, pp.

772–776, 1992.

[40] D. N. Mohr, D. W. Turner, G. R. Pond, J. S. Kamath, C. B. De Vos, and P. C.

Carpenter, “Speech recognition as a transcription aid: A randomized comparison

with standard transcription,” J. Am. Med. Informatics Assoc., vol. 10, no. 1, pp. 85–

93, 2003.

[41] Y. D. Derman, T. Arenovich, and J. Strauss, “Speech recognition software and

electronic psychiatric progress notes: Physicians’ ratings and preferences,” BMC

Med. Inform. Decis. Mak., vol. 10, no. 1, p. 44, Aug. 2010.

[42] G. C. David, A. C. Garcia, A. W. Rawls, and D. Chand, “Listening to what is said -

Transcribing what is heard: The impact of speech recognition technology (SRT) on

the practice of medical transcription (MT),” Sociol. Heal. Illn., vol. 31, no. 6, pp.

924–938, Sep. 2009.

[43] R. M. Issenman and I. H. Jaffer, “Use of voice recognition software in an outpatient

pediatric specialty practice,” Pediatrics, vol. 114, no. 3, pp. e290--e293, 2004.

[44] R. Hoyt and A. Yoshihashi, “Lessons learned from implementation of voice

recognition for documentation in the military electronic health record system.,”

Perspect. Health Inf. Manag., vol. 7, no. Winter, p. 1e, Jan. 2010.

[45] K. Patel and M. Harbord, “Digital dictation and voice transcription software

enhances outpatient clinic letter production: a crossover study,” Frontline

Gastroenterol., vol. 3, no. 3, pp. 162–165, Jul. 2012.

136

[46] T. Marukami, S. Tani, A. Matsuda, K. Takemoto, A. Shindo, and H. Inada, “A basic

study on application of voice recognition input to an electronic nursing record

system-evaluation of the function as an input interface,” J. Med. Syst., vol. 36, no.

3, pp. 1053–1058, Jun. 2012.

[47] M. Johnson et al., “A systematic review of speech recognition technology in health

care,” BMC Med. Inform. Decis. Mak., vol. 14, no. 1, p. 94, Dec. 2014.

[48] H. Suominen, L. Zhou, L. Hanlen, and G. Ferraro, “Benchmarking Clinical Speech

Recognition and Information Extraction: New Data, Methods, and Evaluations,”

JMIR Med. Informatics, vol. 3, no. 2, p. e19, Apr. 2015.

[49] S. Ajami, “Use of speech-to-text technology for documentation by healthcare

providers,” Natl. Med. J. India, vol. 29, no. 3, pp. 148–152, 2016.

[50] J. Fernandes, I. Brunton, G. Strudwick, S. Banik, and J. Strauss, “Physician

experience with speech recognition software in psychiatry: Usage and perspective,”

BMC Research Notes, vol. 11, no. 1. BioMed Central, p. 690, 01-Oct-2018.

[51] J. Fratzke, S. Tucker, H. Shedenhelm, J. Arnold, T. Belda, and M. Petera,

“Enhancing nursing practice by utilizing voice recognition for direct

documentation,” J. Nurs. Adm., vol. 44, no. 2, pp. 79–86, Feb. 2014.

[52] A. H. Abdelaziz, “Comparing Fusion Models for DNN-Based Audiovisual

Continuous Speech Recognition,” IEEE/ACM Trans. Audio Speech Lang. Process.,

vol. 26, no. 3, pp. 475–484, Mar. 2018.

[53] G. Saon, T. Sercu, S. Rennie, and H. K. J. Kuo, “The IBM 2016 English

conversational telephone speech recognition system,” in Proceedings of the Annual

Conference of the International Speech Communication Association,

INTERSPEECH, 2016, vol. 08-12-Sept, pp. 7–11.

137

[54] W. Xiong et al., “The microsoft 2016 conversational speech recognition system,”

ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 5255–5259,

Sep. 2017.

[55] K. Garg and G. Jain, “A comparative study of noise reduction techniques for

automatic speech recognition systems,” in 2016 International Conference on

Advances in Computing, Communications and Informatics, ICACCI 2016, 2016, pp.

2098–2103.

[56] Y. Gong, “Speech recognition in noisy environments: A survey,” Speech Commun.,

vol. 16, no. 3, pp. 261–291, Apr. 1995.

[57] K. Kinoshita et al., “The reverb challenge: Acommon evaluation framework for

dereverberation and recognition of reverberant speech,” in IEEE Workshop on

Applications of Signal Processing to Audio and Acoustics, 2013, pp. 1–4.

[58] S. Thomas, S. Ganapathy, and H. Hermansky, “Recognition of reverberant speech

using frequency domain linear prediction,” IEEE Signal Process. Lett., vol. 15, pp.

681–684, 2008.

[59] A. Wisler, V. Berisha, A. Spanias, and J. Liss, “Noise robust dysarthric speech

classification using domain adaptation,” in 2016 Digital Media Industry and

Academic Forum, DMIAF 2016 - Proceedings, 2016, pp. 135–138.

[60] L. Deng et al., “Distributed speech processing in MiPad’s multimodal user

interface,” IEEE Trans. Speech Audio Process., vol. 10, no. 8, pp. 605–619, Nov.

2002.

[61] M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neural networks for

noise robust speech recognition,” in ICASSP, IEEE International Conference on

Acoustics, Speech and Signal Processing - Proceedings, 2013, pp. 7398–7402.

138

[62] H. Meutzner, S. Araki, M. Fujimoto, and T. Nakatani, “A generative-discriminative

hybrid approach to multi-channel noise reduction for robust automatic speech

recognition,” in ICASSP, IEEE International Conference on Acoustics, Speech and

Signal Processing - Proceedings, 2016, vol. 2016-May, pp. 5740–5744.

[63] T. Menne, R. Schluter, and H. Ney, “Speaker Adapted Beamforming for Multi-

Channel Automatic Speech Recognition,” 2018 IEEE Spok. Lang. Technol. Work.

SLT 2018 - Proc., pp. 535–541, Jun. 2019.

[64] S. F. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,”

IEEE Trans. Acoust., vol. 27, no. 2, pp. 113–120, Apr. 1979.

[65] J. S. Lim and A. V. Oppenheim, “Enhancement and Bandwidth Compression of

Noisy Speech,” Proc. IEEE, vol. 67, no. 12, pp. 1586–1604, 1979.

[66] R. Rehr and T. Gerkmann, “Cepstral noise subtraction for robust automatic speech


Signal Processing - Proceedings, 2015, vol. 2015-Augus, pp. 375–378.

[67] S. Park, Y. Jeong, M. S. Kim, and H. S. Kim, “Linear prediction-based

dereverberation with very deep convolutional neural networks for reverberant

speech recognition,” in International Conference on Electronics, Information and

Communication, ICEIC 2018, 2018, vol. 2018-Janua, no. 1, pp. 1–2.

[68] F. Weninger, S. Watanabe, Y. Tachioka, and B. Schuller, “Deep recurrent de-

noising auto-encoder and blind de-reverberation for reverberated speech


Signal Processing - Proceedings, 2014, no. 2, pp. 4623–4627.

[69] J. Li, Y. Huang, and Y. Gong, “Improved cepstra minimum-mean-square-error noise

reduction algorithm for robust speech recognition,” in ICASSP, IEEE International

Conference on Acoustics, Speech and Signal Processing - Proceedings, 2017, pp.

4865–4869.

139

[70] A. Narayanan and D. Wang, “Joint noise adaptive training for robust automatic

speech recognition,” in ICASSP, IEEE International Conference on Acoustics,

Speech and Signal Processing - Proceedings, 2014, pp. 2504–2508.

[71] A. Prodeus and K. Kukharicheva, “Training of automatic speech recognition system

on noised speech,” in 2016 IEEE 4th International Conference Methods and Systems

of Navigation and Motion Control, MSNMC 2016 - Proceedings, 2016, pp. 221–

223.

[72] J. Rajnoha, “Multi-Condition Training for Unknown Environment Adaptation in

Robust ASR Under Real Conditions,” Acta Polytech., vol. 49, no. 2, pp. 3–7, Jan.

2009.

[73] M. Fujimoto and H. Kawai, “Comparative evaluations of various factored deep

convolutional rnn architectures for noise robust speech recognition,” in ICASSP,

IEEE International Conference on Acoustics, Speech and Signal Processing -

Proceedings, 2018, vol. 2018-April, pp. 4829–4833.

[74] A. Prodeus and K. Kukharicheva, “Automatic speech recognition performance for

training on noised speech,” in 2nd International Conference on Advanced

Information and Communication Technologies, AICT 2017 - Proceedings, 2017, pp.

71–74.

[75] S. Watanabe et al., “ESPNet: End-to-end speech processing toolkit,” in Proceedings

of the Annual Conference of the International Speech Communication Association,

INTERSPEECH, 2018, vol. 2018-Septe, pp. 2207–2211.

[76] D. Amodei et al., “Deep speech 2: End-to-end speech recognition in English and

Mandarin,” 33rd Int. Conf. Mach. Learn. ICML 2016, vol. 1, pp. 312–321, Dec.

2016.

[77] A. Hannun et al., “Deep Speech: Scaling up end-to-end speech recognition,” Dec.

2014.

140

[78] D. Yu and J. Li, “Recent progresses in deep learning based acoustic models,”

IEEE/CAA J. Autom. Sin., vol. 4, no. 3, pp. 396–409, Jul. 2017.

[79] C. X. Qin, D. Qu, and L. H. Zhang, “Towards end-to-end speech recognition with

transfer learning,” Eurasip J. Audio, Speech, Music Process., vol. 2018, no. 1, p. 18,

Dec. 2018.

[80] Z. You, D. Su, and D. Yu, “Teach an All-rounder with Experts in Different

Domains,” in ICASSP, IEEE International Conference on Acoustics, Speech and

Signal Processing - Proceedings, 2019, vol. 2019-May, pp. 6425–6429.

[81] R. Lyer, M. Ostendorf, and H. Gish, “Using out-of-domain data to improve in-

domain language models,” IEEE Signal Process. Lett., vol. 4, no. 8, pp. 221–223,

Aug. 1997.

[82] D. Janiszek, R. De Mori, and F. Bechet, “Data augmentation and language model

adaptation,” in ICASSP, IEEE International Conference on Acoustics, Speech and

Signal Processing - Proceedings, 2001, vol. 1, pp. 549–552.

[83] M. Creutz, S. Virpioja, and A. Kovaleva, “Web augmentation of language models

for continuous speech recognition of SMS text messages,” in EACL 2009 - 12th

Conference of the European Chapter of the Association for Computational

Linguistics, Proceedings, 2009, pp. 157–165.

[84] H. Cucu, L. Besacier, C. Burileanu, and A. Buzo, “ASR domain adaptation methods

for low-resourced languages: Application to Romanian language,” in European

Signal Processing Conference, 2012, pp. 1648–1652.

[85] A. Sethy, S. Narayanan, and B. Ramabhadran, “Data Driven Approach for Language

Model Adaptation using Stepwise Relative Entropy Minimization,” in 2007 IEEE

International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07,

2007, vol. 4, pp. IV-177-IV–180.

141

[86] A. K. Baughman, S. Hammer, and D. Provan, “Using language models for

improving speech recognition for U.S. Open Tennis Championships,” IBM J. Res.

Dev., vol. 61, no. 4/5, pp. 15:1-15:9, Jul. 2017.

[87] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent

neural networks,” in 31st International Conference on Machine Learning, ICML

2014, 2014, vol. 5, pp. 3771–3779.

[88] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone

speech corpus for research and development,” in ICASSP, IEEE International

Conference on Acoustics, Speech and Signal Processing - Proceedings, 1992, vol.

1, pp. 517–520.

[89] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE

Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, 1997.

[90] “TensorFlow.” [Online]. Available: https://www.tensorflow.org/. [Accessed: 05-

Nov-2019].

[91] Kenneth Heafield, “KenLM: Faster and Smaller Language Model Queries,” in

Proceedings of the Sixth Workshop on Statistical Machine Translation, 2011, pp.

187--197.

[92] D. Povey et al., “The Kaldi Speech Recognition Toolkit,” in IEEE 2011 Workshop

on Automatic Speech Recognition and Understanding, Hilton Waikoloa Village, Big

Island, Hawaii, US, Cambridge: IEEE Signal Processing Society, 2011.

[93] “PyTorch.” [Online]. Available: https://pytorch.org/. [Accessed: 05-Nov-2019].

[94] V. Pratap et al., “Wav2Letter++: A Fast Open-source Speech Recognition System,”

in ICASSP, IEEE International Conference on Acoustics, Speech and Signal

Processing - Proceedings, 2019, vol. 2019-May, pp. 6460–6464.

[95] A. Hannun, A. Lee, Q. Xu, and R. Collobert, “Sequence-to-Sequence Speech

Recognition with Time-Depth Separable Convolutions.”

142

[96] T. Likhomanenko, G. Synnaeve, and R. Collobert, “WHO NEEDS WORDS?

LEXICON-FREE SPEECH RECOGNITION A PREPRINT,” 2019.

[97] R. Singh, W. Walker, and P. Wolf, “The CMU Sphinx-4 Speech Recognition

System,” Evaluation, pp. 2–5.

[98] V. Këpuska, “Comparing Speech Recognition Systems (Microsoft API, Google API

And CMU Sphinx),” Int. J. Eng. Res. Appl., vol. 07, no. 03, pp. 20–24, Mar. 2017.

[99] A. Lee and T. Kawahara, “Recent development of open-source speech recognition

engine Julius,” APSIPA ASC 2009 - Asia-Pacific Signal Inf. Process. Assoc. 2009

Annu. Summit Conf., pp. 131–137, 2009.

[100] Google Cloud STT, “Cloud Speech-to-Text - Speech Recognition | Cloud Speech-

to-Text API | Google Cloud,” 2018. [Online]. Available:

https://cloud.google.com/speech-to-text/. [Accessed: 05-Nov-2019].

[101] D. Mowery, J. Wiebe, S. Visweswaran, H. Harkema, and W. W. Chapman,

“Building an automated SOAP classifier for emergency department reports,” J.

Biomed. Inform., vol. 45, no. 1, pp. 71–81, Feb. 2012.

[102] D. A. Juckett, E. P. Kasten, F. N. Davis, and M. Gostine, “Concept detection using

text exemplars aligned with a specialized ontology,” Data and Knowledge

Engineering, vol. 119, North-Holland, pp. 22–35, 01-Jan-2019.

[103] C. Cieri, D. Miller, and K. Walker, “The fisher corpus: A resource for the next

generations of speech-to-text,” in Proceedings of the 4th International Conference

on Language Resources and Evaluation, LREC 2004, 2004, pp. 69–71.

[104] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus

based on public domain audio books,” in ICASSP, IEEE International Conference

on Acoustics, Speech and Signal Processing - Proceedings, 2015, vol. 2015-Augus,

pp. 5206–5210.

143

[105] D. E. R. T. U. München and E. Gazetić, “Comparison Between Cloud-based and

Offline Speech Recognition Systems,” DER TECHNISCHEN UNIVERSITÄT

MÜNCHEN, 2018.

[106] A. R. Aronson and F. M. Lang, “An overview of MetaMap: Historical perspective

and recent advances,” J. Am. Med. Informatics Assoc., vol. 17, no. 3, pp. 229–236,

2010.

[107] “MeSH Browser.” [Online]. Available: http. [Accessed: 06-Nov-2019].

[108] “Deep Domain Adaptation In Computer Vision - Towards Data Science.” [Online].

Available: https://towardsdatascience.com/deep-domain-adaptation-in-computer-

vision-8da398d3167f. [Accessed: 16-Nov-2019].

[109] I. Redko, E. Morvant, A. Habrard, M. Sebban, and Y. Bennani, Advances in Domain

Adaptation Theory. ISTE Press - Elsevier, 2019.

[110] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan,

“A theory of learning from different domains,” Mach. Learn., vol. 79, no. 1–2, pp.

151–175, May 2010.

[111] M. Suzuki, N. Itoh, T. Nagano, G. Kurata, and S. Thomas, “Improvements to N-

gram Language Model Using Text Generated from Neural Language Model,” in

ICASSP, IEEE International Conference on Acoustics, Speech and Signal

Processing - Proceedings, 2019, vol. 2019-May, pp. 7245–7249.

[112] “These are the Easiest Data Augmentation Techniques in Natural Language

Processing you can think of — and they work.” [Online]. Available:

https://towardsdatascience.com/these-are-the-easiest-data-augmentation-

techniques-in-natural-language-processing-you-can-think-of-88e393fd610.

[Accessed: 13-Nov-2019].

144

[113] X. Zhang, J. Zhao, and Y. LeCun, “Character-level Convolutional Networks for

Text Classification,” Adv. Neural Inf. Process. Syst., vol. 2015-Janua, pp. 649–657,

Sep. 2015.

[114] “WordNet | A Lexical Database for English.” [Online]. Available:

https://wordnet.princeton.edu/. [Accessed: 18-Nov-2019].

[115] “Thesaurus.com | Synonyms and Antonyms of Words at Thesaurus.com.” [Online].

Available: https://www.thesaurus.com/. [Accessed: 18-Nov-2019].

[116] E. Abrahamsson, T. Forni, M. Skeppstedt, and M. Kvist, “Medical text

simplification using synonym replacement: Adapting assessment of word difficulty

to a compounding language.”

[117] S. Young Gunnar Evermann Mark Gales Thomas Hain Dan Kershaw Xunying Liu

Gareth Moore Julian Odell Dave Ollason Dan Povey Valtcho Valtchev Phil

Woodland, “The HTK Book,” 1995.

[118] “ARPA Language models – CMUSphinx Open Source Speech Recognition.”

[Online]. Available: https://cmusphinx.github.io/wiki/arpaformat/. [Accessed: 12-

Dec-2019].

[119] D. Klakow, “Log-linear interpolation of language models,” in ICSLP, 1998, no.

January, pp. 1–4.

[120] “Jaccard Index Definition | DeepAI.” [Online]. Available:

https://deepai.org/machine-learning-glossary-and-terms/jaccard-index. [Accessed:

19-Nov-2019].

[121] A. Schofield, M. Magnusson, and D. Mimno, “Pulling Out the Stops: Rethinking

Stopword Removal for Topic Models.”

[122] C. Silva and B. Ribeiro, “The Importance of Stop Word Removal on Recall Values

in Text Categorization,” in Proceedings of the International Joint Conference on

Neural Networks, 2003, vol. 3, pp. 1661–1666.

145

[123] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval - Chapter

6, page 203. New York, NY, USA: McGraw-Hill, Inc., 1986.

[124] V. Thada and V. Jaglan, “Comparison of Jaccard, Dice, Cosine Similarity

Coefficient To Find Best Fitness Value for Web Retrieved Documents Using

Genetic Algorithm.”

[125] T. Saito and M. Rehmsmeier, “The precision-recall plot is more informative than

the ROC plot when evaluating binary classifiers on imbalanced datasets,” PLoS One,

vol. 10, no. 3, p. e0118432, Mar. 2015.

[126] O. Tilk and T. Alumäe, “Bidirectional recurrent neural network with attention

mechanism for punctuation restoration,” Proc. Annu. Conf. Int. Speech Commun.

Assoc. INTERSPEECH, vol. 08-12-Sept, no. September, pp. 3047–3051, 2016.

Appendix A. AP SCORES FOR SOAP CLASSIFICATION

Table A.1 Micro-Averaged AP-Scores of All Classes

Conditions

1 2 3 4 5 6 7 8

max

n-g

ram

len

gth

1 0.87152 0.84058 0.84252 0.85159 0.85459 0.81910 0.85184 0.81206

2 0.88061 0.86977 0.86825 0.87163 0.85796 0.84112 0.84968 0.84737

3 0.89237 0.86801 0.88100 0.87971 0.85346 0.84504 0.86390 0.85155

4 0.88720 0.87295 0.88759 0.87017 0.86087 0.85578 0.84566 0.84874

5 0.89675 0.88257 0.88595 0.87578 0.85288 0.83935 0.86702 0.86095

6 0.88218 0.87506 0.89358 0.87889 0.86509 0.84752 0.86476 0.84768

7 0.89209 0.87731 0.89252 0.88206 0.85856 0.85403 0.85644 0.85793

8 0.89512 0.87937 0.89425 0.88162 0.86106 0.85125 0.86177 0.84217

9 0.89086 0.88086 0.89328 0.87426 0.85684 0.85889 0.86656 0.85883

10 0.88856 0.88010 0.88967 0.87997 0.86259 0.84090 0.87013 0.85468

11 0.89134 0.87508 0.89125 0.87686 0.85885 0.85197 0.85369 0.84510

12 0.88846 0.87587 0.89400 0.88105 0.85362 0.84972 0.86796 0.85585

13 0.88870 0.86693 0.89375 0.87600 0.85575 0.83931 0.85644 0.85642

14 0.87369 0.87150 0.89542 0.87421 0.86041 0.85231 0.86727 0.84845

15 0.88625 0.87110 0.88969 0.87362 0.86681 0.85358 0.85403 0.86945

Table A.2 Macro-Averaged AP-Scores of All Classes

Conditions

1 2 3 4 5 6 7 8

max

n-g

ram

len

gth

1 0.79998 0.75426 0.76915 0.76028 0.76414 0.71638 0.77121 0.72274

2 0.80459 0.77561 0.79090 0.79243 0.75770 0.75826 0.76001 0.75260

3 0.80807 0.78539 0.79045 0.78538 0.76450 0.75281 0.76895 0.76315

4 0.79333 0.78249 0.78759 0.76358 0.76157 0.75335 0.75009 0.75843

5 0.80807 0.78237 0.79266 0.76794 0.75546 0.75181 0.75882 0.77015

6 0.78405 0.77931 0.80132 0.76689 0.76495 0.75031 0.77623 0.75212

7 0.79768 0.77712 0.79728 0.76820 0.76913 0.75521 0.74464 0.76786

8 0.79060 0.77587 0.79075 0.77641 0.76729 0.76235 0.76174 0.75969

9 0.79412 0.77728 0.78958 0.75259 0.76786 0.75242 0.74746 0.76732

10 0.78881 0.77820 0.78910 0.76923 0.76689 0.76032 0.76641 0.75721

11 0.79597 0.77398 0.79302 0.77094 0.76086 0.76354 0.76075 0.73879

12 0.78550 0.77857 0.79082 0.78332 0.74792 0.76637 0.76998 0.76448

13 0.79340 0.76907 0.79065 0.75373 0.75489 0.74406 0.75701 0.75932

14 0.77876 0.77184 0.79121 0.76499 0.76382 0.75211 0.75768 0.76184

15 0.78395 0.76465 0.78830 0.76070 0.77167 0.77614 0.74695 0.77150

147

Table A.3 AP-Scores of Two Classes (Subjective vs All)

Conditions

1 2 3 4 5 6 7 8 m

ax n

-gra

m le

ngt

h

1 0.90501 0.89149 0.88107 0.90806 0.91505 0.89741 0.89351 0.88440

2 0.93336 0.92283 0.91246 0.92152 0.92710 0.89072 0.88780 0.91804

3 0.94239 0.91865 0.92768 0.94003 0.90400 0.90593 0.91811 0.90613

4 0.94333 0.93144 0.94016 0.93299 0.92555 0.92370 0.90864 0.91680

5 0.95273 0.94288 0.93150 0.94842 0.92308 0.89751 0.92782 0.92390

6 0.94767 0.94021 0.94799 0.95153 0.92594 0.90369 0.90920 0.91275

7 0.94967 0.94131 0.94568 0.94773 0.90604 0.92381 0.92287 0.93336

8 0.95740 0.93798 0.94737 0.95101 0.92411 0.91136 0.90881 0.90886

9 0.94873 0.95107 0.94773 0.94274 0.90559 0.93126 0.92834 0.92388

10 0.94585 0.94358 0.94162 0.94536 0.92514 0.90118 0.93022 0.91742

11 0.94941 0.94000 0.93928 0.94745 0.92916 0.91121 0.90954 0.92416

12 0.95327 0.93872 0.94868 0.94808 0.91769 0.90859 0.92494 0.91523

13 0.95026 0.93438 0.95207 0.94390 0.92459 0.90395 0.90351 0.92396

14 0.93729 0.92832 0.95703 0.95128 0.91915 0.91917 0.93349 0.92274

15 0.95010 0.93111 0.94280 0.94444 0.92890 0.90565 0.91919 0.93660

Table A.4 AP-Scores of Two Classes (Objective vs All)

Conditions

1 2 3 4 5 6 7 8

max

n-g

ram

len

gth

1 0.90375 0.85260 0.87347 0.85891 0.87562 0.83632 0.87708 0.84159

2 0.89769 0.88483 0.89230 0.88298 0.88828 0.87137 0.88961 0.87662

3 0.90800 0.87393 0.90164 0.88555 0.89864 0.86943 0.88837 0.87976

4 0.90251 0.88142 0.90932 0.88496 0.89250 0.88372 0.88241 0.88566

5 0.90749 0.89293 0.91095 0.87892 0.87744 0.87557 0.90670 0.87394

6 0.89953 0.87594 0.91214 0.88282 0.90129 0.87362 0.90426 0.87490

7 0.90831 0.88341 0.90971 0.89423 0.89659 0.88469 0.88819 0.87681

8 0.91126 0.89258 0.91615 0.88261 0.88810 0.87390 0.89979 0.86681

9 0.91129 0.88297 0.91712 0.89106 0.90063 0.87826 0.90311 0.87514

10 0.91113 0.88707 0.91147 0.89191 0.89916 0.87251 0.90402 0.87803

11 0.91130 0.89220 0.91876 0.88074 0.88890 0.87459 0.90308 0.87257

12 0.90712 0.88976 0.91710 0.88228 0.89269 0.87452 0.90712 0.88933

13 0.90656 0.87141 0.91521 0.89360 0.89582 0.87049 0.90602 0.88417

14 0.90707 0.88522 0.91463 0.87518 0.89538 0.88458 0.90084 0.86635

15 0.90746 0.88614 0.91403 0.88204 0.89858 0.87836 0.88369 0.88256

148

Table A.5 AP-Scores of Two Classes (Assessment vs All)

Conditions

1 2 3 4 5 6 7 8 m

ax n

-gra

m le

ngt

h

1 0.62572 0.51981 0.58515 0.54963 0.48264 0.42715 0.50811 0.41564

2 0.58350 0.52195 0.56644 0.57512 0.43244 0.49223 0.47429 0.44123

3 0.59137 0.55334 0.55362 0.53872 0.44582 0.47400 0.47320 0.48630

4 0.53987 0.53424 0.54579 0.47043 0.43733 0.45967 0.40620 0.45350

5 0.60639 0.50124 0.56271 0.48171 0.43891 0.45345 0.42155 0.51026

6 0.51899 0.50746 0.56486 0.47294 0.43950 0.46482 0.47102 0.44875

7 0.55083 0.49840 0.57151 0.48657 0.48152 0.46541 0.42106 0.47560

8 0.53778 0.49373 0.55106 0.49727 0.45034 0.46598 0.46191 0.46540

9 0.55811 0.50911 0.53652 0.43471 0.45231 0.46446 0.44352 0.49112

10 0.55279 0.52788 0.53187 0.47275 0.43585 0.47011 0.46762 0.45519

11 0.56682 0.51310 0.55637 0.51224 0.43034 0.49273 0.44681 0.44773

12 0.54444 0.50459 0.54401 0.54557 0.43188 0.50355 0.44888 0.48709

13 0.56513 0.49285 0.53277 0.44201 0.39557 0.44036 0.45409 0.45822

14 0.52844 0.51922 0.53272 0.48407 0.42753 0.42670 0.42524 0.49405

15 0.54566 0.50524 0.53355 0.45283 0.49985 0.52671 0.44175 0.52557

Table A.6 AP-Scores of Two Classes (Plan vs All)

Conditions

1 2 3 4 5 6 7 8

max

n-g

ram

len

gth

1 0.76542 0.75312 0.73690 0.72453 0.78324 0.70465 0.80615 0.74932

2 0.80380 0.77281 0.79240 0.79008 0.78297 0.77872 0.78832 0.77451

3 0.79052 0.79562 0.77884 0.77720 0.80953 0.76187 0.79611 0.78040

4 0.78761 0.78287 0.75510 0.76593 0.79089 0.74632 0.80311 0.77775

5 0.76565 0.79241 0.76548 0.76269 0.78241 0.78070 0.77922 0.77251

6 0.77001 0.79364 0.78029 0.76025 0.79305 0.75910 0.82042 0.77209

7 0.78190 0.78535 0.76220 0.74425 0.79237 0.74693 0.74642 0.78567

8 0.75596 0.77918 0.74840 0.77476 0.80662 0.79814 0.77646 0.79768

9 0.75836 0.76597 0.75696 0.74186 0.81290 0.73569 0.71488 0.77912

10 0.74545 0.75426 0.77144 0.76691 0.80741 0.79749 0.76379 0.77821

11 0.75635 0.75062 0.75767 0.74334 0.79502 0.77564 0.78358 0.71071

12 0.73718 0.78119 0.75350 0.75734 0.74943 0.77882 0.79898 0.76625

13 0.75164 0.77764 0.76255 0.73542 0.80356 0.76144 0.76442 0.77093

14 0.74224 0.75461 0.76046 0.74943 0.81323 0.77798 0.77115 0.76423

15 0.73256 0.73611 0.76280 0.76349 0.75936 0.79384 0.74318 0.74126

Applying Speech Recognition and Language Processing ...

Documents

Transcript of Applying Speech Recognition and Language Processing ...