User-centered Visualizations of Transcription Uncertainty in AI ...

15
User-centered Visualizations of Transcription Uncertainty in AI-generated Subtitles of News Broadcast Fredrik Karlsson Human-Computer Interaction Uppsala University [email protected] ABSTRACT AI-generated subtitles have recently started to automate the process of subtitling with automatic speech recognition. However, people may not perceive that the transcription is based on probabilities and may entail errors. For news that is broadcast live may this be controversial and cause misinterpretation. A user-centered design approach was performed investigating three possible solutions towards visualizing transcription uncertainties in real-time presentation. Based on the user needs, one proposed solution was used in a qualitative comparison with AI- generated subtitles without visualizations. The results suggest that visualization of uncertainties support users’ interpretation of AI-generated subtitles and helps to identify possible errors. However, it does not improve the transcription intelligibility. The result also suggests that unnoticed transcription errors during news broadcast is perceived as critical and decrease trust towards the news. Uncertainty visualizations may increase trust and prevent the risk of misinterpretation with important information. Author Keywords Automatic speech recognition; Uncertainty visualization; Subtitles. INTRODUCTION Subtitles are important for making videos understandable and accessible for as many individuals as possible. While subtitles initially were created to support people with hearing loss, audience analysis has showed that subtitles now are used for several different reasons. Ofcom [26], showed that only 20% of subtitle users in UK suffered from hearing loss. People nowadays use subtitles for several reasons, such as when sound is muted or to support their comprehension [26]. Subtitles are especially important in the context of news, which usually are broadcast live and have to be subtitled real-time. Real-time subtitling usually takes years of practice to handle, yet the quality is usually deficient. In recent years, AI-generated subtitles have started to automate the process of subtitling with techniques for Automatic Speech Recognition (ASR). As with subtitling in real-time can AI-generated subtitles have deficient quality. ASR systems can have problems with determining what a person exactly said and transcribe the correct word [27]. There are several factors that can affect ASRs output, such as variety in people’s speech, dialects [27] and casual speech [3,27]. Other factors that can affect are noisy environments [3], background and overlapping sound [27]. ASR systems determine output based on probability measurements to match the spoken utterance [4,20,27,31]. Probability measurements entail uncertain data [29] and uncertainties in transcriptions could be controversial in the context of live news. The information conveyed through the subtitles must be understandable, credible and not misleading. A fictive but realistic example illustrates how critical this issue can become. Sara is watching the news covering the current state of the corona pandemic. The news reporter is saying: “Quarantine is not over, you are not allowed to go out”, while the subtitle transcription provided by ASR is: “Quarantine is now over, you are now allowed to go out”. Sara that had sound muted misinterpreted the information conveyed through the news. This illustrates that transcription uncertainty may change the meaning of spoken information. However, human process of understanding is not passive, our brain has the capacity of correcting mistakes, build up possible meanings and predict following words [24]. Had the subtitle system presented or just provided with a hint of transcription uncertainty, misinterpretation may have been avoided. In previous studies, visualization of uncertainty has helped user’s interpretation of data and support in decision making. For instance, Kay et al [15] found that people could estimate bus arrivals more precisely when presented with uncertain data in a mobile transit application. However, given the complexity of uncertain data visualizations may rather lead to confusion and decrease trust [8]. It is not clear if uncertainty visualization will help or confuse people when used in AI-generated subtitles. This paper study user needs for visualizing uncertainty in AI-generated subtitles during news broadcast. The goals of this research are to create a design that usefully visualizes uncertainty in AI-generated subtitles; to understand if people perceive visualizations to support in transcription error detection; and to understand how visualizations influence on trust towards AI-generated subtitles when used This work was submitted in partial fulfilment for the master of science degree in Human – Computer Interaction at Uppsala University, Sweden. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honoured. © 250620 Copyright is held by the author(s).

Transcript of User-centered Visualizations of Transcription Uncertainty in AI ...

User-centered Visualizations of Transcription Uncertainty in AI-generated Subtitles of News Broadcast

Fredrik Karlsson Human-Computer Interaction

Uppsala University [email protected]

ABSTRACT AI-generated subtitles have recently started to automate the process of subtitling with automatic speech recognition. However, people may not perceive that the transcription is based on probabilities and may entail errors. For news that is broadcast live may this be controversial and cause misinterpretation. A user-centered design approach was performed investigating three possible solutions towards visualizing transcription uncertainties in real-time presentation. Based on the user needs, one proposed solution was used in a qualitative comparison with AI-generated subtitles without visualizations. The results suggest that visualization of uncertainties support users’ interpretation of AI-generated subtitles and helps to identify possible errors. However, it does not improve the transcription intelligibility. The result also suggests that unnoticed transcription errors during news broadcast is perceived as critical and decrease trust towards the news. Uncertainty visualizations may increase trust and prevent the risk of misinterpretation with important information. Author Keywords Automatic speech recognition; Uncertainty visualization; Subtitles. INTRODUCTION Subtitles are important for making videos understandable and accessible for as many individuals as possible. While subtitles initially were created to support people with hearing loss, audience analysis has showed that subtitles now are used for several different reasons. Ofcom [26], showed that only 20% of subtitle users in UK suffered from hearing loss. People nowadays use subtitles for several reasons, such as when sound is muted or to support their comprehension [26]. Subtitles are especially important in the context of news, which usually are broadcast live and have to be subtitled real-time. Real-time subtitling usually takes years of practice to handle, yet the quality is usually deficient. In recent years, AI-generated subtitles have started to automate the process of subtitling with techniques for Automatic Speech Recognition (ASR). As with

subtitling in real-time can AI-generated subtitles have deficient quality. ASR systems can have problems with determining what a person exactly said and transcribe the correct word [27]. There are several factors that can affect ASRs output, such as variety in people’s speech, dialects [27] and casual speech [3,27]. Other factors that can affect are noisy environments [3], background and overlapping sound [27]. ASR systems determine output based on probability measurements to match the spoken utterance [4,20,27,31]. Probability measurements entail uncertain data [29] and uncertainties in transcriptions could be controversial in the context of live news. The information conveyed through the subtitles must be understandable, credible and not misleading.

A fictive but realistic example illustrates how critical this issue can become. Sara is watching the news covering the current state of the corona pandemic. The news reporter is saying: “Quarantine is not over, you are not allowed to go out”, while the subtitle transcription provided by ASR is: “Quarantine is now over, you are now allowed to go out”. Sara that had sound muted misinterpreted the information conveyed through the news.

This illustrates that transcription uncertainty may change the meaning of spoken information. However, human process of understanding is not passive, our brain has the capacity of correcting mistakes, build up possible meanings and predict following words [24]. Had the subtitle system presented or just provided with a hint of transcription uncertainty, misinterpretation may have been avoided. In previous studies, visualization of uncertainty has helped user’s interpretation of data and support in decision making. For instance, Kay et al [15] found that people could estimate bus arrivals more precisely when presented with uncertain data in a mobile transit application. However, given the complexity of uncertain data visualizations may rather lead to confusion and decrease trust [8]. It is not clear if uncertainty visualization will help or confuse people when used in AI-generated subtitles.

This paper study user needs for visualizing uncertainty in AI-generated subtitles during news broadcast. The goals of this research are to create a design that usefully visualizes uncertainty in AI-generated subtitles; to understand if people perceive visualizations to support in transcription error detection; and to understand how visualizations influence on trust towards AI-generated subtitles when used

This work was submitted in partial fulfilment for the master of science degree in Human – Computer Interaction at Uppsala University, Sweden. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honoured. © 250620 Copyright is held by the author(s).

in news broadcast. The research is approached qualitatively in two phases. First, a user-centered design approach was used to investigate and design possible solutions for visualizing transcription uncertainties in real-time presentation. The design work was performed iteratively and informed by 9 participants. Second, a comparative user study was performed with 6 participants, to investigate whether people perceive uncertainty visualization to support their interpretation and error detection. The study also investigated how trustworthy the AI-generated subtitles are perceived to be and if visualizations influence on that trust.

The three contributions aligned with these goals are as follows:

1. Propose visualization techniques for the context of AI-generated subtitles. This study presents two possible design concepts that usefully communicates transcription uncertainties in AI-generated subtitles.

2. The study shows that uncertainty visualizations support users’ interpretation of transcriptions and in error detection. Visualizations does not improve intelligibility but prevents the risk of misinterpretation.

3. The study shows that uncertainty visualization may increase trust in AI-generated subtitles during news, since it visualizes possible errors and preventing the risk of misinterpretation with important information.

The result also provides a foundation for further work of uncertainty visualization in AI-generated subtitles. To my knowledge, this is the first study investigating the use of uncertainty visualization in the context of AI-generated subtitles. The qualitative approach elicits potential factors that could be studied further in quantitative and long-term studies. BACKGROUND & RELATED LITERATURE This section will begin with presenting the basic fundamentals of ASR systems, focusing on components for probability estimations and types of transcription errors. Secondly, uncertainty and recognition-primed decision model will briefly be presented. Lastly, prior work of uncertainty visualization is presented to enlighten the possibilities for effective communication of uncertainty. Fundamentals of Automatic Speech Recognition ASR refers to the process of identifying speech signals and reproduce the correct sequence of words [20,28]. ASR systems usually consists of four main parts: Future extraction, Acoustic Model (AM), Language Model (LM), and hypothesis search [20,31]. In the feature extraction, speech signal is first acoustically processed [20,31], resulting in the speech signal being divided into short frames of sound to obtain the acoustic features [4,31]. The AM and LM takes these acoustic features to calculate and estimate the probability of the word sequence [4,20,31].

Both AM and LM gives a probability score on the features that is combined during the hypothesis search to determine and output the word sequence with the highest probability [4,20,27,31]. Acoustic models Speech consist of short sequences of sounds (referred as phonemes) that combines into words [4,27]. An AM uses mathematical models and knowledge about acoustics and phonetics to determine which word that has the highest probability to correspond to a speech signal [4,31]. Deep Neural Network (DNN) based variants is at present the most widely used as acoustic modeling [20]. Recurrent Neural Network (RNN) is such a variant and it predicts which letters each short sound frame corresponds to [7,23]. All predicted letters is then put together into sequences that is likely to have been spoken [7,23]. Although, RNN do not only operate on input, but also internal states. It is like a memory that influences future predictions [20,25], as a dynamic system where previous predictions may influence its upcoming predictions [20,28]. Language models An LM determine the probability of a word to correspond to a speech signal from training corpus [4,28,31]. Based on training corpus can LM determine how frequently words occur and how likely they are to occur in relation to other words [4,31]. By assigning scores to likely and unlikely sentences does it help in reducing error rates [4]. Training corpus usually contains of databases and digital collections of text [4], collected from sources such as books and news articles. Hypothesis search Based on probabilities from the AM and LM are several possible hypothesis evaluated to determine which was the most likely to have been said [4]. In the end will the word sequence be output with the highest probability to match the input speech [4,20,27,31]. Even though ASR systems usually output the one-best sequence, multiple hypotheses are generated during decoding process [14,21]. Confusion networks and lattice structure can be used to provide information about alternative hypotheses on sub-level words [14,21]. Transcription uncertainty and errors DNN variants have showed to be significantly more effective for ASR by reducing the overall error rate compared to previous models [11,20,31]. Transcription errors occurring with ASR are usually categorized into three types:

Substitution: Refers to when a different word is transcribed compared to what was actually said [4,5].

Deletion: Refers to missing words in the transcription, meaning that the system not recognizes a spoken word and nothing is transcribed [4,5].

Insertion: Refers to extra words, meaning that a word is transcribed when nothing was actually said [4,5].

Transcription errors are related to concepts of uncertainty. Missing or incorrect data is a type of uncertainty classified as completeness [29]. Skeels et al [29] found that people think the worst kind of missing information is the one you do not even know that is missing. Another type of uncertainty is inconsistency in data [29]. When concerns about incorrect data leads to uncertainty about the rest of the data is the uncertainty a matter of credibility [29]. Recognition-primed decisions and uncertainty Klein [17,18] have developed a model called Recognition-Primed Decisions (RPD), referring to when people have to take quick decisions in complex real-world settings. The RPD model relates to human decision-making and includes three strategies depending on the problem the person is facing [17,18]. The first and simplest strategy is when the person reaches a decision instantly and implements it, and the second is when the person have to evaluate the decision before implementing it. The third is the most complex and refers to when decisions are judged to have flaws or be inadequate and modifications or other decisions need to be taken. Traditional decision-making theories is about evaluating alternatives or assessing probabilities, while RPD is not only about decision making, but also include aspects of judgement and problem solving [17]. The process of problem solving involves four phases; interpretation of the problem, develop possible solutions, try the solution, and evaluate the outcome [12]. Instead of analyzing alternatives is RPD more focused on acting, taking quick decisions and evaluating promising outcomes [17]. Klein argues that persons who take quick rational decisions base it on experience, otherwise does it require the person to process several alternative decisions [18].

How uncertainty is handled by people are referred to uncertainty orientation [30] and may differ depending on personality. Uncertainty-oriented persons seeks out information in order to gain understanding and is a “need-to-know” type, while Certainty-oriented persons rather ignore uncertain data, with exception for familiar situations that confirms their understanding of the world [30]. There are also reasons why people either deal or avoid uncertainty depending on its perceived advantages or disadvantages. People may desire to seek and reveal uncertain information in solving specific goals and gain new knowledge [30]. People may prefer to avoid uncertainty since it is unpleasant, but also when it is ambiguity and conflicts with other information [30]. When uncertain information is perceived as negative will people want to avoid it [30]. Uncertainty visualization Support interpretation of data Previous studies have showed that visualizing uncertainty can support users’ interpretation of data and support in decision-making. Jung et al [13] found that presenting ambiguously uncertain information about remaining battery-life of an electric vehicle had beneficial interpretation effects. Instead of hiding the uncertainty, it

led to improvements in driving experience and style, that was smarter and more adaptive towards remaining battery-life. Similarly, Kay et al [15] show that discreate outcomes can help users interpret and estimate transit predictions in mobile context. Since transit predictions are often presented in a point estimation are people unaware of the related uncertainty in real-time estimations. Discrete outcomes resulted in more accurate decisions based on the uncertainty information [15]. Greis et al [9] studied conflicting sensor data, arguing that the amount of presented uncertainty information affect users’ interpretation. They found improvements in users’ estimation when sensors differed largely, and visualizations were information rich.

While evidence show that uncertainty visualization may benefit users, it can also lead to increased cognitive load and confusion when people need to make quick decisions [8]. Lim and Dey [22] found that uncertainty visualization decrease users’ impression in applications that behave appropriate, but with low certainty. Greis et al [8] argues that uncertainty visualization requires careful reasoning of the complexity of uncertain data and what information is necessary to visualize. Increase trust Users are often exposed to uncertain data, which often is visualized as accurate [2,16]. The effect is that users tend to rely on this data more than it is correct [16]. People noticing uncertainty in data perceive it as untrustworthy [29]. Previous studies have shown that revealing uncertain data can increase trust. Kay et al [16] found that presenting uncertainty associated with weight data can increase trust since users otherwise tend to interpret single point estimated data as more accurate than it actually is. Similarly, research by Jung et al [13] also led to increased trust by presenting uncertainty, and Ferreira, Fisher and König [6] found that people feel more confident in their work with sample data. Visualization techniques Uncertainty visualization research has showed common effects and outcomes of uncertainty visualization, although it is still unclear if suggested visualization techniques can be used outside the specific context for which they were developed. One visualization technique is using distribution [29], which Kay et al [15] used by presenting discrete outcomes. Two other common techniques are juxtaposition [2] and use of colors [2,29]. Juxtaposition refers to visualizations of additional data that is provided nearby or side by side with the uncertain data [2]. Color components such as hue and saturation can be used in various ways to visualize uncertainties [2]. Hengl [10] used colors to visualize degree of uncertainty in geographic information system. DESIGN WORK The literature set the starting point for the design work, by chartering a range of possibilities for visualization of uncertainty in AI-generated subtitles. A user centered

approach was used to create a design that communicates uncertainty in a useful way. Method The design work was performed through an iterative design process. Design work were informed by ongoing feedback sessions with a total of 9 people, recruited with convenient sampling. Initial iterations focused on creating multiple potential concepts with sketches and were informed by 2 experts working with AI and subtitles. Subsequent iterations were performed with a total of 7 users (5 male and 2 female), all generally familiar with using subtitles. Some of them participated in several iterations and each session was performed in person, one-by-one. It was performed with high fidelity prototypes, letting users watch images or video mockups that were presented on either a computer screen or a TV. This was followed up in unstructured open discussions with questions and topics related to each iteration. Data was collected with notes and since participants did not directly interact with the prototype, no observational data was collected. Design iterations Several sketches were initially created for exploring possible ideas (see Appendix A – some decisive sketches). Given the nature of subtitle functionality, the possible realistic concepts related to probability and uncertainty were limited. To further investigate was the use of probabilities on each word, but not only on uncertain words. By visualizing high probabilities, transcriptions could potentially be perceived in a positive sense rather than focusing on errors. Another interesting component to investigate was the use of probabilities on sub-level words. ASR-systems generate multiple hypotheses during decoding process [14,21] that could be presented in the subtitles. With inspiration from previous work, juxtaposition [2] and distribution [15,29] were promising visualization techniques to be used with these probability components. Juxtaposition in combination with sub-level words could suggest alternative words, which potentially could improve transcription intelligibility. Distribution inspired to visualize probabilities on several states and not just the most uncertain words. Following iterations, sketches increased in fidelity into still images (see Appendix B – some decisive mockups), focusing on how the possible concepts would be visualized. Color components proved to be important for highlighting probabilities and uncertainties. The use of font-styles and error symbols were other techniques investigated for highlighting but were less suitable. Colors were more accessible since they were easier to distinguish than font-styles and seemed more reasonable than error symbols. How colors would be visualized and which colors to use in the different concepts were also investigated to improve accessibility. For the visualization technique Juxtaposition, position of additional words was important for the concept to be accessible. Transcription based on sub-level words needed to be close to the rest of the transcription.

Eventually, iterations led into video prototypes to investigate how users would perceive the visualization techniques in motion and also in real-time subtitle functionality. Design issues encountered was that some visualization techniques were less accessible depending on the real-time subtitle functionality. For example, subtitles presented in a traditional manner, on two static rows, gave users less time to perceive the transcription on the bottom row since it disappeared when the row was full. This were a problem if the last word was visualized. Instead, subtitles presented in a rolling functionality gave the user more time to perceive the transcription. Following, the proposed set of designs will be presented together with test results from the last iteration and design choice for the evaluation will be discussed. Proposed designs and result The result of the iterative test and evaluation cycle yielded three design concepts that all had yielded positive feedback from users. The concepts differed in probability and uncertainty functionality but had the same real-time subtitle functionality and similar visualization style in terms of font and color, a result of appropriate accessibility. The concepts and result are presented below. Probability intervals The concept highlights all transcribed words with a color depending on the word´s probability (Figure 1). With three probability intervals, the words with highest probability are highlighted green, yellow for the ones in the middle and red for the most uncertain words. The visualization technique has similarities to distribution [2] and discrete outcomes [15] in its way of visualizing the probability of each word.

Figure 1. The concept of Probability intervals

Early in the design phase, experts were concerned about the increase in cognitive load this concept could lead to. However, it could support users in their interpretation of sentence structure in its way it visualizes probability. As the experts first anticipated, users experienced this concept as confusing. The highlighted colors drew their attention towards it, making it difficult to focus on the text. The concept did not either support their interpretation or understanding. Several colors were perceived as unnecessary since a word is either correct or wrong and only one color is needed to highlight either or. Especially the yellow words, it made them think that the word is as likely to be correct as wrong and therefore led to confusion rather than support. As a result, users began to focus on only one color, either green or red words. One user said that he skipped every word that was highlighted with yellow or

red, focusing on understanding the transcription with only green words. Alternative words Since ASR systems generate multiple hypotheses during decoding process and probability on sub-level words [14,21], this concept is based on that functionality. Every word with high uncertainty is highlighted and presented together with an alternative word that have the second highest probability to be the correct one (Figure 2). The alternative word is presented at the same time, next to the uncertain word, inspired of the visualization technique Juxtaposition [2]. The position of the alternative word was important for users to seamlessly perceive the word without losing focus on the rest of the transcription.

Figure 2. The concept of Alternative words

Users experienced this concept to be advantageous in some cases since it can present the correct word and makes the transcription easier to understand. However, the alternative word was also a disadvantage. Since ASR cannot be certain about the second word either, some users became more confused. When users were uncertain if any of the words were correct, they started to think twice and for some users resulted in loss of track in the transcription. One user also mentioned that he perceived every alternative word as correct, even though this was not always the case. One user that considered this to be the worst concept to support his interpretation said that subtitles in real-time change states to fast for him to have time to read the alternative words. Highlighting uncertainty This concept highlights only the most uncertain words (Figure 3). This is the most simplified concept, only uses a color as visualization technique [2,29].

Figure 3. The concept of Highlighting uncertainty

Users experienced these subtitles as the easiest to read and follow. While the other concepts took too much attention from the text, users experienced this to be less distracting when a word was highlighted. The visualizations were perceived as clear given their singular purpose, making them aware of possible errors. However, even with this concept, users experienced that visualizations could be distracting for a short moment. Another concern raised by a

user was that it does not support with any guidance or complement to support their interpretation. However, the overall perception was that the transcription quality is high enough for them to interpret the sentences even when errors occurred. Design rationale The result from the last design iteration and testing show that uncertainty visualizations can be seen as distracting and takes away users’ focus from the text. The concept of Highlighting uncertainty showed to be the less distracting, seen as the most straightforward concept with a single purpose and therefore not as demanding and distracting as the other concepts. Probability intervals provide three different sources of uncertainty information, that seemed to increase cognitive load. Since users perceived all visualizations as unnecessary, leading them into focusing on one source, suggest that users only want to know if a word is either correct or wrong. That one user only focused on the green words in this concept also suggest that the user may trust visualizations to always be true. Similarly, in the concept of Alternative words, one user said that all alternative words were correct. Alternative words showed to have both advantages and disadvantages. It both supported users in decision-making and interpretation of the transcription, while also making it difficult to continue reading. When users easily could decide if any of the words were correct, they easily interpreted the transcription. However, users quite often could not determine the correct word and lost focus on the text. First, they thought about if any of the alternatives were correct and then tried to figure out which the correct word is. The result suggest that Alternative words may improve the transcriptions intelligibility and support interpretation. However, due to low certainty on sub-level words, the alternative word was too often wrong, making users lose track of the rest of the transcription. But the concept of uncertainty is complex for humans as it is, and the result suggest that visualizations in subtitles should prevent additional uncertainty for users.

Since Highlighting uncertainty conveys the least amount of information about uncertainty for users and is the easiest to continue reading, the concept was chosen to be used in the evaluation. While Alternative words may improve the transcription intelligibility, this happens rarely due to the low certainty on sub-level words, instead taking focus away from rest of the text. With higher certainty on sub-level words, Alternative words would potentially be advantageous the other concepts. Users expressed concerns about the lack of guidance in Highlighting uncertainty, but also saying that the overall quality of transcriptions were very good and therefore could correct errors by themselves. The visualizations were perceived as less demanding, making it easier to continue reading. This suggest that users read more of the transcriptions in Highlighting uncertainty and gets a better understanding of the whole, making it easier to correct errors.

EVALUATION A user study was conducted in order evaluate the use of uncertainty visualization in AI-generated subtitles during news broadcast. In particular, qualitative data was collected to assess if people perceive visualizations to support their interpretation and in error detection. Also, assess how visualizations influence on trust towards AI-generated subtitles when used in news broadcast. In order to do this, the evaluation compared two prototypes, the one designed with uncertainty visualization and one without. Method The evaluation was performed by presenting the prototypes for the participant and follow up with semi-structured interviews. A within-group design was used, meaning that all participants got to see the established prototype with uncertainty visualization and also a prototype without. To control the risk of learning effects and fatigue, Latin square design [19] was used by switching the order on which prototype each participant saw first (see Appendix C - Participant list). By seeing both prototypes, participants would gain some experience from the first one that might affect the perception of the second prototype, especially since they are seeing this kind of visualizations in AI-generated subtitles for the first time.

Before seeing each prototype, a brief introduction explained the concept of AI-generated subtitles and uncertainty visualization. Each test session started with participants seeing one video prototype, directly followed up with questions to collect their direct thoughts and experience. The same procedure was performed for the second prototype. After seeing both, a more comprehensive interview was conducted to collect their whole experience (see Appendix D - interview guide). No observational data was collected, due to the fact that participants did not directly interact with the prototype. Test setup Test sessions was performed through video calls due to the pandemic situation. To ensure a similar test environment for all participants, they were told to watch the videos alone in a calm environment on a TV screen. They were also told to watch each video once. Design Prototype 1 possesses uncertainty visualization with the concept of Highlighting uncertainty, created during the design phase. Prototype 2 were created as standard AI-generated subtitles, without uncertainty visualizations (see

Appendix E - prototypes). The transcription in the prototypes was modified to ensure that they would be on the same level of difficulty, with the same kind of errors and also the same amount. Spoken language in videos was the same as the participants native language, and since the study focused on interpretation of the subtitles, both videos were presented without sound. Participants For the evaluation were six participants recruited (four male and two female), all generally familiar with using subtitles, as criteria for participating. Results Thematic analysis of transcripts was used to identify high-level themes, emerged from data [1]. It was approached inductively, resulting in three high-level themes (Figure 4), each followed by sub-level themes (see Appendix F - Coding table). Results will be discussed and presented on sub-level themes below. Real-time subtitles affected the experience Several participants experienced that real-time subtitles are moving fast and made the reading difficult. P2 and P4 experienced it as difficult to catch up with. P1 and P3 perceived that it affected their whole understanding of the story, since they read text in other ways. P3 explained:

You concentrate on, “what will be next, what will be next, what will be the next word”. Therefore, it becomes very jerky and it becomes harder to understand the whole. Because I may not be one who reads word for word really, instead I am one who reads every three word, and sees the words surrounding them. Like quick reading.

P1 and P4 talked about standard subtitles that do not appear in real-time. P4 expressed that when subtitles come in whole sentences and two rows, then you have time to keep up and you understand the whole. P1 said similar:

Because, when you read subtitles in movies you see the all text, you get to read at your own pace. But here you do not read at your own pace, because here it is written at the same time as you look. Uncertainty visualizations creates additional uncertainty All participants expressed that uncertainty visualization was distracting and disturbing and preferred to watch without it. They described it as it drew their attention towards the visualization and losing focus from the rest of the text. P1 explained it:

Figure 4. Overview of identified themes emerged during thematic analysis. The clouds show the three high-level themes.

Since it is an marking in a flowing text, it is something that should get my attention, and when you want to get someone's attention, you want to, "here look at me". Then it becomes like you automatically stop and think, what is this.

P1, P2, P3 and P6 all explained the loss of focus was because you started to wonder if the word was correct or not, and what is the correct word. P3 expressed:

Then you start to wonder, “what should this be”, and then I stopped reading and started thinking on that word instead.

P1, P2, P3 and P6, referring to loss of focus, expressed it to be more convenient to watch without visualization and therefore easier to interpret, understand and read the text. P6 expressed:

Then you can interpret a little yourself what is wrong. Then you read more of the whole. When it was yellow, you reacted more to it and that it should be wrong. Expectations of subtitles P1, P2, P4, P5 and P6 expect subtitle transcription to always be correct, otherwise they do not fulfill their purpose. P5 and P6 mentioned that some errors can be acceptable, P5 expressed it “as long as you understand the whole”. P4 explained that subtitles are like an extra information source of what is happening:

It should support. If you do not understand what they say in the video, you would like to get a confirmation in the text of what was said... It should reflect reality as best as possible.

P1 expects timing to be correct between subtitles and video images, that text does not appear before or after. P3 and P4 discussed the importance of accessibility in subtitles, that it should be clear to watch and easy to follow. P4 described it:

You should not lose focus, so you can still, like, see the text... Like people with poor eyesight and such. It should work for everyone… It should not disturb too much. Transcription understanding When asked about their understanding of the text, all expressed that they understood the overall story. Everyone also noted transcription errors in both prototypes. P1, P2, P5 and P6 perceived errors that were not visualized in the prototype 1 (video with uncertainty visualization). P2 further expressed that some visualizations were wrong and some correct. P6 perceived that smaller grammatical errors are harder to detect, while incorrectly transcribed words that stand out are easier. P4 perceived the correct words to be similar to the errors and therefore easy to interpret. P5 emphasized the difficulties in understanding that people with disabilities may have with AI-generated subtitles. P5 described:

If you have different difficulties, such as writing or reading disabilities and such, it can probably be very difficult for them to understand what they mean [in the transcription].

P1, P2, P3, P6 discussed familiarity with the topic as decisive for understanding the subtitles. Familiarity made it easier to understand the overall story, P3 and P6 described the weather forecast as something you have been familiar with since forever and the vocabulary is therefore easy to interpret. P3 also expressed how familiarity supports interpretation and why it may support in error correction:

Where I have some knowledge, it is probably easier to understand the whole and each word that may be wrong… But if I am not particularly familiar with a subject, then it becomes more difficult to connect these words, "is it wrong or is it right”, it may be the right word even though it sounds wrong in my head. AI is perceived as more precise than it actually is P3, P4 and P6 all expressed their trust towards the transcription. Although P6 were aware of that AI might be wrong with uncertainty visualizations, P6 still believed that all visualizations were transcription errors. Continuing, P6 stated that a news story in prototype 1 (video with uncertainty visualization) had at least 10 errors. The prototype had 8 transcription errors and 3 of them were visualized. P4 reflected on transcription without uncertainty visualization and expressed that it is perceived as correct:

If there is nothing to indicate that it is wrong. Then I would at least think that this must be true if nothing says otherwise.

P3 also talked about this when discussing prototype 2 without uncertainty visualization:

It is just like you are used to with subtitles that everything is correct, which means that it can be a little wrong without you knowing it. Perception of the amount of transcription errors Despite that both prototypes had the same amount of errors, most participants perceived that prototype 1 (video with uncertainty visualization) had most transcription errors. P3 perceived that there were a few more errors in prototype 1. P6 and P4 stated that uncertainty visualizations were interpreted as transcription errors and that they saw additional errors beyond them. P6 described:

Maybe it was the yellow markings, some might not be errors. But I thought they were errors because they were yellow. Then I saw lots of other errors and then it feels like it was most errors in that video.

P1 was the only one mentioning that prototype 2 (video without visualization) might actually have more transcription errors. P1 reflected on uncertainty visualizations and that it may have led to less focus on the errors in prototype 1. P1 expressed it:

Since I had to look for the errors, I would probably say video 2 [prototype 2], but it might as well be more errors in video 1 because I got everything served and did not have to focus so much.

Uncertainty visualization support error detection A common thread shared by the participants was that uncertainty visualization makes you aware of potential errors. P5 commented it as “good, because you can see the errors”. P2 experienced that it supports in detecting minor grammatical error, while substituted words “out of context” was easier to detect regardless with or without visualizations. P6 stated, “Then I thought about the errors more” while discussing uncertainty visualization. In contrast, P3 referring to prototype 2 (video without visualization), saying that errors go unnoticed:

Those who could be wrong, it was not really noticed, it was not as clear. Which made me keep reading and didn't think as much about it.

P1 was concerned about the usefulness of uncertainty visualizations. Talking about the systems lack of support in terms of error correction. P1 explained the uncertainty visualization as:

It doesn’t matter to me. I just know that "here it marks that it maybe is wrong. Okay, I note it, but I do not know what it should be and I could not go back and read proper"... If I'm not really quick-thinking, it doesn’t matter. Uncertainty visualization affect trust P2, P3 and P4 discussed that uncertainty visualization should make it more trustworthy since it aware you of possible errors. P2 expressed that “you notice if things have gone wrong and faster”, continuing by comparing without uncertainty visualization “No it is not certain that you react. If it is yellow, then you react immediately”. P3 had the same perception, but then changed opinion:

Because it [prototype 1, with visualizations] still gives a heads up if it could possibly be wrong, but at the same time you do not know if it is actually wrong or right. So it can fool you there, too. No, I'll probably take back what I said. No one is more reliable than the other, I think.

While, P4 discussed that uncertainty visualization should make it more trustworthy, P4 also said that “I trust the one without more, because then it seems that nothing is wrong. There is nothing that says it is wrong”. P1 also expressed without as more trustworthy, discussing and referring to uncertainty visualization as distracting:

I get to look for errors myself and there I am following the context better and then I understand the whole better. And then I see “there it was wrong and there it was wrong”. Trustworthiness depends on the errors, topic and TV programs. Participants trust toward AI-generated subtitles varied depending on factors related to errors, topic and TV programs. P2 stated that “since there were errors in them, they are not really trustworthy”, while P6 express that it may be trustworthy, but all errors make the news look a little silly sometimes. P5 also talked about the amount of errors as a factor, mentioning that many errors and

especially grammatical errors will makes them less trustworthy. P1 talked about this risk by referring to deletion errors as a risk there the perception of the news will be affected. P1 explained:

For example, say a word is forgotten, such as "not". Then it may be that the murder of Palme as an example. There the person was not suspected of murder, but in this case the person was suspected of murder. So if the word "not" disappears or a word is perceived in a very different way. Given that, I could interpret the whole news and context completely wrong.

Another factor the participants perceived as decisive, when it comes to trust, is how familiar you are with the topic. P2, P3 and P4 expressed that it was easier to understand and interpret the transcription when it was about a topic you have heard about before. P2 described specific topics from prototype 1 (video with uncertainty visualization):

If you understand what they are talking about then you can trust them. If you do not understand it, then it is not possible… For example, the murder of Palme, I know about those things since before, I know what they're talking about. It is worse for someone who has never heard of these things, then that would be a problem.

P4 talked about this from a general perspective, saying:

I trust the news, but as I said, what I am not so informed, or what you know very little about. There I wonder if I have interpreted correctly or not. Which may cause it to lose some reliability.

P3 express similar thoughts and added “it is probably not great when it comes to news” and the risk of conveying fake news. P2, P3, P4 and P5 considered news as an important source of information and important that it is interpreted correctly. They discussed AI-generated subtitles used in movies and sports, that they would be acceptable and trustworthy, since it would not be as critical if the transcription was interpreted wrong. P2 discussed this topic:

It is important when it comes to news, than if it is a movie. Because it does not matter, but if it is news it is important. It can be very important things said in it. DISCUSSION Perceived advantages and disadvantages The result indicates that uncertainty visualizations are perceived to be important since it makes you aware of possible errors and prevents misinterpretation. Regardless of its benefits, the result implies that users do not like uncertainty visualizations in the subtitles. Why uncertainty visualizations are perceived as both having advantages and disadvantages may relate to how people cope with uncertainty. People cope with uncertain information if it support them solving problems and gaining new knowledge [30]. Participants expressed that it supported in error detection by visualizing possible errors and also reduces the

risk of spreading fake news. Both factors manifest the perceived advantages by people gaining new knowledge through uncertainty visualizations and partially solves a problem by reducing risk of spreading fake news. But the negative perception of uncertainty visualizations was expressed as disturbing and not supporting interpretation. Several participants referred to the cognitive effort that the visualizations engender, forcing the viewer to reflect on the uncertain word and its possible alternatives in real-time. It corresponds with why people rather avoid uncertain information and why it is perceived as negative. Uncertainty is unpleasant and something people prefer to avoid, but also when it is ambiguous and conflicts with other information [30]. Suggesting that people perceive the visualizations as ambiguous by not offering the correct transcription, also taking focus from rest of the text and makes the reading uncomfortable. This reasoning suggests why people perceived disadvantages with uncertainty visualization, while also having advantages since it supports in problem solving. Recognition-primed decisions in real-time subtitles Why people actually not like uncertainty visualization may have its cause with the real-time presentation. Participants expressed the pace as unpleasant, giving no time to reason about the visualization and it conflicts with the rest of the transcription. This suggests that uncertainty visualization initiates the problem-solving process and the real-time presentation force the person to solve the problem quickly. Quick decision in complex situations is related to Klein’s RPD model [17,18]. Situations when uncertainty visualization was perceived as positive may the participant only entered the first RPD strategy state, quickly reached a decision and implemented it. Hence, uncertainty visualization was not perceived as disturbing and not taking focus from the reading. However, when participants had to process the more complex RPD strategies, quick decision was no longer possible and the problem-solving process had to re-evaluate several outcomes, making them more uncertain. When participants failed to solve the transcription error, it was instead associated with negative outcomes. It further resulted in loss of focus with rest of the text and somehow not supporting them gaining new knowledge, instead an unpleasant situation people want to avoid. This suggest why users prefer to watch without uncertainty visualization and perceived it as more convenient. The utility of uncertainty visualizations Despite that participant preferred watching without visualizations, results showed that several participants then perceived less transcription errors, indicating on uncertainty visualizations usefulness in error detection. It suggests that uncertainty visualization do support users in their interpretation of AI-generated subtitles, since it helps them notifying errors that they can take that into account in their interpretation. Participants seemed to be aware of its usefulness, expressing that it supports in notifying on

possible errors. However, it does not improve intelligibility for users. The visualization does not actually tell the user what the correct word is and do not provide with any assistance to improve its intelligibility. This concerned one participant, questioning its usefulness. The Alternative words design could potentially have alleviated this problem. The design study showed that it improved transcription intelligibility when the alternative word was correct, but it happened too rarely. Since the visualization had low certainty on the first transcription alternative, it was even lower for the second and hence, both were very likely to be wrong. Participants expressed this as they had to think about both words, making them more confused and took more focus from the reading. It suggests that revealing additional uncertain data, with low certainty give rise to more confusion. This correspond to what Lim and Dey [22] found, that systems with high certainty improve intelligibility, while low certainty decrease trust. Greis et al [8] argues, that careful reasoning is required on what uncertain data is necessary to reveal due to its complexity, and the result in this study indicates that revealing more uncertain data creates additional complexity and user uncertainty. With improved certainty in ASR systems may more revealed uncertain data support interpretation and improve intelligibility. Uncertainty visualization influence on trust The perception of AI-generated subtitles trustworthiness varied mainly due to the amount of errors. The participants’ expectation is that transcriptions should always be correct and that errors decrease trust. Participants expressed that news is less trustworthy if errors go unnoticed and it could be critical if important information would be misinterpreted. Therefore, the majority of the participants considered uncertainty visualization to be more trustworthy since it visualizes possible transcription errors. Taking into account that participants actually perceived less errors without uncertainty visualizations, indicates that the risk of misinterpretation rises without uncertainty visualizations. It could lead into news misleading people and risk of spreading fake news, resulting in decreased trust towards the news. With uncertainty visualizations, people are at least notified that a word might be incorrectly transcribed, and not as many errors would go unnoticed, indicating on the importance of uncertainty visualization for trustworthy news. While one participant perceived without visualizations as more trustworthy, this can be ascribed to the negative perceptions that always are associated to uncertain data, and therefore affecting the perception and trust of visualizations. The same participant also reported that they found less errors in the subtitle without uncertainty visualizations. This might be a factor in why they perceived the latter as more trustworthy and indicate that users may believe that they detect as many errors without uncertainty visualizations. In line with a previous study on uncertainty visualization [16], the result suggests that users tend to trust AI-generated data as more precise

than it actually are. Concerns were raised that you trust the transcription if nothing indicate on errors. Interpretation of transcription Several participants expressed that familiarity with the topic made it easier to understand the whole and also to notice and correct errors. This correlates with Klein’s [18] argument, that experience make it easier to take quick rational decisions. It suggests that participants only had to process the first RPD strategy in these cases. When they were not as familiar with the topic, they had to enter the more complex strategies taking longer time to solve the problem. As Klein [18] argue, less experience forces the decision maker to evaluate many alternative decisions. The errors easiest to detect without visualizations was the words that somehow stood out, that was taken out of its context. Probably, this also relates to knowledge and familiarity with the sentences and words. However, it may also be due to the functionality of ASR systems and especially RNNs and the Language Models (LM). RNN use previous predictions to influence upcoming predictions [20,28] and its LM base predictions on likely sentence structures from training corpus [4,31]. This generates high quality transcriptions. Even if it entails transcription errors, sentences look reliable as they were correct since the system base the prediction on likely sentences and previous predictions. This suggest why it is difficult to distinguish and detect errors that do not stand out from its context. It further implies why uncertainty visualization is important for notifying users of possible errors. Limitations This was the first-time participants were exposed with AI-generated subtitles in real-time presentation and may have implicated on the result. Participants preferred to watch without visualizations since the pace was fast and gave no time to reason about the visualizations. Over time would people maybe be used to this presentation and visualizations may be preferred due to its perceived advantages.

All test sessions during evaluation was performed with video call. Even though, the procedure would not have been different if I would have met the participants face-to-face, it prevented full control of test environment. However, to enable some control, they were told some setup criteria before the test. See methods for details. Due to the ongoing pandemic, happening during this study, was this the most convenient approach similar to the original plan. CONCLUSION This paper studied the use of uncertainty visualizations in AI-generated subtitles during news broadcast. A user-centered design phase resulted in three proposed design concepts that visualizes uncertainties in AI-generated subtitles. Providing with alternative words as a presumption to improve transcription intelligibility, gave raise to additional uncertainty since the second alternative word was wrong too often. Of the three alternative visualization

strategies, the strategy of Highlighting uncertainties showed to be less distracting, making it easier to continue reading and understanding the whole. Based on user needs, this design concept was chosen to be used in a qualitative comparison with standard AI-generated subtitles. The result suggests that uncertainty visualizations are perceived both having advantages and disadvantages. They were perceived as disturbing and taking focus away from the transcription. However, participants still understood what was said. Uncertainty visualizations showed to support users’ interpretation of transcriptions and in error detection. The advancements in ASR, generates trustworthy sentence structures where errors are difficult to distinguish, resulting in users perceiving the transcription as more precise than it actually is. Hence, uncertainty visualizations are important for supporting people noticing possible errors that otherwise would be difficult to distinguish and notice. Errors that goes unnoticed in news may lead to misinterpretation with important information. I would argue that uncertainty visualization should be used in AI-generated subtitles during news broadcast to prevent the risk of spreading fake news. FUTURE WORK An interesting extension to this study would be to look at long-term effects, letting people use the visualizations over a longer period of time and get used to them. Since people were exposed with this technology for the first time, the study has a novel effect on visualizations and real-time presentation. The cognitive load that the participants reported may be perceived differently after being exposed with this technology over a longer period of time.

The qualitative approach in this study provide a foundation for multiple quantitative studies possible to follow up on this work. Since people seem to perceive less errors without visualizations, it would be interesting to confirm this through a user performance study. For example, success rate or number of errors located in transcriptions. Another interesting aspect would be to study the cognitive load, since several participants expressed the visualizations as more focus demanding. This could also be investigated between visualizations techniques. The result from design phase showed that people perceived alternative words as more demanding, leading to loss of focus with the transcription. It would also be interesting to quantify trust, if there are differences between visualization techniques. ACKNOWLEDGMENTS This research was performed together with Sveriges Television AB (SVT), which is the Swedish public service television broadcaster. I would especially like to thank my supervisors and experts at SVT in their help and support to perform this research. I also thank my supervisor Mikael Laaksoharju for his support and guidance throughout this research. Lastly, a thanks to all participants who helped and provided with valuable feedback.

REFERENCES [1] Virginia Braun and Victoria Clarke. 2006. Using

thematic analysis in psychology. Qual. Res. Psychol. 3, 2 (January 2006), 77–101. DOI:https://doi.org/10.1191/1478088706qp063oa

[2] Ken Brodlie, Rodolfo Allendes Osorio, and Adriano Lopes. 2012. A Review of Uncertainty in Data Visualization. In Expanding the Frontiers of Visual Analytics and Visualization, John Dill, Rae Earnshaw, David Kasik, John Vince and Pak Chung Wong (eds.). Springer London, London, 81–109. DOI:https://doi.org/10.1007/978-1-4471-2804-5_6

[3] Li Deng and Xuedong Huang. 2004. Challenges in adopting speech recognition. Commun. ACM 47, 1 (January 2004), 69–75. DOI:https://doi.org/10.1145/962081.962108

[4] Gregor Donaj and Zdravko Kačič. 2017. Language Modeling for Automatic Speech Recognition of Inflective Languages. Springer International Publishing, Cham. DOI:https://doi.org/10.1007/978-3-319-41607-6

[5] Rahhal Errattahi, Asmaa El Hannani, and Hassan Ouahmane. 2018. Automatic Speech Recognition Errors Detection and Correction: A Review. Procedia Comput. Sci. 128, (2018), 32–37. DOI:https://doi.org/10.1016/j.procs.2018.03.005

[6] Nivan Ferreira, Danyel Fisher, and Arnd Christian Konig. 2014. Sample-oriented task-driven visualizations: allowing users to make better, more confident decisions. In Proceedings of the 32nd annual ACM conference on Human factors in computing systems - CHI ’14, ACM Press, Toronto, Ontario, Canada, 571–580. DOI:https://doi.org/10.1145/2556288.2557131

[7] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (ICML ’06), Association for Computing Machinery, Pittsburgh, Pennsylvania, USA, 369–376. DOI:https://doi.org/10.1145/1143844.1143891

[8] Miriam Greis, Jessica Hullman, Michael Correll, Matthew Kay, and Orit Shaer. 2017. Designing for Uncertainty in HCI: When Does Uncertainty Help? In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems - CHI EA ’17, ACM Press, Denver, Colorado, USA, 593–600. DOI:https://doi.org/10.1145/3027063.3027091

[9] Miriam Greis, Aditi Joshi, Ken Singer, Albrecht Schmidt, and Tonja Machulla. 2018. Uncertainty

Visualization Influences how Humans Aggregate Discrepant Information. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI ’18, ACM Press, Montreal QC, Canada, 1–12. DOI:https://doi.org/10.1145/3173574.3174079

[10] Tomislav Hengl. 2003. Visualisation of uncertainty using the HSI colour model: computations with colours. Proc. 7th Int. Conf. GeoComputation (December 2003), 1.

[11] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process. Mag. 29, 6 (November 2012), 82–97. DOI:https://doi.org/10.1109/MSP.2012.2205597

[12] Nigel James Holt, Andrew Bremner, Ed Sutherland, Michael Vliek, MIchael Passer, and Ronald Smith. 2012. Psychology: The science of Mind and Behaviour. McGraw-Hill Education.

[13] Malte F. Jung, David Sirkin, Turgut M. Gür, and Martin Steinert. 2015. Displayed Uncertainty Improves Driving Experience and Behavior: The Case of Range Anxiety in an Electric Car. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems - CHI ’15, ACM Press, Seoul, Republic of Korea, 2201–2210. DOI:https://doi.org/10.1145/2702123.2702479

[14] Alexandros Kastanos, Anton Ragni, and Mark Gales. 2020. Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks. ArXiv191011933 Cs Eess (March 2020). Retrieved from http://arxiv.org/abs/1910.11933

[15] Matthew Kay, Tara Kola, Jessica R. Hullman, and Sean A. Munson. 2016. When (ish) is My Bus?: User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, ACM, San Jose California USA, 5092–5103. DOI:https://doi.org/10.1145/2858036.2858558

[16] Matthew Kay, Dan Morris, mc schraefel, and Julie A. Kientz. 2013. There’s no such thing as gaining a pound: reconsidering the bathroom scale user interface. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing - UbiComp ’13, ACM Press, Zurich, Switzerland, 401. DOI:https://doi.org/10.1145/2493432.2493456

[17] Gary Klein. 2008. Naturalistic Decision Making. Hum. Factors 50, 3 (June 2008), 456–460. DOI:https://doi.org/10.1518/001872008X288385

[18] Gary A. Klein (Ed.). 1993. Decision making in action: models and methods. Ablex Pub, Norwood, N.J.

[19] Jonathan Lazar, Jinjuan Heidi Feng, and Harry Hochheiser. 2017. Research Methods in Human Computer Interaction: Second Edition. Elsevier, Cambridge, MA.

[20] Jinyu Li, Li Deng, Reinhold Haeb-Umbach, and Yifan Gong. 2015. Robust Automatic Speech Recognition. Academic Press.

[21] Qiujia Li, Preben Ness, Anton Ragni, and Mark Gales. 2019. Bi-Directional Lattice Recurrent Neural Networks for Confidence Estimation. ArXiv181013024 Cs Eess (February 2019). Retrieved from http://arxiv.org/abs/1810.13024

[22] Brian Y. Lim and Anind K. Dey. 2011. Investigating intelligibility for uncertain context-aware applications. In Proceedings of the 13th international conference on Ubiquitous computing - UbiComp ’11, ACM Press, Beijing, China, 415. DOI:https://doi.org/10.1145/2030112.2030168

[23] Yajie Miao, Mohammad Gowayyed, and Florian Metze. 2015. EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding. ArXiv150708240 Cs (October 2015). Retrieved March 13, 2020 from http://arxiv.org/abs/1507.08240

[24] Tim Morris. 2000. Speech Recognition. In Multimedia Systems: Delivering, Generating and Interacting with Multimedia, Tim Morris (ed.). Springer, London, 89–100.

[25] Ali Bou Nassif, Ismail Shahin, Imtinan Attili, Mohammad Azzeh, and Khaled Shaalan. 2019. Speech Recognition Using Deep Neural Networks: A Systematic Review. IEEE Access 7, (2019), 19143–19165. DOI:https://doi.org/10.1109/ACCESS.2019.2896880

[26] Ofcom. 2006. Television Access Services: Review of the Code and guidance. Retrieved from https://www.ofcom.org.uk/__data/assets/pdf_file/0016/42442/access.pdf

[27] Douglas O’Shaughnessy. 2008. Invited paper: Automatic speech recognition: History, methods and challenges. Pattern Recognit. 41, 10 (October 2008), 2965–2979. DOI:https://doi.org/10.1016/j.patcog.2008.05.008

[28] Stuart J. Russell and Peter Norvig. 2016. Artificial intelligence: a modern approach (Third edition, Global edition ed.). Pearson, Boston Columbus

Indianapolis.

[29] Meredith Skeels, Bongshin Lee, Greg Smith, and George G. Robertson. 2009. Revealing Uncertainty for Information Visualization. Inf. Vis. 9, 1 (May 2009), 70–81. DOI:https://doi.org/10.1057/ivs.2009.1

[30] Richard M. Sorrentino and Christopher R. J. Roney. 2000. The uncertain mind: individual differences in facing the unknown. Psychology Press, Philadelphia.

[31] Dong Yu and Li Deng. 2015. Automatic Speech Recognition. Springer London, London.

APPENDIX Appendix A – some decisive sketches

Figure 5 - Appendix A

Appendix B – some decisive mockups

Figure 6 - Appendix B

Appendix C - Participant list

Participant Prototype 1 Prototype 2

P1 First Last

P2 Last First

P3 First Last

P4 Last First

P5 First Last

P6 Last First

Appendix D - interview guide Questions after each video

• What was your first thoughts about the subtitles you just saw?

• Do you feel that you understood the text in the subtitles?

• Did you ever feel that you stopped reading and that you had to think about what it stood, or did you feel that you could read without having to make an effort?

Questions after both videos • What are your expectations of subtitles?

o Did it meet your expectations? • What did you think of subtitle version 1?

o Why do you think this was easy/difficult to understand?

• What did you think of subtitle version 2? o Why do you think this was easy/difficult

to understand? • Which of the subtitles did you find easiest to

understand? Why do you think that? o Did experienced that any subtitles had

more errors than the other? • Do you feel the subtitles are reliable? Why?

o Do you feel that one version is more reliable than the other? Why?

• Do you feel that your perception of the news was affected?

o Was your perception affected of how reliable the news was?

Appendix E - prototypes Prototype 1 – With uncertainty visualization

Prototype 2 – Standard AI-generated subtitles

Appendix F - Coding table

High-level themes

freq. Sub-level themes freq.

Decision-making

50 Perception of the amount of transcription errors

4

Uncertainty visualizations creates additional uncertainty

32

Uncertainty visualization support error detection

14

Cognitive load

48 Transcription understanding

30

Real-time subtitles affected the experience

18

Trust 41 Uncertainty visualization affect trust

9

Trustworthiness depends on the errors, topic and TV programs.

19

Expectations of subtitles 10

AI is perceived as more precise than it actually is 3