Randomness as a Cause of Test Flakiness - DiVA portal

54
Linköpings universitet SE– Linköping + , www.liu.se Linköping University | Department of Computer and Information Science Bachelor’s thesis, 16 ECTS | Datateknik 2021 | LIU-IDA/LITH-EX-G--21/048--SE Randomness as a Cause of Test Flakiness Slumpmässighet som en orsak till skakiga tester Daniel Mastell & Jesper Mjörnman Supervisor : Azeem Ahmad Examiner : Ola Leier

Transcript of Randomness as a Cause of Test Flakiness - DiVA portal

Linköpings universitetSE–581 83 Linköping+46 13 28 10 00 , www.liu.se

Linköping University | Department of Computer and Information ScienceBachelor’s thesis, 16 ECTS | Datateknik2021 | LIU-IDA/LITH-EX-G--21/048--SE

Randomness as a Cause of TestFlakinessSlumpmässighet som en orsak till skakiga tester

Daniel Mastell & Jesper Mjörnman

Supervisor : Azeem AhmadExaminer : Ola Leifler

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annananvändning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligheten finns lösningar av teknisk och administrativ art.Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning somgod sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentetändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.För ytterligare information om Linköping University Electronic Press se förlagets hemsidahttp://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for aperiod of 25 years starting from the date of publication barring exceptional circumstances.The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercialresearch and educational purpose. Subsequent transfers of copyright cannot revoke this permission.All other uses of the document are conditional upon the consent of the copyright owner. The publisherhas taken technical and administrative measures to assure authenticity, security and accessibility.According to intellectual property law the author has the right to bementionedwhen his/her workis accessed as described above and to be protected against infringement.For additional information about the Linköping University Electronic Press and its proceduresfor publication and for assurance of document integrity, please refer to its www home page:http://www.ep.liu.se/.

© Daniel Mastell & Jesper Mjörnman

Abstract

With today’s focus on Continuous Integration, test cases are used to ensure the soft-ware’s reliability when integrating and developing code. Test cases that behave in an un-deterministic manner are known as flaky tests, which threatens the software’s reliability.Because of flaky test’s undeterministic nature, they can be troublesome to detect and cor-rect. This is causing companies to spend great amount of resources on flaky tests since theycan reduce the quality of their products and services.

The aim of this thesis was to develop a usable tool that can automatically detect flaki-ness in the Randomness category. This was done by initially locating and rerunning flakytests found in public Git repositories. By scanning the resulting pytest logs from the teststhat manifested flaky behaviour, noting indicators of how flakiness manifests in the Ran-domness category. From these findings we determined tracing to be a viable option ofdetecting Randomness as a cause of flakiness. The findings were implemented into ourproposed tool FlakyReporter, which reruns flaky tests to determine if they pertain to theRandomness category. Our FlakyReporter tool was found to accurately categorise flakytests into the Randomness category when tested against 25 different flaky tests. This indi-cates the viability of utilizing tracing as a method of categorizing flakiness.

Acknowledgments

The authors would like to thank the examiner Ola Leifler and the supervisor Azeem Ahmadfor their guidance and help with the direction of our work. Thanks to the students that areconducting the same field of work for their help and discussions about the thesis.

iv

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4.1 Flaky Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4.2.1 Testing Framework . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Flaky Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Continuous Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Taxonomy of Flakiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Execution Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 GitHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.6 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.7 Unit Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.7.1 Unit Testing Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.7.2 Pytest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Related Works 73.1 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Automatic Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Automatic Fault Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4 Automatic Flaky Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Method 114.1 Thesis Work Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2.1 Log Files Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2.2 Category Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.3.1 Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

v

4.4 Data Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.4.1 Execution Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.5 FlakyReporter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.5.1 Rerun Flaky Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.5.2 Trace Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.5.3 Execution Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5.4 Compare Return Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5.5 Compare Locals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5.6 Compare Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5.7 Compare Partials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.5.8 Calculate Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.5.9 Produce Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Results 285.1 Rerunning Tests & Recreating Flakiness . . . . . . . . . . . . . . . . . . . . . . . 285.2 Analyzing Log Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2.1 Causes of Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.3 Produced Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.5 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Discussion 346.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1.1 Flaky Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.1.1.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.1.1.2 Human Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1.2 Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.1.3 FlakyReporter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2.1 Creating a Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.3 Source Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.4 Replicability, Reliability and Validity . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.4.1 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7 Conclusion 387.1 Tracing & Log Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.2 Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Bibliography 40

A Dataset of projects with flaky commits 43

vi

List of Figures

2.1 Test Driven Development. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.1 Workflow of thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2 Github commit fixing Too Restrictive Range flakiness. . . . . . . . . . . . . . . . . . 144.3 Found categories from public dataset and GitHub. . . . . . . . . . . . . . . . . . . . 154.4 Found testing frameworks from projects in our created dataset. . . . . . . . . . . . 164.5 Snapshot of two pytest fail messages from two iterations of the same test. . . . . 174.6 Flow of sys.settrace(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.7 Overview of execution trace depth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.8 Flowchart of the method for analyzing trace logs. . . . . . . . . . . . . . . . . . . . 20

5.1 Reduction of projects based on events. . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Test case producing No Indications of Randomness. . . . . . . . . . . . . . . . . . . . . 315.3 Test case producing Many Indications of Randomness. . . . . . . . . . . . . . . . . . . 315.4 Iterations of produced report for test function. . . . . . . . . . . . . . . . . . . . . . 315.5 Results from evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

vii

List of Tables

4.1 Test functions used for evaluating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Categories and numbers of tests used for evaluation. . . . . . . . . . . . . . . . . . 325.2 Results from running the tests with FlakyReporter. . . . . . . . . . . . . . . . . . . . 325.3 Iterations of tests used for evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . 33

viii

1 Introduction

1.1 Motivation

With today’s focus on Continuous Integration, there are a number of test cases used to ensurethe reliability of the new code with the old. For each new edit in the code, all test cases mustbe rerun, to ensure that the newly added code does not introduce any problems. Within thesetest cases there is a risk that flaky tests are created. Flaky tests are non-deterministic tests,meaning they fail to always produce the expected outcome. In some cases a test may run 200times correctly before failing once which creates uncertainty whether the test case or the codeis the problem.

Companies consume both time and resources due to flaky tests, from finding the rootcause to handling bugs created by negligence in test cases made for produced software [11,15, 20]. According to Silva et al. [20] and Lam et al. [11], Google spends between 2-16% ofits testing budget on rerunning tests that are suspected of being flaky. This is not unexpectedand is further explained by Gruber et al. [8] who presents that it takes approximately 170reruns of a test case in order to determine with certainty whether a test is flaky or not.

In order to combat this, programs have been developed in order to automatically evaluateand identify whether a failed test is due to flakiness. For example, there is a program namedDeFlaker that correctly identified 95.5% of all flaky tests that were included in a sample of28.068 tests [2]. Even thought this is saving a lot of time on rerunning test cases, the problemof finding the root cause still remain.

1.2 Aim

Eck et al. [3] presents a taxonomy of eleven categories of flakiness, namely Concurrency, AsyncWait, Too Restrictive Range, Test Order Dependency, Test Case Timeout, Resource Leak, PlatformDependency, Float Precision, Test Suite Timeout, Time, and Randomness. From this taxonomy,Randomness is found to be a prevalent category of flakiness in Python test suites [8] and istherefore an important issue that needs fixing. Since the categories differ in how they displayand manifest flakiness, we have decided to only focus on the Randomness category. Creatinga technique for automatically detecting Randomness may also inspire further developmentsor demonstrate the prerequisites for the development of such techniques. Based on this, wepropose a new technique; FlakyReporter, that classifies a flaky test’s likelihood to be flaky due

1

1.3. Research Questions

to Randomness. It does this by rerunning a flaky test, tracing its execution and storing theinformation into trace logs which are then parsed and analyzed.

1.3 Research Questions

This presents the following research question:

• To what extent can tracing or log files be used to locate and identify a test being flaky inthe Randomness category?

1.4 Delimitations

1.4.1 Flaky Tests

Due to the scope of our work the developed application will assume that a test is suspectedof being flaky and run it based on that assumption. Furthermore, only the category of Ran-domness, as defined by Eck et al. [3], will be investigated due to the same reasons alreadystated.

1.4.2 Python

Every language that is used in a continuous integration format is prone to flakiness. However,this paper will only focus on flakiness in the Python language. Python was selected as alanguage due to the available datasets and Git repositories online.

1.4.2.1 Testing Framework

Due to the flexibility and popularity of pytest, we have selected to focus our efforts ononly this module. By creating a plugin for pytest we strive towards producing tracebacklogs, without creating too much overhead, which the more general approach would. Thedrawback is the inability to use plugins for any other framework but pytest. Even thoughpytest supports running test suites from other frameworks, it does not support plugins forthem. There might exist workarounds but we have failed to locate a suitable one.

2

2 Background

2.1 Flaky Test

Flaky test is a term coined to describe tests that show non-determinism. This means that testcases that fail to always give the same result are classified as flaky. Debugging a flaky testpresents a lot of complications for developers who have to fix the cause of flakiness [8, 12, 15,27]. One such complication is how a test may behave differently on different hardware whichcan indicate the test as non-flaky on one setup and flaky on another. Since flaky tests are non-deterministic, recreating the failure of the test is difficult due to its non-deterministic nature.To recreate a failure it might require several reruns of the test before the failure manifests. Asa result, some companies such as Google rerun a suspected flaky test ten times and mark it asflaky if and only if the ten runs results in at least one pass and one failure. [2, 15]. By doingthis the cost of rerunning until a flaky test failure manifests is avoided. This does however,fail to entirely deduce whether a test is flaky or not but is a compromise, since investigatingif a test is actually flaky consumes both time and resources.

Further difficulties of debugging a flaky test are introduced by how the root causes cor-relate. For example, a test that is flaky due to order dependency may cause other causes to beintroduced, hiding the actual root cause. However, order dependency is not inherently flaky,i.e., test suites can exhibit order dependency without exhibiting any flakiness. In fact, is it notuncommon for tests to be order dependent while not exhibiting any flakiness.

Several articles have been written, discussing this issue in various ways; some automat-ically detect flakiness [2, 13, 27], some automatically fixes flakiness [19], and some discussflakiness in relation to languages and developers [3, 8, 15, 23]. These articles provide insight,methods and most importantly; data sets containing known flaky tests and their respectivecategories of flakiness, including Randomness. We gather different categories to be usedwhen validating our tool FlakyReporter. This is done to determine the accuracy of how itmanages to correctly identify a flaky test as Randomness or not.

2.2 Continuous Integration

Continuous Integration is a practice that is frequently used within software development.Development teams use a system for version handling where contribution from the team’srespective members are added together. When contributions are added, the most recent ver-

3

2.3. Taxonomy of Flakiness

sion of the software is built and its reliability is ensured by running tests. This method ofdeveloping is reliant on tests passing and is susceptible to flaky tests and their implications.For developers it is therefore very important to avoid creating as well as finding and fixingflaky tests consistently and efficiently [11, 13, 19, 20].

2.3 Taxonomy of Flakiness

The eleven categories described; Concurrency, Async Wait, Too Restrictive Range, Test OrderDependency, Test Case Timeout, Resource Leak, Platform Dependency, Float Precision, Test SuiteTimeout, Time, and Randomness, each has their own problems they introduce [3]. To be ableto discern the different categories and to both determine and categorize found flaky tests itis important to understand the difference between each category. Below, each of the elevencategories are briefly described from the characterizations of causes presented by Eck et al.[3].

1. Concurrency classifies tests that are flaky due to synchronization issues, mostly origi-nating from unsafe threading interactions. For example it can be caused by race condi-tions where two threads fight over the same resource.

2. Async Wait is similar to concurrency but is instead characterized by performing asyn-chronous calls without waiting for the result.

3. Too Restrictive Range is categorized by valid output values not being within the asser-tion range considered at test design time, failing these tests when they show up.

4. Test Order Dependency classifies a test that is reliant on the outcome of previous testsand is the most problematic one of all causes as it is ambiguous and is not inherentlyflaky. Flakiness due to this cause occurs most often when shared variables are handledbadly e.g., a previous test, fail to reset the shared variables.

5. Test Case Timeout is when a test is suffering from non-deterministic timeouts.

6. Resource Leak is characterized by improper management of external resources. Al-locating memory and not releasing it or not dereferencing a pointer are examples ofcauses.

7. Platform Dependency is when a test being flaky is caused by its inability to run on aspecific platform. This means that for some flaky tests, different hardware introducesflakiness.

8. Float Precision is when potential precision over- and underflow of floating point op-erations are not considered. This can be caused by rounding to a certain number ofsignificant digits.

9. Test Suite Timeout is comparable to Test Case Timeout but instead of one test causingflakiness, the whole execution of the test suite is the cause. This is caused by test suitesgrowing over time but the max run-time value is not adjusted accordingly.

10. Time is caused by tests relying on reading the local system time, i.e., a test may fail dueto changes to the local system time between two iterations.

11. Randomness specifies flakiness mainly caused by generating random numbers wherethe test fail to handle all possible numbers as well as all edge cases. It can be detectedthrough other categories, such as Async Wait, which may cause Randomness to mani-fest itself.

4

2.4. Execution Tracing

2.4 Execution Tracing

Execution tracing, or only tracing, is a form of logging where each line of execution is logged.This allows for in-depth information about a program’s execution and can further allow dif-ferent methods for debugging. One such technique is divergence, where two tracing logs arecompared to find the difference between them [27].

Tracing is a widely used debugging approach in both a fully automated way and a man-ual way where a developer debugs by tracing each line of execution [1]. Due to the addedoverhead of tracing; each line has to be analyzed and logged correctly, different algorithmshave been presented to combat this and reduce overhead [17].

From earlier works it is further stated how fault localization uses tracing in a majorityof implementations [24, 25]. Implementation of execution tracing is proven to reliably findlocations of interest by using the diff between runs [27]. These two arguments backs theselection of tracing as an approach to determine Randomness as the cause of flakiness.

Kraft et al. [10] describes tracing as a technique that detects and stores relevant eventsduring run-time which is used in a later stage for off-line analysis. By storing the relevantevents we gain the exact lines of execution for each iteration, all locals stored at any givenline and all function calls made during execution. In the same fashion as described in theirstudy, we create log files containing the relevant execution traces which is then analyzed in alater stage.

2.5 GitHub

GitHub is an online code hosting platform for version control and collaboration. It allowsfor teams to collaborate and maintain projects from anywhere. Millions of developers andcompanies use GitHub to develop, maintain and ship their software [9]. This paper utilizesthe Explore functionality of GitHub that lets you browse through all available public projects.Publicly available projects allows for anyone to download the source code and contribute tothe development. Retrieving the source code allows for the running of their tests which inturn might yield manifestations of flaky tests essential for conducting this research.

2.6 Python

Python is a interpreted high-level language which supports a variety of easy to use officialand unofficial packages [5]. The language is often associated with machine learning, webdevelopment, embedded systems, data analysis and scripting. Although, a simple languagethat allows for swift production of software code quantity-wise, it lacks in its execution speedcompared to that of a pre-compiled language like C++.

2.7 Unit Testing

Unit Testing is a software testing method where individual and isolated units of software codeare tested in order to validate their functionality and reliability [18]. A unit in regards to UnitTesting could be any given part of a code that requires testing, for example any function orobject. This method of testing is often used in continuous integration to ensure the reliabil-ity of the code that is being under the development phase. Unit Testing is necessary whenit comes to Test-Driven Development which is the process that enables developers to producecode and to continuously test its functionality, thus providing code with increased quality.This provides opportunity for the developers to detect non-reliable or non-functional codeearlier on, which in turn leads to a possible reduction of resources being spent both identi-fying and fixing it in the future. Unit Testing Is also beneficial when it comes to refactoringlegacy code, meanwhile ensuring the previous functionality. However, it is important to be

5

2.7. Unit Testing

aware of that Unit Testing is only as good as its practitioner and can not be expected to catchevery flaw of a program. For developers to accomplish this while also keeping the syntax oftesting concise, developers take use of what is called a Testing Framework.

Figure 2.1: Test Driven Development.

2.7.1 Unit Testing Framework

Unit testing frameworks allow for concise implementation of testing and do often have sup-port for both logging and supplying feedback when failing tests. Python supports severaltesting frameworks such as its built-in unittest package which supplies the developer withsimple creation of test cases but a lack of in-depth logging compared to other testing frame-works. Another one is pytest which is a framework for unit testing that supports addingand creating plugins.

2.7.2 Pytest

pytest is a Python testing framework which features; detailed failure logging, auto discov-ery of test modules, modular fixtures, plugin architecture and the ability to run unittest andnose suites [22]. It is often selected by developers due to its simplicity of creating tests, itssupport for creating and installing plugins. Due to this and its ability to run most of the otherPython testing frameworks we will utilize its plugin functionality this thesis.

6

3 Related Works

Several articles on automatically detecting flaky tests are written and present both testdatasets used and developed software created. The works do however not as often concernfinding the root cause automatically, leaving a gap in the research.

In this section, relevant and earlier work will be presented. The presented works are splitinto their respective sections based on their area of contribution.

3.1 Empirical Studies

Eck et al. [3] describes flaky tests at a more thorough level; what exactly flaky tests are andhow they, in many ways, negatively affect produced software. To combat the problems ofdebugging and finding root causes they introduce a taxonomy of eleven potential root causesto flakiness. They also explain the most common ways of solving the given root causes.These root causes are widely accepted by researchers and will, like other works, be used inthis paper to categorize causes of flakiness.

Common causes, manifestation and useful strategies for avoiding, identifying and solv-ing flakiness of tests are discussed by Luo et al. [15]. By examining flaky tests from opensource projects, collecting data and analyzing how developers have solved flaky tests fromthe project history, they present common strategies for solving flakiness. The most prevalentcauses examined are of async wait, concurrency and order dependency categories, which are de-scribed in Eck et al. [3]. These are identified in their study as the most prevalent categories offlakiness. It is also explained how order dependency may cause unpredictable behavior andcause different kinds of flakiness to manifest. This can result in more difficulties of examiningthe root cause as the order dependency can be hidden behind a different cause of flakiness cre-ated from it. Luo et al. [15] manage to present reliable strategies for fixing root causes definedin the taxonomy as well as presenting relevant information for identifying and categorizingcommon root causes. Their findings provide information that will be used to identify rootcauses and what information might be needed for any developer to understand and fix it.

Gruber et al. [8] have in their empirical study examined flaky tests from available Githubrepositories, comparing tests written in Python with tests written in Java to compare differ-ence of causes and quantities of flakiness. From their initial findings they state that the causesare different while the quantity is mostly the same. This implies that different languages suf-fers from the same amount of flaky tests but from different causes. They highlight that 59% of

7

3.2. Automatic Detection

tests cases written in Python are flaky due to order dependency, which is not the case for testswritten in Java. Their study has created a public dataset of public git repositories which havesome commit with flaky tests. All these detected tests are fully classified into their respectivecategory (defined in [3]). This dataset is used for research, testing and development in thispaper.

3.2 Automatic Detection

Bell et al. [2] present relevant research which discuss the possibility of automatically beingable to deduce whether a suspected test is flaky or not. They also claim that testing for flaki-ness by reruns is both ineffective and costly, highlighting their DeFlaker which avoids doingtoo many reruns. DeFlaker works by conducting an analysis on the difference between theprevious and the new release (code wise). DeFlaker detects the relevant changes and howthey might affect the suspected flaky test. These identified changes are selected to be trackedby byte-code injection, tracking both statements and class coverage. By recording the out-come of suspected tests and printing a report, it helps debugging and determining if a testfailed due to flakiness. DeFlaker is, however, susceptible to false positives as it suffers fromignoring changes to code that uses reflection and how it rather overestimates the amount offlakiness. Bell et al. [2] argues that this is preferred, instead of getting false negatives, whichwill allow flaky tests to go through the detector. DeFlaker does not support any non-java fileswhich includes setting and data files which could introduce flakiness. By using DeFlaker inconjunction with rerunning they argue that it will save time, resources and will manage toachieve higher precision for determining flakiness. Due to the existence of applications thatautomatically detect if a test is flaky, this paper will instead only focus on root causes for testsproven to be flaky. Meaning our method will assume that a test is defined as flaky before anycategorization and analysis is to be done.

iDFlakies is a framework developed by Lam et al. [13], for detecting flakiness caused byorder dependency. The method of the framework consists of first checking if the original or-der passes, followed by rerunning the test suite a given number of rounds with any of theirfive presented configurations where random-class-method is the best performing one. Theseconfigurations reorder the test order and/or class methods. The last step in the identificationis rerunning the failing order and comparing it to a rerun of the original order which in turnindicates whether a test is flaky due to order dependency or not. Similarly, iFixFlakies is an-other framework handling the same issue of detecting order dependency [19]. The methodof detecting flakiness is similar, but iFixFlakies supports automatically patching the test withwhat they call helpers. These helpers are identified functions that reset the data that causesorder dependency. The helper functions are then patched into the test code to create a func-tion call to the helper before the flaky test which ensures that the affected test is isolated.Both frameworks present relevant information about detecting flakiness caused by order de-pendency. Their techniques are helpful as eliminating order dependency from possible rootcauses is the first step to be done in detecting other causes.

8

3.3. Automatic Fault Localization

3.3 Automatic Fault Localization

Ye et al. [26] propose a method of fault localization by intersection of control-flow based ex-ecution traces. Their method traces test executions, partitioning the resulting logs into twodifferent sets, TRp and TR f , defining traces for passed and failed test iterations respectively.From these logs they compute the intersection of TR f and report all points of the programthat are run in every failing test case. These are then ranked based on their suspiciousness ofcausing bugs. From this point onward, it is only relevant to look at the suspicious points ofthe program to identify the root cause. Wang et al. [24] further explains the usage of execu-tion tracing for fault localization. Similarly, they too propose a scheme for defining a test’ssuspiciousness to sort what points in execution are likely to induce faults. Their approach ishowever different, they propose an approach tailored to object oriented languages. To man-age tracing of an object oriented approach in a simple way, they instead only trace what theyrefer to as blocks. These blocks represents a code block that was or was not executed in thetrace. From this point it is stated how selection of test cases is important since redundant testcases may negatively impact the effectiveness of fault localization. The selected test cases arethen used in conjunction with the block trace to determine the suspiciousness of the block.For each block in the program they compute the relative percentage of passed test cases thatexecute the block to the failed test cases that execute it, i.e. the higher the percentage offailed cases, the higher the likelihood that the block is faulty. The methods of tracing pre-sented follow a similar pattern of producing trace logs, comparing passing and failing logs,determining suspiciousness and estimating what part of the code that is faulty. For findingflakiness, this method is applicable, as proven by Ziftci and Cavalcanti [27]. The main is-sue with tracing is its overhead which in large test suites can be impactful depending on thescope and method of testing. Wang et al. [24] further provides indicators that tracing is aviable approach for locating flakiness. This thesis make use of their method of creating andcomparing logs. We aim to determine if a test is flaky due to Randomness or not by utilizingtracing. This is done in similar steps by creating and comparing trace logs.

Ziftci and Cavalcanti [27] refer their methods to Wong et al. [25], in order to solve theproblem of where flaky tests are created. Ziftci and Cavalcanti [27] describe how Googlehave developed methods in order to locate where the root causes exist with 82% certainty.The proposed tool compares the execution traces of each failing run to each passing run,finding their divergence. By using the divergence and a flakiness score, it is possible to locatewhere the root cause is created. This greatly helps developers find out what the cause ofthe problem is as the location of said problem is presented. Their study further takes intoconcern what information developers want to be able to fix the flakiness. Since developersoften want to understand and fix the problem themselves it is important to be able to givethem information to achieve that. To manage this, they have developed a report tool thatpresents the important information. Since their method only detects the location of flakiness,we will expand on their methods to be able to automatically detect the Randomness category.This is done by introducing an identifier for what type of root cause that is present and amethod of reporting the findings from feedback of their testing. We will mainly utilize theconcept of finding the divergence in executed code to locate and determine flakiness. Weexclude the usage of a longest prefix, and instead only locate the first differing executed line. Ifany such diverging line is located the code leading to that point is further analyzed to locateany indications of Randomness. The technique is also extended on to differentiate betweenreturns from function calls, differing locals etc. We further use the same method but modifiedtailored to locating differing values, returns and assertions instead of lines of execution. Herewe instead locate the divergence in values between iterations.

9

3.4. Automatic Flaky Categorization

3.4 Automatic Flaky Categorization

Frameworks developed for automatically detecting root causes are not common, but such aframework has been developed by Lam et al. [11], named RootFinder. Their framework isproduced for Microsoft projects and is made in and for C# projects. Like all other frameworkscreated for detecting and categorizing, it is done in several steps. Firstly, CloudBuild detectsflaky tests by rerunning any failed test to see if it passes. If the test passes it is classified asflaky as the test fails and passes randomly. When the detection is done, CloudBuild storesinformation about all tests in a database. Each flaky test in the database has its informationread and all dependencies are collected to allow for running on any local machine indepen-dent of CloudBuild. Followed by this they use a tool called Torch, creating an instrumentedversion of all dependencies. This tool is used to allow for more logging of various runtimeproperties at test execution. Followed by this they rerun all flaky tests 100 times to produceboth passing and failing logs for each test. When the logs are collected they are analyzed byusing their proposed application; RootFinder. Each log file is firstly processed independentlyagainst certain predicates; Relative, Absolute, Exception, Order, Slow and Fast, creating logs oftheir outcome as well. These predicates determine if the called behaviour at a certain instruc-tion is deemed "interesting". When this first step is done their RootFinder compares the newlycreated predicate logs with each passing and failing one to identify ones that are true/false inall passing executions, but are the contrary in all failing ones. To then determine the categoryof flakiness they use, in conjunction with other methods, keyword searches. Some keywordsstrongly indicate what the most probable category of root cause is. Similarly, we aim to utilizetracing to create trace logs which will then be analyzed when passing and failing iterationsare found. Although we will not use any predicate functionality since we focus on only onecategory. We will also utilize the usage of keyword matching to further ascertain the categoryof the flaky test.

10

4 Method

In the following section the method used to create a dataset containing flaky tests and themethodology to answer the research question will be presented. All testing, creating of logfiles and development was done on the Ubuntu 20.04.2 LTS OS using Python 3 (3.8.5).

4.1 Thesis Work Process

The work was done in three major steps, represented in the workflow of figure 4.1. Firstlywe located projects containing flaky commits and downloaded these to try and manifest theirflaky behaviour.

In step 0, we focused on recreating flakiness in found flaky commits by running them withpytest. The flaky commits were of different categories and were used to create a datasetcontaining the commits that managed to manifest flakiness.

In step 1 we determined how to implement FlakyReporter. From the resulting log files wefirst categorized each one into their respective categories. Flaky tests categorized as Random-ness were analyzed to locate identifiers of the Randomness category. From our findings weproceeded to implement the code of our FlakyReporter.

In step 2 we evaluate our proposed FlakyReporter. From a set of flaky tests used forevaluating we let our FlakyReporter run and collect the produced trace logs. Divergence wasrun on the collected data and the probability of Randomness was calculated based on thedivergence data and the found identifiers. The final step of FlakyReporter is to produce areport which we analyzed and summarized the accuracy of our tool; FlakyReporter.

11

4.2. Data Collection

Locate flaky testsfrom Python projects

Flaky commitswithin projects

Run each test 5 000 times locally or until 2iterations fail

Find identifiers for theRandomness category

Step 0: Log Files Generation

Categorize and identifylocated flaky tests

Create dataset

Step 1: Data Analysis

Resulting logfiles from

pytest output

FlakyReporter: Collect andread tracelogs

Run Divergenceand locateidentifiers

Calculateprobability ofRandomness

Produce report

Implement ToolFlakyReporter

Calculateaccuracy and

precision

Step 2: Evaluation of FlakyReporter

Figure 4.1: Workflow of thesis.

4.2 Data Collection

To be able to determine the most efficient approach for locating root causes and to have adataset for testing and development of potential methods, flaky tests from all categories hadto be found. To accommodate these requirements, we created a dataset of found flaky tests.

The created flaky dataset [16], consists of flaky tests from public GitHub repositories,found in the same manner as Luo et al. [15]; searching for keywords in popular reposito-ries. All repositories were found from either public datasets, containing research on flakinesssuch as [7], or from searching for the keyword Python on GitHub. The repositories found,from our search, contain all publicly available Python repositories where the most popularones (most starred) were selected. In these repositories, searches for flak, flaky and intermitwere made, finding all commits that have mentioned and often fixed flakiness. From thesecommits, the parent commit is used for the dataset. Both open and closed issues, that containany of the keywords, are also examined as they often produce additional information aboutthe flakiness. If any issue is fixed and closed, its corresponding fix merge is presented and thatcommit is then added to the dataset.

In the 3rd party pytest package it is possible to mark tests as flaky. This is done throughusing the flaky plugin by marking a test as @flaky. Marking these as flaky, will either letthem be rerun a set amount of times if failed, or to ignore them as impactful on the result of thetest suite. By removing the @flakymarking, we ensure that the test is run and its result is notignored. This ensures that the test have the capability to display flaky behaviour when run.The drawback of this method of collecting flaky tests is the requirement of manually locatingthe flakiness. In comparison to found commits, which give some information about the rootcause from the commit message, removing the @flaky tag fails to give any such information.This creates more work as locating the cause of a test’s flakiness is time consuming, as pointedout by earlier research [2, 3, 8, 11, 15]. Due to this, the dataset consists mainly of identifiedflaky tests from commits and other publicly available datasets.

When a flaky test is found, it is analyzed in order to be categorized into the taxonomy ofEck et al. [3], see section 4.2.2. This is also stored in the dataset which stores test name, cate-gory, repository name and the SHA256 of the commit. The publicly available datasets [7, 21],already contain fully categorized flaky tests from different repositories with their respectivetest name and commit SHA256, which are in turn imported into our dataset [16].

12

4.2. Data Collection

4.2.1 Log Files Generation

To create a backlog of data from test logs and prove that a test is flaky, all suspected commitswere run a number of times.

The amount of reruns used for each test suite was decided to be 5 000, or until 2 flaky testexecutions. This was determined after the first test case we ran which required more than1 000 iterations. The large amount of reruns were made possible due to most used projectscontained small test suites that were executed in a small time frame. From these reruns, teststhat failed more than once were deemed flaky. Any test that did not fail during these 5 000runs were either rerun for a set amount, or deemed not flaky depending on the time it took toexecute. We decided that tests that took < 5 minutes to execute were rerun again while teststhat took > 5 minutes to execute were deemed not flaky.

The tests that failed to manifest any flakiness in this amount of runs were discarded asnot flaky. Since all used repositories use either unittest, pytest or any other frameworkthat supports pytest, a simple script was created (see Listing 4.1). The script runs the testsuite a given amount of iterations and parses the verbose output of a pytest execution toa log file log_{i}, where i is the iteration. If the test run fails, the log file is renamed tofailed_log_{i}.

1 # ! / b i n / bash23 mkdir −p logs45 for i in $ ( seq 1 $1 )6 do7 touch logs/log_$ { i }8 python3 −m p y t e s t −v &> logs/log_$ { i }9 i f t a i l −n 1 logs/log_$ { i } | grep −c " f a i l e d " ; then

10 mv logs/log_$ { i } logs/ f a i l e d _ l o g _ $ { i }11 f i12 done

Listing 4.1: Script for creating logs.

4.2.2 Category Identification

The tests and projects that showcased flakiness were categorized into their respective cat-egory. The categorization was based on the descriptions of Eck et al. [3]. Our approachconsisted of reading commits, code and our created test logs.

Reading the commits made for fixing the test’s flakiness in conjunction with reading thefailing logs was a method used for determining the category of a flaky test. Root causeswere often tangible from the solution the developers found. For example, some flakiness wascaused by comparing two unsorted lists which caused flakiness due to their differing order-ing. This was fixed by sorting both lists before comparing them which gave the indicationthat Randomness or Too Restrictive Range were categories and the root causes were the listordering and their comparison (see figure 4.2). The "-" sign in red color presents the old lineof code and the "+" sign in green color presents the new line of code replacing the old line.The figure presents a change where the previously unordered list instead gets sorted to solvethe cause of flakiness.

13

4.3. Data Analysis

Figure 4.2: Github commit fixing Too Restrictive Range flakiness.

Tests that suffered from flakiness due to more abstract reasons were harder to determine,e.g. Concurrency or Async Wait. This can be attributed to their reliance in timings where atimeout may be due to Async Wait which in turn makes it harder to categorize.

Other categories where often simpler to determine. Randomness, for example, was easyto determine by how most of the failing and passing runs were differing in values. In listing4.2, one passing and one failing iteration of the same test is described. As can be seen at line 5& 12, the same assert compares two different sets of numbers (line 6 & 13). The first iteration ispassing since avse30_1 is larger than avse60, while in the second iteration it is failing sinceavse30_1 is smaller than avse60. Most tests that exhibit flakiness due to Randomness tendto produce a similar behaviour, where failing and passing traces contain differing variableswithin the same coverage. Furthermore, differing assertion values between iterations of thesame result, i.e. passing or failing, is an indicator of flakiness in the Randomness category.

1 # I t e r a t i o n 12 .3 .4 .5 a s s e r t avse30_1 > avse606 > (0 .1557815400014442 > 0.11455372861553298)78 # I t e r a t i o n 29 .

10 .11 .12 a s s e r t avse30_1 > avse6013 > (0 .11540383618068703 > 0.12648936212227796)

Listing 4.2: Two iterations of a "Randomness" flaky test.

The initially defined categories that were based on the empirical study by Gruber et al. [8],were also cross checked with the findings from the created test logs. In cases where the usedrepositories came from their public dataset [7], the category found by analyzing the test logswas then compared to the result in the dataset. This ensured both identifying and classifyingthe flaky tests correctly into their respective category.

4.3 Data Analysis

From the found and analyzed repositories, only a small amount were not relevant for ourclassification. The commits present in the public dataset by Gruber et al. [7], was locatedand categorized by hand in their study [8]. Since a few of the commits in their dataset wereunable to manifest any flakiness even though 5 000 or more iterations were run, it can beattributed to human error or due to our test environment differing from theirs. Furthermore,some projects were not able to be installed or had outdated dependencies, resulting in themnot building. The commits and projects that did not work or did not manifest any flakinesswere disregarded from our created dataset [16]. See appendix A for the projects that managedto execute their respective test suite.

14

4.3. Data Analysis

Asy

ncW

ait

Con

curr

ency IO

Net

wor

kPl

atfo

rmD

epen

denc

yR

ando

mne

ss

Res

ourc

eLe

akTe

stC

ase

Tim

eout

Test

Ord

erD

epen

denc

y

Tim

e

Too

Res

tric

tive

Ran

ge

Prec

isio

n

Unk

now

nC

ateg

ory

Uno

rder

edC

olle

ctio

n

0

5

10

15

20

25

30

35

40

45

10

7

4

37

2

44

2 1 1

14

7

1

4

1

Category

Occ

urre

nces

Figure 4.3: Found categories from public dataset and GitHub.

Figure 4.3 is a representation of all of the flaky commits used in this paper, both throughour manual search on Github and in addition to the flaky commits used by Gruber et al. [7].Since we are basing our categorization on the taxonomy by Eck et al. [3], meanwhile Gruberet al. [7] uses its own, figure 4.3 contains both naming schemes of respective categories. Thisis both due to transparency reasons as well as the high level of complexity in categorizingflakiness. For this reason there are categories that are more similar to each other than the rest.

It is of great importance to have acquired flaky tests, from categories that are not onlywithin Randomness, since these will also be useful to test our method against. A methodto determine flakiness within Randomness must also not wrongfully determine flakiness ifsupplied with other categories.

Async Wait and Concurrency could in some cases be categorized as Network, which may bethe reason why Network as category is more prevalent in the dataset compared to our ownsearch through Github repositories. Another occurrence of this is that Randomness and TooRestrictive Range are difficult to distinguish one from another, resulting in most cases getsdefined as Randomness.

15

4.4. Data Reporting

4.3.1 Frameworks

From the projects, that were able to run (see appendix A), we found a split in unittest andpytest. As this work will only focus on pytest, we gathered 51 projects using the pytestand 41 projects using the unittest framework, as can be seen in figure 4.4. Individualamount of flaky tests that were using either unittest or pytest are not presented but onlywhat a specific test suite, or project, is using.

pytest unittest0

10

20

30

40

5051

41

Framework

Occ

urre

nces

Figure 4.4: Found testing frameworks from projects in our created dataset.

Since the method itself is relying on the narrowed scope of pytest, 12 out of the total 51were utilized to develop the method itself, capable of locating the root causes of Randomness.Meanwhile, the remainder 39, were instead used to determine the accuracy of the methoditself after that of its development phase. Keeping the remainder separated from the total51, enabled us to retrieve a validation with certain precision. Using only 12 projects as abasis to create a method was deemed to not impact the end product in a negative way sinceRandomness tends to manifest itself similarly throughout different projects and test cases.

4.4 Data Reporting

Through analyzing our created log files, we found that tracing seemed to be a viable approachin locating flakiness. Tracing allows us to keep track of variables, return values and functioncalls. From what we found, we had to closely follow the execution between runs to accuratelydetermine if the flakiness is due to Randomness. The aim of this is to locate returns, assertionsand variables that are varying in value between each run, both passing and failing. Figure4.5 contains an example of two failing assertions that contain different values (highlighted),separated by a number of equal signs. The differing of values in assertions between iterationshas been noted to be an indicator of flakiness due to Randomness. Both iterations containdiffering assertions where lhs in rhs differs in both lhs and rhs for both iterations. Thisexample of this can be seen in the start of both assertion errors: [1111,204 not foundin 11110,284 ...] and [1111,805 not found in 1111,805 ...] which differsin several locations. Note that the entire AssertionError is not in the figure due to itcontaining too much text. Alternating values between iterations is a characteristic of theRandomness category, where assertions, locals and return values differ between iterationsof the same result. Tracing allows us to follow all differing values and assertions from bothpassing and failing logs which will help indicate random behaviour as the one in the figure.

16

4.4. Data Reporting

Figure 4.5: Snapshot of two pytest fail messages from two iterations of the same test.

4.4.1 Execution Traces

The function sys.settrace(fn) [6], enables the user to create custom tracing functionalitywhich in turn enables a more pinpointed approach. It also includes the inspect package[4], which handles frames and other inspect objects. These are used to gain the actualinformation of the trace. The frame retrieved in the trace function represents the currentframe of execution. This works by sys.settrace firstly registering a global trace whichinvokes a callback returning the local trace, or frame containing all relevant data. Figure 4.6explains the process in a more clear manner.

sys.settrace() Global Trace

CallbackLocal Trace /Frame

Registers

Invokes

Returns

Figure 4.6: Flow of sys.settrace().

Using sys.settrace(fn) allows for excluding tracing into irrelevant files by only trac-ing the correct function name in the correct file. In comparison, Pythons trace module hasno support for defining what files and what functions to trace but instead traces every call.Since pytest calls multiple functions and classes when doing a test run it produces severalredundant trace logs during its execution.

As stated earlier, this approach provides the functionality of excluding non-interestingtests, which can be seen below in Listing 4.3; line 6-7. Here it is defined how only selectedtarget functions from a specific file will be traced.

17

4.4. Data Reporting

1 def _ t race_ func ( frame , event , arg ) −>None :2 co = frame . f_code3 func_name = co . co_name4 fi lename = co . co_f i lename56 i f func_name in t a r g e t _ f u n c \7 and f i lename in t r a c e _ l i s t :8 l ine_no = frame . f _ l i n e n o9 f _ l o c a l s = frame . f _ l o c a l s

10 t r a c e _ l i n e = {11 ’ event ’ : event ,12 ’ func_name ’ : func_name ,13 ’ l i n e _ s t r ’ : l i n e c a c h e . g e t l i n e ( fi lename , l ine_no ) . r s t r i p ( ) ,14 ’ l ine_no ’ : l ine_no15 }16 .17 .18 .

Listing 4.3: Simple trace function example

The linecache.getline(...).rstrip() provides the string representation of the codeexecuted. The string representation is paired with its potential locals by fetching the themfrom the current frame. At line 9 in listing 4.3, it defines how the frame’s locals are fetched.However, the locals only appear on the following line since the line has yet to be executed.

The tracing is done in an event based manner where each trace call provides an event,frame and arguments. The frame provides the current top frame on the stack which in turnprovides meta information about the currently executing object. Events like call and returnhappens when a function either gets called or when returned, which includes the return val-ues. Other events include the line event, which represents a line being executed. This eventis used for collecting information about executed lines of code. The syscall functionalityincludes fetching the frame of the parent caller which enables us to only allow tracing if acertain parent called the function. Doing this can set the depth of tracing allowed, which cur-rently is at a depth of 1. Figure 4.7 illustrates the depth traced, where every arrow representscall/return. The crossed out arrows are calls that are not traced and are thus ignored fromthe resulting trace logs. Below, in figure 4.7, it is displayed how the Test Function performsseveral calls to pytest since it is the module controlling the testing environment. ThePython library contains several different files native to the Python environment. From theanalysis done on the found flaky tests we found that these files are often uninteresting or canbe ignored. We further noted that tracing into multiple function calls, i.e. calls from a call, aremostly redundant. The first Called Function, at depth 1, represents any user defined functionnot native to Python or pytest. This function gets fully traced, i.e. lines executed, locals andreturn value. Any function calls from this depth onward are not traced. By ignoring tracingon greater depth, we effectively reduce the potential overhead while still maintaining therelevant information needed to deduce Randomness. It is however, possible to extend thedepth traced, which is easy to accomplish in our implementation and can be tailored to anynew findings that may require more depth.

18

4.4. Data Reporting

Figure 4.7: Overview of execution trace depth.

Listing 4.4 describes the two events; call and return, as well as how the parent is used todetermine if it should be traced. Line 23 provides the return values which are residing in thearg parameter from the traceback call.

1 .2 .3 .45 e l i f event == ’ c a l l ’ :6 t r y :7 i f parent . co_name in s e l f . logs :8 i f ’ c a l l ’ not in s e l f . logs [ parent . co_name ] :9 s e l f . logs [ parent . co_name ] [ ’ c a l l ’ ] = d i c t ( )

10 .11 .12 .1314 except Exception as e :15 print ( ’ Trace c a l l exception , { } \ nin f i l e { } ’ . format ( e , f i lename ) )16 e l i f event == ’ re turn ’ :17 t r y :18 i f parent . co_name in s e l f . logs :19 s e l f . logs [ parent . co_name]\20 [ ’ c a l l ’ ]\21 [ frame . f_back . f _ l i n e n o ]\22 [ func_name ]\23 [ ’ re turn ’ ] = arg24 except Exception as e :25 print ( ’ Trace return exception , { } \ nin f i l e { } ’ . format ( e , f i lename ) )2627 .28 .29 .

Listing 4.4: Trace events: call and return.

19

4.5. FlakyReporter

Logging the executed lines allows for using a divergence method similar to Ziftci and Caval-canti [27], determining that the same part of the code gets executed in both failing and passingruns. If both passing and failing runs execute the same code, the variable that causes the testto fail is tracked through all execution logs. Depending on how the variable is changing val-ues between each run it can be attributed to Randomness. I.e. if both passing and failing logsvary in the value that caused the test to fail, it is most likely due to randomness.

4.5 FlakyReporter

FlakyReporter calculates a probability of a test being flaky due to Randomness and producesan interactive .html report in several steps. It firstly reruns the target flaky test and traces itsexecution, creating trace logs. When the log files have been created for the target test function,it is parsed and analyzed. In this script it is analyzed and its suspiciousness is calculated inseveral steps. In the following section we will explain how the category is determined andhow the location of the possible root cause is found. This is done in several steps, as can beseen below in figure 4.8. Firstly the executed lines are checked for diverging lines executed.If none are found we can compare the common identifiers of Randomness; returns, locals andassertions. These three categories tend to display randomness through differing between runswith the same result. For example, function returns changing values for each passing iterationindicates that the test suffers from randomness. If a divergence is found, we only comparereturns and locals. We ignore assertions due to how impactful differences in the executed linesbetween passing and failing are towards the final assertion. Furthermore, if a divergence isfound we will never execute the final assertion as we do not continue past the first divergentline. Therefore we classify it as partial data as it does not contain data from the full trace log.This is further explained in section 4.5.7. In the following subsections we will explain thesteps taken to produce the report of its probability of being flaky due to Randomness fromeach step described in figure 4.8.

Figure 4.8: Flowchart of the method for analyzing trace logs.

20

4.5. FlakyReporter

4.5.1 Rerun Flaky Test

FlakyReporter starts with rerunning the test files with pytest, using execution tracing tostore the information of the currently executed lines.

The amount of reruns is defined by the user before it starts executing. The tool traces theexecution of the target test function and stores it, in conjunction with the pytest results,locally into text files.

4.5.2 Trace Logs

The initial step of the FlakyReporter generates trace logs. The trace logs consists of informa-tion gathered by tracing and pytest results. The logs are created in a format of lineno - string< locals, Call-> func : fname, or C-> lineno - string < locals. The logs may contain any numberof iterations, i.e. instead of having ten files for the same function, only one file is used whereeach iteration is separated by 20 equal signs and a newline (see below in Listing 4.5).

1 _ _ _ l i n e 11 def tes t_random_test ( ) :2 _ _ _ l i n e 12 rand = create_random ( )3 Call −> create_random : . . . / t e s t _ i n i t i a l . py4 C−> _ _ _ l i n e 3 def create_random ( ) :5 C−> _ _ _ l i n e 4 rand = random . randint ( 0 , 1 0 )6 C−> _ _ _ l i n e 5 return rand7 C−> r e t 48 < ( rand = 4)9 _ _ _ l i n e 13 i f rand == 0 :

10 _ _ _ l i n e 15 rand2 = create_random ( )11 Call −> create_random : . . . / t e s t _ i n i t i a l . py12 C−> _ _ _ l i n e 3 def create_random ( ) :13 C−> _ _ _ l i n e 4 rand = random . randint ( 0 , 1 0 )14 C−> _ _ _ l i n e 5 return rand15 C−> r e t 616 < ( rand = 4)17 < ( rand2 = 6)18 _ _ _ l i n e 16 a s s e r t rand2 >= rand19 > (6 >= 4)20 ====================2122 _ _ _ l i n e 11 def tes t_random_test ( ) :23 _ _ _ l i n e 12 rand = create_random ( )24 Call −> create_random : . . . / t e s t _ i n i t i a l . py25 C−> _ _ _ l i n e 3 def create_random ( ) :26 C−> _ _ _ l i n e 4 rand = random . randint ( 0 , 1 0 )27 C−> _ _ _ l i n e 5 return rand28 C−> r e t 329 < ( rand = 3)30 _ _ _ l i n e 13 i f rand == 0 :31 _ _ _ l i n e 15 rand2 = create_random ( )32 Call −> create_random : . . . / t e s t _ i n i t i a l . py33 C−> _ _ _ l i n e 3 def create_random ( ) :34 C−> _ _ _ l i n e 4 rand = random . randint ( 0 , 1 0 )35 C−> _ _ _ l i n e 5 return rand36 C−> r e t 837 < ( rand = 3)38 < ( rand2 = 8)39 _ _ _ l i n e 16 a s s e r t rand2 >= rand40 > (8 >= 3)41 ====================42 .43 .44 .

Listing 4.5: Trace log of a passing execution.

21

4.5. FlakyReporter

The "<" sign represents the locals. This can be seen in line 8; where "< (rand = 4)" rep-resents the assignment of rand. The ">" sign represents the assertion value, or comparison.On line 19; "> (6 >= 4)" represents a passing assertion where "assert rand2 >= rand"is the same as "assert 8 >= 7". At line 3 a call is made to the function create_randomwhich resides in the file test_initial.py. The following "C->" lines represents the tracewithin the called function. The "ret 4" at line 7, represents the return value of the calledfunction.

The trace logs are parsed back into the tool which are read and stored in a usable formatfor the remaining steps in the process of producing a report.

4.5.3 Execution Divergence

As can be seen in figure 4.8, the first step, after reading the trace logs, in determining Ran-domness is performing a divergence analysis on the executed lines. Our divergence method isimplemented in a similar way as Ziftci and Cavalcanti [27], where the first diverging line isfound and stops further line analysis. Any new, earlier divergent line found will become thenew stopping point for further comparisons, reducing the actual commonly executed linesof code over all logs. We store the commonly executed code for passing/failing until anydiverging code is found, in the same way as their method. We also add more data and storeall locals, assertions and function calls as well as their return values. This is then used in thelater stages to calculate the result, i.e. calculate the probability of the flakiness pertaining tothe Randomness category. In listing 4.6 we have written an example, taken from Ziftci andCavalcanti [27], and re-written in Python. The gray area represents the commons (commonlyexecuted code), green and red represents what the passing and failing iteration respectively,diverges in.

1 def divergent_function():

2 value = RandInt(0,1)

3 if value == 0:

4 return False5 e lse :

6 return True

Listing 4.6: Example of divergence. Green line and red line represents passing and failingiterations and what code that differs between them. Gray represents the commons.

The function will diverge between the return statements as the False return indicates a failedrun. This should split the result in each iteration evenly between passing/failing. Ziftci andCavalcanti [27] argues that the location of the first divergence detected is the fault location.We also conform to their argument and ignore further line analysis when any divergence isfound. Instead, the next passing log is compared to the failing log to gather more data aboutlocals, returns and in some cases new divergences.

As stated before, the divergence algorithm also fetches data of each traced function call,locals and final assertions producing either failure or pass. These are stored for each iteration,meaning that if we have 100 iterations, n passing and m failing, we will get returns, locals andassertions from each iteration. The use of these will be further explained in their respectivesubsections. As can be seen in listing 4.6, no return value is collected since no function iscalled. The locals gathered differ between passing and failing and is stored as value:1for passing and value:0 for failing. The assertions collected in this example are the twodifferent return statements. Pytest provides explanations to why a test failed or passed andthis is used as the collected data. In the case of our example, the explanation to a failingassertion would be False == True.

Below, in algorithm 4.1, is the pseudo code of our implementation of a divergence al-gorithm based on the one presented by Ziftci and Cavalcanti [27]. All tests in Tp, Tf P T

22

4.5. FlakyReporter

gets checked, where each test in Tp gets checked against each test in Tf . I.e. each pass-ing test tp P Tp gets checked against each failing test t f P Tf , where if any executed linelinep P line in tp differs from line f P line in t f , a divergence is found. The different vari-ables represents their respective type of data where div contains the diverging lines and allcommon lines between failing and passing iterations.

1 div , l o c a l s , returns , a s s e r t i o n s Ð H

2 foreach ( t f , tp ) in ( Tf , Tp ) P T do3 commons Ð H

4 foreach l i n e in t f , tp do5 i f linep != line f do6 div Ð linep + line f + commons7 break8 end9 commons Ð commons Y l i n e

10 re turns Ð re turns Y getReturns ( linec f , linecp )11 l o c a l s Ð l o c a l s Y getLoca ls ( line f , linep )12 end13 a s s e r t i o n s Ð a s s e r t i o n s Y g e t A s s e r t i o n s ( t f , tp )14 end

Algorithm 4.1: Divergence algorithm.

One thing done in the background of the divergence algorithm is locating keywords for eachline read. While each line is checked for any difference between passing and failing execu-tion, it is also checked for any keywords. Keywords represents any word that may referencerandom and to support any further developments of random functions and libraries, we sup-port a keywords.txt. This file contains, for each line, any keyword that might possiblyreference a random function. Listing 4.8 displays a short example list of keywords that refer-ence random functionality. Each line, until any divergence is found, is scanned for any wordscontaining any of the ones contained in the list. Found keywords is then later used to furtherargue for randomness being the category in the Calculate Result step.

1 rand2 Rand3 randint4 RandInt5 random6 Random7 uniform8 Uniform

Listing 4.8: Example of a short keyword list.

4.5.4 Compare Return Values

All return values reached from running our divergence method are stored and compared.In each iteration where any call happens before any divergent line is found, the function_store_returns(self, ...), is executed which stores all unique returns. Since all re-turns are read as a string we store the string representation of the returned value. This allowsus to read and store user defined objects which would not work otherwise and it furtherallows us to uniquely store each returned value.

The return value is stored with its resulting run and line number where the call occurred.The function arguments contain the called function name, line number where the functionwas called, passing return value and failing return value. The number of occurrences of apassing return value is incremented. This provides data on the proportion of all iterationsthat has the same specific return value. This further provides the amount of differing returnvalues between each passing and failing iteration as the number of keys in the dictionarycorrespond to the number of different return values.

23

4.5. FlakyReporter

The resulting data stored is then compared to determine the amount of differing returnvalues based on the number of iterations. This is done by dividing the number of keys bythe number of iterations which is done for passing and failing independently. The aim isto compare if the amount of differing return values can be seen to be distributed in a sim-ilar scope between both passing and failing. We further compare if any return value fromany failing iteration is present in any passing iteration. If both passing and failing logs con-tain differing return values it indicates randomness. However, if any return value exists inboth passing and failing, it does not indicate that Randomness is the cause of flakiness, i.e.if there exists any failed return, fr P Fr and passed return, pr P Pr where fr = pr, thenReturns ­Ñ Randomness. Nor does differing returns inherently imply Randomness as thecause of flakiness. From these findings we calculate a numeric representation of how impact-ful differing return values might be. This is later used to calculate the final result of howprobable it is that the flakiness pertains to Randomness.

4.5.5 Compare Locals

The term Locals refer to all local variables in a function. For each line that contains any storedlocal it is stored in the log files which is then read during our divergence check. Each localis further stored until the end of that variable and is present in each line. Listing 4.9 containstwo variables, or two locals. At line 2, the execution sets a value of a local but this value isnot set until the line has executed. Therefore the value of bar is not set as a local value untilline 3. The same goes for the execution of line 4, where baz get initialized. Since bar is stillin use when baz is initialized, both of them are the locals present at line 5, since line 4 needsto be executed before baz is set as a local. Until any return is reached the locals will stay thesame or increase if no disposal called for any of the locals.1 def foo ( ) :2 bar = 5 # No l o c a l s a t t h i s l i n e3 baz = 120 # { b a r : 5 } i s a l o c a l a t t h i s l i n e45 i f bar == 5 : # { b a r : 5 , baz : 120 } a r e l o c a l s a t t h i s l i n e6 return True7 e lse :8 return Fa lse

Listing 4.9: Example of locals.

The locals are compared in the same manner as the return values; each local is comparedto every other iteration to locate any locals differing in values between runs. This is done forboth passing and failing logs where the occurrences of each local value is stored. The localsare also compared to the assertion failing and passing assertion statements such that, if anylocal that is differing in value between runs is used in any failing assertion, it further provesrandomness.

4.5.6 Compare Assertions

Assertions are the most prevalent indicators and characterizations of Randomness due tothem altering in values between passing and failing in most tests suffering from Randomness.This has been exemplified in listing 4.2 which presents the assertion result of two iterationsand how they differ in values.

All assertions executed in both passing and failing runs are stored and the number of anyvalue’s occurrences is counted. In the case of assertions no assertion in the failing log will beidentical to any passing, due to it failing in that specific assertion. As such we only aim to findhow much the assertions differ between each iteration for both passing and failing. The com-parison is made through comparing the difference in amount of assertions. If the failing runsdo not differ anything in their assertions but the passing runs differ, it does not inherentlyimply Randomness. However, if passing runs do not differ and failing runs differ between

24

4.5. FlakyReporter

runs, it does imply Randomness. Although that is the case, most cases of Randomness tendto display differing values in both passing and failing assertions.

The assertions are compared by checking the amount of distinct assertion values for thefull set of passing or failing test iterations. By doing this we gain a better estimation of theamount of differing assertion values as the number of failing runs should be fewer in com-parison to the passing runs. With a greater proportion of differing assertion values, the prob-ability of it being due to randomness is in turn also greater.

4.5.7 Compare Partials

If any divergent line is located the full data of the comparisons is not available. Instead werun the same method for comparing data but with only partial information. We collect thelocals and returns that are available before the divergent line and compare them instead ofcomparing the full test execution. An implication of doing partial comparisons is the lack ofassertion statements. Due to our usage of a breakpoint when a divergent line is found, thefinal assertion will never be reached when comparing.

The partial comparisons are performed in an identical manner to the ones already men-tioned, i.e. Compare Locals, Compare Assertions. The only difference is the amount of data andthe resulting accuracy of determining Randomness as a root cause.

4.5.8 Calculate Result

Calculating the probability of Randomness is done in a simple manner where each indica-tor of a random behaviour adds points to the variable rnd_probability which representsthe probability of Randomness. All calculations and its resulting score are done in the back-ground and are never presented to the user. This means that we will never present anyspecific score to the user, but will instead present the indicators of Randomness found and amore broad term of categorizing;

• No Indications of Randomness, is the result when no indications have been found.

• Few Indications of Randomness, is when the number of indications is very small. Thisis a result that can be seen as the margin of error between No Indications of Randomness.and Some Indications of Randomness.

• Some Indications of Randomness, is when there are enough indicators for it to be pos-sibly flaky due to Randomness.

• Many Indications of Randomness, is when the amount of indications found stronglyimplies flakiness due to Randomness.

Each part; locals, assertions, returns and keywords are "measured" to give an estimation ofthe flaky category. For locals, we calculate the number of differing local values divided by theamount of iterations followed by returning its average. The formula can be seen in equation4.1, where N f , Np P N is the number of iterations for failing and passing runs respectively.

The values in l fi , lp

i P L references the set of suspicious values any local gets assigned at agiven iteration i. The impact constant is used to define how impactful any certain elementis. Random variables used in the failing assertion will be more impactful and therefore havea higher value on impact. This is used to further balance the probability value of any testfunction, creating a more reliable and easier to modify value of probability.

probability = probability +

řN fi=0 l f

iN f

+

řNpi=0 lp

iNp

+ impact

2(4.1)

25

4.6. Evaluation

The returns are calculated in the same way as the locals. It follows the same exact principleas in equation 4.1, but instead of a set of locals, it uses a set of returns that are compared andcalculated. The assertions also follow the same formula but differs in how it is not a set ofdifferent assertions that is used for calculating, but one set of different values. Assertions alsodiffers in how they are only calculated when no divergence is located, since no assertion cantruly be examined when any divergence is located. The final result is calculated as describedin equation 4.2, where probability is the resulting probability form equation 4.1.

result = 1´∣∣∣∣ 1

probability

∣∣∣∣ (4.2)

The final result, described in 4.2, is therefore any number between 0 and 1. We define therange of 0 ď result ď 0.1 to be No Indications of Randomness, 0.1 ă result ď 0.4 to be FewIndications of Randomness, 0.4 ă result ď 0.7 to be Some Indications of Randomness and 0.7 ăresult ď 1.0 to be Many Indications of Randomness.

4.5.9 Produce Report

The resulting is stored and presented to the reader through a report.html file. Meta datafrom the local environment and the total number of iterations run are also added to the report.The report is formatted to enable interactive viewing by adding expandable sections. Thisallows for better readability, flexibility and allows the user to select what information to view.Ziftci and Cavalcanti [27] state the importance of readability of any relevant information forthe developer to fix the issue. The choice of .html is also inspired by their report creation.

4.6 Evaluation

To evaluate our proposed method we tested it against our created dataset, where the selectedtests are ones not used directly in development. Each test selected for evaluation is run untilfour iterations fail. These logs are created with our FlakyReporter where the check of accuracyis done with the resulting number of probability and the resulting report.

Since our method produces four different results we define a false-positive as Most LikelyRandomness and Some Indications of Randomness. That is when any flaky test not in the Ran-domness category produces the Most Likely Randomness result it is considered a false-positive.For the Some Indications of Randomness result, we define it as a false-positive depending on thedata collected and how it corresponds to Randomness. For each collected indicator, we manu-ally scan it to see how well it determined correctly and if it can be considered a false-positive.Similarly, a false-negative is defined as Few Indications of Randomness and No Indications ofRandomness.

Each test is run 200 times, creating a total of 200 trace logs which are then parsed andanalyzed. The resulting outcome of each of the tests are checked and recorded to verify theamount of correct and false results generated by our FlakyReporter. Of our entire dataset,some tests were unable to produce flakiness when run with our FlakyReporter. Due to this,the amount of data available for validation is reduced.

Table 4.1 contains the test functions used for evaluating our method. Due to the lack ofdata found and difficulties with manifesting flakiness in known flaky tests, the data used fortesting is lacking. From the 39 test functions that were untouched, to be used for validating,only some of them were able to manifest any flakiness when run with our FlakyReporter tool.This causes the number of flaky tests to be used for evaluating to become even smaller.

26

4.6. Evaluation

Project Name Function Name Commit (SHA256) CategoryKeras test_TimeDistributed 75470e380ff92fa52d600aa3ec0cb0be06773cc1 RandomnessKeras test_backends ea5cb74414286de3bdeb8b752a161e006b9286fa Randomness

analytic_shrinkage test_demean 752a0f90696b1ca27a8b84ada3616cdfc98004a9 Randomnesscrfmnes test_run_d80_ellipsoid 5fc9625c86135fb5e8824f867a08cc410d3638bd Randomnesspytsmp test_calculate_distance_profile_constant_sequence 15d58a39e016100fb44cdcc9f6115fa7736eb2bb Randomness

packeteer test_variable_string 2c6e0e8683462e17ff095a89f13a10daf29ef9e0 Randomnesslangsplit test_partial_autodetect d30d47bd6f73852dc716f1e79225d8d49dc39daa Randomness

brian2modelfitting test_ask_skopt 7f01463bf5be47cc82e26f033819c8a82cefac2b Randomnessspreading_dye_sampler test_for_each_cell 4282f7609959a31d1b2a4832f3ed643b15c46cb6 Randomness

wellpathpy test_resample_onto_unchanged_md f29b14a9a1c46e77ea98b141e3cdc43674654ea2 Randomnessfiltered-intervaltree test_compare_addition 37da431ed36dc9e7936ebd330869f142538ef79d Randomness

Neuraxle test_automl_sequential_wrapper 20c6e5713198345b43bf6899355f1b5cf65eb02c RandomnessWattTime test_get_detailed_grid_data 1f2761a752c6cc241c20e854a7a247349b57be36 Network

dcard-spider test_valid_api_forum ac64cbcfe7ef6be7e554cef422b1a8a6bb968d46 Networkgeocoder test_komoot_multi_result 39b9999ec70e61da9fa52fe9fe82a261ad70fa8b Network

gwosc test_fetch_event_json_version 06b56ce506fda3af4857c8a1aae7bb60fb1925e9 Networkpykew test_lookup 557c1852ad7a4366467e344e7a46b42549ce8694 Network

sida test_send 77635d2b08c87612330db8612b70ee85c0d57268 Networksvp test_data fdf64445d8038b4df5ab2181eb6306d56e3b5162 Networksvp test_response fdf64445d8038b4df5ab2181eb6306d56e3b5162 Network

devtracker test_end_time_total ea892d6d48aa5d4627b429469b59ae3f0ce7f10f Timecronjob test_cronrule a2fd9289f44696c5c06ece9cec8dc5315300eecf Timecvbase test_track_parallell_progress_list e48d2f76aaeea6472b2175c37780f7b90ae7f1b6 Concurrency

vault-dev test_cleanup_on_gc ccec5a246e4c05358a57fb68d821a16910d2a125 Async Waitpipm test_append_last_line a9e0f89e80addea18fd3efdc7b4f6eabd6b6988a IO

Table 4.1: Test functions used for evaluating.

27

5 Results

5.1 Rerunning Tests & Recreating Flakiness

When recreating flakiness of a test we found several issues and difficulties. Many complica-tions manifested through dependencies and hardware issues where some test suites did notmanage to run and some failed on all tests.

Due to the used approach of finding commits and running the test suite on that specificcommit, several complications were presented. Several projects containing flaky commitswere unable to build due to either dependencies not cooperating, being outdated or similarissues. This issue extends to projects that were able to run but might not support runningany older commits. Some of these commits were several years old and their dependencies arecompletely outdated. Downgrading the current environment to run these tests was deemedinfeasible. The dataset was therefore reduced from 111 projects containing flaky commitsdown to 92 projects. The inability to run our FlakyReporter on testing frameworks other thanpytest resulted in further reductions. The final dataset consisted of 51 useable projects.Appendix A shows all the different projects found and if we managed to execute them.

Manifesting flakiness on the suspected tests were also an issue. The dataset [7], createdby Gruber et al. [8], contained lacking instructions about testing environment during theirempirical study. Due to this some tests that might have run on their setup were not ableto run on our own setup. This complicated confirming flakiness as some tests failed to runat all and some showed no sign of flakiness after 5 000 iterations. This further confirms theassumption that reproduction of flakiness is time consuming and problematic due to the non-deterministic behaviour of flaky tests, as stated by earlier works [3, 14, 15, 27]. We furtherfound that tests that managed to display flakiness once did not always do it a second time.When rerunning tests with our FlakyReporter for validation we found how tests we knewwere flaky due to them displaying flakiness earlier in our study, suddenly did not displayflakiness. The consequence was a dataset that initially consisted of 51 projects being reducedto a dataset of 37 projects that could be used for developing and verification. Figure 5.1represents how the number of projects got reduced based on the events explained. The x-axis consists of the events that led to a reduction in the dataset and the y-axis represents thenumber of projects.

28

5.1. Rerunning Tests & Recreating Flakiness

Initi

aldata

set

Removed

non-exec

utable

Selec

tedpytest

only

Man

ifeste

d withFlak

yReporte

r0

20

40

60

80

100

120

Flak

yPr

ojec

ts

Figure 5.1: Reduction of projects based on events.

As we used different categories for our validation we needed to locate and categorize flakytests of differing categories. We found that test categories also seemed to produce flakinessdifferently to each other, Randomness managed to manifest flakiness more easily than, forexample, Async Wait. This may be due to the inherently random behaviour of Randomnessand its corresponding tests. Since locals and assertions are elements of the test code thatmight be random between all runs. This fact might result in it being easier for the test toreach the point where a faulty assertion is created.

The amount of reruns needed were also found to be more than what was stated by Gruberet al. [8]. They mentioned that they required at average 170 reruns before any flakinesswas found. However, this was not nearly enough in many of our cases as we found manytest cases to require more than 1 000 iterations before any flakiness was found. From ourexperience of rerunning tests we found that a similar amount of reruns were needed for thesame test cases when done on the same OS, Python and pytest version. This would indicatethat usage of the same hardware and software as used by Gruber et al. [8] would improvethe accuracy of their statement of 170 reruns. Although this was true for most test cases, wefound three different test cases that did not manage to display flakiness for the both of us.

29

5.2. Analyzing Log Files

5.2 Analyzing Log Files

Our findings when targeting Randomness consisted of how passing and failing tests tendto execute the same code but fail on the final assertion. Variables and return values fromfunction calls, were also noted to vary between runs. By following how suspicious variablesdiffer in values between all runs, passing and failing, we can detect if it is due to Randomnessor not.

We further found how Randomness is not only manifested inside the test code, but maybe manifested in any function call made, i.e. the tested code. Depending on the functionscalled it may result in no result of randomness showing up in the test code but only in thefunction call. Due to this we found that to determine Randomness as a category and locate theroot cause, we must also look into the function calls made, their variables and the returnedvalues. Furthermore, we found that the cause of flakiness is mostly due to bad randomfunctions where no seed is used and/or the test fails to take into account the full range ofrandom values.

Keyword searches proved to be efficient when combined with other methods, such astracing, since only looking for keywords fails to prove Randomness. In some cases we foundthat variables that were assigned a random value did not impact the final result.For one testfunction, test_AveragedFunction, we noted that the locals assigned a random value didnot impacted the final result.We found this approach of only doing keyword searches to belacking in the deciding factor, where a keyword does not inherently mean flakiness. Butcombined with our method of tracing we felt that it added useful information to the finalreport.

5.2.1 Causes of Randomness

From our study of logs and flaky tests that were flaky in the Randomness category we founddifferent manifestations of root causes. Mainly we found that random number generatorswere the root cause but others were not uncommon. Some root causes could be argued notbe pertaining to the Randomness category even though Gruber et al. [8], categorized themas such. One example would be system time. Time, which is a separate category, seemed tobe categorized as Randomness when variables were assigned values based on system time.These assignments would produce random behaviour that would strongly indicate Random-ness. Furthermore, the context of the test could be seen as the decider of whether it belongsto Time or Randomness as some tests seemed to rely on system time to produce randomvariables while others produced random variables by mistake.

5.3 Produced Report

The information contained in the produced .html is explained in this section. Result,presents the probability that the test is flaky due to the Randomness category. This can be oneof the four resulting strings: No Indications of Randomness, Few Indications of Randomness, SomeIndications of Randomness and Many Indications of Randomness. No Indications of Randomnessrepresents a result where no or very few identifiers for Randomness were found. Some In-dications of Randomness, represents the result where a few identifiers for Randomness werefound. This result lacks the fully deciding identifier of Randomness but has several identifiersthat are worth taking notice of. Most Likely Randomness represents the result were severalidentifiers were found, or the most impactful identifiers were found. This result is presentedwhen any Randomness is found to impact the final assertions in any way.

Figure 5.2 represents the No Indications of Randomness result while figure 5.3 representsthe result if Many Indications of Randomness. In the case where any keyword or divergence isfound, buttons are presented that may be pressed to view more information about its find-ings.

30

5.4. Evaluation

Figure 5.2: Test case producing No Indications of Randomness.

Figure 5.3: Test case producing Many Indications of Randomness.

Meta, contains all relevant meta data, such as operating system, Python version etc., an-other section of meta data is Number of Iterations, which presents the number of passing andfailing iterations, see figure 5.4. There will always be at least one failed iteration as the appli-cation requires a failed log to function.

Figure 5.4: Iterations of produced report for test function.

5.4 Evaluation

In order to evaluate the method and determine its reliability, identified flaky tests from bothRandomness and non Randomness were used. Those within Randomness provided accuracyof how many were correctly identified and how many were left out. Whereas the flaky testsfrom other categories were instead used to determine the risk of wrongfully identifying teststo be within Randomness.

Figure 5.5 presents the findings from our evaluation. We find that our method is mostlycorrectly identifying a flaky test as either Randomness or not Randomness. The accuracyvaries between the data given from the trace logs, but it is mostly accurate when run. Wecan also see that the method is more prone to false positives, although more rare than correctresults. Since we would rather have false positives than false negatives, the results gatheredprove that we accomplished this. The false positives also seem to be from a function gatheringrandom values but due to another root cause that does not pertain to Randomness. Table5.1 presents the different categories and the number of test cases for each category used forevaluating our FlakyReporter.

31

5.4. Evaluation

Correct False Positive False Negative0

5

10

15

2021

3

1

Result

Occ

urre

nces

Figure 5.5: Results from evaluation.

Category Flaky TestsRandomness 12

Network 8Async Wait 1

Concurrency 1Time 2

IO 1

Table 5.1: Categories and numbers of tests used for evaluation.

Table 5.2 presents the category types and their respective results, either correct or false.The Non-Randomness category type consists of all categories not pertaining to Randomness.The Correct column represents the number of times FlakyReporter correctly identified if thetest was flaky due to Randomness. False represents the number of times FlakyReporter failedto correctly identify if the test is flaky due to Randomness. We found that our method worksbest for Randomness and Network, which is found by Gruber et al. [8] to be the most com-monly occurring flakiness in Python projects. The most commonly occurring flaky categoriesin Python projects are also the most accurately determined when run with our FlakyReporterwhich manages to correctly determine the majority of them. Determining how well it workson other categories than Randomness and Network, require a larger dataset to test on if weare to draw any reliable conclusions.

Category Type Correct FalseRandomness 11 1Non-Randomness 10 3

Table 5.2: Results from running the tests with FlakyReporter.

32

5.5. Tracing

Table 5.3 contains the number of iterations, failing and passing, that were used of thedifferent test functions. Each test has 200 iterations but differ in the number of passing andfailing iterations.

Function Name Passed Iterations Failed Iterationstest_TimeDistributed 198 2test_backends 199 1test_demean 198 2test_run_d80_ellipsoid 197 3test_calculate_distance_profile_constant_sequence 194 6test_variable_string 196 4test_partial_autodetect 170 30test_ask_skopt 199 1test_for_each_cell 197 3test_resample_onto_unchanged_md 38 162test_compare_addition 179 21test_automl_sequential_wrapper 173 27test_get_detailed_grid_data 199 1test_valid_api_forum 114 76test_komoot_multi_result 188 12test_fetch_event_json_version 199 1test_lookup 197 3test_send 174 26test_data 196 4test_response 197 3test_end_time_total 199 1test_cronrule 193 7test_track_parallell_progress_list 199 1test_cleanup_on_gc 2 198test_append_last_line 199 1

Table 5.3: Iterations of tests used for evaluation.

5.5 Tracing

From our initial analysis we found that determining Randomness through tracing is a reliableapproach. It further allows combining different method to complement its shortcomings,such as keyword search. We found that implementing tracing was simple, especially whenrunning it in conjunction with pytest. Our findings support our selection of tracing as amethod for detecting Randomness in test flakiness where it manages to accurately determineif the flakiness pertains to Randomness or not. Although that is the case, we found twodrawbacks of our implementation:

- Overhead - tracing produces a lot of overhead if implemented badly. Our implementa-tion avoids most overhead by only tracing code called by the test function. We furtheravoid tracing into any Python library code, or pytest code.

- Rerunning - rerunning is needed for creation of logs that are used to deduce flaki-ness. Total time spent on rerunning is effectively reduced by not allowing rerunningthe whole test suite which will save time.

Even though the drawbacks mostly negatively impact the performance, this is negated insome ways to avoid the worst impact, as explained in section 4.4.1. Due to the drawbacksmainly affecting performance which is not anything we considered as important as producinga reliable result, we decided that the pros outweighed the cons.

33

6 Discussion

6.1 Results

Collecting data for usage in development, verifying and validating was found to be extremelytroublesome. We found difficulties in finding, installing and recreating flakiness from Pythonprojects and it has ultimately impacted both our development speed and our validation.

The aim of this thesis; to create a method and application that manages to correctly iden-tify Randomness as the cause of a test’s flakiness, proved to be complicated. However, wemanaged to create a basis of a method that works as a proof of concept.

6.1.1 Flaky Tests

Confirming the flakiness was necessary, since we needed to produce and analyze the log filesfrom running the flaky tests.

Reproducing and analyzing the log files proved to not only being extremely time con-suming but also quite difficult, since confirming the flakiness within tests from these projectsoften relied on multiple dependencies with requirements that soon began to clash with eachother. The undeterministic nature of flaky tests manifesting themselves do often require mul-tiple thousands of iterations, which may in some cases exceed well beyond 5000 iterations,have been proven to be a task of itself, when executing on local hardware setups.

We noticed quite early on, that only the most starred repositories on Github, were aware offlakiness existing within their projects, leaving tracks behind for us in the shape of commentsand possible fixes in their commits. Already at this stage, we began grasping just how limitedthe available resources were. If it had not been for the already categorized dataset, from Gru-ber et al. [8], there would not have been enough data to build and evaluate a method upon,since time itself, was also limited. Once enough identifiers were found, that separated theRandomness category from the rest, it enabled the start of deducing, with some likelihood,on whether a test could be prone to Randomness or not.

34

6.1. Results

6.1.1.1 Limitations

Our method is expected to be supplied with a flaky test, already identified as such, which isthen to be determined if it is flaky due to Randomness.

The current limitations are flaky tests within the programming language Python, whichare also using the pytest as its testing framework. Even though pytest itself is allowingfor executing test files originally with the unittest framework, it is conflicting with the restof our implementation. Unexpected limitations proved to be when the trace logs containedutf-8 encoding.

Limitations results in a narrowed scope of usability, however, this thesis may be seen asproof-of-concept. With an extended time window it is possible to reduce these limitations,thus, increasing its usability.

6.1.1.2 Human Error

Everything that requires human interaction is by definition also prone to human error. Eventhough hard to discover such a problem in the aftermath, it is important to be aware of, sincehuman error might also impact the replicability.

The gathering of flaky tests and its categorization are both heavily prone to human er-ror. This thesis has required replication from our side, since we could not rely on alreadycategorized flaky tests, which have been confirmed by others, before us. Thus, in order toverify that tests are being flaky we needed flakiness to manifest on our side as well. Throughmanifesting flakiness on our side we are reducing the likelihood of human error in the dataavailable to us and we are also confirming its replicability. If replication was not possible, ithas also been disregarded as not usable.

6.1.2 Tracing

We found that tracing manages to provide us the information needed to conclude the impactof randomness on a test and its cause of flakiness. It further provides us the means to clearlylocate code snippets of interest. Similarly to earlier works utilizing tracing, such as [24, 25,27], it proved to be very helpful in providing useful data to analyze. We fully believe thattracing is a good approach at locating flakiness in the Randomness category.

Even though tracing is very useful, it suffers from both overhead and issues when select-ing the tracing depth (explained in 4.4.1). The overhead is an issue for larger projects whentracing is used. We solved this by reducing the amount of tracing done by only tracing thefunction stated by the user. The issue with the reduction in trace depth is that it may lead tomissing important execution information that either differs or introduces randomness. Fromour experience, however, we found this to be a small issue as mostly the results would notdiffer and the extra information gained from deeper tracing is not enough to deem it worththe drawbacks.

6.1.3 FlakyReporter

Our created tool FlakyReporter, manages to run all test functions utilizing the pytest frame-work. Issues with this is its inability to function on non-pytest frameworks. Other test-ing frameworks does not support loading of pytest plugins even if the specific frameworkcould be run by pytest. The tool does rely on a specific version of pytest which causesit to become usable in a more narrow scope. Depending on the continued development ofpytest, and whether their experimental assertion hook gets fully implemented or not, itmay require any user to downgrade their pytest version to run our tool.

We found some difficulties when the created trace logs use utf-8 encoding. The utf-8encoding caused several issues when reading and analyzing the trace logs. Due to our timelimitation and the few occurrences where utf-8 encoding was used, we found the fix to this to

35

6.2. Method

be too much work. For further development this can be solved by ensuring correct encodingand decoding, resulting in no issues with utf-8 or ASCII.

FlakyReporter was more developed as a proof of concept rather than a usable tool. Eventhough we aimed to make it relevant for usage, we have not done any studies on its usability.This further extends to the implementation of the tool. It is not impossible to implementour technique on other testing frameworks or other languages. But it does imply that someaspects, such as assertion results, are prone to be lacking or missing. Our initial thoughtswere to implement it without using pytest due to its inability to load plugins for otherframeworks. However we found that it was very difficult, if not impossible to implement itin a way that gave us the information we needed. This was one of the reasons why we optedto use pytest instead.

6.2 Method

The creation of an appropriate method to a given problem is not an easy task, it is a strategyin how to yield the desired result. New methods are often taking inspiration from previousresearch, even if the desired result is new. The behavioural nature could almost be viewedas recursive, with both alterations and improvements. Thus, a method is always prone to theintroduction of flaws from start till end, and our method is no exception.

6.2.1 Creating a Report

Our method of calculating the probability of a flaky test being flaky due to Randomness issomewhat lacking. It is an aggressive method that is prone to result in false positives incategories that show signs of randomness. We feel that false positives are better than falsenegatives and we therefore feel that in our case, this form of calculating the probability ismotivated as a good enough method.

The content of the report (see section 4.5.9), may also be lacking or hard to understand.This is a result of us opting to validate our method of detecting flakiness in the Randomnesscategory instead of its applicability for developers debugging the flaky test. This resulted ina lack of enough feedback to determine the applicability and usability of it. Meeting with acompany early on in the thesis project gave us the impression that they wanted a tool thatwould indicate suspicious parts and potential fixes for the flakiness. With this in mind weopted to create a report that leaves the fixing to the developer by giving indications of suspi-cious parts of the code.

6.3 Source Criticism

The developed method is relying on the settrace function which comes from the sysmod-ule in Python. Settrace does not belong to the language definition of Python but is insteadpart of the implementation platform and is therefore prone to the risk of not existing withinall versions of Python [6]. We considered this to acceptable since we could save an existingversion locally.

The dataset from M. Gruber [7], is largely present throughout the thesis in its entirety. Ifthere is any error present in this dataset it is possible that we have also inherited this error.This is especially due to the difficult nature in categorizing flaky tests. However, we havemanually gone through and replicated the flakiness within these tests to reduce and limitpossible errors. This dataset has also been vital for this thesis due to the limited data available.Sjöbom [21] is a similar case where their findings are used but the same issue of replicatingand validating their findings is proven to be difficult.

The findings by Eck et al. [3], is also used throughout the thesis. Through their inclusionin most other articles that investigate flakiness, we find their categories and findings to bereliable. Their categories are widely comparable to that of Gruber et al. [8].

36

6.4. Replicability, Reliability and Validity

6.4 Replicability, Reliability and Validity

Since the focus of this thesis is on a specific category and the method is developed towards de-tecting that specific category, the usage might be somewhat narrow. The basis of our method,usage of tracing and parsing logs, can be considered pertaining to a greater scope and maybe helpful in developing techniques for detecting other categories.

Due to the nature of flaky tests, reproducing identical results may only be somewhat pos-sible. Similarly to how we only managed to partly recreate flakiness in what was deemedflaky by Gruber et al. [8] and Sjöbom [21]. We managed to induce flakiness in most cases butfailed in some and this pattern will continue with anyone wanting to recreate our study. Inthis thesis we try to provide all information needed to recreate and reevaluate our tool andits findings so to eliminate this issue.

As stated, we had difficulties finding flaky tests and recreating flakiness in the ones wefound. This resulted in a reduced data set used for validating and our evaluation of FlakyRe-porter suffers from a shortage in data tested against it. To further cement our tool as reliablewe would require a larger dataset that can be used for evaluating the accuracy of our tool.Furthermore, to determine the usability of our tool it would need to be evaluated and usedby different developers to determine its usability in practice. Therefore the final result of howaccurate and usable our tool is becomes unreliable due to the reasons stated above.

6.4.1 The work in a wider context

This work is available open-source, to be used by anyone, through our GitHub repository[16]. Therefore it is serving as a transparent tool which can benefit anyone wanting to useit. All produced files and gathered information, when running the tool, is saved in the localenvironment which is not in any point of time being extracted.

This thesis could be used by software development teams to produce an end product orservice with increased quality. Since locating and identifying the root causes of flakinesswould allow for software development teams to strengthen the reliability of the already ex-pected outcome. When it comes to security, one could argue that it is only as secure as itsweakest link, which in comparison, could be a product failing due to flakiness.

37

7 Conclusion

With this thesis we aimed to answer the following research question:

RQ: To what extent can tracing or log files be used to locate and identify a test being flaky inthe Randomness category?

7.1 Tracing & Log Files

Tracing is a technique useful when analyzing code’s behaviour and functionality. Creatinglogs from the traces in the execution results in log files including relevant data for debuggingand locating code snippets of interest. The created trace logs can in turn be analyzed to locatefaulty code or parts of interest pertaining to Randomness flakiness. By following execution,reading the variables, returns, assertions and keywords from the trace logs our FlakyReportercorrectly identified 84% of the cases used in the evaluation. This indicates that it is possibleto determine how likely the flakiness is to be in the Randomness category by using tracingand log files.

RQ: Tracing can be used to gather information about execution, such as variables and return state-ments, into trace logs which can in turn be analyzed. By checking the trace logs for indicators of theRandomness category we can determine the probability that the flakiness is within the Randomnesscategory.

7.2 Consequences

The results from this work consists of:

1. A tool that can determine if a test is flaky due to Randomness

2. Indications that tracing can be used to determine flakiness in the Randomness categorywhich in turn can be helpful for other categories.

Our tool manages to determine if the flakiness pertains to the Randomness category whenused on a test determined to be flaky. It also serves a evidence that tracing may be used whenevaluating the category of flakiness in a flaky test. We believe that this will help future works,

38

7.3. Future Work

researching methods for automatically categorizing flaky tests. It may also help developerssuffering from flaky tests who wants a tool to guide them in finding locations in their testwhere the flaky behaviour is created.

We believe that future work using our findings and building on them may result in betterand more reliable applications and methods for not only detecting Randomness, but alsoother categories of flakiness.

7.3 Future Work

With our thesis we have proven that it is possible to determine flakiness in the Randomnesscategory by using tracing. As our work serves more as a proof of concept, further work canbe done by using our findings and our lessons learned. Future work may be formalized asmore accurate scanning of variables, whereas we analyze everything in a more "top layer"approach, deeper scanning of variables would lead to fewer false positives/negatives. Trac-ing at a greater depth may also be of interest for future work where the functions traced aregreater in number and the resulting data collected may allow studying other categories.

Further validations is also needed for future work. Our testing dataset is small and anevaluation of our tool with a larger dataset would further prove its accuracy and usability.

A study where developers tests the tool and give continuous feedback on the content inthe generated report would also be beneficial. This would help better the tool to become moreuseful for developers and their needs for solving the cause of flakiness. The produced reportcan also be worked on to contain more information, better descriptions of information and atechnique to filter the irrelevant information to provide better readability and comprehensi-bility of the report.

39

Bibliography

[1] S. Alouneh, S. Abed, B.J. Mohd, and A. Al-Khasawneh. “Relational database approachfor execution trace analysis”. In: 2012 International Conference on Computer, Informationand Telecommunication Systems (CITS). 2012, pp. 1–4. DOI: 10.1109/CITS.2012.6220394.

[2] J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov. “DeFlaker: Au-tomatically Detecting Flaky Tests”. In: 2018 IEEE/ACM 40th International Conference onSoftware Engineering (ICSE). 2018, pp. 433–444. DOI: 10.1145/3180155.3180164.

[3] M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli. “Understanding Flaky Tests: TheDeveloper’s Perspective”. In: Proceedings of the 2019 27th ACM Joint Meeting on EuropeanSoftware Engineering Conference and Symposium on the Foundations of Software Engineering.ESEC/FSE 2019. Tallinn, Estonia: Association for Computing Machinery, 2019, pp. 830–840. ISBN: 9781450355728. DOI: 10.1145/3338906.3338945.

[4] Python Software Foundation. inspect. URL: https : / / docs . python . org / 3 /library/inspect.html.

[5] Python Software Foundation. Python. URL: https://www.python.org.

[6] Python Software Foundation. sys.settrace. URL: https://docs.python.org/3/library/sys.html.

[7] M. Gruber. An Empirical Study of Flaky Tests in Python. Zenodo, Jan. 2021. DOI: 10.5281/zenodo.4450435. URL: https://doi.org/10.5281/zenodo.4450435.

[8] M. Gruber, S. Lukasczyk, F. Kroiß, and G. Fraser. An Empirical Study of Flaky Tests inPython. 2021. arXiv: 2101.09077 [cs.SE].

[9] GitHub Inc. GitHub. URL: https://github.com/.

[10] J. Kraft, A. Wall, and H. Kienle. “Trace Recording for Embedded Systems: LessonsLearned from Five Industrial Projects”. In: Runtime Verification. Ed. by H. Barringer,Y. Falcone, B. Finkbeiner, K. Havelund, I. Lee, G. Pace, G. Rosu, O. Sokolsky, and N.Tillmann. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 315–329. ISBN: 978-3-642-16612-9.

[11] W. Lam, P. Godefroid, S. Nath, A. Santhiar, and S. Thummalapenta. “Root CausingFlaky Tests in a Large-Scale Industrial Setting”. In: ISSTA 2019. Beijing, China: Asso-ciation for Computing Machinery, 2019, pp. 101–111. ISBN: 9781450362245. DOI: 10.1145/3293882.3330570.

40

Bibliography

[12] W. Lam, K. Muslu, H. Sajnani, and S. Thummalapenta. “A Study on the Lifecycle ofFlaky Tests”. In: Proceedings of the ACM/IEEE 42nd International Conference on SoftwareEngineering. ICSE ’20. Seoul, South Korea: Association for Computing Machinery, 2020,pp. 1471–1482. ISBN: 9781450371216. DOI: 10.1145/3377811.3381749.

[13] W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie. “iDFlakies: A Framework for Detectingand Partially Classifying Flaky Tests”. In: 2019 12th IEEE Conference on Software Test-ing, Validation and Verification (ICST). 2019, pp. 312–322. DOI: 10.1109/ICST.2019.00038.

[14] W. Lam, S. Winter, A. Astorga, V. Stodden, and D. Marinov. “Understanding Repro-ducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects”. In:2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). 2020,pp. 403–413. DOI: 10.1109/ISSRE5003.2020.00045.

[15] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov. “An Empirical Analysis of Flaky Tests”. In:Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of SoftwareEngineering. FSE 2014. Hong Kong, China: Association for Computing Machinery, 2014,pp. 643–653. ISBN: 9781450330565. DOI: 10.1145/2635868.2635920.

[16] D. Mastell and J. Mjörnman. FlakyReporter. URL: https : / / github . com /JesperMjornman/FlakyReporter.

[17] J. Mertz and I. Nunes. “On the Practical Feasibility of Software Monitoring: A Frame-work for Low-Impact Execution Tracing”. In: SEAMS ’19. Montreal, Quebec, Canada:IEEE Press, 2019, pp. 169–180. DOI: 10.1109/SEAMS.2019.00030. URL: https://doi.org/10.1109/SEAMS.2019.00030.

[18] M. Moe and K. Oo. “Evaluation of Quality, Productivity, and Defect by applying Test-Driven Development to perform Unit Tests”. In: 2020 IEEE 9th Global Conference onConsumer Electronics (GCCE). 2020, pp. 435–436. DOI: 10.1109/GCCE50665.2020.9291950.

[19] A. Shi, W. Lam, R. Oei, T. Xie, and D. Marinov. “IFixFlakies: A Framework for Automat-ically Fixing Order-Dependent Flaky Tests”. In: Proceedings of the 2019 27th ACM JointMeeting on European Software Engineering Conference and Symposium on the Foundations ofSoftware Engineering. ESEC/FSE 2019. Tallinn, Estonia: Association for Computing Ma-chinery, 2019, pp. 545–555. ISBN: 9781450355728. DOI: 10.1145/3338906.3338925.

[20] D. Silva, L. Teixeira, and M. d’Amorim. “Shake It! Detecting Flaky Tests Caused by Con-currency with Shaker”. In: 2020 IEEE International Conference on Software Maintenanceand Evolution (ICSME). 2020, pp. 301–311. DOI: 10.1109/ICSME46990.2020.00037.

[21] Anders Sjöbom. “Studying Test Flakiness in Python Projects : Original Findings forMachine Learning”. MA thesis. KTH, School of Electrical Engineering and ComputerScience (EECS), 2019, p. 58.

[22] pytest-dev team. pytest. URL: https://docs.pytest.org/en/6.2.x/.

[23] B. Vancsics, T. Gergely, and Á. Beszédes. “Simulating the Effect of Test Flakiness onFault Localization Effectiveness”. In: 2020 IEEE Workshop on Validation, Analysis andEvolution of Software Tests (VST). 2020, pp. 28–35. DOI: 10.1109/VST50071.2020.9051636.

[24] X. Wang, Q. Gu, X. Zhang, X. Chen, and D. Chen. “Fault Localization Based on Multi-level Similarity of Execution Traces”. In: 2009 16th Asia-Pacific Software Engineering Con-ference. 2009, pp. 399–405. DOI: 10.1109/APSEC.2009.45.

[25] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa. “A Survey on Software FaultLocalization”. In: IEEE Transactions on Software Engineering 42.8 (2016), pp. 707–740. DOI:10.1109/TSE.2016.2521368.

41

Bibliography

[26] Ye Gang, Li Xianjun, Li Zhongwen, and Yin Jie. “Fault localization with intersection ofcontrol-flow based execution traces”. In: 2011 3rd International Conference on ComputerResearch and Development. Vol. 1. 2011, pp. 430–434. DOI: 10.1109/ICCRD.2011.5764051.

[27] C. Ziftci and D. Cavalcanti. “De-Flake Your Tests : Automatically Locating Root Causesof Flaky Tests in Code At Google”. In: 2020 IEEE International Conference on SoftwareMaintenance and Evolution (ICSME). 2020, pp. 736–745. DOI: 10.1109/ICSME46990.2020.00083.

42

A Dataset of projects with flakycommits

Project Link Able to runacl-anthology https://github.com/icoxfog417/acl-anthology Noaiohttp https://github.com/aio-libs/aiohttp Noamplimap https://github.com/koelling/amplimap Noanalytic_shrinkage https://github.com/matzhaugen/analytic_shrinkage Yesarlib https://github.com/gongliyu/arlib Yesashpool https://github.com/cktc/ashpool YesBeautifulSauce https://github.com/nateraw/BeautifulSauce Yesbeem https://github.com/holgern/beem Yesblindpie https://github.com/alessiovierti/blindpie Nocapsicum https://github.com/m-mizutani/capsicum Nocelery https://github.com/celery/celery Yescenpy https://github.com/ljwolf/cenpy YesCHESS https://github.com/nishaq503/CHESS Yescmdkit https://github.com/glentner/CmdKit Nocore https://github.com/home-assistant/core/ Nocrfmnes https://github.com/nmasahiro/crfmnes Yescronjob https://github.com/havefun-plus/cronjob Yescuckoopy https://github.com/rajathagasthya/cuckoopy Yescvbase https://github.com/hellock/cvbase Yesdata-genie.git https://github.com/mkeshav/data-genie.git Yesdata-partitioner https://github.com/brahle/data_partitioner Yesdcard-spider https://github.com/leVirve/dcard-spider YesDESlib https://github.com/Menelau/DESlib Nodevtracker https://github.com/ConSou/devtracker Yesdicom-standard https://github.com/innolitics/dicom-standard Yesdict https://github.com/wufeifei/dict Yesdiscovery-client https://github.com/tomwei7/discovery-client Yesdocker_compose https://github.com/docker/compose/ Nodomain-validation https://github.com/ElliotVilhelm/python-domain-validation Yesebook_homebrew https://github.com/tubone24/ebook_homebrew No

43

Project Link Able to runfiltered-intervaltree https://github.com/lostinplace/filtered-intervaltree YesfontMath https://github.com/robotools/fontMath Yesforkliftpy https://github.com/castelao/supportdata Yesfuzzylogic https://github.com/amogorkon/fuzzylogic Yesgamble.git https://github.com/jpetrucciani/gamble.git YesGDAX-Python https://github.com/danpaquin/GDAX-Python Yesgeocoder https://github.com/DenisCarriere/geocoder YesGoogleFreeTrans https://github.com/ziliwang/GoogleFreeTrans Yesgwosc https://github.com/gwpy/gwosc Yeshelpers https://github.com/infosmith/helpers Yeshotwing-core https://github.com/jasonhamilton/hotwing-core Yeshumansort https://github.com/coreygirard/humansort YesIntervallum https://github.com/wol4aravio/Intervallum YesJavPy https://github.com/TheodoreKrypton/JavPy Yesjgutils https://github.com/jerodg/jgutils Yesjupyterhub https://github.com/jupyterhub/jupyterhub/ Yeskeras https://github.com/keras-team/keras/ Yeslangsplit https://github.com/mindey/langsplit Yeslineflow https://github.com/yasufumy/lineflow Yesmatplotlib https://github.com/matplotlib/matplotlib Nomsp2db https://github.com/computational-metabolomics/msp2db Yesmusic-dl https://github.com/0xHJK/music-dl Yesmxnet-octave-conv https://github.com/CyberZHG/mxnet-octave-conv Yesnba_scraper https://github.com/mcbarlowe/nba_scraper Yesndradex https://github.com/astropenguin/ndradex YesNeuraxle https://github.com/Neuraxio/Neuraxle Yesnoipy https://github.com/povieira/noipy Yesnoisyopt https://github.com/andim/noisyopt Yesontobio https://github.com/biolink/ontobio Yesorderly-web-deploy https://github.com/vimc/orderly-web-deploy Yespacketeer https://github.com/lungdart/packeteer Yespandas https://github.com/pandas-dev/pandas/ Nopipm https://github.com/jnoortheen/pipm Yespiripherals https://github.com/quantenschaum/piripherals Yesploceidae https://github.com/MATTHEWFRAZER/ploceidae Yespy_ev https://github.com/JBielan/py_ev Yespy-finstmt https://github.com/whoopnip/py-finstmt Yespyais https://github.com/M0r13n/pyais Yespyblnet https://github.com/nielstron/pyblnet Nopykew https://github.com/RBGKew/pykew Yespylabeledrf https://github.com/dessimozlab/pylabeledrf Yespympesa https://github.com/TralahM/pympesa Yespysnooper https://github.com/cool-RR/PySnooper/ Yespython_filmaffinity https://github.com/sergiormb/python_filmaffinity Yespython-cuckoo https://github.com/Leechael/python-cuckoo Yespython-matlab-functions https://github.com/fzyukio/python-matlab-functions Yespytsmp https://github.com/kithomak/pytsmp Yespyupload https://github.com/yukinotenshi/pyupload Yespyvesync https://github.com/markperdue/pyvesync Yespywsl https://github.com/t-sakai-kure/pywsl YesPyXenaValkyrie https://github.com/xenadevel/PyXenaValkyrie YesRanCat https://github.com/mattjegan/rancat Yes

44

Project Link Able to runruTS https://github.com/SergeyShk/ruTS Nosalt https://github.com/saltstack/salt Noschemas https://github.com/CIMAC-CIDC/schemas Yesscikit-procrustes https://github.com/melissawm/skprocrustes Yesselenium-wire https://github.com/wkeeling/selenium-wire Yesservice-manager https://github.com/hmrc/service-manager Yessf-sdk https://github.com/block-cat/sf-sdk Yessida https://github.com/liusida/sida Yessolver https://github.com/thoth-station/solver Nosports.py https://github.com/evansloan/sports.py Yesspreading_dye_sampler https://github.com/NLeSC/spreading_dye_sampler YesSSLyze https://github.com/nabla-c0d3/sslyze Yessvp https://github.com/joshmgrant/svp Yesthreadedprocess https://github.com/nilp0inter/threadedprocess Yestoken-io https://github.com/overcat/token-io Notorch-layer-normalization https://github.com/CyberZHG/torch-layer-normalization Yestorchgan https://github.com/torchgan/torchgan Yestornado https://github.com/tornadoweb/tornado Notrainline-python https://github.com/tducret/trainline-python YestruffleHog https://github.com/MechanicalRock/truffleHog Yesvault-dev https://github.com/vimc/vault-dev YesVerifone https://github.com/vilkasgroup/Verifone Yesvimms https://github.com/sdrogers/vimms YesWattTime https://github.com/stoltzmaniac/WattTime Yeswellpathpy https://github.com/Zabamund/wellpathpy Yeswiki-futures https://github.com/AndrewRPorter/wiki-futures Yeswords-to-regular-expression https://github.com/radeklat/words-to-regular-expression Yesxphyle https://github.com/jdidion/xphyle Yeszulip https://github.com/zulip/zulip/ No

45