Master Thesis Report - DiVA portal

Master Thesis ReportComparison between an imperative and functional solution executed onthe cloud

Java vs Scala for matrix multiplication performed on Amazon EC2

Filippa D.Lidman

June 26, 2022

Filippa Dahlgren LidmanInternal supervisor: Patrik EklundExternal supervisors: Emil Westin, Jonas SjödinPartner: Omegapoint Stockholm ABSpring 2022Master Thesis, 30 creditsMaster of Science in computing science and software engineering, 300 credits

AbstractThis thesis handles a comparison of matrix multiplication in the functional and imperative

paradigm. The main focus was on the performance of the different implementationsexecuted on the EC2 cloud service. The imperative solution was implemented in Java, and

the functional solution was implemented in Scala. This thesis also investigates generaldifferences in lines of written code and the amount of Java bytecode for each

implementation. The performances of the two implementations were evaluated throughbenchmarks when running on a local computer and when executed on the cloud.

Consequently, a functional implementation in Scala is preferable when computing matrixmultiplication in terms of memory usage and minimum execution time, both on the cloudand on a local computer. The Scala implementation consisted of less amount of written

code as well. Although, a Java solution generates less estimated amount of Java bytecode.

Acknowledgements

Felicia Dahlgren Lidman & Johannes Larson - Food providers and life support

Martin Hedberg - A wanker anchor

Evelinn Carlsson & Lina Granberg - My non-lesbian lovers

Amanda Ryman - My studdy-buddy and unknowing literature provider

Emil Söderlind - A Teflon pan

Emil Westin & Jonas Sjödin - External thesis supervisors at Omegapoint

Patrik Eklund - Thesis supervisor

Everybody else - Thank you to every prof-reader, supporter, friend and random readerthat have magically found this thesis.

I hope all of you enjoy this thesis because I have struggled enough to get it done

Contents1 Introduction 1

2 Background 22.1 Imperative programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Functional programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.2 Compared to imperative . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.1 Virtualization and elasticity . . . . . . . . . . . . . . . . . . . . . . . . 52.3.2 High-performance computing on the cloud . . . . . . . . . . . . . . . . 6

3 Objective 7

4 Theory 84.1 Imperative and functional programming languages . . . . . . . . . . . . . . . 8

4.1.1 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.1.2 Scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.2.1 Algorithm used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2.2 Recursive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2.3 Strassen’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2.4 SUMMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Evaluation 145.1 General aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.1 Amount of written code . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1.2 Amount of Java bytecode . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2.2 Java virtual machine (JVM) . . . . . . . . . . . . . . . . . . . . . . . . 155.2.3 Baseline performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2.4 Performance on the cloud . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . 155.2.4.2 Implementation choices . . . . . . . . . . . . . . . . . . . . . 175.2.4.3 Running a test . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Result 196.1 General aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.3 Execution on the cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.3.1 Cloud execution compared to baseline - Java . . . . . . . . . . . . . . 226.3.2 Cloud execution compared to baseline - Scala . . . . . . . . . . . . . . 23

7 Discussion 267.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.2 General aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277.4 Ethical and social aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.4.1 Amazon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.4.2 Functional programming . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8 Conclusion 298.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

i

List of Figures1 Overview of cloud structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Duke, official Java mascot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Official logo for Scala programming language. . . . . . . . . . . . . . . . . . . 94 Example of the structure used in the matrix multiplication application . . . . 165 Example of when running a test on the cloud . . . . . . . . . . . . . . . . . . 186 Minimum execution time for baseline execution of the different solutions . . . 197 Logarithm of the minimum execution time for baseline . . . . . . . . . . . . . 208 Average memory usage for baseline execution for the different implementations 209 Minimum execution time for each cloud instance setup . . . . . . . . . . . . . 2110 Average memory usage for cloud execution for the different instances . . . . . 2211 Minimum execution time for each Java execution . . . . . . . . . . . . . . . . 2312 Average memory usage for each Java execution . . . . . . . . . . . . . . . . . 2313 Execution time for each Scala execution . . . . . . . . . . . . . . . . . . . . . 2414 Memory usage for each Scala execution . . . . . . . . . . . . . . . . . . . . . 25

List of Tables1 The different instance types that were used in different executions of the cloud

setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Result of amount of written code . . . . . . . . . . . . . . . . . . . . . . . . . 193 Result of the estimated amount of Java bytecode . . . . . . . . . . . . . . . . 19

List of Algorithms

1 Pseudocode for Strassen’s algorithm . . . . . . . . . . . . . . . . . . . . . . . 122 Pseudocode for SUMMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

List of Listings1 Insertion sort in Java. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Insertion sort with tail-recursion in Scala . . . . . . . . . . . . . . . . . . . . 43 Example of a message sent from the main point. . . . . . . . . . . . . . . . . 174 Example of a message sent from a node . . . . . . . . . . . . . . . . . . . . . 17

ii

1 Introduction

The functional programming paradigm has been around for many years. Over the last decade,the interest in this paradigm has grown, especially in high-performance computing [36].Probably prompted because of the growth among typed functional languages [32] and theevolution of multi-paradigm languages such as Go and Rust, the functional paradigm is moreused today. For many years functional programming was regarded as more challenging tounderstand and slower in execution than imperative programming languages [32]. Although,the insight that every part of an application does not need to be fast in execution, incombination with the continued evolution of the paradigm, has resulted in the positive useof functional programming.

Another relevant phenomenon in the IT industry is the use of the cloud. It has been shownthat companies gain from locating or "migrating" applications to the cloud [30, 29]. Insteadof being responsible for a large amount of hardware, a company can rent such hardwarethrough a cloud platform [30, 29]. In high-performance computing, cloud migration mayincrease performance in applications [28]. An explanations for the increase in performanceis connected to the attributes of a cloud service in comparison to a regular physical cluster.Cloud clusters generally allows for more advanced hardware than what a company may afford.However, the increased efficiency of running an application on the cloud may be restrictedby the application’s behaviour [28]. This restriction leads us to the the question, "what kindof application behaviour may restrict the performance when executed on the cloud?"

This thesis aims to compare an imperative and a functional program when executed on thecloud. Different aspects of the implementations were compared to evaluate whether appli-cation behaviour is related to the choice of paradigm. The considered aspects of behaviourwere concurrency, memory and time efficiency. In addition, the amount of code needed forthe different solutions and the amount of overhead time added when executed on the cloud,were compared.

This thesis was done in collaboration with Omegapoint in Umeå, a consulting firm that isfocused on cybersecurity and secure development of IT systems. Omegapoint contributedwith access to a cloud platform for practical testing during this project.

1

2 Background

This section describes the main aspects of this thesis’s two programming paradigms, namelyfunctional and imperative. In addition, cloud computing is explained since it is an impor-tant aspect that will be explored during the testing of the performance aspects of the twoparadigms.

2.1 Imperative programming

Imperative programming is the dominant programming paradigm in commercial softwaretoday and has been for many years. It is called "imperative" since the structure of thelanguages in this paradigm is based on commands [38], specifically assignment commands thatoperate with variables, where variables are an abstraction of memory fields. The connectionbetween the structure of imperative code and machine instruction puts the imperativeparadigm in close relation to the machine architectures, which results in something that canbe an efficient programming language, at least in theory [38]. The demand for efficient codein the context of execution time and time to produce, is the main reason for the dominanceof this paradigm. Although the fundamental reason for the use of imperative programmingis the way variables imitate natural objects, and that an imperative program can imitateprocesses that alter the state of such objects. In other words, it is an intuitive programmingparadigm to use when writing code [38].

The imperative paradigm includes the sub-paradigm of object-oriented programming lan-guages. Object-oriented programming is structured to ensure encapsulation of parts of thecode in a way that general imperative programming does not [38]. An example of this ishow in Java, an object-oriented programming language, methods can be declared public, pro-tected or private depending on where the method is supposed to be used. A private methodcannot be used outside its class, but a public method can be used outside its class. Incomparison, the non-object-oriented programming language C has no similar reinforcement,having no restrictions to functions beyond a file.

2.1.1 Example

As an example of both an object-oriented approach and of imperative programming, thecode in Listing 1 shows an implementation of insertion sort in Java. The code changes thestate of the array, which will be noticeable outside of the insertionSort method. An arrayrepresents memory fields for integers. As an example of the object-orientation, the privateprintArray method is only available within the InsertionSortExample class. In contrast, thepublic insertionSort method can be called outside of the class if an InsertionSortExampleinstance exists.

2

1 public class InsertionSortExample {2

3 public static void insertionSort(int array []) {4 int length = array.length;5 for (int i = 1; i < length; i++) {6 int element = array[j];7 int j = i-1;8 while ( (j > -1) && ( array [j] > element ) ) {9 array [j+1] = array [j];

10 j--;11 }12 array[j+1] = element;13 }14 printArray(array)15 }16

17 private void printArray(int array []){18 for(int i : array){19 System.out.print(i+" ");20 }21 }22 }23

Listing 1: Insertion sort in Java.

2.2 Functional programming

Functional programming is a stateless way of programming based on mathematical func-tions [32]. The functional paradigm became first widely used when the programming languageLisp was introduced.Lisp is one of the oldest functional programming languages, even thoughsome imperative features later were required for efficiency [32]. Today, more languages of theparadigm are practically used. However, the more practical use depends on the maturingof the functional language and the development of typed functional programming i.e. ML,Haskell and F#. Functional programming is mainly used in areas of database processing,financial modelling, statistical analysis and bioinformatics [32]. The reason for using func-tional programming in these areas is that functional programming is better adapted formathematical constellations since the paradigm is based on mathematical functions.

2.2.1 Example

As an example of functional programming, the code in Listing 2 implements insertion sortusing tail recursion in Scala. In functional programming, it is common to use recursioninstead of loops and to use lists instead of arrays which usually are used in imperativeprogramming. Because of stateless programming, the original list will not be altered. Thelist cannot be altered outside of the insertionSort function, so the printing of the sorted listis within the function.

3

1 class ListExample {2

3 def insertionSort(list: List[Int]): List[Int] = {4 val addList : List[Int] = sort(list ,Nil)5 printList(addList)6 }7

8 def sortHelper(element: Int , list: List[Int],9 acc: List[Int]): List[Int] ={

10 if (list.isEmpty || element < list.head )11 acc.reverse ++ (element :: list)12 else sortHelper(element ,list.tail , list.head::acc)13 }14

15 def sort(list: List[Int], acc: List[Int]): List[Int] ={16 if (list.isEmpty) acc17 else sort(list.tail , sortHelper(list.head ,acc ,Nil))18 }19

20 def printList(list: List[Int]): List[Int] = {21 list match {22 case Nil => list23 case head :: tail => print(" " + head + " ")24 printList(tail)25 }26 }27

28 }29

Listing 2: Insertion sort with tail-recursion in Scala

2.2.2 Compared to imperative

The imperative paradigm is instructions on how the computation needs to be done, and thefunctional paradigm is focused on what needs to be computed. Functional programming isstateless, while imperative programming is focused on changing a program’s state [32, 38].

In a more thorough comparison, functional languages have easier syntactic and semanticstructures [32]. However, it can be challenging at first for a person who only has experiencein imperative programming to understand functional programming [25]. Functional code isgenerally compact in areas where it is best suited, while the code is still comprehensible. Afunctional implementation can be 10 percent of the amount of code needed for an imperativeimplementation of the same problem. Yet, in areas less suited for functional programming,the functional implementation may consist of 25 percent more code than the imperativesolution [32]. The combination of an easy syntactic and semantic structure, as well as whenused in areas where the amount of code can be short and efficient, leads to good generalreadability of functional languages [25]. On the other hand, observing the examples ofinsertion sort above, this may not apply(see Listing 2). The examples shows similar amountsof code, which can be connected to the sorting algorithm not favouring the functionalparadigm when it lacks the mathematical aspect. Similarly, it may also be evident thatthe sorting algorithm is better adapted for imperative programming because the imperativeexample is more intuitive for this problem (see Listing 1).

Compiled imperative code is generally executed more efficiently than functional code sincefunctional code needs to be interpreted and often has lazy evaluation1. In contrast, imperativecode is known to be closer to the machine architecture [32, 38].

When handling concurrency, imperative languages require a lot of attention. It takes more1An expression is not evaluated until it is first used

4

focus on design to structure an imperative program for concurrency, and at the end of theday, it can still be difficult to implement correctly [38]. In imperative programming, whenworking with concurrency, the program is divided into tasks that later need to be synchronizedcorrectly, in concerns of non-local and global variables. While functional programming hasfunctions that are initially easy to run on multiple threads [32] and shared variables doesn’thave to be taken into account.

2.3 Cloud Computing

Using and migrating to the cloud has become more popular during the last decade dueto several reasons. One of the main reasons is that companies don’t have to handle theresources of a physical cluster for more extensive applications when migration to the cloud,which would need less space and electricity. It may also be more cost-effective to deploy at acloud provider [28] and it is more environmentally friendly when less electricity is needed [30].Requiring less resources in combination with the possibility to scale horizontally and tohave elasticity among the resources, is very appealing to many companies that handles moreextensive computations, large applications or large amounts of data [29].

Cloud computing can be seen as a service or as a group of services [29]. The principle ofthe cloud is to provide services for applications, storage and computational power [30, 29, 7].To guarantee these services, cloud providers "rent" their physical hardware via the internetas a service to the users, resulting in that the user’s devices still can access a wide range ofresources even if the device itself isn’t particularly advanced [7]. For example, see Figure 1,where each service is represented as several servers hosting virtual machines (VM).

Figure 1: Overview of cloud structure (based on image from [7]). The grey boxes indicatingservers hosting VMs.

2.3.1 Virtualization and elasticity

Virtualization is a technique that often is used in cloud computing to enable elasticity [29].Elasticity is a requirement of an official cloud, and it defines horizontal scalability, where thecomputational resource scales out at high demand and scales in at low demand [30]. Theelasticity can be required, as mentioned, by virtualization. Virtualization occurs when a pool

5

of virtual machines are located on a physical machine and enables them to share the hardwareresource (system virtualization [7], hardware virtualization [29]). The infrastructure createdby virtualization allows for some cloud users to be provided with one or more virtual machines,sharing the physical resources in a pool with other users, scaling up the access to the resourcesdependening on the demand [7].

2.3.2 High-performance computing on the cloud

As mentioned in Section 2.3, the cloud enables users to get in contact through the internetwith great computational resources. Throughout the last decade, there have been manyevaluations on running high-performance computing (HPC) application on the cloud [15,28, 33, 10, 13]. Cloud technology is still evolving, but it is starting to become clear thatcomputing on the cloud may be as efficient as computing on physical clusters [28, 13].However, difficulties with executing HPC applications on the cloud in comparison to on aphysical cluster may remain. As mentioned by Roloff et al. [28], efficiency when executing anapplication on the cloud could be dependent on the behaviour of the application. Jackson etal. [15] informs about some performance deficiencies when executing communication strongHPC applications on the Amazon EC2 cloud in comparison to executing it on a physicalcluster. It is also mentioned that the variability of using a cloud can significantly affectapplications because of the virtualization and inability to alter hardware settings of the cloud.Although, Hassanid, Aiatulla & Luksch [13] describe in their report published four years later,in 2014, that these issues have begun to improve on Amazon EC2. Likewise, they mentionedthat HPC optimised clouds were starting to evolve as well. Tomic, Ogrizovic & Car [33]declare many deficiencies with performing high-performance computing on the cloud in theirarticle. Nevertheless, considering the work of Hassanid, Aiatulla & Luksch [13], some of theseissues are beginning to disappear, especially considering that many cloud providers todayoffer specific high-performance computing services on their clouds, for example, Amazon [2].The increase of high-performance focused services should indicate that cloud providers areadjusting to encourage running HPC applications on the cloud.

6

3 Objective

In this thesis, the main question is: "How does an imperative and a functional solution differ-entiate when solving the same problem executed on the cloud?" Answering this involves twodifferent comparisons, comparing different aspects of the solutions in general and comparinghow the solutions act when executed on the cloud. The general aspects are connected tothe structure and foundation of the different programming languages used to implement anapplication in this thesis, while execution aspects evaluate the performance of the differentimplementations.

Aspects evaluated between the paradigm solutions are as follows.

General aspects:

• How does the amount of written code differentiate between each implementation?

• How does the amount of byte code differentiate between each implementation?

Aspects connected to the execution on the cloud:

• Performance estimation. How do the different implementations perform on the cloudregarding running time and latency?

• How much does the overhead differentiate in comparison when the implementationsare executed on a local device?

• What is the amount of memory usage for specific segments of each implementation?

7

4 Theory

This section explains the programming languages chosen for each paradigm and describesthe original problem of matrix multiplication as well as which algorithms are used for imple-mentation.

4.1 Imperative and functional programming languages

The two programming languages representing the different paradigms in this thesis are Javaand Scala, being for the imperative and the functional paradigm respectively. These twoprogramming languages were chosen based on the fact that each programming language is welladapted to its paradigm. Java was created for the purpose of imperative and object-orientedprogramming, while Scala was adapted for functional programming. The use of two differentprogramming languages would generally result in differences in compilation. These twoprogramming languages were specifically chosen because they both run on the Java virtualmachine (JVM), which means that they are compiled in a similar way. This enables a moreeven comparison when Java is programmed in a purely imperative object-oriented mannerand Scala is programmed in a purely functional manner. Another reason for comparingtheses programming language is that the JVM allows for optimization opportunities.

4.1.1 Java

Java is a general-purpose programming language that is class-based, object-oriented, con-current and strongly typed. Usually, the Java programming language is compiled to thebytecode instruction set and binary format requested by the JVM [21]. As mentioned in Sec-tion 2.1 object-oriented is an imperative sub-paradigm, this means that Java programmingis dependent on the state of the computation. At the same time, the object-oriented aspectensures an encapsulating structure of the program. In Java, the object-oriented structure isdependent on classes and objects while also handling interfaces and abstract classes. Eventhough Java is seen as an object-oriented language, it is gaining and already has function-ality similar to functional languages. Examples of functional programming functionality ispattern matching [23] and polymorphism [20]. The Java implementation of this project willbe structured in a strictly imperative and object-oriented manner.

As previously mentioned, Java is developed to be concurrent. Java allows for using threadsthrough the Java.util.concurrent packages, that contains the high-level concurrency objectExecutors that are used for different thread structured implementations [22]. Through theexecutor interfaces, Futures, Thread pools, and Fork/Join can be used to structure a threadedprogram. Java also supports many concurrent objects to ensure safe thread programs, suchas locks, concurrent and synchronised data structures and atomic variables [22].

Figure 2: Duke, official Java mascot.

8

4.1.2 Scala

Scala was created in 2001. Scala is an abbreviation for Scalable Language, and it allows youto write complex systems with large amounts of functionality [12]. Scala combines object-oriented and functional programming as a high-level language, this makes it a multi-paradigmlanguage. Scala is built to run on the JVM and has two main characteristics [12]. The firstcharacteristic is that it is superior to most industrial programming languages because ofits advanced type system. Nonetheless, this also makes Scala more complex. The secondcharacteristic is that Scala uses static typing to write the code more safely. In this way,errors are checked during compilation.

Scala has two types of approaches for parallelization, either by using the parallel collectionsor by using the Actors library [34]. The parallel collections in Scala are semi-explicit. Thekeyword par can be used to call a parallel version of data structures (par.list vs list), aswell as some parallel specific data structures related to the sequential library (ParArray vsArray) [27]. This approach gives all the thread and grain-size2 control to the framework.The actor library has a function called Futures that abstracts a send and receive primitivefor an object to store results not yet computed. It is done in the same way as in the JavaFutures interface. The Futures interface allows for computation to be done concurrently ata later time and the result is collected on demand [34]. It can be implemented as a skeletondata structure, for example, a map or list skeleton. The actor variant allows for more controlover the number of threads and the grain size used in the parallelization. Hence, this variantallows for more implementation design than that of the parallel collections approach [34].

Figure 3: Official logo for Scala programming language.

Scala, in comparison to Java, was initially more adapted to functional programming. Yet,Java has added more functional-like characteristic and methods to the language in lateryears. Similarly, Scala may be optimized by converting functions or operations to Javastructure. With this in mind, each implementation of the two languages will be adaptedafter the paradigm structure chosen for the language at first hand, even though the languagemay offer more functionality. Not regarding additional functionality not relevant to theparadigm does not include the object-oriented structured used for both implementations.The structure is used as "best practice" in both programming languages and should noteffect the performance or general aspects.

4.2 Matrix multiplication

In this thesis, the Java and Scala implementation handles the problem of matrix multipli-cation. Matrix multiplication is used in many different aspects of computer science. It isone of the most fundamental operations in linear algebra and is an essential part of manydifferent algorithms [9]. For example, today matrix multiplication plays an important partwhen handling graphics or machine learning.

Regular matrix multiplication, as known from linear algebra, is to multiply the elements oftwo matrices to evaluate a resulting matrix, C = A×B[6]. The elements in each matrix canbe described as A = (aij) and B = (bij), where i, j = 1, 2..., n. Given that an element cij of

2The size of the task for each thread or processes

9

C is calculated as

cij =

n∑k=1

aik × bkj (1)

, where matrix A and B have size n× n. Alternatively A is an mA × n matrix and B is ann×mB matrix where mA ̸= mB. In this thesis however, the implementation only handlessquare matrix multiplication.

There are many algorithms for matrix multiplications, and there have been many improve-ments throughout the years. A commonly known efficient algorithm for matrix multiplicationis Strassen’s algorithm from 1969. Strassen’s algorithm has a time complexity of O(n2.81) [1],while the straightforward recursive algorithm for matrix multiplication has a time complex-ity of O(n3) [6]. Strassen’s algorithm allows for the matrices A and B to be divided andcomputed so that one less multiplication is needed than in the recursive approach. Strassen’salgorithm does this by using a series of addition and subtractions (see Section 4.2.3), whichexplains the decreased time complexity [6]. There is also Winograd’s algorithms that re-duces the complexity by only using 15 additions and subtractions instead of Strassen’s 18operations [9]. Although the Winograd algorithm needs fewer operations than Strassens’salgorithm, this makes the algorithm as a whole somewhat more complex.

In later years, multiple further improvements to matrix multiplication algorithms have beenmade, resulting in a powerful toolbox of techniques for this problem. The latest improvedalgorithm was published in 2021 by Josh Alman and Virginia Vassilevska Williams [1], witha time complexity of O(n2.37286). Nevertheless, these improved algorithms may not be betterin practice when parallelization, cache awareness and complexity of implementation areconsidered.

For implementation purposes, there exist more algorithms such as Cannon’s and SUMMA [35]that take the principle of parallelization into account. However, these algorithms may needmore resources in the form of the number of cores indifference with Strassen’s and Winograd,which are more dependent on memory efficiency for implementation purposes.

4.2.1 Algorithm used

An appropriate approach for this thesis when considering these factors was introduced byNguyen et al. [18]. Nguyen et al. describes an approach to combine Strassen’s and Cannon’salgorithms. This generalized algorithm uses both algorithm’s effectiveness to compensatefor some of their drawbacks. The time-complexity of the combined algorithm is dependenton the level of recursion of the Strassen’s algorithm [18]. It is also mentioned by Nguyenet al. [18] that this type of algorithm is suitable for combining Strassen’s or Winograd’salgorithm with Cannon’s, SUMMA or any similar algorithm.

Considering that the combination algorithm handles parallelization, recursion and arithmeticoperations, it was believed that this would give an interesting implementation comparison inthis thesis. The mathematical aspect of matrix multiplication may seem advantageous for thefunctional Scala implementation. Although, the algorithm’s general imperative structure andparallel aspect was assumed to not leave the Java implementation at a disadvantage. Nguyenet al. [18] introduced the combination algorithm via a combination of Strassen’s and Cannon’salgorithms. However, the SUMMA algorithm was chosen for this thesis since the SUMMAalgorithm is more adaptive than Cannon’s algorithm. Cannon’s algorithm needs a largenumber of processors in a structured manner for peak performance. In contrast, the SUMMAalgorithm can perform just as well with fewer processors that are unstructured, as well ashandling non-uniform splitting [35]. Strassen’s algorithm was chosen because it is efficientperformance-wise yet still not too advanced to implement. In contrast Winograd’s algorithmis not as applicable to the recursive structure used in the combination algorithm as Strassen’s

10

algorithm. The more recursively applicable structure of Strassen’s algorithm gives that itis the reasonable choice in this thesis, considering the time limitation for implementation.There are also still newly published articles discussing the use of Strassen’s algorithm andimprovements of the algorithm for implementation purposes [9], indicating that it often ismore applicable for practical use then Winograd’s algorithm.

Summed up, in this thesis three different segments of different matrix multiplication algo-rithms were used. First, the main dividing of the matrix was done in the same way as inthe recursive algorithm. For the node matrix computations, a combination of Strassen’s andthe SUMMA algorithm was implemented in Java and Scala, based on the idea described byNguyen et al. [18].

4.2.2 Recursive

The recursive algorithm [6] is a divide-and-conquer algorithm. The algorithm in itself is notused in this thesis, rather the approach of dividing the matrix is used to get an even divisionof sub-matrices. The division is done in the following manner. Matrices A,B and C are ofsize n×n, where Aij and i,j = 0 or 1 indicates a n/2×n/2 segment of the A,B and C matrix.

A×B =

[A00 A01

A10 A11

]×[B00 B01

B10 B11

]=

[C00 C01

C10 C11

]= C (2)

To compute matrix C the segments of the C matrix can be computed by dividing thecomputations.

C00 = A00 ×B00 +A01 ×B10

C01 = A00 ×B01 +A01 ×B11

C10 = A10 ×B00 +A11 ×B10 (3)C11 = A10 ×B01 +A11 ×B11

4.2.3 Strassen’s

Strassen’s algorithm consists of 7 matrix multiplication and 18 matrix additions/subtrac-tions [14, 6]. The following calculations describe the sub-matrix computations and resultingmatrix operation segments. Whereas matrices A,B and C are divided as seen in Equation 2.The 10 sub-matrix computations.

S0 = A00 +A01

S1 = A10 +A11

S2 = A00 +A11

S3 = A01 −A11

S4 = A00 −A10 (4)S5 = B00 +B01

S6 = B10 +B11

S7 = B00 +B11

S8 = B01 −B11

S9 = B10 −B00

The seven sub-matrix computations with help of matrices from Equation 4, where the matrix

11

multiplication is done recursively and through Algorithm 1.

P0 = A00 × S8

P1 = S0 ×B11

P2 = S1 ×B00

P3 = A11 × S9 (5)P4 = S2 × S7

P5 = S3 × S6

P6 = S4 × S5

The resulting matrix C is computed through the following calculations:

C00 = P4 + P3 − P1 + P5

C01 = P0 + P1

C10 = P2 + P3 (6)C11 = P4 + P0 − P2 − P6

The pseudo-code used for implementation was:

Algorithm 1 Pseudocode for Strassen’s algorithm. Where A and B are two n× n matricesthat are to be multiplied. Pi indicates a P matrix from Equation 5.

function Strassen(A, B)let C be a new n× n matrixDivide A, B and C into n/2× n/2 sub-matrices (see Equation 2)Create 10 sub-matrices according addition and subtractions in Equation 4for i=0...7 do

if n is not small enough and n/2 is divisible by 2 thenCall Strassen recursively for the given Pi computation

elseCall SUMMA function for Pi

end ifend forfor each n/2× n/2 C sub-matrix do

Add and subtract P matrices according to Equation 6end forreturn C

end function

The first loop assesses the size of the matrix. If the matrix is as small or smaller than thegiven limit, SUMMA matrix multiplication is called for the given Pi matrix. Else Strassenalgorithim is called recursively for the Pi matrix. The next for loop indicates that whenthese matrices are individually calculated, the calculated P1 . . . P7 matrices are combined bythe rules specified in Equation 6.

4.2.4 SUMMA

The SUMMA algorithm as described in [11] is the one used in this thesis. Algorithm 2describes how the matrix C is calculated via parallel work. The number of parallel taskswill be dependent on the processors per VM, it is also possible to execute the algorithmnon-parallel. This algorithm is based on transposing columns of matrix B, which follows

12

C = ABT giving that calculating an element of C (cij) is done by

cij =

n∑k=1

aik × bTjk (7)

The resulting matrix C is the same as in equation 1, thought the way of calculation differs.The SUMMA algorithm demands more consideration when dividing and combining the resultin comparison with Cannon’s [35]. Despite this the SUMMA algorithm has shown betterperformance in practice [11].

Algorithm 2 Pseudocode for SUMMA. Aij and Bij indicating two n× n matrices, i and jindicating sub-matrix segment as seen in Calculation 2.

function SUMMA(Aij , Bij)let Cij = 0for l = 1 ... n do

Divided columns blj among threadsIn each thread: Form cil = Aij × bTljSummarise overall cil

end forreturn C

end function

The algorithm takes a given matrix and divides the amount of columns depending on theamount of threads. Here a check occurs whether the matrix size is divisible among thethreads, if not the program divides uneven work among the threads. The thread that finishesfirst will calculate an extra task. After each thread is done calculating, it is summarized intoan original sized matrix which is returned.

13

5 Evaluation

Describes different general aspects and how the performance of the implementations wasevaluated.

5.1 General aspects

Two general aspects of the different implementations were evaluated. The aspects assessedwere the amount of written code and the estimated amount of Java bytecode.

5.1.1 Amount of written code

When assessing the amount of written code for each implementation, a line calculator pluginfor VSCode was installed[37]. The line calculator plugin was used on the classes relevant forthe different cloud adapted matrix multiplication implementations. In other words, the codeneeded for the node representation (see Figure 4). The interest of amount of written code isconnected to what was mentioned in section 2.

5.1.2 Amount of Java bytecode

Assessing the amount of Java bytecode for each solution was done by running a counterscript. The script gave an estimation of Java-bytecode through the javap -c file.classcommand on each .class file connected to each implementations [24]. The command prints"the instructions that comprise the Java bytecodes", which gives an estimation of the amountof Java bytecode when counting the number of lines. Each line was evaluated to one or morebytes depending on what instruction was represented on that particular line.

The interest of amount of Java bytecode is in contrast to the amount of written code. Amountof Java bytecode may signify ease of compilation. A solution creating larger amount Javabytecode may need more time processes the written code to executable code.

5.2 Performance

During the evaluation of the performance of the two implementations, benchmarks wereimplemented to gather metrics. The benchmarks were gathered both from a baseline exe-cution as well as when the implementations were performed on the cloud. A warm-up wasperformed in both cases when the tests assessing performance were running on the Javavirtual machine.

5.2.1 Benchmarks

These were the benchmarks collected when executing the different matrix multiplicationimplementations. Some aspects were measured based on benchmarks found in articlesevaluating parallel applications or applications running on the cloud[10, 13], and some werebased on evaluations done for matrix multiplication implementations[8, 14].

• Execution time

14

• Memory usage for a segment of the program

The benchmarks were implemented manually into the running code and gathered in .csvfiles. This choice is based on the fact that it was difficult to benchmark memory usagethrough existing benchmark libraries such as Microbenchmark Harness(JMH) [19] and itsScala counterpart [31].

5.2.2 Java virtual machine (JVM)

Java and Scala source code is dynamically compiled by the JVM, this was taken into consid-eration when the performance was measured [26]. The JVM in itself interprets and optimisessource code before converting it to machine code.

A warm-up was required before evaluating performance when executing the code. Thereason for the warm-up was because of the dynamical compilation of the JVM. Beforeevaluating, all of the optimizations and conversions should be done, this often requires afew executions beforehand. Because of the cloud execution, the warm-up was done by amanual implementation. When code were to be executed on each node (see Figure 4) awarm-up method was called on beforehand. This method consisted of executing the matrixmultiplication functionality multiple times.

Another aspect to take into consideration when running benchmarks on the JVM is thegarbage collection. To ensure that the garbage collection’s time consumption did not interferetoo much with the benchmark, a forced garbage collection was performed before runningthe benchmarks and multiple executions were used to average the effects of the garbagecollection[26].

5.2.3 Baseline performance

The baseline performance was evaluated by executing the matrix-multiplication implemen-tations repeatedly with different matrix sizes. The benchmarks mentioned here above werecollected and the warm-up was done before every execution. The matrix multiplicationalgorithm used is a combination of the Strassen and the SUMMA algorithm as previouslymentioned in Section 4.2.1. Algorithm 1 best describes the overview of the algorithm, com-plemented by Algorithm 2. These baseline executions were done on a machine with thefollowing specification

• Operating System : Windows 10 (64-bit)

• CPU : Intel(R)Core(TM) i5-72000U CPU @ 2.50GHz

• Ram : 8,00 (7,85) GB

5.2.4 Performance on the cloud

When investigating the performance on the cloud a more intricate setup was used, comparedto the straightforward execution used for the baseline performance.

5.2.4.1 Experimental setup

The matrix multiplication application on the cloud followed the structure shown in Figure 4.

15

The application was structured as a main point communicating via a queue to nodes. Themain point was responsible for dividing the matrix into segments, where each node(VM) wasresponsible for computing a sub-matrix and to later send back the computed segment to themain point. As well as sending back benchmarks of the execution in each node.

The matrix multiplication algorithm used for calculating the sub-matrices is the combinationof Strassen’s and SUMMA algorithm as mentioned in Section 4.2.1. Algorithm 1 bestdescribes the overview of the node algorithm, complemented by Algorithm 2. The divisionof the main matrix at the main point was done as mentioned in Section 4.2.2. The iterationlevel of the Strassen-algorithm’s recursion was 1/6 of the size of the node matrix i.e. if thematrix reached a size of 1/6 of the given original matrix size the SUMMA algorithm is called(see else statement Figure 1). The benchmarks measured the execution time and memoryusage for each Sub-matrix calculation.

Figure 4: Example of the general structure used in the matrix multiplication application.The main point was responsible for creating the sub-matrix computation nodes and dividingthe matrix among instances. The node instances later send back a resulting matrix andperformance metrics to the main point.

The cloud service used in this thesis was EC2 provided by Amazon Web Service (AWS) [2].Figure 4 shows the need of 5 instances on the cloud for this specific setup, 1 main point and4 nodes. The Instances used can be specified by a software image (AMI) and a virtual servertype (Instance type). In this thesis, all the instances had the same AMI, while the instancetype varied. The instance types that were used while executing are shown in Table 1. TheAMI used was: Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume Type.

Instance type vCPU Memory settingsc5.large 2 5 GiBc5.xlarge 4 8 GiBc5.2xlarge 8 16 GiB

Table 1: The different instance types that were used in different executions of the cloudsetup.

16

5.2.4.2 Implementation choices

In consideration to the fact that Amazon’s EC2 cloud on AWS was used, Amazon’s messagequeue Amazon Simple Queue Service (SQS), was also used for this implementation [3]. Thereason for this was that it was appropriate to use SQS when it is an efficient message queuefor this type of architecture [4]. In addition, the message queue allows for a decoupledstructure for a distributed system, which in this case was optimal in consideration of needingto switch between Scala and Java code for testing. It was easy to use in consideration whencommunicating between EC2 instances.

When implemented, the message broker was done using two queues, one for communicationfrom the main point to the nodes and one for communication from a node to the main point.This choice was made to avoid any trouble of receiving a wrongful message and needing tosend it back, as well as the fact that this choice of implementation will not affect the testingin any way.

The message structure used in the implementation was a JSON-object containing informationabout a matrix, which segment of the matrix and other relevant information. An example ofthe two different messages is illustrated in Listing 3 and 4. Size in Listing 3 only indicatedthe width of the matrix when the implementations were restricted to handle only squarematrices.

1 {2 "segment" : "C11"3 "size" : 24 "matrix A" : [[1.0 ,2.0][3.0 ,4.0]]5 "matrix B" : [[0.0 ,1.0][2.0 ,4.0]]6 }7

Listing 3: Example of a message sent from the main point.

1 {2 "segment" : "C11"3 "result" : [[4.0 ,9.0][8.0 ,19.0]]4 "time" : 3.8365 "memory" : 75.326 }7

Listing 4: Example of a message sent from a node

Parallelization in both implementations was based on the use of Futures. This choice isinspired by information gathered from the work of Totoo, Deligiannis and Loidl [34] andBloch [5], as well as the fact that the Futures library exists in both programming languages.

5.2.4.3 Running a test

When a test was executed, the main point instance was created through the AWS web-site, where the instance triggered a start-script which initialized the VM and started themain program. After that, the main point program was responsible for starting the otherneeded node instances. An example of when running a test is shown in Figure 5.

17

Figure 5: Example of when running a test on the cloud with setup described in Figure 4.

18

6 Result

6.1 General aspects

As shown in Table 2, the Java application consists of more written code than the Scalaapplication. On the other hand, the Java solution generated less Java bytecode than theScala solution, as observed in Table 3.

Implementation language Lines of written codeJava 567Scala 267

Table 2: Amount of lines of written code for the two implementations.

Implementation language Lines of Java bytecodeJava 1610Scala 2629

Table 3: Estimated amount of bytecode, based on the class files of each implementation.

6.2 Baseline

Observing Figure 6, it is clear that the Java implementation needs more time when multi-plying larger matrices. Figure 7 shows further that the Java solution had a more significantincrease in execution time when increasing the matrix size. In comparison, the Scala solutionhad not as a distinct increase for larger matrices, as illustrated in Figure 7.

Figure 6: Minimum execution time (Seconds) for baseline execution of the various solutionscalculating a matrix of a given size (MB).

19

Figure 7: Logarithm of the minimum execution time (seconds) for baseline execution of thedifferent solutions calculating a matrix of a given size (MB).

Figure 8 shows irregular memory usage for computing smaller matrices for both implementa-tions. Although, the implementations reached a point in matrix size computation where theJava solution escalates in memory used. The look of the Scala graph seem unreliable due tothe declining memory usage is not logical considering the multiplication of larger matrices.

Figure 8: Average memory usage for baseline execution when calculating a matrix of a givensize (MB) for the different implementations.

20

6.3 Execution on the cloud

The execution time for the different instance setups is shown in Figure 9. In comparisonbetween the instances, it is clear that the Scala implementation had a fast execution timeindependent of the amount of cores used, while the Java solution had a significantly slowerminimum execution time. However, the Java solution shows a lower execution time with theincreased amount of cores.

Figure 9: Minimum execution time (Seconds) for each cloud instance setup when calculatinga matrix of given size (MB). The number on the right indicating the amount of cores availablefor the given instance used.

The average memory usage for the different solutions when executed on the cloud as shownin Figure 10, indicates less memory was used when running the Scala implementation. TheJava implementation had a significant increase in used memory with the increase in theamount of cores. Comparing the instances it is clear that the memory used for the Scalaimplementation also increases when the number of available cores increases. Although, theincreased memory usage for the Scala solution is not as significant as the increased memoryuse of the Java solution.

21

Figure 10: Average memory usage (GB) when calculating a matrix of a given size(MB). Thenumber on the right indicating the amount of cores available for the given instance used.

6.3.1 Cloud execution compared to baseline - Java

A comparison of the cloud execution times for the Java implementation and the baselineexecution time is shown in Figure 11. It distinctly shows that the minimum baseline executiontime was faster than the minimum execution time for the cloud execution involving instanceswith two and four cores. The cloud execution for the instances with eight cores had a similarminimum execution time in comparison with the baseline execution for larger matrices. Thisindicates that the added overhead time for executing on the cloud will not be noticed whenusing instances with eight cores or more.

22

Figure 11: Minimum execution time (Seconds) for each Java execution calculating a matrixof a given size (MB).

Figure 12 reinforces the perception that is perceived in Figure 10. It is clear in Figure 12that the two core cloud execution and the baseline execution consumed a similar amount ofmemory, while the other cloud executions required more memory for larger matrix sizes.

Figure 12: Average memory usage (GB) when calculating a matrix of a given size(MB) foreach Java execution.

6.3.2 Cloud execution compared to baseline - Scala

As shown in Figure 13, the Scala baseline had the fast minimum execution time of thedifferent executions. Indicating that there is an overhead when executing Scala on the cloud.

23

Although, an increased amount of cores when running on the cloud instances seems to lowerthe overhead for the Scala implementation. At the same time, the overhead in question isminimal.

Figure 13: Minimum execution time (Seconds) for each Scala execution when calculating amatrix of a given size (MB).

Figure 14 shows the average memory used for the different Scala executions. The baselineexecution had the most irregular memory consummation for smaller matrices, where asii is unreliable for large matrix sizes. For the other executions, it is clear that the Scalaimplementation, similar to the Java implementation (Figure 12), uses more memory whenmore cores are available. However, the Scala solution has incremental memory use while theJava solution for 4 and 6 cores on the cloud uses similar amount of memory.

24

Figure 14: Average memory usage (GB) when calculating a matrix of a given size (MB) foreach Scala execution.

25

7 Discussion

This section discusses the result of the general aspects, performance aspects and limitationsthat have occurred during the practical testing. Ethical and social aspects connected to thisthesis are also mentioned.

7.1 Limitations

The greatest limitation in this thesis was the limitation of data gathered for the tests executedon the cloud. Constructing the setup for testing the performance on the cloud took moretime than expected. This lack of time for testing resulted in less collected data. Thus thislimit should be taken into consideration when reading the discussion of the performanceresult. More data needs to be gathered to ensure a more dependable result.

Another limitation is that the amount of Java bytecode is only an estimation. The usedestimation is based on the fact that each instruction on each line counted corresponds toat least one byte. At the same time, the instructions on each line can be up to eight bytes.However, to investigate and translate each line to its corresponding correct amount of Javabytecode demands more time for research and implementation leaving it outside of the scopeof this thesis.

7.2 General aspects

Compared to the amount of written code, it is clear that the Scala code is more compact thanthe Java code. At the same time, it is also evident that the Java implementation generatesless Java bytecode than the Scala implementation. These differences are most likely explainedby the fact that Scala implementation handles a more intricate solution. In contrast, theJava solution is more straightforward, at the same time as the mathematical aspect of theprogram yields a compact Scala code. This reasoning is connected to the comparison insection 2. Perhaps the resulting Java bytecode is larger for the Scala solution due to thecomplexity of the solution as well as the fact that Scala is a programming language builtupon the Java language. The Java implementation needs less translation to be compiled toJava bytecode, while the Scala implementation needs more translation and conversion. Thedifference in amount of Java bytecode was also evident when the solutions where compiled,the Scala solution demanded more time when compiling to a runnable jar file than the Javasolution.

In terms of readability, one may argue that it is in the eye of the beholder. I found thatboth the solutions were just as easy to understand. This assumption is in considerationof both implementations being just as difficult to debug. However, I might add that theScala solution for matrix multiplication was more accessible to implement than the Javasolution, which should obviously connect to the fact that a mathematical function shouldbe more adaptable to functional programming than imperative programming. I state thiswhile having experience in both the different paradigms.

Implementing the parallel working features of both implementations had the advantage ofboth being pretty simple solutions. Nonetheless, I believe that this was the part of thecode that differentiated the most between the implementations, both in terms of ease ofimplementation and the amount of code. I had not used Futures in any of the programminglanguages before, so in terms of ease of implementation, I can tell for certain that i find itmore challenging to understand the functional parallelization intuitively. On the other hand,the functional implementation resulted in less written code and less need for refactoring from

26

a non-parallel solution. Yet, I felt that the Java implementation was easier to comprehend,even though the parallel solution consists of two classes in Java and the Scala solution consistsof only one method.

7.3 Performance

It is evident by the result shown in Figures 6, 8, 9 and 10 that a Scala implementation is thefavourable choice when performing matrix multiplication, both in terms of execution timeand memory usage and regardless if it is executed on the cloud or on your own laptop. Thesefactors may be connected to the parallelization choices used in the Scala implementation andare most certainly connected to the mathematical nature of the Scala programming languagein combination with the matrix computation.

In hindsight, it is apparent that the sizeable mathematical aspect of the chosen implementa-tion problem should favour the functional paradigm. However, in the beginning, it was notclear that the implementations were to use the same library for communication. For manyreasons, it was a forced choice to use the same Java libraries in both implementations for thequeue communication., which resulted in the implementations only differentiating in termsof matrix computation.

When observing Figure 9, one can discuss the idea of increasing the number of cores used bythe instances even more. The Java implementation is faster when more cores are availablein terms of execution time. If the Java solution was to be executed on more cores, itmight eventually get an execution time similar to Scala. In contrast, this would meanthat the Java implementation needs more resources to reach the same execution time theScala implementation has by default. It would be a bad choice of paradigm for the matrix-multiplication application when the functional choice needs fewer resources for the samecomputation.

The more interesting result of the Java implementation is in terms of memory usage. InFigure 10, it is clear that the Java implementation uses significantly more memory thanthe Scala implementation. On the other hand, when observing Figure 12 and Figure 14,the Scala solution used more memory for each increase in the number of cores per instance,while the Java solution has similar memory usage for four and eight cores per instance. Thedifference is still distinguished between the different paradigm solutions where Scala hasbetter memory handling. At the same time, the Java solution shows a tendency to reach amax limit for memory usage independent of the number of cores available. However, in termsof memory, the Java implementation might show a better result if more data is gatheredwhen this result is based on the average memory used.

When considering figure 8 and 14. The baseline Scala memory usage is unreliable in regardof it not needing less memory usage for bigger matrices. It could be connected to the garbagecollection of the java virtual machine acting up. The memory discarded and the memoryused maybe gets tangled up and thereby gives the illusion of less memory usage for largermatrices. Although, this is only a speculation, i am not sure of why it acts this way. Thereare more than one matrix calculated throughout the straight line as well as it is multipletest data, executed multiple times. When realising the weird incident myself i re did the testbut this i the result i have gathered.

7.4 Ethical and social aspects

A discussion of ethical and social aspects that are connected to this thesis.

27

7.4.1 Amazon

The most prominent ethical aspect of this thesis is the use of Amazon Web Services (AWS).Amazon as a company has been in many controversies over the years. One considerabledebate is the working environment at the warehouses. The employees don’t have the op-portunity to join the union and are working under horrible conditions for low salaries [16,17]. Controversies like these lead to the question of whether private persons and companiesshould use AWS as a service when it contributes to profit for a company known for theirproblems and harmful effects on people and society.

AWS is an Amazon-provided service that many companies use daily, mainly because it con-tains excellent resources for IT development both concerning high-performance applicationsand development for web applications. As mentioned in Section 2, it is today very popularto use cloud services to support applications, and AWS is one of the biggest services for this.So it comes natural for people to choose this service, often without thinking about or takingthe company’s controversy into account. I am one of them. When I chose to work withAWS, I mainly focused on which cloud platform was best supported for high-performancecomputations and disregarded everything else. In hindsight, I would have made anotherchoice because I disagree with what Amazon stands for.

A cloud service that can be used instead of AWS is Azure. Azure is developed by Microsoftand was considered at the beginning of this project. As a cloud platform, Azure has supportfor similar aspects that where used when testing for this thesis. The reason it was not chosenwas because of the research I did on beforehand. The research indicated that AWS supportedHPC application better than Azure, though it should not differentiate considering the scaleof this thesis.

7.4.2 Functional programming

A social aspect connected to this thesis is the choice of paradigm and people’s opinions aboutthe matter. When reading about, hearing about and discussing functional programmingwith others, I have noticed that many people take a firm stand with or against functionalprogramming. It has resulted in companies only using functional programming or refusingto use it, which in my opinion is not a good approach to programming. The paradigm choiceshould be based on what kind of application is to be used or implemented, not one’s personalopinion. It is of course, an advantage to use a programming paradigm or programminglanguage you like and are accustomed to. However, the paradigm choice should not only bebased on what you think and feel. Today I perceive that functional programming is chosenor not chosen because of pride or prejudice. However, no one’s prejudice should stand in theway of choosing or not choosing functional programming.

28

8 Conclusion

The assumption that the matrix-multiplication implementation would not be favourableto the functional paradigm was wrong. Functional programming is significantly better atmathematical computations performed on the cloud in terms of execution time and memoryusage. The functional solution may also be favourable in terms of parallelization and inthe amount of written code. On the other hand, the imperative implementation generatedless Java bytecode and showed a better decrease in execution time when used on the cloudcompared to the baseline execution. The functional Scala solution had both faster executiontime and less memory usage for the baseline execution than when executed on the cloud.Nonetheless, the Scala baseline and cloud executions difference was not significant and wouldnot matter in contrast to when comparing the Scala solution against the Java solution.

In other words, application behaviour may be strongly connected to the choice of paradigm.So dare to use the functional paradigm where it excels, independent of executing it in thecloud or on your local computer, even if it may seem challenging at first with functionalprogramming.

8.1 Future Work

Scala had an advantage in the mathematical aspects of the chosen implementation. Thisadvantage gives a natural approach to further investigate the difference in cloud performancebetween Scala and Java by implementing an application where Scala has a disadvantage as afunctional language. For example, verifying if the claim of non-mathematical implementationshaving a better performance for imperative programming still holds when executing on thecloud. An investigation like this would be fascinating compared to the result of this thesis.

The use of more nodes could further investigate the work of this thesis. For example, ina comparison of Figure 4, one may implement a setup using eight nodes instead of four,dividing the work even further. Such an experiment will additionally signify if the divisionand size of the work on the cloud affect the performance of the different implementations.

In connection with the paper mentioned in Section 2 about HPC on the cloud [28], it wouldalso be interesting to investigate the performance between these two implementations on adifferent cloud platform. To verify if cloud platforms may supports application behavioursdifferently.

29

References

[1] Josh Alman and Virginia Vassilevska Williams. “A Refined Laser Method and FasterMatrix Multiplication”. In: Proceedings of the 2021 ACM-SIAM Symposium on Dis-crete Algorithms (SODA), pp. 522–539. doi: 10.1137/1.9781611976465.32. eprint:https://epubs.siam.org/doi/pdf/10.1137/1.9781611976465.32. url: https://epubs.siam.org/doi/abs/10.1137/1.9781611976465.32.

[2] AWS. Amazon EC2. Accessed (2022-02-28). 2022. url: https://aws.amazon.com/ec2/.

[3] AWS. Amazon Simple Queue Service. Accessed (2022-02-28). 2022. url: https://aws.amazon.com/sqs/.

[4] AWS. Benefits of Message Queues. Accessed (2022-02-28). 2022. url: https://aws.amazon.com/message-queue/benefits/.

[5] Joshua Bloch. Effective Java. 3rd edition. Pearson Education, 2018. isbn: 9780134685991.

[6] Thomas H Cormen et al. Introduction to algorithms. 3rd ed. Cambridge, Mas-sachusetts: MIT Press, 2009. isbn: 9780262033848.

[7] G.F Coulouris. Distributed systems: concepts and design. 5th ed. Harlow, Essex:Pearson Education, 2012. isbn: 9780273760597.

[8] Paolo D’Alberto, Marco Bodrato, and Alexandru Nicolau. “Exploiting Parallelism inMatrix-Computation Kernels for Symmetric Multiprocessor Systems Matrix-Multiplicationand Matrix-Addition Algorithm Optimizations by Software Pipelining and ThreadsAllocation”. In: ACM Trans. Math. Softw. 38 (Nov. 2011), p. 2. doi: 10.1145/2049662.2049664.

[9] Bogdan Dumitrescu. “Improving and estimating the accuracy of Strassen’s algorithm”.In: Numerische Mathematik 79 (1998), pp. 485–499. doi: https://doi-org.proxy.ub.umu.se/10.1007/s002110050348.

[10] Jaliya Ekanayake et al. “High-Performance Parallel Computing with Cloud and CloudTechnologies”. In: July 2010, pp. 275–308. isbn: 978-1-4398-0315-8. doi: 10.1201/EBK1439803158-c12.

[11] Robert A. van de Geijn and Jerrell Watts. “SUMMA: scalable universal matrixmultiplication algorithm”. In: Concurr. Pract. Exp. 9 (1997), pp. 255–274.

[12] Mads Hartmann and Ruslan Shevchenko. Professional Scala : Combine Object-Oriented and Functional Programming to Build High-performance Applications. Birm-ingham: Packt Publishing, 2018. isbn: 9781789534702.

[13] Rashid Hassani, Md Aiatullah, and Peter Luksch. “Improving HPC Application Perfor-mance in Public Cloud”. In: IERI Procedia 10 (2014). International Conference on Fu-ture Information Engineering (FIE 2014), pp. 169–176. issn: 2212-6678. doi: https://doi.org/10.1016/j.ieri.2014.09.072. url: https://www.sciencedirect.com/science/article/pii/S2212667814001208.

[14] Jianyu Huang et al. “Strassen’s Algorithm Reloaded”. In: Nov. 2016, pp. 690–701.doi: 10.1109/SC.2016.58.

[15] Keith R Jackson et al. “Performance Analysis of High Performance Computing Ap-plications on the Amazon Web Services Cloud”. In: 2010 IEEE Second InternationalConference on Cloud Computing Technology and Science. Indianapolis, Indiana, USA,2010, pp. 159–168. doi: 10.1109/CloudCom.2010.69.

[16] Karen Weise Jodi Kantor and Grace Ashford. “Inside Amazon’s Worst Human Re-sources Problem”. In: The New York Times (Nov. 24, 2021). url: https://archive.ph/20211025032625/https://www.nytimes.com/2021/10/24/technology/amazon-employee-leave-errors.html (visited on 05/23/2022).

30

https://doi.org/10.1137/1.9781611976465.32

https://epubs.siam.org/doi/pdf/10.1137/1.9781611976465.32

https://epubs.siam.org/doi/abs/10.1137/1.9781611976465.32

https://epubs.siam.org/doi/abs/10.1137/1.9781611976465.32

https://aws.amazon.com/ec2/

https://aws.amazon.com/ec2/

https://aws.amazon.com/sqs/

https://aws.amazon.com/sqs/

https://aws.amazon.com/message-queue/benefits/

https://aws.amazon.com/message-queue/benefits/

https://doi.org/10.1145/2049662.2049664

https://doi.org/10.1145/2049662.2049664

https://doi.org/https://doi-org.proxy.ub.umu.se/10.1007/s002110050348

https://doi.org/https://doi-org.proxy.ub.umu.se/10.1007/s002110050348

https://doi.org/10.1201/EBK1439803158-c12

https://doi.org/10.1201/EBK1439803158-c12

https://doi.org/https://doi.org/10.1016/j.ieri.2014.09.072

https://doi.org/https://doi.org/10.1016/j.ieri.2014.09.072

https://www.sciencedirect.com/science/article/pii/S2212667814001208

https://www.sciencedirect.com/science/article/pii/S2212667814001208

https://doi.org/10.1109/SC.2016.58

https://doi.org/10.1109/CloudCom.2010.69

https://archive.ph/20211025032625/https://www.nytimes.com/2021/10/24/technology/amazon-employee-leave-errors.html



[17] Bryan Menegus. “Amazon’s Aggressive Anti-Union Tactics Revealed in Leaked 45-Minute Video”. In: Gizmodo (Sept. 26, 2018). url: https://gizmodo.com/amazons-aggressive-anti-union-tactics-revealed-in-leake-1829305201 (visited on05/23/2022).

[18] D.K. Nguyen et al. “A general scalable implementation of fast matrix multiplicationalgorithms on distributed memory computers”. In: Sixth International Conference onSoftware Engineering, Artificial Intelligence, Networking and Parallel/Distributed Com-puting and First ACIS International Workshop on Self-Assembling Wireless Network.2005, pp. 116–122. doi: 10.1109/SNPD-SAWN.2005.2.

[19] Github: openjdk. Java Micro-benchmark Harness(JMH). Accessed (2022-02-08). 2021.url: https://github.com/openjdk/jmh.

[20] Oracle. Java developer guide. Accessed (2022-02-03). url: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/jjdev/Java-overview.html%5C#GUID-17B81887-C338-4489-924D-FDDF2468DEA7.

[21] Oracle. Java Documentation. Accessed (2022-02-03). 2022. url: https://docs.oracle.com/javase/8/docs/technotes/guides/language/index.html.

[22] Oracle. Java Documentation: The Java Tutorials - Concurrency. Accessed (2022-02-07). 2021. url: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/jjdev/Java-overview.html%5C#GUID-17B81887-C338-4489-924D-FDDF2468DEA7.

[23] Oracle. Java SE specification. Accessed (2022-02-03). url: https://docs.oracle.com/javase/specs/index.html.

[24] Oracle. javap - The Java Class File Disassembler. Accessed (2022-05-09). 2020. url:https://docs.oracle.com/javase/7/docs/technotes/tools/windows/javap.html.

[25] Victor Pankratius, Felix Schmidt, and Gilda Garretón. “Combining functional andimperative programming for multicore software: An empirical study evaluating Scalaand Java”. In: 2012 34th International Conference on Software Engineering (ICSE)(2012), pp. 123–133.

[26] Aleksandar Prokopec and Heather Miller. Scala: PARALLEL COLLECTIONS MEA-SURING PERFORMANCE. Accessed (2022-02-04). url: https://docs.scala-lang.org/overviews/parallel-collections/performance.html.

[27] Aleksandar Prokopec and Heather Miller. Scala: PARALLEL COLLECTIONS OVERVIEW.Accessed (2022-02-07). url: https://docs.scala-lang.org/overviews/parallel-collections/overview.html.

[28] Eduardo Roloff et al. “High Performance Computing in the cloud: Deployment, per-formance and cost efficiency”. In: Dec. 2012, pp. 371–378. doi: 10.1109/CloudCom.2012.6427549.

[29] D. Rountree, I Castrillo, and H Jiang. The basics of cloud computing: understandingthe fundamentals of cloud computing in theory and practice. Ebook. Amsterdam:Elsevier Science & Technology Books, 2014. isbn: 0124055214.

[30] Naya Ruparelia. Cloud computing. Ebook. Cambridge, Massachusetts: The MITPress, 2008. isbn: 0262334127.

[31] Github: scala. Scala library benchmarks. Accessed (2022-02-08). 2021. url: https://github.com/scala/scala/tree/2.12.x/test/benchmarks.

[32] Robert .W Sebsta. Concepts of programming languages. 11th edition. Harlow, Essex:Pearson Education, 2016. isbn: 9781292100555.

[33] Drasko Tomic, Dario Ogrizović, and Zlatan Car. “Cloud solutions for high performancecomputing: Oxymoron or realm?” In: Tehnički vjesnik 20 (Feb. 2013), pp. 177–182.

31

https://gizmodo.com/amazons-aggressive-anti-union-tactics-revealed-in-leake-1829305201

https://gizmodo.com/amazons-aggressive-anti-union-tactics-revealed-in-leake-1829305201

https://doi.org/10.1109/SNPD-SAWN.2005.2

https://github.com/openjdk/jmh

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/jjdev/Java-overview.html%5C#GUID-17B81887-C338-4489-924D-FDDF2468DEA7



https://docs.oracle.com/javase/8/docs/technotes/guides/language/index.html

https://docs.oracle.com/javase/8/docs/technotes/guides/language/index.html




https://docs.oracle.com/javase/specs/index.html

https://docs.oracle.com/javase/specs/index.html

https://docs.oracle.com/javase/7/docs/technotes/tools/windows/javap.html

https://docs.oracle.com/javase/7/docs/technotes/tools/windows/javap.html

https://docs.scala-lang.org/overviews/parallel-collections/performance.html

https://docs.scala-lang.org/overviews/parallel-collections/performance.html

https://docs.scala-lang.org/overviews/parallel-collections/overview.html

https://docs.scala-lang.org/overviews/parallel-collections/overview.html



https://github.com/scala/scala/tree/2.12.x/test/benchmarks

https://github.com/scala/scala/tree/2.12.x/test/benchmarks

[34] Prabhat Totoo, Pantazis Deligiannis, and Hans-Wolfgang Loidl. “Haskell vs. F#vs. Scala: A High-Level Language Features and Parallelism Support Comparison”.In: Proceedings of the 1st ACM SIGPLAN Workshop on Functional High-PerformanceComputing. FHPC ’12. Copenhagen, Denmark: Association for Computing Machinery,2012, pp. 49–60. isbn: 9781450315777. doi: 10.1145/2364474.2364483.

[35] Miroslav Tuma. “Parallel matrix computations (Gentle intro into HPC)”. In: (Accessed2022-02-18). June 2021, pp. 85–89. url: https://www2.karlin.mff.cuni.cz/~mirektuma/ps/pp.pdf.

[36] Wojciech Turek et al. “Special issue on Parallel and distributed computing based onthe functional programming paradigm”. In: Concurrency and Computation: Practiceand Experience 30.22 (2018). e4842 cpe.4842, e4842. doi: https://doi.org/10.1002/cpe.4842. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4842. url: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4842.

[37] Kentaro Ushiyama. VScode Counter. Accessed (2022-04-19). 2020. url: https://github.com/uctakeoff/vscode-counter.

[38] D. A Watt, W Findlay, and J Hughes. Programming language concepts and paradigms.New York: Prentice-Hall, 1990.

32

https://doi.org/10.1145/2364474.2364483

https://www2.karlin.mff.cuni.cz/~mirektuma/ps/pp.pdf

https://www2.karlin.mff.cuni.cz/~mirektuma/ps/pp.pdf

https://doi.org/https://doi.org/10.1002/cpe.4842

https://doi.org/https://doi.org/10.1002/cpe.4842

https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4842

https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4842

https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4842

https://github.com/uctakeoff/vscode-counter

https://github.com/uctakeoff/vscode-counter

Master Thesis Report - DiVA portal

Documents

Transcript of Master Thesis Report - DiVA portal