Scheduler for the Stream Processing Frameworks on Hadoop ...

71
Brno, November 2014 Advisor: Doc. RNDr. Pavel Smrž, Ph.D. Brno University of Technology Scheduler for the Stream Processing Frameworks on Hadoop Clusters Mgr. Petr Škoda Rigorous Thesis

Transcript of Scheduler for the Stream Processing Frameworks on Hadoop ...

Brno, November 2014 Advisor: Doc. RNDr. Pavel Smrž, Ph.D. Brno University of Technology

Scheduler for the Stream Processing

Frameworks on Hadoop Clusters

Mgr. Petr Škoda

Rigorous Thesis

2

Abstract

Big data processing is a hot topic of today’s computer world. One of the key paradigms be-hind it is MapReduce—parallel and massively distributed model inspired by the map and reduce functions commonly used in functional programming. Due to its simplicity and general availability of standard implementations, the paradigm has been massively adopted on cur-rent computer clusters. Yet, MapReduce is not optimal for all big data problems.

My work focuses on the area of an alternative paradigm—stream processing—which has multiple advantages over the MapReduce, e.g., it avoids persistent data storing if not re-quired. The research aims at overcoming deficiencies of existing stream processing frame-works that prevent its wider adoption.

In particular, the work deals with scheduling problems of stream processing applica-tions on heterogeneous clusters. Heterogeneity is a typical characteristic of today’s large data centers (caused by incremental upgrades and combinations of computing architectures, in-cluding specialized hardware such as GPU or FPGA) and advanced scheduling mechanisms can significantly increase efficiency of their utilization.

The state-of-the-art research and development of stream processing and advanced methods of related scheduling techniques are discussed in this document. A special attention is paid to benchmark-based scheduling for distributed stream processing which also forms the core of my previous work and the proposed research towards my doctoral thesis.

Finally, the concept of novel heterogeneity aware scheduler is presented first in the in-tuitive way and then discussed deeper on theoretical basis. The prototype of the scheduler is then described and promising results of basic experiments are showed.

3

Contents

Chapter 1 Introduction 4

Chapter 2 State of the Art 7

2.1 Stream Processing .......................................................................................................7

2.1.1 Overview ........................................................................................................7

2.1.2 Components and Architecture .......................................................................9

2.1.3 Fault Tolerance ............................................................................................ 12

2.2 Resource Allocation and Scheduling ......................................................................... 13

2.2.1 Metrics .......................................................................................................... 14

2.2.2 Apache Mesos and Apache YARN .............................................................. 15

2.2.3 Interesting Approaches to Scheduling ......................................................... 19

2.3 Benchmark-based Scheduling .................................................................................... 20

2.3.1 Benchmarking .............................................................................................. 20

2.3.2 Speculative Execution in Hadoop Revisited ................................................ 21

2.3.3 Real Benchmarks .......................................................................................... 22

2.4 Beyond the State of the Art ...................................................................................... 23

Chapter 3 Heterogeneity Aware Scheduler 25

3.1 Motivation ................................................................................................................. 25

3.2 Intuitive Example ...................................................................................................... 25

3.2.1 The Application—Popular Stories ............................................................... 26

3.2.2 Standard Scheduling .................................................................................... 27

3.2.3 Advanced Scheduling ................................................................................... 28

3.3 Scheduler ................................................................................................................... 28

3.3.1 Performance of Program to Hardware Class Combinations ........................ 29

3.3.2 Benchmarking and Reschedules ................................................................... 30

3.3.3 Heterogeneity Aware Scheduling ................................................................. 31

3.4 Experiments and Results ........................................................................................... 31

3.4.1 Prototype scheduler implementation ........................................................... 31

3.4.2 Experiments ................................................................................................. 32

3.4.3 Results .......................................................................................................... 32

Chapter 4 Previous Work 34

4.1 ReReSearch ................................................................................................................ 34

4.2 mOSAIC .................................................................................................................... 35

4.3 JUNIPER+ ................................................................................................................ 35

Chapter 5 Ongoing Work 37

Chapter 6 Conclusion 39

4

Chapter 1

Introduction

As the Internet grows bigger, the amount of data that can be gathered, stored, and processed constantly increases. The first companies that handled internet based big data were the search providers such as Yahoo! and later Google. It is obvious that big amounts of data were processed even in some other industrial branches and scientific disciplines but the main leaders in development of the new cheaper, faster, more flexible and more available technolo-gies were rather the Internet based companies1. Later on, besides the search providers, first so called Web 2.0 companies raised; these were primarily blogs and afterward social networks.

From the technological point of view, the first big data was handled by many special-purpose tools implemented usually only for one type of computation. The large amounts of raw data, e.g., crawled documents, web request logs, etc., were processed this way to com-pute various kinds of derived data such as inverted indices, various representations of the graph structure of the Web documents, summaries of the number of pages crawled per host, and the set of most frequent queries in a given day. Most such computations were conceptu-ally straightforward; however, the input data was usually large and the computations had to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspired to obscure the original simple computation with large amounts of complex code dealing with these secondary issues. A solution, the MapReduce programming model, was developed and introduced by Google. [1]

MapReduce is an abstraction layer that allows expressing the simple computations but hides the messy details of parallelization, fault tolerance, data distribution and load balancing which leads to an ability to process large data sets in a massively parallel manner. The pro-gramming model is based on the following simple concepts: (i) iteration over the input; (ii) computation of key/value pairs from each piece of input; (iii) grouping of all intermediate

values by key; (iv) iteration over the resulting groups; (v) reduction of each group [1, 2]. To-day, the MapReduce programming model is widely adopted and is one of the most common ways to process the large amounts of data. The adoption is mainly led by Apache Hadoop, an open source implementation of MapReduce programming model, which turned into a stack of tools for parallel big data processing.

The Hadoop computations powered by MapReduce have few drawbacks where one of them is a barrier synchronization between map and reduce phase. It means that all map tasks have to be completed prior to any reduce task can start—the computation model is strongly bound to data, its storage, and location. This behavior makes MapReduce rather a batch

1 Main difference between the Internet based companies and other companies (including scientific applications) in context of big data is that the former started with simple and easily parallelizable computations and over time evolved into the more complex ones but the later companies have always needed complex and hardly parallelizable computations.

5

processing than online tool. To overcome the batch behavior, different ideas were intro-duced—for example [3] describes hierarchical and incremental reduction; in [4] support for pipelining is proposed; and the suitability of MapReduce for one-pass analytics processing is analyzed in [5].

If we get back to the big data, there are multiple interesting aspects that affected the development around its processing. We can mention, for instance, a constant growth of smartphones’ popularity hand in hand with availability of mobile data plans; a great success of social networks of all kinds; and constantly decreasing costs of different sensors that can be utilized in the industrial applications or logistics. What all previously named aspects have in common is that they constantly produce the large amounts of data (especially when many of entities are present) and simultaneously there are growing needs to evaluate these data as soon as possible. But still, while the MapReduce and its most popular implementation Apache Hadoop have already irreplaceable role at almost every company dealing with big data, the MapReduce is not suitable for online processing. This is the place for the new ideas and paradigms.

We observed that the batch nature of MapReduce, despite the mentioned efforts, re-cently caused a kind of hunger for technologies better suitable for online processing of stream data for which the original concept does not fit perfectly. At the same time, it is still required to keep the easy parallelization, fault tolerance, data distribution and load balancing—the benefits well known from Hadoop with MapReduce. This all leads to the rise of the new easi-ly parallelizable stream processing systems.

Although the stream processing is currently a target of interest of many stakeholders in area of big data, the integration into data processing platforms (e.g., Hadoop) is still unsatis-fying. Analogical condition dominates in the field of advanced scheduling algorithms. There is a lack of suitable algorithms that can effectively employ the capabilities of heterogeneous clusters. The situation is even more alarming when the usage of static code acceleration (i.e., FPGA) and GPU acceleration is considered. In this area, scheduling algorithms currently available for stream processing applications display many blank spaces.

My work deals with scheduling of long running stream processing applications on the large shared heterogeneous clusters of computers. Leading idea is to combine benchmark-

based scheduling with other scheduling approaches. Moreover, the work is interested mainly in the Apache Hadoop 2.0 clusters because many businesses already deployed big batch pro-cessing solutions on top of the Hadoop clusters, and the stream processing as such is just another piece of big data puzzle they have to handle. The same fact recently led the Hadoop team to separate scheduler (now called YARN) from the rest of Hadoop tools. It means that the deployment of various distributed systems on top of Hadoop clusters is now much easier. At the same time, new YARN scheduler allows to share the clusters among users with fair-ness in resource scheduling without any difficulties. Enhanced scheduler for the stream pro-cessing applications on heterogeneous clusters, the result of my ongoing work, thus can be incorporated into the ecosystem of Hadoop and exploit all its advantages.

The rest of this document has the following structure. In Chapter 2, the state of the art in fields of the stream processing, resource allocation and scheduling in general, and bench-mark-based scheduling is outlined. Chapter 3 then describes the motivation for the novel scheduler with intuitive example, the theoretical basis of the scheduler, the prototype imple-

6

mentation, and the experiments with the prototype including promising results in comparison to Apache Storm’s “standard” scheduler. Meanwhile, Chapter 4 summarizes my previous affil-iation and work. Lastly, in Chapter 5, the objectives of my future work are described.

7

Chapter 2

State of the Art

In previous section, a brief history of the big data and its development over time with respect to the stream processing was described. An oncoming chapter, State of the Art, is at first focused on various aspects of stream processing after which the scheduling of stream pro-cessing jobs for different resources is discussed. Later, the benchmarking approaches, crucial topic for my ongoing work, are introduced. Finally, the problems of current state of the art are highlighted and discussed.

2.1 Stream Processing

It was already mentioned that the classical MapReduce is not a silver bullet for online data processing because of its batch-like manner. Yet, there are attempts to transform existing MapReduce solutions into streams. Some examples were already mentioned in the Introduc-

tion and we will discuss them in more detail together with older approaches to the stream processing. Later, some novel ideas and projects in the field of stream processing will be pre-sented.

2.1.1 Overview

Over the past decade, the stream processing has been the subject of a vivid research. With regard to their scalability, the existing approaches can essentially be subdivided into three categories: Centralized, distributed, and massively-parallel stream processors. A short over-view of each category will now be given. The overview is based on the extensive summary of stream processing systems from [6].

Centralized

Initially, several centralized systems for stream processing have been proposed, such as Auro-ra [7] and STREAM [8, 9]. Aurora is a database management system (DBMS) for continuous queries that are constructed by connecting a set of predefined operators to a directed acyclic graph (DAG). The stream processing engine schedules the execution of the operators and uses load-shedding, i.e., dropping intermediate tuples to meet the QoS goals. At the end-points of the graph, the user defined QoS functions are used to specify the desired latency and the tuples to be dropped. STREAM presents the additional strategies for applying the load-shedding, such as probabilistic exclusion of tuples. While these systems have useful properties such as respecting the latency requirements, they run on a single host and do not scale well with rising data rates and number of data sources.

8

Distributed

Later systems such as Aurora*/Medusa [10] support distributed processing of data streams. An Aurora* system is a set of Aurora nodes that cooperate via an overlay network within the same administrative domain. In Aurora* the nodes can freely relocate load by decentralized, pairwise exchange of the Aurora stream operators. Medusa integrates many participants such as several sites running Aurora* systems from different administrative domains into a single federated system. Borealis [11] extends Aurora*/Medusa and introduces, along with its other features, a sophisticated QoS optimization model where the effects of load-shedding on QoS can be computed at every point in the data flow. This enables the optimizer to find the bet-ter strategies for load-shedding.

Massively-Parallel

The third category of stream processing systems is made up by the massively-parallel data processing systems. In contrast to the previous two categories, these systems have been de-signed to run on hundreds or even thousands of nodes and to efficiently transfer large data volumes between them. Traditionally, those systems have been used to process finite blocks of data stored on distributed file systems. However, many of the newer systems such as Dry-ad [12], Hyracks[13], CIEL [14], DAGuE [15], or Nephele framework [16] allow to assemble the complex parallel data flow graphs and to construct the pipelines between the individual parts of the flow. Therefore, these parallel data flow systems in general are also suitable for the streaming applications.

Current Development

Recently, based on the popularity of MapReduce and the wide spread of Hadoop, there were introduced series of systems exploiting ideas of a MapReduce paradigm in context of the stream processing. The first work in this area was arguably Hadoop Online described in [4]. As was already mentioned in the Introduction, the developers of the Hadoop Online extended the original Hadoop system by ability to stream intermediate results from the map to the reduce tasks as well as the possibility to pipeline data across the different MapReduce jobs. To facilitate these new features, they extended the semantics of the classic reduce function by time-based sliding windows. Li et al. [17] picked up this idea and further improved the suita-bility of Hadoop-based systems for continuous streams by replacing the sort-merge implemen-tation for partitioning by a new hash-based approach.

The Muppet system [18] also focuses on the parallel processing of continuous stream data while preserving a MapReduce-like programming abstraction. However, the authors de-cided to replace the reduce function by a more generic update function to allow for greater flexibility when processing intermediate data with identical keys. Muppet also aims to sup-port near-real-time processing latencies.

The systems S4 [19] and Storm [20, 21] can also be classified as massively-parallel data processing systems with a clear emphasis on low latency. Their programming abstraction is finally not MapReduce but allows the developers to assemble arbitrarily complex DAG of processing tasks. For example, Twitter Storm does not use the intermediate queues to pass the data items from one task to the other; instead, data items are passed directly between the tasks using batch messages on the network level to achieve a good balance between laten-cy and throughput.

9

In the end, it is important to note that along with the stream processing paradigm we can lately observe another movement in the field of low latency computations based on the fast message brokers such as Apache Kafka [22] or Apache Flume. Although the message brokers are well known concept, the new wave of this technology focuses on different objec-tives, which makes it more suitable for high throughput processing.

Traditional enterprise messaging systems e.g., IBM Websphere MQ, or JMS specifica-tion compliant brokers, emphasize strong delivery guarantees mostly with pushing data to consumers and ignore the throughput, which makes them very robust and often slow. The new wave of message delivery systems, on the other hand, accentuates throughput and lets consumers to pull data as they need. This opens a way for the new distributed applications with high throughput and straightforward design constructed on top of these queues. In my opinion, such message delivery systems are another kind of stream processing systems with less limited communication schemas i.e., DAGs are welcomed but not required.

This was a brief overview of “historical” and recent approaches to the stream pro-cessing. Because this work deals mainly with the resource allocation and scheduling, from now on we will discuss the parallel systems only.

2.1.2 Components and Architecture

In this section, the focus will be given to S4 and Storm because the overview of wider range of systems would be excessively long, and, at the same time, we can consider these two being representative examples of modern stream processing architectures.

Across many massively parallel systems, a kind of a master-worker pattern is very common. The master node usually receives data or tasks and distributes them over the net-work of worker nodes. A good example could be again the MapReduce—its master process after receiving a job descriptor starts mappers and reducers on different machines; at the same time, the master process is responsible for a fault tolerance and liveness of the worker nodes (see Fig. 1) [23, 1]. For parallel stream systems where the streams of data are often running for a long time, the placement of tasks can be less frequent but the concept of master and worker nodes stays unchanged. One interesting difference is that MapReduce job eventu-ally finishes, whereas stream processing topologies run forever (or until you explicitly shut them off) [21].

Fig. 1 Master and worker nodes (mappers and reducers) in MapReduce, reproduced form [23]

10

When we look deeper into the structure of parallel systems, we can see that the compu-tations are divided into jobs. The arrangement of jobs can be described by a graph with ver-tices representing the job’s individual tasks and the edges denoting communication channels between them. For example, from a high-level perspective, the graph representation of a typ-ical MapReduce job would consist of a set of Map vertices connected to a set of Reduce verti-

ces. Some frameworks have generalized the MapReduce model to arbitrary directed acyclic graphs (DAGs), some even allow graph structures containing loops [14]. [6]

The stream processing systems are very similar to other parallel systems. The computa-tion is mostly described as a DAG. There are slight differences in a communication layer, while some systems use queues and intermediate message brokers; the others use a straight point to point (process to process) communication. Let me now describe the example topolo-gies of S4 and Storm.

S4

Fig. 2 S4: The Word Count Example, reproduced form [19]

In S4 the main computational unit is called processing element (PE). The communication between PEs is based on events. Every PE consumes just those events which correspond to the value on which it is keyed (one of the parameters of each PE is a keyed attribute) or all events of their type (for PEs without keyed attribute). PE may produce output events. Note that a PE is instantiated for each value of the key attribute (so multiple PEs of the same kind can be placed for a load balancing). This instantiation is performed by the platform. For example, in the word counting application (Fig. 2), WordCountPE is instantiated for each word in the input. When a new word is seen in an event, S4 creates a new instance of the PE corresponding to that word. S4 provides several built-in PEs for standard tasks such as ag-

gregate, join, and others. Unneeded PEs can beaccording to their Time-To-Live

A higher level in the hierarchy of S4 is layer for a physical node in cluster. PNs are hosts for PEs, they are responsible for listening to events, executing operations on the incoming events, dispatching events tance of the communication layer, and emitting output events. S4 routes each event to PNs based on a hash function of the values of all known keyed attributes in that event. A single event may be routed to multiple PNs. An event listener in tthe processing element container[19]

Apache Storm In Storm, the main computational unitStorm has another actor namedtional setup as it can read data from easily implemented). There can be multiple Storm topology are sent as tuples;sumes any number of input streams, does some processing, and possibly emits streams. The complex stream transformatfrom a stream of tweets, require multiple steps and thus multiple bolts. Bolts can do anthing from running functions, filter tuples, do streaming aggregations, do streamtalk to databases, and more. [21]

Fig. 3 Storm: The Rolling Top Words Topology

Networks of spouts and bolts are packaged into a “topology”straction that is submitted to stream transformations where each node is a spout or which bolts are subscribing to which streams. When stream, it sends the tuple to every bolt that subscribed to that stream.nodes of the topology indicate how tuples shoutween Spout A and Bolt B, a link from Spout A to Bolt C, and a link from Bolt B to Bolt C,

2 If no events for that PE object arrive within a specified period of time, the PE becomes eligible for removal3 Currently, only Kestrel queue is supported out of the box.

11

Unneeded PEs can be automatically removed during computation Live2. [19] hierarchy of S4 is Processing Node (PN) that

physical node in cluster. PNs are hosts for PEs, they are responsible for listening to events, executing operations on the incoming events, dispatching events tance of the communication layer, and emitting output events. S4 routes each event to PNs based on a hash function of the values of all known keyed attributes in that event. A single

be routed to multiple PNs. An event listener in the PN passes incoming events to container that invokes the appropriate PEs in the appropriate order.

the main computational unit where the user code runs is called named Spout that serves as a source of data to the

tional setup as it can read data from a queue3 or another sources (new spout types can be here can be multiple Spouts in one topology. The

orm topology are sent as tuples; unbounded sequence of tuples is called stream. sumes any number of input streams, does some processing, and possibly emits

omplex stream transformations, such as computing a stream of trending topics from a stream of tweets, require multiple steps and thus multiple bolts. Bolts can do an

functions, filter tuples, do streaming aggregations, do stream[21]

Storm: The Rolling Top Words Topology, reproduced form

and bolts are packaged into a “topology” which is the topto the Storm cluster for execution. The topology is a graph of

stream transformations where each node is a spout or a bolt. Edges in the graph indicate which bolts are subscribing to which streams. When the spout or bolt emits a tuple to a

ple to every bolt that subscribed to that stream. topology indicate how tuples should be passed around e.g., if there is a link b

tween Spout A and Bolt B, a link from Spout A to Bolt C, and a link from Bolt B to Bolt C,

If no events for that PE object arrive within a specified period of time, the PE becomes eligible for removal

queue is supported out of the box.

during computation

that is an abstraction physical node in cluster. PNs are hosts for PEs, they are responsible for listening

to events, executing operations on the incoming events, dispatching events with the assis-tance of the communication layer, and emitting output events. S4 routes each event to PNs based on a hash function of the values of all known keyed attributes in that event. A single

he PN passes incoming events to invokes the appropriate PEs in the appropriate order.

is called Bolt. Besides that, the whole computa-

or another sources (new spout types can be The data inside of the

unbounded sequence of tuples is called stream. A bolt con-sumes any number of input streams, does some processing, and possibly emits the new

computing a stream of trending topics from a stream of tweets, require multiple steps and thus multiple bolts. Bolts can do any-

functions, filter tuples, do streaming aggregations, do streaming joins,

, reproduced form [24]

which is the top-level ab-topology is a graph of

bolt. Edges in the graph indicate spout or bolt emits a tuple to a

Links between the , if there is a link be-

tween Spout A and Bolt B, a link from Spout A to Bolt C, and a link from Bolt B to Bolt C,

If no events for that PE object arrive within a specified period of time, the PE becomes eligible for removal.

12

then every time the Spout A emits a tuple, it will send it to both Bolt B and Bolt C. All of the Bolt B’s output tuples will go to the Bolt C as well. Each node in the Storm topology executes in parallel. User can specify how many parallel instances are wanted for each node, and then Storm will spawn that number of threads across the cluster to do the execution. [21]

2.1.3 Fault Tolerance

A system designed to run on a large number of commodity computers must be capable of detecting and reacting to possible faults that might occur during its regular operation. Hadoop, for example, uses a naïve strategy based on storing all intermediate results to dura-ble storage before making progress. While such naive strategy is a safe approach, it can be extremely expensive, especially for the systems with high throughputs of data. At the same time, saving data to the persistent storages is one of the main factors causing the undeniable batch nature of MapReduce based systems. The form of fault tolerance handling can signifi-cantly affect the overall performance of distributed system because in very large clusters, i.e., thousands of nodes, the failures are quite probable and they affect a significant portion of the jobs running on such cluster [25].

If we take a look at the systems previously mentioned in chapter 2.1.1 Overview, we can say that CIEL [14], Nephele[6], Dryad[12], DAGuE [26, 15], Hyracks[13], and Storm [21] are designed to be fully fault tolerant while S4 [19] is only partially fault tolerant. S4 is in-tended for dedicated homogeneous clusters and for the computations where the small errors are acceptable (e.g., click through-rate computation on streamed data); thus, faults of nodes are handled just by moving processes to a standby server. The state of the processes, which is stored in a local memory, is lost during the handoff. On new server, the state is regenerated using the input streams, yet, some data may be lost.

The systems with the full fault tolerance usually support transactional processing of streamed data—it is being confirmed that the computation on piece of input data was suc-cessful and complete (in case of failures, the computation is restarted for given data). Nephele, unlike the others, saves checkpointing data to a distributed file system, which can be classified as the naïve approach. The rest of the systems named before use techniques based on replaying of a small portion of computation starting with intermediate results back-tracked from DAG of computation. In the worst case, when no intermediate results are avail-able, the computation for given source data is restarted from the beginning on all jobs im-pacted by failures.

For example, CIEL in a case of job failure automatically re-executes the job. However, if it has failed because its inputs were stored on a failed worker, the task is no longer runna-ble. In that case, CIEL recursively re-executes predecessor tasks until all of the failed task’s dependencies are resolved. To achieve this, the master invalidates the locations in the object table for each missing input and lazily reevaluates the missing inputs. Other tasks that de-pend on data from the failed worker will also fail, and these are similarly re-executed by the master. [14]

Storm, on the other hand, traces the “tree of tuples”. Whenever a new tuple is created in Spout or Bolt, the tree of computations for the originating tuple is updated, so it is always evident which parts of DAG were accomplished. All vertices (Bolts) of the tuple’s tree must

13

then be finished by acknowledgement4. When the acknowledgement of some tuple is timed out, the computation is restarted from the originating Spout where the original tuples must be kept until the full acknowledgement of tuple’s tree.

2.2 Resource Allocation and Scheduling

In context of a High-Performance Computing (HPC) and the shared clusters, resource alloca-tion and scheduling are little overlapping terms. Resource allocation deals mainly with a problem of gathering and assigning resources to the different requestors. Resource can be computing node, CPU time, memory, HDD bandwidth, network bandwidth, or other less common species such as GPU time or FPGA time. Requestor of resources is in terms of HPC typically a framework that needs to run a job on the shared cluster of computers5. A com-plexity of resource allocation is illustrated on Fig. 4. Scheduling, on the other hand, cares about the question: “when to place which tasks to which previously obtained resources.” A typical scenario is: the user commits his job to a framework, the framework asks for resources from the resource allocation system, and after receiving resources, the framework’s scheduler takes care of placing the tasks to the resources over the time of job’s computation. [27, 28, 29, 30]

Fig. 4 Resource Allocation Strategies (RAS) in cloud computing, reproduced form [29]

HPC cluster architectures are ideally suited to workloads where work can be divided into the independent pieces and distributed accordingly. Many workloads, however, can only be parallelized to some extent—they typically contain dependencies that require some serial execution. At the same time, these workloads may also be broken into pieces of significantly different sizes. While producing a schedule using a policy such as First In First Out (FIFO) may be trivial, scheduling over a big cluster to meet quality of service requirements or a level of optimality in the presence of dependencies can be difficult [27]. Optimal scheduling in the general case is an NP-Complete problem. Therefore, optimal scheduling is intractable at the

4 Bolts at the ends of DAG typically save some results to persistent storages or pass results out of the Storm infrastructure. 5 Obviously, the job in framework is usually requested by user or service.

14

scale of large clusters where heterogeneity and network delays are also present. Instead, the heuristic [31, 32] or machine learning [33, 34, 35] scheduling policies are better suitable. [36]

Other interests lay in Quality of Service (QoS), both the users and the providers want to have guarantees that the computation starts or is done in some reasonable time interval. Poor QoS can be caused by poor scheduling decisions leading to low throughput or unaccept-ably long task waiting times. Metrics are essential for providers to be able to monitor the QoS they are delivering and these metrics should be appropriate to the needs of their users. To provide the QoS is challenging because the workloads are usually unpredictable and the utilization of already running jobs varies over time. [36]

2.2.1 Metrics

In order to objectively compare schedulers, the metrics are required. Different metrics are more or less relevant to the different stakeholders in the system. Three metrics that represent each of the industrial stakeholder perspectives are selected here. The metrics relevant to the system administrators correspond to those related to utilization and the metrics that repre-sent the users’ point of view correspond to the responsiveness and fairness metrics. Another commonly used category of metrics is relative metrics. Relative metrics compare schedulers by counting the number of the “best” schedules (by another metric) over a number of scenar-ios in a problem space. Extensive analysis of these and other metrics can be found in [36], which this brief description is based on.

• Utilization—measures how much of the maximum potential of the platform is actually being used. It is desirable to avoid having idle resources if there is non-empty working queue. Utilization can be measured by a number of different metrics. A workload

makespan (1), which is defined by the time at which all the work in the workload was completed, is widely used. Even though it is not suitable for measuring of responsive-ness or fairness, the workload makespan is accurate for utilization in a case of static scheduling problems. For dynamic scheduling, the Average Utilization (2) [37] can be used when it is calculated over some time interval. These intervals can be monitored over the time (like daily or weekly average utilization). Note that Gcores is the number of processing units.

(1)

(2)

• Responsiveness—compares how successful the scheduler is in keeping the job latency low. There will always be a minimum time that a job will take to execute, and this is determined by its critical path. However, the time spent by queueing or during net-work transfers will influence the responsiveness of a job. Responsiveness metrics can be a tool for measuring how successful is the scheduler under the periods of heavy load. Cumulative Completion (3) is a metric that rewards early completion of work, and therefore good average responsiveness [31]. This metric calculates the sum of completed job execution times at each time tick in the execution thus it provides bet-

15

ter insight into how the computation was done than the utilization metrics. For dy-namic scheduling, the metrics must be extended, so that it works over the time win-dow. Much more accurate metric for responsiveness is Schedule Length Ratio (4) [38] as it compares the ideal response time of critical path on one core (JCP) with actual response time (Jresponse). This metric is thus independent of the total execution time or the parallelism available in the job.

(3)

(4)

• Fairness is only little mentioned in literature, the best source is [39]. Burkimsher at al. in [36] then explains that there can be a tradeoff between fairness and utilization in a non-pre-emptive system. Consequently, the metrics are needed to quantify the level of fairness and to ensure that the tradeoff is managed appropriately. There may be an underlying assumption that by raising utilization, responsiveness is maximized; therefore, fairness will be near optimal as well. The average value of Schedule Length

Ratio metrics can be used to gauge the responsiveness that a scheduler is able to achieve with a given workload. Tight clustering of the responsiveness values may in-dicate a fair distribution of grid resources to jobs. The importance of fairness is more-over connected to QoS.

• Relative—a metric where for a given problem instance, the performance of all the con-sidered schedulers are compared against a given metric—usually the workload

makespan. This is repeated over a number of problem instances. The “best” scheduler is then considered to be the one that had the highest number of wins over the prob-lem space. Relative metrics can often be useful for real-world scheduling problems, because finding the optimal schedule is computationally intractable.

2.2.2 Apache Mesos and Apache YARN

As was discussed earlier, Hadoop is already one of the crucial technologies for big data pro-cessing on a large cluster. More importantly, it became the place within an organization where the engineers and researchers have an instantaneous and almost unrestricted access to vast amounts of computational resources and stores of company data. It is caused by MapReduce’s tight binding to the storage6. With this wide adoption, the public of developers extended the MapReduce programming model beyond the capabilities of the cluster manage-ment substrate. A common pattern submits “map-only” jobs to spawn arbitrary processes in the cluster. The examples of (ab)uses include forking web servers and gang-scheduled compu-tation of iterative workloads. Developers, in order to leverage the physical resources, often resorted to clever workarounds to sidestep the limits of the MapReduce API. The main rea-son for those misuses of Hadoop and MapReduce was the monolithic architecture of the re-source management functions and the programming model. [40]

6 MapReduce model counts on distributed storage of data on nodes that are used for computations over that data.

16

Apache Mesos and Apache YARN are discussed in this section mainly because they are both targeted to the MapReduce clusters shared across different frameworks such as MPI, Dryad, or Storm. Another interesting projects dealing with clusters for multiple frameworks are Omega [41], Corona [42], or Cosmos [43]. However, these systems are more focused on the specific needs of companies that have created them (Google, Facebook, Microsoft) than to the universal scenarios. A better comparison can be found in [40].

Apache Mesos

Mesos is a thin fault-tolerant resource sharing layer that enables fine-grained sharing across the diverse cluster computing frameworks by giving frameworks a common interface for ac-cessing cluster resources. It is written in C++ and it has a pretty small codebase of about 10 000 lines.

Because the cluster frameworks are both highly diverse and rapidly evolving, an over-riding design philosophy of Mesos was defining a minimal interface that enables an efficient resource sharing across frameworks, and otherwise pushing control of task scheduling and execution to the frameworks. Firstly, it allows frameworks to implement diverse approaches to various problems in the cluster (e.g., achieving data locality, dealing with faults), and to evolve these solutions independently. Secondly, it keeps Mesos simple and minimizes the rate of change required of the system, which makes it easier to keep Mesos scalable and robust. [44]

Mesos consists of a master process that manages slave daemons running on each cluster node, and frameworks that run tasks on these slaves (see Fig. 5). The master implements a fine-grained sharing across frameworks using resource offers. Each resource offer is a list of free resources on multiple slaves. The offers are generated according to the organizational policy (these are defined in pluggable allocation modules, Fig. 5). The frameworks running on Mesos consist of two components: a scheduler that registers with the master to be offered with resources, and an executor process that is launched on the slave nodes to run the framework’s tasks. Each framework can accept or reject the offer and ask for another offer. Frameworks cannot send detailed requests to Mesos, instead they can state which offers are unacceptable for them by defining the simple filters or whitelists of acceptable nodes. The rejection, filter, and whitelist mechanisms enable frameworks to support arbitrarily complex resource constraints while keeping the Mesos simple and scalable. [44]

Fig. 5 To the left, Mesos architecture diagram with two running frameworks (Hadoop and MPI); to

the right, example of resource offer; reproduced form [44]

17

Resource allocation in Mesos is processed in pluggable modules, so that organizations can tailor allocation to their needs. Out of the box, there are two allocation modules: one that performs fair sharing based on a generalization of max-min fairness for multiple re-sources, and one that implements strict priorities. Similar policies are used in Hadoop and Dryad. In case of need, i.e., many long tasks are blocking the whole cluster, Mesos can revoke (kill) them—this is “preemption”. [44]

Performance isolation between frameworks on the same node is being done by leverag-ing existing mechanisms of the underlying OS. Namely Linux Containers or Solaris Projects; these can limit the CPU, memory, network bandwidth, and (in new Linux kernels) IO usage of a process tree. [44]

Apache YARN

The YARN scheduler is a result of a Hadoop’s evolution. Originally, the Hadoop was using master-slave architecture of one JobTracker and multiple TaskTrackers on each worker node. Drawback of this solution was a very tight connection to the MapReduce programming mod-el where running of non-MapReduce jobs was always kind of compromise. To overcome this, Hadoop 2.0 is delivered with its own scheduler. Therefore, YARN lies on the same level as Mesos. The new scheduler in Hadoop allows extending the whole ecosystem by different frameworks and at the same time, it opens Hadoop clusters to other third parties’ frame-works needed by different users.

YARN newly lifts some functions into a platform layer responsible for resource man-agement, leaving coordination of logical execution plans to a host of framework implementa-tions (Fig. 6). By separating these responsibilities in the JobTracker’s charter, the central allocator can use an abstract description of tenants’ requirements but remains unaware of the semantics of each allocation. That responsibility is delegated to an ApplicationMaster (AM), which coordinates the logical plan of a single job by requesting resources from the ResourceMaster (RM), generating a physical plan from the resources it receives, and coordi-nating the execution of that plan around faults. [40]

Fig. 6 YARN Architecture (the system components in blue, and two applications running in yellow

and pink); reproduced form [40]

Typically, an AM will need to collect the resources (CPUs, RAM, disks etc.) available on multiple nodes to complete a job. To obtain containers, AM issues resource requests to the RM. The form of these requests includes specification of locality preferences and proper-ties of the containers. The RM will attempt to satisfy the resource requests coming from each

18

application according to availability and scheduling policies. When a resource is allocated on behalf of an AM, the RM generates a lease for the resource, which is pulled by a subsequent AM heartbeat. [40]

The ResourceManager (RM) matches a global model of the cluster state against the list of resource requirements reported by running applications. This makes it possible to tightly enforce global scheduling properties (different schedulers in YARN focus on different global properties such as capacity or fairness) but it requires the scheduler to obtain an accurate understanding of applications’ resource requirements. Request vector of the AM contains: number of containers, resources per container7 (currently amount of RAM and number of CPU), locality preferences (granularity is node-level, rack-level, and global), and priority of

requests within the application. The RM can request back some resources from the AMs that, on the other hand, have some flexibility in fulfilling such preemption—for example, the AM can migrate some tasks or checkpoint them prior to returning the resource. ResourceManager is not responsible for coordinating application execution or task fault-tolerance while these are responsibilities of framework’s ApplicationMaster. [40]

The ApplicationMaster is the process that coordinates the framework’s execution in the cluster but it itself is run in the cluster just like any other container. A component of the RM negotiates for the container to spawn this bootstrap process. After building a model of the framework’s requirements, the AM encodes preferences and constraints in a heartbeat mes-sage to the RM. In response to subsequent heartbeats, the AM will receive a container lease on bundles of resources bound to a particular node in the cluster. Based on the containers it receives from the RM, the AM may update its execution plan to accommodate perceived abundance or scarcity. The AM optimizes a locality of data by re-requesting containers with diminished weight for nodes that are not desirable as hosts for tasks. [40]

The NodeManager (NM) is the “worker” daemon in YARN. It authenticates container leases, manages containers’ dependencies, monitors their execution, and provides a set of ser-vices to containers. Operators configure it to report memory, CPU, and other resources avail-able at this node and allocated for YARN. The NM mainly takes care of node’s liveness, availability of all necessary packages, and of cleanup after each task including the killing of the task on the RM’s or the AM’s demand. [40]

Conclusion

While both Mesos and YARN have schedulers at two levels, there are two very significant differences. First, Mesos is an offer-based resource manager, whereas YARN has a request-based approach. YARN allows the AM to ask for resources based on various criteria includ-ing locations, allows the requester to modify its future requests based on what was given and on the current usage. The YARN’s approach is better suitable for location based allocation even though the results of Mesos are considered to be good in [44]. Second, instead of a per-job intra-framework scheduler, Mesos leverages a pool of central schedulers (e.g., classic Hadoop or MPI). YARN enables late binding of containers to tasks, where each individual job can perform local optimizations, and seems more amenable to rolling upgrades (since each job can run on a different version of the framework). On the other side, per-job ApplicationMaster might result in greater overhead than the Mesos’ approach. [40, 44]

7 The resource vector is designed to be extensible.

19

2.2.3 Interesting Approaches to Scheduling

Besides the common schedulers like random, FIFO, capacity, or fair, there are some unusual ones. These are mostly targeted to some occasional behavior of frameworks but they still can show some interesting ways to improve the scheduling and resource allocation decisions.

Hadoop’s standard scheduler is equipped with a simple speculative re-execution of tasks that may be late or failed. This is based on a monitoring of tasks’ progress. Because hetero-geneous clusters proven this to be problematic, Zaharia at al. [45] propose the LATE sched-uler for speculative execution. The LATE re-executes the tasks according their probable fin-ish time (so the performance of current node is included assumption). At the same time, speculatively executed tasks are not scheduled to the nodes known as the slow ones. The LATE’s approach has proved to be more effective on heterogeneous clusters than Hadoop’s standard scheduler.

Another unusual scheduling approach is used in DAGuE distributed DAG engine [15]. The engine takes the definitions of tasks in the DAG and tries to optimize the performance of the distributed system as whole by scheduling tasks with large amount of shared data as close to each other as possible. Two tasks can even share the same memory with respect to NUMA8 architecture if multiple cores sharing the same memory are available in cluster. The DAGuE transparently shares data across processors or even through network across nodes. An application in the DAGuE system is defined by a description of all tasks and the DAG in a special language and by the C-code for each task. Distributed scheduler analyzes the DAG and chooses which vertices of DAG to run on which resources. The work stealing is also per-mitted. The DAGuE engine performed better and was better scalable on Cholesky factoriza-tion than classical approaches, e.g., ScaLAPACK9.

Somewhat similar approach of data transfer optimization is taken by Adaptive Online

Scheduler for Storm [46]. Two adaptive schedulers are proposed: topology-based scheduler and traffic-based scheduler. Topology-based scheduler examines the topology of the stream pro-cessing application and makes the best to place receivers of data streams to the same physi-cal node as the emitters of that data. This is performed at the time of tasks’ scheduling. While the best placement of particular task is given by the topology itself, scheduler works with resources assigned by resource allocator, so the decision is always kind of compromise10. Traffic-based scheduler, moreover, takes into account the actual traffic performed between each emitter and receiver. The placement to the same physical node is conditioned by higher traffic between the tasks. The traffic must be monitored through the run of the application and scheduling decisions are improved over time by rescheduling of tasks. It is worth noting that the traffic scheduler watches even the overall computational power of the node to avoid non-network bottlenecks when rescheduling. Performance results of both adaptive schedulers are slightly better than results of default Storm scheduler.

8 Non-uniform memory access. 9 ScaLAPACK—Scalable Linear Algebra PACKage, the MPI based framework for linear algebra http://www.netlib.org/scalapack/ 10 It is obvious that from the perspective of topology-based scheduler, the best placement of tasks is whole topolo-gy on one physical node. This is caused by naïve decision perspective based only on network bandwidth optimiza-tion.

20

2.3 Benchmark-based Scheduling

For heterogeneous clusters, nevertheless, the crucial approach is the scheduling supported by benchmarking. Over the time, clusters are not homogeneous anymore; instead, they contain the different pieces of hardware. This is often caused by the fact that clusters are gradually enlarged by new nodes that are always the best on market. Even the stable size clusters are usually upgraded, often rack by rack. This way, the clusters became made of some older and some newer machines with differences in available memory but especially in a CPU perfor-mance [45]. Situation is getting even worse with growing interests into the GPU computa-tions where the cluster may contain only a limited number of nodes equipped by robust graphics cards.

Even thought, the schedulers use different techniques to divide the CPU performance (mostly just one core to one task or container), the differences of underlying hardware may be significant (performance of one core differs over CPU generations and obviously depends either on frequency). Heterogeneity causes many of the classical scheduling algorithms to be less accurate because the same computation takes different time on equal containers in clus-ter, e.g., bad performance of standard Hadoop scheduler on heterogeneous clusters described in 2.2.3 Interesting Approaches.

At the same time, different capabilities of nodes may, if correctly employed, bring even better results than homogeneous node deployments. Ren at al. in [47] discussed benefits of heterogeneous processors (e.g., processor with one or two fast cores and six or more slow cores). Besides improved power efficiency, this processor architecture allows for better re-sponse times in high loads on interactive systems such as websites because multiple slow pro-cessors simply can serve more low CPU intensive tasks than one fast CPU. Analogically, more nodes with slower CPUs should be faster in higher amounts of low CPU intensive com-putations over the fast streams of data than fewer nodes with fast CPUs.

The main idea of benchmark-based scheduling is two-way. First, the capabilities of each node of cluster in the means of CPU performance, network bandwidth, or permanent storage

bandwidth are investigated. Afterward, the scheduler is able to make better decisions and can consider various tradeoffs based on knowledge of each node’s capabilities, e.g., to schedule the task far from its data because all nodes with necessary data are CPU over-utilized but with a free network bandwidth.

Now, we will go through the introduction to benchmarking methods where the basic overview of the benchmarking and more specifically the execution time estimation problem will be given. After that, the works dealing with described execution time estimation methods will be discussed.

2.3.1 Benchmarking

In the context of heterogeneity, the execution times of the same tasks are different across various nodes of the same cluster. For better scheduling decisions it would be useful to pre-dict the execution time with respect to specific node. Although it is a complex problem, be-cause different programs have different bottlenecks (i.e., some programs need mostly fast CPU and others require rather fast permanent storages), there are some suitable solutions. We are talking about the execution time estimation problem.

21

In the literature, there are three major classes of solutions to the execution time esti-mation problem: code analysis [48], analytic benchmarking/code profiling [49, 50, 51] and statistical prediction [52, 53].

Code Analysis

In Code analysis, an execution time estimate is found through analysis of the source code of the task. A given code analysis technique is typically limited to a specific code type or a lim-ited class of architectures. Thus, these methods are not very applicable to a broad definition of heterogeneous computing. [54]

Analytic Benchmarking/Code Profiling

Analytic benchmarking defines a number of primitive code types. On each machine, bench-marks, which determine the performance of the machine for each code type, are obtained. The code profiling then attempts to determine the composition of a task in terms of the same code types. The analytic benchmarking data and the code profiling data are then combined to produce an execution time estimate.

Analytic benchmarking/code profiling has two disadvantages. First, it lacks a proven mechanism for producing an execution time estimate from the benchmarking and profiling data over a wide range of algorithms and architectures. Second, it cannot easily compensate for variations in the input data set. However, analytic benchmarking is a powerful compara-tive tool in that it can determine the relative performance differences between machines. [54]

Statistical Prediction Algorithms

Statistical prediction algorithms, the third class of execution time estimation algorithms, make predictions using the past observations. A set of past observations is kept for each ma-chine, which are used to make new execution time predictions. The matching and scheduling algorithm uses these predictions (and other information) to choose a machine to execute the task. While the task executes on the chosen machine, the execution time is measured, and this measurement is subsequently added to the set of previous observations. Therefore, as the number of observations increases, the estimates produced by a statistical algorithm will im-prove.

Statistical methods have the advantages that they are able to compensate parameters of the input data (such as the problem size) and do not need any direct knowledge of the internal design of the algorithm or the machine. However, statistical techniques lack an in-ternal method of sharing observations between machines. By allowing observations to be shared between machines, the execution time estimate on a machine with few observations can be improved by using observations from machines with similar performance characteris-tics. [54]

2.3.2 Speculative Execution in Hadoop Revisited

Getting back to the problem of Hadoop’s speculative execution (2.2.3 Interesting Approach-

es), we can see another solutions to this problems. Bortnikov at al. [25] presents machine learning approach for prediction of MapReduce tasks’ slowdowns. On Yahoo!’s large cluster was identified that the slowdowns of individual tasks are highly correlated with overall job latencies. The correlation is not direct, since mappers and reducers are typically executed in

22

multiple waves. However, significant task slowdowns tend to indicate bottlenecks in the job execution as well. The measurements on production data show that as many as 90 % of spec-ulative tasks are useless—they are launched too late to finish before the original stragglers, and end up being killed by the system. Proposed slowdown predictor gives for a given task-node pair ⟨t, n⟩ the estimation of slowdown. In this context, both the task and the node are modeled as feature vectors assembled from MR-level and system metrics.

From perspective of benchmarking, the statistical prediction algorithm is used. As the concept of MapReduce divide jobs to plenty of map tasks and a few reduce tasks where many of them are somehow the same (at least they have the same code), the patterns of similar tasks on the different nodes can be found even across various jobs executed on the same clus-ter.

Intuitively, two similar tasks are identified by the same “signature”. As an approxima-tion, sibling mappers or reducers in the same job with roughly the same input and output size are considered similar. More precisely, each task has a characterizing profile; it can be considered as a vector of attributes. Two tasks are similar if their profiles are similar. Along the profiles of tasks, the nodes have profiles too—nodes’ profiles are based on hardware con-figuration. Finally, machine learning is employed to build the prediction model from accumu-lated logs and the profiles of the tasks and the hardware.

2.3.3 Real Benchmarks

Although the previous method falls into category of benchmarking, it does not perform any tests on the hardware itself. Instead, the nature of MapReduce tasks’ similarity is exploited. Different approach was taken by Gupta at al. in ThroughputScheduler [55]. With the ThroughputScheduler, after each hardware or topology change the cluster is probed by series of tests (note that it is assumed that such changes are pretty rare, so the testing can be done offline). Then, while running MapReduce jobs, the profiles of their tasks are learned online—without any disturbance of the computation. The Bayesian active learning scheme is split into two phases: learning and exploitation.

Fig. 7 Algorithm SELECTBESTJOB(NodeN; ListOfJobs), reproduced form [55]

To enable online learning the task completion time samples from actual production jobs are collected. With every new time sample, the belief about the resource profile of the tasks is updated. Statistical methods are utilized to decide the level of certainty, and when some predefined11 level is exceeded, the exploitation phase of learned knowledge can begin.

In exploitation phase, the new scheduler called SelectBestJob then according the capa-bilities of each node and learned resource profile of job’s tasks enhance the performance of MapReduce on the heterogeneous cluster (Fig. 7). The main idea is to select from queue the most compatible jobs for specific machine and run them first (Fig. 8). For example, consider Nodes A and B, Node B is almost 7.5 times faster than Node A in terms of CPU but only 2.5 times faster in terms of disk. Hence, intuitively, disk intense jobs are better scheduled on Node A, since the relatively higher CPU performance of Node B is better used for CPU in- 11 The level is fine tuned actually but is universal over all jobs and cluster configurations.

23

tense jobs (if there are any). To account for this relativity of optimal resource matching, we normalize both jobs and machines to make their total requirements and capabilities sum to one for each resource (e.g., computing and storage).

Fig. 8 Algorithm THROUGHPUTSCHEDULER(Cluster, Request), reproduced form [55]

Although the performance of ThroughputScheduler is, especially on the heterogeneous clusters, significantly better than the original fair scheduler’s, the problem is the same as with the statistical prediction algorithm from 2.3.2 Speculative Execution in Hadoop Revisit-

ed—the solution is only suitable for MapReduce jobs.

2.4 Beyond the State of the Art

Current state of the art can be characterized as a slow adoption of new principles required by processing of massive data streams. Stream processing systems are getting into awareness of organizations with need for processing of big data and it can be said that the stream pro-cessing is cutting increasing piece of market where MapReduce-based data processing was used in past. At the same time, the reality of large clusters (even those based on clouds) shows that these are not homogeneous any more (see 2.2.3, 2.3 and 3.1). Benchmark-based scheduling approaches are the remedy for heterogeneity of clusters.

Stream processing is still kind of novice among widely operated paradigms for big data processing; the most widespread paradigm is MapReduce. Current situation around stream processing systems is insufficient because all existing systems are primarily designed for oper-ation on dedicated clusters and the only way to use them on shared cluster is their incorpora-tion with the scheduling systems like Mesos or YARN (see 2.2.2). Although the connectors for schedulers are often available, the problem remains since the stream processing is most suitable for a segment of jobs that are currently processed using the MapReduce paradigm and the MapReduce is mostly run without any other scheduling substrate. One of possible solutions is being brought by the new concept of Hadoop 2.0 (the most widely used imple-mentation of MapReduce paradigm) where the YARN scheduler is opened for other applica-tions.

There are multiple efforts to utilize the heterogeneity of clusters but most of them are targeted to MapReduce where, in Hadoop implementation, the problem of non-homogeneous clusters is already significant (see introduction of 2.3). Solutions to this problem of MapReduce are based on different ways of task’s speculative execution (see 2.2.3 and 2.3.2)

24

and benchmarking (see 2.3.3) but none of them is directly suitable for the stream processing paradigm.

On the other hand, the long running nature of stream jobs together with their resource requirement heterogeneity can be exploited for better utilization of heterogeneous clusters. Current situation, however, does not allow for such scenario because there is a lack of ad-vanced schedulers that are able to fulfill this potential of stream applications. Further re-search of benchmark-based scheduling algorithms has to be made. At the same time, many concepts already known from other fields of big data processing can be a good starting point to a resource aware scheduling capable of effective utilization of heterogeneous hardware.

Existing advanced scheduling approaches for MapReduce and stream processing pro-vide different effectiveness (see 2.2.3). Each one is suitable for specialized set of problems where it may bring some advantage over standard schedulers. Unfortunately, these ideas are fragmented over multiple different systems (e.g., Hadoop or DAGuE) and there is no attempt to combine at least some of them into the single and thus more flexible scheduler. Bench-mark-based scheduling for stream processing systems, for example, may profit from combina-tion with topology-based or traffic-based scheduler from 2.2.3. The reason is that the sets of problems are partly overlapping, i.e., all mentioned schedulers can utilize the data from per-formance monitoring together with information about resource capabilities found by bench-marking.

Deficiencies and possible improvements discussed in this section are the fundamentals for objectives of my work. Ongoing chapters describe in more detail the motivation of this work and show the ways leading to prototype of the heterogeneity aware scheduler for stream

processing frameworks.

25

Chapter 3

Heterogeneity Aware Scheduler

3.1 Motivation

In recent years, the popularity of the MapReduce and Hadoop raised to the new levels. Al-most any subject dealing with so called “big data” uses the Hadoop for data processing. The popularity it received was mostly caused by the simple model of parallelization that provides a massively distributed processing of data with an easy and understandable interface. The interface is not limited only to MapReduce programming model but there are various tools on the top of MapReduce such as Apache Pig (scripting language) or Apache Hive (SQL like query language). For increasingly diverse companies, Hadoop has become the data and com-

putational agorá—the de facto place where the data and computational resources are shared and accessed [40].

Recent trends showed that the batch processing on persistently stored data via Hadoop is not sufficient for a vast amount of applications. The reasons are obvious—storing of data is often unneeded and it prolongs both the time of computation over data itself, and the time from data arrival to results of the computation. Stream processing is the solution. Same way, it is natural to incorporate the stream processing into an existing Hadoop ecosystem because firstly, many clusters are already running Hadoop and will have to run the Hadoop in future, and secondly, the interaction of data obtained by stream processing with another data cur-rently saved or processed in Hadoop clusters12 is inevitable.

First steps to make the connection between Hadoop and the stream processing were re-cently done. There are third party schedulers making the Hadoop clusters available to other data processing frameworks; the Hadoop itself now has dedicated scheduler called YARN, again extensible for new frameworks.

Although the stream processing is on rise, there is no progress in scheduling approaches to support the streams—especially the situation concerning the heterogeneous clusters is still unsatisfying and the new solutions have to be proposed. An adoption of the benchmark-based scheduling for wider classes of problems, such as stream processing, can be the right cure.

3.2 Intuitive Example

To better illustrate the scheduling problems anticipated in current state of the art in area of stream processing on heterogeneous clusters, in this section, the example application and its possible behavior are presented. We chose the Apache Storm as the stream processing sub-strate because of its partial integration into Apache Hadoop and its simple and straightfor-

12 On distributed storages like the HDFS or the HBase managed by the Hadoop ecosystem.

26

ward processing model. Part of the application was used for basic experiments with schedul-ing.

The use case is built on top of existing Java libraries and components from various re-search projects of our faculty. The application itself is not intended to get relevant results because it is just a concept demonstrating different problems of current state of the art. At the same time, the application will not be fine tuned for precise results as they are not im-portant.

3.2.1 The Application—Popular Stories

The example application analyzes websites found by watching thousands of RSS feeds in or-der to find photos and texts identifying the most popular connections between persons and related entities. The result of analysis is triples list of person’s name, keywords, and a set of

photos. The meaning of the triple is: a person recently frequently mentioned in context of a keywords (e.g., events, objects, persons, etc.) and a set of photos. The application holds a list of triples with most often seen person, keywords, and pictures for some period of time. This way, current trends of persons related to keywords with relevant photos can be obtained.

Fig. 9 depicts the topology of example application. Circles express components of the application: spouts and bolts known from the Apache Storm (see 2.1.2). Each bolt or spout can be scaled into multiple instances and deployed on different cluster nodes. Components communicate in one way manner through the streams displayed as arrows in the figure.

The operation of the application can be described in the following way: Computation starts at the URL generator spout where the URLs are extracted from RSS feeds. After that, Downloader gets the HTML, styles, and pictures of websites. Complete website is passed to the Analyzer bolt, which searches for person names and for keywords and pictures in context of the names found. Resulting pairs of name-keyword are stored as a tops list in Memory

store NK where the list is updated each time a new pair arrives and excessively old pairs are removed from computation of the list—the window of a time period is held13. Changes in tops list are passed to Memory store NKP. Pairs of name-picture emitted by Analyzer are pro-cessed in Image feature extractor, so that indexable features of each image are sent to Memory store NP where the tops list of the most popular persons and images pairs is held. Memory stores employ Apache Lucene14 indexes especially for ability to easily search the fea-tures of images15. Changes in tops list of Memory store NP are again emitted to Memory

store NKP where consolidated tops list of names with keywords and pictures is maintained. Changes in consolidated tops list of triples are persistently saved for further questioning.

13 In-memory stores in the topology are scalable so that even large time window can be easily observed if enough nodes with memory are available, data partitioning is applied by name. 14 http://lucene.apache.org/core/ 15 Images are compared by their features so that similar images are found even when they are of different sizes, formats, or compressions.

27

S B B

B

B B DB

B

Fig. 9 Popular Stories With Pictures—example application: topology according the Storm terminolo-

gy (S—Spout, generator of data for topology; B—Bolts, data processing components; streams of tuples

described in light grey)

3.2.2 Standard Scheduling

Schedulers make their decisions on specific level of abstraction (i.e., they do not schedule tasks to computation node as a whole), usually a kind of units with equal or scalable size are used. YARN scheduler, for example, uses Containers with various number of cores and vari-ous amount of memory. Apache Storm, on the other hand, uses equal size Slots (one slot per CPU core). In each slot, multiple spouts or blots of the same topology may run16. Storm’s default scheduler then, using round-robin strategy, deploys bolts and spouts (collectively called executors) in the way that each node in topology has almost equal number of executors running in each slot. At the same time, the rule of almost equal number of executors is main-tained even when multiple topologies are running the same cluster.

Considering a heterogeneous Storm cluster, which is already running multiple topolo-gies, the work of scheduler may, for described example use case, result in scenario depicted in Fig. 10 to the left. This heterogeneous cluster consists of four nodes with different hardware configurations (see Fig. 10), so the number of slots available at each node differs. Taking into account presumable requirements of individual executors17, it is obvious that the scheduling decision was relatively wrong and it resulted in inefficient utilization of the cluster.

16 Number of separate slots for each spout and bolt type is defined by developer in topology definition. 17 URL generator and Downloader have low CPU requirements; Analyzer requires fast CPU; Image feature extrac-tor can utilize GPU using OpenCL; In-memory store requires large amount of memory

28

Fast CPU node (8 slots)

Slow CPU node (6 slots)High memory node (8 slots)

GPU node (4 slots)

A1

URL

D2

D1

A3

A2

MSNKP

MSNK

MS2NP

MS1NP

IE1

IE2

IE3

Fast CPU node (8 slots) Slow CPU node (6 slots)

High memory node (8 slots) GPU node (4 slots)

A1 URL

D2

D1A3

A2

MSNKP

MSNK

MSNP

IE1

IE2

Standard scheduling Advanced scheduling

URL—URL generator; Dx—Downloader; Ax—Analyzer; IEx—Image feature extractor; MSx—In-memory store Fig. 10 Standard vs. Advanced scheduling—possible impacts of advanced scheduling on heterogeneous

cluster; two bolts were removed from topology thanks to utilization of more suitable nodes for particu-

lar bolts (note that High memory and Fast CPU nodes are switched on diagram to the right)

3.2.3 Advanced Scheduling

If we consider a kind of simple benchmark-based scheduler, which for new Storm topology run a number of benchmarks with each executor running each time on different node, the suitability of nodes for particular executors can be revealed. Scheduler can then deploy exec-utors to slots (with respect to availability) that are running on nodes with the most suitable resource profile. Our example use case application can be deployed on the same cluster with fewer executors18 this way and probably even with higher throughput (see Fig. 10)

Other results of scheduling with possibly even more effective utilization of cluster may be achieved by combining benchmarking-based scheduling with traffic-based scheduling (see 2.2.3). Such scheduler may consider even tradeoffs between bandwidth availability on partic-ular nodes and availability of better suitable nodes as such.

3.3 Scheduler

Following the intuitive ideas from previous section, we can analyze the problem of the sched-uling based on a suitability of hardware type for a component of application in different ways. In this section, we will go through some basics and discussions about the concept of the heterogeneity aware scheduler for stream processing frameworks.

For the heterogeneity aware scheduler, one important part of the problem is the benchmarking of the nodes. Because the clusters are usually built of only a few categories of hardware we can narrow the problem to only benchmark each category of hardware—a hard-

ware class. The other part of the heterogeneity scheduler’s problem lays in finding of suitable

18 In-memory stores were deployed on node with multiple times higher amount of memory and Image feature extractors were deployed on node with two GPUs so it would be possible to remove some parallelism.

29

“hardware class to program” combination; here the statistical prediction algorithms of execu-tion time from 2.3.1 can be utilized. This way, the scheduled program (executor type) itself serves as the benchmark suite.

3.3.1 Performance of Program to Hardware Class Combinations

The suitability of executor type for particular hardware class may be, as was discussed in section 2.3.1, determined in various ways. Whereas code analysis approach is very unsuitable because of wide variety of hardware and programming languages used for stream processing big data computations, the analytic benchmarking or code profiling may bring some useful results. Disadvantage of later two methods is again the necessity to use the extensive library of code primitives for each programming language and hardware type. On the other hand, the statistical prediction algorithms are very suitable for stream processing applications and clusters because the problem space for the benchmarking is relatively small even for the big clusters and complicated applications. Details of the benchmarking are discussed in the next section.

Statistical prediction algorithms exploit a set of past observations of programs’ execu-tion times to predict the future execution times. With more observations, the predicted exe-cution time may be more precise. Conversely, if the nature of computed data slowly changes over the time then too old observations may reduce the precision.

Although the stream applications work with continuous data and programs are consid-ered to be potentially infinite (so are the data), we can measure the execution times for each small unit of data because the stream processing frameworks typically do computations on some kind of tuples containing semantically related data or equal size blocks of data. The measurements of execution times have to be stored for each hardware class to program type combination. This way we can predict the execution times for any node in the cluster and any program instance of the stream application.

Because the knowledge of the best suitable hardware class is not sufficient for schedul-ing whenever there are no more slots available of particular hardware class, the order on suitability of each hardware class must be defined according the set of observed execution times executionTime for each hardware class h and program type p. Formula (5) describes a set of average execution time observations T where the H set is a set of all hardware classes and the P set is a set of all application’s program types and the recordCount is a number of observations for given “hardware class/program type” pair.

� = � �⟨�,�⟩; � ∈ avg ��������������⟨�,�,�⟨�,�⟩⟩� ; ℎ ∈ �; � ∈ ; ! ∈ "; !⟨�,�⟩ < $���$%&����⟨�,�⟩' (5)

The set of performance losses between each pair of hardware classes for particular pro-

gram type L is defined using the difference between average execution time observations for given hardware classes and program type (6). The performance loss defined in L is, however, unsuitable for comparison between program types because of differences in the execution fre-quency of distinct program types. To overcome this, we can use some kind of normalization over the count of executions. A normalized set of performance losses M is then defined in (7).

30

( = )*⟨�+,�,,�⟩; * ∈ �⟨�+,�⟩ − �⟨�,,�⟩; � ∈ �; �, . ∈ "; � ≠ .0 (6)

1 = 2�⟨�+,�,,�⟩; � ∈ 34 567859:8;<=⟨�+,�⟩>?@A567859:8;<=⟨B,C⟩DE + 1 × *⟨�+,�,,�⟩; * ∈ (; �, . ∈ "; � ≠ .; I ∈ �; $ ∈ ; J (7)

3.3.2 Benchmarking and Reschedules

Using the part of the ideas presented in Bortnikov at al. [25] (see 2.3.2), we start with identifying the same “programs” running in the stream processing application. As was de-scribed in 2.1.2 Components and Architecture, the stream application is composed of a set of executor types where each type of executor is running in the specific number of instances. This gives us interesting similarity of processes running in the stream processing clusters.

Taking into account the large computations with big data, we can say that there are hundreds or thousands of instances for each executor type. At the same time, applications are usually built from maximally dozens of executor types. From hardware perspective, we can analogically say that we usually have only a few hardware classes and hundreds or thousands of nodes of each hardware class. In this situation, the problem space for benchmarking of every executor type on every hardware class is relatively small.

minRescheduleCount = max �Z 6[67;=85\]�6^ >_`abc7de^^fd8=:8;<=^gh , Z bc7de^^6^ >_`a6[67;=85i<^=e<76:8;<=^gh� (8)

The exact count of reschedules needed to do complete benchmarking of particular clus-ter and given application is complicated. However, for a basic insight into the number of nec-essary reschedules, we can use a simplified calculation19 (8). The formula uses a number of executor types (executorTypes), a number of hardware classes (HWclasses), a set of available slot counts for each hardware class (HWclassSlotCounts), and a set of numbers of instances for each executor type (executorInstanceCounts). In Fig. 11, the example of dependency be-tween executor types, slot counts, and number of needed reschedules is shown. The results then may be generalized as follows: We may need only one or two reschedules whenever the

number of executor types is slightly lower than the slot count of the rarest hardware class and

at the same time, the number of hardware classes is slightly lower than the count of instances

of the rarest executor.

Number of executor types 10 9 8 7 6 5 4 3 2 1

Slot count 1 2 3 4 5 6 7 8 9 10

Reschedules 10 5 3 2 2 1 1 1 1 1

Fig. 11 Number of reschedules to rarest hardware class’ number of slots and number of program types

dependency

19 This formula works well as long as there are only a few hardware classes with small amount of slots and at the same time when there are only a few executor types with a small amount of instances.

31

In this section, we showed that the computational model of the stream processing ap-plications is very suitable for benchmarking on big clusters with a few hardware classes. At the same time, we can say that the benchmarking, despite it has to be performed for every stream application (and for each of its programs), may be very effective with even very small amount of reschedules.

3.3.3 Heterogeneity Aware Scheduling

Based on the information gathered firstly by benchmarking and later by production use of the stream processing application and using formulas discussed in 3.3.1, we can define the most suitable hardware class for each program type (9). The formula is looking for the hard-ware class u, which gives for the program type the biggest normalized performance loss when not scheduled correctly, see (7). The set of hardware classes H is here shorten by the set of hardware classes that already have no more free slots because of previous program instance placements.

� = ℎ; �⟨�,�,,�⟩ = max ��⟨�+,�,,�⟩� ; ℎj ∈ � − k; ℎ ∈ �; � ∈ (9)

l = �; �⟨�+,�,,�⟩ = max ��⟨�+,�,,�⟩� ; � ∈ − m; ℎj , ℎn ∈ � (10)

The scheduler goes through the loop where it: (I) firstly gets the program type v with

currently the biggest performance loss among others (10) using the p from the set of un-scheduled program types (Z is the set of currently scheduled program types); (II) secondly places required number of instances using the most suitable hardware class for each program

type (9). This is the fundamental principle of the proposed novel scheduler. This principle still

has a few shortcomings. For example, the scheduler should place the instances of the same program type not only according the hardware class but according to physical node too. This should be done in the way that equal number of the program type instances should run on each available physical node of the given hardware class. This principle has proven to be effi-cient for example in Apache Storm’s even scheduler (this was observed during my experi-ments with scheduling too).

3.4 Experiments and Results

To evaluate the concept of the heterogeneity aware scheduler described in previous sections, the prototype has been implemented as a pluggable scheduler for the Apache Storm—the stream processing framework for massively parallel computations. Reasons for choosing Apache Storm were already described in 3.2 Intuitive Example.

3.4.1 Prototype scheduler implementation

The heterogeneity aware scheduler prototype is partly implemented inside of the Apache Storm (scheduling component and analysis component), and partly inside of the ex-

32

ample application itself (monitoring component). Monitoring data produced by each applica-tion component instance is currently saved in centralized relational database, which brings the ability to write straightforward statistical queries over the data (but, on the other hand, this concept is not suitable for big environments). The scheduling component has straight-forward access to Storm’s APIs for executor placement, removal and status so the scheduling can be easily made by executor type and host-name. The analyzer component then operates inside of the Storm cluster supervisor system called Nimbus, the same place where the sched-uler resides. Analyzer component questions the database with monitoring data and provides API for per “hardware class”—“application component” performance data and known place-ments to the scheduler component.

Prototype implementation of the scheduler is suitable for experiments with heterogene-ous clusters in the meaning of different hardware used over the cluster nodes (e.g., different CPUs, GPUs, amount of memory or static acceleration). It allows to periodically reschedule the application over the heterogeneous cluster and to observe the performance of application components on different hardware classes.

3.4.2 Experiments

Experiments were made on small cluster of 7 machines with three different hardware classes. Two hardware classes are of the same CPU generations, namely “class 1” is Intel Xeon (12 cores, 3 nodes) and “class 2” is Intel i7 (8 cores, 2 nodes). The last “class 3” is a three years old Intel Xeon (12 cores, 2 nodes). Other parameters of the hardware classes such as amount of RAM or presence of GPU/FPGA are not important because the testing appli-cation currently does not employ them. The number of Storm slots was set up based on the number of cores of each machine. I used multiple configurations of cluster with different numbers of machines of each class during experiments. The example application “Popular Stories” with 6 different components in various degree of parallelism served as test suite. The experiment was run on the same set of data 10 times for each scheduler type (scheduler were run 10 times one after another); the performance scheduler was using observations (bench-marks) from all previous schedules.

Different components of “Popular stories” application have different demands on per-formance and process different amounts of tuples over time so I evaluated performance im-provement in two ways: 1) based on number of tuples computed by whole application in time interval, and 2) based on number of tuples computed by each component in time interval. The performance is compared between worst possible schedule, standard schedule, and best schedule where the worst and best possible schedules are based on profiling and benchmark-ing made by the heterogeneity aware scheduler prototype and standard schedule is made by Storm’s “Even scheduler”.

3.4.3 Results

Results of experiments showed that performance gain of best scheduling based on profiling and benchmarking depends on various factors where the most important one is the structure of heterogeneous cluster. On homogeneous cluster, all three scheduling techniques would give almost same results because neither the worst scheduling nor the best scheduling can utilize

33

the differences of hardware classes. Having a cluster with small amount of “slower” nodes brings some difference between the scheduling techniques but the difference is still small.

At the same time, the architecture and demands of the application affect the difference between scheduling approaches too. Heterogeneity of the application’s components allows scheduler to utilize differences in hardware for better performance of whole system. General-ly, with increasing heterogeneity of application’s components, the performance gain caused by benchmarking based scheduler over worst scheduling and standard scheduling grows.

Component W tuples S tuples B tuples S-W gain B-S gain B-W gain

FeedUrlSpout 5706 6207 6214 108.78 % 100.11 % 108.90 %

FeedReaderBolt 2500 2500 2500 100.00 % 100.00 % 100.00 %

DownloaderBolt 547 580 630 106.03 % 108.62 % 115.17 %

AnalyzerBolt 12748 15765 18774 123.67 % 119.09 % 147.27 %

ExtractFeaturesBolt 2605 3363 3992 129.10 % 118.70 % 153.24 %

IndexBolt 2601 3444 3957 132.41 % 114.90 % 152.13 %

Total 26707 31859 36067 116.66 % 110.24 % 129.45 %

W - worst scheduler, S - standard scheduler, B - performance and benchmark based scheduler

S-W gain - gain of standard scheduler over worst scheduler

Fig. 12 Experiment with part of the “Popular Stories” application on the heterogeneous cluster; aver-

ages from 10 runs of each scheduler on the same data

In my experiments, the suitability of profiling and benchmarking based scheduling was proven on an application without components that could utilize special hardware such as GPU or FPGA. Different components of our application only had different demands on CPU and overall node performance (e.g., memory speed). In the experiment, the performance gain of the new scheduler in terms of all tuples processed by whole application was over Storm’s standard scheduler 13 % and over “worst scheduler” 35 %. Average gain measured for each component type then was 10.2 % over standard scheduler and 29.5 % over the hypothetical “worst scheduler”. For more detailed statistics from multiple tests on different data see Fig.

12. The results of the new scheduler are satisfying but they still strongly depend on heteroge-neity of cluster and application as described earlier. Finally, I assume that the applications employing GPUs or FPGAs for some of their components will benefit even more.

34

Chapter 4

Previous Work

My previous work was interested in the cloud computing and parallelization of legacy appli-cations dealing with a natural language processing—these are areas tightly connected to my current work. As a participant of mOSAIC multi cloud API project, I was developing a con-nector for the Apache CloudStack cloud middleware and I have implemented a cloud based version of an application dealing with processing of scientific papers in near real time manner [56, 57] that is the part of a ReReSearch project. This work is closely connected to the JU-NIPER+ project, which I am recently participating in. The JUNIPER+ is interested in a real time computing on large clusters. Our group is responsible for improvement of scheduling techniques leading to effective utilization of shared clusters.

4.1 ReReSearch

ReReSearch is a project that partially determined my interest in stream processing. The ex-perimental “umbrella” project developed by our team, which aims at building a knowledge base and derived personalized portals about research. The key entities it operates on include the researchers, teams and universities, papers, reports and deliverables, books, journals, pro-ceedings and various collections, conferences, workshops and seminars, projects and funding agencies. Information on all of these entities has to be interconnected in order to be useful. To gather the data, the system first identifies relevant sources on the Web and then down-loads and processes specific web pages and papers.

Fig. 13 ReReSearch—getting the data into the database

To transform data from an unstructured form into a structured one, the information extraction methods are applied. Various people have contributed to build individual compo-nents; my role have been to define the workflow and implement parallel version of the crucial step of the process—the information extraction from scientific papers.

Homepagesearch

CiteSeerXDBLPImport RRS

DB

Document source filesearch

Documentsource files

store

Metadata extractor(Information Extraction)

Bibliogr. dataSource filesMetadata

35

Papers are collected by the special crawlers that search the Web for the pages possibly containing links to the papers (e.g., online proceedings or lists of publications linked from homepages of authors that the system already “knows”). The printable formats (PDF, PS, DOC, etc.) are transformed into a semi-structured text (text structuring information such as sections or list items can be recovered), see Fig. 13. The process of information extraction from the texts on one CPU core may take from a few seconds to several minutes, depending on the complexity and the length of the document [56].

Scientific documents gathered by crawlers produce a kind of stream with variable bandwidth. My first solution for processing these documents in desired way was powered by faculty N1GE grid but this approach was better suitable for processing of batches of docu-ments. The responsiveness in means of time from crawler’s discovery of the document to providing extracted information to the users of ReReSearch system was pretty slow. The need for other methods was obvious.

In contrast to the example application mentioned in section 3.2, ReReSearch has a po-tential to become a full application that will benefit from effective utilization of heterogene-ous clusters by stream processing applications. Based on these assumptions, my work on ReReSearch project will continue in the direction to real stream processing.

4.2 mOSAIC

mOSAIC is a language- and platform-agnostic API for exploitation of multi-cloud resources, and at the same time a portable platform for utilization of cloud services based on the API and cloud usage patterns. It allows one version of an application to be deployed to any sup-ported cloud provider. Moreover, it aims at auto-scalability of applications (i.e., elasticity).

My participation on the project was in three main areas: the example application using the mOSAIC capabilities, the connector for Apache CloudStack middleware, and the cooper-ation on development of Eclipse tools for easy integration of mOSAIC based applications. In the context of my ongoing work, the example application for mOSAIC is interesting. The batch nature of processing on N1GE engine led us to port the information extraction part of the ReReSearch to cloud using the mOSAIC platform. Newly created system was more flexi-ble in processing of various bandwidth streams of the scientific documents. During the mOSAIC project, the needs for better schedulers were recognized and I started the research leading to this work. The participation in mOSAIC project resulted in 2 papers [56, 57].

4.3 JUNIPER+

The JUNIPER project proposes a Real-Time Java based platform built on top of the real-time technologies. The objectives are to create the Real-Time Scalable Java Platform and supporting Methodology and Tool-Flow that can be configured to build and support a range of high-performance Big Data application domains enabling real-time constraints to be met. Along that, the project deals with FPGA acceleration for pieces of Java code and improve-ments of scheduling on heterogeneous clusters.

Our team is developing the real-time scheduling advisor and new scheduler that will take care of advanced scheduling in the clusters with real-time support. Scheduling advisor, moreover, will help the developers to enhance design of their systems according to an availa-

36

ble infrastructure. Our team will also participate in definition of a modeling language for dis-tributed real-time systems.

My ongoing work is closely connected to the JUNIPER+ project. Results of my work will be the important piece of our team’s contribution to this project. For example, the use case scenario proposed in section 3.2 has already become a core of the testbed used by our team in the JUNIPER+ project to reveal the actual attributes of stream processing applica-tions on heterogeneous clusters. The work on JUNIPER+ project already resulted in one paper [58] and another submitted paper.

37

Chapter 5

Ongoing Work

1. Investigate the ways leading to effective utilization of heterogeneous clusters

Heterogeneity of large clusters and cloud setups is inevitable over the time. Main reasons are the need for growth in time and gradual upgrades of hardware. Simultaneously, the right treatment of computing resources with different capabilities can potentially bring better performance in the means of increased number of completed jobs per time period and thus the shorter time from queued to finished of each job.

2. Examine the potential of benchmark-based scheduling in combination with

other scheduling approaches

Other scheduling approaches already have proven to be beneficial for stream processing applications. One of them is the traffic-based scheduler that optimizes the placement of stream processing tasks to physical nodes in the way that the tasks with the biggest amount of inter-task communication are placed on the same node (i.e., part of bandwidth is only intra-node, so the network is less utilized and communication is faster). Combina-tion of these approaches promises even better efficiency of heterogeneous clusters.

3. Propose the new approaches to resource allocation and scheduling of stream

processing tasks on heterogeneous clusters

All previously described goals lead to definition of new scheduling algorithms for stream processing applications. My work aims to propose and prototype new scheduler enhancing the performance of stream processing applications on heterogeneous clusters.

4. Explore how the recommendations of scheduler can be presented to users

Data gathered for scheduling decisions can be furthermore used for analysis of the stream processing application architecture. For example, a wrong decomposition of system can be revealed this way. Besides that, the right scaling of stream application’s components in correlation of available hardware resources can be tuned based on the data gathered and based on previous scheduling decisions. Interest of my work in this area lays in searching for methods that can present useful knowledge detected during the operation of stream processing system.

5. Evaluate the new scheduling approaches on appropriate use cases

The new scheduler prototyped during my work will be compared to standard schedulers utilized for stream processing applications. The efficiency can be checked by implement-ing some subset of algorithms from DEBS 2014 Grand Challenge20 and by running its testing dataset employing the new scheduler. Later, the effectiveness will be verified on two industrial use cases—the financial application evaluating safety of credit card trans-actions in real time, and the social based application for evaluation of trending topics in a

20 http://www.cse.iitb.ac.in/debs2014/?page_id=42

38

forum and for disclosing the reasons to liking or disliking a specific product. Objective of this evaluation is to prove that the prototype of scheduler can enhance the performance of these use cases.

39

Chapter 6

Conclusion

This document overviews my current knowledge and understanding of the topic of stream processing systems’ scheduling on heterogeneous clusters and proposes a plan for the future research work towards the doctoral thesis. The State of the Art section briefly introduces history and basic notions of stream processing, scheduling, and benchmarking. Argumenta-tion about the present unsatisfying situation of stream processing regarding the integration into the big data ecosystems, e.g., Hadoop, about the lack of advanced scheduling algorithms, and about the problems accompanying heterogeneous clusters is discussed in Beyond the

State of the Art section. Separated sections look into the solutions exploiting the benchmark-based scheduling in

combination with other advanced scheduling techniques and to description of areas that cur-rently lack deeper research. Besides that, my previous work and affiliation lying in multi-cloud middleware mOSAIC and in processing of scientific papers in grid and cloud (ReReSearch) was presented. At the same time, the course of my doctoral work, which is strongly interconnected with the JUNIPER+ project, was pointed out multiple times.

Main part of the work then brings the motivations for my research with intuitive ex-ample of potentials laying in exploitation of heterogeneity of clusters. Basic ideas and de-pendencies of benchmark based scheduling are then shown and connected with the state of the art.

Finally, initial experiments are presented and discussed. These show that the novel het-erogeneity aware scheduler improved scheduling decisions on a particular use case and a small heterogeneous cluster by more than 10 % over the Apache Storm’s standard scheduler and by almost 30 % over the hypothetical “very bad” scheduler21.

Future work mainly aims at next improvements of the novel scheduler’s performance in various scenarios including larger clusters, bigger data, and different stream processing appli-cations, then at addressing the issues related to eventual decentralization of the scheduler implementation, at accommodation of the scheduler to the MPI based stream processing framework, and at finding out the possibilities of novel scheduler’s interconnections with dif-ferent scheduling approaches such as Traffic-based scheduler. Another objective is to find out the ways how to make the performance of the benchmarking phase better suitable for pro-duction use of the scheduled application so that the application can be deployed for produc-tion faster.

21 A “very bad” or “worst case” scheduler is special scheduler utilizing the same knowledge as the heterogeneity aware scheduler to achieve the worst schedule; it is used for comparison of standard and heterogeneity aware scheduler.

40

References

[1] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of The ACM, vol. 51, no. 1, pp. 107--113, 2008.

[2] R. Lämmel, "Googles MapReduce programming model - Revisited," Science of Computer Programming, vol. 70, no. 1, pp. 1--30, 2008.

[3] M. Elteir, H. Lin and W.-c. Feng, "Enhancing MapReduce via Asynchronous Data Processing," 2010.

[4] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy and R. Sears, "MapReduce Online," 2010.

[5] E. Mazur, B. Li, Y. Diao and P. Shenoy, "Towards Scalable One-Pass Analytics Using MapReduce," pp. 1102--1111, 2011.

[6] B. Lohrmann, D. Warneke and O. Kao, "Nephele streaming: stream processing under QoS constraints at scale," Cluster Computing, pp. 1-18, 2013.

[7] D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker and N. Tatbul, "Aurora: a new model and architecture for data stream management," The Vldb Journal, vol. 12, no. 2, pp. 120--139, 2003.

[8] S. Babu and J. Widom, "Continuous queries over data streams," Sigmod Record, vol. 30, no. 3, pp. 109--120, 2001.

[9] R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. S. Manku, C. Olston, J. Rosenstein and R. Varma, "Query Processing, Approximation, and Resource Management in a Data Stream Management System," 2003.

[10] M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Çetintemel and Y. Xing, "Scalable Distributed Stream Processing," 2003.

[11] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-h. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul and Y. Xing, "The Design of the Borealis Stream Processing Engine," 2005.

[12] M. Isard, M. Budiu, Y. Yu, A. Birrell and D. Fetterly, "Dryad: distributed data-parallel programs from sequential building blocks," 2007.

[13] V. R. Borkar, M. J. Carey, R. Grover, N. Onose and R. Vernica, "Hyracks: A flexible and extensible foundation for data-intensive computing," 2011.

[14] D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith and A. Madhavapeddy, "CIEL: a universal execution engine for distributed data-flow computing," 2011.

[15] G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier and J. Dongarra, "DAGuE: A generic distributed DAG engine for High Performance Computing," Parallel Computing.

[16] D. Warneke and O. Kao, "Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 6, pp. 985--997, 2011.

[17] B. Li, E. Mazur, Y. Diao, A. McGregor and P. Shenoy, "A platform for scalable one-pass analytics using MapReduce," pp. 985--996, 2011.

[18] W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri and A. Doan, "Muppet: MapReduce-style Processing of Fast Data," Proc. VLDB Endow., vol. 5, pp. 1814--1825, aug 2012.

[19] L. Neumeyer, B. Robbins, A. Nair and A. Kesari, "S4: Distributed Stream Computing

41

Platform," 2010. [20] N. Marz, "Storm - Distributed and fault-tolerant realtime computation," Twitter, 2012. [21] N. Marz, "nathanmarz/storm - GitHub," [Online]. Available:

https://github.com/nathanmarz/storm. [Accessed 1st January 2014]. [22] J. N. N. R. J. Kreps, „Kafka: A distributed messaging system for log processing,“ v

Proceedings of the NetDB, 2011. [23] F. Marozzo, D. Talia and P. Trunfio, "A Framework for Managing MapReduce

Applications in Dynamic Distributed Environments," 2011. [24] M. Noll, "Implementing Real-Time Trending Topics With a Distributed Rolling Count

Algorithm in Storm," 18 January 2013. [Online]. Available: http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/. [Accessed 1 January 2014].

[25] E. Bortnikov, A. Frank, E. Hillel and S. Rao, "Predicting Execution Bottlenecks in Map-reduce Clusters," 2012.

[26] G. Bosilca, "Dealing with the Scale Challenge," Innovative Computing Laboratory , [Online]. Available: http://computing.ornl.gov/SC11/documents/Bosilca_OpenMPI_SC11.pdf. [Accessed 2 January 2014].

[27] C. Boeres and A. P. Nascimento, "Dynamic self-scheduling for parallel applications with task dependencies," 2008.

[28] X. Lu and Z. Gu, "A load-adapative cloud resource scheduling model based on ant colony algorithm," 2011.

[29] R. Sridaran, "A Survey on Resource Allocation Strategies in Cloud Computing," 2012. [30] S. Krawczyk and K. Bubendorfer, "Grid Resource Allocation : Allocation Mechanisms

and Utilisation Patterns," 2008. [31] T. D. Braun, H. J. Siegel, N. Beck, L. L. Bölöni, M. Maheswaran, A. I. Reuther, J. P.

Robertson, M. D. Theys, B. Yao, D. A. Hensgen and R. F. Freund, "A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems," Journal of Parallel and Distributed Computing, vol. 61, no. 6, pp. 810--837, 2001.

[32] A. A. Khan, C. Mccreary and M. S. Jones, "A Comparison of Multiprocessor Scheduling Heuristics," 1994.

[33] G. Tesauro, N. K. Jongt, R. Das and M. N. Bennani, "A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation," 2006.

[34] D. Vengerov, "A reinforcement learning approach to dynamic resource allocation," Engineering Applications of Artificial Intelligence, vol. 20, no. 3, pp. 383--390, 2007.

[35] B. C. Csáji, "ADAPTIVE RESOURCE CONTROL Machine Learning Approaches to Resource Allocation in Uncertain and Changing Environments," 2008.

[36] A. Burkimsher, I. Bate and L. S. Indrusiak, "A Survey of Scheduling Metrics and an Improved Ordering Policy for List Schedulers Operating on Workloads with Dependencies and a Wide Variation in Execution Times," Future Gener. Comput. Syst., vol. 29, pp. 2009--2025, oct 2013.

[37] M. A. Iverson and F. Özgüner, "Hierarchical, competitive scheduling of multiple DAGs in a dynamic heterogeneous environment," Distributed Systems Engineering, vol. 6, no. 3, pp. 112--120, 1999.

[38] H. Topcuoglu, S. Hariri and M.-y. Wu, "Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing," IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 3, pp. 260--274, 2002.

42

[39] R. Jain, D.-m. Chiu and W. Hawe, "A Quantitative Measure Of Fairness And Discrimination For Resource Allocation In Shared Computer Systems," Computing Research Repository, vol. cs.NI/9809, 1998.

[40] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. OMalley, S. Radia, B. Reed and E. Baldeschwieler, "Apache Hadoop YARN: Yet Another Resource Negotiator," 2013.

[41] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek and J. Wilkes, "Omega: flexible, scalable schedulers for large compute clusters," 2013.

[42] Facebook Engineering Team, "Under the Hood: Scheduling MapReduce jobs more efficiently with Corona," 2012. [Online]. Available: https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920. [Accessed 1 January 2014].

[43] R. Chaiken, B. Jenkins, P.-Ã. Larson, B. Ramsey, D. Shakib, S. Weaver and J. Zhou, "SCOPE: easy and efficient parallel processing of massive data sets," Proceedings of The Vldb Endowment, vol. 1, no. 2, pp. 1265--1276, 2008.

[44] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker and I. Stoica, "Mesos: A platform for fine-grained resource sharing in the data center," 2010.

[45] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz and I. Stoica, "Improving MapReduce Performance in Heterogeneous Environments," 2008.

[46] L. Aniello, R. Baldoni and L. Querzoni, "Adaptive Online Scheduling in Storm," 2013. [47] S. Ren, Y. He, S. Elnikety and K. S. McKinley, "Exploiting Processor Heterogeneity in

Interactive Services," 2013. [48] B. Reistad and D. K. Gifford, "Static Dependent Costs for Estimating Execution Time,"

ACM Sigplan Lisp Pointers, vol. VII, no. 3, pp. 65--78, 1994. [49] A. Khokhar, V. K. Prasanna, M. Shaaban and C.-L. Wang, "Heterogeneous

Supercomputing: Problems and Issues," 1992. [50] T. Yang and A. Gerasoulis, "DSC: Scheduling Parallel Tasks on an Unbounded Number

of Processors," IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 9, pp. 951--967, 1994.

[51] J. Yang, I. Ahmad and A. Ghafoor, "Estimation of Execution times on Heterogeneous Supercomputer Architectures," 1993.

[52] M. A. Iverson, F. Özgüner and G. J. Follen, "Run-Time Statistical Estimation of Task Execution Times for Heterogeneous Distributed Computing," 1996.

[53] M. V. Devarakonda and R. K. Iyer, "Predictability of Process Resource Usage: A Measurement-Based Study on UNIX," IEEE Transactions on Software Engineering, vol. 15, no. 12, pp. 1579--1586, 1989.

[54] M. A. Iverson, F. Özgüner and L. C. Potter, "Statistical Prediction of Task Execution Times Through Analytic Benchmarking for Scheduling in a Heterogeneous Environment," IEEE Transactions on Computers, vol. 48, no. 12, pp. 99--111, 1999.

[55] S. Gupta, C. Fritz, R. Price, R. Hoover, J. de Kleer and C. Witteveen, "ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters," 2013.

[56] P. Škoda, S. Šperka and P. Smrž, "Extracting Information from Scientific Papers in the Cloud," in Sixth International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), Palermo, 2012.

[57] S. Šperka, P. Škoda and P. Smrž, "Cloudification of Legacy Information Extraction System," in Proceedings of WoSS 4, Bled, 2012.

43

[58] M. Rychlý, P. Škoda and P. Smrž, "Scheduling Decisions in Stream Processing on Heterogeneous Clusters," in Proceedings of the CISIS 2014, Birmingham, 2014.

44

List of Papers

P. Škoda, S. Šperka and P. Smrž, "Extracting Information from Scientific Papers in the Cloud," in Sixth International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), Palermo, 2012. Contribution: 45 % S. Šperka, P. Škoda and P. Smrž, "Cloudification of Legacy Information Extraction System," in Proceedings of WoSS 4, Bled, 2012. Contribution: 49 % M. Rychlý, P. Škoda and P. Smrž, "Scheduling Decisions in Stream Processing on Heterogeneous Clusters," in Proceedings of the CISIS 2014, Birmingham, 2014. Contribution: 45 % M. Rychlý, P. Škoda and P. Smrž, "Heterogeneity-Aware Scheduler for Stream Processing Frameworks," International Journal of Big Data Intelligence Special Issue, 2014 (submitted, waiting for acceptance) Contribution: 45 %

45

Reprints of Selected Papers

Scheduling Decisions in Stream Processingon Heterogeneous Clusters

Marek RychlyDepartment of Information SystemsFaculty of Information Technology

Brno University of TechnologyBrno, Czech Republic

Email: [email protected]

Petr Skoda, Pavel SmrzDepartment of Computer Graphics and Multimedia

Faculty of Information Technology, Brno University of TechnologyIT4Innovations Centre of Excellence

Brno, Czech RepublicEmail: {iskoda,smrz}@fit.vutbr.cz

Abstract—Stream processing is a paradigm evolving in re-sponse to well-known limitations of widely adopted MapReduceparadigm for big data processing, a hot topic of today’s computerworld. Moreover, in the field of computation facilities, hetero-geneity of data processing clusters, intended or unintended, isstarting to be relatively common. This paper deals with schedulingproblems and decisions in stream processing on heterogeneousclusters. It brings an overview of current state of the art of streamprocessing on heterogeneous clusters with focus on resource allo-cation and scheduling. Basic scheduling decisions are discussedand demonstrated on naive scheduling of a sample application.The paper presents a proposal of a novel scheduler for streamprocessing frameworks on heterogeneous clusters, which employsdesign-time knowledge as well as benchmarking techniques toachieve optimal resource-aware deployment of applications overthe clusters and eventually better overall utilization of the cluster.

Keywords—scheduling; resource-awareness; benchmarking;heterogeneous clusters; stream processing; Apache Storm.

I. INTRODUCTION

As the Internet grows bigger, the amount of data that can begathered, stored, and processed constantly increases. Traditionalapproaches to processing of big data, e.g., the data of crawleddocuments, web request logs, etc., involves mainly batch pro-cessing techniques on very large shared clusters running inparallel across hundreds of commodity hardware nodes. For thestatic nature of such datasets, the batch processing appears to bea suitable technique, both in terms of data distribution and taskscheduling, and distributed batch processing frameworks, e.g.,the frameworks that implement the MapReduce programmingparadigm [1], have proved to be very popular.

However, the traditional approaches developed for the pro-cessing of static datasets cannot provide low latency responsesneeded for continuous and real-time stream processing whennew data is constantly arriving even as the old data is beingprocessed. In the data stream model, some or all of the inputdata that are to be processed are not available in a static dataset,but rather arrive as one or more continuous data streams [2].Traditional distributed processing frameworks like MapReduceare not well suited to process data streams due to their batch-orientation. The response times of those systems are typicallygreater than 30 seconds while real-time processing requiresresponse times in the (sub)seconds range [3].

To address distributed stream processing, several platformsfor data or event stream processing systems have been proposed,

e.g., S4 and Storm [4], [5]. In this paper, we build upon one ofthese distributed stream processing platforms, namely Storm.Storm defines distributed processing in terms of streams ofdata messages flowing from data sources (referred to as spouts)through a directed acyclic graph (referred to as a topology) ofinterconnected data processors (referred to as bolts). A singleStorm topology consists of spouts that inject streams of datainto the topology and bolts that process and modify the data.

Contrary to the distributed batch processing approach, re-source allocation and scheduling in distributed stream process-ing is much more difficult due to dynamic nature of input datastreams. In both cases, the resource allocation deals mainlywith a problem of gathering and assigning resources to thedifferent requesters while scheduling cares about which tasksand when to place on which previously obtained resources [6].

In the case of distributed batch processing, both resourcesallocation and tasks scheduling can be done prior to the process-ing of a batch of jobs based on knowledge of data and tasks forprocessing and of a distributed environment. Moreover, duringbatch processing, required resources are often simply allocatedstatically from the beginning to the end of the processing.

In the case of distributed stream processing, which istypically continuous, dynamic nature of input data and unlimitedprocessing time require dynamic allocation of shared resourcesand real-time scheduling of tasks based on actual intensity ofinput data flow, actual quality of the data, and actual workloadof a distributed environment. For example, resource allocationand task scheduling in Storm involves real-time decision makingconsidering how to replicate bolts and spread them across nodesof a cluster to achieve required scalability and fault tolerance.

This paper deals with problems of scheduling in distributeddata stream processing on heterogeneous clusters. The paperis organized as follows. In Section II, stream processing onheterogeneous clusters is discussed in detail, with focus onresource allocation and task scheduling, and related work andexisting approaches are analysed. In Section III, a use case ofdistributed stream processing is presented. Section IV dealswith scheduling decisions in the use case. Based on the analysisof the scheduling decisions, Section V proposes a concept of anovel scheduling advisor for distributed stream processing onheterogeneous clusters. Since this paper presents an ongoingresearch, Section VI discusses future work on the schedulingadvisor. Finally, Section VII provides conclusions.

II. STREAM PROCESSING ON HETEROGENEOUS CLUSTERS

In homogeneous computing environments, all nodes haveidentical performance and capacity. Resources can be allocatedevenly across all available nodes and effective task schedulingis determined by quantity of the nodes, not by their individualquality. Typically, resource allocation and scheduling in thehomogeneous computing environments balance of workloadacross all the nodes which should have identical workload.

Contrary to the homogeneous computing environments,there are different types of nodes in a heterogeneous cluster withvarious computing performance and capacity. High-performancenodes can complete the processing of identical data fasterthan low-performance nodes. Moreover, the performance ofthe nodes depends on the character of computation and onthe character of input data. For example, graphic-intensivecomputations will run faster on nodes that are equipped withpowerful GPUs while memory-intensive computation will runfaster on nodes with large amount of RAM or disk space.To balance workload in a heterogeneous cluster optimally,a scheduler has to (1) know performance characteristics forindividual types of nodes employed in the cluster for differenttypes of computations and to (2) know or to be able to analysecomputation characteristics of incoming tasks and input data.

The first requirement, i.e., the performance characteristicsfor individual types of employed nodes, means the awarenessof infrastructure and topology of a cluster including detailedspecification of its individual nodes. In the most cases, thisinformation is provided at the cluster design-time by its adminis-trators and architects. Moreover, the performance characteristicsfor individual nodes employed in a cluster can be adjusted atthe cluster’s run-time based on historical data of performancemonitoring and their statistical analysis of processing differenttypes of computations and data by different types of nodes.

The second requirement is the knowledge or the abilityto analyse computation characteristics of incoming tasks andinput data. In batch processing, tasks and data in a batch canbe annotated or analysed in advance, i.e., before the batch isexecuted, and acquired knowledge can be utilized in optimalallocation of resources and efficient task scheduling. In streamprocessing, the second requirement is much more difficult tomeet due to continuous flow and unpredictable variability ofthe input data which make thorough analysis of computationcharacteristics of the input data and incoming tasks impossible,especially with real-time limitations in their processing.

To address the above mentioned issues of stream process-ing in heterogeneous clusters with optimal performance, user-defined tasks processing (at least some) of the input data hasto help the scheduler. For example, an application may includeuser-defined helper-tasks tagging input data at run-time by theirexpected computation characteristics for better scheduling1.Moreover, individual tasks of a stream application should betagged at design-time according to their required computationresources and real-time constraints on the processing to helpwith their future scheduling. Implementation of the mentionedtagging of tasks at design-time should be part of modelling (ameta-model) of topology and infrastructure of such applications.

1e.g., parts of variable-bit-rate video streams with temporary high bit-ratewill be tagged for processing by special nodes with powerful video decoders,while average bit-rate parts can be processed by common nodes

With the knowledge of the performance characteristics forindividual types of nodes employed in a cluster and with theknowledge or the ability to analyse computation characteristicsof incoming tasks and input data, a scheduler has enoughinformation for balancing workload of the cluster nodes andoptimizing throughput of an application. Related schedulingdecisions, e.g., rebalancing of the workload, are usually done pe-riodically with an optimal frequency. An intensive rebalancingof the workload across the nodes can cause high overhead whilean occasional rebalancing may not utilize all nodes optimally.

A. Related Work

Over the past decade, stream processing has been the sub-ject of a vivid research. Existing approaches can essentiallybe categorised by scalability into centralized, distributed, andmassively-parallel stream processors. In this section, we willfocus mainly on distributed and massively-parallel stream pro-cessors but also on their successors exploiting ideas of theMapReduce paradigm in the context of the stream processing.

In distributed stream processors, related work is mainlybased on Aurora* [7], which has been introduced for scalabledistributed processing of data streams. An Aurora* system isa set of Aurora* nodes that cooperate via an overlay networkwithin the same administrative domain. The nodes can freelyrelocate load by decentralized, pairwise exchange of the Aurorastream operators. Sites running Aurora* systems from differentadministrative domains can by integrated into a single federatedsystem by Medusa [7]. Borealis [8] has introduced a refinedQoS optimization model for Aurora*/Medusa where the effectsof load shedding on QoS can be computed at every point inthe data flow, which enables better strategies for load shedding.

Massively-parallel data processing systems, in contrast tothe distributed (and also centralised) stream processors, havebeen designed to run on and efficiently transfer large datavolumes between hundreds or even thousands of nodes. Tradi-tionally, those systems have been used to process finite blocks ofdata stored on distributed file systems. However, newer systemssuch as Dryad [9], Hyracks [10], CIEL [11], DAGuE [12], orNephele framework [13] allow to assemble complex paralleldata flow graphs and to construct pipelines between individualparts of the flow. Therefore, these parallel data flow systemsin general are also suitable for the streaming applications.

The latest related work is based mainly on the MapReduceparadigm or its concepts in the context of stream processing.At first, Hadoop Online [14] extended the original Hadoopby ability to stream intermediate results from map to reducetasks as well as the possibility to pipeline data across differentMapReduce jobs. To facilitate these new features, the semanticsof the classic reduce function has been extended by time-based sliding windows. Li et al. [15] picked up this idea andfurther improved the suitability of Hadoop-based systems forcontinuous streams by replacing the sort-merge implementationfor partitioning by a new hash-based technique. The Muppetsystem [16], on the other hand, replaced the reduce functionof MapReduce by a more generic and flexible update function.

S4 [4] and Apache Storm [5], which is used in this paper,can also be classified as massively-parallel data processing sys-tems with a clear emphasis on low latency. They are not basedon MapReduce but allows developers to assemble arbitrarily a

complex directed acyclic graph (DAG) of processing tasks. Forexample, Storm does not use intermediate queues to pass dataitems between tasks. Instead, data items are passed directlybetween the tasks using batch messages on the network levelto achieve a good balance between latency and throughput.

The distributed and massively-parallel stream processorsmentioned above usually do not explicitly solve adaptive re-source allocation and task scheduling in heterogeneous en-vironments. For example, in [17], authors demonstrate howAurora*/Medusa handles time-varying load spikes and pro-vides high availability in the face of network partitions. Theyconcluded that Medusa with the Borealis extension does notdistribute load optimally but it guarantees acceptable allocations;i.e., either no participant operates above its capacity, or, ifthe system as a whole is overloaded, then all participantsoperate at or above capacity. The similar conclusions can bedone also in the case of the previously mentioned massively-parallel data processing systems. For example, DAGuE does nottarget heterogeneous clusters which utilize commodity hardwarenodes but can handle intra-node heterogeneity of clusters ofsupercomputers where a runtime DAGuE scheduler decides atruntime which tasks to run on which resources [12].

Another already mentioned massively-parallel stream pro-cessing system, Dryad [9], is equipped with a robust schedulerwhich takes care of nodes liveness and rescheduling of failedjobs and tracks execution speed of different instances of eachprocessor. When one of these instances underperforms the oth-ers, a new instance is scheduled in order to prevent slowdownsof the computation. Dryad scheduler works in greedy mode, itdoes not consider sharing of cluster among multiple systems.

Finally, in the case of approaches based on the MapReduceparadigm or its concepts, resource allocation and schedulingof stream processing on heterogeneous clusters is necessarydue to utilization of commodity hardware nodes. In streamprocessing, data placement and distribution are given by a user-defined topology (e.g., by pipelines in Hadoop Online [14],or by a DAG of interconnected spouts and bolts in ApacheStorm [5]). Therefore, approaches to the adaptive resourceallocation and scheduling have to discuss initial distributionand periodic rebalancing of workload (i.e., tasks, not data)across nodes according to different processing performance andspecialisation of individual nodes in a heterogeneous cluster.

For instance, S4 [4] uses Apache ZooKeeper to coordinateall operations and for communication between nodes. Initially,a user defines in ZooKeeper which nodes should be used forparticular tasks of a computation. Then, S4 employs some nodesas backups for possible node failure and for load balancing.

Adaptive scheduling in Apache Storm has been addressedin [18] by two generic schedulers that adapt their behaviouraccording to a topology and a run-time communication patternof an application. Experiments shown improvement in latencyof event processing in comparison with the default Stormscheduler, however, the schedulers do not take into account therequirements discussed in the beginning of Section II, i.e., theexplicit knowledge of performance characteristics for individualtypes of nodes employed in a cluster for different types of com-putations and the ability to analyse computation characteristicsof incoming tasks and input data. By implementation of theserequirements, efficiency of the scheduling can be improved.

III. USE CASE

To demonstrate the scheduling problems anticipated in thecurrent state of the art of stream processing on heterogeneousclusters, an sample application is presented in this section.The application “Popular Stories” implements a use case ofprocessing of continuous stream of web-pages from thousandsof RSS feeds. It analyses the web-pages in order to find textsand photos identifying the most popular connections betweenpersons and related keywords and pictures. The result is a listof triples (a person’s name, a list of keywords, and a set ofphotos) with meaning: a person recently frequently mentionedin context of the keywords (e.g., events, objects, persons, etc.)and the photos. The application holds a list of triples withthe most often seen persons, keywords, and pictures for someperiod of time. This way, current trends of persons related tokeywords with relevant photos can be obtained2.

The application utilizes Java libraries and components fromvarious research projects and Apache Storm as a stream pro-cessing framework including its partial integration into ApacheHadoop as a data distribution platform. Figure 1 depicts spoutsand bolts components of the application and its topology, asknown from Apache Storm. The components can be scaledinto multiple instances and deployed on different cluster nodes.

The stream processing starts by the URL generator spoutwhich extracts URLs of web-pages from RSS feeds. After that,Downloader gets the (X)HTML source, styles, and picturesof each web-page and encapsulates them into a stand-alonemessage. The message is passed to the Analyzer bolt, whichsearches the web-page for person names and for keywords andpictures in context of the names found. The resulting pairs ofname-keyword are stored in tops lists in In-memory store NKwhich is updated each time a new pair arrives and excessivelyold pairs are removed from computation of the list. In otherwords, the window of a time period is held for the tops list. Allchanges in the tops list are passed to In-memory store NKP.

Moreover, pairs of name-picture emitted by Analyzer areprocessed in Image feature extractor to get indexable featuresof each image which allows later to detect different instances ofthe same pictures (e.g., the same photo in different resolutionor with different cropping). The image features are sent toIn-memory store NP where the tops list of the most popularpersons and related unique images pairs is held. The memorystores employ search engine Apache Lucene3 with distributedHadoop-based storage Katta4 for Lucene indexes to detectdifferent instances of the same pictures as mentioned above.All modifications in the tops list of In-memory store NP areemitted to In-memory store NKP which maintains a consolidatedtops list of persons with related keywords and pictures. Thistops list is persistent and available for further querying.

Individual components of the application described above,both spouts and bolts, utilize various types of resources toperform various types of processing. More specifically, URL

2The application has been developed for demonstration purposes only,particularly to evaluate the scheduling advisor described in the paper. However,it may be used also in practice, e.g., for visualisation of popular news onpersons (with related keywords and photos) from news feeds on the Internet.

3https://lucene.apache.org/4http://katta.sourceforge.net/

BB

S B B B B DBS B B B B DB

BB

Figure 1. A Storm topology of the sample application (“S”-nodes are Storm stouts generating data and “B”-odes are Storm bolts processing the data).

generator and Downloader have low CPU requirements, Ana-lyzer requires fast CPU, Image feature extractor can use GPUusing OpenCL, and In-memory stores require large amount of

memory. Therefore, the application should utilize a heteroge-neous cluster with adaptive resource allocation and scheduling.

IV. SCHEDULING DECISIONS IN THE STREAM PROCESSING

Schedulers make their decisions on a particular level ofabstraction. They do not try to schedule all live tasks to allpossible cluster nodes but just deal with units of equal or scal-able size. For example, the YARN scheduler uses Containerswith various amounts of cores and memory and Apache Stormuses Slots of equal size (one slot per CPU core) where, in eachslot, multiple spouts or blots of the same topology may run.

One of the important and commonly adopted schedulerdecisions is data locality. For instance, the main idea ofMapReduce is to perform computations by the nodes wherethe required data are saved to prevent intensive data loading toand removing from a cluster. Data locality decisions from thestream processing perspective are different because processorsusually does not operate on data already stored in a processingcluster but rather on streams coming from remote sources. Thus,in stream processing, we consider the data locality to be anapproach to minimal communication costs which results, forexample, in scheduling of the most communicating processorinstances together to the same node or the same rack.

The optimal placement of tasks across cluster nodes may,moreover, depend on other requirements beyond the communi-cation costs mentioned above. Typically, we are talking aboutCPU performance or overall node performance that makes theprocessing faster. For example, the performance optimizationmay lie in detection of tasks which are exceedingly slow incomparison to the others with the same signature. More sophis-ticated approaches are based on various kinds of benchmarksperformed on each node in a cluster while the placement of atask is decided with respect to its detected performance on a

particular node or a class of nodes. Furthermore, the presenceof resources, e.g., GPU or FPGA, can be taken into account.

There are two essential kinds of scheduling decisions: offlinedecisions and online decisions. The former is based on theknowledge the scheduler has before any task is placed and run-ning. In context of stream processing, this knowledge is mostlythe topology and the offline decisions can, for example, considercommunication channels between nodes. Online decisions aremade with information gathered during the actual execution ofan application, i.e., after or during initial placement of its tasksover a cluster nodes. So the counterpart for the offline topologybased communication decision is decision derived from realbandwidths required between running processor instances [18].In effect, the most of scheduling decisions in stream processingare made online or based on historical online data.

A. Storm Default Scheduler

The Storm default scheduler uses a simple round-robin strat-egy. It deploys bolts and spouts (collectively called processors)so that each node in a topology has almost equal number ofprocessors running in each slot even for multiple topologiesrunning in the same cluster. When tasks are scheduled, theround-robin scheduler simply counts all available slots on eachnode and puts processor instances to be scheduled one at thetime to each node while keeping the order of nodes constant.

In a shared heterogeneous Storm cluster running multipletopologies of different stream processing applications, the round-robin strategy may, for the sample application described inSection III, result in the scenario depicted in Figure 2. Thedepicted cluster consists of four nodes with different hardwareconfigurations, i.e., fast CPU, slow CPU, lots of memory, andGPU equipped (see Figure 2), so the number of slots availableat each node differs but the same portion of each node is utilizedas the consequence of round-robin scheduling. Moreover, thedefault scheduler did not respect different requirements ofprocessors. The Analyzers requiring the CPU performance were

Fast CPU node (8 slots) GPU node (4 slots)

URLURL

D2 MSNKP

MS2NPD2

D1NKPNP MS1D1 MS1

NP

IE3IE3

A3A3MS

A2

MSNK

IE2A2 IE2

A1 IE1

Slow CPU node (6 slots)High memory node (8 slots)

A1

Slow CPU node (6 slots)High memory node (8 slots)

URL—URL generator; Dx—Downloader; Ax—Analyzer; IEx—URL—URL generator; Dx—Downloader; Ax—Analyzer; IEx—Image feature extractor; MSx—In-memory store

Figure 2. Possible results of the Storm default round-robin scheduler.

placed to the node with lots of memory while the memorygreedy In-memory stores were scheduled to the nodes withpowerful GPU and slow CPU which led to the need of higherlevel of parallelism of “MS NP”. The fast CPU node thenruns the undemanding Downloaders and the URL generator.Finally, the Image extractors were placed to the slow CPUnode and the high memory node. Therefore, it is obvious thatthe scheduling decision was relatively wrong and it results intoinefficient utilization of the cluster.

V. PROPOSING SCHEDULING ADVISOR

The proposed scheduling advisor targets to offline decisionsderived from results of performance test sets of each resourcetype in combination with particular component (processor).Therefore, every application should be benchmarked on aparticular cluster prior to its run in production.

The benchmarking will run the application with production-like data and after initial random or round-robin placement ofprocessors over nodes, it will reschedule processors so that eachprocessor is benchmarked on each class of hardware nodes.The performance of processors will be measured based on thenumber of tuples processed in time period. Finally, with datafrom benchmarks, scheduling in the production will minimizethe overall counted loss of performance in deployment onparticular resources in comparison to performance in the idealdeployment, i.e., the one where each processor runs on the nodewith top performance measured in the benchmarking phase.

Later, the scheduler can utilize also performance data cap-tured during the production run. These data will be takeninto consideration as the reflection of possible changes ofprocessed data, and new scheduling decisions will prefer themover the performance information from the benchmarking phase.Moreover, with employment of production performance data,an application can be deployed initially using the round-robinand then gradually rescheduled in reasonable intervals. Few firstreschedules have to be random to gather initial differences inperformance per processor and node class. Then, the schedulercan deploy some of processors to currently best known nodesand others processors to nodes with yet unknown performance.

High memory node (8 slots) GPU node (4 slots)

MSIE2

MSNKP MS

IE2MSNP

MS

NPIE1

MSNKNK

A2A2

D1A3A3

A1 URLA1 URLD2

Fast CPU node (8 slots) Slow CPU node (6 slots)Fast CPU node (8 slots) Slow CPU node (6 slots)

URL—URL generator; Dx—Downloader; Ax—Analyzer; IEx—URL—URL generator; Dx—Downloader; Ax—Analyzer; IEx—Image feature extractor; MSx—In-memory store

Figure 3. The advanced scheduling in a heterogeneous cluster (High memoryand Fast CPU nodes are mutually swapped in comparison with Figure 2).

However, when omitting the benchmarking phase, a new ap-plication without historical performance data may temporarilyunderperform and more instances of their processors may beneeded to increase parallelism. On the other hand, withoutthe benchmarking phase, the new application can be deployedwithout delay and utilize even nodes that have not yet beenbenchmarked (e.g., new nodes or nodes occupied by otherapplications during the benchmark phase on a shared cluster).

A. Scheduling of the Example Use Case Application

The proposed scheduler is trying to deploy processors toavailable slots that are running on nodes with the most suitableresource profile. Therefore, the scheduler will deploy fewerinstance of the processors than the Storm default scheduler inthe same cluster and probably even with higher throughput.In the case of the sample application, the deployment bythe proposed scheduler may be as depicted in Figure 3. In-memory stores were deployed on the node with high amountof memory and Image feature extractors were deployed on thenode with two GPUs so it was be possible to reduce parallelism.Undemanding Downloaders were placed on the Slow CPUnode and Analyzers utilize the Fast CPU node. Possibly evenmore effective scheduling may be achieved by combinationof pre-production and production benchmarking discussed inSection V. Then, the scheduling decisions can be based onactual bandwidths between processors with consideration oftrade-offs between bandwidth availability on particular nodesshared among multiple applications and availability of moresuitable nodes in perspective of performance.

VI. DISCUSSION AND FUTURE WORK

Since this paper presents ongoing work, in this section, wediscuss preliminary results and outline possible further improve-ments. Ongoing work mainly deals with three topics: imple-mentation of the scheduling advisor and the sample application,evaluation of the proposed approach, and its improvement basedon results of the evaluation.

At the time of writing this paper, the implementation ofthe application described in the paper was in progress and the

scheduling advisor was in the phase of design. The schedulingadvisor will be realised as a Storm scheduler implementingthe IScheduler interface provided by Storm API. Besides thefunctionality available via Storm API, the scheduler will utilizethe Apache Ambari project to monitor the Hadoop platformwhere a Storm cluster and other services will be running (e.g.,Zookeeper coordinating various daemons within the Stormcluster, YARN managing resources of the cluster, or Hivedistributed storage of large datasets generated by performance,workload, and configuration monitoring of the cluster). ApacheAmbari can provide the scheduler with information on statusof host systems and of individual services that run on them,including status of individual jobs running on those services.

After prototype implementation of the application and thescheduling advisor, we plan to perform thorough evaluation, todetermine performance boost and workload distribution statis-tics and to compare these values for the scheduling advisor, thedefault Storm scheduler, and for generic schedulers proposed in[18]. The evaluation will also include performance monitoringand analysis of the overhead introduced by the schedulers (e.g.,by monitoring and by reallocation of resources or reschedulingof processors), both per node and for the whole Storm cluster.

Finally, there are several possible improvements and openissues which we have yet to be addressed. These are rang-ing from improvement of scheduling algorithm performance,which is important for real-time processing (e.g., by using ahidden Markov chain based prediction algorithm for predictinginput stream intensity and characteristics), through problemsconnected with automatic scaling of components (elasticity),to the issue of total decentralisation of the scheduler and all itscomponents, which will be “a single point of failure” otherwise.

VII. CONCLUSION

This paper described problems of adaptive scheduling ofstream processing applications on heterogeneous clusters andpresented ongoing research towards the novel scheduling ad-visor. In the paper, we outlined general requirements to thescheduling in stream processing on heterogeneous clustersand analysed the state-of-the-art approaches introduced in therelated works. We also described the sample application ofstream processing in heterogeneous clusters, analysed schedul-ing decisions, and proposed the novel scheduler for the ApacheStorm distributed stream processing platform based on theknowledge acquired in the previous phases.

The sample application and the proposed scheduler are stillwork-in-progress. We are currently implementing the applica-tion and a first prototype of the scheduler to be able to performan evaluation of the proposed approach in practice. Our futurework mainly aims at possible improvements of the schedulerperformance, which is important for real-time processing, ataddressing the problems connected with automatic scaling ofprocessing components (i.e., their elasticity), and at addressingthe issues related to eventual decentralisation of the schedulerimplementation.

ACKNOWLEDGEMENT

This work was supported by the BUT FIT grant FIT-S-14-2299, the European Regional Development Fund in theproject CZ.1.05/1.1.00/02.0070 “The IT4Innovations Centre of

Excellence”, and by the EU 7FP ICT project no. 318763 “JavaPlatform For High Performance And Real-Time Large ScaleData Management” (JUNIPER).

REFERENCES

[1] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing onlarge clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113,2008.

[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Modelsand issues in data stream systems,” in Proceedings of the twenty-firstACM SIGMOD-SIGACT-SIGART symposium on Principles of databasesystems. ACM, 2002, pp. 1–16.

[3] A. Brito, A. Martin, T. Knauth, S. Creutz, D. Becker, S. Weigert,and C. Fetzer, “Scalable and low-latency data processing with streammapreduce,” in 2011 IEEE Third International Conference on CloudComputing Technology and Science (CloudCom). IEEE, 2011, pp.48–58.

[4] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributedstream computing platform,” in 2010 IEEE International Conference onData Mining Workshops (ICDMW). IEEE, 2010, pp. 170–177.

[5] N. Marz, “Apache Storm,” https://git-wip-us.apache.org/repos/asf?p=incubator-storm.git, 2014, Git repository.

[6] V. Vinothina, S. Rajagopal, and P. Ganapathi, “A survey on resource allo-cation strategies in cloud computing,” International Journal of AdvancedComputer Science and Applications, vol. 3, no. 6, 2012.

[7] M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel,Y. Xing, and S. B. Zdonik, “Scalable distributed stream processing,” inCIDR, vol. 3, 2003, pp. 257–268.

[8] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack,J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul,Y. Xing, , and S. Zdonik, “The design of the borealis stream processingengine,” in CIDR, vol. 5, 2005, pp. 277–289.

[9] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributeddata-parallel programs from sequential building blocks,” ACM SIGOPSOperating Systems Review, vol. 41, no. 3, pp. 59–72, 2007.

[10] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks:A flexible and extensible foundation for data-intensive computing,” in2011 IEEE 27th International Conference on Data Engineering (ICDE).IEEE, 2011, pp. 1151–1162.

[11] D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Mad-havapeddy, and S. Hand, “CIEL: a universal execution engine fordistributed data-flow computing,” in Proceedings of the 8th USENIX Con-ference on Networked Systems Design and Implementation. USENIXAssociation, 2011.

[12] G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier, andJ. Dongarra, “DAGuE: A generic distributed dag engine for high per-formance computing,” Parallel Computing, vol. 38, no. 1, pp. 37–51,2012.

[13] D. Warneke and O. Kao, “Exploiting dynamic resource allocation forefficient parallel data processing in the cloud,” IEEE Transactions onParallel and Distributed Systems, vol. 22, no. 6, pp. 985–997, 2011.

[14] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, andR. Sears, “Mapreduce online,” in Proceedings of the 7th USENIX Con-ference on Networked Systems Design and Implementation. USENIXAssociation, 2010.

[15] B. Li, E. Mazur, Y. Diao, A. McGregor, and P. Shenoy, “A platform forscalable one-pass analytics using MapReduce,” in Proceedings of the2011 ACM SIGMOD International Conference on Management of data.ACM, 2011, pp. 985–996.

[16] W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan,“Muppet: MapReduce-style processing of fast data,” Proceedings of theVLDB Endowment, vol. 5, no. 12, pp. 1814–1825, 2012.

[17] M. Balazinska, H. Balakrishnan, and M. Stonebraker, “Load managementand high availability in the Medusa distributed stream processing system,”in Proceedings of the 2004 ACM SIGMOD international conference onManagement of data. ACM, 2004, pp. 929–930.

[18] L. Aniello, R. Baldoni, and L. Querzoni, “Adaptive online schedulingin Storm,” in Proceedings of the 7th ACM international conference onDistributed event-based systems. ACM, 2013, pp. 207–218.

Heterogeneity-Aware Scheduler for Stream ProcessingFrameworks

Abstract: MapReduce and its well known implementation Hadoop have somewidely discussed limitations. This led to evolution of new big data paradigmswhere the nowadays hottest one is the stream processing paradigm. Along withthat, always growing data-centres deal more often with intended and unintendedheterogeneity of their clusters. This article focuses on the problems and decisionsof the stream processing application scheduling on heterogeneous clusters. Anoverview of current state of the art of the stream processing on heterogeneousclusters with focus on the resource allocation and scheduling is brought. Commonscheduling approaches used with stream processing frameworks are discussedand their disadvantages in heterogeneous environment are demonstrated on asimple stream application. Finally, the article presents a novel heterogeneity-awarescheduler for stream processing framework based on design-time knowledge aswell as benchmarking techniques, which lead to a near optimal resource-awaredeployment over the cluster nodes and thus better utilization of the cluster itself.

Keywords: scheduling; resource-awareness; benchmarking; heterogeneousclusters; stream processing; Apache Storm.

Biographical notes:

1 Introduction

As the Internet grows bigger, the amount of data that can be gathered, stored, and processedconstantly increases. Traditional approaches to processing of big data, e.g., the data ofcrawled documents, web request logs, etc., involves mainly batch processing techniques onvery large shared clusters running in parallel across hundreds of commodity hardware nodes.For the static nature of such datasets, the batch processing appears to be a suitable technique,both in terms of data distribution and task scheduling, and distributed batch processingframeworks, e.g., the frameworks that implement the MapReduce programming paradigmDean and Ghemawat (2008), have proved to be very popular.

However, the traditional approaches developed for the processing of static datasetscannot provide low latency responses needed for continuous and real-time stream processingwhen new data is constantly arriving even as the old data is being processed. In the datastream model, some or all of the input data that are to be processed are not available ina static dataset but rather arrive as one or more continuous data streams Babcock et al.(2002). Traditional distributed processing frameworks like MapReduce are not well suitedto process data streams due to their batch-orientation. The response times of those systemsare typically greater than 30 seconds while real-time processing requires response times inthe (sub)seconds range Brito et al. (2011).

To address distributed stream processing, several platforms for data or event streamprocessing systems have been proposed, e.g., S4 and Storm Neumeyer et al. (2010); Marz(2014). In this article, we build upon one of these distributed stream processing platforms,namely Storm. Storm defines distributed processing in terms of streams of data messages

Copyright © 2012 Inderscience Enterprises Ltd.

2 author

flowing from data sources (referred to as spouts) through a directed acyclic graph (referredto as a topology) of interconnected data processors (referred to as bolts). A single Stormtopology consists of spouts that inject streams of data into the topology and bolts that processand modify the data.

Contrary to the distributed batch processing approach, resource allocation and schedulingin distributed stream processing is much more difficult due to dynamic nature of input datastreams. In both cases, the resource allocation deals mainly with a problem of gathering andassigning resources to the different requesters while scheduling cares about which tasks andwhen to place on which previously obtained resources Vinothina et al. (2012).

In the case of distributed batch processing, both resources allocation and tasks schedulingcan be done prior to the processing of a batch of jobs based on knowledge of data andtasks for processing and of a distributed environment. Moreover, during batch processing,required resources are often simply allocated statically from the beginning to the end of theprocessing.

In the case of distributed stream processing, which is typically continuous, dynamicnature of input data and unlimited processing time require dynamic allocation of sharedresources and real-time scheduling of tasks based on actual intensity of input data flow, actualquality of the data, and actual workload of a distributed environment. For example, resourceallocation and task scheduling in Storm involves real-time decision making considering howto replicate bolts and spread them across nodes of a cluster to achieve required scalabilityand fault tolerance.

This article deals with problems of scheduling in distributed data stream processing onheterogeneous clusters. The article is organized as follows. In Section 2, stream processingon heterogeneous clusters is discussed in detail, with focus on resource allocation and taskscheduling, and related work and existing approaches are analysed. In Section 3, a use caseof distributed stream processing is presented. Section 4 deals with scheduling decisionsin the use case. Based on the analysis of the scheduling decisions, Section 5 proposes aconcept of a novel scheduling advisor for distributed stream processing on heterogeneousclusters. The implementation and evaluation of the proposed scheduling advisor is describedin Section 6. Furthermore, Section 7 describes possible utilisation of the scheduling advisorin development process of applications for distributed stream processing. Finally, Section 8outlines future work on the scheduling advisor and provides conclusions of the article.

2 Stream Processing on Heterogeneous Clusters

In homogeneous computing environments, all nodes have identical performance and capacity.Resources can be allocated evenly across all available nodes and effective task schedulingis determined by quantity of the nodes, not by their individual quality. Typically, resourceallocation and scheduling in the homogeneous computing environments balance of workloadacross all the nodes which should have identical workload.

Contrary to the homogeneous computing environments, there are different types ofnodes in a heterogeneous cluster with various computing performance and capacity. High-performance nodes can complete the processing of identical data faster than low-performancenodes. Moreover, the performance of the nodes depends on the character of computationand on the character of input data. For example, graphic-intensive computations will runfaster on nodes that are equipped with powerful GPUs while memory-intensive computationwill run faster on nodes with large amount of RAM or disk space. To balance workload in a

Heterogeneity-Aware Scheduler for Stream Processing Frameworks 3

heterogeneous cluster optimally, a scheduler has to (1) know performance characteristics forindividual types of nodes employed in the cluster for different types of computations and to(2) know or to be able to analyse computation characteristics of incoming tasks and inputdata.

The first requirement, i.e., the performance characteristics for individual types ofemployed nodes, means the awareness of infrastructure and topology of a cluster includingdetailed specification of its individual nodes. In the most cases, this information is providedat the cluster design-time by its administrators and architects. Moreover, the performancecharacteristics for individual nodes employed in a cluster can be adjusted at the cluster’srun-time based on historical data of performance monitoring and their statistical analysis ofprocessing different types of computations and data by different types of nodes.

The second requirement is the knowledge or the ability to analyse computationcharacteristics of incoming tasks and input data. In batch processing, tasks and data ina batch can be annotated or analysed in advance, i.e., before the batch is executed, andacquired knowledge can be utilized in optimal allocation of resources and efficient taskscheduling. In stream processing, the second requirement is much more difficult to meetdue to continuous flow and unpredictable variability of the input data which make thoroughanalysis of computation characteristics of the input data and incoming tasks impossible,especially with real-time limitations in their processing.

To address the above mentioned issues of stream processing in heterogeneous clusterswith optimal performance, user-defined tasks processing (at least some of) the input datahas to help the scheduler. For example, an application may include user-defined helper-tasks tagging input data at run-time by their expected computation characteristics forbetter scheduling1. Moreover, individual tasks of a stream application should be tagged atdesign-time according to their required computation resources and real-time constraintson the processing to help with their future scheduling. Implementation of the mentionedtagging of tasks at design-time should be part of modelling (a meta-model) of topology andinfrastructure of such applications.

With the knowledge of the performance characteristics for individual types of nodesemployed in a cluster and with the knowledge or the ability to analyse the computationcharacteristics of incoming tasks and input data, a scheduler has enough information forbalancing workload of the cluster nodes and optimizing throughput of an application. Relatedscheduling decisions, e.g., rebalancing of the workload, are usually done periodically withan optimal frequency. An intensive rebalancing of the workload across the nodes can causehigh overhead while an occasional rebalancing may not utilize all nodes optimally.

2.1 Related Work

Over the past decade, stream processing has been the subject of a vivid research. Existingapproaches can essentially be categorised by scalability into centralized, distributed, andmassively-parallel stream processors. In this section, we will focus mainly on distributedand massively-parallel stream processors but also on their successors exploiting ideas of theMapReduce paradigm in the context of the stream processing.

In distributed stream processors, related work is mainly based on Aurora* Cherniacket al. (2003), which has been introduced for scalable distributed processing of data streams.An Aurora* system is a set of Aurora* nodes that cooperate via an overlay network within thesame administrative domain. The nodes can freely relocate load by decentralized, pairwiseexchange of the Aurora stream operators. Sites running Aurora* systems from different

4 author

administrative domains can be integrated into a single federated system by Medusa Cherniacket al. (2003). Borealis Abadi et al. (2005) has introduced a refined QoS optimization modelfor Aurora*/Medusa where the effects of load shedding on QoS can be computed at everypoint in the data flow, which enables better strategies for load shedding.

Massively-parallel data processing systems, in contrast to the distributed (and alsocentralised) stream processors, have been designed to run on and efficiently transfer largedata volumes between hundreds or even thousands of nodes. Traditionally, those systemshave been used to process finite blocks of data stored on distributed file systems. However,newer systems such as Dryad Isard et al. (2007), Hyracks Borkar et al. (2011), CIEL Murrayet al. (2011), DAGuE Bosilca et al. (2012), or Nephele framework Warneke and Kao (2011)allow to assemble complex parallel data flow graphs and to construct pipelines betweenindividual parts of the flow. Therefore, these parallel data flow systems in general are alsosuitable for the streaming applications.

The latest related work is based mainly on the MapReduce paradigm or its concepts inthe context of stream processing. At first, Hadoop On-line Condie et al. (2010) extendedthe original Hadoop by ability to stream intermediate results from map to reduce tasks aswell as the possibility to pipeline data across different MapReduce jobs. To facilitate thesenew features, the semantics of the classic reduce function has been extended by time-basedsliding windows. Li et al. Li et al. (2011) picked up this idea and further improved thesuitability of Hadoop-based systems for continuous streams by replacing the sort-mergeimplementation for partitioning by a new hash-based technique. The Muppet system Lamet al. (2012), on the other hand, replaced the reduce function of MapReduce by a moregeneric and flexible update function.

S4 Neumeyer et al. (2010) and Apache Storm Marz (2014), which is used in this article,can also be classified as massively-parallel data processing systems with a clear emphasis onlow latency. They are not based on MapReduce but allows developers to assemble arbitrarilya complex directed acyclic graph (DAG) of processing tasks. For example, Storm does notuse intermediate queues to pass data items between tasks. Instead, data items are passeddirectly between the tasks using batch messages on the network level to achieve a goodbalance between latency and throughput.

The distributed and massively-parallel stream processors mentioned above usually donot explicitly solve adaptive resource allocation and task scheduling in heterogeneousenvironments. For example, in Balazinska et al. (2004), authors demonstrate howAurora*/Medusa handles time-varying load spikes and provides high availability in the faceof network partitions. They concluded that Medusa with the Borealis extension does notdistribute load optimally but it guarantees acceptable allocations; i.e., either no participantoperates above its capacity, or, if the system as a whole is overloaded, then all participantsoperate at or above capacity. The similar conclusions can be done also in the case of thepreviously mentioned massively-parallel data processing systems. For example, DAGuEdoes not target heterogeneous clusters which utilize commodity hardware nodes but canhandle intra-node heterogeneity of clusters of supercomputers where a runtime DAGuEscheduler decides at runtime which tasks to run on which resources Bosilca et al. (2012).

Another already mentioned massively-parallel stream processing system, Dryad Isardet al. (2007), is equipped with a robust scheduler which takes care of nodes liveness andrescheduling of failed jobs and tracks execution speed of different instances of each processor.When one of these instances under-performs the others, a new instance is scheduled in orderto prevent slowdowns of the computation. Dryad scheduler works in greedy mode, it doesnot consider sharing of cluster among multiple systems.

Heterogeneity-Aware Scheduler for Stream Processing Frameworks 5

Finally, in the case of approaches based on the MapReduce paradigm or its concepts,resource allocation and scheduling of stream processing on heterogeneous clusters isnecessary due to utilization of commodity hardware nodes. In stream processing, dataplacement and distribution are given by a user-defined topology (e.g., by pipelines in HadoopOn-line Condie et al. (2010), or by a DAG of interconnected spouts and bolts in Apache StormMarz (2014)). Therefore, approaches to the adaptive resource allocation and schedulinghave to discuss initial distribution and periodic rebalancing of workload (i.e., tasks, not data)across nodes according to different processing performance and specialisation of individualnodes in a heterogeneous cluster.

For instance, S4 Neumeyer et al. (2010) uses Apache ZooKeeper to coordinate alloperations and for communication between nodes. Initially, a user defines in ZooKeeperwhich nodes should be used for particular tasks of a computation. Then, S4 employs somenodes as backups for possible node failure and for load balancing.

Adaptive scheduling in Apache Storm has been addressed in Aniello et al. (2013) bytwo generic schedulers that adapt their behaviour according to a topology and a run-timecommunication pattern of an application. Experiments shown improvement in latency ofevent processing in comparison with the default Storm scheduler, however, the schedulersdo not take into account the requirements discussed in the beginning of Section 2, i.e., theexplicit knowledge of performance characteristics for individual types of nodes employedin a cluster for different types of computations and the ability to analyse computationcharacteristics of incoming tasks and input data. By implementation of these requirements,efficiency of the scheduling can be improved.

3 Use Case

To demonstrate the scheduling problems anticipated in the current state of the art of streamprocessing on heterogeneous clusters, an sample application is presented in this section. Theapplication “Popular Stories” implements a use case of processing of continuous stream ofweb-pages from thousands of RSS feeds. It analyses the web-pages in order to find textsand photos identifying the most popular connections between persons and related keywordsand pictures. The result is a list of triples (a person’s name, a list of keywords, and a set ofphotos) with meaning: a person recently frequently mentioned in context of the keywords(e.g., events, objects, persons, etc.) and the photos. The application holds a list of tripleswith the most often seen persons, keywords, and pictures for some period of time. This way,current trends of persons related to keywords with relevant photos can be obtained2.

The application utilizes Java libraries and components from various research projectsand Apache Storm as a stream processing framework. Figure 1 depicts spouts and boltscomponents of the application and its topology, as known from Apache Storm. Thecomponents can be scaled into multiple instances and deployed on different cluster nodes.

The stream processing starts by the URL generator spout, which extracts URLs ofweb-pages from RSS feeds. After that, Downloader gets the (X)HTML source, styles, andpictures of each web-page and encapsulates them into a stand-alone message. The message ispassed to the Analyzer bolt, which searches the web-page for person names and for keywordsand pictures in context of the names found. The resulting pairs of name-keyword are storedin the tops lists in In-memory store NK, which is updated each time a new pair arrives andexcessively old pairs are removed from computation of the list. In other words, the window

6 author

BB

S B B B B DBS B B B B DB

BB

Figure 1 A Storm topology of the sample application (“S”-nodes are Storm stouts generating data and“B”-nodes are Storm bolts processing the data).

of a time period is held for the tops list. All changes in the tops list are passed to In-memorystore NKP.

Moreover, pairs of name-picture emitted by Analyzer are processed in Image featureextractor to get indexable features of each image, which later allows to detect differentinstances of the same pictures (e.g., the same photo in different resolution or with differentcropping). The image features are sent to In-memory store NP where the tops list of the mostpopular persons and related unique images pairs is held. The memory stores employ searchengine Apache Lucene3 with distributed Hadoop-based storage Katta4 for Lucene indexesto detect different instances of the same pictures as mentioned above. All modifications inthe tops list of In-memory store NP are emitted to In-memory store NKP, which maintainsa consolidated tops list of persons with related keywords and pictures. This tops list ispersistent and available for further querying.

Individual components of the application described above, both spouts and bolts, utilizevarious types of resources to perform various types of processing. More specifically, URLgenerator and Downloader have low CPU requirements, Analyzer requires fast CPU, Imagefeature extractor can use GPU using OpenCL, and In-memory stores require large amountof memory. Therefore, the application may utilize a heterogeneous cluster with adaptiveresource allocation and scheduling.

Heterogeneity-Aware Scheduler for Stream Processing Frameworks 7

4 Scheduling Decisions in the Stream Processing

Schedulers make their decisions on a particular level of abstraction. They do not try toschedule all live tasks to all possible cluster nodes but just deal with units of equal or scalablesize. For example, the YARN scheduler uses Containers with various amounts of cores andmemory and Apache Storm uses Slots of equal size (one slot per CPU core) where, in eachslot, multiple spouts or blots of the same topology may run.

One of the important and commonly adopted scheduler decisions is data locality. Forinstance, the main idea of MapReduce is to perform computations by the nodes where therequired data are saved to prevent intensive data loading to and removing of the data from acluster. Data locality decisions from the stream processing perspective are different becauseprocessors usually does not operate on data already stored in a processing cluster but ratheron the streams coming from remote sources. Thus, in stream processing, we consider thedata locality to be an approach to minimal communication costs, which results, for example,in scheduling of the most communicating processor instances together to the same node orthe same rack.

The optimal placement of tasks across cluster nodes may, moreover, depend on otherrequirements beyond the communication costs mentioned above. Typically, we are talkingabout CPU performance or overall node performance that makes the processing faster. Forexample, the performance optimization may lie in detection of tasks which are exceedinglyslow in comparison to the others with the same signature. More sophisticated approachesare based on various kinds of benchmarks performed on each node in a cluster while theplacement of a task is decided with respect to its detected performance on a particular nodeor a class of nodes. Furthermore, the presence of resources, e.g., GPU or FPGA, can betaken into account.

There are two essential kinds of scheduling decisions: off-line decisions and on-linedecisions. The former is based on the knowledge the scheduler has before any task is placedand running. In context of stream processing, this knowledge is mostly the topology andthe off-line decisions can, for example, consider communication channels between nodes.On-line decisions are made with information gathered during the actual execution of anapplication, i.e., after or during initial placement of its tasks over a cluster nodes. So thecounterpart for the off-line topology based communication decision is decision derivedfrom real bandwidths required between running processor instances Aniello et al. (2013). Ineffect, the most of the scheduling decisions in stream processing are made on-line or basedon historical on-line data.

4.1 Storm’s Default Scheduler

The Storm’s default scheduler uses a simple round-robin strategy. It deploys bolts and spouts(collectively called processors) so that each node in a topology has almost equal numberof processors running in each slot even for multiple topologies running in the same cluster.When tasks are scheduled, the round-robin scheduler simply counts all available slots oneach node and puts processor instances to be scheduled one at the time to each node whilekeeping the order of nodes constant.

In a shared heterogeneous Storm cluster running multiple topologies of different streamprocessing applications, the round-robin strategy may, for the sample application describedin Section 3, result in the scenario depicted in Figure 2. The depicted cluster consists of fournodes with different hardware configurations, i.e., fast CPU, slow CPU, lots of memory,

8 author

Fast CPU node (8 slots) GPU node (4 slots)

URLURL

D2 MSNKP

MS2NPD2

D1NKPNP MS1D1 MS1

NP

IE3IE3

A3A3MS

A2

MSNK

IE2A2 IE2

A1 IE1

Slow CPU node (6 slots)High memory node (8 slots)

A1

Slow CPU node (6 slots)High memory node (8 slots)

URL—URL generator; Dx—Downloader; Ax—Analyzer; IEx—URL—URL generator; Dx—Downloader; Ax—Analyzer; IEx—Image feature extractor; MSx—In-memory store

Figure 2 Possible results of the Storm default round-robin scheduler.

and GPU equipped (see Figure 2), so the number of slots available at each node differs butthe same portion of each node is utilized as the consequence of round-robin scheduling.Moreover, the default scheduler did not respect different requirements of processors. TheAnalyzers requiring the CPU performance were placed to the node with lots of memorywhile the memory greedy In-memory stores were scheduled to the nodes with powerful GPUand slow CPU which led to the need of higher level of parallelism of “MS NP”. The fastCPU node then runs the undemanding Downloaders and the URL generator. Finally, theImage extractors were placed to the slow CPU node and the high memory node. Therefore,it is obvious that the scheduling decision was relatively wrong and it results into inefficientutilization of the cluster.

5 Proposing Scheduling Advisor

The proposed scheduling advisor targets to off-line decisions derived from results ofperformance test sets of each resource type in combination with particular component(processor). Therefore, every application should be benchmarked on a particular clusterprior to its run in production.

The benchmarking will run the application with production-like data and after initialrandom or round-robin placement of processors over nodes, it will reschedule processors sothat each processor is benchmarked on each class of hardware nodes. The performance ofprocessors will be measured based on the number of tuples processed in time period. Finally,with data from benchmarks, scheduling in the production will minimize the overall countedloss of performance in deployment on particular resources in comparison to performancein the ideal deployment, i.e., the one where each processor runs on the node with topperformance measured in the benchmarking phase.

Later, the scheduler can utilize also performance data captured during the production run.These data will be taken into consideration as the reflection of possible changes of processed

Heterogeneity-Aware Scheduler for Stream Processing Frameworks 9

High memory node (8 slots) GPU node (4 slots)

MSIE2

MSNKP MS

IE2MSNP

MS

NPIE1

MSNKNK

A2A2

D1A3A3

A1 URLA1 URLD2

Fast CPU node (8 slots) Slow CPU node (6 slots)Fast CPU node (8 slots) Slow CPU node (6 slots)

URL—URL generator; Dx—Downloader; Ax—Analyzer; IEx—URL—URL generator; Dx—Downloader; Ax—Analyzer; IEx—Image feature extractor; MSx—In-memory store

Figure 3 The advanced scheduling in a heterogeneous cluster (High memory and Fast CPU nodes are mutuallyswapped in comparison with Figure 2).

data, and new scheduling decisions will prefer them over the performance informationfrom the benchmarking phase. Moreover, with employment of production performancedata, an application can be deployed initially using the round-robin and then graduallyrescheduled in reasonable intervals. Few first reschedules have to be random to gather initialdifferences in performance per processor and node class. Then, the scheduler can deploysome of the processors to currently best known nodes and others processors to nodes with yetunknown performance. However, when omitting the benchmarking phase, a new applicationwithout historical performance data may temporarily under-perform and more instancesof their processors may be needed to increase parallelism. On the other hand, without thebenchmarking phase, the new application can be deployed without delay and utilize evennodes that have not yet been benchmarked (e.g., new nodes or nodes occupied by otherapplications during the benchmark phase on a shared cluster).

5.1 Scheduling of the Example Use Case Application

The proposed scheduler is trying to deploy processors to available slots that are runningon nodes with the most suitable resource profile. Therefore, the scheduler will deployfewer instances of the processors than the Storm’s default scheduler in the same cluster andprobably even with higher throughput. In the case of the sample application, the deploymentby the proposed scheduler may be as depicted in Figure 3. In-memory stores were deployedon the node with high amount of memory and Image feature extractors were deployed on thenode with two GPUs so it was be possible to reduce parallelism. Undemanding Downloaderswere placed on the Slow CPU node and Analyzers utilize the Fast CPU node. Possibly evenmore effective scheduling may be achieved by combination of pre-production and productionbenchmarking discussed in Section 5. Then, the scheduling decisions can be based onactual bandwidths between processors with consideration of trade-offs between bandwidth

10 author

N

N

NN

N

N

Hardware platformApplication model

Juniperperformance monitoring

Scheduling advisor

Macro-scheduling

N N

Monitoring component

Analysis

monitoring

Schedulabilityanalysis and simulation

Analysis component

AdvisorScheduling component

Figure 4 Architecture of the scheduling advisor

availability on particular nodes shared among multiple applications and availability of moresuitable nodes in perspective of performance.

6 Implementation and Evaluation of the Scheduling Advisor

The scheduling advisor has been developed within the Juniper project5 as a part of a Javaplatform supporting high-performance applications for real-time access and processing ofstreaming and stored data.

The scheduling advisor consists of two main components (see Figure 4): macro-scheduling component and advisor component. The macro-scheduling component takes careof scheduling decisions made on particular hardware platform in production or pre-productiondeployment of a Juniper application. The advisor component, on the other hand, analysesperformance data gathered during production or pre-production deployment, combines themwith information from modelling provided by developers, and shows possible shortcomingsin the application design. As this article deals primary with the scheduling, the advisorcomponent mentioned above will be omitted and the rest of this section discusses only thescheduling component.

6.1 The Macro-scheduling Component

The macro-scheduling component is further divided into three subcomponents (see Figure 4):monitoring component, analysis component, and scheduling component.

• Monitoring component: Gathers data about performance of individual instances ofapplication components deployed on various hardware configurations. More precisely

Heterogeneity-Aware Scheduler for Stream Processing Frameworks 11

it traces the execution times of program instances, which is the most important metricfor scheduling advisor prototype.

• Analysis component: Computes the performance characteristics of applicationcomponents running on individual hardware classes (i.e., pairs [component, HW-class])based on data gathered by monitoring component. Along with that, this componentproduces the first output of the scheduling advisor, namely profiling results of aapplication and benchmarking results for its individual components for variousdeployments of the application.

• Scheduling component: Based on the data from analysis component, the schedulingcomponent prepares new deployments of application components over the hardwareplatform to either provide more data for the analysis component or to improve overallperformance of the application as a whole. The scheduling component produces thesecond output of the scheduling advisor, the best possible deployment of the theapplication on a particular platform.

The macro-scheduling component of the scheduling advisor prototype consumes threeinputs in different sub-components. The first input is a deployment package of a particularimplementation of a Juniper application and it is utilised by scheduling component, whichtakes care of actual deployment of the application components over the Juniper platform.The second input is a description of a particular hardware platform, which is employed byscheduling component and analysis component. These two components need to know thehardware classes of nodes in the hardware platform and the counts of nodes belonging toparticular hardware classes to correctly observe and use the performance data of differentapplication components. The third input is a defined degree of parallelism of each applicationcomponent and it is utilized by scheduler component to correctly deploy the Juniperapplication.

6.2 Evaluation of the Scheduling Advisor Prototype

To evaluate the concept of the scheduling advisor described in this article, the prototype hasbeen implemented as a pluggable scheduler for Apache Storm. Apache Storm was chosenas a substrate to the prototype of scheduling advisor because its model of computation is asubset of Juniper platform’s computation model with focus on stream processing and lowlatency, and at the same time it offers the way to easily implement our example application“Popular Stories” described in Section 3.

Different schedulers make their decisions on specific level of abstraction (i.e., they donot schedule tasks to computation node as a whole), usually a kind of units with equal orscalable size are used. Apache Storm uses equal size Slots (one slot per CPU core), as itwas mentioned in Section 4. In each slot, multiple executors (i.e., multiple components ofJuniper applications) of the same topology may run. Storm’s default scheduler then, usinground-robin strategy, deploys executors in the way that each node in topology has almostequal number of executors running in each slot. At the same time, the rule of almost equalnumber of executors is maintained even when multiple topologies are running the samecluster.

The Scheduling Advisor prototype is partly implemented inside of the Apache Storm(scheduling component and analysis component) and partly inside of the example applicationitself (monitoring component). Monitoring data produced by each application component

12 author

instance is currently saved in centralized relational database, which brings the possibilityof easy statistical questioning over the data. As the centralized database is not suitablefor distributed environments, later, in the future versions of the Scheduling Advisor, themonitoring data will be stored together with the rest of the performance data gathered fromthe platform in Juniper platform’s monitoring system.

The scheduling component has straightforward access to Storm’s APIs for executorplacement, removal and status so the scheduling can be easily made by executor type and host-name (host-names are mapped to hardware classes using the hardware platform description).The analysis component then operates inside the Storm cluster supervisor system calledNimbus, the same place where the scheduler resides. The analysis component questionsthe database with monitoring data and provides API for per hardware class—applicationcomponent performance data and known placements to the scheduler component.

Prototype implementation of the scheduling advisor is suitable for experiments withheterogeneous clusters in the meaning of different hardware used over the cluster nodes (e.g.,different CPUs, GPUs, amount of memory or static acceleration). It allows to periodicallyreschedule the application over the heterogeneous cluster and to observe the performance ofapplication components on different hardware classes.

Our experiments were made on small cluster of 7 machines with three different hardwareclasses. Two hardware classes are of the same CPU generations, namely “class 1” is IntelXeon (12 cores, 3 nodes) and “class 2” is Intel i7 (8 cores, 2 nodes). The last “class 3” is athree years old Intel Xeon (12 cores, 2 nodes). Other parameters of the hardware classessuch as amount of RAM or presence of GPU/FPGA are not important because the testingapplication currently does not employ them. The number of Storm slots was set up basedon the number of cores of each machine. We used multiple configurations of cluster withdifferent numbers of machines of each class during experiments. Our example application“Popular Stories” with 6 different components in various degree of parallelism served as testsuite.

Different components of “Popular stories” application have different demands onperformance and process different amounts of tuples over time so we evaluated performanceimprovement in two ways: 1) based on the number of tuples computed by a whole applicationin time interval, and 2) based on the number of tuples computed by each component inthe time interval. The performance is compared between worst possible schedule, standardschedule, and the best schedule where the worst and the best possible schedules are basedon profiling and benchmarking made by scheduling advisor prototype and standard scheduleis made by Storm’s “Even scheduler”.

Results of experiments showed that performance gain of best scheduling based onprofiling and benchmarking depends on various factors where the most important one is thestructure of heterogeneous cluster. On homogeneous cluster, all three scheduling techniqueswould give almost same results because the worst scheduling nor the best scheduling canutilize the differences of hardware classes. Having a cluster with small amount of “slower”nodes brings some difference between the scheduling techniques but the difference is stillsmall. Finally, on cluster with only few nodes with greater performance, the best schedulingbased on profiling and benchmarking brings the greatest difference.

At the same time, the architecture and demands of the application affect the differencebetween scheduling approaches too. Heterogeneity of application’s components allowsscheduler to utilize differences in hardware for better performance of whole system. Generally,with increasing heterogeneity of application’s components, the performance gain caused bybenchmarking based scheduler over worst scheduling and standard scheduling grows.

Heterogeneity-Aware Scheduler for Stream Processing Frameworks 13

Component W tuples S tuples B tuples S-W gain B-S gain B-W gain

AnalyzerBolt 135096 163993 164714 121,39 % 100,44 % 121,92 %

DownloaderBolt 1396 1499 1494 107,38 % 99,67 % 107,02 %

ExtractFeaturesBolt 39745 41867 47991 105,34 % 114,63 % 120,75 %

FeedReaderBolt 1580 1576 1576 99,75 % 100, % 99,75 %

FeedUrlSpout 45711 46334 45654 101,36 % 98,53 % 99,88 %

IndexBolt 39744 41866 47989 105,34 % 114,63 % 120,75 %

Total 263272 297135 309418 112,86 % 104,13 % 117,53 %

W - worst scheduler, S - standard scheduler, B - performance and benchmark based scheduler

S-W gain - gain of standard scheduler over worst scheduler

Figure 5 Scheduler performance comparison – cumulative results of multiple tests based on number of tuplesprocessed in time interval

In our experiments, the suitability of profiling and benchmarking based scheduling wasproven on application without components that could utilize special hardware such as GPUor FPGA. Different components of our application only had different demands on CPU andoverall node performance (e.g., memory speed). Performance gain of our scheduler in termsof all tuples processed by whole application was over Storm’s standard scheduler 4.1 %and over “worst scheduler” 17.5 %. Average gain measured for each component type thenwas 6.8 % over standard scheduler and 11.7 % over “worst scheduler”. For more detailedstatistics from multiple tests on different data see Figure 5. The results of our schedulerare satisfying but they still strongly depends on heterogeneity of cluster and application asdescribed earlier. Finally, we assume that applications employing GPUs or FPGAs for someof their components will benefit even more.

7 Utilisation of the Scheduling Advisor in Development Process

Outputs of the macro-scheduling component, which were described in the previous section,are produced at run-time, processed by the advisor component, and utilised to improveassessed Juniper applications at their design-time. Development and deployment of a Juniperapplication requires cooperation of different roles of responsible participants who can utilisethe advisor’s outputs in the development process. These roles are namely: (1) an analyst, whodescribes required functionality of a Juniper application including its time-based constraintswhich have to be meet at runtime; (2) an architect, who designs and describes the application’sarchitecture in details with respect to the Juniper platform; (3) a developer, who implementsthe application as a distributed system of concurrently running components; and (4) a systemadministrator, who deploys the application to a particular cluster, with certain performancecharacteristics, running the Juniper platform.

The development process of a Juniper application and the utilization of the schedulingadvisor in the development and deployment of the application by the above mentioned rolesis depicted in Figure 6.

14 author

Sch

ed

ulin

g

ad

vis

or

Ad

min

istr

ato

rD

evelo

per

Model application

Implement application

Define platform

Run schedulability

analysis

Deploy application

Define degree of

parallelism

Profile application

Optimize deployment

Figure 6 The sequence of the individual steps in development and deployment of a Juniper application withutilisation of the scheduling advisor.

After modelling and implementation of the application by an analyst and/or developer, asystem administrator defines the run-time platform where the application will be executed.Then, the developer performs the schedulability analysis to determine an initial deploymentof the application. The application deployment has to meet computation resources utilizationand real-time constraints both defined at design-time; otherwise, if it would not possibleto suitable deploy the application, the application has to be remodelled, reimplemented,and the platform redefined, so the another schedulability analysis will result into a suitabledeployment. Finally, the developer defines the degree of parallelism of the application’scomponents, i.e., numbers of instances of individual components) and the application isdeployed.

In he next step, the deployed application is executed and profiled by the schedulingadvisor, concurrently with benchmarking of the run-time platform. The results of the profilingand benchmarking are utilised by the scheduling advisor to optimize the deployment with thecurrent degree of parallelism, as it has been described in Section 5. Moreover, the profilingand benchmarking results can be used by developer in the repeated schedulability analysisto propose a better initial deployment with a new degree of parallelism.

8 Conclusion

This article described problems of adaptive scheduling of stream processing applications onheterogeneous clusters and presented ongoing research towards the novel scheduling advisor.In the article, we outlined general requirements to the scheduling in stream processing onheterogeneous clusters and analysed the state-of-the-art approaches introduced in the relatedworks. We also described the sample application of stream processing in heterogeneousclusters, analysed scheduling decisions, and proposed the novel scheduler for the Apache

Heterogeneity-Aware Scheduler for Stream Processing Frameworks 15

Storm distributed stream processing platform based on the knowledge acquired in theprevious phases.

The sample application and the proposed scheduler are still work-in-progress. We areperforming an evaluation of the proposed approach in practice. Our future work mainlyaims at possible improvements of the scheduler performance, which is important forreal-time processing, at addressing the problems connected with automatic scaling ofprocessing components (i.e., their elasticity), and at addressing the issues related to eventualdecentralisation of the scheduler implementation.

References

Abadi, D. J., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J.-H.,Lindner, W., Maskey, A. S., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., and Zdonik, S.(2005). The design of the borealis stream processing engine. In CIDR, volume 5, pages277–289.

Aniello, L., Baldoni, R., and Querzoni, L. (2013). Adaptive online scheduling in Storm. InProceedings of the 7th ACM international conference on Distributed event-based systems,pages 207–218. ACM.

Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. (2002). Models and issues indata stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGARTsymposium on Principles of database systems, pages 1–16. ACM.

Balazinska, M., Balakrishnan, H., and Stonebraker, M. (2004). Load management and highavailability in the Medusa distributed stream processing system. In Proceedings of the2004 ACM SIGMOD international conference on Management of data, pages 929–930.ACM.

Borkar, V., Carey, M., Grover, R., Onose, N., and Vernica, R. (2011). Hyracks: A flexibleand extensible foundation for data-intensive computing. In 2011 IEEE 27th InternationalConference on Data Engineering (ICDE), pages 1151–1162. IEEE.

Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., and Dongarra, J. (2012).DAGuE: A generic distributed dag engine for high performance computing. ParallelComputing, 38(1):37–51.

Brito, A., Martin, A., Knauth, T., Creutz, S., Becker, D., Weigert, S., and Fetzer, C. (2011).Scalable and low-latency data processing with stream mapreduce. In 2011 IEEE ThirdInternational Conference on Cloud Computing Technology and Science (CloudCom),pages 48–58. IEEE.

Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., andZdonik, S. B. (2003). Scalable distributed stream processing. In CIDR, volume 3, pages257–268.

Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. (2010).Mapreduce online. In Proceedings of the 7th USENIX Conference on Networked SystemsDesign and Implementation. USENIX Association.

16 author

Dean, J. and Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters.Communications of the ACM, 51(1):107–113.

Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. (2007). Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating SystemsReview, 41(3):59–72.

Lam, W., Liu, L., Prasad, S., Rajaraman, A., Vacheri, Z., and Doan, A. (2012). Muppet:MapReduce-style processing of fast data. Proceedings of the VLDB Endowment,5(12):1814–1825.

Li, B., Mazur, E., Diao, Y., McGregor, A., and Shenoy, P. (2011). A platform for scalable one-pass analytics using MapReduce. In Proceedings of the 2011 ACM SIGMOD InternationalConference on Management of data, pages 985–996. ACM.

Marz, N. (2014). Apache Storm. https://git-wip-us.apache.org/repos/asf?p=incubator-storm.git. Git repository.

Murray, D. G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., and Hand,S. (2011). CIEL: a universal execution engine for distributed data-flow computing.In Proceedings of the 8th USENIX Conference on Networked Systems Design andImplementation. USENIX Association.

Neumeyer, L., Robbins, B., Nair, A., and Kesari, A. (2010). S4: Distributed stream computingplatform. In 2010 IEEE International Conference on Data Mining Workshops (ICDMW),pages 170–177. IEEE.

Vinothina, V., Rajagopal, S., and Ganapathi, P. (2012). A survey on resource allocationstrategies in cloud computing. International Journal of Advanced Computer Science andApplications, 3(6).

Warneke, D. and Kao, O. (2011). Exploiting dynamic resource allocation for efficientparallel data processing in the cloud. IEEE Transactions on Parallel and DistributedSystems, 22(6):985–997.

Note

1e.g., parts of variable-bit-rate video streams with temporary high bit-rate will be tagged for processingby special nodes with powerful video decoders, while average bit-rate parts can be processed bycommon nodes

2The application has been developed for demonstration purposes only, particularly to evaluate thescheduling advisor described in the article. However, it may be used also in practice, e.g., forvisualisation of popular news on persons (with related keywords and photos) from news feeds on theInternet.

3https://lucene.apache.org/4http://katta.sourceforge.net/5http://www.juniper-project.org/

Informatica 36 (2012) 501–505 501

Cloudification of Legacy Information Extraction System Svatopluk Šperka, Petr Škoda, Pavel Smrž Brno University of Technology Faculty of Information Technology IT4Innovations Centre of Excellence Bozetechova 2, 612 66 Brno, Czech Republic {isperka, iskoda, smrz}@fit.vutbr.cz Keywords: cloud, legacy systems, ReReSearch, mOSAIC, information extraction

This paper presents the process of cloudification of a legacy information extraction system targeting scientific papers into the Cloud. The original system is described along with architectural and implementation decisions made in order to gain performance and to conform to the philosphy of the mOSAIC cloud middleware that is used as a basis for implementation.

1 Introduction With consistently growing popularity and availability of cloud computing, the focus of developers and enterprises turns also to legacy applications. The legacy systems are important because they are often crucial parts of the organisations’ workflows and it would be error prone and/or cost ineffective to redevelop them. In the same way the scientists do not have enough resources or time to redevelop numbers of legacy applications. Reasons for getting the legacy applications into a Cloud are mostly the same as for getting there new ones e.g., great horizontal scalability, better cost efficiency and easy access to vast amount of computational resources on demand [1][2][3].

At our faculty we develop a system for building an accessible knowledge base about scientific research called ReReSearch. This system consists of many old modules that we need to run in adequate number of instances in order to achieve the required overall performance [4]. We can say that these modules are legacy because the cost of their porting into another language is unacceptable—they contain fine tuned complex natural language processing algorithms. Current way of parallelisation is a usage of the faculty’s N1GE-based grid but it brings various limitations and problems. To overcome this we decided to port some modules to a Cloud. The first ported module, which is a crucial part of the ReReSearch, is the scientific paper information extraction module [4].

This paper addresses a process of porting the information extraction module written in Python language to the mOSAIC cloudware solution. We start by describing the ReReSearch and mOSAIC systems in sections 2 and 3, and continue by analysis of possible porting approaches in section 4. Then we describe the design of communication interface of the Cloud application in section 5 and the implementation of mOSAIC-based solution in section 6. Finally, we conclude by discussing further steps in development.

2 ReReSearch system ReReSearch is an experimental project being developed by our faculty’s Natural Language Processing group which aims at building a knowledge base and derived personalized portals about research. The key entities it operates on include researchers, teams and universities, papers, reports and deliverables, books, journals, proceedings and various collections, conferences, workshops and seminars, projects and funding agencies. Information on all of these entities has to be interconnected in order to be useful. To gather the data, the system first identifies relevant sources on the Web and then downloads and processes specific web pages and papers. In order to transform data from unstructured form into a structured one, information extraction methods are applied.

Papers are collected by special crawlers that search the Web for the pages possibly containing links to scientific papers (e.g., online proceedings or lists of publications linked from homepages of authors that the system already “knows”). Printable formats (PDF, PS, DOC, etc.) are transformed into a semi-structured text, i.e., text-structuring information such as sections or list items can be recovered. Then the information extraction from the text itself is being done. This process takes from a few seconds to several minutes depending on the complexity and length of the document.

It is hard or practically impossible to exactly predict the number of scientific papers that the system can acquire in a given time and it is essential to process all the acquired documents as fast as possible in order to maintain the database actual. Currently, we employ the N1 Grid Engine (N1GE, formerly SGE) to process large batches of papers. [4]

On N1GE we compute serialized extractions of ten documents per job per node (scheduling jobs with only one document would result in grid’s poor performance because of the scheduling overhead). Whole extracting process, from printable format to structured metadata, of one document is then carried out on one grid node. Executables of the metadata extractor that is written in

502 Informatica 36 (2012) 501–505 Svatopluk Šperka, Petr Škoda, Pavel Smrž

Python are run from fileserver so we do not need to install the extractor on every node. After the successful extraction, resulting metadata are copied back to the fileserver and the ReReSearch control system takes care about uploading them into the database.

There are some drawbacks of current metadata extraction solution using the faculty grid. Mainly, the grid is not dedicated to our research group so it is often highly utilized by other groups. Secondly, the time from finding a document by ReReSearch system to resulting metadata is pretty long (i.e. we must wait for ten documents before enqueueing it to grid’s queue and then all of them must be extracted at once). Lastly, the faculty grid is primarily intended to a new research so it should not be occupied by projects in their deployment phase.

3 mOSAIC cloud middleware It was essential for us to avoid the cloud provider lock-in—ReReSearch is a representative of systems that can profit from the portability and deployment across clouds and cloud providers. The mOSAIC middleware provides such benefits. It is a language- and platform-agnostic API for usage of multi-Cloud resources and at the same time a portable platform for utilization of Cloud services based on the API and Cloud usage patterns [5]. It allows one version of an application to be deployed to any supported provider. Moreover, it aims at auto-scalability of applications [6], i.e., elasticity.

The mOSAIC’s API has several layers: Connector API abstracts types of cloud resources commonly offered by cloud providers such as message queues, key-value stores or distributed file systems. Driver API is at the bottom of the mOSAIC API hierarchy; it sits on top of a native API for a particular resource and enables a uniform access protocol. This layer offers plugins where each plugin enables access to one implementation of a resource type. Intermediate Interoperability API ensures language independency of the API and thus it mediates between the Connector API implementation and a compatible driver implementation. Connector API and Cloudlet API—the APIs offered by the platform to application developers—are currently only available in the Java programming language but Python and Erlang versions are being developed.

Cloud components, that implement parts of an application’s functionality, are called cloudlets. They use the Cloudlet API that handles life-cycle of cloudlets, enables initialization, configuration, migration and obtaining bindings to used resources [6]. There are one or several containers within which one or more cloudlet instances execute. The number of instances is under control of the container and is managed in order to grant scalability [7], i.e., the mOSAIC platform controls scaling up and scaling down of an application. In order to ensure scalability, components should communicate indirectly—usually by using message queues as intermediary so more cloudlet instances can dispatch messages from the same queue.

4 Identifying architecture The scientific paper information extractor is easily parallelisable by just running many instances simultaneously because there is no dependency between extractions of individual documents. But there are still several approaches how to do such computation in cloud environment. These differ mainly in amount of necessary changes in the original application.

4.1 Minimal changes in original application

It is possible to prepare a software package, very same to the current N1GE one, to be run in a Cloud. The mOSAIC cloudlet would then execute the whole process from a printable version to metadata for each document.

Main advantage of this solution is the simplicity. There is practically no workflow to be programmed in mOSAIC, everything is run as one cloudlet, and there is no communication between components. Scaling is achieved by running more computing nodes with more cloudlet instances. However, there is no space for further optimization or speed up. The situation is better than on N1GE because each document is processed alone (no need to process ten documents at once as on N1GE), but this solution still doesn’t bring any parallelization in the scope of one document.

Further step could be to separate metadata extraction from plaintext transformation. The processing of one document actually consists of two different parts—the transformation from a printable format to a plaintext and the information extraction process itself. By separating these two parts we would get two cloudlets connected by a queue. The mOSAIC workflow then consists of the document to plaintext transformation and the information extraction.

This design is still very straightforward because even in N1GE solution there are two subprograms—one for each part—and there is a text file passed from the document to text transformer to the information extractor. On the other hand, this solution doesn’t bring many improvements over the first one in terms of parallelization.

4.2 Decomposition of original application Finally, we can dig deeper into the original application and identify its inner structure. Information extractor consists of multiple metadata extractors, e.g. language identifier, email extractor, entity extractor or document splitter (see Figure 1). Then in mOSAIC there is a cloudlet for each extractor and communication between them is done through queues. The original extractors must be interfaced so that we can run them separately in cloudlets.

In order to do that, we need to precisely analyse original information extractor, write new wrapper for each extractor, and design new interfaces between newly separated parts. Advantage, however, is that with separated extractors we can easily parallelise the information extraction process of one document.

Cloudification of Legacy Information Extraction System Informatica 23 (1999) xxx–yyy

Extractor cloudlets can then run on many nodes and time needed to extract one document can almost reach the time of the slowest extractor—more precisely the time of a longest chain of extractors that depend on each other and thus must be pipelined. Moreover, we can scale up the number of extractors’ cloudlets with respect to their average speed i.e., the slow extractors will have slightly more instances than the fast ones.

Figure 1. Metadata extractor—from document source to metadata.

Because the speed of each document is important for the ReReSearch system, we decided for the last option. This also brings us the ability to easily add new extractors and to run only necessary extractors for different kinds of documents.

5 Communication Interface In order to be able to communicate with the system, it needs to be given an interface for the outside world. Conceptually, there are two communication use cases the interface must satisfy—requesting extractions and their monitoring.

As there are only these two trivial use cases, a lightweight communication mechanism is appropriate. RESTful interface based on HTTP protocol fulfills this requirement perfectly. Moreover, it is easy to implement and interact with given the abundance of tools, libraries and implementations.

Two mentioned use cases map to the two HTTP methods. For requesting an extraction, there is the POST method available at URL of the form http://<IP>/documents1. The body of the request must contain a JSON object (Content-Type header of the request must therefore be “application/json”) with the field url whose value is the URL of the document to process.

The request can be either successful or not. In case of a success, the response code is 202 (it is described as

1 IP is the IP address of the machine where the

gateway is running.

“The request has been accepted but the processing has not been completed” [8]). Content type of the body is again “application/json” and the JSON object is of the form: { “url”:“<document_url>” “url-hash”:“<url_hash>” “status-url”:

“http://<ip>/documents/<url_hash>/status” }

The url-hash field contains a hash of the URL computed by SHA-256 function. The status-url field contains the URL where monitoring information will available.

Results of extractions themselves are actively pushed to the control system through its HTTP gateway because it is necessary for the application to be a seamless part of the ReReSearch system.

5.1.1 Monitoring Monitoring of the extraction of a particular document can be done by obtaining status information. It is available at a status-url contained in the JSON object from an HTTP response when requesting the extraction. One can obtain it by issuing a GET request to that URL. The successful response is of the code 200 and enclosed JSON object contains fields named as the cloudlets containing objects with following fields:

• start-timestamp—time when cloudlet started processing the document,

• end-timestamp—time when cloudlet finished processing the document,

• error—contain error if one occurred in this component.

The monitoring response is of the code 404 if there is no information available about such a document.

Cloud

ReReSearch Control System

MessagesMonitoring DataHTTP request

ExtractionModule

Queue

Get XMLExtractor

Queue

Source FileProcessor

Language Detector Extractor

Extractor NQueue Queue

HTTP Gateway

Queue Downloader

Queue

Result Sender

HTTP Gateway

Key-Value Store

HTTPg Handler

Wrap DocumentExtractor

Queue

Queue

Queue…

Cloud Resource CloudletmOSAIC Component Figure 2. The architecture of the cloudified application.

6 Implementation Given that we were forced to use the legacy Python software implementing extraction algorithms, two implementation tasks were solved in parallel after designing the application’s architecture—dissecting the Python software so that it could be wrapped into

504 Informatica 36 (2012) 501–505 Svatopluk Šperka, Petr Škoda, Pavel Smrž

cloudlets, and development of the wrapping cloudlets along with related code built upon the mOSAIC platform.

As we said above, the Python software was split into parts so that each logical task (text extraction from PDF, language detection, keyword or citation extraction, etc.) could be executed autonomously. This way, each task usually runs a reasonable amount of time—in order of seconds. Such tasks are more suitable than longer running ones for the mOSAIC platform. Also, it enables us to parallelise those tasks that are not dependant on each other in the future. Each task is implemented by a single Python script with unified input/output interface—they all accept and produce JSON objects. Source File Processor—that wraps the document to text transformer— is not considered to be an extractor in the code because it has to process a binary file such as PDF that is read from the standard input. It is then different from other extractors and has to be treated specifically in the Java code. Its output, however, is a JSON object that is already accepted by extractors.

A cloudlet that would wrap extractors was created such that it could be parameterised upon creation to execute a particular Python script. This parameter is specified in the cloudlet descriptor, which is a configuration file necessary to start it along with the code (JAR archive in this case).

After finishing works on the Python part, it had to be packaged—along with all its numerous dependencies and a specific Python interpreter version—as a TazPkg package of the SliTaz Linux distribution upon which the mOS2 is based. By creating such a package and pushing it into the mOSAIC package repository, we became able to deploy all necessary software into newly created mOS-based virtual machines. Then we could proceed to integrating both layers of the application.

The integration phase consisted of making Extractors and also Source File Processor spawn a new process executing a particular Python script on the cloudlet’s input. ProcessBuilder and Process Java API classes were used to achieve the execution as the mOSAIC API does not provide its own API for spawning processes under its control yet.

7 Conclusion We have overviewed our approach to porting a legacy information extraction application to the cloud by using the mOSAIC cloudware, along with the designed architecture and some implementation details. To this day, we have finished a main part of the development and the application was successfully tested on the mOSAIC Portable Testbed Cluster3. Next, it is necessary to test it on real cloud solution to see the performance of current solution and also its behaviour when connected to the ReReSearch control sytem. In case of unsatisfactory

2 mOS is an operating system developed to power VM instances

running the mOSAIC platform. 3 PTC is a small-sized cluster for testing and development of

mOSAIC applications on local computers.

results, the application will have to be further improved—as in case of any other application.

Acknowledgment The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 256910 (mOSAIC Cloud) and by the IT4Innovations Centre of Excellence project, Registration number CZ.1.05/1.1.00/02.0070, supported by Operational Programme “Research and Development for Innovations” funded by Structural Funds of the European Union and the state budget of the Czech Republic.

References [1] D. Yu, J. Wang, B. Hu, J. Liu, X. Zhang, K. He, L.

Zhang, A Practical Architecture of Cloudification of Legacy Applications, Kingdee International Software Group Co. Ltd., China, 2011.

[2] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph and R. H. Katz, Above the Clouds: A Berkeley View of Cloud Computing, University of California at Berkeley, Technical Report UBS/EECS-2009-28, 10 February 2009.

[3] Y. Liu, Q. Wang, M. Zhuang, Y. Zhu, Reengineering Legacy Systems with RESTful Web Service, International Computer Software and Applications Konference – COMPSAC, pp. 785–790, 2008.

[4] P. Škoda, S. Šperka, P. Smrž, Extracting Information from Scientific Papers in the Cloud, Complex, Intelligent and Software Intensive Systems (CISIS), 2012 Sixth International Conference, 2012.

[5] B. Di Martino, D. Petcu, R. Cossu, P. Goncalves, T. Máhr, M. Loichate, Building a Mosaic of Clouds, Euro-Par 2010 Parallel Processing Workshops, LNCS, vol. 6586, pp. 571–578, Springer, 2011.

[6] D.Petcu, C. Crăcium, M. Neagul, M. Rak, I. L. Larrarte, Building an Interoperability API for Sky Computing, Proc. 2011 International Conference on High Performance Computing and Simulation, Workshop on Cloud Computing Interoperability and Services (InterCloud 2011), IEEE CS, pp. 405–412, 2011.

[7] G. Macariu, mOSAIC API: Java Programming Guide, eAustria Research Institute, 2011.

[8] R. Fielding et al.: Hypertext Transfer Protocol—HTTP 1.1, section 10, RFC 2616, 1999. [Online]. Available: http://www.w3.org/Protocols/rfc2616/ rfc2616.html.