Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors

Cloud PARTE: Elastic Complex EventProcessing based on Mobile Actors

Janwillem Swalens Thierry Renaux Lode Hoste Stefan Marr Wolfgang De MeuterSoftware Languages Lab

Vrije Universiteit Brussel, Belgium{jswalens,trenaux,lhoste,smarr,wdmeuter}@vub.ac.be

AbstractTraffic monitoring or crowd management systems producelarge amounts of data in the form of events that need tobe processed to detect relevant incidents. Rule-based patternrecognition is a promising approach for these applications,however, increasing amounts of data as well as large andcomplex rule sets demand for more and more processingpower and memory. In order to scale such applications, arule-based pattern detection system needs to be distributableover multiple machines. Today’s approaches are howeverfocused on static distribution of rules or do not supportreasoning over the full set of events.

We propose Cloud PARTE, a complex event detectionsystem that implements the Rete algorithm on top of mo-bile actors. These actors can migrate between machines torespond to changes in the work load distribution. CloudPARTE is an extension of PARTE and offers the first ruleengine specifically tailored for continuous complex event de-tection that is able to benefit from elastic systems as providedby cloud computing platforms. It supports fully automaticload balancing and supports online rules with access to theentire event pool.

Categories and Subject Descriptors D.1.3 [ProgrammingTechniques]: Concurrent programming; D.3.4 [Program-ming Techniques]: Processors; I.5.5 [Pattern Recognition]:Implementation

General Terms Algorithms, Design, Performance

Keywords mobile actors, Rete, complex event processing,online reasoning, load balancing

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]! ’13, October 27, 2013, Indianapolis, Indiana, USA.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2602-5/13/10. . . $15.00.http://dx.doi.org/10.1145/2541329.2541338

1. IntroductionLarge-scale processing of complex events gains new rele-vance with the big data movement and an increasing avail-ability of real-time data, for instance for enterprise applica-tions, or societal and security-related scenarios. Companieswant to leverage real-time big data for new products and im-proved services [1], while traffic and crowd monitoring sys-tems could be used to prevent traffic jams or guide rescueservices in emergency situations.

Such applications require flexible and maintainable eventprocessing systems that can be adapted easily for instance torecognize new emergency scenarios or business cases. Oneapproach is to use machine learning techniques to analyze,filter, and categorize events. However, these approaches areoften black boxes that do not give feedback on the reasonsfor a certain classification [12, 13]. Moreover, most machinelearning techniques require training data, which is seldomlyavailable for exceptional and emergency situations.

For scenarios where training data is lacking or hard togather, query and rule-based complex event detection (CED)systems are more suitable choices [11]. Furthermore, rule-based systems have the advantage of providing software en-gineering abstractions to express intent, avoiding cumber-some and error-prone ad-hoc implementations, and allowingexpert programmers to intervene in the correlation of eventsby explicitly encoding application logic. For instance, theforming of a crowd can be detected based on simple rulesthat correlate location and movement of people in publicplaces. Similarly, upcoming traffic jams can be detected withrules that detect deviations from the typical speed of cars.

While rule-based systems are appealing, the applicationslisted above need to process large amounts of data efficientlyin order to reach the responsiveness required for use casesthat are based on live feedback. One application that couldbe broadly described as “crowd monitoring” is the real-time analysis of a soccer game, as proposed by the ACMDEBS 2013 Grand Challenge.1 While most participants of

1 The ACM DEBS 2013 Grand Challenge, DEBS, access date: August15, 2013 http://www.orgs.ttu.edu/debs2013/index.php?goto=

cfchallengedetails

http://www.orgs.ttu.edu/debs2013/index.php?goto=cfchallengedetails

http://www.orgs.ttu.edu/debs2013/index.php?goto=cfchallengedetails

this competition prototyped their systems with rule-basedcomplex event detection systems such as Esper,2 the finalsystems used highly custom solutions, severely lacking soft-ware engineering abstractions and maintainability, and withstrong limitations on the rules. Today’s rule-based systemsdo not yet provide the necessary efficiency to handle suchuse cases with the required performance.

In order to facilitate big data applications that are sup-posed to provide live feedback, complex event detection en-gines and pattern recognition systems need to be able to dealwith data sets that are larger than the main memory of a sin-gle machine. As a first step, we propose a distributed rule-based CED system called Cloud PARTE, which builds onPARTE [17], a Rete-based CED system. To our knowledge,Cloud PARTE is the first Rete-based CED system that pro-vides the following characteristics:

Distributed, yet unified knowledge base. The Rete nodesare automatically distributed over multiple machines con-nected in a LAN, yet the semantics of having a singleunified Rete network are maintained.

Scalable. Performance degradation from an increasing amountof work can be countered without programming effort, byincreasing the number of machines.

Elastic. The system adapts at run time to increases or de-creases in the work load. New machines can be addedto or removed from a running program and the workerscooperatively redistribute the work.

Dynamically load balanced. The system adapts at run timeto change in the distribution of the work load throughoutthe ruleset. As different subrules become memory- orcomputationally intensive, the allocation of processingand storage resources is redistributed, across machineboundaries.

2. Context and Requirements2.1 Traffic MonitoringOne of the potential applications for big data complex eventprocessing is road traffic monitoring. Modern highways aremonitored for a wide variety of reasons. Use cases includetracking specific cars for the collection of road tolls, moni-toring the overall traffic flow to guide drivers to avoid trafficjams, or the recognition of potentially dangerous situationsin order to alert rescue services.

While systems for these applications exist today, they areusually not easily extensible and most of the applicationsdo not provide live feedback. One challenge is to combinethe data of various data sources and handle the resultingamounts of data efficiently. Examples for data sources are

2 Esper, EsperTech Event Stream Intelligence, access date: August 15, 2013http://esper.codehaus.org/

(defrule StationaryCar

(Car (camera ?c) (plate ?plt) (time ?t1))

(Car (camera ?c) (plate ?plt) (time ?t2))

(test (and (> (- ?t2 ?t1) (seconds 30)))

(< (- ?t2 ?t1) (seconds 60)))

(Camera (id ?c) (highway ?hw) (position ?p))

=>

(assert (StationaryCar (camera ?c)

(plate ?plt) (time ?t1)

(highway ?hw) (position ?p))))

Listing 1: Example rule to describe a stationary car.

sensors such as inductive loops,3 which are permanently em-bedded into roads to count passing cars, or radar stationsused to determine the speed of specific cars. Together withcameras, they can be used to record speed limit violations.Cameras can further be used in combination with on-boardunits to determine the exact toll for cars as for instance inGermany.4 Assuming that these data sources can be com-bined for large and traffic intensive regions or countries inorder to facilitate new traffic monitoring applications, the re-sulting amounts of data outgrows what a single server sys-tem can handle.

In order to enable a wide range of such applications,rule-based approaches are promising. Forgy [6] proposed theRete algorithm as an efficient implementation technique. Re-naux et al. [17] demonstrated how such an approach couldbe used for efficient implementation on multicore systems.Imagining an application that uses various input sources todetect traffic jams and accidents, a simple rule can be speci-fied as in the example given in Listing 1. The rule describeshow to detect cars that are not moving. The example assumesthat the system receives events about cars that include anidentifier for the traffic camera recording the car, the licenseplate information, as well as the time stamp of the event.We assume that a car can be considered as stationary, if ithas been recorded at least twice by the same camera in a30 to 60 seconds timespan. Furthermore, we assume that thesystem has detailed information about the cameras and theirpositions on a specific highway. Such information can thenbe used to trigger a higher-level event StationaryCar thatcan be further processed and used to investigate whether itis an emergency situation or a traffic jam. Unlike existingtechniques measuring the flowrate of cars, declarative rulesallow reasoning over the causes, e. g., by observing abnor-mal movement of cars.

Modeling complex traffic monitoring systems based onsuch an approach requires efficient execution engines. Thescenario itself will result in a large amount of rules to de-

3 ZELT vehicle monitoring, Traffic Technology Ltd, access date:August 15, 2013 http://www.traffictechnology.co.uk/

vehicle-monitoring/zelt-loop-detection4 Toll Collect, A system for positioning, monitoring, and billing truckson German motorways, access date: August 15, 2013 http://www.

toll-collect.de/

http://esper.codehaus.org/

http://www.traffictechnology.co.uk/vehicle-monitoring/zelt-loop-detection

http://www.traffictechnology.co.uk/vehicle-monitoring/zelt-loop-detection

http://www.toll-collect.de/

http://www.toll-collect.de/

scribe the scenarios that are to be detected, and even largeramounts of events coming in from many sensors, at highfrequencies. Furthermore, these rules need to be applied tocontinuous event streams to provide live feedback. Using theRete algorithm for the rule processing allows for efficient ex-ecution, however, it has not widely been used for such large-scale, online scenarios.

2.2 RequirementsIn the previous section, we outlined traffic monitoring as anexample scenario in more detail, but the domain of com-plex event detection has a much wider range of applicationswhere live feedback enables individuals and companies toreact better and faster to a changing environment. In general,correlating various different data sources provides a muchricher context that can be leveraged [1]. These applicationshave in common is that they follow the big data trend, i. e.,they are based on the availability of large amounts of dataexceeding the capacities of a machine. This big data is thenmined for correlations and patterns.

For the scenarios we are interested in, data sources pro-duce constant streams of events that need to be processedonline to enable live feedback. Rete engines such as Lana–Match[2] enable the use of multiple systems for distributedpattern matching, but they have significant drawbacks suchas high latencies, because all matches need to be committedcentrally using an transactional concurrency system. Typi-cally, they partition a fact base in order to fit into the memoryof a single machine, and then allow parallel execution of thematching process. This approach poses the severe restrictionon the supported rules that individual rules may not outgrowthe capabilities of a single system, since the granularity ofdistribution is placed at the rule-level.

The main requirements for continuous complex eventprocessing with big data are as follows. First of all, sucha system needs to perform pattern matching online, whilethe events are taking place. Next, it needs to provide sup-port for rules that reason over the whole data set of events,i. e., it needs a unified knowledge base that does not makedata distribution explicit and does not require any a prioryknowledge about how data is stored and processed. More-over, such a system needs to scale: the use of additional ma-chines needs to enable it to compensate for an increase inincoming events. In addition to scalability, it needs to adaptto changing work loads elastically: it should be possible toadd additional machines to handle increasing work loads atruntime. Finally, it should dynamically balance load be-tween machines: the system needs to be able to redistributedata and computation dynamically to avoid overloading ma-chines and to avoid performance bottlenecks. These require-ments guide the design of Cloud PARTE, which is discussedin the following section.

Car?

unify?c?plt test

unify?c terminalCamera?

Figure 1: Rete network for the rule defined in Listing 1.

3. Cloud PARTEOur system is an extension of PARTE, an actor-based, paral-lel CED system by Renaux et al. [17]. The pattern detec-tion in PARTE builds on the Rete algorithm proposed byForgy [6]. The Rete algorithm converts a list of rules, suchas the one in Listing 1, into a network of nodes, such asshown in Figure 1. The left-hand side of each rule is con-verted into a set of nodes that perform tests. Incoming eventsflow through the network as tokens, which are filtered by thenodes through which they pass. Nodes can perform tests onone token, e. g., compare them to a constant, or on two to-kens, e. g., test for a unification between two events. In caseof a complete match of the left-hand side of a rule, a tokenwill reach the terminal node which represents the right-handside of the matched rule.

Rete follows a state-saving approach to forward chaining,making it highly efficient for stable fact bases. Since ourwork is tailored to events, which are stable by design,5 theRete algorithm offers a good solution.

Renaux et al. [17] used the strong similarity betweena Rete network and a graph of actors communicating viamessage passing in the design of PARTE by reifying allRete nodes as actors. In our work, known as Cloud PARTE,we extended the previous system with the means to scaleup beyond the bounds of a single machine. Cloud PARTEallows to distribute the detection of complex events overmultiple machines, dynamically balancing the load acrossthe available machines. To this effect, the most resource-intensive parts—the actors—are made mobile such that theycan be moved around to balance resource usage.

3.1 ArchitectureCloud PARTE shares much of its architecture with its prede-cessor. Unlike PARTE, which used a custom scheduler and acustom implementation of the actor model to offer soft real-time guarantees in a shared memory context, Cloud PARTEis built on top of the Theron6 actor library.

Overview From a high level, Cloud PARTE processesevents coming from various input sources. A predefined ruleset is compiled into a Rete network where each node is repre-sented by an actor to enable parallel execution. These nodes,

5 Events, after occurring, never change. They have to be stored immutablywhile useful, and afterwards they expire.6 Theron, a lightweight C++ concurrency library based on the Actor Model,access date: August 15, 2013 http://www.theron-library.com

http://www.theron-library.com

Event Sources Application Layer

GUI

Pub/Sub

Cloud PARTE

Traffic Cameras

Radar Stations

On-board Units

Car CountersRule Set

Events(Knowledge

Base)

Rete Inference Engine

Worker 1 Worker 2 Worker n...

Figure 2: High-level view on Cloud PARTE. Input eventsfrom a variety of event sources are processed based on a ruleset that is compiled into a Rete network. The nodes of theRete network are actors which are distributed over a numberof worker machines. When a rule matches an event pattern,a higher-level event is generated that can be processed by auser interface or a general publish/subscribe system.

i. e., actors, can be distributed over a number of worker ma-chines. When a rule triggers the match of a certain eventpattern, a high-level event is generated, which can then beused by a user interface or a general publish/subscribe sys-tem to react appropriately to it. Figure 2 shows a sketch ofthe overall system.

Distribution The distribution scheme of Cloud PARTE isillustrated in Figure 3. The nodes of the Rete network, repre-sented in the figure as circles, are reified as actors. Multipleactors execute concurrently but maintain a single thread ofcontrol internally. The actors may be distributed over mul-tiple machines. Each machine runs a single Cloud PARTEprocess that can contain many actors, but every actor canonly be in one process at any given time. Thus, there is a1 : 1 mapping between machines and processes, and a 1 : nmapping between processes and actors.

Communication In a directed acyclic graph created by theRete algorithm, every node has one or more predecessors,and zero, one or multiple successors. In our system, thesecan be located on the same or on different machines. Ac-tors are identified by unique names, which are used by theTheron framework to route messages to the correct receiver.When the rules are compiled into a Rete network at programstartup, actors are given their name and are informed of thenames of their predecessors and successors.

Theron uses the Crossroads I/O library7 to send messagesover a network. While this library uses TCP/IP as its com-munication protocol, guaranteeing delivery of messages, itdrops messages when the receivers buffer is full. Unlike insome Belief-Desire-Intent multiagent systems such as Ja-son [5], Cloud PARTE’s semantics do not allow actors tofail or refuse to react to messages. Our system hence em-

7 Crossroads I/O, A socket library providing platform-agnostic asyn-chronous message-sending across different network protocols, access date:August 15, 2013 http://www.crossroads.io

manager-1master

agenda

manager-2

manager-3 manager-4

master machine worker machine 1 worker machine 2

worker machine 3 worker machine 4

remotesender-1 remotesender-2

remotesender-3 remotesender-4

Figure 3: The distribution scheme employed by the system.The boxes represent machines: the single master machinecontains the master actor and the agenda actor, and everyworker machine contains a part of the Rete network and amanager actor.

ploys a best-effort based approach to overcome this issue,by acknowledging the receipt of a message with a confirma-tion message, similar to TCP. For this purpose, every CloudPARTE process contains one “RemoteSender” actor, whichis responsible for the confirmation protocol, and propagatesdeserialized messages to the correct destination by local8

message-send. Despite being a main implementation issue,we do not consider this a scientific challenge, since it couldbe overcome using existing techniques.

Coordination While Cloud PARTE is distributed, it is notdecentralized. A master actor oversees the entire system,and knows where each node is at every moment (assum-ing no failure happened, failure resilience is considered fu-ture work). During initialization, the master actor parses theuser-defined rules, creates a Rete network out of them, de-cides which machines the nodes should move to, and informsthose machines’ manager actors of the nodes they need tospawn. A manager actor exists per process. They each over-see the part of the Rete network located on the machine theyare running in, and can spawn – and kill – Rete nodes and theRemoteSender within their process. They keep track of thenodes living in their process, and can for instance determinewhether the destination of a message-send is local.

The master actor’s process further contains the agenda ac-tor. The agenda sequentializes side-effects performed fromwithin the Rete network. More specifically, whenever a com-plete match for a rule is found, any callback function spec-ified as the consequence of the rule is to be called throughthe foreign function interface. This execution should not beperformed locally in the Rete node that detected the match.Instead, a message requesting the execution is sent to theagenda actor, which executes it on the master machine.

8 We refer to actors in the same process as local actors, and actors in otherprocesses as remote actors.

http://www.crossroads.io

3.2 Distributed Execution ModelThe Theron library implements the actor model by provid-ing an Actor class that can be extended to create new typesof actors. Within a process, a Theron instance contains manyactors that run concurrently on a thread pool. On each ma-chine in a LAN, one process is started, and their Theron in-stances are connected. Inside an actor no parallelism is used.

A frequent issue in such concurrent systems is contentionfor shared resources. Even with the actor model takingcare of low-level data races, access to actors’ inboxes re-mains a point of contention. Where PARTE offered lock-free inboxes, Theron’s need to potentially serialize mes-sages across a LAN require it to forgo this optimization.Within the Rete graph, though, every node has only one ortwo predecessors. Only the master actor, the agenda and theRemoteSender are heavily contended. Though these do poseas potential bottlenecks, our experiments showed they areable to keep up with the rest of the system. Furthermore, theRemoteSender is only present to overcome a technical issueof the Theron library. The only real, inherent bottlenecks liesin the communication over the LAN. Reducing this networktraffic is part of our future work. One approach we envisionis to distribute the Rete graph in such a way that stronglyconnected subgraphs in the Rete network tend to be phys-ically close. This would reduce the need for inter-processcommunication, and with that the contention for the LAN.

A second issue systems like Cloud PARTE face is effi-cient serialization. When a message is sent to a local actor,Cloud PARTE makes use of the shared memory space to skipserialization. When the receiving actor is remote, however,pointer-indirections would be invalid, so the message is seri-alized to a flattened, uniform representation. When choosingthe serialization format, two considerations were made: thecompactness of the serialization, and the speed at which the(de-)serialization can take place. These have an impact onrespectively the required bandwidth and the incurred latencyof message sends. In our experiments, JSON9 revealed suf-ficiently small and fast, though we do consider improvingserialization future work.

Despite its support for distribution, Theron does not of-fer support for code mobility, i. e., the ability to dynamicallychange the bindings between code fragments and the loca-tion where they are executed [7]. Since we require actors tomigrate between machines, we built mobility on top of theexisting library. Since an actor maintains no runtime stackacross turn borders, it can be migrated by sending its inboxand its internal state.

We devised the schema depicted in Figure 4 to deal withmessages sent to an actor while it is moving. In a first phase,the actor to move informs its predecessors, which from thatmoment on start buffering their messages for the migrating

9 JSON (JavaScript Object Notation) is a text-based standardized languageused for data interchange. It is human-readable, which eases debugging butdoes not lead to the most compact representation possible.

Master Worker 1 Worker 2 Worker 3

M N …

1. request to move

M N … M N … N

2(a). stop sending,buffer instead

prepare move:inform predecessors

switch flag tobuffer messagesto moving node

– serialize node– delete it here– send 4. serialized node

– deserialize node– create it here

6(a). resume sending

6(b). move completed

send bufferedmessages (if any)

5.

2(b).

Figure 4: The procedure to move a node to a different ma-chine. Every worker machine contains one manager actor(M) and several Rete node actors (N). In this figure, a nodeon worker 2 moves to worker 3. The node on worker 1 is apredecessor of the moving node: when it receives message2(a) it starts buffering its messages destined for the movingactor, when it receives message 6(a) it sends the bufferedmessages to the now moved node.

actor. Once all predecessors are buffering, the actor is serial-ized and destroyed, and the serialized representation is sentto the destination machine’s manager. There, it is deserial-ized and reconstructed, and the predecessors are notified sothat they can send their buffered messages.

3.3 Load Detection and BalancingCloud PARTE uses a global, centralized load balancing strat-egy [19]. The manager of each process measures the load ofits local actors on a regular interval, and relays that informa-tion to the master actor. The master then makes load balanc-ing decisions based on this global information.

Non-trivial heuristics to base the load balancing decisionson are outside of the scope of this paper, and consideredfuture work. Our experiments described later in this paperwere conducted with a simple heuristic that defines load asa threshold on the number of unprocessed messages in anactor’s inbox. Machines are in turn labeled as overloadedor underloaded based on the amount of overloaded actorsthey contain. Whenever at least one overloaded and oneunderloaded machine are found, the master actor selectsan overloaded actor on the most overloaded machine andmigrates it to a randomly selected underloaded machine.

The movement of actors is fully transparent: the user doesnot need to worry about the location of actors, or when andwhere to move them. The load balancing algorithm deter-mines this automatically. The predecessors of the moved ac-tor need not do any lookup to ‘find’ the actor at its new lo-cation: the unique name of the actor remains valid.

4. Evaluation4.1 Methodology and SetupFor the evaluation of Cloud PARTE’s performance, we con-ducted a series of microbenchmarks. They ran on a networkof commodity computers, each containing a quad-core pro-cessor10 and 8 GB of RAM memory and running the Ubuntu12.04.2 operating system with a Linux 3.2.0-43-generic ker-nel. All code was compiled using gcc 4.6.3 with optimiza-tion flag -O3. The network is rated at 1000 Mbit/s. Based onthe methodology proposed by Georges et al. [8], every con-figuration is executed at least 30 times, and is executed ad-ditionally until either a confidence level of 95% is achieved,or the configuration has been executed 75 times.

The experiments are parametrized by the following threedimensions:

• The number of machines used. To measure the scalabilityof the system, we gradually increased the number ofmachines and measured the effect on the throughput. Inan ideal case, doubling the amount of machines doublesthe throughput.

• With/without confirmation messages. Enabling confirma-tion messages is expected to cause a decrease in perfor-mance, because 1) for every received message a confir-mation is sent and received, 2) for every received mes-sage a check for duplication is performed, and 3) for ev-ery sent message it is regularly checked whether a confir-mation has already been received or whether a resend iswarranted. To demonstrate that we incur a performance-hit from the inherently hard problem of verified message-delivery in a distributed system, we compare the versionof Cloud PARTE utilizing the RemoteSender with a ver-sion assuming failure-free communication channels.

• With/without dynamic load balancing. The load balanc-ing scheme introduced in subsection 3.3 is expected toincrease efficiency. We measured this by comparing thesystem with dynamic load balancing with a version of thesystem where dynamic load balancing was disabled.

The benchmarks we performed to evaluate Cloud PARTEare a combination of benchmarks defined earlier by Renauxet al. [17] for the evaluation of PARTE, and a new bench-mark measuring the newly added aspects, namely distribu-tion and dynamic load balancing. The existing benchmarkscan be categorized as 1) a number of microbenchmarks em-ploying only pipeline parallelism while conducting simple,complex, or heavyweight tests; 2) a number of microbench-marks executing multiple of those pipelines in parallel; 3) amicrobenchmark executing a tree-shaped Rete-network thatis mostly unification- and communication-heavy; and 4) amicrobenchmark where multiple rules require a lot of pro-cessing, but only one rule triggers the benchmark’s end, such

10 Specifically, an Intel Core i5-3570, a processor with a 64-bit architecturerunning at a clock frequency of 3.40 GHz, with 6 MB of on-chip cache.

1 3 5 7 9 11 13 15 17 192 4 6 8 10 12 14 16 18 20Event type:Phase 1:

Phase 2:

Phase 3:

Phase 4:

Phase 5:

Phase 6:

Phase 7:

Phase 8:

Phase 9:

Phase 10:

Figure 5: Partitioning of benchmark into phases: 100,000events of 20 types are asserted in 10 phases, gradually pro-gressing through these types. Each phase asserts events fromthree to five different but consecutive event types.

Figure 6: A scenario that could lead to a situation such asdepicted in Figure 5: a number of cars move through an area,within view of multiple cameras. As they move, they go outof view of one camera and move into view of another.

that non-FIFO scheduling may cause a superlinear improve-ment in detection latency. The new benchmark activates therules in phases, as demonstrated in Figure 5, simulating ascenario such as the one depicted in Figure 6 where a groupof tracked cars moves though the view of multiple sensors.In general, a scenario where different parts of the Rete net-work become computation and/or memory-intensive duringthe run time of the CED system is what caused the need forload balancing in Cloud PARTE. Hence, this benchmark val-idates the ability to adapt to changing workloads.

For every configuration, i. e., for every run of the bench-mark with a certain set of parameters chosen, we measuredone of two variables:

Run time The time between inserting the first event of thebenchmark into the system, and successfully having pro-cessed all of them. In other words, the experiments mea-sure the time it takes for all of the events to percolatethrough the Rete network.

Benchmark Run time (ms) Slow-PARTE Cloud PARTE down

101 simple tests 21.7± 1.4 391.4± 10.0 18.04501 simple tests 70.6± 1.1 2,053.6± 53.9 29.10101 complex tests 3,145.0± 4.3 16,810.4± 375.0 5.35

. . . with variables 7,302.1± 3.0 38,111.1±1,117.2 5.2210×101 simple tests 212.8± 0.6 12,875.7± 550.8 60.4910×8 heavy tests 4,907.8± 2.8 11,497.0± 580.9 2.3416 heavy tests 104.4± 1.0 379.9± 8.1 3.6432 heavy tests 199.4± 1.0 750.6± 13.5 3.7664 heavy tests 389.2± 1.0 1,432.3± 31.5 3.68128 heavy tests 772.7± 1.8 2,631.9± 81.2 3.41Joining tree 29.0± 0.4 173.9± 2.5 5.99Search 25,526.9±264.3 49,979.1±6,497.7 1.96

Table 1: Comparison of Cloud PARTE’s and PARTE’s runtimes, and the 95% confidence interval of the results.

Throughput The number of events that can be processedby the system per time unit. The throughput is measuredindirectly, by dividing the number of events generatedduring the benchmark by the total run time.

Results are visualized in beanplots [14], a variation onboxplots. The width of a ‘bean’ at a given height indicatesthe density of measured results with that run time or through-put. The horizontal lines indicate the arithmetic mean of theresults, and the dot marks the median.

4.2 EfficiencyThe first experiment we conducted consists of a compari-son of Cloud PARTE to PARTE, using the benchmarks cre-ated for PARTE, run on one machine. Cloud PARTE was setup with one worker process running on the same machineas the master process. As expected, Cloud PARTE did con-siderably worse, requiring inter-process communication forI/O, and constantly depending on locking access to inboxes,whereas PARTE uses nonblocking data structures, and heav-ily exploits the shared memory architecture. Furthermore, inPARTE benchmarks end when a match is found, whereas inCloud PARTE a trip to the master process is required first.Still, these benchmarks serve as a useful baseline, demon-strating Cloud PARTE’s overhead.

Figure 7a shows the situation for the representative caseof the “101 complex tests” benchmark. Cloud PARTE is onaverage 5.35 times slower than PARTE. The worst situa-tion arose with the “10 times 101 simple tests in parallel”benchmark, an exceptionally computation-light and hencescheduling- and communication-heavy benchmark. As Fig-ure 7b shows, an average slowdown of 60.49 was mea-sured. The best-case, where Cloud PARTE is only 1.96 timesslower, can be found in the “search” benchmark, whose re-sults are plotted in Figure 7c. For completeness, Table 1shows the run times of all benchmarks on both systems. Notehow PARTE, being tailored towards soft real-time behavior,has much less variation in its run times – even when normal-ized to absolute run time.

PARTE CloudPARTE

0

5

10

15

20

25

Run

tim

e(s

)

101 complex tests

(a) Runtime of the“101 complex tests”benchmark

PARTE CloudPARTE

0

5

10

15

20

25

Run

tim

e(s

)

10 times 101 simpletests in parallel

(b) Runtime of the“10 × 101 simpletests” benchmark

PARTE CloudPARTE

0

50

100

150

200

Run

tim

e(s

)

Search

(c) Runtime of the“search” benchmark

Figure 7: A comparison of the run time of PARTE and CloudPARTE when executing a subset of the benchmarks on asingle machine.

4.3 Static ScalabilityThe second experiment assesses the scalability of CloudPARTE by measuring the throughput as the amount of ma-chines is increased. Two variants of Cloud PARTE are com-pared, one using confirmation messages and one without.

In this benchmark, a rule duplicates every incoming eventto fifteen pipelines conducting sixteen heavy tests each onthe event, and each generating a complex event on com-pletion, signaling a seventeenth and final rule, which uni-fies all complex events and ends the benchmark when allevents were processed. As an initial setup, this Rete networkis deployed over eight machines and run. Subsequently, thenumber of machines is halved (to four), by merging two ma-chines’ processes into one; this is repeated two more times(for set-ups with two and one machine). For these tests, dy-namic load balancing was disabled.

Table 2 shows the results. The version with confirmationmessages performs, as can be expected, less well than theone without them: on eight machines the extra communica-tion overhead of the confirmation messages nearly halves thespeed of the system. The table shows that moving from a set-up with one machine to one with two machines does not pro-vide much benefit: this is because in the case of one machineall communication is local while when using two machines(de)serialization and network communication is necessary.However, the throughput nearly doubles when increasing theamount of machines from two to four, and from four to eight.While ideal speedup is not achieved, from four machines on,Cloud PARTE achieves a higher throughput than PARTE,demonstrating that an automatically load balanced distribu-tion based on mobile actors enables scaling of inference en-gines beyond the capabilities of a single machine.

4.4 Dynamic ScalabilityIn the third experiment, the benefits of dynamic load balanc-ing and elasticity are demonstrated. To this end, three sce-narios are compared: one in which dynamic load balancing

10

200

400

600

800

1000

1200

1400

1600

Thr

ough

put

(mes

sage

s/s)

PARTE

1 2 4 8Number of machines

0

200

400

600

800

1000

1200

1400

1600

Thr

ough

put

(mes

sage

s/s)

Cloud PARTE (withoutconfirmation messages)

1 2 4 8Number of machines

0

200

400

600

800

1000

1200

1400

1600

Thr

ough

put

(mes

sage

s/s)

Cloud PARTE (withconfirmation messages)

Figure 8: Beanplots indicating the distribution of mea-surements of the throughput (in messages per second) ofPARTE, Cloud PARTE without confirmation messages, andCloud PARTE with confirmation messages, as the amountof machines increases. The green dashed lines indicate idealspeedup.

Number of PARTE Cloud PARTEprocesses no conf. confirmation

1 266 (1.00x) 131 (0.49x) 141 (0.53x)2 — 199 (0.75x) 161 (0.60x)4 — 393 (1.48x) 351 (1.32x)8 — 792 (2.98x) 407 (1.53x)

Table 2: Median throughput and ratios of the benchmark ofFigure 8.

is disabled, secondly a scenario that allows dynamic loadbalancing, and lastly a configuration in which an additionalempty machine is present. In all scenarios the Rete networkis initially distributed over two machines at startup by a hard-coded distribution scheme. The second scenario shows thebenefit of dynamic load balancing, whereas the third showsthe elasticity provided by the system.

The most relevant benchmark is hence the new ‘phased’benchmark introduced in subsection 4.1 and depicted in Fig-ure 5, which consists of twenty rules – one for each eventtype. The initial distribution of all twenty rules on the firstmachine corresponds to a reasonable real-life situation, sinceone may not know in advance how the entities whose eventsare being pattern matched will behave.

The results plotted in Figure 9 show that both the dynamicload balancing and the elasticity of Cloud PARTE can offermeasurable improvements. The median run time for the basecase where load balancing is disabled is 110.36 seconds,while the median time for the load balanced version is 98.45seconds, i. e., 10.8% faster. The addition of a third machinedecreases the run time even further to 91.69 seconds, a16.9% decrease compared to the original.

By logging the migrations triggered by the load balancer,we confirmed that the performance increase indeed cor-relates to the correct migration of overloaded nodes. The

No dynamic load balancing(2 machines)

Dynamic load balancing(2 machines)

Dynamic load balancing(3 machines)

0

50

100

150

200

250

300

Run

tim

e(s

)

Dynamic load balancing

Figure 9: Comparison of run time with manual distributionof the Rete network over 2 machines and (a) no load balanc-ing, (b) load balancing, (c) load balancing and an extra thirdmachine. The blue +s indicate outliers.

heuristics used by the load balancing algorithm described insubsection 3.3 are little more than a proof-of-concept, andimproving these is considered future work. Still, it tends tomove nodes around such that the active nodes of the cur-rent phase are balanced over the machines. Occasionally,it moves too few or too many nodes, since the number ofnodes which have moved are simply not part of the equationthe load balancer uses. In those cases, the run time peaks.

4.5 DiscussionWe conclude from our performance evaluation that mobileactors, as employed in Cloud PARTE, succeed in offeringan improvement over the state of the art in complex eventdetection. While an overhead definitely but obviously ex-ists in the distribution, this overhead can be overcome byincreasing the number of machines. An additional benefit isgained from load balancing, even with the simple heuristicsand a straightforward migration system which only allowsone node to be on the move at the same time, and uses thesuboptimal JSON format for serialization.

The elasticity of Cloud PARTE effectively manages toprovide ‘free’ additional speedup by simply plugging in an-other machine. Currently, all machines have to be present atstartup time of Cloud PARTE, though this is but a technicalissue: there is no conceptual limitation prohibiting machinesto arrive when the system is already running. Even supportfor unloading machines and subsequently retiring them fromthe system is just some engineering-effort away.

A number of open problems remain, which require moreresearch to be resolved. Those are listed in section 6.

5. Related Work5.1 Parallel and Distributed ReteThe Rete algorithm is widely used, and several parallel anddistributed variants exist already.

The parallelization approach taken by Gupta et al. [10] issimilar to ours: every node of the Rete network is considered

a ‘task’, and is given an internal thread of control. Based onthe data-flow like organization of the Rete network, multipletokens will be flowing through the Rete network simulta-neously, and as a result multiple nodes can be active at thesame time. Their approach is more general, as it’s not limitedto event processing, but they therefore have to forgo concep-tual optimizations that rely on temporal reasoning, e. g., inCloud PARTE events expire automatically, whereas in theirapproach facts need to be explicitly retracted.

The approach of Kelly and Seviora [15] increases thegranularity of the parallelization: instead of considering indi-vidual Rete nodes as tasks, single token–token comparisonsare parallelized. Thus, nodes of the Rete network are splitinto several copies, one copy for every token in their as-sociated memory node. Each of these copies is a separatetask. The level of parallelism increases, but the amount ofcommunication increases as well; as such this approach issuitable for shared memory systems but not for distributedsystems.

A different approach to distributing the Rete algorithm istaken by Aref and Tayyib [2]: in their Lana–Match algorithmthe Rete network is duplicated among a number of workerprocessors. Changes to the working memory made by theworkers are synchronized by a single, centralized master, us-ing an optimistic algorithm which backtracks in case of con-flicting updates. The optimistic algorithm assumes each rulewill only change a very small part of the fact base and thatconflicts are rare. Furthermore, in Lana–Match the central-ized master processor will be processing all changes to theworking memory, and can therefore become a bottleneck,while in Cloud PARTE the master only handles incomingdata and the working memory is spread over the workers.In Cloud PARTE, the event base is spread over the workermachines, but appears as a single unit.

Lastly, yet other systems use a hierarchical blackboard,for example the Parallel Real-time Artificial IntelligenceSystem (PRAIS) by Goldstein [9] and the Hierarchically Or-ganized Parallel Expert System (HOPES) by Dai et al. [3]. Inthese systems, each knowledge source (KS) contains a rule,and multiple KSs can simultaneously process information.This information is shared through a blackboard, a globaldatabase that contains all information about the problem cur-rently in the system. KSs incrementally update the black-board leading to a solution to the problem. This approach ismore coarse grained than Cloud PARTE, leading to a lowerlevel of parallelism. These systems also do not focus on scal-ability and elasticity.

5.2 Mobile Actors and Dynamic Load BalancingUsing mobile actors as a mechanism for load balancing ex-ists in other systems. For instance, Lange et al. [16] in-troduce “Aglets”, a Java implementation of mobile actors(also called mobile agents). Fuggetta et al. [7] describe adistributed information system, in which information is dis-persed throughout a network, and code moves to be ‘closer’

to the data it uses. In these systems, the actors are proactive:they decide when to move and to where, for instance whenit is impossible or unfeasible to move the information (e. g.,because of its huge size). In our system movements are reac-tive: the load balancer decides when an actor should move, toprovide each actor with the computational resources it needs.

SALSA [18] is an actor-based language in which actorsare distributed over multiple machines. Actors are serial-ized for migration, as in our system, and references betweenactors remain valid through the use of a universal namingscheme, this too is similar in our system. In SALSA, thecode of a universal actor needs to be present on every ma-chine, in our system this is partly the case: the code to createany of the five primitive node types of the Rete algorithmneeds to be present on each machine, but the code suppliedby the user as part of a rule will be migrated along withthe actor that contains it (and it is reparsed when the ac-tor is resumed). Desell et al. [4] describe how load balanc-ing is added to SALSA. A major difference between CloudPARTE and SALSA is that, in Cloud PARTE, the load bal-ancing is centralized (a master machine gathers load infor-mation and makes load balancing decisions), while SALSAsupports several decentralized load balancing techniques. Inthese techniques, lightly loaded machines will attempt tosteal work from other machines. These ideas are applicableto Cloud PARTE and will be investigated in the future.

Lastly, in the blackboard systems discussed in the previ-ous subsection (PRAIS [9] and HOPES [3]), each actor con-tains one rule, while in Cloud PARTE each actor containsonly a part of the rule (one node of the Rete network). Asa result, load balancing in these blackboard systems wouldconsist of moving a complete rule, while in Cloud PARTEparts of rules can be executed on different machines andmove independently. Because of this finer granularity, CloudPARTE will allow a higher amount of parallelism and greaterload balancing flexibility.

6. Conclusions and Future WorkCloud PARTE is a complex event detection system that dis-tributes the Rete algorithm over multiple machines, usingmobile actors in order to respond to changing work loads.It is designed for applications such as traffic monitoring orcrowd management, where large amounts of data need to beprocessed and it is therefore necessary to use multiple ma-chines. Cloud PARTE does not require programmer effortin the distribution of rules, but despite the transparency ofthe load balancing and elasticity, it retains the semantics ofa single unified Rete network.

Variations in the work load make dynamic load balancingnecessary. Cloud PARTE uses mobile actors, i. e. actors thatcan migrate between machines at run time, to provide loadbalancing. This also enables elasticity: a machine can beadded at run time and the load will be redistributed.

Based on a set of micro- and synthetic benchmarks, weshow that Cloud PARTE is a scalable and elastic system thatcan handle increasing amount of work, and that dynamicload balancing provides an additional benefit in distributingthe work evenly over the available machines.

Future work will address the large communication over-heads, and improve the load balancing algorithm to take intoaccount more factors, such as putting closely connected Retenodes on the same machine, or estimating the cost and poten-tial gain of the migration. Furthermore, migration currentlyis a cooperative effort, requiring both the migrating actor andits predecessors to work in close collaboration. Future re-search should focus on decoupling the actors, up to the pointof adding fault tolerance to the system, such that failure ofindividual machines can be overcome without underminingthe entire system. Finally, we envision extensions to the ex-pressiveness of the querying system in the form of dynamicqueries of the knowledge base, and support for negation andexistential quantification in the rule language.

References[1] S. S. Adams, S. Bhattacharya, B. Friedlander, J. Gerken,

D. Kimelman, J. Kraemer, H. Ossher, J. Richards, D. Un-gar, and M. Wegman. Enterprise context: A rich source ofrequirements for context-oriented programming. In Proceed-ings of the 5th International Workshop on Context-OrientedProgramming, COP’13, pages 3:1–3:7, New York, NY, USA,2013. ACM. ISBN 978-1-4503-2040-5. .

[2] M. M. Aref and M. A. Tayyib. Lanamatch algorithm: a par-allel version of the retematch algorithm. Parallel Computing,24(5-6):763–775, June 1998. ISSN 0167-8191. .

[3] H. Dai, T. Anderson, and F. Monds. On the implementa-tion issues of a parallel expert system. Information and Soft-ware Technology, 34(11):739–755, November 1992. ISSN09505849. .

[4] T. Desell, K. E. Maghraoui, and C. Varela. Load Balancing ofAutonomous Actors over Dynamic Networks. In Proceedingsof the Proceedings of the 37th Annual Hawaii InternationalConference on System Sciences, number 9 in HICSS’04, pages1–10, 2004. ISBN 0769520561.

[5] A. Fernandez Dıaz, C. Benac Earle, and L.-A. Fredlund.Adding distribution and fault tolerance to jason. In Proceed-ings of the 2nd edition on Programming systems, languagesand applications based on actors, agents, and decentralizedcontrol abstractions, AGERE! ’12, pages 95–106, New York,NY, USA, 2012. ACM. ISBN 978-1-4503-1630-9. .

[6] C. L. Forgy. Rete: A fast algorithm for the many pattern/manyobject pattern match problem. Artificial Intelligence, 19(1):17–37, 1982. ISSN 0004-3702. .

[7] A. Fuggetta, G. Picco, and G. Vigna. Understanding codemobility. IEEE Transactions on Software Engineering, 24(5):342–361, May 1998. ISSN 00985589. .

[8] A. Georges, D. Buytaert, and L. Eeckhout. Statistically rig-orous java performance evaluation. In Proceedings of the22nd annual ACM SIGPLAN conference on Object-oriented

programming systems and applications, OOPSLA ’07, pages57–76. ACM, 2007. ISBN 978-1-59593-786-5. .

[9] D. Goldstein. Extensions to the Parallel Real-Time ArtificialIntelligence System(PRAIS) for fault-tolerant heterogeneouscycle-stealing reasoning. In Proceedings of the 2nd AnnualCLIPS (’C’ Language Integrated Production System) Confer-ence, NASA. Johnson Space Center, pages 287–293, Houston,September 1991.

[10] A. Gupta, C. Forgy, A. Newell, and R. Wedig. Parallel algo-rithms and architectures for rule-based systems. In Proceed-ings of the 13th annual international symposium on Computerarchitecture, ISCA ’86, pages 28–37. IEEE Computer SocietyPress, 1986. ISBN 0-8186-0719-X. .

[11] L. Hoste, B. Dumas, and B. Signer. Mudra: A Unified Multi-modal Interaction Framework. In Proceedings of ICMI 2011,13th International Conference on Multimodal Interaction, Al-icante, Spain, Nov. 2011.

[12] L. Hoste, B. De Rooms, and B. Signer. Declarative Ges-ture Spotting Using Inferred and Refined Control Points. InProceedings of ICPRAM 2013, 2nd International Conferenceon Pattern Recognition Applications and Methods, Barcelona,Spain, February 2013.

[13] M. W. Kadous. Learning Comprehensible Descriptions ofMultivariate Time Series. In Proceedings of ICML 1999, Bled,Slovenia, June 1999.

[14] P. Kampstra. Beanplot: A boxplot alternative for visual com-parison of distributions. Journal of Statistical Software, CodeSnippets, 28(1):1–9, Oct. 2008. ISSN 1548-7660.

[15] M. A. Kelly and R. E. Seviora. A multiprocessor architec-ture for production system matching. In Proceedings of theNational Conference on Artificial Intelligence, pages 36–41,1987.

[16] D. Lange, M. Oshima, G. Karjoth, and K. Kosaka. Aglets:Programming mobile agents in Java. In Proceedings of theInternational Conference on Worldwide Computing and ItsApplications, pages 253–266, 1997. .

[17] T. Renaux, L. Hoste, S. Marr, and W. De Meuter. Parallel ges-ture recognition with soft real-time guarantees. In Proceed-ings of the 2nd edition on Programming Systems, Languagesand Applications based on Actors, Agents, and DecentralizedControl Abstractions, SPLASH ’12 Workshops, pages 35–46,New York, NY, USA, October 2012. ACM. ISBN 978-1-4503-1630-9. .

[18] C. Varela and G. Agha. Programming dynamically reconfig-urable open systems with SALSA. ACM SIGPLAN Notices,36(12):20, Dec. 2001. ISSN 03621340. .

[19] M. J. Zaki, W. Li, and S. Parthasarathy. Customized DynamicLoad Balancing for a Network of Workstations. Technicalreport, The University of Rochester, Rochester, New York,USA, 1995.

Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors

Documents

Transcript of Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors