Topology aware gossip overlays

17
Technical Report RT/XX/2007 Topology Aware Gossip Overlays Jo˜ ao Leit˜ ao INESC-ID/IST [email protected] Jos´ e Pereira University of Minho [email protected] Luis Rodrigues INESC-ID/IST [email protected] December 2007

Transcript of Topology aware gossip overlays

Technical Report RT/XX/2007

Topology Aware Gossip Overlays

Joao LeitaoINESC-ID/IST

[email protected]

Jose PereiraUniversity of [email protected]

Luis RodriguesINESC-ID/[email protected]

December 2007

Abstract

Gossip, or epidemic, protocols have emerged as a highly scalable and resilient approach to implement severalapplication level services such as reliable multicast, data aggregation, publish-subscribe, among others. Allthese protocols organize nodes in an unstructured random overlay network. In many cases, it is interesting tobias the random overlay such that it better matches the underlying network topology, for instance to reduce thestretch in the overlay routing.

In this paper we propose a new scheme that allows an unstructured gossip overlay network to bias its topologyaccordingly to some efficiency metric, for instance to better match the topology of the underlying network.Our scheme is completely decentralized and, unlike previous approaches, preserves a number of interestingproperties of the original (non-biased) overlay, such as the node degree, small diameter, and low clustering.

.

Topology Aware Gossip Overlays∗

Joao LeitaoINESC-ID/IST

[email protected]

Jose PereiraUniversity of [email protected]

Luıs RodriguesINESC-ID/[email protected]

Abstract

Gossip, or epidemic, protocols have emerged as a highly scalable and resilient approach to implement several applicationlevel services such as reliable multicast, data aggregation, publish-subscribe, among others. All these protocols organizenodes in an unstructured random overlay network. In many cases, it is interesting to bias the random overlay such that itbetter matches the underlying network topology, for instance to reduce the stretch in the overlay routing.

In this paper we propose a new scheme that allows an unstructured gossip overlay network to bias its topology accordinglyto some efficiency metric, for instance to better match the topology of the underlying network. Our scheme is completelydecentralized and, unlike previous approaches, preserves a number of interesting properties of the original (non-biased)overlay, such as the node degree, small diameter, and low clustering.

1. Introduction

Gossip, or epidemic, protocols have emerged as a highly scalable and resilient approach to implement several applicationlevel services such as reliable multicast [2, 18, 6, 9, 21], data aggregation [13, 16], publish-subscribe [5], among others [26,10, 22]. A gossip protocol operates as follows: in order to broadcast a message, a node selects t nodes at random from thesystem (this is a configuration parameter called fanout) and sends the message to them. Upon the reception of a message forthe first time, each node simply repeats this procedure.

This approach owns several advantages: i) it is simple to implement, ii) it shares the load evenly across all nodes in thesystem, making gossip protocols highly scalable, in fact the load imposed by the process in each node of the systems onlyhas to grow logarithmically with the size of the system in order to ensure atomic broadcast with a high probability [2, 7], andfinally, iii) it’s inherent redundancy makes gossip protocols highly resilient to node and link failures.

Some gossip protocols have been designed to operate with full membership information [4, 2], by maintaining locally ateach node a list with every other node identifier1 in the system. However, such approach is not scalable, not only due to thelarge size of the membership list but mainly due to the cost of maintaining such large amount of information up-to-date indynamic systems. For better scalability, nodes may rely in a peer sampling service [12], which is a membership protocolthat operates with the goal of maintaining locally, at each node, a small random subset (called a partial view) of the fullmembership list; in this case, nodes use their local partial views to select peers with whom they exchange messages.

Partial views establish neighboring associations among nodes that define an overlay network which can be used for datagossip. Typically, the peer sampling service aims at maintaining a random partial view of the system [8, 28, 21] whichshould ensure that a selection of peers from local partial views is equivalent to a random selection of peers across the fullmembership. Therefore, the resulting overlay has a random unstructured topology. Although this randomness is desirable, itprevents the underlying network topology to be taken into consideration by the peer sampling service, which usually leads toscenarios where many overlay links are suboptimal with regard to a given network efficiency criteria such as bandwidth orlatency. Unfortunately, the inefficiency of the overlay has a direct negative impact in the performance of the applications that

∗This work was partially supported by project ”P-SON: ProbabilisticallyStructured Overlay Networks” (POS C/EIA/60941/2004).

1Typically, an identifier is a tuple (ip, port) that allows a node to be reached.

uses the overlay (such as application level reliable broadcast service). This have been recognized as a relevant research topicfor gossip-based protocols [1].

In this paper, we propose a new scheme that allows an unstructured overlay network to bias its topology accordinglyto some given efficiency criteria, for instance, to better match the topology of the underlying network. Still, this bias isapplied without compromising key properties of random overlay networks (such as the node degree, small diameter, andlow clustering), which are essential to ensure the efficiency and reliability of applications, including gossip-based broadcastprotocols.

The basis for our proposal is the combined use of two distinct partial views. In [21] we presented the HyParView protocol,a membership protocol that illustrated how to achieve a high resilience to faults (as high as 80% of simultaneous nodesfailures) in a gossip-based broadcast protocol using a low fanout value, by combining a small sized active view and a largerpassive view. In this paper, we show how the same fundamental architecture can be used to optimize the overlay networkused for message dissemination.

Our scheme is completely decentralized and, unlike prior works, it presents the following set of characteristics: i) itoperates only with local information and, in each optimization round, the coordination among nodes is limited to at most4 nodes in the system; moreover there is no a priori knowledge in the nodes concerning their desired locations on the finaltopology; ii) it strives to preserve the degree of the nodes that participate in an optimization round, which contributes topreserve the connectivity of the overlay; iii) every modification performed in the overlay, ensures that the global cost of theoverlay is decreased; this is feasible due to the fact that we do not incur in the risk of falling in local minimal cost values,moreover iv) we ensure that we preserve a number of interesting properties of the overlay such as a low clustering coefficientand low overlay diameter; v) our scheme is highly flexible, as we rely on a companion oracle to estimate the link cost and,therefore, our algorithm can bias the network according to different criteria just by using the appropriate oracle; finally, vi)our scheme does nor requires the oracle to be precise (i.e., oracles can be unreliable).

The rest of this paper is organized as follows. In Section 2 we give some detail concerning some relevant aspects ofunstructured overlay networks which are at the core of our proposal, and give a brief overview of the HyParView protocol.Section 3 introduces our topology aware adaptive protocol, explaining the rationale for our architecture and the proposedalgorithm. Section 4 presents experimental work that evaluates our approach in both steady state and in scenarios where largepercentages of nodes can fail simultaneously. Finally, Section 5 presents related work and Section 6 concludes the paper andgives some directions for future work.

2. Unstructured Overlay Networks

In this section we enumerate several aspects of unstructured overlay networks and their applications for reliable broadcast.In the end of the section, for self containment, we present an overview of the HyParView protocol, that serves as the basis forour topology aware protocol to be described in Section 3.

2.1. Requirements

In order to support fast message dissemination and high level of resilience to node failures, the overlay networks definedby the partial views must own several properties2. Some of the most important properties are the following:

Connectivity The overlay is connected if there is at least one path that allows every node to reach every other node in theoverlay. The overlay should remain connected despite failures that might occur. If this requirement is not meet, isolatednodes will not receive broadcast messages.

Degree Distribution The degree of a node is the number of edges of a node, or in other words, the number of neighbors thata given node has3. The degree of a node is both a measure of its reachability on the overlay and also a measure of itscontribution to maintain the overlay connected. If the probability of failure is uniformly distributed in the node space,for improved fault-tolerance all nodes should have the same degree value. Nodes that have a small degree will become

2The reader should note that some of these properties are intrinsically related with graph properties, as it is, an overlay network can be seen as a graph,where nodes are represented by vertex, and links, or neighboring relations, are represented by vertex. Depending on the nature of these relations, graphs canbe directed or undirected.

3To be precise, usually partial views establish asymmetric neighboring relations, therefore the degree is viewed as two distinct components: in-degreeand out-degree. However, because HyParView has symmetric active views, we do not consider these components as being distinct.

2

more easily disconnected from the overlay as the number of faults increases, and the failure of nodes with high degreemay have an undesired impact in the overall connectivity of the overlay.

Average Path Length A path between two nodes in the overlay is the set of edges that a message has to cross to reach fromone node to the other. We define the average path length as the average of all shortest paths between all pair of nodesin the overlay. This is closely related to the overlay diameter. To promote the overlay efficiency when broadcastingmessages, the average path length between nodes should be as small as possible. Notice that large values of averagepath length has two negative implications: i) The number of hops required for messages to reach all nodes increases,with a negative impact in the broadcast latency and ii) the broadcast process becomes more prone to failures, has thetime window for failures increases (e.g. the number of steps required to fully disseminate a message increases).

Clustering Coefficient The clustering coefficient of a node is the number of edges between that node’s neighbors divided bythe maximum possible number of edges across those neighbors. This metric indicates a density of neighbor relationsacross the neighbors of a given node, having it’s value between 0 and 1. The clustering coefficient of a graph is theaverage of clustering coefficients across all nodes. The clustering coefficient of a graph should be as small as possible,and failure to met this requirements also have several negative implications: i) the number of redundant messagesreceived by nodes when disseminating data increases, specially in the first steps of the dissemination process; ii) thediameter of the overlay increases, which in turn will make the average path length increase, and finally iii) it decreasesthe fault resilience of the overlay, as areas of the overlay which exhibit a high values of clustering will more easilybecame disconnected.

Average Link Cost The link cost captures the target optimization metric and its value is provided by an oracle available atits edge nodes. We assume that the link cost is inversely proportional to the utility of the link to the performance ofthe overlay. In other words, we assume that a low cost link is desired over a high cost link. The average link cost ofan overlay is the average of all links that form the overlay. The link cost can be associated to a concrete (underlay)network metric such as link latency. However, the network cost can also be associated to some higher level metric, forinstance, in a file sharing peer-to-peer system it could be a measure of the semantic similarity between the data storedat the edges of a link.

The interested reader can find a more detailed discussion of these and other properties of random overlays in [21] and [19].

2.2. Metrics

Several metrics can be used to measure the performance of a gossip-based broadcast protocol operating on top of arandom overlay network. In this paper we are mainly concerned with offering dependability for applications requiringreliable broadcast. Therefore, we focus on the offered reliability that has the following definition:

Reliability Gossip reliability is defined as the percentage of correct nodes that deliver a given broadcast message. A relia-bility of 100% means that the protocol was able to deliver a given message to all active nodes or, in other words, thatthe message resulted in an atomic broadcast as defined in [18].

2.3. HyParView

The Hybrid Partial View membership protocol, or simply, HyParView [21, 19] is a fully decentralized membership protocolthat builds and maintains a unstructured (random) overlay network. The protocol was designed to support highly efficientand reliable application level broadcast protocols, with special interest in scenarios where large percentages of nodes mayfail simultaneously.

Unlike other membership protocols, HyParView maintains two distinct partial views which are maintained using differentstrategies and for different purposes4.

A small symmetric active view with (fanout+1) size, which is used mainly to disseminate broadcast messages, is main-tained using a reactive strategy which means that these partial views only change in response to some external event thataffects the overlay (e.g. a node joining or leaving). A TCP connection is maintained to each neighbor in these partial views,

4In fact, the protocol is said to be Hybrid because it combines these different strategies.

3

which allows the selection of smaller fanout by assuming that the links do not omit messages. Moreover, TCP is used as anunreliable failure detector, which facilitates the implementation of the reactive maintenance strategy.

Each node also maintains a larger passive view usually k times larger than the passive view, whereas k is a constantthat is related with the fault tolerance level of the protocol. The passive view is maintained by a cyclic strategy therefore,periodically, each node performs a shuffle operation with one random node in the overlay that results in a update of its passiveview. This partial view is used for fault tolerance, as it works like a backup list of nodes that are used in attempts to fill theactive view when some of the nodes in it are suspected as being failed.

HyParView was conceived to support a flooding broadcast strategy. This strategy is feasible and effective because theactive views are very small. Furthermore, a flooding strategy speeds up the failure detection mechanism of TCP, as everynode in the overlay is tested with each broadcast message. In the current paper, we focus on the impact of adaptive topologyaware overlay construction on the performance of the flooding broadcast strategy. Still, the reader should note that HyParViewcan also be used to support tree-based broadcast schemes, as described in [20]; such tree-based broadcast schemes wouldalso benefit from the topology aware overlay construction proposed in this paper.

3. Topology Aware Gossip Overlays

3.1. Architecture

We assume that each node maintains two distinct and disjoint partial views, similarly to the HyParView protocol, whereasone is a small sized symmetric active view and the other a larger cyclic passive view. As in HyParView the active view isused mainly for communication among peers, and for the same reasons, TCP connections are maintained between neighborsin this view.

We also assume that all nodes have access to a local oracle. Oracles are components that export a getCost(Peer p)method, which returns the link cost between the invoking node and the given target node p in the system (since there is asingle link to each neighbor, in the paper we use interchangeably link cost or node cost when referring to the output of theoracle). The implementation of the oracle is outside the scope of this paper. However, previous work [14, 15] addresses theuse of inexpensive oracles for calculating neighbor proximity based on IP-based clustering (for instance, using a match ofcommon IP prefixes to calculate a measure of proximity between two peers). The reader should notice that the oracles arenot required to be perfect, in the sense that provided costs do not need to be 100% accurate. Later in the paper, in Section 4.3,we study the impact of unreliable oracles on our approach.

3.2. Rationale

The rationale of our approach is as follows. As in HyParView, we maintain a small active view and a larger passive view.However, unlike HyParView, our topology aware adaptive protocol continuously attempts to improve the overlay definedby the active views according to some efficiency metric embedded in the external oracle. Periodically, each node startsoptimization rounds in which it attempts to switch one member of its active view with one (better) neighbor of its passiveview. In the optimization protocol, a node uses its local oracle to obtain an estimate of the link cost to some random selectedpeers of its passive view. The number of nodes scanned in each set of optimization rounds its a protocol parameter denotedPassive Scan Length or simply PSL. This parameter limits the maximum number of optimization rounds started by each nodeeach time it runs the optimization protocol.

As in the original HyParView protocol, the passive view is not biased. The reader should notice that, the passive view iscontinuously updated during the system operation, so that it reflects the changes in the global membership (e.g. nodes thatleave the system, are eventually purged from all passive views, and nodes that join the system eventually appear in some ofthe passive views). Therefore, passive views are a continuous source of potential nodes to be upgraded to the active view, inorder to impose the desired bias on the overlay topology.

Out scheme strives to preserve the connectivity of the overlay. This has two implications in our scheme: i) nodes onlymake an effort to optimize their active views when they have a full active view (i.e., no bias is applied to active views untilconnectivity of the nodes is ensured). Furthermore, each node attempts to maintain some unbiased neighbors, as we explainin the next section; ii) we try to preserve the degree of nodes that participate in a optimization procedure, given that the nodedegree has a significant impact on the connectivity of the overlay. To ensure this, each optimization round involves 4 nodesin the system as we will detail later in the text.

4

Algorithm 1. Improving ProcedureData:

activeView //fixed size sorted listpassiveView //fixes size list

1: every ∆ T do2: if isfull(activeView) then3: canditates←− randomSample(passiveView, PSL )4: for i := UN ; i < sizeof(activeView) ; i :=i + 15: o←− activeView[i]6: while candidates 6= {} do7: c←− removeFirst(candidates)8: if isBetter(o,c) then9: Send(OPTIMIZATION, c, o, myself)10: break

11: upon Receive(OPTIMIZATIONREPLY,answer,o,d,c) do12: if answer then13: if o ∈ activeView do14: if d 6=null then15: Send(DISCONNECTWAIT, o, myself)16: else17: Send(DISCONNECT, o, myself)18: activeView←− activeView \{o}19: passiveView←− passiveView \{c}20: activeView←− activeView ∪{c}

21: upon Receive(OPTIMIZATION, o, peer) do22: if !isfull(activeView) then23: activeView←− activeView ∪ {peer}24: Send(OPTIMIZATIONREPLY, true, o, null, myself)25: else26: d←− activeView[UNOPT ]27: Send(REPLACE, d, o, peer, myself)

28: upon Receive(REPLACEREPLY,answer,i,o,d) do29: if answer then30: activeView←− activeView \{d}31: activeView←− activeView ∪{i}32: Send(OPTIMIZATIONREPLY,answer,o,d,myself)

33: upon Receive(REPLACE, o, i, c) do34: if ! isbetter(peer,o) then35: Send(REPLACEREPLY, c, false, i, o, myself)36: else37: Send(SWITCH, o, i, c, myself)

38: upon Receive(SWITCHREPLY,answer,i,c,o) do39: if answer then40: activeView←− activeView \{c}41: activeView←− activeView ∪{o}42: Send(REPLACEREPLY,answer,i,myself)

43: upon Receive(SWITCH,i,c,d) do44: if i ∈ activeView or received(DISCONNECTWAIT from i) then45: Send(DISCONNECTWAIT,i,myself)46: activeView←−activeView \{i}47: activeView←−activeView ∪{d}48: Send(SWITCHREPLY,answer,di,c,myself)

49: isBetter(old,new)50: return Oracle.getCost(old) > Oracle.getCost(new)

3.2.1 Unbiased Neighbors

By blindingly imposing a bias in the topology of the overlay, one easily breaks some of the key desirable properties of arandom overlay, such as the low clustering coefficient and low average path length. The negative effect of such bias can beeven more notorious in an architecture such as ours, that relies on small active views. To prevent this flaw, we do not applythe bias to all members of the active view. Instead, each node should maintain both “high-cost” (unbiased) and “low-cost”(biased) neighbors. The number of “high-cost” neighbors each node keeps is a protocol parameter called Unbiased Neighborsor simply UN.

Unfortunately, it is not trivial to decide which peers have a “high-cost”, given that nodes are not expected to have globalknowledge of the system, not only regarding membership information but also regarding global metrics, such as the averagelink cost in the overlay. To circumvent this limitation, we maintain the active views of each node sorted by link cost (the firstelement of each active view is the neighbor with the largest link cost). Therefore, a node never attempts to apply the bias tothe first UN members of its active view.

Results reported in [27] indicate that, in order to ensure low average path length, each node may only be required tomaintain 1 “high-cost” neighbor. In this paper, we also evaluate the effect of the UN parameter in other key properties, suchas the clustering coefficient of the overlay.

3.3. Algorithm

The algorithm executed at each optimization round is depicted in Algorithm 1. The reader should notice that the algorithmpresented has been simplified for clarity, for instance, we omitted some insertions of nodes into passive views and themechanisms required to ensure the symmetry of active views.

As stated before, usually an optimization round involves 4 nodes of the system, and each round is composed of 4 steps,one for each node that participates in the optimization. Moreover, we use the following definitions to identify each of theparticipating nodes:

Node i (initiator): is the node that starts the optimization round.

5

Node o (old): is a node from i’s active view which is replaced during the optimization process.

Node c (candidate): is a node from i’s passive view which is upgraded to the active view.

Node d (disconnected): is the node in the candidate’s active view which it disconnects in order to be able to accept i.

3.3.1 Step 1

Step 1 is executed at node i (Algorithm 1, lines 1 − 10) and its purpose is to contact one, or more, potential candidates toparticipate in a set of optimization rounds5.

This step starts with the random selection of, at most, PSL nodes from the i’s passive view. This random sample is a setof candidates for executing the optimization round. To check if a target node is a suitable candidate, i iterates over its activeview, consulting the oracle to compare the cost of his neighbors with the cost of the target (Algorithm 1, lines 49−50). Whena suitable candidate c is found, which presents a possibility for improving a given neighbor o, i sends to c an OPTIMIZATIONmessage, stating its interest in exchanging o for c in its active view. The reception of that message will trigger the executionof Step 2 in node c.

This step ends with the reception of a OPTIMIZATIONREPLY message from node c (Algorithm 1, lines 11 − 20), or withthe suspicion of failure of node c. If node c accepts the exchange, then i will add c to the active view. If o is still in its activeview6, i will send a DISCONNECTWAIT or DISCONNECT message to o. The difference between these messages is simple:DISCONNECT, only removes the sender from the active view (as described in [21]) while DISCONNECTWAIT also notifiesthe node that it should maintain (until an internal timeout expires) that free slot in the active view, which will be used in step4. Node i chooses which message to send, based on information received from c, specifically, if c had to remove some nodefrom its active view in order to insert i in its active view.

3.3.2 Step 2

Step 2 is initiated at node c with the reception of a OPTIMIZATION message from node i (Algorithm 1, lines 21 − 27) andends when c replies to i with a OPTIMIZATIONREPLY message.

If c does not have a full active view it immediately replies to i by sending an OPTIMIZATIONREPLY message acceptingthe exchange, and notifying i that no other node was involved in the optimization. In this case i will disconnect itself fromo and insert c in its active view. Note that in this particular scenario, our algorithm does not preserve the degree of node o,although it preserves the number of links in the overlay. However, this is a uncommon scenario given that, usually, more than99% of nodes in the system have full active views.

On the other hand, if c has a full active view, c has to select some neighbor d from its active view to exchange for i.Therefore c send to d a REPLACE message, stating its desire to remove d from its active view; this message also indicatesto d that it can connect to o in exchange. The REPLACE message also carries information concerning the identification ofthe initiator of the optimization procedure. In order to promote the decrease of the overlay’s global cost, the selection of dis deterministic, in such a way that d is the neighbor of c with an higher cost (excluding, naturally, the first UN protectedmembers).

In the later case, to conclude this step, c has to receive a REPLACEREPLY message from node d (Algorithm 1, lines28− 32) or suspect that d’ has failed (in which case, node c acts as if it had a free slot in the active view from the beginningof this step). If d accepts the exchange, c will remove d from its active view and replace it with i. If d declines the exchange,c does not change its own views. In either case, c will notify i of d’s answer using the OPTIMIZATIONREPLY message.

3.3.3 Step 3

This step begins with the reception at node d of a REPLACE message (Algorithm 1, lines 33−37) and ends when a REPLAC-EREPLY message is sent back to node c.

A REPLACE message explicitly requests node d to exchange node c with node o in his active view. Node d consults theoracle to assess if o has a lower cost than c. Note that our algorithm only requires 2 of the 4 nodes involved in an optimizationround to consult the oracle in order to assess the merit of the proposed peer exchange. This is enough as we assume that linkcosts are symmetric, and effectively, we are exchanging 2 existing links in the overlay, for other 2. We exchange the link

5More than an optimization round might be triggered in the context of this step.6Note that o might have already disconnected from i as a result of the execution of step 4.

6

Algorithm 2: Alternative for the isBetter procedureisBetter(old,new)

cOld := Oracle.getCost(old)cNew := Oracle.getCost(new)return cOld > cNew ∧ (cOld − cNew)

cOld >= THRESHOLD

between i and o for a link between i and c and the link between d and c with the the link between d and o. Therefore, onlynodes i and o need to assess the gain resulting from these exchanges.

Naturally, if node d verifies that there is no gain in the exchange of c for o, d aborts the exchange by notifying c. Otherwise,it will send a SWITCH message to node o notifying him to switch node i in his active view for himself. Moreover, it notifiesnode o. Finally, the answer received from o in a SWITCHREPLY (Algorithm 1, lines 38− 42) is forwarded to c.

3.3.4 Step 4

This step is executed by node o upon the reception of a SWITCH message (Algorithm 1, lines 43 − 48) and ends when osends a SWITCHREPLY to d. This step is required only to ensure the symmetry of active views. The behavior of node o inthis step is deterministic. For clarity, in Algorithm 1 we only depict 2 of the constraints that are checked before accepting theexchange. The complete list of constraints have to be checked can be found in [21]. After checking all constraints, node osends a DISCONNECTWAIT message to i and adds d to its active view. This concludes an optimization round.

3.3.5 Cost

A complete optimization round requires the serial exchange of seven messages although, in most cases, each node involvedin the optimization only has, at most, to send and receive two messages. Given that the optimization of the overlay canbe executed as a background activity, the cost of the adaptive mechanisms can be easily tuned to become negligible whencompared with the (application) data traffic.

3.4. Threshold Parameter

Oracles might not be perfect, in the sense that they might provide information that is not fully accurate. Namely, two nodesmay obtain different costs for the same link when they consult their local oracle. For instance, in [15] is stated that longestIP prefix and latency has an approximate correlation of −0.85. In cases where oracles are not perfect, nodes have to makedecisions with inaccurate information. We propose a simple scheme to improve the behavior of the protocol in such scenarios.Basically, we change the isBetter evaluation function to include some hysteresis, namely, a link new is only considered ashaving a lower cost than other link old, if the difference between the cost (obtained through the oracle) offers a gain above agiven Threshold, which is a new protocol parameter introduced to address the inaccuracy of oracles. Algorithm 2 depicts therequired changes to the isBetter function.

4. Evaluation

4.1. Experimental Setting

We conducted an extensive experimental evaluation of our algorithm in the peersim simulator [25] using its cycle basedengine. All evaluation was conducted in a system composed of 10.000 nodes. Simulation were run on top of a networkmodel composed of 13.037 routers generated by Inet-3.0 with its default parameters. In order to calculate the cost betweenneighbors, we calculated the shortest paths in the network between 10.000 virtual nodes inserted over it. The cost betweentwo nodes is the sum of the distance of links between routers which compose these shortest paths.

All results reported in this section are an average of results of 5 runs. Each one of these runs used one of 5 random networktopology generated as described above.

In order to evaluate our approach, we improved the HyParView protocol with our our topology aware adaptive protocol.We used the configuration parameters for HyParView reported in [21]. The most relevant configuration parameters are theactive view length, which was set to 5 and the passive view length, which was set to 30.

7

We tested our approach for different Unbiased Neighbors values which range from 0 to 5 (all); the last configurationcorresponds to the operation of the original HyParView protocol (given that no bias is applied to any member). Moreover, inall simulations we used the following parameters:

Period between optimizations was set to 2 simulation cycles. In each cycle each node sends a SHUFFLE and, in average,receives another Shuffle message therefore, in each cycle each node participates, in average, in two shuffle processes.Setting the period between optimization to two cycles ensures that, between executions of optimization steps the passiveview of nodes is updated, increasing the possibility of selecting new nodes to evaluate.

Passive Scan Length (PSL) was set to two, so each time a node executes the step 1 of the optimization algorithm, itmeasures, at most, 2 nodes from its passive view. This also limit the number of nodes which are exchanged in a singleround for a node as 2. Setting PSL to a small value allows to achieve two goals: i) it promotes some stability in theoverlay, as we avoid to exchange the majority of nodes in the active view of a single node in the context of a singleoptimization execution and, ii) it lowers the cost of the overall optimization process.

4.2. Stable Environment

We evaluated our approach in a stable environment where no failures were present. This simulations were run for 250cycles. Nodes were added to the overlay using the Join mechanism provided by the HyParView protocol as described in [19]and had access to local perfect oracles (e.g. their precision was of 100%). In this section we report results concerning someoverlay properties and the impact on the broadcast reliability.

4.2.1 Overlay Properties

Figure 1. Average Link Cost

Figure 1 shows the average link cost in the overlay. As expected while the original HyParView protocol (Unbiased= 5)shows a constant link cost in steady state, when improved with our topology aware adaptive protocol it is able to lower itsaverage link cost from approximately 5% while maintaining 4 unbiased neighbors to 25% when keeping 1 unbiased neighboror even 32% when no unbiased member is maintained in the active view. Notice also that, although the optimization processworks continuously, 50 simulation cycles is enough to obtain a visible improvement in the average link cost.

Figure 2 depicts results for clustering coefficient. As expected, if the bias is applied to all members of the active viewthe clustering coefficient of the overlay increases. On the other hand, maintaining a single unbiased neighbor is enough topartially mitigate this effect. However, as it can be observed after 150 simulation cycles, the clustering coefficient of thenetwork with a single unbiased member is still above that of the original HyParView protocol. Interestingly, when 2 to 4unbiased neighbors are maintained in the active view, the clustering coefficient drops to values below those obtained byHyParView which maintains 5 random neighbors. This phenomenon can be explained as follows. By maintaining active

8

Figure 2. Clustering Coefficient

Figure 3. Average Path Length

views sorted by cost, the selected “unbiased” neighbors are those with a larger cost. In other words, our algorithm, at no extracost, promotes the maintenance of “long distance” nodes in each view. This effect is also visible in Figure 3 where we showthe average path length values.

4.2.2 Broadcast Reliability

To evaluate the reliability of the flooding broadcast protocol on top of the overlay, we select, in each simulation cycle, arandom node in the system to broadcast a message. After the dissemination process is complete, we evaluate the reliabilityof the broadcast by observing the percentage of active nodes that receive that message. The reliability obtained for allconfigurations of the protocol was of 100% in steady state. This shows that the overlay maitained by the protocol did notbecame disconnect due to the operation of our topology aware adaptive protocol.

4.3. Effect of Unreliable Oracles

We have run the same experiments as above, but now using unreliable oracles. Each time the oracle is consulted, it cangive an inaccurate answer with a probability that ranges from .1 to .3 (i.e, 30% of the answers are inaccurate). When the

9

oracle given an inaccurate answer, the answer with an error that goes up to 50% (i.e., when a oracle is not accurate, it reportsa link cost with a random value that ranges from 50% to 150% of the real cost). Figure 4 depicts the results obtained. It canbe seen that such unreliable oracles do not have a visible effect in the optimization of the overlay. These results are due totwo distinct reasons: i) For an optimization round to be completed, two distinct nodes have to accept the exchange of links bycomparing results provided by their local oracles. With oracles that present an error rate below .3, the probability of havingboth nodes observing consistent erroneous measures is still very low and, ii) even with an error rate of .3, the number ofinaccurate oracle results is not enough to dominate the network.

12500

13000

13500

14000

14500

15000

15500

16000

16500

17000

0 50 100 150 200 250

aver

age

link

cost

cycle

error rate = 0error rate = 0.1error rate = 0.2error rate = 0.3

Figure 4. Average Link Cost

4.4. Massive Failures

In this section we provide results concerning the reliability of the flooding broadcast protocol, on top of our overlaynetwork, after massive node failures in the system that range from the simultaneous failure of 10% to 95% of the all nodes.These faults were induced in the system after 250 cycles of simulation to ensure that the overlay had time to converge to abiased version. After the induction of failures, simulations were conducted for more 250 cycles and in each cycle a random(active) node was selected to broadcast a message and the reliability was measured. Figure 5 shows the average reliabilityobtained while broadcasting 250 messages after massive failures. The reader should notice that all protocol instances presentsimilar values for reliability. This happens due to the passive view maintained by HyParView. Notice that, although we usepassive views across nodes to bias the topology of the overlay, the properties of the passive view are not affected, thus thehealing properties of passive views are not affected by the operation of our algorithm.

One could expect that maintaining a single unbiased neighbor would decrease the resilience of the overlay to node failures.In fact, one could imagine that if the unbiased member fails, our scheme would make difficult for this member to be replacedby another unbiased member. However, the reader should note that our algorithm only attempts to apply some bias whenthe active view is complete. Therefore, when a node crashes and needs to be replaced, the replacement is picked at randomfrom the passive view. This policy, combined with the use of a sorted active view explains the effect that we depicted inSection 4.2.1 in which our unbiased neighbors become biased for high costs.

5. Related Work

Narada [3] includes self-organizing and self-improving protocols to construct and maintain overlay networks. Similar toour scheme, their approach is based in a utility function that is applied periodically to samples of peers in the system, whichis employed to make local decisions concerning the addition and removal of links to the overlay. However, because Naradais targeted at small and medium scale systems, it operates using full membership information and therefore has scalabilityrestrictions.

10

Figure 5. Average reliability of 250 messages after failures

The work presented in [17] also aims at increasing the efficiency of gossip disseminations by reducing the overheadimposed in network links by communication between distant random peers, due to the fact that, in a epidemic disseminationprocess, the same message might transverse a given network link several times. However, the approach taken by the authors isbased on the creation of a hierarchy among peers in the system, taking into account the structure of the underlying network,for instance, by grouping peers by network domains (e.g. Local Area Networks, Subnets or even Autonomous Systems),whereas our approach does not require knowledge concerning physical location of nodes to operate or the maintenance of anhierarchy among peers.

T-Man [11] is a generic topology management scheme for overlay networks, that is able to evolve a given overlay topologyto a desired topology such as a taurus, ring or some user defined topology. This is achieved by relaying in a single rankingfunction which should describe the designed topology, in the sense that it enables each node in the system to extract cluesto its optimal position and desired neighbors in the overlay network. These ranking functions might present a high level ofcomplexity and, moreover, these functions also may require to contain some global knowledge of the system. In our approachwe rely only on link costs, that can be obtained through local oracles that are not required to be accurate.

The Localiser algorithm [23] is a algorithm that also aims at optimizing unstructured overlays according to a proximitycriterion and promote the balancing of node degree. Localiser is based on a Metropolis scheme [24] whereas nodes, iteratively,strive to minimize an energy function, by swapping connections among peers. Although this algorithm strives to balance thedegree of nodes in the system, unlike our approach, it does not ensure the maintenance of degree in nodes that participatein the links exchange. Furthermore, the Localiser algorithm incurs in the risk of falling into local minimal of the energyfunction, due to the fact the the pool of peers available to generate overlay optimization is limited. Therefore, it sometimeshave to exchange links that increase their energy function. Moreover, the localiser is not concerned with preserving some keyproperties of the overlay, such as low clustering and small diameter.

GoCast [27] is the work that most resembles our own. GoCast builds an overlay which is optimized in order to maintainboth random (distant) neighbors and close neighbors while balancing node degree in such a way that degree of nodes convergeto a given pre established value D and varies only between D−2 and D+2. However, the strategy of GoCast to maintain andselect peers is more complex than our own. For instance, they have separated protocols to maintain random and closeneighbors whereas our approach is fully integrated. Moreover, by leveraging on a passive view we are able not only toimprove our overlay but also sustain larger concurrently node failure (above 90%) whereas in [27] values reported are ofonly 25%. Finally, GoCast is designed to improve unstructured overlay networks only for low latency links, with measuresbased on RTT between nodes. Although in this paper we focus also on improving latency between peers, our approach ismore flexible and can be applied to improve other network efficiency criteria or even application criteria, an advantage whichcomes from the use of oracles.

11

6. Conclusion and Future Work

In this paper, we proposed a new algorithm that allows an unstructured overlay network to bias its topology according tosome given efficiency criteria. In particular, we have shown that this approach can be used to have a better match betweenthe overlay topology and the topology of the underlying network, without compromising key properties of random overlaynetworks.

Our proposal leverages on an architecture that we have introduced in previous work [21] that advocates the use of twodistinct partial views. In that paper, we have shown that this approach allows to achieve high reliability and high resilience tofaults (as high as 80% of simultaneous nodes failures) in a gossip-based broadcast protocol. The challenge addressed in thepresent paper was to reduce the cost of the overlay links without loosing these good properties.

The experimental evaluation has shown that our approach is in fact able to apply a bias to a unstructured overlay networktopology, without compromising key properties of these overlays such as the low clustering coefficient and the low averagepath length. Moreover, our approach is able to sustain high reliability even in scenarios where large percentages, as large as80%, of nodes may fail simultaneousl Finally, our optimization strategy is feasible even when the oracles have a probabilityof returning inaccurate values as high as .3.

As future work we plan to experiment with other interesting oracles, such as oracles that reflect how similar the contentstored by each node is. These oracles could be used to build new resource location protocols on top of a (biased) unstructuredoverlays.

References

[1] K. Birman. The promise, and limitations, of gossip protocols. SIGOPS Oper. Syst. Rev., 41(5):8–13, 2007.[2] K. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky. Bimodal multicast. ACM Transactions on Computer

Systems, 17(2), May 1999.[3] Y.-H. Chu, S. Rao, S. Seshan, and H. Zhang. A case for end system multicast. IEEE Journal on Selected Areas in Communications,

20(8):1456–1471, Oct 2002.[4] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. Epidemic algorithms

for replicated database maintenance. In PODC ’87: Proceedings of the sixth annual ACM Symposium on Principles of distributedcomputing, pages 1–12, New York, NY, USA, 1987. ACM Press.

[5] P. T. Eugster and R. Guerraoui. Probabilistic Multicast. In DSN 2002, 2002.[6] P. T. Eugster, R. Guerraoui, S. B. Handurukande, P. Kouznetsov, and A.-M. Kermarrec. Lightweight probabilistic broadcast. ACM

Trans. Comput. Syst., 21(4):341–374, 2003.[7] P. T. Eugster, R. Guerraoui, A.-M. Kermarrec, and L. Massoulie. From Epidemics to Distributed Computing. IEEE Computer,

37(5):60–67, 2004.[8] A. J. Ganesh, A.-M. Kermarrec, and L. Massoulie. SCAMP: Peer-to-peer lightweight membership service for large-scale group

communication. In Networked Group Communication, pages 44–55, 2001.[9] M. Hayden and K. Birman. Probabilistic broadcast. Technical report, Ithaca, NY, USA, 1996.

[10] A.-M. Jelasity, M.; Kermarrec. Ordered slicing of very large-scale overlay networks. Peer-to-Peer Computing, 2006. P2P 2006.Sixth IEEE International Conference on, pages 117–124, 06-08 Sept. 2006.

[11] M. Jelasity and O. Babaoglu. T-man: Gossip-based overlay topology management. In The Fourth International Workshop onEngineering Self-Organizing Applications (ESOA’06), Hakodate, Japan, may 2006.

[12] M. Jelasity, R. Guerraoui, A.-M. Kermarrec, and M. van Steen. The peer sampling service: experimental evaluation of unstruc-tured gossip-based implementations. In Middleware ’04: Proceedings of the 5th ACM/IFIP/USENIX international conference onMiddleware, pages 79–98, New York, NY, USA, 2004. Springer-Verlag New York, Inc.

[13] M. Jelasity and A. Montresor. Epidemic-style proactive aggregation in large overlay networks. In Proceedings of The 24th In-ternational Conference on Distributed Computing Systems (ICDCS 2004), pages 102–109, Tokyo, Japan, 2004. IEEE ComputerSociety.

[14] P. Karwaczynski. Fabric: Synergistic proximity neighbour selection method. In P2P ’07: Proceedings of the Seventh IEEE Interna-tional Conference on Peer-to-Peer Computing (P2P 2007), pages 229–230, Washington, DC, USA, 2007. IEEE Computer Society.

[15] P. Karwaczynski, D. Konieczny, J. Mocnik, and M. Novak. Dual proximity neighbour selection method for peer-to-peer-baseddiscovery service. In SAC ’07: Proceedings of the 2007 ACM symposium on Applied computing, pages 590–591, New York, NY,USA, 2007. ACM.

[16] D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In FOCS ’03: Proceedings of the 44thAnnual IEEE Symposium on Foundations of Computer Science, page 482, Washington, DC, USA, 2003. IEEE Computer Society.

[17] A.-M. Kermarrec and A. J. Ganesh. Efficient and adaptive epidemic-style protocols for reliable and scalable multicast. IEEE Trans.Parallel Distrib. Syst., 17(7):593–605, 2006. Member-Indranil Gupta.

12

[18] A.-M. Kermarrec, L. Massoulie, and A. J. Ganesh. Probabilistic reliable dissemination in large-scale systems. IEEE Trans. ParallelDistrib. Syst., 14(3):248–258, 2003.

[19] J. Leitao. Gossip-based broadcast protocols. Master’s thesis, University of Lisbon, 2007.[20] J. Leitao, J. Pereira, and L. Rodrigues. Epidemic broadcast trees. In Proceedings of the 26th IEEE International Symposium on

Reliable Distributed Systems (SRDS’2007), pages 301 – 310, Beijing, China, Oct. 2007.[21] J. Leitao, J. Pereira, and L. Rodrigues. HyParView: A membership protocol for reliable gossip-based broadcast. In DSN ’07: Proc. of

the 37th Annual IEEE/IFIP Intl. Conf. on Dependable Systems and Networks, pages 419–429, Edinburgh, UK, 2007. IEEE ComputerSociety.

[22] H. C. Li, A. Clement, E. L. Wong, J. Napper, I. Roy, L. Alvisi, and M. Dahlin. Bar gossip. In USENIX’06: Proceedings of the7th conference on USENIX Symposium on Operating Systems Design and Implementation, pages 14–14, Berkeley, CA, USA, 2006.USENIX Association.

[23] L. Massoulie, A.-M. Kermarrec, and A. J. Ganesh. Network awareness and failure resilience in self-organising overlay networks.srds, 00:47, 2003.

[24] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equations of state calculations by fast computingmachine. J. Chem. Phys., 21:1087–1091, 1953.

[25] Peersim p2p simulator. http://peersim.sourceforge.net/.[26] R. V. Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. Technical report, Ithaca, NY, USA, 1998.[27] C. Tang and C. Ward. GoCast: Gossip-enhanced overlay multicast for fast and dependable group communication. In DSN ’05:

Proc. of the 2005 Intl. Conf. on Dependable Systems and Networks (DSN’05), pages 140–149, Washington, DC, USA, 2005. IEEEComputer Society.

[28] S. Voulgaris, D. Gavidia, and M. Steen. Cyclon: Inexpensive membership management for unstructured p2p overlays. Journal ofNetwork and Systems Management, 13(2):197–217, June 2005.

13