End-to-end consensus using end-to-end channels

10
End-to-end consensus using end-to-end channels * Matthias Wiesmann Xavier Défago Japan Advanced Institute of Science and Technology Asahidai 1-1 Nomi-shi Ishikawa 923-1292 Japan E-mail: {wiesmann|defago}@jaist.ac.jp Abstract End-to-end consensus ensures delivery of the same value to the application layer running in distributed processes. De- liveries that have not been acknowledged by the application before a failure are delivered again. End-to-end primitives are important for applications that need to enforce persis- tency. We present an algorithm that solves the end-to-end consensus problem. Our approach is to build end-to-end consensus using a new type of communication channels, end- to-end channels. 1. Introduction Group communication has been proposed for some time as an approach to help the design of complex distributed sys- tems. Group communication offer rich primitives that help building distributed applications. While those primitives address fundamental problems that are difficult to solve in environments prone to component failures, their actual use- fulness depends on how well-adapted they are to the needs of applications [9], and to what extent they help enforce application level properties [5]. No end-to-end guarantees. One important shortcoming in the way current group communication primitives are spec- ified is that they offer no support for end-to-end properties. Consider the following application; each process proposes one value to a consensus primitive, and writes the resulting decision to stable storage. If the application crashes between the delivery of the decision by the primitive and the termi- nation of the write operation, the value will be lost. Upon recovery the primitive will not deliver the decision again (it already has), and the application did not have time to write it to stable storage. While this situation poses no problem in a model where crashes are permanent (the crash-stop model), * Work supported by MEXT Grant-in-Aid for Scientific Research on Priority Areas (Nr. 18049032) Swiss National Science Foundation Fellowship PA002-104979 it raises real issues in a model were processes can eventually recover from a crash (the crash-recovery model). Because the decision of the consensus is delivered to the application only once, the data is effectively lost, if it has been delivered but not written to stable storage. Atomicity is an end-to- end property and can therefore not be fully implemented in the lower-layers [16]. Recently there have been proposals for primitives with end-to-end properties [12], but the prop- erties are implemented at a high-level and imply complex interfaces. The lack of end-to-end properties in group communica- tion primitives means that those primitives cannot, in most cases, be used directly. If the application programmer is to actually use group communication, she will need to add additional functionality in order to ensure end-to-end prop- erties, for instance log all messages and deliver them again to a recovering node [10]. This defeats the purpose of group communication toolkits, and leads to redundant and ineffi- cient code. In order to support applications with persistency requirements, group communication need to offer primitives with end-to-end properties. Such properties mean that the application controls when the atomicity properties are en- sured, i.e., when the delivery of a message is complete. If delivery is not complete, it must be initiated again by the group communication primitive. The lack of appropriate support for atomicity makes it difficult to use group commu- nication in settings where the application actually requires such properties, like for instance replicated databases [18]. Contributions. In this paper, we present a set of group communication primitives with end-to-end properties. We show how those properties can be implemented by defining the system from the ground up with end-to-end properties. This is done by proposing a more advanced type of commu- nication channel that offers some basic persistency and atom- icity properties: end-to-end channels. Using those channels, we present algorithms for basic end-to-end group commu- nication primitives: reliable broadcast and consensus. We show how this model relates to the traditional crash-recovery model. The approach that we present can be extended to ©IEEE. Appeared in Proc. of the Pacific Rim Int. Symp. on Dependable Computing (PRDC'06)

Transcript of End-to-end consensus using end-to-end channels

End-to-end consensus using end-to-end channels!

Matthias Wiesmann† Xavier DéfagoJapan Advanced Institute of Science and Technology

Asahidai 1-1 Nomi-shi Ishikawa 923-1292 JapanE-mail: {wiesmann|defago}@jaist.ac.jp

Abstract

End-to-end consensus ensures delivery of the same valueto the application layer running in distributed processes. De-liveries that have not been acknowledged by the applicationbefore a failure are delivered again. End-to-end primitivesare important for applications that need to enforce persis-tency. We present an algorithm that solves the end-to-endconsensus problem. Our approach is to build end-to-endconsensus using a new type of communication channels, end-to-end channels.

1. Introduction

Group communication has been proposed for some timeas an approach to help the design of complex distributed sys-tems. Group communication offer rich primitives that helpbuilding distributed applications. While those primitivesaddress fundamental problems that are difficult to solve inenvironments prone to component failures, their actual use-fulness depends on how well-adapted they are to the needsof applications [9], and to what extent they help enforceapplication level properties [5].

No end-to-end guarantees. One important shortcomingin the way current group communication primitives are spec-ified is that they offer no support for end-to-end properties.Consider the following application; each process proposesone value to a consensus primitive, and writes the resultingdecision to stable storage. If the application crashes betweenthe delivery of the decision by the primitive and the termi-nation of the write operation, the value will be lost. Uponrecovery the primitive will not deliver the decision again (italready has), and the application did not have time to write itto stable storage. While this situation poses no problem in amodel where crashes are permanent (the crash-stop model),

!Work supported by MEXT Grant-in-Aid for Scientific Research onPriority Areas (Nr. 18049032)

†Swiss National Science Foundation Fellowship PA002-104979

it raises real issues in a model were processes can eventuallyrecover from a crash (the crash-recovery model). Becausethe decision of the consensus is delivered to the applicationonly once, the data is effectively lost, if it has been deliveredbut not written to stable storage. Atomicity is an end-to-end property and can therefore not be fully implemented inthe lower-layers [16]. Recently there have been proposalsfor primitives with end-to-end properties [12], but the prop-erties are implemented at a high-level and imply complexinterfaces.

The lack of end-to-end properties in group communica-tion primitives means that those primitives cannot, in mostcases, be used directly. If the application programmer isto actually use group communication, she will need to addadditional functionality in order to ensure end-to-end prop-erties, for instance log all messages and deliver them againto a recovering node [10]. This defeats the purpose of groupcommunication toolkits, and leads to redundant and ineffi-cient code. In order to support applications with persistencyrequirements, group communication need to offer primitiveswith end-to-end properties. Such properties mean that theapplication controls when the atomicity properties are en-sured, i.e., when the delivery of a message is complete. Ifdelivery is not complete, it must be initiated again by thegroup communication primitive. The lack of appropriatesupport for atomicity makes it difficult to use group commu-nication in settings where the application actually requiressuch properties, like for instance replicated databases [18].

Contributions. In this paper, we present a set of groupcommunication primitives with end-to-end properties. Weshow how those properties can be implemented by definingthe system from the ground up with end-to-end properties.This is done by proposing a more advanced type of commu-nication channel that offers some basic persistency and atom-icity properties: end-to-end channels. Using those channels,we present algorithms for basic end-to-end group commu-nication primitives: reliable broadcast and consensus. Weshow how this model relates to the traditional crash-recoverymodel. The approach that we present can be extended to

©IEEE. Appeared in Proc. of the Pacific Rim Int. Symp. on Dependable Computing (PRDC'06)

further primitives like total order broadcast, while offeringsimpler interfaces than the one presented in [12].

End-to-end channels. While it is our goal to offer groupcommunication primitive with end-to-end properties, thisproperty should be addressed at the right level. End-to-endissues are not specific to group communication, they alreadyarise in simple point-to-point communications. Consider aclient that sends a message on a server to be written on disk.The client does not only expect a reliable delivery of themessage to the server, but also processing. So it makes senseto define a point-to-point channel with end-to-end proper-ties. The main motivation for a layered architecture is togive an abstract view of a problem and isolate upper layersfrom the complexity of implementation. Group communi-cation should offer an abstraction layer to the application.In turn, the group communication layer should use the rightabstractions. While attention has been given to group com-munication layer as a mean to simplify applications, we feelmore work is needed to improve the abstractions used tosimplify the group communication layer. One of the mainabstractions used by the group communication layer is thenotion of channel.

Numerous models for distributed systems have been pro-posed, each with different communication channels. In manycases, those channels try to somehow represent the existingcommunication infrastructure. Sometimes the communica-tion layer is assumed to offer reliable point-to-point channels,sometimes the channels are lossy. Those two choices offersome similarity with TCP/IP and UDP/IP channels, but therelationship is neither explicit nor exact. Reliable channelsimplicitly assume infinite buffer space and no flow control –this is quite different from TCP/IP. UDP/IP guarantees nei-ther delivery nor order and supports only messages with alimited length. UDP/IP is therefore difficult to use directly.Because of this mismatch between available communicationchannels and the properties required by group communi-cation, any real toolkit will need to implement a channelabstraction. If one is to use a channel abstraction, then itshould fulfill the task of an abstraction layer: isolate theupper layer from the specifics of the lower layer.

Traditional reliable channel abstraction isolate the up-per layer for transmission and connection failure, but notfrom crashes from end-points (processes). Because of this,the same algorithm needs to be redefined depending onhow process crashes are considered [17]. For instance theconsensus algorithm based on the rotating coordinator isdescribed differently in the crash stop [4] and the crash-recovery model [1]. In the crash recovery model, in orderto tolerate certain failures, algorithms need access to stablestorage. Thus, the same algorithm needs to be defined threetimes (no-recovery, recovery with no stable storage, recov-ery with stable storage). One cannot simply take the crash

recovery model algorithm and apply to all three cases, sincestable storage makes little sense in the crash-stop model. Wepresent an algorithm that solves the end-to-end consensusproblem, and is defined in a way that abstracts the actualfailure model, because it is hidden within the channel ab-straction.

Our goal is to implement a reliable channel for the failuremodel considered. If the model permits crashes and recov-eries, the channel must handle those failures. Using thischannel, algorithms can be defined so that they work in anymodel. The channel is not modeled on low-level networkinginterfaces, but instead closer to high-level interfaces used byapplications: message queues.

Instead of explicitly using stable storage, algorithmshave a minimal state, messages are logged, and replayedas needed. Message logging is a well-known technique fortaking checkpoints of a distributed system [8, 6, 3]. Mes-sages that were not fully processed are delivered again aftera crash, while messages that were fully processed are simplydiscarded. By moving the notion of stable storage from thealgorithm into the channel abstraction, we get a more generalmodel. Message persistency can thus be implementing usinglocal stable storage, a fault-tolerant service, or by replicatingthe message information among the nodes.

Advantages. End-to-end channels offer three advantages:a) end-to-end channels can be used to build algorithm withend-to-end properties, b) end-to-end channels abstract theactual failure model, and c) end-to-end channels can abstractvery different network models, in particular those wherethere is no direct connection between sender and receiver.This includes store-and-forward architectures, or epidemicnetworks. As end-to-end channels can handle both connec-tion and process failures, they are well-suited for ad-hocor peer-to-peer networks. Furthermore, as they integratemessage acknowledgment, end-to-end channels handle themessage life-cycle explicitly, and thus help in practical issuessuch as flow-control or message garbage collection.

Structure. This paper is structured as follows: Section 2presents the models and notations of this paper. Section 3presents the channel, Section 4 presents a algorithm thatimplements reliable-broadcast using end-to-end channels.Section 5 presents an algorithm that solves the consensusproblem using end-to-end channels. Both algorithm work inthe crash recovery model. Section 6 concludes the paper.

2. Model

A distributed system is modeled as a set ! of n pro-cesses: {p1, . . . , pn}. At any time, a process can be up ordown. When a process is up it executes its algorithm nor-mally. When a process is down it takes no action. The

2©IEEE. Appeared in Proc. of the Pacific Rim Int. Symp. on Dependable Computing (PRDC'06)

transition from the state down to up is called a recovery,the transition from the state up to down is called a crash.Initially, all processes are up.

We distinguish three kinds of processes: red, yellow andgreen processes. Red processes are processes that eithercrash and never recover or are forever unstable. Yellowprocesses crash and recover, but eventually stay forever up.Green processes never crash.

Processes communicate by sending and receiving seman-tics messages [14]. A semantic message might be sent indifferent multiple concrete messages. For instance, becauseof failures, a message might be retransmitted, or a messagemight be relayed by some intermediate nodes. In such cases,while there is one logical message, there are multiple con-crete messages.

We call a semantic message a class, and use the term“message” to denote concrete messages. All concrete mes-sages that represent the same semantic message are said tobe part of the same class. A class K is therefore a set of con-crete messages K = {mk

1,mk2 . . .mk

n}. If message m is part ofclass K we denote this m " K. We say that two messages m1and m2 are class equivalent if they both belong to some classK. We denote this relationship with the notation m1 # m2,this relationship is reflexive and transitive.

3. End-to-End Channels

3.1. Intuition

Figure 1 shows the core idea of end-to-end channels. Thechannel layer offers interfaces to control end-to-end delivery.Each layer uses the end-to-end interface of the layer belowto offer end-to-end interfaces to the layer above. The generalproperties of the channels are the following:End-to-End: The channel offers an interface to notify the

channel that the message was delivered and processed.Reliable: The channel offers reliable transmission, i.e., it

ensure that messages are not lost, even in the event offailure.

Deterministic: The behavior of the channel is determinis-tic.

The interface for end-to-end delivery makes message re-ception acknowledgment explicit. Each layer notifies thelayer below when message delivery is complete and atom-icity has been ensured. Reliability means that, if a messageis sent on a channel, the message will be delivered even iffailures occur. The channel can tolerate message losses, butalso the crash of the endpoints: the sender and receiver pro-cesses. If the sender crashes, recovers, and sends a messageagain, then the duplicate message will be handled. If thereceiver crashes while handling a message, the message willbe delivered again.

Operating

System

Middleware

End-to-End

Channel

Group Communication

Layer

Point-to-point

Layer

Transport Layer

Physical Network

Application

Operating

System

Group Communication

Layer

Point-to-point

Layer

Transport Layer

Physical Network

Application

End-to-End

End-to-End

End-to-End

Figure 1. End-to-end layers

The reliability and end-to-end properties of the channelmake it possible to build distributed algorithms that cantolerate process crashes without logging. Such algorithmhave no persistent state and restart in case of crash. In orderto ensure that the execution after a crash is consistent, thealgorithm must be deterministic. Data read from channelsbefore and after crashes should be consistent.

Simply delivering messages in a fixed order is not suffi-cient to ensure determinism [15]. Real algorithms rely onexternal events like timers or oracles to make decisions. Anon-deterministic channel might cause the algorithm to be-have in a certain way before a crash, and another way afterrecovery. A deterministic channel is needed to help ensurethat execution before and after a crash is the same.

Take for instance an atomic commitment protocol. Aprocess p could take a decision to abort the transaction andas a result send a message m1 to process q. Process p thencrashes, recovers and replays the algorithm and takes anotherdecision (commit). It then sends message m2 to q. Obviously,process q should get only one message, either m1 or m2– both messages are two concrete messages of the samesemantic message, both m1 and m2 are part of the same classdefined for decision messages.

As it is impossible to know beforehand how many con-crete versions of a given semantic message will be sent, theonly deterministic choice is to only deliver the first versionof the message. For this purpose we rely on the notion ofmessages classes as defined in Section 2. Only the first mes-sage of a given class is delivered. This solves the issue ofduplicate messages in a deterministic way.

3.2. Channel operations

An end-to-end channel is used to send messages be-tween processes. The channel supports two basic operations:the sender can send messages, and the receiver can readthem. Sending message m to process q is done by usingthe send(m,q) primitive. Reading message class K is donewith the read(K) primitive. The channel only transmits one

3©IEEE. Appeared in Proc. of the Pacific Rim Int. Symp. on Dependable Computing (PRDC'06)

message per message class. Because of this, the sender willsend messages, but the receiver reads a certain class andwill get one message for this class. If process q requests amessage for class K and obtains m, we say that q reads mfrom class K.

We consider that both sending and receiving are non-atomic operations – they take a certain amount of time anda crash can occur during their execution. Sending is de-fined by the two events send(m, p) and send-done(m, p).Those events mark the beginning and the end of the opera-tion, respectively. Delivery of message m is only ensured ifsend-done(m, p) occurred. Reading is defined by two eventsread(K) and read-done(K).

If the send(m) has occurred, we say that message m wassent. If read(K) occurred and returned m, we say that mes-sage m was read from class K. If send-done(m) occurred,we say that m was successfully sent. If read-done(K) oc-curred, we say that K was successfully read, we also say thatclass K was acknowledged. If the send or read operationis not successful, we say that the operation is partial, thatis, partial send and partial read. If a class K has been ac-knowledged, operation read(K) will return a special value$.Value $ is not part of any message class, i.e., %K || $/" K.A read(K) operation on a class where no message is yetavailable will block. Channels have one additional primitive:available(K). This primitive returns true if read(K) can beexecuted without blocking and false otherwise.

3.3. Channel Properties

End-to-end channels have the following properties:Class correctness If read(K) returns m then m " K.No creation If read(K) returns m then m was sent by some

process.Determinism If read(K) returns m and subsequent read(K)

returns m& then m = m& or m& =$.Class atomicity If a class K has been read successfully

(read-done(K) occurred) then read(K) returns $.Class no loss If process p successfully sends m to q and q

is a non-red process that did not execute read-done(K),then eventually executing read(K) on q will read m&

such that m& # m.The class correctness property ensures that when a class

is read, only messages of this class are returned. The no cre-ation property ensures that only messages that were sent canbe read. The determinism property ensures that the samemessage is always read for class K. The class atomicityproperty ensures that if a class was acknowledged, then nomessage from this class will be readable. The no loss prop-erty ensures that messages are delivered if the system isstable.

One must note that a class K can be acknowledged beforeany message m"K was ever read. This is a useful feature for

procedure rb-send(m) ; ! Broadcast mbegin

S ' /0 ; ! Set of stable destinationsmrbcast ' (rbcast,m) ; ! Encapsulate message.foreach q " {p1 . . . pn} do

send(mrbcast) ; ! Class is KKrbcast

when send-done(mrbcast) doS ' S*{q} ; ! Message m is acknowledged for process qif S = ! then

rb-send-done(m) ; ! Acknowledge to all destinations

endprocedure rb-read(K) ; ! Read the message for class Kbegin

mrbcast ' read(KKrbcast ) ;

if mrbcast =$ thenreturn $ ; ! R-bcast already read successfully

else(rbcast,m) ' mrbcast ; ! Extract mS f ' q " ! ||q < p ; ! All processes with larger idforeach q " S f do

send(m&,q) ; ! We forward the message to ensure atomicity

Sc ' /0 ; ! Set of confirmed destinationswhen send-done (m&,q) do

Sc ' Sc*{q} ; ! Message m is acknowledged for process qif Sc = S f then

return m ; ! Msg. was forwarded, delivery is now allowed.

endprocedure rb-read-done(K) ;begin

read-done (KKrbcast) ; ! Successful read, ack. KK

rbcastend

Figure 2. Reliable Broadcast

building deterministic algorithms. Assume that a process pwaits for a message m for some time, after this time, pro-cess p takes action a1. Process p then crashes and recovers.Message m might now be available and thus process p cantake action a2, leading to a completely different behavior.If after timing out, process p acknowledges the class K ofmessage m (m" k), it ensures that m will never be read in thefuture, even if a crash occurs and the algorithm is executedagain.

4. Reliable Broadcast

In this section, we present the specification for an end-to-end reliable broadcast along with an implementation thatrelies on end-to-end channels. For a discussion on reliablebroadcast in the crash recovery model, see [2].

4©IEEE. Appeared in Proc. of the Pacific Rim Int. Symp. on Dependable Computing (PRDC'06)

4.1. Operations and Properties

Reliable broadcast is specified by two primitives: rb-send(m) and rb-read(K). The first primitive sends a messagem to all processes pi "!. The second reads message class K.Both operations are non-atomic, rb-send-done(m) denotesthe end of the sending operation and rb-read-done(K) theend of the read operation. If rb-read-done(K) occurred, wesay that class K was rb-acknowledged.

Class correctness If rb-read(K) returns m then m " K.No creation If rb-read(K) returns m then some process p

executed rb-send(m).Determinism If rb-read(K) at process p returns m and sub-

sequent read(K) at process p returns m& then m = m& orm =$.

Message atomicity If rb-read-done(K) occurred at pro-cess p then rb-read(K) always returns $ at p.

Class atomicity If rb-read(K) returns m at process p, thenfor all process q that are non-red and did not executerb-read-done(K), rb-read(K) will return m& such thatm& # m.

Class no loss If rb-send-done(m) occurred at process p,then for all process q that are non-red and did not exe-cute rb-read-done(K), rb-read(K) will return m& suchthat m& # m.

4.2. Implementation

There is a naive implementation that simply consists indelaying rb-send-done(m) until message m is stable at allreceiving processes. But, enforcing atomicity poses a prob-lem: once the sender knows that the message is stable onall receivers, a reliable broadcast is still needed to notify allreceivers that they should deliver.

The algorithm described in Algorithm 2 is basically anadaptation of the classical reliable broadcast algorithm thatuses end-to-end channels. When a message is received forthe first time, it is forwarded to all destinations.

For each message class K transmitted using the reliablebroadcast, a new class KK

rbcast is defined. Filtering of du-plicate messages is not needed since this issue is handledby the end-to-end channel: duplicate messages will be inthe same class. Note that the atomicity is only specified forclasses, different processes might read different messagesof a given class K. In particular, if process p sends messagem " K, crashes and sends m& " K, process q1 might readm for K, while another process q2 might read m& for K. Asimilar issues arises if two processes broadcast messages inthe same class.

5. End-to-End Consensus

In this section we present an algorithm that solves end-to-end consensus using end-to-end channels. For an overviewof the consensus problem see [7]. The group communicationlayer does not have access to stable storage. It can onlysend and receive messages using end-to-end channels. Thebasic idea of this algorithm is to rely on a classical rotatingcoordinator algorithm [4] and use the end-to-end channelsas a roll-back mechanism. When a process crashes, theend-to-end channel will again deliver all messages that havenot been acknowledged. So roughly speaking, after a crash,the process “replays” the messages and re-executes the con-sensus algorithm. This is an extension of the approach ofMostéfaoui et al. [13]. The main problem is to ensure thatthe algorithm takes, after a crash, the same decisions it tookbefore.

5.1. Operations and Properties

The end-to-end consensus algorithm is defined by twoprimitives: propose(v) and decide(v). Both primitives arenon atomic and are ended by the events propose-done(v)and decide-done(v). The end-to-end consensus algorithmimplements the following specification:Uniform Validity If a process decides v, then v was pro-

posed by some process.Uniform Agreement No two processes decide differently.Uniform Integrity Every process successfully decides at

most once.Termination Every non-red process eventually decides.The algorithm terminates if a non-red process successfullyproposed and there is a majority of non-red processes.

5.2. Failure Detector

The end-to-end consensus algorithm relies on a failuredetector of class "Su [1]. This failure detector is imperfectin the sense that it can suspect processes that are not crashed,or trust processes that are crashed. The failure detector hasthe following propertiesWeak Accuracy Eventually, one non-red process is not sus-

pected by any non-red process.Strong Completeness Eventually all red processes are sus-

pected by any non-red process.Notice that there is no guarantee that the failure detector

will output the same status both before and after the crash.Process p might suspect q before a crash, but when execut-ing the algorithm again during recovery, p might no longersuspect q. So care must be taken to take the same decisionsdespite changes of the state of the failure detector. The issueof different readings of the failure detector before and aftera crash are unrelated to the unreliable nature of the failure

5©IEEE. Appeared in Proc. of the Pacific Rim Int. Symp. on Dependable Computing (PRDC'06)

procedure rotate ; ! Rotating Coordinator Consensusbegin

rp ' 1 ; ! Start at round 1.while decided =false do

rp ' rp +1 ; ! New roundcp ' (rp mod n)+1; ! Calculate coordinatorcall phase1 ; call phase2 ; call phase3 ; call phase4 ;

end

Figure 3. Rotate Task

detector. The same issue would arise with a perfect failuredetector.

5.3. Message Classes

The end-to-end consensus algorithm relies on five typesof messages. The init messages are simply used to make theinitial proposal persistent. They contain the initial proposi-tion value and belong to class Kinit. The estimate messagescontain the estimate of process p for round r and belong toclass Kr,p

estimate. The prop messages contain the proposal fromthe coordinator for round r and belong to class Kr

prop. Thevote messages contain the vote of each process p for round rand belong to class Kr,p

vote. Those messages are sometimescalled acknowledgment messages. In this paper we use adifferent name to avoid confusion with application-level ac-knowledgment messages. The decided messages contain thefinal decision for consensus and belong to class Kdecided.

5.4. Algorithm

The application interface is defined by three procedures:values are proposed by the application using the proposeprocedure (Figure 3), the application gets values using thedecide procedure and confirms delivery using the decide-done procedure (Figure 5). Recovery is handled using therecovery procedure (Figure 10). The end-to-end consensusalgorithm follows the general idea of the rotating coordinatoralgorithm introduced by Chandra et al. [4], and uses thesame four-phase structure. Figure 3 describes the rotatingcoordinator task.

Phase 1 Each process sends its estimate to the coordinatorof the round (Figure 6).

Phase 2 This phase (Figure 7) is executed by the coordi-nator of the round. The coordinator process waits until iteither received an estimate message or suspected a majorityof processes. Each message contains one estimate. Thecoordinator selects the estimate message with the highesttimestamp. We call this estimate the winner. In order tomake the same choice after a crash, all estimate messagesexcept the one from the winner are acknowledged. In caseof crash, the coordinator will only be able to read again the

procedure propose(v) ; ! Submit proposal valuebegin

send((init,v), p) ; ! Send to self. Class Kinit1when send-done((init,v),p) do

propose-done(v) ; ! proposal is acknowledgedestimatep ' v ; decided 'false ;tsp ' 0 ;2new task rotate ; ! Start rotating coordinator

end

Figure 4. Proposeprocedure decide ; ! get decisionbegin

m ' rb-read(Kdecided ) ; ! Read decisionif m =$ then

return $ ; ! Consensus already successfully decidedelse(decided,r,v) ' m ; ! Extract value v from message mdecided 'true ; return v ; ! Algorithm can stop.

endprocedure decide-done ; ! Decision is stablebegin

foreach i " 1 . . .rp doq ' (i mod n)+1 ; ! Find coordinator qread-done (Kr,q

estimate) ; ! Acknowledge all estimates

rb-read-done(Kdecided) ; ! Acknowledge decisionread-done(Kinit) ; ! Acknowledge initial value

end

Figure 5. Decide and Decide-done

value proposed by the winner of the execution before thecrash. This estimate is sent to all processes in a prop mes-sage. Once the sending of the prop message is done, theestimate message from the winner is acknowledged (at thispoint, nothing in this round needs to be done again).

Phase 3 Every process waits for the prop message fromthe coordinator (Figure 8). If the prop message is receivedbefore the coordinator is suspected, the process sends a votemessage with a true value. If the coordinator is suspected,the prop message that did not arrive is acknowledged. Thisway, the proposed value cannot be read after a crash. Theprocess then sends a vote message with a false value.

Phase 4 In this phase (Figure 9), the coordinator waitsfor vote messages from a majority of processes. It thenacknowledges all vote message that it did not receive in orderto take the same decision after a crash. If the coordinatorcollected a majority of vote message with true values, itsends a reliable broadcast with the final decision.

5.5. Correctness

We first show that our algorithm is equivalent to the onepresented by Chandra et al. [4] in the crash-stop model. We

6©IEEE. Appeared in Proc. of the Pacific Rim Int. Symp. on Dependable Computing (PRDC'06)

procedure phase1 ; ! All proposes estimatebegin

m ' (estimatep, tsp) ; ! Encapsulate estimate in messagesend(m,cp) ; ! Class is Krp,p

estimateend

Figure 6. Phase 1

then show that a process p that crashes and then recoverscannot be distinguished by other processes from one that didnot crash. Process p will reach the same state and the otherprocesses will receive the same sequences of messages asif p did not crash. Consequently any run where process pcrashes and recovers is equivalent to a run in which p did notcrash. We then show that the algorithm terminates correctlyif at least one process successfully proposed a value.

Lemma 1 In the end-to-end consensus algorithm, if pro-cesses that crash never recover, then operations read(K)and rb-read(K) never return value $.

PROOF. If processes that crash never recover, each phaseof the end-to-end consensus algorithm is only executed oncefor each round r. In each phase, the class depends on theround number, so message classes from previous round arenot read again. In each phase, message classes are alwaysacknowledged after they have been read. A read(K) or rb-read(K) can only return value $ if K was acknowledgedbefore the read operation. #Lemma 1

Lemma 2 The algorithm in Section 5.4 solves the consensusproblem if processes do not recover and there is a majorityof green processes.

PROOF. If processes that crash never recover (i.e., no yel-low processes), then processes are either green or red. Redprocesses can be either unstable, or crashed forever, but inthe crash-stop model unstable processes cannot exist. If weexclude recoveries and unstable processes, the failure detec-tor "Su is equivalent to "S, that is, it enforces eventual weakaccuracy and strong completeness. By design the algorithmpresented in Section 5.4 only differs from the one of Chan-dra and Toueg [4] in two ways: class acknowledgment andhandling of the special value $. Lemma 1 states that value$ is never returned when processes do not recover. The onlyeffect of class acknowledgment is to change the return valueof read operations. Therefore, the algorithm is functionallyequivalent to the one presented by Chandra et al. [4] which isproven to solve consensus with a majority of correct (green)processes in the asynchronous model. #Lemma 2

Lemma 3 If a process p crashes in Phase 1 during roundr and process p re-executed correctly round r+1, then pro-cess p will re-execute Phase 1 and reach the same state andproduce the same outputs than before the crash.

procedure phase2 ; ! Coordinator selects estimatebegin

if p = cp thenestSet ' /0 ; estimatec '$ ; tsc '+1 ; win '$;3while |estSet*"Su| < n do4

foreach q /" estSet doif available(Kr,q

estimate) thenm ' read(Kr,q

estimate ) ; ! Read all messagesif m ,=$ then(estimateq, tsq) ' m ;if tsq > tsc then5

tsc ' tsq ; estimatec ' estimateq ; win ' q ;

estSet ' estSet*{q};

foreach q ,= win doread-done(Kr,q

estimate) ; ! Ack. all except winner6

if tsc ,=$ then7foreach q " ! do

send((prop, tsc,estimatec), q); ! Class Krprop

when send-done((prop, tsc,estimatec), q) doread-done(Kr,win

estimate) ; ! Acknowledge the winner.8

end

Figure 7. Phase 2

PROOF. Phase 1 has only one output: the estimate message.There are two cases: the first round (round 1) and subsequentrounds. If process p crashed in Phase 1 of round 1, then mes-sage init was successfully sent (line 1). Upon recovery, therecover procedure will restore the initial estimatep value ofthe algorithm (line 19). If the crash occurs during roundr > 1, the content of all variables is decided by the execu-tion of the previous round. As round r+ 1 was correctlyrecovered, all values (estimatep, rp, and tsp) are correct. Inboth cases, the estimate message might be sent twice, but thedeterminism clause of the end-to-end channel ensures thatonly the first version of the message will be read. #Lemma 3

Lemma 4 If process p crashes in Phase 2 during round rand process p re-executed correctly round r+1, then uponrecovery process p will re-execute the algorithm and reachthe state and produce the same outputs than before the crash.

PROOF. Because of Proposition 3, process p will reach thecorrect state at the end of Phase 1. The output of Phase 2 isthe value of estimatep that is sent in prop message (line 7).If process p is not the coordinator of round r, then this phasedoes nothing. Let us assume that process p is the coordina-tor. There are three possibilities. a) Process p was not ableto send a prop message. In this case the loop will restart,potentially with another outcome, but as there was no output,it does not matter. b) Process p received enough messages(line 4) to exit the loop and was able to send a prop message,but the prop message was not acknowledged. The value of

7©IEEE. Appeared in Proc. of the Pacific Rim Int. Symp. on Dependable Computing (PRDC'06)

procedure phase3 ; ! All wait for proposal and confirmbegin9

accept 'false ; ! By default refusewait until available(Kr

prop- cp " "Su) ;if available(Kr

prop) thenm ' read(Kr

prop ) ;10

if m ,=$ then(prop, tsc,estimatec) ' m ;estimatep ' estimatec ; tsp ' rp; ! Upd. est. & timestamp11accept 'true ; ! Vote is true

elseread-done(Kr

prop)12

send((vote,accept), cp) ; ! Class Krp,pvote13

end

Figure 8. Phase 3

estimatep was built from the set of messages contained in themessage from process win. Upon recovery, only estimatewinwill be readable, all other message having being acknowl-edged (line 6). So the same values will be selected and sentwith the prop message message to all processes. c) Processp was able to send prop message successfully to all otherprocesses. All estimate messages have been acknowledgedso this phase does nothing. In all cases, the algorithm willreach the correct state. #Lemma 4

Lemma 5 If process p crashes in Phase 3 during round rand process p re-executed correctly round r+1, then uponrecovery it will re-execute the algorithm and reach the statereached in Phase 3 of round r.

PROOF. Because of Proposition 4, process p will reach thecorrect state at the end of Phase 2 of round r. Phase 3 hasonly one output: the value contained in the vote message,but it also changes the state of the variables estimatep andtsp. There are three crash scenarios: a) Process p crasheswhile waiting for a vote message (line 9). In this case thereis no output and no internal variable changed. The algorithmcan re-execute in any way. b) Process p receives a votemessage and crashes. When the algorithm is re-executed,the same message m will be read. This will result in thesame state change (line 11). The algorithm might send thevote message multiple times (line 13), but the determinismclause ensures that the message is only read once by thecoordinator. c) Process p suspects the coordinator, sends avote message with value false (line 13) and crashed. Processp acknowledged the prop message (line 12), so executionafter the crash cannot read an eventual prop message, so thestate will not be changed. A duplicate vote message mightbe sent but the determinism clause ensures that only the firstmessage will be read by the coordinator. In all cases, thealgorithm will reach the correct state. #Lemma 5

Lemma 6 If a process p crashes in Phase 4 during round rand if process p re-executed correctly round r+1, then upon

procedure phase4 ; ! Coordinator tries to decidebegin

if p = cp thenyesSet ' /0 ; ! Received proposalnoSet ' /0 ; ! Did not receive the proposalclrSet ' /0 ; ! Ack. processeswhile (|yesSet*noSet| < . n

2 /)0 (|clrSet| < . n2 /) do14

if available(Krp,qvote ) then

m ' read(Krp,qvote ) ;

if m ,=$ then(vote,accept) ' m ; ! Extract vote from message.if accept =true then

yesSet ' yesSet*{q} ; ! Add q to yes setelse

noSet ' noSet*{q} ; ! Add q to no set

elseclrSet ' clrSet*{q} ; ! Add q to clear set

%q /" (yesSet*noSet) read-done(Krp,qvote ) ;15

if |yesSet|1 . n2 / then

m'(decided,rp,estimatep) ; ! Encapsulate estimate in mrb-send(m) ; ! Class Kdecided .16when rb-send-done(m) do%q " yesSet read-done(Krp,q

vote ) ; ! Decision sent17

else%q " yesSet read-done(Krp,q

vote ) ; ! No majority reached18

end

Figure 9. Phase 4

recovery it will re-execute the algorithm and reach the statereached in Phase 4 of round r.

PROOF. Because of Lemma 5, process p will reach the cor-rect state at the end of Phase 3. If p is not the coordinator forthis round, this phase does nothing. Let us consider that p isthe coordinator for this round. Phase 4 has only one output:the decided message that can be sent or not. There are fourcrash possibilities: a) Process p crashes while waiting fora vote message (line 14). This case results in no output, soexecution can resume in any way. b) Process p sends thedecided message and crashes. Upon recovery, p will readthe same set of messages (yesSet* noSet). Messages notwithin this set have been acknowledged (line 15). So thealgorithm will reach the same decision (to send the messageor not). As the same set is used to build the decided mes-sage, the same value is sent. We need to ensure this becausereliable broadcast only ensures class atomicity (Section 4.1).c) Process p successfully sent the decided message and exe-cuted line 17, Reading from all processes will return value$,yesSet*noSet will be empty, and this phase will do nothing.d) Process p did not get a majority of vote messages withvalue true, it did not send a decided message, and executedline 18. On recovery, there will be a majority of messages inclrSet and this phase will do nothing. Even if a crash occurs

8©IEEE. Appeared in Proc. of the Pacific Rim Int. Symp. on Dependable Computing (PRDC'06)

procedure recover ; ! Process recoversbegin

if available(Kinit) then(init,estimatep) ' read(Kinit ); ! Recover initial value.19tsp ' 0 ; ! Initial value’s timestamp is 020

elseestimatep '$ ; ! Use default value21tsp '+1 ; ! This proposal should never win22

decided 'false ; ! Algorithm has not decidednew task rotate ;

end

Figure 10. Recovery procedure

during line 15, the loop at line 14 will always exit. Therewill always be a majority of process either in yesSet*noSetor in clrSet. In all cases, the algorithm will reach the correctstate. #Lemma 6

Lemma 7 Yellow processes that have successfully proposedbehave like green processes.

PROOF. Let us assume that process p crashed in Phasei of round r. Lemmas 3 to 6 ensure that the execution ofeach round results in the same outputs (i.e., state transitionsand sent messages). If some messages are send multipletimes, the no-duplication clause of the channels hides this.#Lemma 7

Theorem 1 The algorithm solves consensus if there is amajority of yellow processes that successfully proposed.

PROOF. Lemma 7 ensures that yellow processes that havesuccessfully proposed behave like green processes. Unstablered processes will eventually be permanently suspected bythe failure detector and thus handled like crashed processes.Therefore the algorithm behaves as with a majority of greenprocesses. #Theorem 1

Lemma 8 Value$ will never be sent in the “prop” messageby the coordinator.

PROOF. If estimatep =$, then ts = +1 (line 22). Thecoordinator only sends a prop message if win ,=$ (line 5).Variable win is only affected if the coordinator receives anestimate with a timestamp larger than the initial value of -1(line 5). #Lemma 8

Theorem 2 The algorithm solves the problem if at leastone yellow process successfully proposed a value, and if amajority of processes are yellow.

PROOF. If one process successfully proposed, then at leastone process p has estimatep ,=$. This value will eventuallyreach the coordinator of some round r. Because of Lemma 8,the coordinator cannot select an invalid ($) value, so eventu-ally this value will be sent in the prop message and decidedon. #Theorem 2

5.6. Discussion

The end-to-end consensus presented in this paper solvesthe issue presented in Section 1. Application-level atomicitycan be enforced in the following way: the consensus prim-itive delivers value v to the application using the decide(v)interface. The application checks if it is recovering after acrash and if v was already delivered before the crash. If this isnot the case, v is written to disk and primitive decide-done(v)is called. This ensures that value v will not be delivered inthe future.

The properties of the consensus algorithm presented heredepend on the properties of the underlying end-to-end chan-nels. If one considers the asynchronous model, the propertiesare the same as those of the consensus algorithm presentedby Aguilera et al. [1] and by Lamport [11]: if a majority ofprocesses is yellow, consensus will terminate. The main dif-ference is that our algorithm ensures end-to-end properties,i.e., the decision is delivered successfully once.

The natural way to implement end-to-end channels isto store the message on stable storage on the receiver andconsider the send successful when the write was done and ac-knowledged. Still, end-to-end channels can be implementedin different ways. In their paper, Aguilera et al. [1] alsopresent an algorithm that solves consensus with a majorityof yellow processes and one green process without stablestorage. End-to-end channels need some form of persistencyto be implemented, but replication can be used instead ofstable storage.

For instance end-to-end channels can be be implementedusing quorum replication. The state of each class K onreceiver process p consists of one replicated data item (K, p).Each item can only contain three values: /0 (no message forthis channel), message m, or $ that marks the fact that themessage class was acknowledged. The state of an item canonly change from /0 to either m or $, and from m to $.Because the possible values are strictly ordered and only twoprocesses (sender and receiver) can write each item, there isno need for version number and both reads and writes canbe done in one operation on each copy. Channel operationsare implemented in the following way:send(m) multicast m to a write quorum of Qw processes.

Each process in the quorum will only store m if theexisting value is /0.

read(K) the value for class K is requested to a read quorumof Qr. If no process has any value, block and wait forvalue to be available. If one process has value $ thenreturn $. If no process has value $ and at least oneprocess has message m, return message m.

read-done(K) multicast value $ to a write quorum of Qwprocesses.

If we set the quorums to be Qr = 1 and Qw = n, thenthe channels will be correct as long as one process remains

9©IEEE. Appeared in Proc. of the Pacific Rim Int. Symp. on Dependable Computing (PRDC'06)

green. While we get the same properties as the algorithm ofAguilera et al., our protocol sends many more messages. Inorder to implement end-to-end atomicity, the system needsto have access to some form of stable storage. At the veryleast, the protocol needs to store on stable storage one bitof information for each message m and for each receiverq: if a message m was delivered on q. Because of this wesee that end-to-end consensus cannot be implemented with-out some persistency mechanism, either stable storage orreplication. It is interesting to note that using different quo-rum configurations gives a consensus protocol with differentfault-tolerance properties.

6. Conclusion

We presented an algorithm that implements end-to-endconsensus based on end-to-end channels. The algorithmoffers end-to-end properties in addition to traditional consen-sus properties.

There is a clear trade-off between strong fault-toleranceproperties and performance in the way the end-to-end chan-nels are implemented: stronger properties require eitherstable storage operations or the exchange of messages be-tween processes. The algorithm is designed in a way that isindependent of the underlying model. It only relies on twoabstractions: the failure detector and the end-to-end chan-nel. The algorithm can be used in a large variety of modelswithout any change, as long as the failure detector and theend-to-end channel abstraction are available.

The end-to-end channel can be adapted to the require-ments of group communication and the applications. Thetechnique used in this paper to implement an end-to-endconsensus algorithm can be extended to other group com-munication primitives, like atomic broadcast. This approachis very well-suited for modular group communication andsystems that support protocol composition.

Primitives with end-to-end properties can be used forapplications that have end-to-end requirements, such as forinstance replicated databases. End-to-end channels can alsobe used by applications that need to recover some part oftheir state after a crash.

References

[1] M. K. Aguilera, W. Chen, and S. Toueg. Failure detectionand consensus in the crash recovery model. DistributedComputing, 13(2):99–125, 2000.

[2] R. Boichat and R. Guerraoui. Reliable broadcast in the crashrecovery model. In Proc. of 19th Symp. on Reliable Distr.Syst. (SRDS), pages 32–41. IEEE, 2000.

[3] A. Bouteiller, P. Lemarinier, G. Krawezik, and F. Cappello.Coordinated checkpoint versus message log for fault toler-ant MPI. In Proc. of the Int. Conf. on Cluster Computing(CLUSTER), pages 242–250. IEEE, 2003.

[4] T. D. Chandra and S. Toueg. Unreliable failure detectors forreliable distributed systems. Journal of the ACM, 43(2):225–267, 1996.

[5] D. R. Cheriton and D. Skeen. Understanding the limitationsof causally and totally ordered communication. In B. Liskov,editor, Proc. of the 14th Symp. on Operating Systems Princi-ples, pages 44–57. ACM, 1993.

[6] E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson.A survey of rollback-recovery protocols in message-passingsystems. ACM Computing Surveys, 34(3):375–408, 2002.

[7] R. Guerraoui, M. Hurfin, A. Mostefaoui, R. Oliveira, M. Ray-nal, and A. Schiper. Consensus in asynchronous distributedsystems: A concise guided tour. In Advances in DistributedSystems, volume 1752 of LNCS, pages 33–47. Spinger, 2000.

[8] D. B. Johnson and W. Zwaenepoel. Recovery in distributedsystems using asynchronous message logging and check-pointing. In Proc. of the 7th annual Symp. on Principlesof distributed computing (PODC), pages 171 – 181. ACM,1988.

[9] S. Johnson, F. Jahanian, S. Ghosh, B. Vanvoorst, andN. Weininger. Experiences with group communication mid-dleware. In Proc. of the Int. Conf. on Dependable Systemsand Networks (DSN), pages 37 – 42. IEEE, 2000.

[10] B. Kemme, A. Bartoli, and Ö. Babaoglu. Online reconfigura-tion in replicated databases based on group communication.In Proc. of the Int. Conf. on Dependable Systems and Net-works (DSN), pages 117–127. IEEE, 2001.

[11] L. Lamport. The part-time parliament. ACM Transactionson Computer Systems, 16(2):133–169, 1998.

[12] S. Mena and A. Schiper. A new look at atomic broadcastin the asynchronous crash-recovery model. In Proc. of the24th Symp. on Reliable Distributed Systems (SRDS), pages202–214. IEEE, 2005.

[13] A. Mostefaoui, M. Hurfin, and M. Raynal. Consensus inasynchronous systems where processes can crash and recover.In Proc. of the 17th Symp. on Reliable Distr. Syst. (SRDS),pages 280–287. IEEE, 1998.

[14] J. Pereira, L. Rodrigues, and R. Oliveira. Semantically reli-able multicast: Definition, implementation and performanceevaluation. IEEE Transactions on Computers, 52(2):150–165,February 2003.

[15] S. Poledna. Fault-Tolerant Real-Time Systems: the Prob-lem of Replica Determinism, volume 345 of Enginering andComputer Science. Kluwer, 1995.

[16] J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end ar-guments in system design. ACM Transactions on ComputerSystems, 2(4):277–288, November 1984.

[17] A. Schiper. Group communication: where are we today andfuture challenges. In Proc. of the 3rd Int. Symp. on NetworkComputing and Applications (NCA 2004), pages 109–117.IEEE, 2004.

[18] M. Wiesmann and A. Schiper. Beyond 1-safety and 2-safetyfor replicated databases: Group-safety. In Proc. of the 9th

Int. Conf. on Extending Database Technology (EDBT2004),volume 2992 of LNCS. VLDB, Springer, March 2004.

10©IEEE. Appeared in Proc. of the Pacific Rim Int. Symp. on Dependable Computing (PRDC'06)