Fast message ordering and membership using a logical token-passing ring

10

Transcript of Fast message ordering and membership using a logical token-passing ring

Fast Message Ordering and MembershipUsing a Logical Token-Passing RingY. Amir, L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, P. CiarfellaDepartment of Electrical and Computer EngineeringUniversity of California, Santa Barbara, CA 93106AbstractMany protocols exist to support the maintenance ofconsistency of data in fault-tolerant distributed sys-tems; these protocols are quite expensive and thushave not been widely adopted. The Totem protocolsupports consistent concurrent operations by placinga total order on broadcast messages. This total or-der is achieved by including a sequence number ina token circulated around a logical ring that is im-posed on a set of processors in a broadcast domain.A membership algorithm handles recon�guration, in-cluding restarting of a failed processor and remergingof a partitioned network. E�ective ow-control allowsthe protocol to achieve message ordering rates two tothree times higher than the best prior protocols.1 IntroductionAs distributed systems become more widely used, im-proved performance and more reliable operation willbe required. Of primary importance are the prob-lems of maintaining the consistency of data and ofcoordinating the activities of processors in the net-work. These problems are di�cult when asynchro-nism, fault-tolerance and performance must be takeninto account.Protocols exist that address these problems in fault-tolerant asynchronous distributed systems. These pro-tocols are, however, quite expensive in the number ofmessages broadcast and/or the computation required.Recent designs for fault-tolerant distributed systems[5, 13, 17] employ the idea of placing a partial order ora total order on broadcast messages. This approachsimpli�es the design of the application but dependsheavily on e�cient fault-tolerant protocols to keep theoverhead reasonable.The Totem protocol provides fast reliable totallyordered delivery of messages in a broadcast domainin which every message is transmitted to all proces-sors in the domain. Reliable totally ordered deliv-ery is achieved by circulating a token around a logi-cal ring imposed on the broadcast domain. Only theprocessor in possession of the token can broadcast aThe address of Y. Amir is Computer Science Department, TheHebrew University of Jerusalem, Israel. The address of P. Ciar-fella is Digital Equipment Corporation, Networks Engineering,Littleton, MA.

message to the other processors on the ring. Underlow loads, token-passing protocols, such as Totem, canexhibit longer latency from generation to ordering ofa message. Under high loads, the token-based ow-control strategy of Totem provides about three timesthe throughput of, and lower latency than, prior totalordering protocols.The Totem protocol also determines processormembership of the logical ring and handles token loss,processor failure and restart, and partitioning and re-merging of the ring. Unless the token is lost, proces-sors can continue to broadcast messages during recon-�guration up to the point at which the token for theold ring is replaced by the token for the new ring. Alsoimportant are the provisions for virtual synchrony andextended virtual synchrony which ensure that mes-sages are not lost or incorrectly ordered as a result ofrecon�guration.This work is based on prior experience with theTrans and Total protocols [13, 15, 16], which were ef-fective but computationally expensive and led to thedesign of the Totem protocol [1, 12]. Ideas from theTrans protocol also in uenced the development of theTransis system [2, 3]. The work presented here extendsthe membership algorithm of Transis to handle ringformation and token generation and combines that al-gorithm with the single-ring total ordering algorithmof Totem to obtain a highly e�cient fault-tolerant to-tal ordering protocol.2 Related WorkAn early reliable broadcast and ordering protocol byChang and Maxemchuk [6] uses a token-passing strat-egy. In their protocol the processor that holds thetoken acknowledges messages, whereas in Totem theprocessor that holds the token broadcasts messages.Chang and Maxemchuk have also provided a member-ship and token recovery algorithm. Typically, betweentwo and three messages are required to order a mes-sage in an optimally loaded system. While the latencyis reasonable at low loads, it increases at high loadsand in the presence of a failed processor.More closely related to Totem is the TPM protocolof Rajagopalan and McKinley [18]. Like Totem, TPMuses a token on a logical ring for broadcasting andretransmission of messages. They have also provideda membership and token regeneration protocol. In theabsence of processor failure and network partitioning,

the TPM protocol achieves safe delivery (de�ned inSection 3) but requires, on average, two and one-halftoken rotations. TPM addresses network partitioning,but delivers more messages than are allowed by causaldelivery and virtual synchrony.The ISIS system of Birman and Joseph [4] is basedon the idea of ordering messages multicast to processgroups. Recent versions [5] have adopted a token-passing protocol, similar to that of Chang and Maxem-chuck, for total ordering. ISIS has focused on the ap-plication program interface and provides several typesof services of increasing cost, reliability and synchro-nization: BCAST or unordered messages, CBCASTor causally ordered messages, and ABCAST or totallyordered messages. ISIS has also established the ideaof virtual synchrony as important to maintaining con-sistency in fault-tolerant distributed systems.Peterson, Buchholz and Schlichting [17] have de-vised the Psync protocol, which constructs a partialorder on messages that can be converted into a totalorder. In contrast, the Totem protocol constructs atotal order on messages directly without constructinga partial order �rst. Mishra, Peterson and Schlichting[14] have developed a membership algorithm based onthe partial order of Psync.An early use of synchronous behavior in thecommunication system to support fault-tolerant dis-tributed applications was the work of Cristian, Aghili,Strong and Dolev [8]. They fabricated an atomicbroadcast from unreliable message communication us-ing loosely synchronized clocks and timeouts with anupper bound on message transit time. Cristian [7] hasalso developed a membership algorithm using a strat-egy similar to that used for atomic broadcast. TheTotem protocol likewise uses a timeout mechanism todetect token loss and processor failure.The approach of Kopetz et al [9] is even more syn-chronous than that of Cristian, being speci�cally di-rected towards real-time applications. The result ishigher performance and less exibility. Kopetz [10]has used this synchronous approach to provide mem-bership services in addition to reliable ordered mes-sage delivery. The Totem protocol provides equallyhigh performance with greater exibility.Related to our ow-control strategy are the sliding-window strategy and the token rotation time limit ofFDDI. The use of a token and a window for ow con-trol on a broadcast medium has not been investigatedin prior work known to us.3 ServicesThe Totem protocol provides reliable totally orderedmessage delivery and membership services to higherlayers of the communication protocol hierarchy, re-ferred to here as the application. The protocol accom-modates message loss (including token loss), processorfailure and restart, and network partitioning and re-merging. Malicious faults are assumed not to occur.In addition, we assume that the probability that aprocessor receives a broadcast message is greater thansome positive constant. We use the terms receive anddeliver as follows. A processor receives a message that

was broadcast by a processor in the broadcast do-main and a processor deliversmessages in order to theapplication.3.1 Reliable Ordered Delivery ServicesThe Totem protocol provides two types of reliable or-dered message delivery, agreed and safe. Both of theseservices provide total ordering of messages. Safe de-livery requires, in addition, that the processor deliv-ering the message know that all other processors inthe con�guration have received and will deliver themessage. Safe delivery is the best that any communi-cation subsystem can provide and may be required bythe application. However, safe delivery incurs a largerlatency than agreed delivery. In Totem, the mean la-tency from generation of a message to agreed deliveryis approximately half the token rotation time, whilethe mean latency to safe delivery is approximately twotoken rotation times.Reliable ordered delivery services can be catego-rized into �ve levels of service: basic, �fo, causal,agreed and safe. The basic level of service guaranteesthat messages are delivered to the application (with-out regard for ordering). Every processor that receivesa basic message can deliver that message immediately.The �fo level of service ensures that messages origi-nated by a given processor are delivered in the order inwhich they were originated. However, messages fromdi�erent processors can be arbitrarily interleaved. Thecausal level of service, which is taken from Lamport[11], is the re exive transitive closure of the relation:� Message m precedes message m0 if processor pdelivers m before p sends m0.� Message m precedes message m0 if processor psends m before p sends m0.We now provide the de�nitions of agreed and safedelivery.De�nition. A message m is delivered by a pro-cessor p in agreed order in con�guration C if and onlyif (1) p is a member of C, p has received m, and mwas originated by a member of C or of a con�gurationthat precedes C, (2) p does not deliver two di�erentmessages at the same position in agreed order in C nordoes it deliver one message at two di�erent positionsin agreed order in C, (3) p has delivered all messagesthat precede m in causal order in C, (4) p has deliv-ered all messages that precede m in agreed order in C,and (5) for any other processor q in C, if p delivers mbefore n in agreed order in C, then q does not delivern before m in agreed order in C.De�nition. A message m is delivered by a proces-sor p in safe order in con�guration C if and only if itis delivered by p in agreed order in C and if p knowsthat all processors in C have received and will deliverm.It is easy to see that safe service is also agreed ser-vice, that agreed service is also causal service, thatcausal service is also �fo service and that �fo serviceis also basic service.

Many systems provide some of these levels of ser-vice. In particular, the ISIS system [5] provides CB-CAST and ABCAST service, corresponding approxi-mately to causal and agreed service, respectively. ISISalso provides an all-stable service, corresponding ap-proximately to safe service. The Transis system [2]provides causal, agreed and safe levels of service.The use of one of the lower levels of service, i.e. ba-sic, �fo or causal rather than agreed, has been justi�edfor other protocols by the earlier delivery achieved. InTotem, a processor can deliver a message broadcaston the ring in agreed order immediately upon receipt,provided that it has already delivered all messages onthe ring that precede that message in the order. Safedelivery provides a higher level of service than agreeddelivery, and does incur a latency penalty. Thus, weprovide agreed and safe delivery and no others.3.2 Membership ServicesIn a distributed system processors may fail and restartand the network may partition and remerge. Thus, theneed exists for a service that maintains consensus onthe set of processors currently connected and work-ing. Some distributed applications may need to knowabout con�guration changes. This information canbe provided by con�guration change messages, whichare delivered in order along with broadcast messages.Such a message is regarded as the terminal message ofan existing con�guration and the initial message of anew con�guration. The con�guration change messagesde�ne a partial order on the con�gurations. Thesemessages support our implementation of virtual syn-chrony, a concept introduced by Birman [5]. We de�nevirtual synchrony as follows:De�nition. Virtual synchrony requires that if pro-cessors p and q are both members of consecutive con-�gurations C and D, then p and q determine exactlythe same agreed order of messages in C.Virtual synchrony su�ces for systems like ISIS thatdo not allow network partitioning and only considerprocessors that fail and never recover or regard re-paired processors as new processors. If a con�gura-tion C partitions into con�gurations D and E, theexisting concept of virtual synchrony might allow con-�gurations D and E to deliver messages in inconsis-tent orders. Such behavior can be observed in existingsystems and causes problems for database designers.Similar problems can arise if a processor fails. In ef-fect, immediately prior to the failure, the processorbecomes isolated and may order messages inconsis-tently with the other processors. Consequently, weextend the de�nition of virtual synchrony to constrainthe behavior of processors that are members of non-consecutive con�gurations.De�nition. Extended virtual synchrony is virtualsynchrony with the additional requirement that, ifprocessors p and q both order messages m and n, thenthey determine the same relative order of m and n.Note that messages m and n are not necessarilyordered by p and q in the same con�guration or evenin consecutive con�gurations.

As a con�guration changes due to processor fail-ure or network partitioning, it is inevitable that somemessages may be received by some processors but notby others. Care is required to ensure that delivery ofmessages still meets the requirements of causal, safeand agreed orders. In Totem, a processor deliverstwo con�guration change messages to the application.These messages not only inform the application of thechanges but also de�ne the con�gurations for whichvirtual synchrony holds.4 The Total Ordering AlgorithmThe Totem single-ring ordering algorithm operates ona broadcast domain that consists of a �nite numberof processors. Each processor has a unique identi�er,which does not change if a processor fails and restarts.A con�guration or ring consists of an ordered set ofprocessors within the broadcast domain. Each ringhas a unique ring identi�er consisting of the smallestidenti�er of any processor on the ring and the sequencenumber of the �rst message broadcast on that ring; atany point in time a processor is a member of onlyone ring. Each message has an identi�er consisting ofthe identi�er of the processor originating the message,the identi�er of the ring on which the message wasoriginated, and a sequence number.For the total ordering algorithm described in thissection we assume that the token is never lost, that thering does not become partitioned and that processorfailures do not occur; however, messages may be lost.In Section 5 we relax these assumptions and extend thealgorithm to handle token loss, processor failure andrestart, and partitioning and remerging of the ring.The total ordering of messages is achieved by usinga single sequence of message sequence numbers for allprocessors on the ring and by including the sequencenumber of the last message broadcast in the token.The token contains the following �elds:� type: Regular.� seq: The highest sequence number of any messagethat has been broadcast on the ring, i.e. a high-water mark. The seq is initially one less than thesequence number of the �rst message on the ring.� aru: A sequence number (all-received-up-to) suchthat all processors on the ring have received allmessages up to (and including) the message withthis sequence number, i.e. a low-water mark. Itis used to control the discarding of messages thathave been received by all processors on the ringand that will, therefore, not need to be retrans-mitted. The aru is initially one less than the se-quence number of the �rst message broadcast onthe ring.� rtr: A retransmission request list, containing oneor more retransmission requests.� g: A ag that is used during recon�guration toachieve extended virtual synchrony. The g isinitially set to false.

� fcc: The number ( ow control count) of messagesactually broadcast by all processors on the ring inthe last rotation of the token, including retrans-missions.Each processor maintains a local variable my arucontaining the sequence number of the message suchthat it has received all messages with sequence num-bers at most equal to that sequence number. Initially,a processor sets my aru to be one less than the se-quence number of the �rst message on the ring. Asthe processor receives messages, it updates my aru.Each processor maintains a list of messages that ithas received; messages that are safe can be discardedfrom this list.On receipt of the token, a processor completes theprocessing of messages in its input bu�er, broadcastsmessages, updates the token and transmits it to thenext processor on the ring. For each new messageit broadcasts, a processor increments the seq �eld ofthe token and sets the sequence number of the newmessage to this seq.Whether broadcasting a message or not, a processorcompares the aru �eld of the token with my aru and,if my aru is smaller, replaces aru with my aru. If theprocessor previously lowered the aru and the tokenreturned with the same value as it set, then it setsaru equal to my aru. If seq and aru are equal, then itincrements aru and my aru in step with seq.If the seq �eld of the token indicates that messageshave been broadcast that it has not yet received, aprocessor sets or augments the rtr �eld. If the proces-sor has received messages that appear in the rtr �eldthen, for each such message, it generates an indepen-dent random variable to determine whether it shouldretransmit that message before broadcasting new mes-sages. (This randomization increases overall systemreliability.) When it retransmits a message, the pro-cessor removes the sequence number of that messagefrom the rtr �eld.The g �eld is used to achieve extended virtual syn-chrony, as described in Section 5. The fcc �eld pro-vides the ow control for the protocol, as described inSection 6.If a processor has received and delivered every mes-sage with sequence number less than that of messagem, then it can deliverm in agreed order. If a processorcan deliver m in agreed order and if on two successiverotations, it releases the token with an aru no less thanthe sequence number of m, then it can determine thatm is safe. The proof of correctness of the basic totalordering algorithm of Totem has been omitted.5 The Membership AlgorithmThe membership algorithm presented here is used inconjunction with the total ordering algorithm of Sec-tion 4. The algorithm handles all aspects of processormembership, including processor failure and restart,token loss, and network partitioning and remerging.The algorithm di�ers from other membership al-gorithms in several important ways. The algorithmuses a single representative for each ring that is be-ing merged; this representative negotiates the mem-

bership of the new ring on behalf of the other proces-sors on the old ring and should not be regarded as aleader or master of the old or new rings. While thenew ring is being formed, the old ring is used as long aspossible to broadcast messages. Before it is installed,the new ring is used to recover messages for the oldring that must be delivered to achieve extended virtualsynchrony.The membership algorithm uses three types of spe-cial messages:� Attempt Join message broadcast by a representa-tive initiating the membership algorithm to forma new ring from two or more rings.� Join message broadcast by a representativeproposing a set of representatives of old rings andalso a set of failed representatives (a subset of theset of representatives). The proposed new ringwill be formed by the representatives in the �rstset but not in the second.� Con�guration Change message that enumeratesthe membership of a new ring or of an interimcon�guration consisting of processors on both oldand new rings.The Attempt Join and Join messages have no sequencenumbers and are not delivered to the application. TheCon�guration Change messages are used to ensure ex-tended virtual synchrony. They are generated locallyat each processor and are delivered directly to the ap-plication without being broadcast.In forming a new ring the representative of thatring generates a Form token. The Form token, whichdi�ers from the regular token of the total orderingalgorithm, contains the following �elds:� type: Form.� form token id: The Form token identi�er, whichconsists of the identi�er of the representative ofthe new ring and a Form token sequence number.This sequence number is necessary to ensure thatthere is a unique Form token for the proposed newring. It is determined by the representative andincremented with each Form token the represen-tative generates.� join list: A sorted list of the representatives' iden-ti�ers.� join index: The index of the representative in thejoin list that has just released the Form token.This index is used to connect the old rings to-gether into a new ring (see Figure 2).� memb list: A list containing all the identi�ers ofall the members of the new ring according to theirposition on the new ring. For each of these mem-bers, this list also contains its old ring identi�erand my aru.� high seq: The larger of the highest sequence num-ber of a message received from any of the old ringsand of the highest �rst sequence number of anynew ring containing a member of this ring. Thiswill be the �rst sequence number of the new ring.

These �elds are updated as the Form token circu-lates around the new ring, except for the type �eld,the form token id �eld and the join list, which are setby the representative of the new ring.5.1 The State DiagramThe membership algorithm is de�ned in terms of thestate diagram shown in Figure 1. The de�nitions ofevents and states are given below.5.1.1 De�nition of EventsThere are �ve membership events, namely:� Receiving a foreign message, which can be a{ Regular message broadcast by a processorthat is not a member of the ring, or{ Attempt Join message, or{ Join message.� Receiving a Form token. On the �rst receipt ofthe Form token a processor of the proposed newring updates the Form token; on the second re-ceipt it obtains the updated information that theother processors supplied.� Token loss timeout. This timeout indicates thata processor did not receive either the token ora message from a processor on the ring in thatamount of time.� Gather timeout. This timeout is used to boundthe time for gathering representatives to form anew ring.� Commit timeout. This timeout indicates that aprocessor participating in the formation of a newring failed to determine that consensus had beenreached.5.1.2 De�nition of StatesThere are �ve states, namely:� Operational state. This is the regular state of thering in which the basic total ordering algorithmoperates with no membership changes.� Gather state. In the Gather state the represen-tatives that will constitute the new ring are col-lected. This is done by gathering as many At-tempt Join and Join messages as possible beforethe Gather timeout expires.� Commit state. In the Commit state the represen-tatives attempt to reach consensus on the set ofrepresentatives whose rings will be combined toform the proposed new ring.� Form state. In the Form state the path of thetoken of the proposed new ring is determined andinformation about the members of that ring isexchanged.� Recover state. In the Recover state messages thathave to be delivered for the old rings are recov-ered to ensure that extended virtual synchronyis achieved. When extended virtual synchrony isachieved, the proposed new ring is installed as thenew ring.

5.1.3 Formation of a New RingWe �rst explain the membership algorithm withoutconsidering the e�ects of further processor failure ortoken loss during operation of the algorithm. Thosee�ects are examined in Section 5.1.5.The membership algorithm is invoked when a to-ken loss is detected or when a foreign message is re-ceived by a processor on the ring. A processor thatstarts or restarts �rst forms a singleton con�gurationcontaining only itself and then broadcasts an AttemptJoin message. A non-representative in the Operationalstate ignores foreign messages.In the Operational state the ordering of messagesproceeds according to the basic total ordering algo-rithm of Totem. When a foreign message is receivedby a representative, the representative initiates an At-tempt Join message, advertizing its intention to forma bigger ring. It then shifts to the Gather state.The Gather state allows time for the representativeto collect together as many representatives as possibleto form a new ring. The representative remains in theGather state until the Gather timeout expires. It thenbroadcasts a Join message, containing the identi�ersof the representatives it has collected in the Gatherstate, and shifts to the Commit state.In the Commit state the representatives reach con-sensus about the set of representatives that will partic-ipate in the formation of the new ring. Each time thata representative broadcasts a Join message, it is com-mitted to the set of representatives and the set of failedrepresentatives in that message. If another represen-tative broadcasts a Join message that contains at leastone representative in either of those sets that is not inthe corresponding sets of the �rst Join message, thenthe �rst Join message is e�ectively cancelled. (This isessential because the second representative will neveragree on the sets of representatives in the �rst Joinmessage.) The �rst representative will then broad-cast, and commit to, a new Join message that containsthe union of the sets of representatives and the unionof the sets of failed representatives in the two Joinmessages.A consensus is reached when there exists a set ofrepresentatives and a set of failed representatives, enu-merated in a Join message, such that each of the non-failed representatives has broadcast a Join messagewith those two sets of representatives. When sucha representative receives all those Join messsages, itknows that consensus has been reached. If the Com-mit timeout expires, no consensus has been reached.In this case the representative inserts into the set offailed representatives all representatives from which ithas not received the required Join message. The rep-resentative then broadcasts a revised Join message,restarts the Commit timeout, and tries again to forma new ring.The representative for the proposed new ring, cho-sen deterministically from among the representativeswhen consensus is reached, generates a Form token.The Form token circulates through all members of theproposed new ring along a cycle determined by the in-creasing order of identi�ers of the representatives (seeFigure 2). Every member of the proposed new ring

Foreign regular message

OR Form Token received Foreign message

received

Foreign regular message

OR Attempt Join received

Foreign message

received

Token loss timeout

Form token receivedAND Representative

AND NOT

Join received

(Consensus AND Representative)

Token loss timeout

Form token received

AND Representative

Operational

RecoverGather

Commit Form

received

Attempt Join

Token loss

AND NOT Representative

Form token received

Form token received

Join received AND

Representative

Consensus AND

Join received

Form token received

Token loss

Gather timeout

Token losstimeouttimeout

timeout

Foreign message

Extended virtual synchrony

Regular token received AND

received

Commit timeoutFigure 1: The state diagram for the membership algorithm.shifts to the Form state as it forwards the Form to-ken. This includes the non-representatives, which shiftfrom the Operational state to the Form state, as shownby the dashed line in Figure 1. Having entered theForm state, a processor ignores new messages broad-cast on the old ring and consumes the regular tokenfor the old ring if it subsequently receives it.After one rotation of the Form token, the repre-sentative of the new ring knows all of the informationneeded for the new ring and shifts to the Recover state.After two rotations of the Form token, all other pro-cessors on the proposed new ring have this informationand have also shifted to the Recover state. This infor-mation includes the new ring identi�er, which consistsof the identi�er of the representative of the new ringand high seq, as well as the my aru and old ring identi-�er of any processor that is a member of the proposednew ring. From this information each processor cancalculate an interim con�guration and a low my arufor the processors in that interim con�guration.On receiving the Form token after its second ro-tation, the representative of the new ring consumesthe Form token and transmits the regular token forthe new ring in its place. At this point the new ring isformed but not yet installed, and the operation neededto achieve extended virtual synchrony begins.5.1.4 Achieving Extended Virtual SynchronyThe objective of extended virtual synchrony is to pro-vide strong guarantees about message delivery to theapplication, even in the presence of failures. We mustguarantee that� All messages that are delivered to the applicationare delivered in causal order.

� The messages delivered in agreed order in con-�guration C include all messages originated byprocessors in C up through the message deliveredwith the largest sequence number.� The messages delivered in safe order in con�gu-ration C are delivered to the application in everyprocessor of C unless that processor fails.� All messages originated by a processor are deliv-ered to the application in that processor unlessthat processor fails.These guarantees allow an application, such as a fault-tolerant distributed database, to operate correctly de-spite con�guration changes.The extended virtual synchrony algorithm consistsof six steps:1. The processors of the new ring, that were mem-bers of the same old ring, exchange messages toensure that they all have the same set of messagesbroadcast on the old ring but not yet delivered.2. Each processor delivers to the application thosemessages that can be delivered in agreed or safeorder on the old ring.3. The processor delivers the �rst Con�gurationChange message, which enumerates the proces-sors in a smaller interim con�guration consistingof the processors on both the old and new rings.4. The processor delivers to the application furthermessages that could not be delivered in agreed orsafe order on the old ring but that can be deliv-ered in agreed or safe order in the smaller interimcon�guration.

14

11 3

1

5

14

5

11

1

3

26

1615

102020 10

15 16

25

26

25Figure 2: Determination of the cycle along which the token will circulate for the proposed new ring. The representativesof the old rings are shown shaded.5. The processor delivers a second Con�gurationChange message, which enumerates the proces-sors on the new ring.6. The processor then shifts to the Operationalstate, and broadcasts and delivers messages forthe new ring.To implement the �rst step of the extended virtualsynchrony algorithm, each processor that is a memberof both an old ring and the new ring determines thelowestmy aru of any processor from its old ring that isalso a member of the new ring. It then broadcasts onthe new ring every message for the old ring that it hasreceived and that has a sequence number greater thanthe lowest my aru. This ensures that each processorreceives as manymessages from its old ring as possible.Each such message is broadcast with the old ringidenti�er and an old ring sequence number, as wellas with a new ring identi�er and a new ring sequencenumber. The new ring sequence numbers are used toensure that messages are received, and the old ringsequence numbers are used to order messages as mes-sages of the old ring. Messages from an old ring arenot delivered to the application by any processor thatwas not a member of the old ring.Completion of this message exchange is determinedby the recovery ag g in the token, which is initializedto false by the representative of the new ring, and by avariable my install seq in each processor. A processorchanges g from false to true if it has more messagesto retransmit when it releases the token. A processorchanges g from true to false if it set g to true andnow has no further messages to retransmit.When it receives the token on two successiverotations with g set to false, a processor setsmy install seq to seq. When my aru is at least equalto my install seq, the processor has received all of themessages of the old ring that have been broadcast onthe new ring.If any of the messages to be delivered in the interimcon�guration require safe delivery, then the proces-sor must not deliver the �rst Con�guration Changemessage until it knows that all processors on the ringhave determined a value for my install seq and have

received all messages with lower sequence numbers.This is guaranteed when it has seen the aru �eld atleast equal to my install seq in two successive tokenrotations and the g �eld set to false three times.The processor then moves to the second step. Itsorts the messages for the old ring that were broad-cast on the new ring into the order of their sequencenumbers on the old ring, and delivers messages in or-der until it encounters a gap in the sorted sequence ora message requiring safe delivery that is not safe on theold ring. It then delivers the �rst of the two Con�g-uration Change messages, which enumerates the pro-cessors in the interim con�guration.Following the �rst Con�guration Change message,the processor delivers in order all remaining messagesthat were originated on the old ring by processors inthe interim con�guration or that need to be deliveredto satisfy safe delivery. The processor then delivers thesecond Con�guration Change message, which enumer-ates the processors on the new ring, and shifts to theOperational state.Note that a processor delivers all messages for theold ring before it broadcasts or delivers any messagefor the new ring. The decision to shift to the Oper-ational state and the set of old ring messages to bedelivered is a local decision. Some processors may bein the Operational state broadcasting messages for thenew ring, while others are still in the Recover state ob-taining retransmissions of messages for the old ring.Note, however, that no safe message can be deliveredin the new ring before all of the processors on the newring install the new ring.5.1.5 Token Loss and Processor FailureWe do not distinguish between processor failure andtoken loss because a failed processor cannot forwardthe token to the next processor on the ring and thusthe consequence of processor failure is token loss. Ifthe token reaches a processor that has determined thatthe token is lost, the processor consumes the token.The simplest token loss event occurs in the Opera-tional state. On expiration of the token loss timeout, aprocessor regards itself as a representative, represent-ing only itself but retaining its existing ring id, andproceeds to the Gather state.

Token loss can also occur to a representative in theGather or Commit states. If, in either of these states,the token of the representative's existing ring is lost,the representative continues the operation of the mem-bership algorithm, retaining its existing ring identi�erbut representing only itself.Loss of the Form token or of the regular token forthe new ring can also occur in the Form or Recoverstates. In these states the old ring is no longer oper-ational and the new ring has not yet been installed,and the processor returns to the Gather state. To en-sure termination, the algorithm guarantees, by plac-ing a processor in the set of failed processors, that thenext membership on which consensus is reached in theCommit state is smaller than the previous member-ship. An attempt is made to exclude the processormost likely to have lost the token.If the token is lost during the Recover state, someprocessors may have installed the new ring while oth-ers have not. During the subsequent recovery the pro-cessors that installed the new ring must regard thatring as their old ring. Processors that have not in-stalled the new ring must regard the old ring as theirold ring. Thus, each processor must preserve its oldring identi�er until it is able to install a new ring.When a processor delivers a message as safe in aninterim con�guration, it guarantees that every othermember of the interim con�guration will also deliverthat message unless that other processor fails. If thetoken is lost in the Recover state, some processors maynot install the new con�guration and thus will not de-liver messages in the interim con�guration. These pro-cessors will proceed in due course to form another newring with an associated interim con�guration. Theremay be processors in the original interim con�gura-tion that are not in this subsequent interim con�gu-ration. Because of the safe delivery requirement, mes-sages from those processors must still be delivered inthis subsequent interim con�guration.Consequently, for safe delivery, if a processor haslost the token late in the Recover state, then it mustretain messages that it would have delivered in the�rst interim con�guration because some other proces-sor may have already installed the new ring. The pro-cessor must ensure that when it is able to install anew ring such messages are delivered in the interimcon�guration between the old ring and the installednew ring. The proof of correctness of the membershipalgorithm has been omitted.6 Flow ControlA basic characteristic of reliable broadcast and mul-ticast protocols is that the rate of broadcasting mes-sages cannot exceed the rate at which the slowest pro-cessor can receive messages. At faster rates of broad-casting, the input bu�er of the slowest receiver willbecome full and messages will be lost. Retransmissionof these messages will result in an increase in messagetra�c and a reduction in the useful transmission rate.Because modern transmissionmedia, such as FDDI,are capable of very high transmission rates and be-cause many processors may contribute to the trans-

mission rate, a broadcast protocol without an e�ec-tive ow-control mechanism will easily overwhelm theslowest receiver. The ow-control mechanism shouldallow high transmission rates when all processors areable to process messages at that rate. However, toprevent bu�ers from over owing and messages frombeing lost, the mechanism must dynamically restrictthe ow of messages to the rate at which messages canbe processed.For point-to-point protocols, positive acknowledg-ment mechanisms such as the sliding-window mecha-nism have been re�ned to provide excellent ow con-trol. However, positive acknowledgment protocols areunattractive when messages are broadcast becauseof the excessive number of acknowledgments. Rate-controlled protocols have attracted much attention re-cently, particularly for applications that require hightransmission rates. The disadvantage of rate controlfor broadcast protocols is that the aggregate rate ofall transmitters must be controlled. For applicationswith bursty communication patterns, the maximumtransmission rate must be set to a low value to guar-antee that all processors receive messages at or belowthat rate.The response time of the ow-control mechanism isdetermined by the size of the message bu�er in eachprocessor. Even if all processing stops, a processor cancontinue to receive messages and thus other processorscan continue to broadcast messages until that proces-sor's bu�er becomes full. To ensure that bu�er over- ow does not occur, during the interval from the timeat which a processor's bu�er becomes empty and thenext ow-control message (token for the Totem proto-col) transmitted by that processor, the other proces-sors must not broadcast more messages than that pro-cessor's bu�er can contain. Consequently, each pro-cessor must transmit ow-control information at leastonce during any sequence of messages that is su�cientto �ll its bu�er.This requirement is met by the Totem protocol. Aseach processor receives the token, it must empty itsinput bu�er before passing on the token. If a processoris unable to process messages at the rate at whichthey are broadcast, many messages may still be in itsinput bu�er when it receives the token. This processormust not transmit the token until it has processedthose messages and emptied its input bu�er. Thus,the rate of broadcasting is reduced to the rate at whichmessages can be processed by the slowest processor.Our ow-control algorithm depends on the fcc �eldof the token, which keeps a count of the number ofmessages broadcast by all processors on the previousrotation of the token, and on the window size, themaximum number of messages that all processors areallowed to broadcast in any token rotation. The num-ber of messages that a processor is allowed to broad-cast in this token rotation is bounded by the di�er-ence window size � fcc and by the maximum numberof messages it is allowed to broadcast in any tokenrotation.On receipt of the token, a processor �rst processesall messages in its input bu�er and broadcasts mes-sages in its output bu�er, up to the maximumnumber

Message Size in Bytes

Uti

lizat

ion

of

the

Med

ium

0 200 400 600 800 1000 1200 14000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

UtilizationMessages/second

Five Sun 4/IPC ProcessorsEthernet

Mes

sag

es O

rder

ed p

er S

eco

nd

500

600

700

800

900

1000

1100

1200

1300

Figure 3: The number of ordered broadcast messages persecond and the utilizationof the Ethernet for variousmes-sage sizes for a network of �ve Sun 4/IPC processor run-ning the Totem protocol.permitted by the ow control mechanism. The proces-sor then sets fcc to fcc plus the number of messages itbroadcast during this token visit minus the number ofmessages it broadcast during the previous token visit.The processor then transmits the token to the nextprocessor on the ring.Assuming that the window size is no larger thanthe bu�er size for any of the processors, it is easy todemonstrate that this protocol guarantees that no in-put bu�er can over ow. Experience with the Totemprotocol and other broadcast and multicast protocolshas shown that e�ective ow control, capable of pre-venting message loss due to bu�er over ow even athigh message rates, is essential to the attainment ofhigh throughput. Existing broadcast protocols mustbe throttled at relatively low rates of broadcasting toavoid high rates of message loss, particularly when thetra�c is bursty. The circulating token ow-controlmechanism of the Totem protocol has been demon-strated to ensure negligible rates of message loss evenfor very high rates of broadcasting.7 Implementation and PerformanceThe Totem single-ring total ordering and membershipprotocol has been implemented as 1500 lines of C codeon a network of Sun 4/IPC workstations connected byan Ethernet. The implementation uses the standardUDP broadcast interface within the Unix operatingsystem (SunOS 4.1.1). One UDP socket is used forall broadcast messages, and a separate UDP socketis used by each processor to receive the token fromits predecessor on the ring. The version of the codecurrently operating detects token loss and reforms thering. Failed processors are removed from the ring, newand restarted processors can join the ring, and ringscan partition and remerge.Measurements from this �rst implementation showexcellent performance. Figure 3 shows the number ofmessages ordered per second for messages of varioussizes. These measurements were made on a networkof �ve Sun 4/IPC workstations with the window size

Messages Ordered per Second

Mill

isec

on

ds

0 200 400 600 800 10000

20

40

60

80

Token Rotation

Latency to

Agreed Delivery

Latency to

Safe Delivery

Five Sun 4/IPC ProcessorsEthernet, 1000 Byte Messages

Figure 4: The mean token rotation time and the meanlatency from generation of a message until it is deliveredin agreed order by all processors on the ring as a functionof load.set to the maximum value for which message loss isnegligible. Note that with 1000 byte messages, 830messages were ordered per second. Reducing the mes-sage size to 600 bytes resulted in over 1000 orderedmessages per second. The highest prior rates for asyn-chronous fault-tolerant ordered broadcast messages ofwhich we are aware are about 250 to 300 messages persecond for the Transis protocol.A concern about token-passing protocols, such asTotem, is that the token-passing overhead reduces thedata transmission rate available for messages. Figure3 shows the useful utilization (excluding transmissionof the token) of the Ethernet achieved by the Totemprotocol. With large messages, a utilization of about70% is achieved, which may be compared to the ap-proximately 65% utilization that can be achieved byTCP transmitting messages point-to-point with thesame equipment.While Figure 3 depicts the maximum data trans-mission rates measured using Totem over the Ether-net, Figure 4 considers performance at lower, moretypical loads with Poisson arrivals of messages and1000 byte messages. It shows the token rotation timeand the latency from generation of a message until it isdelivered in agreed order by all processors on the ring.At low loads (e.g. 400 ordered messages per secondwhich is much more than the maximum throughputfor prior protocols), the latency achieved by Totem isunder 10 milliseconds. Even at 50% useful utilizationof the Ethernet (625 messages per second), the latencyis still only about 13 milliseconds. Note that the la-tency is close to half the token rotation time except atvery high loads where queueing delays dominate.8 ConclusionThe single-ring total ordering protocol of Totem pro-vides fault-tolerant agreed and safe delivery of mes-sages within a broadcast domain. The use of a to-ken circulating around a logical ring provides excellent ow control and permits substantially higher through-put than existing total ordering protocols. There is

no need to provide a weaker message ordering service,such as partially ordered causal delivery, because to-tally ordered agreed delivery can be provided at noextra cost.The membership algorithm of Totem handles pro-cessor failure and recovery and also network partition-ing and remerging. The concept of virtual synchronyhas been extended to ensure consistent actions by pro-cessors that fail and are repaired and in networks thatpartition and remerge.Continuing work on Totem is exploiting the single-ring protocol to provide more general services. Agreedand safe delivery services will be provided to processgroups, as will membership services, across rings in-terconnected by gateways. It remains to be inves-tigated whether the exceptional performance of theTotem single-ring protocol can be sustained in wide-area networks.AcknowledgmentsThis work was supported by NSF grant NCR-9016361.We acknowledge the contribution of Danny Dolev,ShlomoKramer and Dalia Malki who, with Yair Amir,developed the membership protocol for Transis fromwhich our membership protocol is derived.References[1] D. A. Agarwal, P. M. Melliar-Smith, and L. E.Moser. Totem: A protocol for message order-ing in a wide-area network. In Proceedings of theFirst ISMM International Conference on Com-puter Communications and Networks, pages 1{5,San Diego, CA, June 1992.[2] Y. Amir, D. Dolev, S. Kramer, and D. Malki.Transis: a communication sub-system for highavailability. In Proceedings of the 22nd Annual In-ternational Symposium on Fault-Tolerant Com-puting, pages 76{84, Boston, MA, July 1992.[3] Y. Amir, D. Dolev, S. Kramer, and D. Malki.Membership algorithms in broadcast domains. InProceedings of the 6th International Workshop onDistributed Algorithms, Lecture Notes in Com-puter Science 647, pages 292{312, Haifa, Israel,November 1992.[4] K. P. Birman and T. A. Joseph. Reliable commu-nication in the presence of failures. ACM Trans-actions on Computer Systems, 5(1):47{76, Febru-ary 1987.[5] K. P. Birman, A. Schiper, and P. Stephen-son. Lightweight causal and atomic group multi-cast. ACM Transactions on Computer Systems,9(3):272{314, August 1991.[6] J. M. Chang and N. F. Maxemchuk. Reliablebroadcast protocols. ACM Transactions on Com-puter Systems, 2(3):251{273, August 1984.[7] F. Cristian. Reaching agreement on processorgroup membership in synchronous distributedsystems. Distributed Computing, 4(4):175{187,April 1987.

[8] F. Cristian, H. Aghili, R. Strong, and D. Dolev.Atomic broadcast: From simple message di�usionto byzantine agreement. In Proceedings of theIEEE Symposium on Fault Tolerant Computing,pages 200{206, Ann Arbor, Michigan, June 1985.[9] H. Kopetz, A. Damm, C. Koza, M. Mulazzani,W. Schwabl, C. Senft, and R. Zainlinger. Dis-tributed fault-tolerant real-time systems: TheMars approach. IEEE Micro, pages 25{40, Febru-ary 1989.[10] H. Kopetz, G. Gr�unsteidl, and J. Reisinger.Fault-tolerant membership service in a syn-chronous distributed real-time system. In Pro-ceedings of the International Working Conferenceon Dependable Computing for Critical Applica-tions, pages 167{174, August 1989.[11] L. Lamport. Time, clocks, and the ordering ofevents in a distributed system. Communicationsof the ACM, pages 558{565, July 1978.[12] P. M. Melliar-Smith, L. E. Moser, and D. A.Agarwal. Ring-based ordering protocols. In Pro-ceedings of the International Conference on Infor-mation Engineering, pages 882{891, Singapore,December 2-5 1991.[13] P. M. Melliar-Smith, L. E. Moser, andV. Agrawala. Broadcast protocols for distributedsystems. IEEE Transactions on Parallel and Dis-tributed Systems, 1(1):17{25, 1990.[14] S. Mishra, L. L. Peterson, and R. D. Schlicht-ing. A membership protocol based on partial or-der. In Proceedings of the International WorkingConference on Dependable Computing for CriticalApplications, Dependable Computing and Fault-Tolerant Systems 6, Springer-Verlag, pages 309{331, Tucson, AZ, February 1991.[15] L. E. Moser, P. M. Melliar-Smith, andV. Agrawala. Membership algorithms for asyn-chronous distributed systems. In Proceedings ofthe IEEE 11th International Conference on Dis-tributed Computing Systems, pages 480{488, Ar-lington, TX, May 1991.[16] L. E. Moser, P. M. Melliar-Smith, andV. Agrawala. Asynchronous fault-tolerant totalordering algorithms. To appear in SIAM Journalof Computing.[17] L. L. Peterson, N. C. Buchholz, and R. D.Schlichting. Preserving and using context in-formation in interprocess communication. ACMTransactions on Computer Systems, 7(3):217{246, August 1989.[18] B. Rajagopalan and P. K. McKinley. A token-based protocol for reliable, ordered multicastcommunication. In Proceedings of the 8thIEEE Symposium on Reliable Distributed Sys-tems, pages 84{93, Seattle, WA, October 1989.