Clock synchronization with faults and recoveries (extended abstract)

Clock Synchronization with Faults and Recoveries

(Extended Abstract)

Boaz Barak�

The Weizmann Institute of [email protected] Halevi

IBM Watson Research [email protected] Amir HerzbergIBM Haifa Research [email protected] Dalit Naor �

IBM Almaden Research [email protected] present a convergence-function based clock synchroniza-tion algorithm, which is simple, e�cient and fault-tolerant.The algorithm is tolerant of failures and allows recoveries,as long as less than a third of the processors are faulty `atthe same time'. Arbitrary (Byzantine) faults are tolerated,without requiring awareness of failure or recovery. In con-trast, previous clock synchronization algorithms limited thetotal number of faults throughout the execution, which isnot realistic, or assumed fault detection.The use of our algorithm ensures secure and reliable timeservices, a requirement of many distributed systems and al-gorithms. In particular, secure time is a fundamental as-sumption of proactive securemechanisms, which are also de-signed to allow recovery from (arbitrary) faults. Therefore,our work is crucial to realize these mechanisms securely.KeywordsClock synchronization, Mobile adversary, Proactive systems�Work done while at IBM Haifa.

1. INTRODUCTIONAccurate and synchronized clocks are extremely useful tocoordinate activities between cooperating processors, andtherefore essential to many distributed algorithms and sys-tems. Although computers usually contain some hardware-based clock, most of these are imprecise and have substantialdrift, as highly precise clocks are expensive and cumbersome.Furthermore, even hardware-based clocks are prone to faultsand/or malicious resetting. Hence, a clock synchronizationalgorithm is needed, that lets processors adjust their clocksto overcome the e�ects of drifts and failures. Such algorithmmaintains in each processor a logical clock, based on the lo-cal physical clock and on messages exchanged with otherprocessors. The algorithm must deal with communicationdelay uncertainties, clock imprecision and drift, as well aslink and processor faults.In many systems, the main need for clock synchronization isto deal with faults rather than with drift, as drift rates arequite small (some works e.g. [11, 12] actually ignore drifts).It should be noted, however, that clock synchronization isan on-going task that never terminates, so it is not realis-tic to limit the total number of faults during the system'slifetime. The contribution of this work is the ability to tol-erate unbounded number of faults during the execution, aslong as `not too many processors are faulty at once'. Thisis done by allowing processors which are no longer faulty tosynchronize their clocks with those of the operational pro-cessors.Out protocol withstands arbitrary (or Byzantine) faults,where a�ected processors may deviate from their speci�edalgorithm in an arbitrary manner, potentially cooperatingmaliciously to disrupt the goal of the algorithm or system.It is obviously critical to tolerate such faults if the systemis to be secure against attackers, i.e. for the design of se-cure systems. Indeed, many secure systems assume the useof synchronized clocks, and while usually the e�ect of driftscan be ignored, this assumption may become a weak spotexploited by an attacker who maliciously changes clocks.Therefore, solutions frequently try not to rely on synchro-nized clocks, e.g. use instead freshness in authenticationprotocols (e.g. Kerberos [22]). However, this is not alwaysachievable as often synchronized clocks are essential for e�-ciency or functionality. In fact, some security tasks require

securely synchronized clocks by their very de�nition, for ex-ample time-stamping [14] and e-commerce applications suchas payments and bids with expiration dates. Therefore, se-cure time services are an integral part of secure systems suchas DCE [25], and there is on-going work to standardize a se-cure version of the Internet's Network Time Protocol in theIETF [28]. (Note that existing `secure time` protocols sim-ply authenticate clock synchronization messages, and it iseasy to see that they may not withstand a malicious attack,even if the authentication is secure.)The original motivation for this work came from the need toimplement secure clock synchronization for a proactive se-curity toolkit [1]: Proactive security allows arbitrary faultsin any processor - as long as no more than f processors arefaulty during any (�xed length) period. Namely, proactivesecurity makes use of processors which were faulty and laterrecovered. It is important to notice that in some settings itmay be possible for a malicious attacker to avoid detection,so a solution is needed that works even when there is noindication that a processor failed or recovered. To achievethat, algorithms for proactive security periodically performsome `corrective/maintenance' action. For example, theymay replace secret keys which may have been exposed tothe attacker. Clearly, the security and reliability of such pe-riodical protocols depend on securely synchronized clocks,to ensure that the maintenance protocols are indeed per-formed periodically. There is substantial amount of researchon proactive security, including basic services such as agree-ment [24], secret sharing [23, 17], signatures e.g. [16] andpseudo-randomness [4, 5]; see survey in [3]. However, all ofthe results so far assumed that clocks are synchronized. Ourwork therefore provides a missing foundation to these andfuture proactive security works.1.1 Relations to prior workThere is a very large body of research on clock synchroniza-tion, much of it focusing on fault-tolerance. Below we focuson the most relevant works.A number of works focus on handling processor faults, butignore drifts. Dolev and Welch [11, 12] analyzed clock syn-chronization under a hybrid faults model, with recovery fromarbitrary initial state of all processors (self stabilization) aswell as napping (stop) failures in any number of processorsin [11] or Byzantine faults in up to third of the processorsin [12]. Both works assume a synchronous model, and syn-chronize logical clocks - the goal is that all clocks will havethe same number at each pulse. Our results are not di-rectly comparable, since it is not clear if our algorithm isself stabilizing. (In our analysis we assume that the systemis initialized correctly.) On the other hand, we allow Byzan-tine faults in third of the processors during any period, workin asynchronous setting, allow drift and synchronize to realtime.The model of time-adaptive self stabilization as suggestedby Kutten and Patt-Shamir [18] is closer to ours; there, thegoal is to recover from arbitrary faults at f processors intime which is a function of f . We notice this is a weakermodel than ours in the sense that it assumes periods of nofaults. A time-adaptive, self-stabilizing clock synchroniza-tion protocol, under asynchronous model, was presented by

Herman [15]. This protocol is not comparable to ours as itdoes not allow drifts and does not synchronize to real time.Among the works dealing with both processor faults anddrifts, most assume that once a processor failed, it never re-covers, and that there is a bound f on the number of failedprocessors throughout the lifetime of the system. Many suchworks are based on local convergence functions. An earlyoverview of this approach can be found in Schneider's report[26]. A very partial list of results along this line includes [13,7, 8, 9, 21, 2, 20]. The Network Time Protocol, designed byMills [21], allows recoveries, but without analysis and proof.Furthermore, while authenticated versions of [21] were pro-posed, so far these do not attempt recovery from maliciousfaults.Our algorithm uses a convergence function similar to thatof Fetzer and Cristian [9] (which, in turn, is a re�nementof that of Welch and Lynch [20]). However, it seems thatone of the design goals of the solution in [9] is incompatiblewith processor recoveries. Speci�cally, [9] try to minimizethe change made to the clocks in each synchronization oper-ation. Using such small correction may delay the recovery ofa processor with a clock very far from the correct one (with[9] such recovery may never complete). This problem ac-counts for the di�erence between our convergence functionand the one in [9]. In the choice between small maximumcorrection value, and fast recovery time, we chose the latter.Another aspect in which [9] is optimal, is the maximumlogical drift (see De�nition 3 in Section 2.3). In their so-lution, the logical drift equals the hardware drift, whereasin our solution there is an additive factor of O(2�K), whereK is the number of synchronization operations performedin every time period. (Roughly, we assume that less than athird of the processors are faulty in each time period, andrequire that several synchronization operations take placein each such period.) As our model approaches that of [9](i.e., as the length of the time period approaches in�nity),this added factor to the logical drift approaches zero. Weconjecture that `optimal` logical drift can not be achieved inthe mobile faults model.Another di�erence between our algorithm and several tra-ditional convergence function based clock synchronizationalgorithms, is that many such solutions proceed in rounds,where correct processes approximately agree on the timewhen a round starts. At the end of each round each proces-sor reads all the clocks and adjusts its own clock accordingly.In contrast to this, our protocol (and also NTP [21]) do notproceed in rounds. We believe that implementing roundsynchronization across a large network such as the Internetcould be di�cult.A previous work to address faults and recoveries for clocksynchronization is due to Dolev, Halpern, Simons and Strong[10]. In that work it is assumed that faults are detected. Inpractice, faults are often undetected - especially maliciousfaults, where the attacker makes every e�ort to avoid detec-tion of attack. Handling undetected faults and recoveries iscritical for (proactive) security, and is not trivial, as a re-covering processor may have its clock set to a value `just abit` outside the permitted range. The solution in [10] rely

on signatures rather than authenticated links, and thereforealso limit the power of the attacker by assuming it cannotcollect too many `bad' signatures (assumption A4 in [10]).The algorithm of Dolev et al. [10] is based on broadcast,and require that all processors sign and forward messagesfrom all other processors. This has several practical dis-advantages compared to local convergence function basedalgorithms such as in the present paper. Some of these dis-advantages, which mostly result from its `global' nature, arediscussed by Fetzer and Cristian in [13]. Additional prac-tical disadvantages of broadcast-based algorithms includesensitivity to transient delays, inability to take advantage ofrealistic knowledge regarding delays, and the overhead anddelay resulting from depending on broadcasts reaching theentire network (e.g. the Internet). On the other hand, be-ing a broadcast based algorithm , Dolev et al. [10] requireonly a majority of the processors to be correct (we needtwo thirds). Also [10] only requires that the subnet of nonfaulty processors be connected, rather than demanding adirect link between any two processors. (But implementingthe broadcast used by [10] has substantial overhead and re-quires two-third of processors to be correct and connected.)1.2 Informal Statement of the RequirementsA clock synchronization algorithm which handles faults andrecoveries should satisfy:Synchronization Guarantee that at all times, the clockvalues of the non-faulty processors are close to eachother.1Accuracy Guarantee that the clock rates of non-faulty pro-cessors are close to that of the real-time clock. Onereason for this requirement is that in practice, the setof processors is not an island and will sometimes needto communicate and coordinate with processors fromthe \outside world".Recovery Guarantee that once a processor is no longerfaulty, this processor recovers the correct clock valueand rejoins the \good processors" within a �xed amountof time.We present a formalization of this model and goals, and asimple algorithm which satis�es these requirements. We an-alyze our algorithm in a model where an attacker can tem-porarily corrupt processors, but not communication links.It may be possible to re�ne our analysis to show that thesame algorithm can be used even if an attacker can corruptboth processors and links, as long as not too many of eitherare corrupted \at the same time".2. FORMAL MODEL2.1 Network and ClocksOur network model is a fully connected communication graphof n processors, where each processor has its own local clock.We denote the processor names by 1; 2; : : : n and assume that1Note that a trivial solution of setting all local clocks to aconstant value achieves the synchronization goal. The accu-racy requirement prevents this from happening.

each processor knows its name and the names of all its neigh-bors. In addition to the processors, the network model alsocontains an adversary, who may occasionally corrupt pro-cessors in the network for a limited time. Throughout thediscussion below we assume some bound � on the clock driftbetween \good processors", and a bound � on the time thatit takes to send a message between two good processors. Werefer to � as the drift bound and to � as the message deliverybound.We envision the network in an environment with real time.A convenient way of thinking about the real time is as justanother clock, which also ticks more or less at the same rateas the processors' clocks. For the purpose of analysis, it isconvenient to view the local clock of a processor p as con-sisting of two components. One is an unresettable hardwareclock Hp, and the other is an adjustment variable adjp, whichcan be reset by the processor. The clock value of p at realtime � , denoted Cp(�) is the sum of its hardware clock andadjustment factor at this time, Cp(�) = HP (�)+adjp (theseare the same notations as in [26]).2 We stress that Hp andadjp are merely a mathematical convenience, and that theprocessors (and adversary) do not really have access to theirvalues. Formally, the only operations that processor p canperform on Hp and adjp are reading the value Hp(�) + adjpand adding an arbitrary factor to adjp. Other than thesechanges, the value of Hp changes continuously with � (andthe value of adjp remains �xed).Definition 1 (Clocks). The hardware clock of a pro-cessor p is a smooth, monotonically increasing function, de-noted Hp(�). The adjustment factor of p is a discrete func-tion adjp(�) (which only changes when p adds a value to theits adjustment variable). The local clock of p is de�ned asCp(�) def= Hp(�) + adjp(�) (1)We assume an upper bound � on the drift rate betweenprocessors' hardware clocks and the real time. Namely, forany �1 < �2, and for every processor p in the network, itholds that(�2 � �1)=(1 + �) � Hp(�2)�Hp(�1) � (�2 � �1) � (1 + �)(2)We note that in practice, � is usually fairly small (on theorder of 10�6).2.2 Adversary ModelAs we said above, our network model comes with an adver-sary, who can occasionally break into a processor, resettingits local clock to an arbitrary value. After a while, the ad-versary may choose to leave that processor, and then wewould like this processor to recover its clock value.We envision an adversary who can see (but not modify) allthe communication in the network, and can also break intoprocessors and leave them at wish. When breaking into a2In general adjp does not have to be a discrete variable, andit could also depend on � . We don't use that generality inthis paper, though.

processor p, the adversary learns the current internal stateof that processor. Furthermore, from this point and until itleaves p, the adversary may send messages for p, and mayalso modify the internal state of p, including its adjustmentvariable adjp. Once the adversary leaves a processor p, it hasno more access to p's internal state. We say that p is faulty(or controlled by the adversary), if the adversary broke intop and did not leave it yet.We assume reliable and authenticated communication be-tween processors p and q that are not faulty. More pre-cisely, let � denote the message delivery bound. Then forany processors p and q not faulty during [�; � +�], if p sendsa message to q at time � , then q receives exactly the samemessage from p during [�; � + �]. Furthermore, if a non-faulty processor q receives a message from processor p attime � , then either p has sent exactly this message to q dur-ing [� � �; � ], or else it was faulty at some time during thisinterval.3The power of the adversary in this model is measured bythe number of processors that it can control within a timeinterval of a certain length. This limitation is reasonablebecause otherwise, even an adversary that can control onlyone processor at a time, can corrupt all the clocks in thesystem by moving fast enough from processor to processor.Definition 2 (Limited Adversary). Let � > 0 andf 2 f1; 2; : : : ; ng be �xed. An adversary is f-limited (withrespect to �) if during any time interval [�; �+�], it controlsat most f processors.We refer to � as the time period and to f as the number offaulty processors.Notice that De�nition 2 implies in particular that an f -limited adversary who controls f processors and wants tobreak into another one, must leave one of its current proces-sors at least � time units before it can break into the newone. In the rest of the paper we assume that n � 3f + 1.2.3 Clock Synchronization ProtocolsIntuitively, the purpose of a clock synchronization algorithmis to ensure that processors' local clocks remain close to eachother and close to the real time, and that faulty processorsbecome synchronized again quickly after the adversary leavethem. It is clear, however, that no protocol can achieveinstantaneous recovery, and we must allow processors sometime to recover. Typically we want this recovery time to beno more than �, so by the time the adversary breaks into thenew processors, the ones that it left are already recovered.Definition 3 (Clock Synchronization). Consider aclock synchronization protocol � that is executed in a net-work with drift rate � and message delivery bound �, and inthe presence of an f-limited adversary with respect to timeperiod �.3This formulation of \good links" does not completely ruleout replay of old messages. This does not pause a problemfor our application, however.

i. We say that � ensures synchronization with maximaldeviation �, if at any time � and for any two processorsp; q not faulty during [� � �; � ], it holds that jCp(�)�Cq(�)j � �.ii. We say that � ensures accuracy with maximal drift~� and maximal discontinuity �, if whenever p is notfaulty during an interval [�1 � �; �2] (with �1 � �2), itholds thatCp(�2)�Cp(�1) � (�2 � �1)=(1 + ~�) � �andCp(�2)�Cp(�1) � (�2 � �1) � (1 + ~�) + � (3)3. ACLOCK SYNCHRONIZATIONPROTO-

COLAs in most (practical) clock synchronization protocols, themost basic operation in our protocol is the estimation by aprocessor of its peers' clocks. We therefore begin in Sub-section 3.1 by discussing the requirements from a clock es-timation procedure and describing a simple (known) proce-dure for doing that. Then, in Subsection 3.2 we describethe clock synchronization protocol itself. In this descriptionwe abstract the clock estimation procedure, and view it a\black box" that provides only the properties that were dis-cussed before. Finally, in Subsection 3.3 we elaborate onsome aspects of our protocol, and compare it with similarsynchronization protocols for other models.3.1 Clock EstimationOur protocol's basic building block is a subroutine in whicha processor p estimates the clock value of another processorq. The (natural) requirements from such a procedures areAccuracy. The value returned from this procedureshould not be too far from the actual clock value ofprocessor q.Bounded error. Along with the estimated clock value,p also gets some upper bound on the error of thatestimation.For technical reasons it is also more convenient to have thisprocedure return the distance between the local clocks of pand q, rather than the clock value of q itself. Hence we de�nea clock estimation procedure as a two-party protocol, suchthat when a processor p invokes this protocol, trying to esti-mate the clock value of another processor q, the protocol re-turns two values (dq; aq) (for distance and accuracy). Thesevalues should be interpreted as \since the procedure was in-voked, there was a point in which the di�erence Cq�Cp wasabout dq, up to an error of aq". Formally, we haveDefinition 4. We say that a clock estimation routinehas reading error � and timeoutMaxWait, if whenever a pro-cessor p is non-faulty during time interval [�; � +MaxWait],and it calls this routine at time � to estimate the clock of q,then the routine returns at time � 0 � �+MaxWait, with somevalues (d; a). Moreover, if q was also non-faulty during theinterval [�; � 0], then the values (d; a) satisfy the following:� a � �, and

� There was a time � 00 2 [�; � 0] at which Cq(� 00)�Cp(� 00) 2[d� a; d+ a].We now describe a simple clock estimation algorithm. Therequestor p sends a message to q, who returns a reply top containing the time according on the clock of q (whensending the reply). If p does not receive a reply withinMaxWait = 2� time (where � is the message delivery bound),p aborts the estimation and sets dq = 0 and aq =1. Other-wise, if p sends its \ping" message to q at local time S, andreceives an answer C at local time R, it sets dq = C � R+S2and aq = R�S2 .Intuitively, p estimates that at its local time R+S2 , q's timewas C. If the network is totally symmetric, that is the timefor the message to arrive was identical on the way from p toq and on the way back from q to p, and p's clock progressedbetween S and R at constant rate, then the estimation wouldbe totally accurate. In any case, if q returned an answer C,then at some time between p local time S and p local timeR, q had the value C, so the estimation of the o�set can'tmiss by more than R�S2 .This simple procedure can be \optimized" in several ways.A common method, which is used in practice to decrease theerror in estimating the peer's clock (at the expense of worsetimeliness), is to repeatedly ping the other processor andchoose the estimation given from the ping with the leastround trip time. This is used, for example, in the NTPprotocol [21].Also to reduce network load it may be possible to piggybackclock querying messages on other messages, or to performthem in a di�erent thread which will spread them across atime interval. Of course, if we implement the latter idea inthe mobile adversary setting, a clock synchronization pro-tocol should periodically check that this thread exists andrestart it otherwise (to protect against the adversary killingthat thread). We note that when implemented this way, wecannot guarantee the conditions of De�nition 4 anymore,since the separate thread may return an old cached valuewhich was measured before the call to the clock estimationprocedure. (Hence, the analysis in this paper cannot beapplied \right out of the box" to the case where the timeestimation is done in a separate thread.)3.2 The ProtocolSync is our clock synchronization protocol. It uses a clockestimation procedure such as the one described in Section3.1, which we denote by estimateO�set, with the time-outbound denoted by MaxWait and maximal error �. Otherparameters in this protocol are the (local) time SyncInt be-tween two executions of the synchronization protocol, and aparameterWayO� , which is used by a processor to gauge thedistance between its clock and the clocks of the other proces-sors. These parameters are (approximately) computed fromthe network model parameters �; � and �. The constraintsthat these parameters should satisfy are:SyncInt � 2MaxWait � 4�WayO� � �+�, where � is the maximum deviation we

want to achieve (and we have � > 16�).These settings are further discussed in the analysis (Sec-tion 4.2) and in Section 3.3.The Sync protocol is described in Figure 1. The basic idea isthat each processor p uses estimateO�set to get an estimatefor the clocks of its peers. Then p eliminates the f smallestand f largest values, and use the remaining values to adjustits own clock. Roughly, p computes a \low value" Cm whichis the f + 1'st smallest estimate, and a \high value" CMwhich is the f +1'st largest estimate. If p's own clock Cp ismore than WayO� away from the interval [Cm; CM ], then pknows that its clock is too far from the clocks of the \goodprocessors", so it ignores its own clock and resets it to (Cm+CM )=2. Otherwise, p's clock is \not too far" from the otherprocessors, so we would like to limit the change to it. Inthis case, instead of completely ignoring the old clock value,p resets its clock to (min(Cm; Cp) + max(CM ; Cp))=2. (Soif p's clock was below Cm or above CM , it will only movehalf-way towards these values.)The details of the Sync protocol are slightly di�erent, though,speci�cally in the way that the \low value" and \high value"are chosen. Processor p �rst uses the error bounds to gener-ate overestimates and underestimates for these clock values,and then computes the \low value" Cm as the f+1'st small-est overestimate, and the \high value" the f + 1's largestunderestimate. In the analysis we also assume that all theclock estimations are done in parallel, and that the time thatit takes to make the local computations is negligible, so arun of Sync takes at most MaxWait time on the local clock.(This is not really crucial, but it saves the introduction ofan extra parameter in the analysis.)3.3 DiscussionOur Sync protocol follows the general framework of \conver-gence function synchronization algorithms" (see [26]), wherethe next clock value of a processor p is computed from itsestimates for the clock values of other processors, using a�xed, simple convergence function.4No rounds. As mentioned in section 1.1, one notable dif-ferences between our protocol and other protocols that havebeen proposed in the literature is that many convergencefunction protocols (for example [8, 9]) proceed in rounds,where each processor keeps a di�erent logical clock for eachround. (A round is the time between two consecutive syn-chronization protocols.) In these protocols, if a processor isasked for a \round-i" clock when this processor is alreadyin its i + 1'st round, it would return the value of its clock\as if it didn't do the last synchronization protocol".In contrast, in our Sync protocol a processor p always re-sponds with its current clock value. This makes the analy-sis of the protocol a little more complicated, but it greatlysimpli�es the implementation, especially in the mobile ad-versary setting (since variables such as the current round4In the current algorithm and analysis, a processor needs toestimate the clocks of all other processors; we expect thatthis can be improved, so that a processor will only need toestimate the clocks of its local neighbors.

Figure 1: Algorithm Sync for processor pParameters: SyncInt // time between synchronizationsWayO� // bound for clocks which are very far from the rest1. Every SyncInt time units call sync()3. function sync () f4. For each q 2 f1; : : : ; ng do5. (dq; aq) estimateO�set(q)6. dq = dq + aq // overestimate dq7. dq = dq � aq // underestimate dq8. m the f + 1'st smallest dq9. M the f + 1'st largest dq10. If m � �WayO� and M � WayO�11. then adjp adjp + (min(m; 0) + max(M; 0))=212. else adjp adjp + (m+M)=213. gnumber, last round's clock, and the time to begin the nextround have to be recovered from a break in).Known values. Another practical advantage of our pro-tocol is that it does not require to know the values of param-eters such as the message delivery bound � , the hardwaredrift �, the maximum deviation �, which may be hard tomeasure in practice (in fact they may even change duringthe course of the execution). We only use these values inthe analysis of the protocol. In practice, all the algorithmparameters which do depend on these value (like MaxWait,SyncInt and WayO�) may overestimate them by a multi-plicative factor without much harm (i.e. without introduc-ing such a factor to the maximum deviation, logical drift orrecovery time actually achieved).When to perform Sync? In our protocol, a processorexecutes the Sync protocol every SyncInt time units of lo-cal time, and we do not make any assumptions about therelative times of Sync executions in di�erent processors. Acommon way to implement this is to set up an alarm at theend of each execution, and to start the new execution whenthis alarm goes o�. In the mobile adversary setting, onemust make sure that this alarm is recovered after a break-in.We note that our analysis does not depend on the processorsexecuting a Sync exactly every SyncInt time units. Rather,all we need is that during a time interval of (1+�)(SyncInt+MaxWait) real time, each processor completes at least oneand at most two Sync's.4. ANALYSISLet T denote some value such that every non-faulty pro-cessor completes at least one and at most two full Sync'sduring any interval of length T . Speci�cally, setting T =(1 + �)SyncInt + 2MaxWait is appropriate for this purpose(where SyncInt is the time that is speci�ed in the protocol,MaxWait is a bound on the execution time of a single Sync,and � is the drift rate).

4.1 Main TheoremThe following main theorem characterizes the performanceachieved by our protocol:Theorem 5. Let T be as de�ned above, let K def= b �T cand assume that K � 5. Theni. The Sync protocol ful�lls the synchronization require-ment with maximum deviation � = 16� + 18�T + 4Cwhere C = 17�+18�T2K�3 .ii. The Sync protocol ful�lls the accuracy requirement withlogical drift ~� = �+ C2T and discontinuity � = �+C=2.We note that the theorem shows a tradeo� between the rateat which the Sync protocol is performed (as a function of�) and how optimal its performance is. That is, if we chooseT to be small compared to � (for instance T = �20 ) then Cis very small and so we get almost perfect accuracy (~� � �)and the signi�cant term in the maximum deviation boundis 16�.4.2 Clock BiasFor the purpose of analysis, it will be more convenient toconsider the bias of the clocks, rather than the clock valuesthemselves. The bias of processor p at time � is the dif-ference between its logical clock and the real time, and isdenoted by Bp(�). Namely,Bp(�) def= Cp(�)� � (4)When the real-time � is implied in the context, we often omitit from the notation and write just Bp instead of Bp(�).In the analysis below we view the protocol Sync as a�ectingthe biases of processors, rather than their clock values. Inparticular, in an execution of Sync by processor p, we canview dq as an estimate for Bp �Bq rather than an estimatefor Cp�Cq, and we can view the modi�cation of adjp in thelast step as a modi�cation of Bp. We can therefore re-write

Figure 2: Algorithm Sync for processor p: bias formulationParameters: SyncInt // time between sunchronizationsWayO� // Bound for clocks which are very far from the rest1. Every SyncInt time units call sync()3. function sync () f4. For each q 2 f1; : : : ; ng do5. (di; ai) estimateO�set(i)6. Bq = Bp + dq + aq // overestimate Bq7. Bq = Bp + dq � aq // underestimate Bq8. B(m)p the f + 1'st smallest Bq9. B(M)p the f + 1'st largest Bq10. If Bp �B(m)p � WayO� and B(M)p �Bp � WayO�11. then Bp �min(B(m)p ;Bp) + max(B(M)p ;Bp)� =212. else Bp �B(m)p +B(M)p � =213. gthe protocol in terms of biases rather than clock values, asin Figure 2.We note that by referencing Bp in the protocol, we meanBp(�) where � is the real time where this reference takesplace. We stress that the protocol cannot be implementedas it is described in Figure 2, since a processor p does notknow its bias Bp. Rather, the above description is just analternative view of the \real protocol" that is described inFigure 1.4.3 Proof Overview of the Main TheoremBelow we provide only an informal overview of the proof.A few more details (including a useful piece of syntax andstatements of the technical lemmas) can be found in Ap-pendix A. A complete proof will be included in the fullversion of the paper. For simplicity, in this overview weonly look at the case with no drifts and no clock-readingerrors, namely � = � = 0. (Note that in this case, we al-ways have aq = 0, so in Steps 6-7 of the protocol we getBq = Bq = Bp + dq.)The analysis looks at consecutive time intervals I0; I1; : : : ,each of length T , and proceeds by induction over these inter-vals. For each interval Ii we prove \in spirit" the followingclaims:i. The bias values of the \good processors" get closertogether: If they were at distance � from each otherat the beginning of Ii, they will be at distance 7�=8 atthe end of it.ii. The bias values of the \recovering processors" gets(much) closer to those of the good processors. If a re-covering processor was at distance � from the \rangeof good processors" at the beginning of Ii, it would beat distance at most �=2 from that range at the end ofIi.It therefore follows that after a few such intervals, the bias of

a \recovering processor" will be at most � away from thoseof the \good processors".To prove the above claims, our main technical lemma con-siders a given interval Ii, and assumes that there is a set Gof at least n�f processors, which are all non-faulty through-out Ii, and all have bias values in some small range at thebeginning of Ii (w.l.o.g., this can be the range [�D;D]).Then, we prove the following three properties:Property 1 We �rst show that the biases of the processorsin G remain in the range [�D;D] throughout the inter-val Ii. This is so because in every execution of Sync, aprocessor p 2 G always gets biases in that range fromall other processors in G. Since G contains more than2f processors, then both B(m)p and B(M)p are in thatrange, and so p's bias remains in that range also afterit completes the Sync protocol.Also, if follows from the same argument that we alwayshave Bp � B(m)p � WayO� and B(M)p � Bp � WayO� ,so processor p never ignores its own current bias inStep 11 of the protocol.Property 2 Next we consider processors p 2 G whose ini-tial bias values are low (say, below the median for G).Since p executes at most two Sync's during the timeinterval Ii, and in each Sync it takes the average of itsown current bias and another bias below D, the biasof p remains bounded strictly below D. Speci�cally,one can show that the resulting bias values cannot belarger than (Z +3D)=4 (where Z is the initial medianvalue).Similarly, for the processors q 2 G with high initialbias values, the bias values remain bounded strictlyabove �D, speci�cally at least (Z � 3D)4.Property 3 Last, we use the result of the previous steps toshow that at the end of the interval, the bias of everyprocessor in G is between (Z�7D)=8 and (Z+7D)=8.

(Hence, in the case of no errors or drifts, the size ofthe interval that includes all the processors inG shrunkfrom 2D to 7D=4.)To see this, recall that by the result of the previousstep, whenever a processor p 2 G executes a Sync, itgets bias values which are bounded by (Z+3D)=4 fromall the processors with low initial biases { and so its lowestimate B(m)p must also be smaller than (Z + 3D)=4.Similarly, it gets bias values which are bounded above(Z � 3D)=4 from all the processors with high initialbiases { and so its high estimate B(M)p must be largerthan (Z � 3D)=4. Since the bias of p after its Syncprotocol is computed as (min(B(m)p +Bp)+max(B(M)p +Bp))=2, and since Bp is in the range [�D;D] by theresult of the �rst step, the result of this step follows.Moreover, a similar argument can be applied even toa processor outside G, whose initial bias is not in therange [�D;D]. Speci�cally, we can show that if at thebeginning of interval Ii, a non-faulty processor p hashigh bias, say D+ � for some � > 0, then at the end ofthe interval the bias of p is at most (Z+7D)=8 + �=2.Hence, the distance between p and the \good range"shrinks from � to �=2.A formal analysis, including the e�ects of drifts and readingerrors, will be included in the full version of the paper.5. FUTURE DIRECTIONSOur results require that at most a third of the processorsare faulty during each period. Previous clock synchroniza-tion protocols assuming authenticated channels (as we do)were able to require only a majority of non-faulty processors[19, 27]. It is interesting to close this gap. In [10] there isanother weaker requirement: only that subnetwork contain-ing non-faulty processors remain connected (but [10] alsoassumes signatures). It may be possible to prove a variantof this for our protocol, in particular it would be interestingto show that it is su�cient that the non-faulty processorsform a su�ciently connected subgraph. If this holds, it willalso justify limiting the clock synchronization links to a lim-ited number of neighbors for each processor, which is oneof the practical advantages of convergence based clock syn-chronization.(It should be noted that (3f+1)-connectivity is not su�cientfor our protocol. One can construct a graph on 6f+2 nodeswhich is (3f + 1)-connected, and yet our protocol does notwork for it. This graph consists of two cliques of 3f+1 nodes,and in addition the i'th node of one clique is connected tothe i'th node of the other. Now, this graph is clearly 3f +1connected, but our protocol cannot guarantee that the theclocks in one cliques do not drift apart from those in theother.)Additional work will be required to explore the practicalpotential of our protocol. In particular, practical protocolssuch as the Network Time Protocol [21] involve many mecha-nisms which may provide better results in typical cases, suchas feedback to estimate and compensate for clock drift. Suchimprovements may be needed to our protocol (while makingsure to retain security!), as well as other re�nements in the

protocol or analysis to provide better bounds and results intypical scenarios.The Synchronization and Accuracy requirements we de�nedonly talk about the behavior of the protocol when the ad-versary is suitably limited. It may also be interesting to askwhat happens with stronger adversaries. Speci�cally, whathappens if the adversary was \too powerful" for a while, andnow it is back to being f -limited. An alternative way of ask-ing the same question is what happens when the adversary islimited, but the initial clock values of the processors are ar-bitrary. Along the lines of [11, 12], it is desirable to improvethe protocol and/or analysis to also guarantee self stabiliza-tion, which means that the network eventually converges toa state where the non-faulty processors are synchronized.6. REFERENCES[1] B. Barak, A. Herzberg, D. Naor and E. Shai, TheProactive Security Toolkit and Applications,Proceedings of the 6th ACM Conference on Computerand Communication Security, Nov. 1999, Singapore,Pages 18{27.[2] A. Bouzelat and Z. Mammeri, Simple reading, implicitrejection and average function for fault-tolerantphysical clock synchronization, Proceedings of 23rdEuroMicro Conference, Sep. 1997, pages 524{531.[3] R. Canetti, R. Gennero, A. Herzberg and D. Naor,Proactive Security: Long-term protection againstbreak-in, CryptoBytes, Vol. 3 No. 1, Spring 1997.[4] R. Canetti and A. Herzberg, Maintaining Security inthe Presence of Transient Faults, Proceedings ofCrypto' 94, pages 425{438, August 1994.[5] C.S. Chow and A. Herzberg, Network RandomizationProtocol: A proactive pseudo-random generator,Proceedings of 5th Usenix Unix Security Symposium,Salt Lake City, Utah, June 1995, pages 55{63.[6] R. Canetti, S. Halevi, and A. Herzberg. Maintainingauthenticated communication in the presence ofbreak-ins. Journal of Cryptology, to appear.Preliminary version appeared in Proceedings of 16'thAnnual ACM Symposium on Principles of DistributedComputing, pages 15{24. ACM Press, 1997.[7] F. Cristian, Probabilistic Clock Synchronization,Distributed Computing, Vol. 3, pp. 146{158, 1989.[8] F. Cristian and C. Fetzer, Probabilistic Internal ClockSynchronization, Proceedings of the 13th Symposiumon Reliable Distributed Systems, Oct. 1994, DanaPoint, CA. Pages 22{31.[9] F. Cristian and C. Fetzer, An Optimal Internal ClockSynchronization Algorithm, Proceedings of the 10thAnnual IEEE Conference on Computer Assurance,June, 1995, Gaithersburg, MD. Pages 187{196.[10] D. Dolev, J.Y. Halpern, B. Simons and R. Strong,Dynamic Fault-Tolerant Clock Synchronization,JACM Vol. 42, No. 1, Jan. 1995, pp. 143{185.

[11] S. Dolev and J. Welch, Wait-free clocksynchronization, Proceedings of 12'th Annual ACMSymposium on Principles of Distributed Computing(PODC), pages 97{108. ACM Press, 1993.[12] S. Dolev and J. Welch, Self-Stabilizing ClockSynchronization in the Presence of Byzantine Faults,Proc. of the 2nd Workshop on Self-StabilizingSystems, May 1995.[13] C. Fetzer and F. Cristian, Lower bounds forconvergence function based clock synchronization,14th PODC, Aug 1995, Ottowa, Canada, pp. 137{142.[14] S. Haber and W.S. Stornetta How to Time-Stamp aDigital Document, Journal of Cryptology, 1991, Vol. 3,pages 99{111.[15] T. Herman, Phase Clocks for Transient Fault Repair,Technical Report 99-08, University of IowaDepartment of Computer Science, 1999.[16] A. Herzberg, M. Jakobsson, S. Jarecki, H. Krawczykand M. Yung, Proactive public key and signaturesystems, Proceedings of the 4th ACM Conference onComputer and Communication Security, 1997.[17] A. Herzberg, S. Jarecki, H. Krawczyk and M. Yung,Proactive Secret Sharing, or: How to cope withperpetual leakage, Proceedings of Crypto' 95, August1995, pp. 339{352.[18] S. Kutten and B. Patt-Shamir, Time-adaptive selfstabilization, Proceedings of the Sixteenth AnnualACM Symposium on Principles of DistributedComputing (PODC), pages 149{158, 1997.[19] L. Lamport and P.M. Melliar-Smith, Synchronizingclocks in the presence of faults, JACM Vol. 32, No. 1,Jan. 1985, pp. 52{78.[20] J. Lundelius-Welch and N. Lynch, A newfault-tolerant algorithm for clock synchronization,Information and Computation, Vol. 77, No. 1, pages1{36, 1988.[21] D.L. Mills, Internet time synchronization: theNetwork Time Protocol. In IEEE Trans.Communications, October 1991, pp. 1482- 1493.[22] B.C. Neuman and T. Ts'o, Kerberos: AnAuthentication Service for Computer Networks, IEEECommunications Magazine, Sep. 1994, Vol. 32, No. 9,pp. 33{38.[23] R. Ostovsky and M. Yung, How to withstand mobilevirus attacks, Proceedings of PODC, 1991, pp. 51{61.[24] R. Reischuk, A new solution to the byzantine generalsproblem, Information and Control, pages 23{42, 1985.[25] W. Rosenberry, D. Kenney and G. Fisher,Understanding DCE, Chapter 7: DCE Time Service:Synchronizing Network Time, O'Reilly, Oct. 1992.[26] Fred B. Schneider, Understanding Protocols forByzantine Clock Synchronization Technical ReportTR87-859, CS Department, Cornell University, 1987.

[27] T.K. Srikanth and S. Toueg, Optimal clocksynchronization, JACM Vol. 34, No. 3, July 1987, pp.626{645.[28] IETF Secure Time (Stime) Working Group,http://www.stime.org/.APPENDIXA. SOME DETAILSBelow we describe the syntax that we use to formally proveTheorem 5, and states explicitly the technical lemmas thatwe prove.A.1 Bias values and envelopesIt is useful to think of the bias values of processors as if theyare drawn on a two-dimensional plane, where one axis is realtime values and the other axis is bias values. (We call thisthe (�; �)-plane.)Note that since the drift of processors is bounded by �, thebias of a processor that does not reset its clock during a timeinterval of length � cannot change by more than �� . Hence,if we know that the bias of a processor at time � is in theinterval [a; b], and that this processor did not reset its clockbetween time � and � 0, then its bias at time � 0 must be inthe interval [a��(� 0��); b+�(� 0��)]. This motivates thefollowing de�nition of an envelope in the (�; �)-plane.Definition 6. An envelope in the (�; �)-plane is a regionof the formE = f(�; �) j � � �0; a� �(� � �0) � � � b+ �0(� � �0)gFor some real values a � b and �0. (We also allow a = �1or b = +1).We sometimes write E = Envf�0; [a; b]g when we want tostress the parameters �0, a and b.See Figure 3 for examples of a few envelopes. Some notationsthat we use throughout the proof are summarized below.� If E is an envelope, we denote by E(� 0) the intervalon the �-axis that corresponds to the intersection of Eand the line � = � 0. For instance, in the notation aboveE(�0) = [a; b].More formally, if E = Envf�0; [a; b]g, then for all � ��0 we have E(�) = [a � �(� � �0); b + �(� � �0)]. Wealso denote by jE(�)j the size of the interval E(�) (sofor example we have jE(�0)j = b� a).� We say that the bias of processor p is in the enve-lope E during interval [�1; �2], if Bp(�) 2 E(�) forany � 2 [�1; �2]. We say that the bias of p is notabove E during [�1; �2] if for any � 2 [�1; �2], we haveBp(�) � maxE(�). Similarly we say that the bias ofp is not bellow E during [�1; �2] if for any � 2 [�1; �2],Bp(�) � minE(�).� If E = Envf�0; [a; b]g is an envelope and c is a non-negative number, then we denote by E+c the envelopeobtained from E by extending it by c at both sides.That is, Envf�0; [a; b]g+ c = Envf�0; [a� c; b+ c]g.

Figure 3: The envelopes E and E0 of Lemma 76 ?6�0 �MaxWait �0 �0 + T �MaxWait �0 + TE0E D�D� -�74D+ 2�+ 2�T� If E = Envf�0; [a; b]g and E0 = Envf�0; [a0; b0]g aretwo envelopes, then the average of E and E0 is theenvelope avg(E;E0) = Envf�0; [(a+a0)=2; (b+b0)=2]g.Note that if at some time � we have one bias value � inE and another bias value �0 in E0, then the average ofthese bias values (�+�0)=2 is in avg(E;E0). (Similarlyif we only know that �;�0 are not above E;E0, respec-tively, then it follows that their average is not aboveavg(E; E0), and the same for \not below".)A.2 Main technical lemmaRecall that we denote T def= (1 + �)SyncInt + 2MaxWait.Below we assume that T � 2(1 � �)SyncInt. Note that foran interval of length T , the following properties hold:� In any interval of length T any non-faulty processorperforms between one and two Syncs.� In any interval of length T �MaxWait any non faultyprocessor completes at least one full Sync. (That is aSync that starts and ends within the interval).We assume that the parameter WayO� is set to WayO� =16� + 18�T + � = � + �. Our main technical lemma is asfollows.Lemma 7. Let G be a set of at least n � f processors,all of which are non-faulty during some real time interval[�0 �MaxWait; �0 + T ]. Also, let D > 8� be a real number.If there exists an envelope E such that jE(�0)j � 2D, andthe biases of G's members are in E throughout the interval[�0 �MaxWait; �0], theni. The biases of G's members remain in E throughout theinterval [�0; �0 + T ].ii. Furthermore, there exists an envelope E0, jE0(�0)j =7D4 + 2�, such that E0 � E and all the biases of allmembers of G are in the envelope E0 in the interval[�0 + T �MaxWait; �0 + T ].iii. If p is non-faulty in [�0; �0 + T ] and Bp(�0) is in E + �for some � � 0 then p's bias is in an envelope E0 + �2in the interval [�0 + T �MaxWait; �0 + T ].

The envelopes E and E0 are depicted in Figure 3. Part(i) of the lemma guarantees that biases of all good mem-bers remain within the drift bound. Part (ii) shows thatthe deviation among the good members actually shrinks (as7D4 + 2� < 2D) by the end of the interval T . Finally, part(iii) refers to recovering processors and shows that their dis-tance from the good members halves at each interval T .This lemma is proved along the lines outlined in Section 4.3.A.3 The inductive stepWe let D = 8�+8�T +2C. Note that the maximum devia-tion we prove is � = 2D+2�T . We look at the time intervalsI0 = [0; T +MaxWait]; and Ii = [iT �MaxWait; (i+1)T ] fori � 1. For all i = 0; 1; : : : , denote by Gi the set of pro-cessors that were non-faulty in the interval of length � thatends at time (i + 1)T . (Intuitively, we want to prove thatall these processors are synchronized by time (i+ 1)T .) Byour assumption on the power of the adversary, we know thatthe size of each Gi is at least n � f . Note also that since� > T +MaxWait, then in particular all the processors in Giare non-faulty during the interval Ii. We prove inductivelythe following claim:Claim 8. There are envelopes E0; E1; E2; : : : such thati. jEi(iT )j � 2D, and Ei is \almost contained" in Ei�1.speci�cally Ei � Ei�1 + C2 .ii. and Ei contains during the interval Ii the biases of allthe processors in Gi.iii. Let j < i, and denote � = max �WayO�2i�j � C2 ; 0�. Letp be a processor that is non-faulty in [jT; iT ]. Then,p's bias is in the envelope Ei + � during the interval[iT �MaxWait; iT ].Claim 8 immediately implies Theorem 5.

Clock synchronization with faults and recoveries (extended abstract)

Documents

Transcript of Clock synchronization with faults and recoveries (extended abstract)