Programming Language Support for Writing Fault-Tolerant Distributed Software

22
Programming Language Support for Writing Fault-Tolerant Distributed Software Richard D. Schlichting and Vicraj T. Thomas Abstract Good programming language support can simplify the task of writing fault-tolerant distributed software. Here, an approach to providing such support is described in which a general high-level distributed programming language is augmented with mechanisms for fault tolerance. Unlike approaches based on sequential languages or specialized languages oriented towards a given fault-tolerance technique, this approach gives the programmer a high level of abstraction, while still maintaining flexibility and execution efficiency. The paper first describes a programming model that captures the important characteristics that should be supported by a programming language of this type. It then presents a realization of this approach in the form of FT-SR, a programming language that augments the SR distributed programming language with features for replication, recovery, and failure notification. In addition to outlining these extensions, an example program consisting of a data manager and its associated stable storage is given. Finally, an implementation of the language that uses the -kernel and runs standalone on a network of Sun workstations is discussed. The overall structure and several of the algorithms used in the runtime are interesting in their own right. 1 Introduction Programmers faced with choosing a programming language for writing fault-tolerant dis- tributed software—that is, software that must continue to provide service in a multicomputer system despite failures in the underlying computing platform—often have few alternatives. At one end of the spectrum are relatively low-level choices such as assembly language or C, often coupled with a fault-tolerance library such as ISIS [1]. Such an approach can result in good execution efficiency, yet forces the programmer to deal with the complexities of distributed execution and fault-tolerance in a language that is fundamentally sequential. At This work supported in part by the National Science Foundation under grant CCR-9003161 and the Office of Naval Research under grant N00014-91-J-1015. 1

Transcript of Programming Language Support for Writing Fault-Tolerant Distributed Software

Programming Language Support for WritingFault-Tolerant Distributed Software�

Richard D. Schlichting and Vicraj T. Thomas

Abstract

Good programming language support can simplify the task of writing fault-tolerantdistributed software. Here, an approach to providing such support is described in whicha general high-level distributed programming language is augmented with mechanismsfor fault tolerance. Unlike approaches based on sequential languages or specializedlanguages oriented towards a given fault-tolerance technique, this approach gives theprogrammer a high level of abstraction, while still maintaining flexibility and executionefficiency. The paper first describes a programming model that captures the importantcharacteristics that should be supported by a programming language of this type. It thenpresents a realization of this approach in the form of FT-SR, a programming languagethat augments the SR distributed programming language with features for replication,recovery, and failure notification. In addition to outlining these extensions, an exampleprogram consisting of a data manager and its associated stable storage is given. Finally,an implementation of the language that uses the x-kernel and runs standalone on anetwork of Sun workstations is discussed. The overall structure and several of thealgorithms used in the runtime are interesting in their own right.

1 Introduction

Programmers faced with choosing a programming language for writing fault-tolerant dis-

tributed software—that is, software that must continue to provide service in a multicomputer

system despite failures in the underlying computing platform—often have few alternatives.

At one end of the spectrum are relatively low-level choices such as assembly language or C,

often coupled with a fault-tolerance library such as ISIS [1]. Such an approach can result

in good execution efficiency, yet forces the programmer to deal with the complexities of

distributed execution and fault-tolerance in a language that is fundamentally sequential. At�This work supported in part by the National Science Foundation under grant CCR-9003161 and the Officeof Naval Research under grant N00014-91-J-1015.

1

the other end of the spectrum are high-level languages specifically intended for constructing

fault-tolerant applications using a given technique. Examples here include Argus [2] and

Plits [3], which support a programming model based on atomic actions. Such languages

simplify the problems considerably, yet can be overly constraining if the programmer desires

to use fault-tolerance techniques other than the one supported by the language [4]. The net

result is that neither option provides the ideal combination of features.

In this paper, we advocate an intermediate approach based on taking a general high-level

concurrent or distributed programming language such as Ada [5], CSP [6], or SR [7] and

augmenting it with additional mechanisms to facilitate fault-tolerance. Starting with a lan-

guage of this type offers a number of advantages. For example, unlike a low-level approach,

such languages allow the programmer to deal with multiple processes and interprocess com-

munication at a high level of abstraction, thereby simplifying the programming process.

Moreover, given a well-designed set of fault-tolerance extensions, such a language can give

the programmer a greater degree of flexibility than is found in current higher-level alterna-

tives. Such flexibility allows, for instance, the use of multiple fault-tolerance techniques,

something that can be important in certain types of software. In short, if done right, this

approach can offer a language that preserves many of the positive attributes of both sets of

alternatives.

The specific purpose of this paper is to elaborate on this approach, in two ways. First,

we present a programming model based on the notion of fail-stop modules that captures

the characteristics needed for a language oriented towards writing fault-tolerant distributed

software. Second, we describe a realization of this approach in the form of FT-SR, a

programming language based on augmenting the SR distributed programming language with

additional mechanisms for fault-tolerance. FT-SR has been implemented using the x-kernel,

an operating system designed for experimenting with communication protocols [8], and runs

standalone on a network of Sun workstations. The implementation structure and several of

the algorithms used in the runtime system are also interesting in their own right. We restrict

our attention in this paper to failures suffered by processors with fail-silent semantics—that

is, where the only failures are assumed to be a complete cessation of execution activity—

2

although the approach generalizes to other failure models as well.

2 Fail-Stop Modules and Program Design

A fail-stop (or FS) module is an abstract unit of encapsulation. Such an module contains one

or more threads of execution, which implement a collection of operations that are exported

and made available for invocation by other FS modules. When such an invocation occurs, the

operation normally executes to completion as an atomic unit, despite failures and concurrent

execution. The failure resilience of an FS module is increased either by composing modules

to form complex FS modules, or by using recovery techniques within the simple module

itself. Replicating a module N times on separate processors to create a high-level abstract

module that can survive N-1 failures is an example of the former [9], while including a

recovery protocol that reads a checkpointed state from stable storage [10] is an example of

the latter.

The other key aspect of FS modules is failure notification. Notification is generated

whenever a failure exhausts the redundancy of a (simple or complex) FS module, resulting in

complete failure of the abstraction being implemented. The notification can then be fielded

by other modules that use the failed module so that they can react to the loss of functionality.

For example, if N-fold replication is used to construct a complex FS module, notification

would be generated should a failure destroy the Nth copy, assuming no recovery. We refer to

a failure that exhausts redundancy in this way as a catastrophic failure. Notification is also

generated if a module is explicitly destroyed by programmer action. Note that the analogy

to fail-stop processors [11] implied by the term “fail-stop modules” is strong: in both cases,

either the abstraction is maintained (processor or module) or notification is provided. FS

modules are also similar in some respects to the “Ideal Fault-Tolerant Components” described

in [12].

FS modules form the building blocks out of which a fault-tolerant distributed program

can be constructed. As an example, consider the simple distributed banking system shown in

Figure 1. Each box represents an FS module, with the dependencies between modules rep-

3

lockunlock

lockunlock

withdrawdeposit

transfer

readwrite

writereadread

write

startTransactionprepareToCommit

read/writecommit

abortabortcommit

read/writeprepareToCommit

startTransaction

Host 2Host 1

Stable Storage

Stable Storage Stable Storage Lock ManagerLock Manager

Data ManagerData Manager

Transaction Manager

Figure 1: Fault-tolerant system structured using FS modules

resented by arrows [13]. User accounts are assumed to be partitioned across two processors,

with each data manager module managing the collection of accounts on its machine. The

user interacts with the transaction manager, which in turn uses the data managers and a stable

storage module to implement transactions using, for example, the two-phase commit protocol

[14] and logging. The data managers export operations to read and write user accounts, and

to implement the two-phase commit protocol. The stable storage modules are used to store

the user data and to maintain key values for recovery purposes. The lock managers are used

to control concurrent access.

To increase the overall system dependability, the constituent FS modules would be

constructed using fault-tolerance techniques. For example, the transaction and data managers

might use recovery protocols to ensure data consistency following failure. Similarly, stable

storage might be replicated. The failure notification aspect of FS modules can be used to

allow modules to react to the failures of modules upon which they depend. If such a failure

cannot be tolerated, it may, in turn, be propagated up the dependency graph. At the top

4

level, this would be seen as the catastrophic failure of the transaction manager and hence,

the system. This might occur, for example, should the redundant copies of the stable storage

module all fail concurrently.

The failure notification and composability aspects of FS modules are what makes this

programming model so useful. Ideally, a fault-tolerant program behaves as an FS module:

commands to the program are executed completely or a failure notification is generated.

This assures users that, absent notification, their commands have been correctly processed.

Such a program is much easier to develop if each of its components are in turn implemented

by FS modules. Since component failures are detectable, other components do not have

to implement complicated failure detection schemes or deal with erroneous results. These

components may in turn be implemented by other FS modules, with this process continuing

until the simplest components are implemented by simple FS modules. At each level, the

guarantees made by FS modules simplify the composition process.

3 The FT-SR Language

The programming model presented in the previous section provides a framework and rationale

to guide the design of fault-tolerance extensions for a high-level distributed programming

language. Here, we present FT-SR, the result of following this design process for SR. To

support the model, the language has provisions for encapsulation based on SR resources,

resource replication, recovery protocols, and both synchronous and asynchronous failure

notification. Familiarity with SR is assumed, although many of its constructs should be

intuitive; details can be found in [7, 15].

3.1 Simple FS Modules

Most distributed programming languages, including SR, have module constructs that provide

many of the properties needed to realize a simple FS module. In SR, these modules are called

resources. Each resource is populated by a varying number of processes that implement

operations that are exported for invocation from other resources. As an example, consider

5

resource lock managerop get lock(cap client) returns intop rel lock(cap client; int)

body lock managervar : : :variable declarations: : :process lock server

do true ->in get lock(client cap) and lock available() ->: : :mark lock id as being held by client cap: : :

return lock id

[] rel lock(client cap, lock id) ->: : :release lock: : :return

niod

end lock serverend lock manager

Figure 2: Lock Manager resource

the simple lock manager resource shown in Figure 2. This resource contains a single process

that exports two operations, get lock and rel lock. If a client invokes the get lock

operation and the lock is available, a lock id is returned and the client can proceed. If the

lock is unavailable, the client is blocked at the first guard of the input statement, a multiway

receive with semantics similar to Ada’s Select statement. get lock takes as its argument

the capability of the invoking client, which is used as an identifier.

Given that resources export operations and contain multiple processes, the only aspect

of simple FS modules that SR does not support directly is failure notification. Accordingly,

FT-SR includes provisions for both generating and fielding such notifications. The language

runtime is responsible for generating notifications when processor failures are detected, so

further discussion of that part is deferred to Section 4. For fielding notifications, FT-SR

supports two different models. The first is synchronous with respect to a call; in this case, the

notification is fielded by an optional backup operation specified in the calling statement. The

second is asynchronous; in this case, the programmer specifies a resource to be monitored

and an operation to be invoked should the monitored resource fail.

To understand the need for these two kinds of failure notification, consider what might

happen if the lock manager shown in Figure 2 or any of its clients fail. If the lock manager

fails, all clients that are blocked on its input statement will remain blocked forever. To

6

resource clientop : : :op : : :

body client()var lock id: intop mgr failed(cap client) returns int...lock id := call flock mgr cap.get lock, mgr failedg (myresource())...

proc mgr failed(client cap) returns lock errreturn LOCK ERR

end mgr failedend client

Figure 3: Outline of Lock Manager client

handle this situation, clients can use the synchronous failure notification facility to unblock

themselves and take some recovery action.

Figure 3 shows the outline of a client structured in this way. Bracketed with the normal

invocation is the capability for a backup operation, mgr failed. The backup is invoked

should the original call fail, i.e., if the lock manager fails to reply within a certain amount

of time (see Section 4 for details). In this example, the backup operation is implemented

locally, although it could just as easily have been implemented in another resource. Note that

the backup is called with the same arguments as the original operation, implying that the two

operations must be type compatible.

Consider now the inverse situation where a client fails while holding a lock. The server

can use the FT-SR asynchronous failure notification facility to detect such a failure and

release the lock, as shown in Figure 4. Here, monitor is used to enable monitoring of

the client instance specified by the capability client cap. If the client is down when the

statement is executed or should it subsequently fail, rel lockwill be implicitly invoked by

the language runtime system with client cap and lock id as arguments. Monitoring

is terminated by monitorendor by another monitor statement that specifies the same

resource.

7

resource lock managerop get lock(cap client) returns intop rel lock(int)

body lock managervar : : :variable declarations: : :process lock server

do true ->in get lock(client cap) and lock available() ->: : :mark lock id as being held by client cap: : :

monitor client cap send rel lock(client cap, lock id)return lock id

[] rel lock(client cap, lock id) ->: : :release lock if held by client cap: : :monitorend client capreturn

niod

end lock serverend lock manager

Figure 4: Lock Manager with client monitoring

3.2 Composition and other Fault-Tolerance Mechanisms

FT-SR provides mechanisms for using recovery techniques within simple FS modules and

for composing simple FS modules using replication. The replication facility allows multiple

copies of a resource to be created, with the language and runtime providing the illusion that

the collection is a single resource instance exporting the same set of operations. The SR

create statement has been generalized to allow for the creation of such replicated resources,

which we call a resource group. For example, the statement

lock mgr cap := create (i := 1 to N) lock manager() on vm caps[i]

creates a resource group with N identical instances of the resource lock manager on the

SR virtual machines specified by the array vm caps. The value returned is a resource

capability that provides access to the operations implemented by the new resource group. In

particular, this capability is a resource group capability that allows multicast invocation of

any of the group’s exported operations. In other words, using this capability in a call or a

send causes the invocation to be multicast to each of the individual resource instances that

make up the group.

8

A multicast invocation provides certain guarantees. One is that all such invocations

are delivered to the runtime of each resource instance in a consistent total order, although

the program may vary this if desired. This means, for example, that if two operations

implemented by alternatives of an input statement are enabled simultaneously, the order in

which they will be executed is consistent across all functioning replicas unless explicitly

overridden. Moreover, the multicast is also done atomically, so that either all functioning

replicas receive the invocation or none do. This combination of properties means that a

multicast invocation is equivalent to an atomic broadcast, a facility that has proven useful

for constructing many types of fault-tolerant distributed systems [16, 17, 18, 19].

Provisions are also made for coordinating outgoing invocations generated within a re-

source group. There are two kinds of invocations that can be generated by a group member.

The first is a private invocation, which a member uses to communicate with a resource

instance individually without coordination with other group members. This can be used, for

example, to allow each replica to have its own set of private resources. The other is a group

invocation, which a group uses to generate a single outgoing invocation on behalf of the

entire group.

To distinguish between these two kinds of communication, FT-SR supports capability

variables of typeprivate cap. Invocations made using a private capability are considered

private communication and are not coordinated with invocations from other group members.

Invocations using regular capability variables are, however, group invocations that generate

exactly one invocation. The invocation is actually transmitted when one of the members

reaches the statement, with later instances being suppressed by the language runtime system.

Note that either type of invocation will be a multicast invocation if the capability is a resource

group capability.

FT-SR also provides the programmer with the ability to restart a failed resource instance

on a functioning virtual machine. The recovery code to be executed in this situation is

denoted by the keywords recoveryand end. Restart can be either explicit or implicit. An

explicit restart is done by

restart lock mgr cap() on vm cap

9

which restarts the resource indicated by lock mgr cap and executes any specified recovery

code. An entire resource group can be restarted using syntax similar to the create statement.

In both cases, the restarted resource instance is, in fact, a re-creation of the failed instance

and not a new instance. This means, for example, that its operations can be invoked using

any capability values obtained prior to the failure.

Implicit restart is indicated by specifying backup virtual machines when a resource or

resource group is created. For example, the final clause of

create lock mgr() on vm cap backups on vm caps array

specifies that the lock manager be restarted on one of the backup virtual machines in

vm caps array should the original instance fail. The backups on clause may also

be used in conjunction with the group create statement; in this case, a group member is

automatically restarted on a backup virtual machine should it fail. This facility allows a

resource group to automatically regain its original level of redundancy following a failure.

Another issue concerning restart is determining when the runtime of the recovering

resource instance begins accepting outside invocations. In general, the resource is in an

indeterminate state while performing recovery, so messages are only accepted after the

recovery code has completed. The one exception is if the recovering instance itself initiates

an invocation during recovery; in this case, invocations are accepted starting from the time

that particular invocation terminates. This facilitates a system organization in which the

recovering instance retrieves state variables from other resources during recovery.

3.3 Distributed Banking System Example

As an example of how the FT-SR collection of mechanisms can be used in concert to

construct a fault-tolerant application, consider the manager and stable storage modules from

the distributed banking example outlined in Section 2. This example also illustrates the ease

with which different fault-tolerance techniques can be used within the same program.

The data manager controls concurrency and provides atomic access to data items on stable

storage. For simplicity, we assume that all data items are of the same type and are referred

10

resource dataManagerimports globalDefs, lockManager, stableStoreop startTransaction(tid: int; dataAddrs: addrList; numDataItems: int)op read(tid: int; dataAddrs: addrList; data: dataList; numDataItems: int)op write(tid: int; dataAddrs: addressList; data: dataList; numDataItems: int)op prepareToCommit(tid: int), commit(tid: int), abort(tid: int)

body dataManager(dmId: int; lmcap: cap lockManager; ss: cap stableStore)type transInfoRec = rec(tid: int;

transStatus: int;dataAddrs: addressList;currentPointers: intArray;memCopy: ptr dataArray;numItems: int)

var statusTable[1:MAX TRANS]: transInfoRec; statusTableMutex: semaphore

initial# initialize statusTable: : :monitor(ss)send failHandler()monitor(lmcap)send failHandler()

end initial: : :code for startTransaction, prepareToCommit, commit, abort, read/write: : :proc failHandler()

destroy myresource()end failHandler

recoveryss.read(statusTable, sizeof(statusTable), statusTable);transManager.dmUp(dmId);

end recoveryend dataManager

Figure 5: Outline of dataManager resource

to by a logical address. Stable storage is read by invoking its read operation, which takes

as arguments the address of the block to be read, the number of bytes, and a buffer in which

the values read are to be returned. Data is written to stable storage by invoking an analogous

write operation.

Figures 5 shows an outline of such a data manager. As can be seen from its specification,

the data manager imports stable storage and lock manager resources, and exports six oper-

ations. startTransaction is invoked by the transaction manager to access data held

by the data manager; its arguments are a transaction identifier tid and a list of addresses of

the data items used during the transaction. read and write are used to access and modify

objects. prepareToCommit and commit are invoked in succession upon completion

to first, commit any modifications made to the data items by the transaction, and second,

complete the transaction. abort is used to abandon any modifications and terminate the

11

transaction; it can be invoked at any time up to the time commit is first invoked. All

these operations are implemented as SR procs, which means that invocations result in the

creation of a new thread to service that invocation. Finally, the data manager contains ini-

tial and recovery code, as well as a failure handler proc that deals with the failure of the

lockManager and stableStore resources.

The data manager depends on the stable storage and lock manager resources to implement

its operations correctly and so, needs to be informed when they fail catastrophically. The data

manager does this by establishing an asynchronous failure handler failHandler using the

monitor statement. When invoked, failHandler terminates the data manager resource,

thereby causing the failure to be propagated to the transaction manager.

The failure of the data manager itself is handled by recovery code that retrieves the current

contents of key variables from stable storage. It is the responsibility of the transaction

manager to deal with transactions that were in progress at the time of the failure; those

for which commit had not yet been invoked are aborted, while commit is reissued for

the others. To handle this, the recovery code sends a message to the transaction manager

notifying it of the recovery.

Stable storage is implemented in our example by creating a storage resource and replicat-

ing it to increase failure resilience, as shown in Figure 6. Replica failures are dealt with by

restarting the resource on another machine; this is done automatically by specifying backup

virtual machines when stableStore is created (see Figure 7). A replica’s recovery code

starts by requesting the current state from the other group members. All replicas respond

to this request; the first is received, while the others remain queued at the recvState

operation until the replica is either destroyed or fails. The newly restarted replica begins

processing queued messages upon finishing recovery. Since messages are queued from the

point sendState is invoked, subsequent messages can be applied to the state normally to

re-establish consistency.

The main resource that starts up the entire system is shown in Figure 7. Resource main

creates a virtual machine on each of three physical machines. Two replicas of the stable

storage module are then created, with the third virtual machine being used as a backup

12

resource stableStoreimport globalDefsop read(address: int; numBytes: int; buffer: charArray)op write(address: int; numBytes: int; buffer: charArray)op sendState(sscap: cap stableStore)op recvState(objectStore: objList)

body stableStorevar store[MEMSIZE]: char

process ssdo true ->

in read(address, numBytes, buffer) ->buffer[1:numBytes] := store[address:address+numBytes-1]2 write(address, numBytes, buffer) ->store[address, address+numBytes-1] := buffer[1:numBytes]2 sendState(rescap) -> send rescap.recvState(store)

niod

end ss

recoverysend mygroup().sendState(myresource())receive recvState(store); send ss

end recoveryend stableStore

Figure 6: stableStore resource

machine. The two data managers are then created followed by the transaction manager.

This banking example has been implemented and tested. In addition, a number of other

examples have been programmed using FT-SR to test its appropriateness for writing a variety

of fault-tolerant distributed programs, as well as the larger thesis that high-level distributed

programming languages are suitable for software of this type. These include a fault-tolerant

version of the Dining Philosophers problem that shows how a single monitor statement can

be used to implement a group membership service [20, 21], and a distributed word game

that exploits multiple processors for increased performance as well as fault-tolerance. A

description of all these examples together with complete code can be found in [22].

3.4 Language Design Issues

The fault-tolerance mechanisms of FT-SR are designed with two important considerations

in mind. The first is that the mechanisms be orthogonal, so that any interplay between these

mechanisms not result in unexpected behavior. The second is that, whenever possible, these

mechanisms use or form natural extensions to existing SR mechanisms. These considerations

13

resource mainimports transManager, dataManager, stableStore, lockManager

body mainvar virtMachines[3] : cap vm # array of virtual machine capabilitiesdataSS[2], tmSS: cap stableStore # capabilities to stable storeslm: cap lockManager; dm[2]: cap dataManager # capabilities to lock and data managers

virtMachines[1] := create vm() on ‘‘host1’’virtMachines[2] := create vm() on ‘‘host2’’virtMachines[3] := create vm() on ‘‘host3’’ # backup machine

# create stable storage for use by the data managers and the transaction managerdataSS[1] := create (i := 1 to 2) stableStore() on virtMachines[i]

backups on virtMachines[3]dataSS[2] := create (i := 1 to 2) stableStore() on virtMachines[i]

backups on virtMachines[3]tmSS := create (i := 1 to 2) stableStore() on virtMachines[i]

backups on virtMachines[3]

# create lock manager, data managers, and transaction managerlm := create lockManager() on virtMachines[2]fa i := 1 to 2 ->

dm[i] = create dataManager(i, lm, dataSS[i]) on virtMachines[i]aftm = create transManager(dm[1], dm[2], tmSS) on virtMachines[1]

end main

Figure 7: System startup in resource main

preserve the semantic integrity of the language and at the same time keep it relatively simple

and therefore, easy to understand and use. We illustrate these points with several examples.

FT-SR provides mechanisms for monitoring, failure handling, restarts, and replication,

all of which can be meaningfully combined to achieve different effects. For example, both

the monitor statement and backup operations work with groups just as they do with resources.

In either case, a failure notification is generated when no resource or resource group member

is available to handle invocations. Similarly, the restart statement can be used to restart

entire groups, group members, or individual resources, with the same rules for execut ion of

recovery code and acceptance of new invocations being used in each case.

Another example is that an operation implemented by a resource group can be used in

the same way as one implemented by a single resource, since the two capability values are

indistinguishable. In particular, group operations may be specified as failure handlers in

monitor statements or as backup operations in call statements, as well as normal invocations.

The parallels between resource groups and resources also extends to invocations from a

group; it is impossible to tell if an invocation originated from a group or an individual

14

resource.

The second aspect of good language design is that wherever possible, the fault-tolerance

mechanisms of FT-SR are integrated into existing SR mechanisms. For example, the group

create statement is a natural extension of the SR resource create statement, both in terms of

its syntax and semantics. Furthermore, a failure handler is essentially an operation that is

invoked as a result of a failure and is therefore expressed using existing language mechanisms.

Emphasizing these two aspects of language design has numerous advantages. The or-

thogonality of the FT-SR mechanisms allows a small set of mechanisms to be combined in

different ways to achieve different effects; the lack of restrictions or special cases governing

this combination eliminates any programming pitfalls that can snare a novice programmer.

The use of existing SR mechanisms keeps the language small and easy to learn, while allow-

ing the fault-tolerance aspects of the language to be blended with its concurrency aspects.

All these considerations lead to a logically and aesthetically integrated language design.

4 Implementation and Performance

4.1 Overview

The FT-SR implementation consists of two major components: a compiler and a runtime

system. Both are written in C and borrow from the existing implementation of SR where

possible. In fact, the FT-SR compiler is almost identical to the SR compiler, which is to be

expected since FT-SR is syntactically close to SR. The compiler is based on lex and yacc,

and consists of about 16,000 lines of code. It generates C code, which is in turn compiled by

a C compiler and linked with the FT-SR runtime system.

The FT-SR runtime system, which is significantly different from that of SR, provides

primitives for creating, destroying and monitoring resources and resource groups, handl ing

failures, restarting failed resources, invoking and servicing operations, and a variety of other

miscellaneous functions. It consists of 9600 lines of code and is implemented using ver-

sion 3.1 of the x-kernel. The major advantage of such a bare machine implementation is that

it facilitates experimentation with realistic fault-tolerant software systems when compared

15

to systems built, for example, on top of Unix. In addition, the x-kernel provides a flexible

infrastructure for composing communication protocols, something that has proven to be very

useful in building the variety of protocols required for the FT-SR runtime system.

Figure 8 shows the organization of the FT-SR runtime system on a single processor. As

shown, each FT-SR virtual machine exists in a separate x-kernel user address space. In

addition to the user program, a virtual machine contains those parts of the runtime system

that create and destroy resources, route invocations to operations on resources, and manage

intra-virtual machine communication. This user resident part accounts for about 85% of the

runtime system and the kernel resident part the remaining 15%.

The important runtime system modules and communication paths are also illustrated

in Figure 8. The Communication Manager consists of multiple communication protocols

that provide point-to-point and broadcast communication services between processors. The

VM Manager is responsible for creating and destroying virtual machines, and for providing

communication services between virtual machines. The Processor Failure Detector (PFD) is a

failure detector protocol; it monitors processors and notifies the VM manager when a failure

occurs. In user space, the Resource Manager is responsible for creating, destroying and

restarting resources, while the Group Manager is responsible for the analogous operations on

groups, as well as intergroup communication. The Resource Failure Detector (RFD) detects

resource failures.

4.2 Novel Features

Three interesting algorithms used within the FT-SR implementation are described in this

section. The first is related to group communication and is interesting because it uses a

variation of the primary replica approach to sequence invocations to a group. The second is

related to group reconfiguration and is interesting because no expensive election protocols

are used. Both these algorithms exploit a system parameter max sf —the maximum number

of simultaneous failures to be tolerated—to optimize performance. The third algorithm is

the failure detection and notification algorithm. It is interesting because it is implemented by

three modules at different levels of the system, with each module using the services provided

16

Kernel Space

User Space

VM Manager

Comm. Mngr.

VM 2VM 1

User Program

Grp. Mngr.

Invocation Mngr.

Resource Mngr.

RFD

User Program

Grp. Mngr.

Invocation Mngr.

Resource Mngr.

RFD

PFD

Figure 8: Organization of FT-SR runtime system

by the one below it.

Group Communication. Perhaps the most interesting aspect of replication is the algorithm

used to implement multicast invocations. The technique we use is similar to [23, 24], where

one replica is a primary through which all messages are funneled. Another max sf replicas

are designated as primary-group members, with the remaining being considered ordinary

members. Upon receiving a message, the primary adds a sequence number and multicasts it

to all replicas. Upon receipt, (only) primary-group members send acknowledgments. Once

the primary gets these max sf messages, it sends an acknowledgement to the original sender

of the message; this action is appropriate since the receipt of this many acknowledgements

guarantees that at least one replica will have the message even should max sf failures actually

occur. The primary is also involved in outgoing group invocations. In such situations, the

runtime system suppresses the invocation from all non-primary group members. When the

primary receives an acknowledgement that its invocation has been received, it relays that

information to the other group members.

17

From To Group Size Invocation Time (msec)

resource group 1 3.24

resource group 2 6.84

resource group 3 6.84

resource group 3 8.35

(max sf = 2)

group resource 3 7.19

group group 3 14

Table 1: Times (in msec) for invocation involving groups

Table 1 shows the cost of invocations to and from resource groups. As can be seen, for

groups larger than max sf + 1, the cost of an invocation to the group is independent of group

size, a direct result of the above algorithm. This is especially significant given that a max sf

of one is sufficient for most systems [25]. This gives FT-SR a considerable advantage over

systems such as ISIS where the cost of an invocation grows linearly with the size of the

group.

Group Reconfiguration after Failure. The Group Manager at each site is responsible for

determining the primary and the members of the primary-group set. Specifically, it maintains

a list of all group members and whether it is the primary, a primary-group member, or an

ordinary member. This list is ordered consistently at all sites based on the order in which the

replicas were specified in the group create statement. This ordering ensures that all Group

Managers will independently pick the same primary and assign the same set of replicas to

the primary-group set.

The Group Managers are also responsible for dealing with the failure of group members.

If the primary fails, the first member of the primary-group is designated as the new primary.

This action or the failure of a primary-group member will cause the size of the primary-

group to fall below max sf, so an appropriate number of ordinary members are added to

18

the primary-group to restore its original size. No special action is needed when an ordinary

member fails. If backup virtual machines were specified for the group when it was created and

such machines are available, failed replicas are restarted automatically. Restarted replicas

join the group as ordinary members.

Failure Detection and Notification. Failure detection in FT-SR is done at three levels: at

the processor level by the PFD, at the virtual machine level by the VM Manager, and at the

resource level by the RFD. Each PFD monitors the other processors and notifies the local

VM manager of any failures. The VM manager then maps these processor failures to virtual

machine failures and notifies the RFD. The RFD in turn maps virtual machine failures to

resource failures and passes this information on to any other runtime system module that

requested failure notification. To detect termination of a resource that is explicitly destroyed,

the RFD sends a message to its peer on the appropriate virtual machine asking to be notified

when the resource is destroyed. Similarly, a VM Manager can ask another VM Manager to

send a failure notification when a virtual machine is explicitly destroyed.

5 Conclusions

Numerous programming languages with support for fault-tolerance have been developed,

some as entirely new languages, some as extensions to existing languages and systems, and

some as libraries to existing languages. Examples of new languages include Argus [2],

Aeolus [26] and Plits [3]. Examples of extensions include Fault-Tolerant Concurrent C

(FTCC) [27], HOPS [28], and languages described in [29], [30] and [31]. Finally, fault-

tolerance library support is provided by Arjuna [32] for C++, and Avalon [33] for C++,

Common Lisp and Ada.

A distinguishing feature of these languages is the programming model they support. For

example, the transaction model is supported by Aeolus, Argus, Avalon, HOPS, Plits, and

Arjuna, while the replicated state machine approach [9] is supported by HOPS and FTCC. FT-

SR differs from all the above languages in supporting a model based on FS modules, which

19

allows any of these other approaches to be programmed easily. Another difference is that FT-

SR’s design as a set of extensions to a high-level distributed programming language greatly

enhances its usability. It simplifies the construction of fault-tolerant distributed programs by

allowing for the seamless integration of the distribution and fault-tolerance aspects of these

programs.

Despite these efforts, developing enhanced language support for fault tolerance is, in

some sense, a neglected area compared with the numerous efforts to develop new system

libraries or network protocols. However, our view is that research in this area has the

potential to render significant benefits. By offering a high-level realization of important

fault-tolerance abstractions, programmers are freed from the need to learn implementation

details or how a particular library can be used in a given context. The advantages of a single,

coherent package for expressing the program should not be underestimated either, especially

one based on a high-level distributed programming language that already offers a framework

for writing multi-process programs.

This paper has presented such a language-based approach to writing fault-tolerant dis-

tributed programs. Although the specifics of our approach are based on extending the SR

language, the FS module programming model and design principles could be applied equally

well to any similar language. It is also important when designing a language for such appli-

cations to pay sufficient attention to the implementation, especially the design of an efficient

runtime system. Although confirming experiments are continuing, our expectation is that the

user will pay little, if any, performance penalty for the advantages of a high-level language.

Acknowledgments

Thanks to G. Andrews, H. Bal, M. Hiltunen, D. Mosberger-Tang, R. Olsson, and the anony-

mous referees for reading earlier versions of this paper and providing valuable feedback.

References[1] K. Birman, A. Schiper, and P. Stephenson, “Lightweight causal and atomic group multicast,”

ACM Trans. Computer Systems, vol. 9, pp. 272–314, Aug 1991.

20

[2] B. Liskov, “The Argus language and system,” in Distributed Systems: Methods and Toolsfor Specification, LNCS, Vol. 190 (M. Paul and H. Siegert, eds.), ch. 7, pp. 343–430, Berlin:Springer-Verlag, 1985.

[3] C. Ellis, J. Feldman, and J. Heliotis, “Language constructs and support systems for distributedcomputing,” in ACM Symp. on Prin. of Dist. Comp., pp. 1–9, Aug 1982.

[4] H. Bal, “A comparative study of five parallel programming languages,” in Proc. EurOpen Conf.on Open Dist. Systems, May 1991.

[5] U. S. Dept. of Defense, Reference Manual for the Ada Programming Language. WashingtonD.C., 1983.

[6] C. A. R. Hoare, “Communicating sequential processes,” Commun. ACM, vol. 21, pp. 666–677,Aug 1978.

[7] G. R. Andrews and R. A. Olsson, The SR Programming Language: Concurrency in Practice.Benjamin/Cummings, 1993.

[8] N. Hutchinson and L. L. Peterson, “The x-Kernel: An architecture for implementing networkprotocols,” IEEE Trans. Softw. Eng., vol. 17, pp. 64–76, Jan 1991.

[9] F. Schneider, “Implementing fault-tolerant services using the state machine approach: A tuto-rial,” ACM Computing Surveys, vol. 22, pp. 299–319, Dec 1990.

[10] B. Lampson, “Atomic transactions,” in Distributed Systems—Architecture and Implementation(B. Lampson, M. Paul, and H. Seigert, eds.), ch. 11, pp. 246–265, Springer-Verlag, 1981.

[11] R. Schlichting and F. Schneider, “Fail-stop processors: An approach to designing fault-tolerantcomputing systems,” ACM Trans. Computer Systems, vol. 1, pp. 222–238, Aug 1983.

[12] P. Lee and T. Anderson, Fault Tolerance: Principles and Practice. Vienna: Springer-Verlag,second ed., 1990.

[13] F. Cristian, “Understanding fault-tolerant distributed systems,” Commun. ACM, vol. 34, pp. 56–78, Feb 1991.

[14] J. Gray, “Notes on data base operating systems,” in Operating Systems, An Advanced Course(R. Bayer, R. Graham, and G. Seegmuller, eds.), ch. 3.F, pp. 393–481, Springer-Verlag, 1979.

[15] G. Andrews et al., “An overview of the SR language and implementation,” ACM Trans. Prog.Lang. and Systems, vol. 10, pp. 51–86, Jan. 1988.

[16] F. Cristian, H. Aghili, R. Strong, and D. Dolev, “Atomic broadcast: From simple messagediffusion to Byzantine agreement,” in Proc. 15th Fault-Tolerant Computing Symp., pp. 200–206, June 1985.

[17] H. Kopetz et al., “Distributed fault-tolerant real-time systems: The Mars approach,” IEEEMicro, vol. 9, pp. 25–40, Feb 1989.

[18] P. Melliar-Smith, L. Moser, and V. Agrawala, “Broadcast protocols for distributed systems,”IEEE Trans. on Parallel and Distributed Systems, vol. 1, pp. 17–25, Jan 1990.

21

[19] D. Powell, ed., Delta-4: A Generic Architecture for Dependable Computing. Springer-Verlag,1991.

[20] F. Cristian, “Reaching agreement on processor-group membership in synchronous distributedsystems,” Distributed Computing, vol. 4, pp. 175–187, 1991.

[21] H. Kopetz, G. Grunsteidl, and J. Reisinger, “Fault-tolerant membership service in a synchronousdistributed real-time system,” in Dependable Computing for Critical Applications (A. Avizienisand J.-C. Laprie, eds.), pp. 411–429, Wien: Springer-Verlag, 1991.

[22] V. Thomas, FT-SR: A Programming Language for Constructing Fault-Tolerant DistributedSystems. PhD thesis, Dept. of CS, Univ. of Arizona, 1993.

[23] J. Chang and N. Maxemchuk, “Reliable broadcast protocols,” ACM Trans. Computer Systems,vol. 2, pp. 251–273, Aug 1984.

[24] M. F. Kaashoek, A. Tanenbaum, S. Hummel, and H. Bal, “An efficient reliable broadcastprotocol,” Operating Systems Review, vol. 23, pp. 5–19, Oct 1989.

[25] J. Gray, “Why do computers stop and what can be done about it,” in Proc. 5th Symp. on Reliabilityin Dist. Software and Database Systems, pp. 3–12, Jan 1986.

[26] R. LeBlanc and C. T. Wilkes, “Systems programming with objects and actions,” in Proc. 5thConf. on Distributed Computing Systems, (Denver), pp. 132–139, May 1985.

[27] R. Cmelik, N. Gehani, and W. D. Roome, “Fault Tolerant Concurrent C: A tool for writing faulttolerant distributed programs,” in Proc. 18th Fault-Tolerant Computing Symp., pp. 55–61, June1988.

[28] H. Madduri, “Fault-tolerant distributed computing,” Scientific Honeyweller, vol. Winter 1986-87, pp. 1–10, 1986.

[29] J. Knight and J. Urquhart, “On the implementation and use of Ada on fault-tolerant distributedsystems,” IEEE Trans. Softw. Eng., vol. SE-13, pp. 553–563, May 1987.

[30] M. F. Kaashoek, R. Michiels, H. Bal, and A. Tanenbaum, “Transparent fault-tolerance in parallelOrca programs,” in Proc. USENIX Symp. on Exper. with Distributedand MultiprocessorSystems,pp. 297–311, Mar 1992.

[31] R. Schlichting, F. Cristian, and T. Purdin, “A linguistic approach to failure-handling in distributedsystems,” in Dependable Computing for Critical Applications (A. Avizienis and J.-C. Laprie,eds.), pp. 387–409, Wien: Springer-Verlag, 1991.

[32] S. Shrivastava, G. Dixon, and G. Parrington, “An overview of the Arjuna distributed program-ming system,” IEEE Software, vol. 8, pp. 66–73, Jan 1991.

[33] M. Herlihy and J. Wing, “Avalon: Language support for reliable distributed systems,” in Proc.17th Fault-Tolerant Computing Symp., pp. 89–94, July 1987.

22