Programming Language Support for Writing Fault-Tolerant Distributed Software
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of Programming Language Support for Writing Fault-Tolerant Distributed Software
Programming Language Support for WritingFault-Tolerant Distributed Software�
Richard D. Schlichting and Vicraj T. Thomas
Abstract
Good programming language support can simplify the task of writing fault-tolerantdistributed software. Here, an approach to providing such support is described in whicha general high-level distributed programming language is augmented with mechanismsfor fault tolerance. Unlike approaches based on sequential languages or specializedlanguages oriented towards a given fault-tolerance technique, this approach gives theprogrammer a high level of abstraction, while still maintaining flexibility and executionefficiency. The paper first describes a programming model that captures the importantcharacteristics that should be supported by a programming language of this type. It thenpresents a realization of this approach in the form of FT-SR, a programming languagethat augments the SR distributed programming language with features for replication,recovery, and failure notification. In addition to outlining these extensions, an exampleprogram consisting of a data manager and its associated stable storage is given. Finally,an implementation of the language that uses the x-kernel and runs standalone on anetwork of Sun workstations is discussed. The overall structure and several of thealgorithms used in the runtime are interesting in their own right.
1 Introduction
Programmers faced with choosing a programming language for writing fault-tolerant dis-
tributed software—that is, software that must continue to provide service in a multicomputer
system despite failures in the underlying computing platform—often have few alternatives.
At one end of the spectrum are relatively low-level choices such as assembly language or C,
often coupled with a fault-tolerance library such as ISIS [1]. Such an approach can result
in good execution efficiency, yet forces the programmer to deal with the complexities of
distributed execution and fault-tolerance in a language that is fundamentally sequential. At�This work supported in part by the National Science Foundation under grant CCR-9003161 and the Officeof Naval Research under grant N00014-91-J-1015.
1
the other end of the spectrum are high-level languages specifically intended for constructing
fault-tolerant applications using a given technique. Examples here include Argus [2] and
Plits [3], which support a programming model based on atomic actions. Such languages
simplify the problems considerably, yet can be overly constraining if the programmer desires
to use fault-tolerance techniques other than the one supported by the language [4]. The net
result is that neither option provides the ideal combination of features.
In this paper, we advocate an intermediate approach based on taking a general high-level
concurrent or distributed programming language such as Ada [5], CSP [6], or SR [7] and
augmenting it with additional mechanisms to facilitate fault-tolerance. Starting with a lan-
guage of this type offers a number of advantages. For example, unlike a low-level approach,
such languages allow the programmer to deal with multiple processes and interprocess com-
munication at a high level of abstraction, thereby simplifying the programming process.
Moreover, given a well-designed set of fault-tolerance extensions, such a language can give
the programmer a greater degree of flexibility than is found in current higher-level alterna-
tives. Such flexibility allows, for instance, the use of multiple fault-tolerance techniques,
something that can be important in certain types of software. In short, if done right, this
approach can offer a language that preserves many of the positive attributes of both sets of
alternatives.
The specific purpose of this paper is to elaborate on this approach, in two ways. First,
we present a programming model based on the notion of fail-stop modules that captures
the characteristics needed for a language oriented towards writing fault-tolerant distributed
software. Second, we describe a realization of this approach in the form of FT-SR, a
programming language based on augmenting the SR distributed programming language with
additional mechanisms for fault-tolerance. FT-SR has been implemented using the x-kernel,
an operating system designed for experimenting with communication protocols [8], and runs
standalone on a network of Sun workstations. The implementation structure and several of
the algorithms used in the runtime system are also interesting in their own right. We restrict
our attention in this paper to failures suffered by processors with fail-silent semantics—that
is, where the only failures are assumed to be a complete cessation of execution activity—
2
although the approach generalizes to other failure models as well.
2 Fail-Stop Modules and Program Design
A fail-stop (or FS) module is an abstract unit of encapsulation. Such an module contains one
or more threads of execution, which implement a collection of operations that are exported
and made available for invocation by other FS modules. When such an invocation occurs, the
operation normally executes to completion as an atomic unit, despite failures and concurrent
execution. The failure resilience of an FS module is increased either by composing modules
to form complex FS modules, or by using recovery techniques within the simple module
itself. Replicating a module N times on separate processors to create a high-level abstract
module that can survive N-1 failures is an example of the former [9], while including a
recovery protocol that reads a checkpointed state from stable storage [10] is an example of
the latter.
The other key aspect of FS modules is failure notification. Notification is generated
whenever a failure exhausts the redundancy of a (simple or complex) FS module, resulting in
complete failure of the abstraction being implemented. The notification can then be fielded
by other modules that use the failed module so that they can react to the loss of functionality.
For example, if N-fold replication is used to construct a complex FS module, notification
would be generated should a failure destroy the Nth copy, assuming no recovery. We refer to
a failure that exhausts redundancy in this way as a catastrophic failure. Notification is also
generated if a module is explicitly destroyed by programmer action. Note that the analogy
to fail-stop processors [11] implied by the term “fail-stop modules” is strong: in both cases,
either the abstraction is maintained (processor or module) or notification is provided. FS
modules are also similar in some respects to the “Ideal Fault-Tolerant Components” described
in [12].
FS modules form the building blocks out of which a fault-tolerant distributed program
can be constructed. As an example, consider the simple distributed banking system shown in
Figure 1. Each box represents an FS module, with the dependencies between modules rep-
3
lockunlock
lockunlock
withdrawdeposit
transfer
readwrite
writereadread
write
startTransactionprepareToCommit
read/writecommit
abortabortcommit
read/writeprepareToCommit
startTransaction
Host 2Host 1
Stable Storage
Stable Storage Stable Storage Lock ManagerLock Manager
Data ManagerData Manager
Transaction Manager
Figure 1: Fault-tolerant system structured using FS modules
resented by arrows [13]. User accounts are assumed to be partitioned across two processors,
with each data manager module managing the collection of accounts on its machine. The
user interacts with the transaction manager, which in turn uses the data managers and a stable
storage module to implement transactions using, for example, the two-phase commit protocol
[14] and logging. The data managers export operations to read and write user accounts, and
to implement the two-phase commit protocol. The stable storage modules are used to store
the user data and to maintain key values for recovery purposes. The lock managers are used
to control concurrent access.
To increase the overall system dependability, the constituent FS modules would be
constructed using fault-tolerance techniques. For example, the transaction and data managers
might use recovery protocols to ensure data consistency following failure. Similarly, stable
storage might be replicated. The failure notification aspect of FS modules can be used to
allow modules to react to the failures of modules upon which they depend. If such a failure
cannot be tolerated, it may, in turn, be propagated up the dependency graph. At the top
4
level, this would be seen as the catastrophic failure of the transaction manager and hence,
the system. This might occur, for example, should the redundant copies of the stable storage
module all fail concurrently.
The failure notification and composability aspects of FS modules are what makes this
programming model so useful. Ideally, a fault-tolerant program behaves as an FS module:
commands to the program are executed completely or a failure notification is generated.
This assures users that, absent notification, their commands have been correctly processed.
Such a program is much easier to develop if each of its components are in turn implemented
by FS modules. Since component failures are detectable, other components do not have
to implement complicated failure detection schemes or deal with erroneous results. These
components may in turn be implemented by other FS modules, with this process continuing
until the simplest components are implemented by simple FS modules. At each level, the
guarantees made by FS modules simplify the composition process.
3 The FT-SR Language
The programming model presented in the previous section provides a framework and rationale
to guide the design of fault-tolerance extensions for a high-level distributed programming
language. Here, we present FT-SR, the result of following this design process for SR. To
support the model, the language has provisions for encapsulation based on SR resources,
resource replication, recovery protocols, and both synchronous and asynchronous failure
notification. Familiarity with SR is assumed, although many of its constructs should be
intuitive; details can be found in [7, 15].
3.1 Simple FS Modules
Most distributed programming languages, including SR, have module constructs that provide
many of the properties needed to realize a simple FS module. In SR, these modules are called
resources. Each resource is populated by a varying number of processes that implement
operations that are exported for invocation from other resources. As an example, consider
5
resource lock managerop get lock(cap client) returns intop rel lock(cap client; int)
body lock managervar : : :variable declarations: : :process lock server
do true ->in get lock(client cap) and lock available() ->: : :mark lock id as being held by client cap: : :
return lock id
[] rel lock(client cap, lock id) ->: : :release lock: : :return
niod
end lock serverend lock manager
Figure 2: Lock Manager resource
the simple lock manager resource shown in Figure 2. This resource contains a single process
that exports two operations, get lock and rel lock. If a client invokes the get lock
operation and the lock is available, a lock id is returned and the client can proceed. If the
lock is unavailable, the client is blocked at the first guard of the input statement, a multiway
receive with semantics similar to Ada’s Select statement. get lock takes as its argument
the capability of the invoking client, which is used as an identifier.
Given that resources export operations and contain multiple processes, the only aspect
of simple FS modules that SR does not support directly is failure notification. Accordingly,
FT-SR includes provisions for both generating and fielding such notifications. The language
runtime is responsible for generating notifications when processor failures are detected, so
further discussion of that part is deferred to Section 4. For fielding notifications, FT-SR
supports two different models. The first is synchronous with respect to a call; in this case, the
notification is fielded by an optional backup operation specified in the calling statement. The
second is asynchronous; in this case, the programmer specifies a resource to be monitored
and an operation to be invoked should the monitored resource fail.
To understand the need for these two kinds of failure notification, consider what might
happen if the lock manager shown in Figure 2 or any of its clients fail. If the lock manager
fails, all clients that are blocked on its input statement will remain blocked forever. To
6
resource clientop : : :op : : :
body client()var lock id: intop mgr failed(cap client) returns int...lock id := call flock mgr cap.get lock, mgr failedg (myresource())...
proc mgr failed(client cap) returns lock errreturn LOCK ERR
end mgr failedend client
Figure 3: Outline of Lock Manager client
handle this situation, clients can use the synchronous failure notification facility to unblock
themselves and take some recovery action.
Figure 3 shows the outline of a client structured in this way. Bracketed with the normal
invocation is the capability for a backup operation, mgr failed. The backup is invoked
should the original call fail, i.e., if the lock manager fails to reply within a certain amount
of time (see Section 4 for details). In this example, the backup operation is implemented
locally, although it could just as easily have been implemented in another resource. Note that
the backup is called with the same arguments as the original operation, implying that the two
operations must be type compatible.
Consider now the inverse situation where a client fails while holding a lock. The server
can use the FT-SR asynchronous failure notification facility to detect such a failure and
release the lock, as shown in Figure 4. Here, monitor is used to enable monitoring of
the client instance specified by the capability client cap. If the client is down when the
statement is executed or should it subsequently fail, rel lockwill be implicitly invoked by
the language runtime system with client cap and lock id as arguments. Monitoring
is terminated by monitorendor by another monitor statement that specifies the same
resource.
7
resource lock managerop get lock(cap client) returns intop rel lock(int)
body lock managervar : : :variable declarations: : :process lock server
do true ->in get lock(client cap) and lock available() ->: : :mark lock id as being held by client cap: : :
monitor client cap send rel lock(client cap, lock id)return lock id
[] rel lock(client cap, lock id) ->: : :release lock if held by client cap: : :monitorend client capreturn
niod
end lock serverend lock manager
Figure 4: Lock Manager with client monitoring
3.2 Composition and other Fault-Tolerance Mechanisms
FT-SR provides mechanisms for using recovery techniques within simple FS modules and
for composing simple FS modules using replication. The replication facility allows multiple
copies of a resource to be created, with the language and runtime providing the illusion that
the collection is a single resource instance exporting the same set of operations. The SR
create statement has been generalized to allow for the creation of such replicated resources,
which we call a resource group. For example, the statement
lock mgr cap := create (i := 1 to N) lock manager() on vm caps[i]
creates a resource group with N identical instances of the resource lock manager on the
SR virtual machines specified by the array vm caps. The value returned is a resource
capability that provides access to the operations implemented by the new resource group. In
particular, this capability is a resource group capability that allows multicast invocation of
any of the group’s exported operations. In other words, using this capability in a call or a
send causes the invocation to be multicast to each of the individual resource instances that
make up the group.
8
A multicast invocation provides certain guarantees. One is that all such invocations
are delivered to the runtime of each resource instance in a consistent total order, although
the program may vary this if desired. This means, for example, that if two operations
implemented by alternatives of an input statement are enabled simultaneously, the order in
which they will be executed is consistent across all functioning replicas unless explicitly
overridden. Moreover, the multicast is also done atomically, so that either all functioning
replicas receive the invocation or none do. This combination of properties means that a
multicast invocation is equivalent to an atomic broadcast, a facility that has proven useful
for constructing many types of fault-tolerant distributed systems [16, 17, 18, 19].
Provisions are also made for coordinating outgoing invocations generated within a re-
source group. There are two kinds of invocations that can be generated by a group member.
The first is a private invocation, which a member uses to communicate with a resource
instance individually without coordination with other group members. This can be used, for
example, to allow each replica to have its own set of private resources. The other is a group
invocation, which a group uses to generate a single outgoing invocation on behalf of the
entire group.
To distinguish between these two kinds of communication, FT-SR supports capability
variables of typeprivate cap. Invocations made using a private capability are considered
private communication and are not coordinated with invocations from other group members.
Invocations using regular capability variables are, however, group invocations that generate
exactly one invocation. The invocation is actually transmitted when one of the members
reaches the statement, with later instances being suppressed by the language runtime system.
Note that either type of invocation will be a multicast invocation if the capability is a resource
group capability.
FT-SR also provides the programmer with the ability to restart a failed resource instance
on a functioning virtual machine. The recovery code to be executed in this situation is
denoted by the keywords recoveryand end. Restart can be either explicit or implicit. An
explicit restart is done by
restart lock mgr cap() on vm cap
9
which restarts the resource indicated by lock mgr cap and executes any specified recovery
code. An entire resource group can be restarted using syntax similar to the create statement.
In both cases, the restarted resource instance is, in fact, a re-creation of the failed instance
and not a new instance. This means, for example, that its operations can be invoked using
any capability values obtained prior to the failure.
Implicit restart is indicated by specifying backup virtual machines when a resource or
resource group is created. For example, the final clause of
create lock mgr() on vm cap backups on vm caps array
specifies that the lock manager be restarted on one of the backup virtual machines in
vm caps array should the original instance fail. The backups on clause may also
be used in conjunction with the group create statement; in this case, a group member is
automatically restarted on a backup virtual machine should it fail. This facility allows a
resource group to automatically regain its original level of redundancy following a failure.
Another issue concerning restart is determining when the runtime of the recovering
resource instance begins accepting outside invocations. In general, the resource is in an
indeterminate state while performing recovery, so messages are only accepted after the
recovery code has completed. The one exception is if the recovering instance itself initiates
an invocation during recovery; in this case, invocations are accepted starting from the time
that particular invocation terminates. This facilitates a system organization in which the
recovering instance retrieves state variables from other resources during recovery.
3.3 Distributed Banking System Example
As an example of how the FT-SR collection of mechanisms can be used in concert to
construct a fault-tolerant application, consider the manager and stable storage modules from
the distributed banking example outlined in Section 2. This example also illustrates the ease
with which different fault-tolerance techniques can be used within the same program.
The data manager controls concurrency and provides atomic access to data items on stable
storage. For simplicity, we assume that all data items are of the same type and are referred
10
resource dataManagerimports globalDefs, lockManager, stableStoreop startTransaction(tid: int; dataAddrs: addrList; numDataItems: int)op read(tid: int; dataAddrs: addrList; data: dataList; numDataItems: int)op write(tid: int; dataAddrs: addressList; data: dataList; numDataItems: int)op prepareToCommit(tid: int), commit(tid: int), abort(tid: int)
body dataManager(dmId: int; lmcap: cap lockManager; ss: cap stableStore)type transInfoRec = rec(tid: int;
transStatus: int;dataAddrs: addressList;currentPointers: intArray;memCopy: ptr dataArray;numItems: int)
var statusTable[1:MAX TRANS]: transInfoRec; statusTableMutex: semaphore
initial# initialize statusTable: : :monitor(ss)send failHandler()monitor(lmcap)send failHandler()
end initial: : :code for startTransaction, prepareToCommit, commit, abort, read/write: : :proc failHandler()
destroy myresource()end failHandler
recoveryss.read(statusTable, sizeof(statusTable), statusTable);transManager.dmUp(dmId);
end recoveryend dataManager
Figure 5: Outline of dataManager resource
to by a logical address. Stable storage is read by invoking its read operation, which takes
as arguments the address of the block to be read, the number of bytes, and a buffer in which
the values read are to be returned. Data is written to stable storage by invoking an analogous
write operation.
Figures 5 shows an outline of such a data manager. As can be seen from its specification,
the data manager imports stable storage and lock manager resources, and exports six oper-
ations. startTransaction is invoked by the transaction manager to access data held
by the data manager; its arguments are a transaction identifier tid and a list of addresses of
the data items used during the transaction. read and write are used to access and modify
objects. prepareToCommit and commit are invoked in succession upon completion
to first, commit any modifications made to the data items by the transaction, and second,
complete the transaction. abort is used to abandon any modifications and terminate the
11
transaction; it can be invoked at any time up to the time commit is first invoked. All
these operations are implemented as SR procs, which means that invocations result in the
creation of a new thread to service that invocation. Finally, the data manager contains ini-
tial and recovery code, as well as a failure handler proc that deals with the failure of the
lockManager and stableStore resources.
The data manager depends on the stable storage and lock manager resources to implement
its operations correctly and so, needs to be informed when they fail catastrophically. The data
manager does this by establishing an asynchronous failure handler failHandler using the
monitor statement. When invoked, failHandler terminates the data manager resource,
thereby causing the failure to be propagated to the transaction manager.
The failure of the data manager itself is handled by recovery code that retrieves the current
contents of key variables from stable storage. It is the responsibility of the transaction
manager to deal with transactions that were in progress at the time of the failure; those
for which commit had not yet been invoked are aborted, while commit is reissued for
the others. To handle this, the recovery code sends a message to the transaction manager
notifying it of the recovery.
Stable storage is implemented in our example by creating a storage resource and replicat-
ing it to increase failure resilience, as shown in Figure 6. Replica failures are dealt with by
restarting the resource on another machine; this is done automatically by specifying backup
virtual machines when stableStore is created (see Figure 7). A replica’s recovery code
starts by requesting the current state from the other group members. All replicas respond
to this request; the first is received, while the others remain queued at the recvState
operation until the replica is either destroyed or fails. The newly restarted replica begins
processing queued messages upon finishing recovery. Since messages are queued from the
point sendState is invoked, subsequent messages can be applied to the state normally to
re-establish consistency.
The main resource that starts up the entire system is shown in Figure 7. Resource main
creates a virtual machine on each of three physical machines. Two replicas of the stable
storage module are then created, with the third virtual machine being used as a backup
12
resource stableStoreimport globalDefsop read(address: int; numBytes: int; buffer: charArray)op write(address: int; numBytes: int; buffer: charArray)op sendState(sscap: cap stableStore)op recvState(objectStore: objList)
body stableStorevar store[MEMSIZE]: char
process ssdo true ->
in read(address, numBytes, buffer) ->buffer[1:numBytes] := store[address:address+numBytes-1]2 write(address, numBytes, buffer) ->store[address, address+numBytes-1] := buffer[1:numBytes]2 sendState(rescap) -> send rescap.recvState(store)
niod
end ss
recoverysend mygroup().sendState(myresource())receive recvState(store); send ss
end recoveryend stableStore
Figure 6: stableStore resource
machine. The two data managers are then created followed by the transaction manager.
This banking example has been implemented and tested. In addition, a number of other
examples have been programmed using FT-SR to test its appropriateness for writing a variety
of fault-tolerant distributed programs, as well as the larger thesis that high-level distributed
programming languages are suitable for software of this type. These include a fault-tolerant
version of the Dining Philosophers problem that shows how a single monitor statement can
be used to implement a group membership service [20, 21], and a distributed word game
that exploits multiple processors for increased performance as well as fault-tolerance. A
description of all these examples together with complete code can be found in [22].
3.4 Language Design Issues
The fault-tolerance mechanisms of FT-SR are designed with two important considerations
in mind. The first is that the mechanisms be orthogonal, so that any interplay between these
mechanisms not result in unexpected behavior. The second is that, whenever possible, these
mechanisms use or form natural extensions to existing SR mechanisms. These considerations
13
resource mainimports transManager, dataManager, stableStore, lockManager
body mainvar virtMachines[3] : cap vm # array of virtual machine capabilitiesdataSS[2], tmSS: cap stableStore # capabilities to stable storeslm: cap lockManager; dm[2]: cap dataManager # capabilities to lock and data managers
virtMachines[1] := create vm() on ‘‘host1’’virtMachines[2] := create vm() on ‘‘host2’’virtMachines[3] := create vm() on ‘‘host3’’ # backup machine
# create stable storage for use by the data managers and the transaction managerdataSS[1] := create (i := 1 to 2) stableStore() on virtMachines[i]
backups on virtMachines[3]dataSS[2] := create (i := 1 to 2) stableStore() on virtMachines[i]
backups on virtMachines[3]tmSS := create (i := 1 to 2) stableStore() on virtMachines[i]
backups on virtMachines[3]
# create lock manager, data managers, and transaction managerlm := create lockManager() on virtMachines[2]fa i := 1 to 2 ->
dm[i] = create dataManager(i, lm, dataSS[i]) on virtMachines[i]aftm = create transManager(dm[1], dm[2], tmSS) on virtMachines[1]
end main
Figure 7: System startup in resource main
preserve the semantic integrity of the language and at the same time keep it relatively simple
and therefore, easy to understand and use. We illustrate these points with several examples.
FT-SR provides mechanisms for monitoring, failure handling, restarts, and replication,
all of which can be meaningfully combined to achieve different effects. For example, both
the monitor statement and backup operations work with groups just as they do with resources.
In either case, a failure notification is generated when no resource or resource group member
is available to handle invocations. Similarly, the restart statement can be used to restart
entire groups, group members, or individual resources, with the same rules for execut ion of
recovery code and acceptance of new invocations being used in each case.
Another example is that an operation implemented by a resource group can be used in
the same way as one implemented by a single resource, since the two capability values are
indistinguishable. In particular, group operations may be specified as failure handlers in
monitor statements or as backup operations in call statements, as well as normal invocations.
The parallels between resource groups and resources also extends to invocations from a
group; it is impossible to tell if an invocation originated from a group or an individual
14
resource.
The second aspect of good language design is that wherever possible, the fault-tolerance
mechanisms of FT-SR are integrated into existing SR mechanisms. For example, the group
create statement is a natural extension of the SR resource create statement, both in terms of
its syntax and semantics. Furthermore, a failure handler is essentially an operation that is
invoked as a result of a failure and is therefore expressed using existing language mechanisms.
Emphasizing these two aspects of language design has numerous advantages. The or-
thogonality of the FT-SR mechanisms allows a small set of mechanisms to be combined in
different ways to achieve different effects; the lack of restrictions or special cases governing
this combination eliminates any programming pitfalls that can snare a novice programmer.
The use of existing SR mechanisms keeps the language small and easy to learn, while allow-
ing the fault-tolerance aspects of the language to be blended with its concurrency aspects.
All these considerations lead to a logically and aesthetically integrated language design.
4 Implementation and Performance
4.1 Overview
The FT-SR implementation consists of two major components: a compiler and a runtime
system. Both are written in C and borrow from the existing implementation of SR where
possible. In fact, the FT-SR compiler is almost identical to the SR compiler, which is to be
expected since FT-SR is syntactically close to SR. The compiler is based on lex and yacc,
and consists of about 16,000 lines of code. It generates C code, which is in turn compiled by
a C compiler and linked with the FT-SR runtime system.
The FT-SR runtime system, which is significantly different from that of SR, provides
primitives for creating, destroying and monitoring resources and resource groups, handl ing
failures, restarting failed resources, invoking and servicing operations, and a variety of other
miscellaneous functions. It consists of 9600 lines of code and is implemented using ver-
sion 3.1 of the x-kernel. The major advantage of such a bare machine implementation is that
it facilitates experimentation with realistic fault-tolerant software systems when compared
15
to systems built, for example, on top of Unix. In addition, the x-kernel provides a flexible
infrastructure for composing communication protocols, something that has proven to be very
useful in building the variety of protocols required for the FT-SR runtime system.
Figure 8 shows the organization of the FT-SR runtime system on a single processor. As
shown, each FT-SR virtual machine exists in a separate x-kernel user address space. In
addition to the user program, a virtual machine contains those parts of the runtime system
that create and destroy resources, route invocations to operations on resources, and manage
intra-virtual machine communication. This user resident part accounts for about 85% of the
runtime system and the kernel resident part the remaining 15%.
The important runtime system modules and communication paths are also illustrated
in Figure 8. The Communication Manager consists of multiple communication protocols
that provide point-to-point and broadcast communication services between processors. The
VM Manager is responsible for creating and destroying virtual machines, and for providing
communication services between virtual machines. The Processor Failure Detector (PFD) is a
failure detector protocol; it monitors processors and notifies the VM manager when a failure
occurs. In user space, the Resource Manager is responsible for creating, destroying and
restarting resources, while the Group Manager is responsible for the analogous operations on
groups, as well as intergroup communication. The Resource Failure Detector (RFD) detects
resource failures.
4.2 Novel Features
Three interesting algorithms used within the FT-SR implementation are described in this
section. The first is related to group communication and is interesting because it uses a
variation of the primary replica approach to sequence invocations to a group. The second is
related to group reconfiguration and is interesting because no expensive election protocols
are used. Both these algorithms exploit a system parameter max sf —the maximum number
of simultaneous failures to be tolerated—to optimize performance. The third algorithm is
the failure detection and notification algorithm. It is interesting because it is implemented by
three modules at different levels of the system, with each module using the services provided
16
Kernel Space
User Space
VM Manager
Comm. Mngr.
VM 2VM 1
User Program
Grp. Mngr.
Invocation Mngr.
Resource Mngr.
RFD
User Program
Grp. Mngr.
Invocation Mngr.
Resource Mngr.
RFD
PFD
Figure 8: Organization of FT-SR runtime system
by the one below it.
Group Communication. Perhaps the most interesting aspect of replication is the algorithm
used to implement multicast invocations. The technique we use is similar to [23, 24], where
one replica is a primary through which all messages are funneled. Another max sf replicas
are designated as primary-group members, with the remaining being considered ordinary
members. Upon receiving a message, the primary adds a sequence number and multicasts it
to all replicas. Upon receipt, (only) primary-group members send acknowledgments. Once
the primary gets these max sf messages, it sends an acknowledgement to the original sender
of the message; this action is appropriate since the receipt of this many acknowledgements
guarantees that at least one replica will have the message even should max sf failures actually
occur. The primary is also involved in outgoing group invocations. In such situations, the
runtime system suppresses the invocation from all non-primary group members. When the
primary receives an acknowledgement that its invocation has been received, it relays that
information to the other group members.
17
From To Group Size Invocation Time (msec)
resource group 1 3.24
resource group 2 6.84
resource group 3 6.84
resource group 3 8.35
(max sf = 2)
group resource 3 7.19
group group 3 14
Table 1: Times (in msec) for invocation involving groups
Table 1 shows the cost of invocations to and from resource groups. As can be seen, for
groups larger than max sf + 1, the cost of an invocation to the group is independent of group
size, a direct result of the above algorithm. This is especially significant given that a max sf
of one is sufficient for most systems [25]. This gives FT-SR a considerable advantage over
systems such as ISIS where the cost of an invocation grows linearly with the size of the
group.
Group Reconfiguration after Failure. The Group Manager at each site is responsible for
determining the primary and the members of the primary-group set. Specifically, it maintains
a list of all group members and whether it is the primary, a primary-group member, or an
ordinary member. This list is ordered consistently at all sites based on the order in which the
replicas were specified in the group create statement. This ordering ensures that all Group
Managers will independently pick the same primary and assign the same set of replicas to
the primary-group set.
The Group Managers are also responsible for dealing with the failure of group members.
If the primary fails, the first member of the primary-group is designated as the new primary.
This action or the failure of a primary-group member will cause the size of the primary-
group to fall below max sf, so an appropriate number of ordinary members are added to
18
the primary-group to restore its original size. No special action is needed when an ordinary
member fails. If backup virtual machines were specified for the group when it was created and
such machines are available, failed replicas are restarted automatically. Restarted replicas
join the group as ordinary members.
Failure Detection and Notification. Failure detection in FT-SR is done at three levels: at
the processor level by the PFD, at the virtual machine level by the VM Manager, and at the
resource level by the RFD. Each PFD monitors the other processors and notifies the local
VM manager of any failures. The VM manager then maps these processor failures to virtual
machine failures and notifies the RFD. The RFD in turn maps virtual machine failures to
resource failures and passes this information on to any other runtime system module that
requested failure notification. To detect termination of a resource that is explicitly destroyed,
the RFD sends a message to its peer on the appropriate virtual machine asking to be notified
when the resource is destroyed. Similarly, a VM Manager can ask another VM Manager to
send a failure notification when a virtual machine is explicitly destroyed.
5 Conclusions
Numerous programming languages with support for fault-tolerance have been developed,
some as entirely new languages, some as extensions to existing languages and systems, and
some as libraries to existing languages. Examples of new languages include Argus [2],
Aeolus [26] and Plits [3]. Examples of extensions include Fault-Tolerant Concurrent C
(FTCC) [27], HOPS [28], and languages described in [29], [30] and [31]. Finally, fault-
tolerance library support is provided by Arjuna [32] for C++, and Avalon [33] for C++,
Common Lisp and Ada.
A distinguishing feature of these languages is the programming model they support. For
example, the transaction model is supported by Aeolus, Argus, Avalon, HOPS, Plits, and
Arjuna, while the replicated state machine approach [9] is supported by HOPS and FTCC. FT-
SR differs from all the above languages in supporting a model based on FS modules, which
19
allows any of these other approaches to be programmed easily. Another difference is that FT-
SR’s design as a set of extensions to a high-level distributed programming language greatly
enhances its usability. It simplifies the construction of fault-tolerant distributed programs by
allowing for the seamless integration of the distribution and fault-tolerance aspects of these
programs.
Despite these efforts, developing enhanced language support for fault tolerance is, in
some sense, a neglected area compared with the numerous efforts to develop new system
libraries or network protocols. However, our view is that research in this area has the
potential to render significant benefits. By offering a high-level realization of important
fault-tolerance abstractions, programmers are freed from the need to learn implementation
details or how a particular library can be used in a given context. The advantages of a single,
coherent package for expressing the program should not be underestimated either, especially
one based on a high-level distributed programming language that already offers a framework
for writing multi-process programs.
This paper has presented such a language-based approach to writing fault-tolerant dis-
tributed programs. Although the specifics of our approach are based on extending the SR
language, the FS module programming model and design principles could be applied equally
well to any similar language. It is also important when designing a language for such appli-
cations to pay sufficient attention to the implementation, especially the design of an efficient
runtime system. Although confirming experiments are continuing, our expectation is that the
user will pay little, if any, performance penalty for the advantages of a high-level language.
Acknowledgments
Thanks to G. Andrews, H. Bal, M. Hiltunen, D. Mosberger-Tang, R. Olsson, and the anony-
mous referees for reading earlier versions of this paper and providing valuable feedback.
References[1] K. Birman, A. Schiper, and P. Stephenson, “Lightweight causal and atomic group multicast,”
ACM Trans. Computer Systems, vol. 9, pp. 272–314, Aug 1991.
20
[2] B. Liskov, “The Argus language and system,” in Distributed Systems: Methods and Toolsfor Specification, LNCS, Vol. 190 (M. Paul and H. Siegert, eds.), ch. 7, pp. 343–430, Berlin:Springer-Verlag, 1985.
[3] C. Ellis, J. Feldman, and J. Heliotis, “Language constructs and support systems for distributedcomputing,” in ACM Symp. on Prin. of Dist. Comp., pp. 1–9, Aug 1982.
[4] H. Bal, “A comparative study of five parallel programming languages,” in Proc. EurOpen Conf.on Open Dist. Systems, May 1991.
[5] U. S. Dept. of Defense, Reference Manual for the Ada Programming Language. WashingtonD.C., 1983.
[6] C. A. R. Hoare, “Communicating sequential processes,” Commun. ACM, vol. 21, pp. 666–677,Aug 1978.
[7] G. R. Andrews and R. A. Olsson, The SR Programming Language: Concurrency in Practice.Benjamin/Cummings, 1993.
[8] N. Hutchinson and L. L. Peterson, “The x-Kernel: An architecture for implementing networkprotocols,” IEEE Trans. Softw. Eng., vol. 17, pp. 64–76, Jan 1991.
[9] F. Schneider, “Implementing fault-tolerant services using the state machine approach: A tuto-rial,” ACM Computing Surveys, vol. 22, pp. 299–319, Dec 1990.
[10] B. Lampson, “Atomic transactions,” in Distributed Systems—Architecture and Implementation(B. Lampson, M. Paul, and H. Seigert, eds.), ch. 11, pp. 246–265, Springer-Verlag, 1981.
[11] R. Schlichting and F. Schneider, “Fail-stop processors: An approach to designing fault-tolerantcomputing systems,” ACM Trans. Computer Systems, vol. 1, pp. 222–238, Aug 1983.
[12] P. Lee and T. Anderson, Fault Tolerance: Principles and Practice. Vienna: Springer-Verlag,second ed., 1990.
[13] F. Cristian, “Understanding fault-tolerant distributed systems,” Commun. ACM, vol. 34, pp. 56–78, Feb 1991.
[14] J. Gray, “Notes on data base operating systems,” in Operating Systems, An Advanced Course(R. Bayer, R. Graham, and G. Seegmuller, eds.), ch. 3.F, pp. 393–481, Springer-Verlag, 1979.
[15] G. Andrews et al., “An overview of the SR language and implementation,” ACM Trans. Prog.Lang. and Systems, vol. 10, pp. 51–86, Jan. 1988.
[16] F. Cristian, H. Aghili, R. Strong, and D. Dolev, “Atomic broadcast: From simple messagediffusion to Byzantine agreement,” in Proc. 15th Fault-Tolerant Computing Symp., pp. 200–206, June 1985.
[17] H. Kopetz et al., “Distributed fault-tolerant real-time systems: The Mars approach,” IEEEMicro, vol. 9, pp. 25–40, Feb 1989.
[18] P. Melliar-Smith, L. Moser, and V. Agrawala, “Broadcast protocols for distributed systems,”IEEE Trans. on Parallel and Distributed Systems, vol. 1, pp. 17–25, Jan 1990.
21
[19] D. Powell, ed., Delta-4: A Generic Architecture for Dependable Computing. Springer-Verlag,1991.
[20] F. Cristian, “Reaching agreement on processor-group membership in synchronous distributedsystems,” Distributed Computing, vol. 4, pp. 175–187, 1991.
[21] H. Kopetz, G. Grunsteidl, and J. Reisinger, “Fault-tolerant membership service in a synchronousdistributed real-time system,” in Dependable Computing for Critical Applications (A. Avizienisand J.-C. Laprie, eds.), pp. 411–429, Wien: Springer-Verlag, 1991.
[22] V. Thomas, FT-SR: A Programming Language for Constructing Fault-Tolerant DistributedSystems. PhD thesis, Dept. of CS, Univ. of Arizona, 1993.
[23] J. Chang and N. Maxemchuk, “Reliable broadcast protocols,” ACM Trans. Computer Systems,vol. 2, pp. 251–273, Aug 1984.
[24] M. F. Kaashoek, A. Tanenbaum, S. Hummel, and H. Bal, “An efficient reliable broadcastprotocol,” Operating Systems Review, vol. 23, pp. 5–19, Oct 1989.
[25] J. Gray, “Why do computers stop and what can be done about it,” in Proc. 5th Symp. on Reliabilityin Dist. Software and Database Systems, pp. 3–12, Jan 1986.
[26] R. LeBlanc and C. T. Wilkes, “Systems programming with objects and actions,” in Proc. 5thConf. on Distributed Computing Systems, (Denver), pp. 132–139, May 1985.
[27] R. Cmelik, N. Gehani, and W. D. Roome, “Fault Tolerant Concurrent C: A tool for writing faulttolerant distributed programs,” in Proc. 18th Fault-Tolerant Computing Symp., pp. 55–61, June1988.
[28] H. Madduri, “Fault-tolerant distributed computing,” Scientific Honeyweller, vol. Winter 1986-87, pp. 1–10, 1986.
[29] J. Knight and J. Urquhart, “On the implementation and use of Ada on fault-tolerant distributedsystems,” IEEE Trans. Softw. Eng., vol. SE-13, pp. 553–563, May 1987.
[30] M. F. Kaashoek, R. Michiels, H. Bal, and A. Tanenbaum, “Transparent fault-tolerance in parallelOrca programs,” in Proc. USENIX Symp. on Exper. with Distributedand MultiprocessorSystems,pp. 297–311, Mar 1992.
[31] R. Schlichting, F. Cristian, and T. Purdin, “A linguistic approach to failure-handling in distributedsystems,” in Dependable Computing for Critical Applications (A. Avizienis and J.-C. Laprie,eds.), pp. 387–409, Wien: Springer-Verlag, 1991.
[32] S. Shrivastava, G. Dixon, and G. Parrington, “An overview of the Arjuna distributed program-ming system,” IEEE Software, vol. 8, pp. 66–73, Jan 1991.
[33] M. Herlihy and J. Wing, “Avalon: Language support for reliable distributed systems,” in Proc.17th Fault-Tolerant Computing Symp., pp. 89–94, July 1987.
22