Recovery scopes, recovery groups, and fine-grained recovery in enterprise storage controllers with...
-
Upload
independent -
Category
Documents
-
view
2 -
download
0
Transcript of Recovery scopes, recovery groups, and fine-grained recovery in enterprise storage controllers with...
Recovery scopes, recoverygroups, and fine-grainedrecovery in enterprisestorage controllers withmulti-core processors
&
S. Seshadri
L. Liu
L. Chiu
In this paper we extend a previously published approach to error recovery in enterprise
storage controllers with multi-core processors. Our approach first involves the
partitioning of the set of tasks in the runtime of the controller software into clusters
(recovery scopes) of dependent tasks. Then, these recovery scopes are mapped into a
set of recovery groups, on which the scheduling of tasks, both during the recovery
process and normal operation, is based. This recovery-aware scheduling (RAS)
replaces the performance-based scheduling of the storage controller. Through
simulation and benchmark experiments, we find that: 1) the performance of RAS
appears to be critically dependent on the values of recovery-related parameters; and
2) our fine-grained recovery approach promises to enhance the storage system
availability while keeping the additional overhead, and the resulting degradation in
performance, under control.
INTRODUCTION
In this paper, we extend earlier work1
that intro-
duced the concepts of recovery groups and recovery-
aware scheduling (RAS) and described their use in
the fine-grained recovery of controller software in
enterprise storage systems. The scheduling mecha-
nism is based on serializing a set of recovery-
dependent tasks (which belong to the same recovery
group) to reduce ripple effects of software failures
and speed up the recovery process.
Our approach involves first the partitioning of the
set of tasks in the runtime of the controller software
into clusters (recovery scopes) of dependent tasks.
Then, these recovery scopes are mapped into a set of
recovery groups, on which the scheduling of tasks is
based. The approach in Reference 1 can be viewed
as a special case when the mapping of recovery
scopes to recovery groups is 1:1.
�Copyright 2009 by International Business Machines Corporation. Copying inprinted form for private use is permitted without payment of royalty providedthat (1) each reproduction is done without alteration and (2) the Journalreference and IBM copyright notice are included on the first page. The titleand abstract, but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based and otherinformation-service systems. Permission to republish any other portion of thepaper must be obtained from the Editor. 0018-8670/09/$5.00 � 2009 IBM
IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 1
Experiments with a state-of-the-art enterprise stor-
age controller have shown that the clustering of
tasks into recovery scopes (dependent tasks be-
longing to the same scope) leads to a large number
of scopes over which tasks are unevenly distributed;
most scopes contain a small number of tasks,
whereas a small number of scopes contain a large
number of tasks. Clearly, tracking dependencies at a
coarse granularity may result in a recovery scope
with many dependent tasks whose activations have
to be serialized. This is likely to decrease the
opportunities for parallel execution on the multi-
core architecture and thus is likely to increase
processing overhead during recovery. At the same
time, tracking dependencies at too fine a granularity
increases the overhead of managing a large number
of recovery scopes, resulting in a performance
penalty. Mapping of recovery scopes to recovery
groups is intended to trade off intergroup task
dependencies versus the degree of parallel process-
ing in the RAS mechanism.
Based on our analysis, we present guidelines for
determining the recovery scopes, the recovery
groups, and the mapping of recovery scopes to
recovery groups for use in the RAS. We imple-
mented this approach in a realistic environment by
using an enterprise-class storage controller with
minimal changes to its software. Handling of
various failures in the system can be implemented
incrementally. We show that by selecting appropri-
ate values for the recovery-sensitive system param-
eters, it may be possible to speed up the recovery of
storage controllers and achieve good performance at
the same time.
In this paper, we make the following contributions:
� The higher the level of multi-threading (i.e., the
number of cores in the processor), the higher the
number of recovery groups for effective recovery.
Thus, as the multi-threading capability increases,
it is beneficial to track finer-granularity recovery
scopes through more recovery groups.� Under operating conditions, in which failure rates
(mean time between failures) and recovery rates
(mean time to recovery) make it very unlikely that
the recovery mechanism has to deal with more
than one error at the time, the number of recovery
groups does not depend on those rates.� When mapping recovery scopes to recovery
groups, it is beneficial to distribute the workload
as evenly as possible among the groups.
� Even if the number of recovery groups and the
mapping of recovery scopes to these groups are
not optimal, the performance during recovery of
the storage system with RAS surpasses the
performance of the system with the standard
(performance-oriented) scheduling.
Our work is largely inspired by previous work in the
areas of fault-tolerant software and high-availability
storage systems. Techniques for software fault
tolerance can be classified into fault-treatment
techniques and error-handling techniques. The
fault-treatment techniques aim at avoiding the
activation of faults through environmental diversity,
such as by rebooting the entire system2
or by micro-
rebooting components of the system,3
through
periodic rejuvenation of the software,4,5
or by
retrying the operation in a different environment.6
Error-handling techniques aim at handling the error
triggered by a fault. They include the storing of
system states (i.e., establishing checkpoints) and
recovery by reverting to an earlier, correct-state
(roll-back),7
application-specific techniques such as
exception handling8
and recovery blocks,9
and more
recent techniques such as failure-oblivious comput-
ing.10
Because the software of enterprise storage systems
has evolved from legacy storage systems, it is highly
desirable to minimize the software development
work for implementing the techniques listed above.
Under tight coupling between the components of a
storage system, implementing the component micro-
reboot technique or carrying out the periodic
rejuvenation approach are both challenging. Rx6
is
an interesting approach to recovery that involves
retrying operations in a modified environment. It
requires, however, making checkpoints of the
system state in order to allow roll-backs, and given
the high volume of requests (tasks) in the runtime of
the storage controller and the complex operational
semantics of such requests, the approach may not be
feasible in our environment. Whereas localized
recovery techniques, such as transactional recovery
in database management systems,11
and applica-
tion-specific recovery mechanisms, such as recovery
blocks9
and exception handling,8
have a proven
track record, their performance in a multi-thread
environment, in which interacting tasks are execut-
ing concurrently, is unknown.
The recovery-aware approach described in Refer-
ence 1 involves three different RAS algorithms, each
9 : 2 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009
representing a distinct way to trade off between
recovery time and system performance. Although
vast amounts of prior work have been dedicated to
scheduling algorithms,12,13
to the best of our
knowledge, only the work in Reference 1 is focused
on recovery.
Some work in the area of virtualization aims to
improve availability by isolating each virtual ma-
chine (VM) from failures occurring in other VMs.14
Because the software of an enterprise storage
controller is not easily partitioned into components
that can run in different VMs, such an approach may
not be feasible in our environment. The redundant
array of inexpensive disks (RAID) approach is aimed
at improving storage system availability at the
device level.15
Our approach, on the other hand, is
focused on the availability of the embedded soft-
ware (firmware) in the storage controllers of
enterprise storage systems and thus can be viewed
as complementary to the RAID approach.
The rest of this paper is organized as follows. In the
next section we introduce recovery scopes as
groupings that associate runtime recovery tasks if
they are dependent on each other. We show that the
association could be resource-based, component-
based, or request-based. In the next section, we
summarize the material in Reference 1 on RAS. In
the section that follows we develop mathematical
models that capture the way in which various
system parameter values affect recovery perfor-
mance. In our analysis we focus on degree of
multiprocessing, scheduling discipline, failure and
recovery rates, and workload characteristics. In the
section that follows we describe our experimental
results. Then, we discuss our results. In the last
section we provide of short summary of our results.
RECOVERY SCOPES
The recovery-aware approach presented in Refer-
ence 1 consists of two stages: first, partitioning the
tasks in the runtime of the controller software into
recovery groups, and then, scheduling the runtime
workload based on these recovery groups. In this
paper we extend this recovery-aware approach to
one consisting of three stages. In the first stage we
partition the runtime tasks into sets of interdepen-
dent tasks, which we refer to as recovery scopes. In
the second stage we map these recovery scopes to a
set of recovery groups, the entities on which
workload scheduling is based. The last stage is the
RAS. In this section we discuss recovery scopes and
describe the various ways in which we can define
them.
Performing fine-grained recovery from errors in
controller software in a way that exploits the
available multi-threading involves the discovery of
task interdependencies. The tasks in a recovery
scope interact in complex ways. In order to avoid
deadlocks and return the system to a consistent state
when a task encounters an exception, the recovery
process may involve additional tasks. Although
explicit dependencies may be specified by the
programmer during program development, these
dependencies may be too coarse. Moreover, some
dependencies may have been overlooked because of
their dynamic nature and the intrinsic complexity of
the system. To compensate for these conditions, one
could track dependencies dynamically and continu-
ously and use the information to refine, over time,
the developer-defined recovery scopes. The criteria
for classification of tasks into recovery scopes
depend on the nature of the application and the
faults to be handled. In our work, we have identified
three possible task classifications into recovery
scopes: resource-based, component-based, and re-
quest-based.
Resource-based classificationTasks accessing the same resources (such as device
drivers or metadata) may be classified under the
same recovery scope. This classification would be
effective in dealing with resource-based failures. For
example, consider a ‘‘queue full’’ condition in
storage controllers that occurs when an adapter
whose queue is full refuses to accept additional
requests. Under these circumstances, the error and
the subsequent recovery action would probably
affect only the tasks attempting to enqueue requests
to the same adapter.
To identify resource-based recovery dependencies,
one could observe the pattern of lock acquisitions.
Intuitively, tasks that access the same resource are
likely to acquire common locks. Lock acquisitions
patterns can potentially be used to further refine
resource-based dependencies at runtime by utilizing
the temporal aspect of dependencies (e.g., tasks
whose lock operations are minutes apart need not be
considered dependent).
Resource-based dependencies can be identified in
one of three ways: 1) static code analysis; 2)
IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 3
analysis of traces collected at runtime from the
execution environment with the given workload; or
3) discovery of these dependencies at runtime. The
disadvantage of the dynamic approach 3 is the
dependencies (such as lock acquisitions) manifest
themselves only after the thread is dispatched.
Assigning recovery scopes after dispatch would be
pointless, unless the tasks can be immediately
suspended and again enqueued in the appropriate
recovery group queue. Such an approach would
result in a performance penalty caused by the high
amount of context switching. We therefore recom-
mend a static assignment of tasks into recovery
scopes using either approach 1 or 2. However,
logging lock acquisitions at runtime may still be
necessary in order to keep track of resource
ownership and perform cleanup of resources in the
event of a failure. The disadvantage of a static
approach compared to a dynamic approach is that it
is unable to use the temporal aspect of dependencies
and is likely to overlook certain dependencies that
are only detectable at runtime. We should point out,
however, that a suboptimal classification of tasks
into recovery scopes does not affect the consistency
of the results, but only the performance penalty
during failure recovery. This is demonstrated later,
in our experimental results section.
Component-based classification
Even in the absence of well-defined operational
boundaries between functional components, certain
failures may require resetting state or performing
recovery actions for tasks associated with a partic-
ular functional component. Consider a scenario
involving the cache component in which tasks
require a temporary data structure known as a
control block for successful completion. Further-
more, a failure occurs when the system runs out of
control blocks. One recovery strategy in this
situation might be to search the list of control blocks
to identify instances that have not been freed up
correctly. Another strategy would be to retry the
operation at a later time, in order to work around
concurrency issues. However, it is very likely that
other tasks associated with the cache component
will encounter the same error if executed before the
issue is resolved. Moreover, recovery work that
involves modifying data structures or consistency
checking may require further dispatching of tasks
associated with the cache component. In these
conditions, a component-based classification of
tasks may be effective in identifying dependencies
during recovery. In component-based classification,
all tasks associated with the same functional
component belong to the same recovery scope.
Request-based classification
Consider a situation in which a read/write request
fails because the request contains an invalid
address. In this case the request will fail. Moreover,
all tasks associated with the request will be either
aborted or will participate in the recovery process.
In this situation, we may choose a recovery strategy
that involves the necessary cleanup actions and then
aborting the request. Then the scope of recovery
consists of all tasks, across all components, that are
associated with this request.
Depending upon the nature of failures that fine-
grained recovery is expected to handle, one class or
a valid combination of the above classifications may
be used to define recovery scopes. As previously
stated, recovery scopes could be either specified (as
in the case of component- or request-based classi-
fications) or discovered (as in the case of resource-
based classification). We refer the readers to
Reference 1 for examples and further discussion on
dynamically associating recovery actions with fail-
ure points in the code.
RECOVERY-AWARE SCHEDULING
In this section we summarize the recovery-aware
scheduling (RAS) from Reference 1 (in which RAS is
referred to as recovery-conscious scheduling). The
key idea of RAS is to ensure bounded recovery time
by efficient allocation of resources to recovery
groups. RAS enforces some serialization of recovery-
dependent tasks in order to reduce the ripple effect
of failure and ensure resource availability during a
localized recovery process.
Reference 1 presents three RAS algorithms, each
using a different method of mapping recovery
groups to processing resources: static, partially
dynamic, and dynamic. Each mapping technique
represents different trade-offs between system
availability and system performance under normal
operation.
Static scheduling of recovery groups determines the
mapping of recovery groups to processors at
compile time and is effective in situations where
task dependencies during recovery are well under-
stood and the workloads are stable. With this
9 : 4 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009
scheme, tasks are dispatched only on processors
associated with the recovery group to which they
belong.
Dynamic scheduling of recovery groups to process-
ing resource pools represents the other end of the
spectrum. This scheme works effectively, even in
the presence of frequently changing workloads.
With dynamic RAS, all processors are mapped to all
recovery groups. The scheduler then uses a starva-
tion-avoiding scheme such as round-robin to iterate
through the groups and dispatch work. However, a
recoverability constraint is specified for each group.
A recoverability constraint prescribes the maximum
number of concurrently executing tasks permissible
for that group. To achieve acceptable utilization, the
constraint is selectively violated when no task
satisfying the constraint is found while resources are
idle.
Between the two ends of the spectrum is partially
dynamic scheduling, which involves partially static
scheduling for those recovery groups whose re-
source demand is stable and well understood and
dynamic scheduling for the remaining recovery
groups.
In spite of implementing fine-grained recovery and
identifying recovery dependencies between tasks,
without careful design it is possible that more
dependent tasks are dispatched before a recovery
process can be completed. This would result in a
longer recovery interval or an inconsistent system
state. A dangerous situation may also arise when
many of the concurrently executing threads turn out
to be dependent, especially since tasks often arrive
in batches. Then the recovery process could
consume a large fraction of system resources, which
may stall the entire system.
To overcome these problems, RAS incorporates two
countermeasures, one proactive and one reactive.
The proactive one comes into play during normal
operation, when RAS attempts to minimize the
number of dependent tasks executing concurrently
by dispatching, at any point in time, tasks from
different recovery groups while closely adhering to
recoverability constraints. The reactive technique
comes into play during failure recovery, when,
based on recovery dependencies information af-
forded by recovery groups, RAS suspends the
dispatching of tasks from those recovery groups
whose tasks are currently undergoing recovery.
We measure the effectiveness of RAS against
traditional performance-oriented scheduling (POS).
POS, either with a single, global queue or multiple
load-balanced queues, does not include recovery-
dependency in its criteria for resource allocation.
Recovery groups and performance duringrecovery
The number of recovery groups in the system and
the constraints on these recovery groups are critical
factors in determining the system recovery time and
thus the fault resiliency of the storage system (a
system is resilient if it can continue to operate when
a failure occurs and if it can recover from such
failures quickly). Having a large number of recovery
groups allows fine-grained dispatching of work and
thus the opportunity of improved recovery perfor-
mance through higher-level use of multiprocessing
in the multi-core processor. Depending on how tasks
are assigned to recovery groups, the performance
during normal operation may also be impacted. In
general, increasing the number of recovery groups
beyond a system-dependent threshold may cause
scheduling overhead that may outweigh the benefit
of decreased lock contention.
To enforce recoverability constraints effectively, we
must map scopes appropriately to groups. The
simple approach is to make each recovery scope a
recovery group. Given the uneven distribution of
tasks over recovery scopes, this may result in higher
scheduling overhead as the scheduler polls the large
number of recovery groups for work and most
groups have no pending work. It may also offer little
benefit in terms of shorter recovery time. In this
section we develop mathematical models that
capture the way in which various system parameter
values affect recovery performance. Our analysis
focuses on the degree of multiprocessing, scheduling
discipline, failure and recovery rates, and workload
characteristics.
Impact of recovery groups on recovery rateThe number of outstanding tasks belonging to a
single recovery group, and hence the degree of
serialization, has a direct bearing on the time-to-
recovery of the system. For example, in the worst
case—in which all tasks running at the time of
failure belong to the same recovery group—massive
IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 5
system-wide recovery will have to be initiated.
Intuitively, the recovery time increases with in-
creasing degree of multiprocessing and with a
decreasing number of recovery groups.
Based on the definition of recovery scopes, we
assume that when a task t belonging to the kth
recovery scope fails, all tasks belonging to the scope
that are executing concurrently with the failed task t
need to undergo recovery.
Let kk
represent the failure rate and lk
represent the
repair rate for failures in the kth
recovery scope. The
number of processors or cores in the system is
represented by variable m and let ak(i) represent
that probability that i outstanding tasks belonging to
the kth
recovery scope are executing concurrently at
the time of failure.
We assume that the recovery process executes
serially, even for concurrently executing threads, in
order to restore the system to a consistent state. As a
result, the time to complete system recovery is a
product of the number of recovering processes and
the individual task recovery time. Then the mean
time to complete system recovery is given by:
l ¼ akð1Þ1
lk
þ akð2Þ2
lk
þ akð3Þ3
lk
þ � � � þ akðmÞm
lk
Let ck
represent the probability that a task belongs to
recovery scope k. Then, using the Poisson approx-
imation for the binomial probability mass function,
the probability that there are i outstanding tasks
belonging to the kth
recovery scope is given by:
akðiÞ ¼ bði; m; ckÞ ¼e�ckm 3ðckmÞi
i!
With performance-oriented scheduling (POS), there
is no notion of bounding the recovery process.
Interdependent tasks belonging to the same recov-
ery scope can potentially be executing on all
processors. As a result up to m dependent tasks may
be executing concurrently at the time of failure.
Under these circumstances the system mean-time-
to-recovery (MTTR) for POS given that the failure
occurred in the kth
recovery group denoted by
MTTRPOSjk is:
MTTRPOSjk ¼Xmi¼1
e�ckm 3ðckmÞi
i!3
i
lk
On the other hand, RAS enforces constraints on
recovery groups there by ensuring some degree of
serialization of dependent tasks. Let us assume that
the constraint on the maximum number of concur-
rent tasks of the recovery group containing the kth
recovery scope is given by ck. Then the system
mean-time-to-recovery (MTTR) for RAS given that
the failure occurred in the kth
recovery group
denoted by MTTRRASjk is:
MTTRRASjk ¼Xck
i¼1
e�ckck 3ðckckÞi
i!3
i
lk
However, with dynamic RAS, a more flexible
mapping of resources to recovery groups is em-
ployed in order to reduce resource idling and
improve utilization. Under this scheme in the event
that there are spare idle resources even after all
tasks have been dispatched according to recover-
ability constraints, keeping in mind the high-
performance requirements of the system, the con-
straints are selectively violated. Let the number of
active recovery groups in the system be denoted by
R. Let ck
be the constraint specified on the maximum
number of concurrent tasks for the group containing
the kth
recovery scope. Without loss of generality,
we assume that there are idle resources only when
RRi¼1 c
i, m. For the sake of simplicity, let us assume
that the available spare resources m� RRi¼1 c
iis
allocated evenly among all groups. Then in the
worst-case violation of a constraint ck, denoted as ck
is given by:
ck ¼ ck þ
�m�
XR
i¼1
ci
R
�Thus, the system recovery time with dynamic RAS is
obtained by replacing the constraint ck
by ck in the
expression for system recovery time for RAS
(MTTRRASjk).
Clearly, the system availability under POS is affected
by the failure rate kk, the repair rate l
kfor failures in
the kth
recovery scope, the number m of processors
or cores in the system, and the probability ck
that a
task belongs to recovery scope k. In contrast, with
RAS, availability is also influenced by additional
parameters such as the number R of active recovery
groups in the system and the constraint ck
on the
maximum number of concurrent tasks of the group
containing the kth
recovery scope.
Impact of RAS queues on system performanceIn this section we present analysis that shows the
impact of recovery groups on the system perfor-
9 : 6 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009
mance and based on these results we describe
criteria for the selection of number of recovery
groups for efficient scheduling. Each recovery group
is mapped to a single scheduler queue and the
serialization constraint imposed on the group
applies to all scopes that are mapped to the group.
While evaluating system performance, we must take
into consideration both the good-path (i.e., normal
operation) and bad-path (during failure recovery)
performance. Good path performance is primarily
impacted by the efficiency of the scheduler. On the
other hand, bad-path performance will be impacted
by the extent of failure and recovery (i.e. the degree
of serialization) and the availability of resources for
normal operation during local recovery.
Variation of service rate with RAS queues
We model the variation of service rate with the
number of queues as a hypoexponential distribution
with two phases, where the first phase describes the
scenario in which the service rate increases with the
number of queues due to reduced lock contention.
The second phase models the scenario where the
increase in the number of queues causes the service
rate to drop due to the additional scheduling
overhead.
In order to study the impact of recovery awareness
on the performance of the system, we model both
POS and RAS with varying degrees of multipro-
cessing and during good-path and bad-path opera-
tion. In order to model utilization, response time,
and throughput we adopt the models for M/M/m
queuing systems.16
Consider a system where tasks arrive as a Poisson
process with rate ka, and service times for all cores
are independent, identically distributed random
variables. Let the mean service rate as a function of
the number of scheduler queues for POS be denoted
by lPOS,
and let the mean service rate as a function
of the number of recovery groups for RAS be
denoted by lRAS
. We assume that the service times
include the time required to dequeue tasks from the
job queues and iterate through queues (for RAS). Let
m denote the total number of cores in the system.
Good-path performance
During good-path operation, all system resources
are available and storage controller performance is
limited only be scheduler efficiency. Accordingly,
the average number of jobs, N, in the system is
given by:
E½N� ¼ mqþ qðmqÞm
m!
p0
ð1� qÞ2
where p0, the steady-state probability that there are
no jobs in the system is given by:
p0 ¼Xm�1
k¼0
ðmqÞk
k!þ ðmqÞm
m!
1
ð1� qÞ
" #� 1
For POS, the value q, the traffic intensity, is given
by, qPOS¼ k
a/ml
POSand that for RAS is given by
qRAS¼ k
a/ml
RCS. E
POS[N] and E
RAS[N] are obtained
by substituting q by qPOS
and qRAS,
respectively, in
the expressions for E[N] and p0. In each case, based
on Little’s formula,17
the average response time for
POS (EPOS
[R]) and RAS (ERAS
[R]) is given by:
EPOS½R� ¼EPOS½N�
kaand ERAS½R� ¼
ERCS½N�ka
Assuming that our system utilizes a non-preemptive
model where individual tasks complete execution
within the service time allocated to them on system
cores, the system throughput T can be modeled as
follows:
EPOS½T� ¼ lPOSUPOS0 and ERCS½T� ¼ lRCSURCS
0
where U0
the utilization of the system is given by
U0¼ 1 � p
0and the values for utilization with
POSðUPOS0 Þ and RASðURCS
0 Þ are obtained by substi-
tuting appropriate values for p0.
Bad-path performance
In order to model system performance during bad-
path operation we assume that the amount of
system resources consumed by the recovery process
is proportional to the extent (i.e., the number of
outstanding tasks undergoing recovery) of the
recovery process.
As described previously, with POS, the extent of the
recovery process is unbounded and can potentially
span all the available cores in the system. As with
the analysis of system availability, assume that a
task t belonging to the kth
recovery scope encounters
a failure causing in all executing tasks belonging to
the kth
recovery group to undergo recovery. Let fPOSk
and fRCSk denote the extent of the failure-recovery for
POS and RAS, respectively. Let mPOS and mRCS
denote the expected number of cores available for
normal operation during failure recovery. Then, as
IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 7
explained in the case of the impact on recovery rate,
mPOS ¼ m� fPOSk ¼ m�
Xmi¼0
e�ckmðckmÞi
i!3 i
mRCS ¼ m� fRCSk ¼ m� ck
Then the expected response time and throughput
during bad-path: E0POS
[R], E0POS
[T] and E0RAS
[R],
E0RAS
[T] for POS and RAS, respectively, can be
computed by substituting m in the original expres-
sions with mPOS and mRCS, respectively.
EXPERIMENTAL RESULTS
In this section we present results of simulation
experiments and of laboratory experiments involv-
ing a modified enterprise storage controller with
RAS.
Storage system overview
Storage controllers are embedded systems that add
intelligence to storage and provide functionalities
such as RAID, I/O routing, error detection, and
recovery. Failures in storage controllers are typically
more complex and more expensive to recover from
if they are not handled appropriately.
Figure 1 illustrates the architecture of an enterprise-
class storage system, which includes a storage
controller and a set of disk devices. In order to avoid
a single point of failure, the storage controller
components could be designed with redundancy.
The controller consists of a processor complex with
an N-way (N-core) processor, a nonvolatile storage
(NVS) that acts as a fast-write cache, a data cache,
and a program memory. (These three memory
components are represented in Figure 1 by the block
labeled ‘‘Memory.’’) The firmware of the storage
controller provides the management functionalities
for the storage subsystem and also controls the
cache. The storage controller firmware typically
consists of a number of interacting components
(SCSI [Small Computer System Interface] command
processor, cache manager, and device manager),
each of which performs work through a large
number of asynchronous, short-running threads
(processing intervals on the order of microseconds).
We refer to each of these threads as a task. The
program memory is accessible to all the processors
within the complex and holds the job queues
through which functional components in the firm-
ware perform work for host I/O requests. Each
processor runs an independent scheduler and any of
the N processors may execute the jobs available in
the queues. Tasks (e.g., processing a SCSI com-
mand, reading data into cache memory, destaging
data from cache) are placed onto the job queues by
the components and then dispatched to run on one
of the available processors. Tasks interact through
shared data structures in memory and through
message passing.
Experimental setupWe prototyped our approach to fine-grained recov-
ery by modifying the firmware of a commercial
enterprise-class storage system (sensitive informa-
tion is left out). The storage system consists of a
storage controller with two 8-way server processor
complexes, memory for I/O caching, persistent
memory (NVS) for write caching, multiple fibre
channel protocol (FCP), Fiber Connectivity (FICON*)
or Enterprise System Connection (ESCON*) adapters
connected by a redundant high-bandwidth (2-Gbyte)
interconnect, fibre channel disk drives, and man-
agement consoles. The system is designed to achieve
both response time and throughput objectives. The
embedded storage controller software is similar to
the model presented in this paper. The system has a
number of interacting components that dispatch a
large number of short-running tasks.
The RAS itself was implemented in approximately
1000 lines of code (a line of code corresponds to a
single C-language statement). Task-level recovery
can be implemented incrementally, by adding each
failure situation to be handled one at the time.
Currently, our implementation specifies system-
Host HostHost
Storagecontroller
Host adapters
Device adapters
Disks
Hardware architecture
Memory
N-way processor
Processor complex
Figure 1Storage system architecture
9 : 8 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009
level recovery as the default action, except for the
task-level recovery cases that have been imple-
mented. A naive approach to program development
for task-level recovery would produce code whose
size would be directly proportional to the number of
‘‘panics’’ or failures to be handled. Implementing a
single task-level recovery case usually involves only
several tens of lines of code.
The simulator, which was written in C, allows us to
specify system configuration features (such as
number of processors and scheduling algorithm),
scheduling strategy (proactive, reactive, or proac-
tive/reactive), recovery scope (resource-based,
component-based, or request-based) and fault-gen-
eration parameters (failure rate, failure type, recov-
ery rate). The simulator is driven by a workload
trace that specifies the tasks dispatched, task
execution startup and completion times, and lock
acquisition and release times.
For prototype experiments we used the z/OS* Cache
Standard workload and traces of the cache standard
workload for the simulation experiments. The z/OS
Cache Standard workload18,19
is considered repre-
sentative of online transaction processing in a z/OS
environment. The workload has a read-to-write ratio
of 3, read hit ratio of 0.735, destage rate (rate of
transfer from cache to disk) of 11.6 percent, and an
average transfer size of 4 Kbyte. The setup for the
Cache Standard workload in the prototype imple-
mentation was CPU-bound.
We measure throughput and response times in our
prototype experiments and scheduler efficiency (as
measured by the number of task dispatches per unit
time) in the simulation experiments. For the
prototype experiments, we identified 16 component-
based recovery scopes. Each recovery scope corre-
sponded to a functional component such as a host
adapter, device manager, or cache manager.
In order to understand the impact on system
performance when localized recovery is underway,
we inject faults into the workload. We choose a
candidate task belonging to recovery scope 5 and
introduce faults at a fixed rate. The time required for
recovery is specified by the recovery rate. During
localized recovery, all tasks belonging to the same
recovery scope that are currently executing in the
system and that are dispatched during the recovery
process also experience a delay for the duration of
the recovery time. For example, in our implemen-
tation, a recovery time of 20 ms and a failure rate of
1 in every 10K dispatches, for tasks belonging to
component 5, introduce an overhead of 5 percent to
aggregate execution time per minute of component 5
execution on average. The recoverability constraint
for dynamic RAS was set to 1. Note that in the case
of dynamic RAS, the constraint would selectively be
violated only if no task satisfying the constraint was
found.
Effect of fine-grained recovery on system
performanceIn order to infuse recovery-awareness into the
allocation of resources, we need to keep track of
recovery-time dependencies while recognizing that
when these dependencies are tracked at too fine a
granularity, the overhead of managing a large
number of recovery scopes under normal conditions
may result in a severe performance penalty.
Therefore, we need to evaluate the performance
impact of tracking fine-grained recovery scopes
under normal operating conditions.
For these experiments, the tasks were redistributed
between recovery scopes based on different granu-
larity of dependency tracking and each of these
scopes was mapped to a recovery group. In effect,
the scope-to-group mapping was 1:1 only in the case
of 512 recovery groups and many:1 in all other
cases. Using this mapping, we study the perfor-
mance impact of fine-grained recovery under dif-
ferent degrees of multiprocessing and scheduling
policies. Specifically, we compare scheduler perfor-
mance during normal operation and during failure
recovery, using the number of task dispatches per
unit time as the metric.
Figure 2(A) shows several plots of the average
number of dispatches per minute during normal
operation as a function of the number of recovery
groups. Plots for the dynamic RAS scheme with 2, 4,
and 8 cores are shown together with a plot for POS
with 8 cores. In the case of POS, the workload is
uniformly distributed among the queues. Recall that
each recovery group is managed using a separate
scheduler queue. With RAS, in all three cases, the
number of dispatches initially increases (as much as
16 percent, 14 percent, and 65 percent in the case of
8, 4, and 2 cores, respectively) and then decreases
(as much as 21 percent, 30 percent, and 45 percent
in the case of 8, 4, and 2 cores, respectively). The
IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 9
high-performance peak is achieved with 64 groups
in the case of 8 cores, 32 groups with 4 cores, and 8
groups with 2 cores. This shows that the optimal
number of recovery groups depends on the multi-
processing level.
Although RAS initially benefits from the increased
concurrency afforded by additional scheduling
queues, as the number of queues increases, due to
the uneven distribution of workload between
recovery groups, scheduling efficiency decreases.
Depending upon the degree of multiprocessing,
beyond a certain granularity, recovery-awareness
and keeping track of fine-grained recovery scopes
may degrade system performance.
Whereas POS also exhibits decreasing efficiency
with a large number of queues, the degradation in
performance is less steep due to the uniform
distribution of workload among queues. Thus, the
number of recovery groups chosen and the mapping
of tasks to recovery groups should take into
consideration workload distribution among groups
and try to achieve load-balancing.
In order to emphasize the importance of right choice
of number of recovery groups on performance, we
next compare scheduler performance under dy-
namic RAS and a load-balanced, performance-
oriented scheduler with varying degrees of multi-
processing. We use 16 recovery groups for the
dynamic RAS scheduler and 16 queues, with
uniform workload distribution for the POS schedul-
er. Figure 2(B) shows the average number of
dispatches per minute in both cases with varying
degrees of multiprocessing (number of cores). With
this hand-picked choice of number of recovery
groups, we see that the system can achieve
performance that is very close to a performance-
oriented architecture, even while tracking recovery
dependencies across varying degrees of multipro-
cessing.
Effect of fine-grained recovery on system
availability
The benefit from tracking recovery dependencies is
realized during failure recovery. Figure 3(A) com-
pares scheduler performance during normal opera-
tion with that during failure recovery for a system
with 8 cores. By availability, we refer to service
availability and also the ability of the service to meet
performance expectations during failure-recovery.
We measure this using scheduler performance
during failure-recovery. Failure was emulated by
injecting faults into a chosen component at the rate
of once in every 10,000 dispatches of the tasks
belonging to that component.
First, the graph shows that, during failure-recovery,
for a low number of recovery groups (i.e., a coarse
granularity of recovery tracking), the benefit from
recovery awareness is low—although still higher
than the performance-oriented case. However, at the
right granularity, recovery awareness can make a
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
1.6e+07
1.8e+07
0 64 128 192 256 320 384 448 512
Aver
age
disp
atch
es p
er m
inut
e
Recovery groups
Dynamic RCS (2 cores)Dynamic RCS (4 cores)Dynamic RCS (8 cores)POS (8 cores)
0
5e+06
1e+07
1.5e+07
2e+07
2.5e+07
3e+07
0 10 20 30 40 50 60 70Cores
RCS with 16 groupsPOS with 16 queues
Aver
age
disp
atch
es p
er m
inut
e
A B
Figure 2(A) Performance versus number of recovery groups; (B) performance versus number of cores (16 groups)
9 : 10 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009
significant improvement in scheduler performance.
In this case, at a group size of 16, recovery
awareness can effect a 23 percent improvement in
scheduler performance.
Next, consider the group sizes 4 and 32 where POS
almost matches the performance of RAS. Figures
3(B) through 3(D) represent the number of dis-
patches per minute over a duration of 30 minutes.
The graphs show that even at group sizes of 4 and
32, where POS matches RAS in average number of
task dispatches per minute, POS results in serious
fluctuations of scheduler performance. At some
instances, the number of dispatches with POS drops
to as low as 65 percent of that with RAS. Recall that
POS distributes workload equally among all pro-
cessors without considering recovery dependencies.
Therefore, during failure, many tasks dependent on
the failing task may be executing concurrently. As a
result, in spite of fine-grained recovery, the entire
recovery process takes longer, resulting in a drop in
performance due to unavailability of resources for
normally operating tasks. We can argue that even
with some inaccuracy in the selection of number of
recovery groups, there is a conclusive advantage
over POS during failure recovery by being able to
track recovery dependencies. Also, in spite of
implementing fine-grained recovery, tracking re-
covery dependencies to improve performance dur-
ing failure recovery is crucial.
D
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
1.6e+07
5 10 15 20 25 30
Num
ber o
f tas
k di
spat
ches
Time (in minutes)
Dynamic RCS (32 groups)POS (32 queues)
C
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
1.6e+07
5 10 15 20 25 30
Num
ber o
f tas
k di
spat
ches
Time (in minutes)
Dynamic RCS (16 groups)POS (16 queues)
A B
1.6e+07
Aver
age
disp
atch
es p
er m
inut
e
Recovery Groups
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
20 40 60 80 100 120
Dynamic RCS (Good path)POS (Good path)Dynamic RCS, recovery =100 msPOS, recovery =100 ms
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
1.6e+07
5 10 15 20 25 30
Num
ber o
f tas
k di
spat
ches
Time (in minutes)
Dynamic RCS (4 groups)POS (4 queues)
Figure 3(A) Performance versus number of recovery groups: good path, bad path; (B) bad-path performance over time (4queues); (C) bad-path performance over time (16 queues); (D) bad-path performance over time (32 queues)
IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 11
Sensitivity to recovery and failure rate
Figure 4(A) shows the variation of scheduler
performance for different recovery rates for tasks
belonging to component 5. The failure rate was fixed
at 1 in every 10,000 dispatches of tasks belonging to
component 5. The figures tells us that the choice of
number of recovery groups is nearly independent of
the recovery rate, since if x number of recovery
groups is a better choice than y for a certain recovery
rate, it is almost true for all other recovery rates also.
Figure 4(B) shows the variation of scheduler
performance with different failure rates. The recov-
ery time for a single failed task was set to 100 ms,
and a failure was injected into tasks belonging to
component 5. As with the case of recovery rate, the
figure shows that the choice of number of recovery
groups is nearly independent of failure rate. Also,
Figures 4(A) and (B) show that the system
performance is far more sensitive to the recovery
rate than the failure rate. For example, an 80 percent
improvement in recovery rate improves system
performance by 27 percent on average, as compared
to a 100 percent improvement in failure rate
effecting a 13 percent improvement in performance.
Prototype experiments
By conducting experiments with our prototype
implementation using the Cache Standard workload,
we observe that, compared to POS, our recovery-
aware approach improved system throughput by
16.3 percent and response time by 22.9 percent
during failure recovery. The throughput with POS
was observed to be 107,000 I/O per second and
87,800 I/O per second during good-path and bad-
path, respectively, and with RAS 105,000 I/O per
second during both good-path and bad-path. Simi-
larly, the response time with POS was observed to
be 13.3 ms and 16.6 ms during good-path and bad-
path, respectively, and with RAS 13.5 ms during
both good-path and bad-path.
Figure 5(A) shows the average number of task
dispatches per minute over 30 minutes with a
varying number of recovery groups under the Cache
Standard workload. The figure also shows the
scheduler performance for the same configuration
using the simulation. As the figure shows, the
number of dispatches initially increases (although
modestly) with the increase in the number of
groups. For instance, when the number of groups
increases from 1 to 16, the number of dispatches
increases by nearly 13 percent (9 percent in the
simulation) and from 1 to 4, the number of
dispatches increases by 10 percent (8 percent in the
simulation). This experiment was used to validate
the simulator and establish the preferred number of
recovery groups as 16 for further experimentation
with the prototype.
Figure 5(B) shows the average number of task
dispatches per minute per recovery group and in
total under POS and RAS with the Cache Standard
workload. Of the 16 recovery groups, only 8 have
active tasks. The figure shows the number of
dispatches under normal operation (which are
1.6e+07Av
erag
e di
spat
ches
per
min
ute
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
20 40 60 80 100 120 0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
1.6e+07
Aver
age
disp
atch
es p
er m
inut
e
Recovery groups Recovery groups
A B
Dynamic RCS (good path)Bad path, recovery time = 20 msBad path, recovery time = 40 msBad path, recovery time = 80 msBad path, recovery time = 100 ms
20 40 60 80 100 120
Dynamic RCS (good path)Failure rate = 1 in 5 KFailure rate = 1 in 10 KFailure rate = 1 in 15 KFailure rate = 1 in 20 K
Figure 4(A) Performance versus number of recovery groups for various recovery rates; (B) performance versus number ofrecovery groups for various failure rates
9 : 12 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009
nearly identical for POS and RAS) and those under
bad-path for RAS and POS. Under bad-path opera-
tion, the number of dispatches with POS drops by
nearly 14.4 percent, while the number of dispatches
for RAS drops by only 3 percent, which corresponds
to a 16.3 percent improvement in throughput and
22.9 percent improvement in response time of RAS
over POS, in the average case. In the worst case POS
may cause complete system unavailability.
SUMMARYWhile our experiments provide some insights into
the selection of parameters such as recovery groups,
clearly these decisions are largely impacted by the
nature of the software. Below, we present certain
guidelines for the selection of these parameters that
must be validated for the particular instance of
software and system configuration. A possible
procedure to perform this validation is to evaluate
the impact of various parameters using simulation-
based studies and workload traces as shown in this
paper.
The number of recovery scopes in the system is a
characteristic of the software and the dependencies
between tasks. Once the granularity of recovery has
been identified, and the dependency information has
been specified (with explicit dependencies being
specified initially and the system identifying implicit
dependencies over certain duration of observation),
the recovery scopes are specified. During runtime,
tasks are enqueued based on the recovery scope that
has been identified for the task. The scheduler
efficiency now depends on the number of recovery
groups that need to be iterated through at runtime.
This choice of recovery groups depends on the
degree of multiprocessing (number of cores), and
the mapping of recovery scopes to recovery groups
depends on the distribution of tasks among recovery
scopes. Thus the guidelines for selection of recov-
ery-aware parameters can be summarized as fol-
lows:
� The optimal number of recovery groups depends
on the degree of multiprocessing. However,
choosing the number of groups to be more than
the number of cores can help improve perfor-
mance by reducing contention for job queue locks,
for example.� The choice of the number of recovery groups and
the mapping of tasks to recovery groups should
take into consideration workload distribution
between the groups and try to achieve load-
balancing and avoid idle cycling of the scheduler
through empty queues looking for work. The
information required to perform load-balancing
can be acquired by studying the workload for
distribution of tasks between recovery scopes and
their arrival rates.� Even with some inaccuracy in the selection of
number of recovery groups, there is a conclusive
advantage of RAS over POS during failure recov-
ery by being able to track recovery dependencies.
This gives the developer some flexibility in
choosing the number of recovery groups.
DISCUSSION
We observe that as the number of recovery scopes in
the system increases, for a finer granularity of
2.50e+07Av
erag
e di
spat
ches
(in
mill
ions
)
0.00e+00
5.00e+06
1.00e+07
1.50e+07
2.00e+07
50 10 15 20 25
Num
ber o
f dis
patc
hes
(in
mill
ions
)
Time (in minutes) Groups
0 1 2 3 4 5 6 9 15 Total
Normal (good path)POS (bad path)Dynamic RCS (bad path)
16Q Prototype16Q Simulation 4Q Prototype 4Q Simulation 1Q Prototype 1Q Simulation
5
10
15
20
25A B
Figure 5(A) Performance over time, cache standard workload; (B) performance: good path, bad path (POS, RAS)
IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 13
recovery tasks, the system availability improves as if
a natural resiliency develops in the system. How-
ever, the rate at which the resource availability
improves tends to decrease as the number of
recovery scopes continues to increase. The benefit
from recovery groups at a given granularity is
determined by distribution of tasks among recovery
groups, task recovery time, failure rate, and degree
of multiprocessing. However, the choice of the
number of recovery groups mainly depends on the
degree of multiprocessing, distribution of workload,
and the relationship between the failure and the
recovery-dependency tracked. Depending on these
parameters, we can predict the expected recovery
time in the event of a failure. By appropriate
selection of the number of recovery groups and the
recovery scope-to-group mapping, we can derive the
maximum benefit from the recovery-aware frame-
work.
Effectiveness of recovery-aware scheduling
An important question is whether we can account
for environmental problems, such as detecting and
avoiding faulty adapters, to avoid meaningless
rescheduling. RAS, to some extent, is meant to avoid
exactly such circumstances, such as continuing to
send tasks to a faulty adapter repeatedly. The
concept of recovery groups is meant to serialize
dependencies by which more faulty tasks are not
dispatched while recovery is being attempted. Two
types of RAS—proactive and reactive—were intro-
duced and discussed in Reference 1. The goal of
proactive scheduling is to enhance availability and
reduce the impact of failure by bounding the
number of outstanding tasks per recovery group,
even during normal operation. Reactive scheduling,
on the other hand, takes over after a failure has
occurred. Reactive scheduling suspends the dispatch
of the tasks belonging to the group undergoing
recovery until localized recovery is completed. If a
problematic adapter results in the failure of a task
belonging to a recovery group, dispatch of tasks
from that recovery group will be suspended until
recovery is completed. Instead, the resources be-
longing to that recovery group may be used for
recovery, or be utilized by other failure-free recov-
ery groups.
Specifying the recovery handler at programdevelopment time
There are good reasons to require that the recovery
handler be specified at program development time.
First, we acknowledge that writing error-recovery
code is a complex task. As a first step, our
framework provides guidance for identifying de-
pendencies and also ensures that recovery handlers
are nonintrusive and have minimal impact on good-
path execution. We are working on machine-
learning techniques to identify and incorporate
micro-feedback into the identification of failure
scenarios and recovery strategies.
Second, our analysis shows that due to the
complexity of the system, not all failures can be
recovered using fine-grained recovery. The recovery
strategies are often determined by the semantics of
the failure and the nature of the tasks that
encountered the failure. For example, a straightfor-
ward recovery strategy for a failure during a
background task that is not critical can be simply to
ignore the failure, whereas the strategy may be
different for a critical task that requires on-time
completion. Such task-specific, semantic-based re-
covery handlers are best defined by the developers
of these tasks. Due to the inherent complexity of the
system and semantics involved, a fully automated
approach to determining the recovery strategies,
without programmer assistance, is difficult and less
effective. In the event that the developer specifies an
ineffective recovery strategy, such that the problem
causing the failure is not resolved, we recommend
setting a recovery threshold that specifies the
number of times that the micro-recovery should be
attempted before falling back to system-level re-
covery. If the failure is not prevented by the micro-
recovery mechanism and the failure threshold has
been reached, then system-level recovery will be
performed.
Finally, we note that, in the case of identifying
recovery dependencies, our approach combines
programmer-specified information with system-ini-
tiated learning through logs. Whereas the program-
mer is given the tool to specify dependencies based
on experience, the system uses the programmer-
specified dependencies as a starting point and
continues to refine these dependencies throughout
the life of the system. In other words, the recovery-
aware framework does not rely on the completeness
or the correctness of developer-specified dependen-
cies. One of our ongoing projects is to develop an
access-log-based architecture that uses the informa-
tion provided by lock accesses as a guideline to
understanding system state changes. Based on such
9 : 14 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009
interactions between concurrent tasks, the archi-
tecture dynamically identifies dependencies at run-
time and alerts the developer to events like ‘‘dirty
reads’’ of shared state (i.e., reading data that has
been modified by another transaction but not yet
committed). This log-based architecture will relieve
the developer from the burden of tracking resources
such as shared buffers and locks and tracking read-
write conflicts on shared state.
CONCLUSION
In this paper, we have presented an extension to the
recovery-aware framework in Reference 1 and
applied it to the fine-grained recovery of enterprise
storage systems with multi-core processors. We
introduced the concept of recovery scope and used it
to partition the set of runtime tasks into groups of
interdependent tasks. We then demonstrated how
the mapping of recovery scopes to recovery groups
can be used in scheduling of tasks during recovery
from faults.
We focused on developing effective mappings of
dependent tasks to processor resources through
careful tuning of recovery-sensitive parameters. We
presented a formal model to capture the mapping of
recovery scopes to recovery groups, which can be
used in the scheduling of tasks during recovery.
We have implemented our proposed recovery-aware
framework by modifying the controller software in
an enterprise storage system. Through our analysis
and experimentation, we have shown that through
careful tuning of the system configuration and the
recovery-sensitive parameters, it is possible to
improve the system performance during recovery
and thus improve system resiliency to faults.
ACKNOWLEDGMENTSThis work was partially funded by an NSF CISE grant
and an IBM SUR grant. We acknowledge with
gratitude Cornel Constantinescu, Subashini
Balachandran, Clem Dickey, Paul Muench, David
Whitworth, Andrew Lin, Juan Ruiz (J. J.), Brian
Hatfield, Chiahong Chen, and Joseph Hyde for
helping us perform experimental evaluations and
interpret the data. We thank K. K. Rao, David
Chambliss, Brian Henderson, and the other members
of the Storage Systems group at the IBM Almaden
Research Center for valuable feedback and for
providing the resources to perform our experiments.
We also thank Prof. Karsten Schwan for valuable
insights. Finally, we thank the anonymous referees
for their comments and the editorial staff for helping
us to improve the quality of our paper.
*Trademark, service mark, or registered trademark ofInternational Business Machines Corporation in the UnitedStates, other countries, or both.
REFERENCES1. S. Seshadri, L. Chiu, C. Constantinescu, S. Balachandran,
C. Dickey, L. Liu, and P. Muench, ‘‘Enhancing StorageSystem Availability on Multi-core Architectures withRecovery Conscious Scheduling,’’ Proceedings of the SixthUSENIX Conference on File and Storage Technologies(FAST 2008), San Jose, CA (February 26–29, 2008),pp. 143–158.
2. J. Gray, ‘‘Why Do Computers Stop and What Can BeDone About It?’’ Proceedings of the Fifth Symposium onReliability in Distributed Software and Database Systems,Los Angeles, IEEE Computer Society Press (January 13–15, 1986), pp. 3–12.
3. G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A.Fox, ‘‘Microreboot—A Technique for Cheap Recovery,’’Proceedings of the Sixth Symposium on Operating SystemsDesign and Implementation (OSDI ’04), San Francisco,USENIX Association (December 6–8, 2004), pp. 31–44.
4. Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton,‘‘Software Rejuvenation: Analysis, Module and Applica-tions,’’ Proceedings of the Twenty-Fifth InternationalSymposium on Fault-Tolerant Computing (FTCS-25),Pasadena, IEEE Computer Society (June 27–30, 1995),pp. 381–390.
5. S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi, ‘‘On theAnalysis of Software Rejuvenation Policies,’’ Proceedingsof the Twelfth Annual Conference on Computer Assurance(COMPASS ’97), Gaithersburg, MD (June 18–20, 1997),pp. 88–96.
6. F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou, ‘‘Rx:Treating Bugs as Allergies—A Safe Method to SurviveSoftware Failure,’’ Proceedings of the 20th ACM Sympo-sium on Operating Systems Principles (SOSP 2005),Brighton, UK (October 23–26, 2005), ACM, New York,pp. 235–248.
7. J. Gray and A. Reuter, Transaction Processing: Conceptsand Techniques (The Morgan Kaufmann Series in DataManagement Systems), Morgan Kaufmann, Burlington,MA (1992).
8. S. Sidiroglou, O. Laadan, A. D. Keromytis, and J. Nieh,‘‘Using Rescue Points to Navigate Software Recovery,’’Proceedings of the 2007 IEEE Symposium on Security andPrivacy (SP ’07), Oakland, CA, IEEE (May 20–23 2007),pp. 273–280.
9. B. Randell, ‘‘System Structure for Software Fault Toler-ance,’’ IEEE Transactions on Software Engineering 1,No. 2, 221–232 (1975).
10. M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, andW. S. Beebee Jr., ‘‘Enhancing Server Availability andSecurity Through Failure-Oblivious Computing,’’ Pro-ceedings of the Sixth Symposium on Operating SystemsDesign and Implementation (OSDI ’04), San Francisco,USENIX Association (December 6–8, 2004), pp. 303–316.
11. C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P.Schwarz, ‘‘ARIES: A Transaction Recovery Method
IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 15
Supporting Fine-Granularity Locking and Partial Roll-backs Using Write-Ahead Logging,’’ ACM Transactionson Database Systems 17, No. 1, 94–162 (1992).
12. M. Karlsson, C. Karamanolis, and X. Zhu, ‘‘Triage:Performance Differentiation for Storage Systems UsingAdaptive Control,’’ ACM Transactions on Storage 1, No.4, 457–480 (2005).
13. A. Gulati, A. Merchant, and P. J. Varman, ‘‘pClock: AnArrival Curve Based Approach for QoS Guarantees inShared Storage Systems,’’ Proceedings of the 2007 ACMSIGMETRICS International Conference on Measurementand Modeling of Computer Systems (SIGMETRICS 2007),San Diego, CA, ACM (June 12–16, 2007), pp. 13–24.
14. B. Jansen, H. V. Ramasamy, M. Schunter, and A. Tanner,‘‘Architecting Dependable and Secure Systems UsingVirtualization,’’ Proceedings of the Workshop on SoftwareArchitecting for Dependable Systems (WADS 2007),Edinburgh, Lecture Notes in Computer Science 5135,Springer (June 27, 2007), pp. 124–149.
15. M. Sivathanu, V. Prabhakaran, A. C. Arpaci-Dusseau,and R. H. Arpaci-Dusseau, ‘‘Improving Storage SystemAvailability with D-GRAID,’’ ACM Transactions onStorage 1, No. 2, 133–170 (2005).
16. K. S. Trivedi, Probability and Statistics with Reliability,Queuing, and Computer Science Applications, PrenticeHall PTR, Upper Saddle River, NJ (2004).
17. S. Stidham Jr., ‘‘A Last Word on L¼ kw,’’ OperationsResearch 22, No. 2, 417–421 (1974).
18. IBM z/Architecture Principles of Operation, SA22-7832-06,IBM Corporation, 2008, http://publibz.boulder.ibm.com/epubs/pdf/dz9zr006.pdf.
19. L. LaFrese, ‘‘IBM TotalStorage Enterprise Storage ServerModel 800: New features in LIC Level 2.3.0,’’ Perfor-mance White Paper, IBM Corporation (2003), http://www.ibm.com/systems/uk/resources/systems_uk_storage_disk_ess_performance.pdf.
Accepted for publication Sept. 17, 2008.
Sangeetha SeshadriGeorgia Institute of Technology, 266 Ferst Dr., Atlanta, GA30332-0765 ([email protected]). Ms. Seshadri receiveda B.E. degree in computer science and an M.Sc. degree inmathematics from the Birla Institute of Technology andScience, Pilani, India, in 2002. She is currently workingtoward a Ph.D. degree at the College of Computing, GeorgiaInstitute of Technology, under the guidance of Prof. Ling Liu.Her research interests include large-scale storage systems andservices and distributed middleware overlay systems, as wellas techniques and architectures for improving the availability,scalability, and performance of such systems.
Ling LiuGeorgia Institute of Technology, 266 Ferst Dr., Atlanta, GA30332-0765 ([email protected]). Dr. Liu, an AssociateProfessor in the College of Computing at Georgia Institute ofTechnology, directs the research programs in the DistributedData Intensive Systems Lab. The research covers variousaspects of data intensive systems, ranging from mobilecomputing and event stream processing to Internet datamanagement, storage systems, and service-orientedarchitectures. It is currently directed toward building large-scale Internet systems and services, with a focus onperformance, security, privacy, and energy efficiency. Herresearch group has produced a number of open sourcesoftware systems, among which the most popular ones are
WebCQ, XWRAPElite, and PeerCrawl. She has published morethan 200 journal and conference articles and is currently onthe editorial board of several journals, including IEEETransactions on Service Computing, IEEE Transactions onKnowledge and Data Engineering, International Journal ofVery Large Database Systems, International Journal of Peer-to-Peer Networking and Applications, International Journal ofWeb Services Research, and Wireless Network Journal. She is arecipient of the best paper award of ICDCS 2003, the bestpaper award of WWW 2004, the 2005 Pat Goldberg MemorialBest Paper Award from IBM Research, the best dataengineering paper award of International Conference onSoftware Engineering and Data Engineering 2008, and arecipient of IBM faculty awards in 2003 and 2006 through2008. Her research is sponsored primarily by the NationalScience Foundation, DARPA, the Air Force Office of ScientificResearch, and IBM.
Lawrence ChiuIBM Almaden Research Center, 650 Harry Rd., San Jose, CA95120 ([email protected]). Mr. Chiu received his M.S. degreein computer engineering from the University of SouthernCalifornia in 1991 and his M.S. degree in technologycommercialization from the McCombs School of Business,University of Texas, Austin, in 2003. From 2000–2003, heinitiated and managed the effort to develop and deliver thefirst IBM storage virtualization product, SAN VolumeController. He currently manages the Scalable Storage Systemsgroup at the Almaden Research Center, focusing on highlyscalable and highly available enterprise storage systems, andsolid-state disk architecture. &
9 : 16 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009