Recovery scopes, recovery groups, and fine-grained recovery in enterprise storage controllers with...

16
Recovery scopes, recovery groups, and fine-grained recovery in enterprise storage controllers with multi-core processors & S. Seshadri L. Liu L. Chiu In this paper we extend a previously published approach to error recovery in enterprise storage controllers with multi-core processors. Our approach first involves the partitioning of the set of tasks in the runtime of the controller software into clusters (recovery scopes) of dependent tasks. Then, these recovery scopes are mapped into a set of recovery groups, on which the scheduling of tasks, both during the recovery process and normal operation, is based. This recovery-aware scheduling (RAS) replaces the performance-based scheduling of the storage controller. Through simulation and benchmark experiments, we find that: 1) the performance of RAS appears to be critically dependent on the values of recovery-related parameters; and 2) our fine-grained recovery approach promises to enhance the storage system availability while keeping the additional overhead, and the resulting degradation in performance, under control. INTRODUCTION In this paper, we extend earlier work 1 that intro- duced the concepts of recovery groups and recovery- aware scheduling (RAS) and described their use in the fine-grained recovery of controller software in enterprise storage systems. The scheduling mecha- nism is based on serializing a set of recovery- dependent tasks (which belong to the same recovery group) to reduce ripple effects of software failures and speed up the recovery process. Our approach involves first the partitioning of the set of tasks in the runtime of the controller software into clusters (recovery scopes) of dependent tasks. Then, these recovery scopes are mapped into a set of recovery groups, on which the scheduling of tasks is based. The approach in Reference 1 can be viewed as a special case when the mapping of recovery scopes to recovery groups is 1:1. Ó Copyright 2009 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of the paper must be obtained from the Editor. 0018-8670/09/$5.00 Ó 2009 IBM IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9:1

Transcript of Recovery scopes, recovery groups, and fine-grained recovery in enterprise storage controllers with...

Recovery scopes, recoverygroups, and fine-grainedrecovery in enterprisestorage controllers withmulti-core processors

&

S. Seshadri

L. Liu

L. Chiu

In this paper we extend a previously published approach to error recovery in enterprise

storage controllers with multi-core processors. Our approach first involves the

partitioning of the set of tasks in the runtime of the controller software into clusters

(recovery scopes) of dependent tasks. Then, these recovery scopes are mapped into a

set of recovery groups, on which the scheduling of tasks, both during the recovery

process and normal operation, is based. This recovery-aware scheduling (RAS)

replaces the performance-based scheduling of the storage controller. Through

simulation and benchmark experiments, we find that: 1) the performance of RAS

appears to be critically dependent on the values of recovery-related parameters; and

2) our fine-grained recovery approach promises to enhance the storage system

availability while keeping the additional overhead, and the resulting degradation in

performance, under control.

INTRODUCTION

In this paper, we extend earlier work1

that intro-

duced the concepts of recovery groups and recovery-

aware scheduling (RAS) and described their use in

the fine-grained recovery of controller software in

enterprise storage systems. The scheduling mecha-

nism is based on serializing a set of recovery-

dependent tasks (which belong to the same recovery

group) to reduce ripple effects of software failures

and speed up the recovery process.

Our approach involves first the partitioning of the

set of tasks in the runtime of the controller software

into clusters (recovery scopes) of dependent tasks.

Then, these recovery scopes are mapped into a set of

recovery groups, on which the scheduling of tasks is

based. The approach in Reference 1 can be viewed

as a special case when the mapping of recovery

scopes to recovery groups is 1:1.

�Copyright 2009 by International Business Machines Corporation. Copying inprinted form for private use is permitted without payment of royalty providedthat (1) each reproduction is done without alteration and (2) the Journalreference and IBM copyright notice are included on the first page. The titleand abstract, but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based and otherinformation-service systems. Permission to republish any other portion of thepaper must be obtained from the Editor. 0018-8670/09/$5.00 � 2009 IBM

IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 1

Experiments with a state-of-the-art enterprise stor-

age controller have shown that the clustering of

tasks into recovery scopes (dependent tasks be-

longing to the same scope) leads to a large number

of scopes over which tasks are unevenly distributed;

most scopes contain a small number of tasks,

whereas a small number of scopes contain a large

number of tasks. Clearly, tracking dependencies at a

coarse granularity may result in a recovery scope

with many dependent tasks whose activations have

to be serialized. This is likely to decrease the

opportunities for parallel execution on the multi-

core architecture and thus is likely to increase

processing overhead during recovery. At the same

time, tracking dependencies at too fine a granularity

increases the overhead of managing a large number

of recovery scopes, resulting in a performance

penalty. Mapping of recovery scopes to recovery

groups is intended to trade off intergroup task

dependencies versus the degree of parallel process-

ing in the RAS mechanism.

Based on our analysis, we present guidelines for

determining the recovery scopes, the recovery

groups, and the mapping of recovery scopes to

recovery groups for use in the RAS. We imple-

mented this approach in a realistic environment by

using an enterprise-class storage controller with

minimal changes to its software. Handling of

various failures in the system can be implemented

incrementally. We show that by selecting appropri-

ate values for the recovery-sensitive system param-

eters, it may be possible to speed up the recovery of

storage controllers and achieve good performance at

the same time.

In this paper, we make the following contributions:

� The higher the level of multi-threading (i.e., the

number of cores in the processor), the higher the

number of recovery groups for effective recovery.

Thus, as the multi-threading capability increases,

it is beneficial to track finer-granularity recovery

scopes through more recovery groups.� Under operating conditions, in which failure rates

(mean time between failures) and recovery rates

(mean time to recovery) make it very unlikely that

the recovery mechanism has to deal with more

than one error at the time, the number of recovery

groups does not depend on those rates.� When mapping recovery scopes to recovery

groups, it is beneficial to distribute the workload

as evenly as possible among the groups.

� Even if the number of recovery groups and the

mapping of recovery scopes to these groups are

not optimal, the performance during recovery of

the storage system with RAS surpasses the

performance of the system with the standard

(performance-oriented) scheduling.

Our work is largely inspired by previous work in the

areas of fault-tolerant software and high-availability

storage systems. Techniques for software fault

tolerance can be classified into fault-treatment

techniques and error-handling techniques. The

fault-treatment techniques aim at avoiding the

activation of faults through environmental diversity,

such as by rebooting the entire system2

or by micro-

rebooting components of the system,3

through

periodic rejuvenation of the software,4,5

or by

retrying the operation in a different environment.6

Error-handling techniques aim at handling the error

triggered by a fault. They include the storing of

system states (i.e., establishing checkpoints) and

recovery by reverting to an earlier, correct-state

(roll-back),7

application-specific techniques such as

exception handling8

and recovery blocks,9

and more

recent techniques such as failure-oblivious comput-

ing.10

Because the software of enterprise storage systems

has evolved from legacy storage systems, it is highly

desirable to minimize the software development

work for implementing the techniques listed above.

Under tight coupling between the components of a

storage system, implementing the component micro-

reboot technique or carrying out the periodic

rejuvenation approach are both challenging. Rx6

is

an interesting approach to recovery that involves

retrying operations in a modified environment. It

requires, however, making checkpoints of the

system state in order to allow roll-backs, and given

the high volume of requests (tasks) in the runtime of

the storage controller and the complex operational

semantics of such requests, the approach may not be

feasible in our environment. Whereas localized

recovery techniques, such as transactional recovery

in database management systems,11

and applica-

tion-specific recovery mechanisms, such as recovery

blocks9

and exception handling,8

have a proven

track record, their performance in a multi-thread

environment, in which interacting tasks are execut-

ing concurrently, is unknown.

The recovery-aware approach described in Refer-

ence 1 involves three different RAS algorithms, each

9 : 2 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009

representing a distinct way to trade off between

recovery time and system performance. Although

vast amounts of prior work have been dedicated to

scheduling algorithms,12,13

to the best of our

knowledge, only the work in Reference 1 is focused

on recovery.

Some work in the area of virtualization aims to

improve availability by isolating each virtual ma-

chine (VM) from failures occurring in other VMs.14

Because the software of an enterprise storage

controller is not easily partitioned into components

that can run in different VMs, such an approach may

not be feasible in our environment. The redundant

array of inexpensive disks (RAID) approach is aimed

at improving storage system availability at the

device level.15

Our approach, on the other hand, is

focused on the availability of the embedded soft-

ware (firmware) in the storage controllers of

enterprise storage systems and thus can be viewed

as complementary to the RAID approach.

The rest of this paper is organized as follows. In the

next section we introduce recovery scopes as

groupings that associate runtime recovery tasks if

they are dependent on each other. We show that the

association could be resource-based, component-

based, or request-based. In the next section, we

summarize the material in Reference 1 on RAS. In

the section that follows we develop mathematical

models that capture the way in which various

system parameter values affect recovery perfor-

mance. In our analysis we focus on degree of

multiprocessing, scheduling discipline, failure and

recovery rates, and workload characteristics. In the

section that follows we describe our experimental

results. Then, we discuss our results. In the last

section we provide of short summary of our results.

RECOVERY SCOPES

The recovery-aware approach presented in Refer-

ence 1 consists of two stages: first, partitioning the

tasks in the runtime of the controller software into

recovery groups, and then, scheduling the runtime

workload based on these recovery groups. In this

paper we extend this recovery-aware approach to

one consisting of three stages. In the first stage we

partition the runtime tasks into sets of interdepen-

dent tasks, which we refer to as recovery scopes. In

the second stage we map these recovery scopes to a

set of recovery groups, the entities on which

workload scheduling is based. The last stage is the

RAS. In this section we discuss recovery scopes and

describe the various ways in which we can define

them.

Performing fine-grained recovery from errors in

controller software in a way that exploits the

available multi-threading involves the discovery of

task interdependencies. The tasks in a recovery

scope interact in complex ways. In order to avoid

deadlocks and return the system to a consistent state

when a task encounters an exception, the recovery

process may involve additional tasks. Although

explicit dependencies may be specified by the

programmer during program development, these

dependencies may be too coarse. Moreover, some

dependencies may have been overlooked because of

their dynamic nature and the intrinsic complexity of

the system. To compensate for these conditions, one

could track dependencies dynamically and continu-

ously and use the information to refine, over time,

the developer-defined recovery scopes. The criteria

for classification of tasks into recovery scopes

depend on the nature of the application and the

faults to be handled. In our work, we have identified

three possible task classifications into recovery

scopes: resource-based, component-based, and re-

quest-based.

Resource-based classificationTasks accessing the same resources (such as device

drivers or metadata) may be classified under the

same recovery scope. This classification would be

effective in dealing with resource-based failures. For

example, consider a ‘‘queue full’’ condition in

storage controllers that occurs when an adapter

whose queue is full refuses to accept additional

requests. Under these circumstances, the error and

the subsequent recovery action would probably

affect only the tasks attempting to enqueue requests

to the same adapter.

To identify resource-based recovery dependencies,

one could observe the pattern of lock acquisitions.

Intuitively, tasks that access the same resource are

likely to acquire common locks. Lock acquisitions

patterns can potentially be used to further refine

resource-based dependencies at runtime by utilizing

the temporal aspect of dependencies (e.g., tasks

whose lock operations are minutes apart need not be

considered dependent).

Resource-based dependencies can be identified in

one of three ways: 1) static code analysis; 2)

IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 3

analysis of traces collected at runtime from the

execution environment with the given workload; or

3) discovery of these dependencies at runtime. The

disadvantage of the dynamic approach 3 is the

dependencies (such as lock acquisitions) manifest

themselves only after the thread is dispatched.

Assigning recovery scopes after dispatch would be

pointless, unless the tasks can be immediately

suspended and again enqueued in the appropriate

recovery group queue. Such an approach would

result in a performance penalty caused by the high

amount of context switching. We therefore recom-

mend a static assignment of tasks into recovery

scopes using either approach 1 or 2. However,

logging lock acquisitions at runtime may still be

necessary in order to keep track of resource

ownership and perform cleanup of resources in the

event of a failure. The disadvantage of a static

approach compared to a dynamic approach is that it

is unable to use the temporal aspect of dependencies

and is likely to overlook certain dependencies that

are only detectable at runtime. We should point out,

however, that a suboptimal classification of tasks

into recovery scopes does not affect the consistency

of the results, but only the performance penalty

during failure recovery. This is demonstrated later,

in our experimental results section.

Component-based classification

Even in the absence of well-defined operational

boundaries between functional components, certain

failures may require resetting state or performing

recovery actions for tasks associated with a partic-

ular functional component. Consider a scenario

involving the cache component in which tasks

require a temporary data structure known as a

control block for successful completion. Further-

more, a failure occurs when the system runs out of

control blocks. One recovery strategy in this

situation might be to search the list of control blocks

to identify instances that have not been freed up

correctly. Another strategy would be to retry the

operation at a later time, in order to work around

concurrency issues. However, it is very likely that

other tasks associated with the cache component

will encounter the same error if executed before the

issue is resolved. Moreover, recovery work that

involves modifying data structures or consistency

checking may require further dispatching of tasks

associated with the cache component. In these

conditions, a component-based classification of

tasks may be effective in identifying dependencies

during recovery. In component-based classification,

all tasks associated with the same functional

component belong to the same recovery scope.

Request-based classification

Consider a situation in which a read/write request

fails because the request contains an invalid

address. In this case the request will fail. Moreover,

all tasks associated with the request will be either

aborted or will participate in the recovery process.

In this situation, we may choose a recovery strategy

that involves the necessary cleanup actions and then

aborting the request. Then the scope of recovery

consists of all tasks, across all components, that are

associated with this request.

Depending upon the nature of failures that fine-

grained recovery is expected to handle, one class or

a valid combination of the above classifications may

be used to define recovery scopes. As previously

stated, recovery scopes could be either specified (as

in the case of component- or request-based classi-

fications) or discovered (as in the case of resource-

based classification). We refer the readers to

Reference 1 for examples and further discussion on

dynamically associating recovery actions with fail-

ure points in the code.

RECOVERY-AWARE SCHEDULING

In this section we summarize the recovery-aware

scheduling (RAS) from Reference 1 (in which RAS is

referred to as recovery-conscious scheduling). The

key idea of RAS is to ensure bounded recovery time

by efficient allocation of resources to recovery

groups. RAS enforces some serialization of recovery-

dependent tasks in order to reduce the ripple effect

of failure and ensure resource availability during a

localized recovery process.

Reference 1 presents three RAS algorithms, each

using a different method of mapping recovery

groups to processing resources: static, partially

dynamic, and dynamic. Each mapping technique

represents different trade-offs between system

availability and system performance under normal

operation.

Static scheduling of recovery groups determines the

mapping of recovery groups to processors at

compile time and is effective in situations where

task dependencies during recovery are well under-

stood and the workloads are stable. With this

9 : 4 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009

scheme, tasks are dispatched only on processors

associated with the recovery group to which they

belong.

Dynamic scheduling of recovery groups to process-

ing resource pools represents the other end of the

spectrum. This scheme works effectively, even in

the presence of frequently changing workloads.

With dynamic RAS, all processors are mapped to all

recovery groups. The scheduler then uses a starva-

tion-avoiding scheme such as round-robin to iterate

through the groups and dispatch work. However, a

recoverability constraint is specified for each group.

A recoverability constraint prescribes the maximum

number of concurrently executing tasks permissible

for that group. To achieve acceptable utilization, the

constraint is selectively violated when no task

satisfying the constraint is found while resources are

idle.

Between the two ends of the spectrum is partially

dynamic scheduling, which involves partially static

scheduling for those recovery groups whose re-

source demand is stable and well understood and

dynamic scheduling for the remaining recovery

groups.

In spite of implementing fine-grained recovery and

identifying recovery dependencies between tasks,

without careful design it is possible that more

dependent tasks are dispatched before a recovery

process can be completed. This would result in a

longer recovery interval or an inconsistent system

state. A dangerous situation may also arise when

many of the concurrently executing threads turn out

to be dependent, especially since tasks often arrive

in batches. Then the recovery process could

consume a large fraction of system resources, which

may stall the entire system.

To overcome these problems, RAS incorporates two

countermeasures, one proactive and one reactive.

The proactive one comes into play during normal

operation, when RAS attempts to minimize the

number of dependent tasks executing concurrently

by dispatching, at any point in time, tasks from

different recovery groups while closely adhering to

recoverability constraints. The reactive technique

comes into play during failure recovery, when,

based on recovery dependencies information af-

forded by recovery groups, RAS suspends the

dispatching of tasks from those recovery groups

whose tasks are currently undergoing recovery.

We measure the effectiveness of RAS against

traditional performance-oriented scheduling (POS).

POS, either with a single, global queue or multiple

load-balanced queues, does not include recovery-

dependency in its criteria for resource allocation.

Recovery groups and performance duringrecovery

The number of recovery groups in the system and

the constraints on these recovery groups are critical

factors in determining the system recovery time and

thus the fault resiliency of the storage system (a

system is resilient if it can continue to operate when

a failure occurs and if it can recover from such

failures quickly). Having a large number of recovery

groups allows fine-grained dispatching of work and

thus the opportunity of improved recovery perfor-

mance through higher-level use of multiprocessing

in the multi-core processor. Depending on how tasks

are assigned to recovery groups, the performance

during normal operation may also be impacted. In

general, increasing the number of recovery groups

beyond a system-dependent threshold may cause

scheduling overhead that may outweigh the benefit

of decreased lock contention.

To enforce recoverability constraints effectively, we

must map scopes appropriately to groups. The

simple approach is to make each recovery scope a

recovery group. Given the uneven distribution of

tasks over recovery scopes, this may result in higher

scheduling overhead as the scheduler polls the large

number of recovery groups for work and most

groups have no pending work. It may also offer little

benefit in terms of shorter recovery time. In this

section we develop mathematical models that

capture the way in which various system parameter

values affect recovery performance. Our analysis

focuses on the degree of multiprocessing, scheduling

discipline, failure and recovery rates, and workload

characteristics.

Impact of recovery groups on recovery rateThe number of outstanding tasks belonging to a

single recovery group, and hence the degree of

serialization, has a direct bearing on the time-to-

recovery of the system. For example, in the worst

case—in which all tasks running at the time of

failure belong to the same recovery group—massive

IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 5

system-wide recovery will have to be initiated.

Intuitively, the recovery time increases with in-

creasing degree of multiprocessing and with a

decreasing number of recovery groups.

Based on the definition of recovery scopes, we

assume that when a task t belonging to the kth

recovery scope fails, all tasks belonging to the scope

that are executing concurrently with the failed task t

need to undergo recovery.

Let kk

represent the failure rate and lk

represent the

repair rate for failures in the kth

recovery scope. The

number of processors or cores in the system is

represented by variable m and let ak(i) represent

that probability that i outstanding tasks belonging to

the kth

recovery scope are executing concurrently at

the time of failure.

We assume that the recovery process executes

serially, even for concurrently executing threads, in

order to restore the system to a consistent state. As a

result, the time to complete system recovery is a

product of the number of recovering processes and

the individual task recovery time. Then the mean

time to complete system recovery is given by:

l ¼ akð1Þ1

lk

þ akð2Þ2

lk

þ akð3Þ3

lk

þ � � � þ akðmÞm

lk

Let ck

represent the probability that a task belongs to

recovery scope k. Then, using the Poisson approx-

imation for the binomial probability mass function,

the probability that there are i outstanding tasks

belonging to the kth

recovery scope is given by:

akðiÞ ¼ bði; m; ckÞ ¼e�ckm 3ðckmÞi

i!

With performance-oriented scheduling (POS), there

is no notion of bounding the recovery process.

Interdependent tasks belonging to the same recov-

ery scope can potentially be executing on all

processors. As a result up to m dependent tasks may

be executing concurrently at the time of failure.

Under these circumstances the system mean-time-

to-recovery (MTTR) for POS given that the failure

occurred in the kth

recovery group denoted by

MTTRPOSjk is:

MTTRPOSjk ¼Xmi¼1

e�ckm 3ðckmÞi

i!3

i

lk

On the other hand, RAS enforces constraints on

recovery groups there by ensuring some degree of

serialization of dependent tasks. Let us assume that

the constraint on the maximum number of concur-

rent tasks of the recovery group containing the kth

recovery scope is given by ck. Then the system

mean-time-to-recovery (MTTR) for RAS given that

the failure occurred in the kth

recovery group

denoted by MTTRRASjk is:

MTTRRASjk ¼Xck

i¼1

e�ckck 3ðckckÞi

i!3

i

lk

However, with dynamic RAS, a more flexible

mapping of resources to recovery groups is em-

ployed in order to reduce resource idling and

improve utilization. Under this scheme in the event

that there are spare idle resources even after all

tasks have been dispatched according to recover-

ability constraints, keeping in mind the high-

performance requirements of the system, the con-

straints are selectively violated. Let the number of

active recovery groups in the system be denoted by

R. Let ck

be the constraint specified on the maximum

number of concurrent tasks for the group containing

the kth

recovery scope. Without loss of generality,

we assume that there are idle resources only when

RRi¼1 c

i, m. For the sake of simplicity, let us assume

that the available spare resources m� RRi¼1 c

iis

allocated evenly among all groups. Then in the

worst-case violation of a constraint ck, denoted as ck

is given by:

ck ¼ ck þ

�m�

XR

i¼1

ci

R

�Thus, the system recovery time with dynamic RAS is

obtained by replacing the constraint ck

by ck in the

expression for system recovery time for RAS

(MTTRRASjk).

Clearly, the system availability under POS is affected

by the failure rate kk, the repair rate l

kfor failures in

the kth

recovery scope, the number m of processors

or cores in the system, and the probability ck

that a

task belongs to recovery scope k. In contrast, with

RAS, availability is also influenced by additional

parameters such as the number R of active recovery

groups in the system and the constraint ck

on the

maximum number of concurrent tasks of the group

containing the kth

recovery scope.

Impact of RAS queues on system performanceIn this section we present analysis that shows the

impact of recovery groups on the system perfor-

9 : 6 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009

mance and based on these results we describe

criteria for the selection of number of recovery

groups for efficient scheduling. Each recovery group

is mapped to a single scheduler queue and the

serialization constraint imposed on the group

applies to all scopes that are mapped to the group.

While evaluating system performance, we must take

into consideration both the good-path (i.e., normal

operation) and bad-path (during failure recovery)

performance. Good path performance is primarily

impacted by the efficiency of the scheduler. On the

other hand, bad-path performance will be impacted

by the extent of failure and recovery (i.e. the degree

of serialization) and the availability of resources for

normal operation during local recovery.

Variation of service rate with RAS queues

We model the variation of service rate with the

number of queues as a hypoexponential distribution

with two phases, where the first phase describes the

scenario in which the service rate increases with the

number of queues due to reduced lock contention.

The second phase models the scenario where the

increase in the number of queues causes the service

rate to drop due to the additional scheduling

overhead.

In order to study the impact of recovery awareness

on the performance of the system, we model both

POS and RAS with varying degrees of multipro-

cessing and during good-path and bad-path opera-

tion. In order to model utilization, response time,

and throughput we adopt the models for M/M/m

queuing systems.16

Consider a system where tasks arrive as a Poisson

process with rate ka, and service times for all cores

are independent, identically distributed random

variables. Let the mean service rate as a function of

the number of scheduler queues for POS be denoted

by lPOS,

and let the mean service rate as a function

of the number of recovery groups for RAS be

denoted by lRAS

. We assume that the service times

include the time required to dequeue tasks from the

job queues and iterate through queues (for RAS). Let

m denote the total number of cores in the system.

Good-path performance

During good-path operation, all system resources

are available and storage controller performance is

limited only be scheduler efficiency. Accordingly,

the average number of jobs, N, in the system is

given by:

E½N� ¼ mqþ qðmqÞm

m!

p0

ð1� qÞ2

where p0, the steady-state probability that there are

no jobs in the system is given by:

p0 ¼Xm�1

k¼0

ðmqÞk

k!þ ðmqÞm

m!

1

ð1� qÞ

" #� 1

For POS, the value q, the traffic intensity, is given

by, qPOS¼ k

a/ml

POSand that for RAS is given by

qRAS¼ k

a/ml

RCS. E

POS[N] and E

RAS[N] are obtained

by substituting q by qPOS

and qRAS,

respectively, in

the expressions for E[N] and p0. In each case, based

on Little’s formula,17

the average response time for

POS (EPOS

[R]) and RAS (ERAS

[R]) is given by:

EPOS½R� ¼EPOS½N�

kaand ERAS½R� ¼

ERCS½N�ka

Assuming that our system utilizes a non-preemptive

model where individual tasks complete execution

within the service time allocated to them on system

cores, the system throughput T can be modeled as

follows:

EPOS½T� ¼ lPOSUPOS0 and ERCS½T� ¼ lRCSURCS

0

where U0

the utilization of the system is given by

U0¼ 1 � p

0and the values for utilization with

POSðUPOS0 Þ and RASðURCS

0 Þ are obtained by substi-

tuting appropriate values for p0.

Bad-path performance

In order to model system performance during bad-

path operation we assume that the amount of

system resources consumed by the recovery process

is proportional to the extent (i.e., the number of

outstanding tasks undergoing recovery) of the

recovery process.

As described previously, with POS, the extent of the

recovery process is unbounded and can potentially

span all the available cores in the system. As with

the analysis of system availability, assume that a

task t belonging to the kth

recovery scope encounters

a failure causing in all executing tasks belonging to

the kth

recovery group to undergo recovery. Let fPOSk

and fRCSk denote the extent of the failure-recovery for

POS and RAS, respectively. Let mPOS and mRCS

denote the expected number of cores available for

normal operation during failure recovery. Then, as

IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 7

explained in the case of the impact on recovery rate,

mPOS ¼ m� fPOSk ¼ m�

Xmi¼0

e�ckmðckmÞi

i!3 i

mRCS ¼ m� fRCSk ¼ m� ck

Then the expected response time and throughput

during bad-path: E0POS

[R], E0POS

[T] and E0RAS

[R],

E0RAS

[T] for POS and RAS, respectively, can be

computed by substituting m in the original expres-

sions with mPOS and mRCS, respectively.

EXPERIMENTAL RESULTS

In this section we present results of simulation

experiments and of laboratory experiments involv-

ing a modified enterprise storage controller with

RAS.

Storage system overview

Storage controllers are embedded systems that add

intelligence to storage and provide functionalities

such as RAID, I/O routing, error detection, and

recovery. Failures in storage controllers are typically

more complex and more expensive to recover from

if they are not handled appropriately.

Figure 1 illustrates the architecture of an enterprise-

class storage system, which includes a storage

controller and a set of disk devices. In order to avoid

a single point of failure, the storage controller

components could be designed with redundancy.

The controller consists of a processor complex with

an N-way (N-core) processor, a nonvolatile storage

(NVS) that acts as a fast-write cache, a data cache,

and a program memory. (These three memory

components are represented in Figure 1 by the block

labeled ‘‘Memory.’’) The firmware of the storage

controller provides the management functionalities

for the storage subsystem and also controls the

cache. The storage controller firmware typically

consists of a number of interacting components

(SCSI [Small Computer System Interface] command

processor, cache manager, and device manager),

each of which performs work through a large

number of asynchronous, short-running threads

(processing intervals on the order of microseconds).

We refer to each of these threads as a task. The

program memory is accessible to all the processors

within the complex and holds the job queues

through which functional components in the firm-

ware perform work for host I/O requests. Each

processor runs an independent scheduler and any of

the N processors may execute the jobs available in

the queues. Tasks (e.g., processing a SCSI com-

mand, reading data into cache memory, destaging

data from cache) are placed onto the job queues by

the components and then dispatched to run on one

of the available processors. Tasks interact through

shared data structures in memory and through

message passing.

Experimental setupWe prototyped our approach to fine-grained recov-

ery by modifying the firmware of a commercial

enterprise-class storage system (sensitive informa-

tion is left out). The storage system consists of a

storage controller with two 8-way server processor

complexes, memory for I/O caching, persistent

memory (NVS) for write caching, multiple fibre

channel protocol (FCP), Fiber Connectivity (FICON*)

or Enterprise System Connection (ESCON*) adapters

connected by a redundant high-bandwidth (2-Gbyte)

interconnect, fibre channel disk drives, and man-

agement consoles. The system is designed to achieve

both response time and throughput objectives. The

embedded storage controller software is similar to

the model presented in this paper. The system has a

number of interacting components that dispatch a

large number of short-running tasks.

The RAS itself was implemented in approximately

1000 lines of code (a line of code corresponds to a

single C-language statement). Task-level recovery

can be implemented incrementally, by adding each

failure situation to be handled one at the time.

Currently, our implementation specifies system-

Host HostHost

Storagecontroller

Host adapters

Device adapters

Disks

Hardware architecture

Memory

N-way processor

Processor complex

Figure 1Storage system architecture

9 : 8 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009

level recovery as the default action, except for the

task-level recovery cases that have been imple-

mented. A naive approach to program development

for task-level recovery would produce code whose

size would be directly proportional to the number of

‘‘panics’’ or failures to be handled. Implementing a

single task-level recovery case usually involves only

several tens of lines of code.

The simulator, which was written in C, allows us to

specify system configuration features (such as

number of processors and scheduling algorithm),

scheduling strategy (proactive, reactive, or proac-

tive/reactive), recovery scope (resource-based,

component-based, or request-based) and fault-gen-

eration parameters (failure rate, failure type, recov-

ery rate). The simulator is driven by a workload

trace that specifies the tasks dispatched, task

execution startup and completion times, and lock

acquisition and release times.

For prototype experiments we used the z/OS* Cache

Standard workload and traces of the cache standard

workload for the simulation experiments. The z/OS

Cache Standard workload18,19

is considered repre-

sentative of online transaction processing in a z/OS

environment. The workload has a read-to-write ratio

of 3, read hit ratio of 0.735, destage rate (rate of

transfer from cache to disk) of 11.6 percent, and an

average transfer size of 4 Kbyte. The setup for the

Cache Standard workload in the prototype imple-

mentation was CPU-bound.

We measure throughput and response times in our

prototype experiments and scheduler efficiency (as

measured by the number of task dispatches per unit

time) in the simulation experiments. For the

prototype experiments, we identified 16 component-

based recovery scopes. Each recovery scope corre-

sponded to a functional component such as a host

adapter, device manager, or cache manager.

In order to understand the impact on system

performance when localized recovery is underway,

we inject faults into the workload. We choose a

candidate task belonging to recovery scope 5 and

introduce faults at a fixed rate. The time required for

recovery is specified by the recovery rate. During

localized recovery, all tasks belonging to the same

recovery scope that are currently executing in the

system and that are dispatched during the recovery

process also experience a delay for the duration of

the recovery time. For example, in our implemen-

tation, a recovery time of 20 ms and a failure rate of

1 in every 10K dispatches, for tasks belonging to

component 5, introduce an overhead of 5 percent to

aggregate execution time per minute of component 5

execution on average. The recoverability constraint

for dynamic RAS was set to 1. Note that in the case

of dynamic RAS, the constraint would selectively be

violated only if no task satisfying the constraint was

found.

Effect of fine-grained recovery on system

performanceIn order to infuse recovery-awareness into the

allocation of resources, we need to keep track of

recovery-time dependencies while recognizing that

when these dependencies are tracked at too fine a

granularity, the overhead of managing a large

number of recovery scopes under normal conditions

may result in a severe performance penalty.

Therefore, we need to evaluate the performance

impact of tracking fine-grained recovery scopes

under normal operating conditions.

For these experiments, the tasks were redistributed

between recovery scopes based on different granu-

larity of dependency tracking and each of these

scopes was mapped to a recovery group. In effect,

the scope-to-group mapping was 1:1 only in the case

of 512 recovery groups and many:1 in all other

cases. Using this mapping, we study the perfor-

mance impact of fine-grained recovery under dif-

ferent degrees of multiprocessing and scheduling

policies. Specifically, we compare scheduler perfor-

mance during normal operation and during failure

recovery, using the number of task dispatches per

unit time as the metric.

Figure 2(A) shows several plots of the average

number of dispatches per minute during normal

operation as a function of the number of recovery

groups. Plots for the dynamic RAS scheme with 2, 4,

and 8 cores are shown together with a plot for POS

with 8 cores. In the case of POS, the workload is

uniformly distributed among the queues. Recall that

each recovery group is managed using a separate

scheduler queue. With RAS, in all three cases, the

number of dispatches initially increases (as much as

16 percent, 14 percent, and 65 percent in the case of

8, 4, and 2 cores, respectively) and then decreases

(as much as 21 percent, 30 percent, and 45 percent

in the case of 8, 4, and 2 cores, respectively). The

IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 9

high-performance peak is achieved with 64 groups

in the case of 8 cores, 32 groups with 4 cores, and 8

groups with 2 cores. This shows that the optimal

number of recovery groups depends on the multi-

processing level.

Although RAS initially benefits from the increased

concurrency afforded by additional scheduling

queues, as the number of queues increases, due to

the uneven distribution of workload between

recovery groups, scheduling efficiency decreases.

Depending upon the degree of multiprocessing,

beyond a certain granularity, recovery-awareness

and keeping track of fine-grained recovery scopes

may degrade system performance.

Whereas POS also exhibits decreasing efficiency

with a large number of queues, the degradation in

performance is less steep due to the uniform

distribution of workload among queues. Thus, the

number of recovery groups chosen and the mapping

of tasks to recovery groups should take into

consideration workload distribution among groups

and try to achieve load-balancing.

In order to emphasize the importance of right choice

of number of recovery groups on performance, we

next compare scheduler performance under dy-

namic RAS and a load-balanced, performance-

oriented scheduler with varying degrees of multi-

processing. We use 16 recovery groups for the

dynamic RAS scheduler and 16 queues, with

uniform workload distribution for the POS schedul-

er. Figure 2(B) shows the average number of

dispatches per minute in both cases with varying

degrees of multiprocessing (number of cores). With

this hand-picked choice of number of recovery

groups, we see that the system can achieve

performance that is very close to a performance-

oriented architecture, even while tracking recovery

dependencies across varying degrees of multipro-

cessing.

Effect of fine-grained recovery on system

availability

The benefit from tracking recovery dependencies is

realized during failure recovery. Figure 3(A) com-

pares scheduler performance during normal opera-

tion with that during failure recovery for a system

with 8 cores. By availability, we refer to service

availability and also the ability of the service to meet

performance expectations during failure-recovery.

We measure this using scheduler performance

during failure-recovery. Failure was emulated by

injecting faults into a chosen component at the rate

of once in every 10,000 dispatches of the tasks

belonging to that component.

First, the graph shows that, during failure-recovery,

for a low number of recovery groups (i.e., a coarse

granularity of recovery tracking), the benefit from

recovery awareness is low—although still higher

than the performance-oriented case. However, at the

right granularity, recovery awareness can make a

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

1.6e+07

1.8e+07

0 64 128 192 256 320 384 448 512

Aver

age

disp

atch

es p

er m

inut

e

Recovery groups

Dynamic RCS (2 cores)Dynamic RCS (4 cores)Dynamic RCS (8 cores)POS (8 cores)

0

5e+06

1e+07

1.5e+07

2e+07

2.5e+07

3e+07

0 10 20 30 40 50 60 70Cores

RCS with 16 groupsPOS with 16 queues

Aver

age

disp

atch

es p

er m

inut

e

A B

Figure 2(A) Performance versus number of recovery groups; (B) performance versus number of cores (16 groups)

9 : 10 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009

significant improvement in scheduler performance.

In this case, at a group size of 16, recovery

awareness can effect a 23 percent improvement in

scheduler performance.

Next, consider the group sizes 4 and 32 where POS

almost matches the performance of RAS. Figures

3(B) through 3(D) represent the number of dis-

patches per minute over a duration of 30 minutes.

The graphs show that even at group sizes of 4 and

32, where POS matches RAS in average number of

task dispatches per minute, POS results in serious

fluctuations of scheduler performance. At some

instances, the number of dispatches with POS drops

to as low as 65 percent of that with RAS. Recall that

POS distributes workload equally among all pro-

cessors without considering recovery dependencies.

Therefore, during failure, many tasks dependent on

the failing task may be executing concurrently. As a

result, in spite of fine-grained recovery, the entire

recovery process takes longer, resulting in a drop in

performance due to unavailability of resources for

normally operating tasks. We can argue that even

with some inaccuracy in the selection of number of

recovery groups, there is a conclusive advantage

over POS during failure recovery by being able to

track recovery dependencies. Also, in spite of

implementing fine-grained recovery, tracking re-

covery dependencies to improve performance dur-

ing failure recovery is crucial.

D

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

1.6e+07

5 10 15 20 25 30

Num

ber o

f tas

k di

spat

ches

Time (in minutes)

Dynamic RCS (32 groups)POS (32 queues)

C

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

1.6e+07

5 10 15 20 25 30

Num

ber o

f tas

k di

spat

ches

Time (in minutes)

Dynamic RCS (16 groups)POS (16 queues)

A B

1.6e+07

Aver

age

disp

atch

es p

er m

inut

e

Recovery Groups

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

20 40 60 80 100 120

Dynamic RCS (Good path)POS (Good path)Dynamic RCS, recovery =100 msPOS, recovery =100 ms

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

1.6e+07

5 10 15 20 25 30

Num

ber o

f tas

k di

spat

ches

Time (in minutes)

Dynamic RCS (4 groups)POS (4 queues)

Figure 3(A) Performance versus number of recovery groups: good path, bad path; (B) bad-path performance over time (4queues); (C) bad-path performance over time (16 queues); (D) bad-path performance over time (32 queues)

IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 11

Sensitivity to recovery and failure rate

Figure 4(A) shows the variation of scheduler

performance for different recovery rates for tasks

belonging to component 5. The failure rate was fixed

at 1 in every 10,000 dispatches of tasks belonging to

component 5. The figures tells us that the choice of

number of recovery groups is nearly independent of

the recovery rate, since if x number of recovery

groups is a better choice than y for a certain recovery

rate, it is almost true for all other recovery rates also.

Figure 4(B) shows the variation of scheduler

performance with different failure rates. The recov-

ery time for a single failed task was set to 100 ms,

and a failure was injected into tasks belonging to

component 5. As with the case of recovery rate, the

figure shows that the choice of number of recovery

groups is nearly independent of failure rate. Also,

Figures 4(A) and (B) show that the system

performance is far more sensitive to the recovery

rate than the failure rate. For example, an 80 percent

improvement in recovery rate improves system

performance by 27 percent on average, as compared

to a 100 percent improvement in failure rate

effecting a 13 percent improvement in performance.

Prototype experiments

By conducting experiments with our prototype

implementation using the Cache Standard workload,

we observe that, compared to POS, our recovery-

aware approach improved system throughput by

16.3 percent and response time by 22.9 percent

during failure recovery. The throughput with POS

was observed to be 107,000 I/O per second and

87,800 I/O per second during good-path and bad-

path, respectively, and with RAS 105,000 I/O per

second during both good-path and bad-path. Simi-

larly, the response time with POS was observed to

be 13.3 ms and 16.6 ms during good-path and bad-

path, respectively, and with RAS 13.5 ms during

both good-path and bad-path.

Figure 5(A) shows the average number of task

dispatches per minute over 30 minutes with a

varying number of recovery groups under the Cache

Standard workload. The figure also shows the

scheduler performance for the same configuration

using the simulation. As the figure shows, the

number of dispatches initially increases (although

modestly) with the increase in the number of

groups. For instance, when the number of groups

increases from 1 to 16, the number of dispatches

increases by nearly 13 percent (9 percent in the

simulation) and from 1 to 4, the number of

dispatches increases by 10 percent (8 percent in the

simulation). This experiment was used to validate

the simulator and establish the preferred number of

recovery groups as 16 for further experimentation

with the prototype.

Figure 5(B) shows the average number of task

dispatches per minute per recovery group and in

total under POS and RAS with the Cache Standard

workload. Of the 16 recovery groups, only 8 have

active tasks. The figure shows the number of

dispatches under normal operation (which are

1.6e+07Av

erag

e di

spat

ches

per

min

ute

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

20 40 60 80 100 120 0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

1.6e+07

Aver

age

disp

atch

es p

er m

inut

e

Recovery groups Recovery groups

A B

Dynamic RCS (good path)Bad path, recovery time = 20 msBad path, recovery time = 40 msBad path, recovery time = 80 msBad path, recovery time = 100 ms

20 40 60 80 100 120

Dynamic RCS (good path)Failure rate = 1 in 5 KFailure rate = 1 in 10 KFailure rate = 1 in 15 KFailure rate = 1 in 20 K

Figure 4(A) Performance versus number of recovery groups for various recovery rates; (B) performance versus number ofrecovery groups for various failure rates

9 : 12 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009

nearly identical for POS and RAS) and those under

bad-path for RAS and POS. Under bad-path opera-

tion, the number of dispatches with POS drops by

nearly 14.4 percent, while the number of dispatches

for RAS drops by only 3 percent, which corresponds

to a 16.3 percent improvement in throughput and

22.9 percent improvement in response time of RAS

over POS, in the average case. In the worst case POS

may cause complete system unavailability.

SUMMARYWhile our experiments provide some insights into

the selection of parameters such as recovery groups,

clearly these decisions are largely impacted by the

nature of the software. Below, we present certain

guidelines for the selection of these parameters that

must be validated for the particular instance of

software and system configuration. A possible

procedure to perform this validation is to evaluate

the impact of various parameters using simulation-

based studies and workload traces as shown in this

paper.

The number of recovery scopes in the system is a

characteristic of the software and the dependencies

between tasks. Once the granularity of recovery has

been identified, and the dependency information has

been specified (with explicit dependencies being

specified initially and the system identifying implicit

dependencies over certain duration of observation),

the recovery scopes are specified. During runtime,

tasks are enqueued based on the recovery scope that

has been identified for the task. The scheduler

efficiency now depends on the number of recovery

groups that need to be iterated through at runtime.

This choice of recovery groups depends on the

degree of multiprocessing (number of cores), and

the mapping of recovery scopes to recovery groups

depends on the distribution of tasks among recovery

scopes. Thus the guidelines for selection of recov-

ery-aware parameters can be summarized as fol-

lows:

� The optimal number of recovery groups depends

on the degree of multiprocessing. However,

choosing the number of groups to be more than

the number of cores can help improve perfor-

mance by reducing contention for job queue locks,

for example.� The choice of the number of recovery groups and

the mapping of tasks to recovery groups should

take into consideration workload distribution

between the groups and try to achieve load-

balancing and avoid idle cycling of the scheduler

through empty queues looking for work. The

information required to perform load-balancing

can be acquired by studying the workload for

distribution of tasks between recovery scopes and

their arrival rates.� Even with some inaccuracy in the selection of

number of recovery groups, there is a conclusive

advantage of RAS over POS during failure recov-

ery by being able to track recovery dependencies.

This gives the developer some flexibility in

choosing the number of recovery groups.

DISCUSSION

We observe that as the number of recovery scopes in

the system increases, for a finer granularity of

2.50e+07Av

erag

e di

spat

ches

(in

mill

ions

)

0.00e+00

5.00e+06

1.00e+07

1.50e+07

2.00e+07

50 10 15 20 25

Num

ber o

f dis

patc

hes

(in

mill

ions

)

Time (in minutes) Groups

0 1 2 3 4 5 6 9 15 Total

Normal (good path)POS (bad path)Dynamic RCS (bad path)

16Q Prototype16Q Simulation 4Q Prototype 4Q Simulation 1Q Prototype 1Q Simulation

5

10

15

20

25A B

Figure 5(A) Performance over time, cache standard workload; (B) performance: good path, bad path (POS, RAS)

IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 13

recovery tasks, the system availability improves as if

a natural resiliency develops in the system. How-

ever, the rate at which the resource availability

improves tends to decrease as the number of

recovery scopes continues to increase. The benefit

from recovery groups at a given granularity is

determined by distribution of tasks among recovery

groups, task recovery time, failure rate, and degree

of multiprocessing. However, the choice of the

number of recovery groups mainly depends on the

degree of multiprocessing, distribution of workload,

and the relationship between the failure and the

recovery-dependency tracked. Depending on these

parameters, we can predict the expected recovery

time in the event of a failure. By appropriate

selection of the number of recovery groups and the

recovery scope-to-group mapping, we can derive the

maximum benefit from the recovery-aware frame-

work.

Effectiveness of recovery-aware scheduling

An important question is whether we can account

for environmental problems, such as detecting and

avoiding faulty adapters, to avoid meaningless

rescheduling. RAS, to some extent, is meant to avoid

exactly such circumstances, such as continuing to

send tasks to a faulty adapter repeatedly. The

concept of recovery groups is meant to serialize

dependencies by which more faulty tasks are not

dispatched while recovery is being attempted. Two

types of RAS—proactive and reactive—were intro-

duced and discussed in Reference 1. The goal of

proactive scheduling is to enhance availability and

reduce the impact of failure by bounding the

number of outstanding tasks per recovery group,

even during normal operation. Reactive scheduling,

on the other hand, takes over after a failure has

occurred. Reactive scheduling suspends the dispatch

of the tasks belonging to the group undergoing

recovery until localized recovery is completed. If a

problematic adapter results in the failure of a task

belonging to a recovery group, dispatch of tasks

from that recovery group will be suspended until

recovery is completed. Instead, the resources be-

longing to that recovery group may be used for

recovery, or be utilized by other failure-free recov-

ery groups.

Specifying the recovery handler at programdevelopment time

There are good reasons to require that the recovery

handler be specified at program development time.

First, we acknowledge that writing error-recovery

code is a complex task. As a first step, our

framework provides guidance for identifying de-

pendencies and also ensures that recovery handlers

are nonintrusive and have minimal impact on good-

path execution. We are working on machine-

learning techniques to identify and incorporate

micro-feedback into the identification of failure

scenarios and recovery strategies.

Second, our analysis shows that due to the

complexity of the system, not all failures can be

recovered using fine-grained recovery. The recovery

strategies are often determined by the semantics of

the failure and the nature of the tasks that

encountered the failure. For example, a straightfor-

ward recovery strategy for a failure during a

background task that is not critical can be simply to

ignore the failure, whereas the strategy may be

different for a critical task that requires on-time

completion. Such task-specific, semantic-based re-

covery handlers are best defined by the developers

of these tasks. Due to the inherent complexity of the

system and semantics involved, a fully automated

approach to determining the recovery strategies,

without programmer assistance, is difficult and less

effective. In the event that the developer specifies an

ineffective recovery strategy, such that the problem

causing the failure is not resolved, we recommend

setting a recovery threshold that specifies the

number of times that the micro-recovery should be

attempted before falling back to system-level re-

covery. If the failure is not prevented by the micro-

recovery mechanism and the failure threshold has

been reached, then system-level recovery will be

performed.

Finally, we note that, in the case of identifying

recovery dependencies, our approach combines

programmer-specified information with system-ini-

tiated learning through logs. Whereas the program-

mer is given the tool to specify dependencies based

on experience, the system uses the programmer-

specified dependencies as a starting point and

continues to refine these dependencies throughout

the life of the system. In other words, the recovery-

aware framework does not rely on the completeness

or the correctness of developer-specified dependen-

cies. One of our ongoing projects is to develop an

access-log-based architecture that uses the informa-

tion provided by lock accesses as a guideline to

understanding system state changes. Based on such

9 : 14 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009

interactions between concurrent tasks, the archi-

tecture dynamically identifies dependencies at run-

time and alerts the developer to events like ‘‘dirty

reads’’ of shared state (i.e., reading data that has

been modified by another transaction but not yet

committed). This log-based architecture will relieve

the developer from the burden of tracking resources

such as shared buffers and locks and tracking read-

write conflicts on shared state.

CONCLUSION

In this paper, we have presented an extension to the

recovery-aware framework in Reference 1 and

applied it to the fine-grained recovery of enterprise

storage systems with multi-core processors. We

introduced the concept of recovery scope and used it

to partition the set of runtime tasks into groups of

interdependent tasks. We then demonstrated how

the mapping of recovery scopes to recovery groups

can be used in scheduling of tasks during recovery

from faults.

We focused on developing effective mappings of

dependent tasks to processor resources through

careful tuning of recovery-sensitive parameters. We

presented a formal model to capture the mapping of

recovery scopes to recovery groups, which can be

used in the scheduling of tasks during recovery.

We have implemented our proposed recovery-aware

framework by modifying the controller software in

an enterprise storage system. Through our analysis

and experimentation, we have shown that through

careful tuning of the system configuration and the

recovery-sensitive parameters, it is possible to

improve the system performance during recovery

and thus improve system resiliency to faults.

ACKNOWLEDGMENTSThis work was partially funded by an NSF CISE grant

and an IBM SUR grant. We acknowledge with

gratitude Cornel Constantinescu, Subashini

Balachandran, Clem Dickey, Paul Muench, David

Whitworth, Andrew Lin, Juan Ruiz (J. J.), Brian

Hatfield, Chiahong Chen, and Joseph Hyde for

helping us perform experimental evaluations and

interpret the data. We thank K. K. Rao, David

Chambliss, Brian Henderson, and the other members

of the Storage Systems group at the IBM Almaden

Research Center for valuable feedback and for

providing the resources to perform our experiments.

We also thank Prof. Karsten Schwan for valuable

insights. Finally, we thank the anonymous referees

for their comments and the editorial staff for helping

us to improve the quality of our paper.

*Trademark, service mark, or registered trademark ofInternational Business Machines Corporation in the UnitedStates, other countries, or both.

REFERENCES1. S. Seshadri, L. Chiu, C. Constantinescu, S. Balachandran,

C. Dickey, L. Liu, and P. Muench, ‘‘Enhancing StorageSystem Availability on Multi-core Architectures withRecovery Conscious Scheduling,’’ Proceedings of the SixthUSENIX Conference on File and Storage Technologies(FAST 2008), San Jose, CA (February 26–29, 2008),pp. 143–158.

2. J. Gray, ‘‘Why Do Computers Stop and What Can BeDone About It?’’ Proceedings of the Fifth Symposium onReliability in Distributed Software and Database Systems,Los Angeles, IEEE Computer Society Press (January 13–15, 1986), pp. 3–12.

3. G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A.Fox, ‘‘Microreboot—A Technique for Cheap Recovery,’’Proceedings of the Sixth Symposium on Operating SystemsDesign and Implementation (OSDI ’04), San Francisco,USENIX Association (December 6–8, 2004), pp. 31–44.

4. Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton,‘‘Software Rejuvenation: Analysis, Module and Applica-tions,’’ Proceedings of the Twenty-Fifth InternationalSymposium on Fault-Tolerant Computing (FTCS-25),Pasadena, IEEE Computer Society (June 27–30, 1995),pp. 381–390.

5. S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi, ‘‘On theAnalysis of Software Rejuvenation Policies,’’ Proceedingsof the Twelfth Annual Conference on Computer Assurance(COMPASS ’97), Gaithersburg, MD (June 18–20, 1997),pp. 88–96.

6. F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou, ‘‘Rx:Treating Bugs as Allergies—A Safe Method to SurviveSoftware Failure,’’ Proceedings of the 20th ACM Sympo-sium on Operating Systems Principles (SOSP 2005),Brighton, UK (October 23–26, 2005), ACM, New York,pp. 235–248.

7. J. Gray and A. Reuter, Transaction Processing: Conceptsand Techniques (The Morgan Kaufmann Series in DataManagement Systems), Morgan Kaufmann, Burlington,MA (1992).

8. S. Sidiroglou, O. Laadan, A. D. Keromytis, and J. Nieh,‘‘Using Rescue Points to Navigate Software Recovery,’’Proceedings of the 2007 IEEE Symposium on Security andPrivacy (SP ’07), Oakland, CA, IEEE (May 20–23 2007),pp. 273–280.

9. B. Randell, ‘‘System Structure for Software Fault Toler-ance,’’ IEEE Transactions on Software Engineering 1,No. 2, 221–232 (1975).

10. M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, andW. S. Beebee Jr., ‘‘Enhancing Server Availability andSecurity Through Failure-Oblivious Computing,’’ Pro-ceedings of the Sixth Symposium on Operating SystemsDesign and Implementation (OSDI ’04), San Francisco,USENIX Association (December 6–8, 2004), pp. 303–316.

11. C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P.Schwarz, ‘‘ARIES: A Transaction Recovery Method

IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009 SESHADRI ET AL. 9 : 15

Supporting Fine-Granularity Locking and Partial Roll-backs Using Write-Ahead Logging,’’ ACM Transactionson Database Systems 17, No. 1, 94–162 (1992).

12. M. Karlsson, C. Karamanolis, and X. Zhu, ‘‘Triage:Performance Differentiation for Storage Systems UsingAdaptive Control,’’ ACM Transactions on Storage 1, No.4, 457–480 (2005).

13. A. Gulati, A. Merchant, and P. J. Varman, ‘‘pClock: AnArrival Curve Based Approach for QoS Guarantees inShared Storage Systems,’’ Proceedings of the 2007 ACMSIGMETRICS International Conference on Measurementand Modeling of Computer Systems (SIGMETRICS 2007),San Diego, CA, ACM (June 12–16, 2007), pp. 13–24.

14. B. Jansen, H. V. Ramasamy, M. Schunter, and A. Tanner,‘‘Architecting Dependable and Secure Systems UsingVirtualization,’’ Proceedings of the Workshop on SoftwareArchitecting for Dependable Systems (WADS 2007),Edinburgh, Lecture Notes in Computer Science 5135,Springer (June 27, 2007), pp. 124–149.

15. M. Sivathanu, V. Prabhakaran, A. C. Arpaci-Dusseau,and R. H. Arpaci-Dusseau, ‘‘Improving Storage SystemAvailability with D-GRAID,’’ ACM Transactions onStorage 1, No. 2, 133–170 (2005).

16. K. S. Trivedi, Probability and Statistics with Reliability,Queuing, and Computer Science Applications, PrenticeHall PTR, Upper Saddle River, NJ (2004).

17. S. Stidham Jr., ‘‘A Last Word on L¼ kw,’’ OperationsResearch 22, No. 2, 417–421 (1974).

18. IBM z/Architecture Principles of Operation, SA22-7832-06,IBM Corporation, 2008, http://publibz.boulder.ibm.com/epubs/pdf/dz9zr006.pdf.

19. L. LaFrese, ‘‘IBM TotalStorage Enterprise Storage ServerModel 800: New features in LIC Level 2.3.0,’’ Perfor-mance White Paper, IBM Corporation (2003), http://www.ibm.com/systems/uk/resources/systems_uk_storage_disk_ess_performance.pdf.

Accepted for publication Sept. 17, 2008.

Sangeetha SeshadriGeorgia Institute of Technology, 266 Ferst Dr., Atlanta, GA30332-0765 ([email protected]). Ms. Seshadri receiveda B.E. degree in computer science and an M.Sc. degree inmathematics from the Birla Institute of Technology andScience, Pilani, India, in 2002. She is currently workingtoward a Ph.D. degree at the College of Computing, GeorgiaInstitute of Technology, under the guidance of Prof. Ling Liu.Her research interests include large-scale storage systems andservices and distributed middleware overlay systems, as wellas techniques and architectures for improving the availability,scalability, and performance of such systems.

Ling LiuGeorgia Institute of Technology, 266 Ferst Dr., Atlanta, GA30332-0765 ([email protected]). Dr. Liu, an AssociateProfessor in the College of Computing at Georgia Institute ofTechnology, directs the research programs in the DistributedData Intensive Systems Lab. The research covers variousaspects of data intensive systems, ranging from mobilecomputing and event stream processing to Internet datamanagement, storage systems, and service-orientedarchitectures. It is currently directed toward building large-scale Internet systems and services, with a focus onperformance, security, privacy, and energy efficiency. Herresearch group has produced a number of open sourcesoftware systems, among which the most popular ones are

WebCQ, XWRAPElite, and PeerCrawl. She has published morethan 200 journal and conference articles and is currently onthe editorial board of several journals, including IEEETransactions on Service Computing, IEEE Transactions onKnowledge and Data Engineering, International Journal ofVery Large Database Systems, International Journal of Peer-to-Peer Networking and Applications, International Journal ofWeb Services Research, and Wireless Network Journal. She is arecipient of the best paper award of ICDCS 2003, the bestpaper award of WWW 2004, the 2005 Pat Goldberg MemorialBest Paper Award from IBM Research, the best dataengineering paper award of International Conference onSoftware Engineering and Data Engineering 2008, and arecipient of IBM faculty awards in 2003 and 2006 through2008. Her research is sponsored primarily by the NationalScience Foundation, DARPA, the Air Force Office of ScientificResearch, and IBM.

Lawrence ChiuIBM Almaden Research Center, 650 Harry Rd., San Jose, CA95120 ([email protected]). Mr. Chiu received his M.S. degreein computer engineering from the University of SouthernCalifornia in 1991 and his M.S. degree in technologycommercialization from the McCombs School of Business,University of Texas, Austin, in 2003. From 2000–2003, heinitiated and managed the effort to develop and deliver thefirst IBM storage virtualization product, SAN VolumeController. He currently manages the Scalable Storage Systemsgroup at the Almaden Research Center, focusing on highlyscalable and highly available enterprise storage systems, andsolid-state disk architecture. &

9 : 16 SESHADRI ET AL. IBM J. RES. & DEV. VOL. 53 NO. 2 PAPER 9 2009